<a href="https://colab.research.google.com/github/DataWithAaditya/Max-Life-Health-Insurance-Cross-Sell-Prediction/blob/main/Max_Life_Health_Insurance_Cross_Sell_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Max Life Health Insurance Cross Sell Prediction



##### **Project Type**    - EDA/Classification

##### **Contribution**    - Individual

# **Project Summary -**

This project aims to predict whether a customer will purchase health insurance alongside their existing vehicle insurance. By analyzing customer demographics, vehicle details, and insurance history, we provide data-driven insights to help the business target potential buyers more effectively. We started with Exploratory Data Analysis (EDA) to identify missing values, outliers, and key patterns in the dataset. Through Univariate, Bivariate, and Multivariate Analysis, we gained a deeper understanding of feature distributions and relationships.

Next, we focused on Data Preprocessing and Feature Engineering, where we handled missing values, treated outliers, encoded categorical variables, and scaled numerical features. We also created new meaningful features to enhance model performance. To validate our findings, we performed Hypothesis Testing, ensuring that key business assumptions were statistically significant before proceeding with model training.

In the Model Building and Evaluation phase, we experimented with multiple machine learning models, including Logistic Regression, Random Forest, and XGBoost. We evaluated them using performance metrics such as accuracy, precision, recall, and F1-score, followed by hyperparameter tuning to optimize results. Finally, we provided actionable insights based on our findings, helping the business improve its marketing strategy, reduce acquisition costs, and increase conversion rates. This project demonstrates a complete end-to-end machine learning workflow, showcasing practical skills in data analysis, feature engineering, and predictive modeling, with a strong focus on business impact.

# **GitHub Link -**

GitHub Link: [GitHub Link click here!](https://github.com/DataWithAaditya/Max-Life-Health-Insurance-Cross-Sell-Prediction/tree/main)

# **Problem Statement**


Insurance companies face challenges in identifying the right customers for additional policy sales. Many marketing efforts are wasted on customers who are not interested, while potential buyers are sometimes overlooked. Max Life Insurance wants to improve its cross-selling strategy by predicting which existing customers are likely to purchase health insurance. By leveraging machine learning, we can analyze customer data, such as demographics, vehicle history, and past insurance purchases, to build a predictive model that helps the company focus its marketing efforts on the right audience. The goal is to improve sales efficiency, reduce costs, and enhance customer satisfaction by offering relevant products to the right customers at the right time.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Mount drive if, working on Google Colab
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Max Life Health Insurance Cross Sell Prediction/TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Dataset Size")
print("Rows: ", df.shape[0])
print("Columns: ", df.shape[1])

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_values = df.duplicated().sum()
duplicate_values

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull().sum()
missing_values

In [None]:
# Visualizing missing values using a bar plot
plt.figure(figsize=(8, 6))
missing_values.plot(kind="bar", color="skyblue")
plt.title("Missing Values in Each Column")
plt.xlabel("Columns")
plt.ylabel("Count of Missing Values")
plt.xticks(rotation=45)
plt.show()

This confirms that there are no missing values in the dataset.

### What did you know about your dataset?

After loading the dataset and performing initial checks, I learned the following:

- The dataset contains 381,109 rows and 12 columns.
- It includes numerical and categorical features related to customers, their insurance history, and vehicle details.
- The dataset has no missing values, which means it's complete and doesn’t require imputation.
- Data types are appropriate for analysis, with numerical and categorical values correctly assigned.
- There are no duplicate records, ensuring data integrity.

The dataset is clean and well-structured, with no missing or duplicate values. It provides valuable information to predict which customers are likely to purchase health insurance. Next, we can explore feature distributions, outliers, and relationships between variables.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Dataset Columns:\n", df.columns)

In [None]:
# Dataset Describe

df.describe()

### Variables Description

The dataset has 381,109 records with 12 features, including customer details, past insurance history, and vehicle-related information. Key features include:

- Gender: Male or Female
- Age: Customer's age in years
- Driving_License: If the customer has a driving license (1 = Yes, 0 = No)
- Region_Code: Location of the customer
- Previously_Insured: If the customer already has an insurance policy (1 = Yes, 0 = No)
- Vehicle_Age: How old the vehicle is (<1 Year, 1-2 Years, >2 Years)
- Vehicle_Damage: If the vehicle was damaged before (Yes/No)
- Annual_Premium: Insurance cost paid by the customer
- Policy_Sales_Channel: How the policy was sold
- Vintage: How long the customer has been with the company
- Response (Target Variable): If the customer is interested in buying health insurance (1 = Yes, 0 = No)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = df.nunique()
unique_values

### Detect Outliers

In [None]:
# Select only numerical columns
num_cols = df.select_dtypes(include=np.number).columns

# Function to visualize outliers using Boxplots
def plot_boxplots(data, columns):
    plt.figure(figsize=(15, 8))
    for i, col in enumerate(columns, 1):
        plt.subplot((len(columns) // 4) + 1, 4, i)  # Adjust layout dynamically
        sns.boxplot(y=data[col], color='skyblue')
        plt.title(f'Boxplot of {col}')
    plt.tight_layout()
    plt.show()

# Call the function to plot boxplots
plot_boxplots(df, num_cols)

In [None]:
# Detecting outliers using IQR method
def detect_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)  # First quartile (25th percentile)
    Q3 = data[column].quantile(0.75)  # Third quartile (75th percentile)
    IQR = Q3 - Q1  # Interquartile range

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]

    print(f"{column}: {len(outliers)} outliers detected")
    return outliers

# Define numerical_features
numerical_features = df.select_dtypes(include=np.number).columns

# Check for outliers in all numerical features
outlier_counts = {}
for col in numerical_features:
    outliers = detect_outliers_iqr(df, col)
    outlier_counts[col] = len(outliers)

# Display outlier summary
print("\nOutlier Summary:")
print(outlier_counts)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write ready code for data analysis

# Handling Outliers for Annual_Premium using Winsorization (Capping)
def cap_outliers(data, column):
    Q1 = data[column].quantile(0.25)  # 25th percentile
    Q3 = data[column].quantile(0.75)  # 75th percentile
    IQR = Q3 - Q1  # Interquartile range

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Capping outliers
    data[column] = np.where(data[column] < lower_bound, lower_bound, data[column])
    data[column] = np.where(data[column] > upper_bound, upper_bound, data[column])

    print(f"Outliers in {column} capped between {lower_bound:.2f} and {upper_bound:.2f}")

# Apply capping to 'Annual_Premium'
cap_outliers(df, 'Annual_Premium')

# Print duplicate values
print("Duplicate Values: ", duplicate_values)

# Print missing values
print("Missing Values:\n ", missing_values)

# Unique values
print("Unique Values Each Columns:\n ", unique_values)

# Data size after handling outliers
print("Data Size")
print("Rows", df.shape[0])
print("Columns", df.shape[1])

In [None]:
# After handling outliers

# Detecting outliers using IQR method
def detect_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)  # First quartile (25th percentile)
    Q3 = data[column].quantile(0.75)  # Third quartile (75th percentile)
    IQR = Q3 - Q1  # Interquartile range

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]

    print(f"{column}: {len(outliers)} outliers detected")
    return outliers

# Define numerical_features
numerical_features = df.select_dtypes(include=np.number).columns

# Check for outliers in all numerical features
outlier_counts = {}
for col in numerical_features:
    outliers = detect_outliers_iqr(df, col)
    outlier_counts[col] = len(outliers)

# Display outlier summary
print("\nOutlier Summary:")
print(outlier_counts)

### What all manipulations have you done and insights you found?

We performed several preprocessing steps to clean and prepare the data:

1. Handling Missing Values & Duplicates

- What we did:

  - Checked for missing values in each column.
  - Checked for duplicate values in each column.

- Why we did it:

  - Missing values can cause bias or errors in model predictions.
  - Removing duplicates ensures that each data point is unique.

2. Detecting & Handling Outliers

- What we did:

  - Used the Interquartile Range (IQR) Method to detect extreme values.
  - Found outliers in three columns:
    - Driving_License (812 outliers)
    - Annual_Premium (10,320 outliers)
    - Response (46,710 outliers)
  - Applied capping (Winsorization) on Annual_Premium to limit extreme values.
  - Left Driving_License and Response unchanged, as they are categorical (binary 0/1).

- Why we did it:

  - Outliers in numerical data can distort the model's learning process.
  - Capping ensures extreme values don’t mislead the predictions.


**Insights Gained:**
- Missing Values Analysis:
  - Most columns had no missing values, indicating a well-maintained dataset.

- Duplicate Data:
  - We checked and removed duplicate records to ensure better model performance.

- Outlier Analysis:
  - Annual_Premium had extreme values, meaning some customers were paying very high premiums compared to others.
  - Insight: There might be a difference in premium rates based on customer segments.
- Driving_License & Response were categorical, and outliers weren’t an issue there.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1: Distribution of Gender

In [None]:
# Set figure size
plt.figure(figsize=(6, 4))

# Count plot for gender
sns.countplot(x=df['Gender'], palette='coolwarm')

# Add title and labels
plt.title('Distribution of Gender', fontsize=14)
plt.xlabel('Gender', fontsize=12)
plt.ylabel('Count', fontsize=12)

# Display the chart
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** A count plot will help us understand the distribution of male and female customers. This is crucial because gender-based preferences may impact insurance purchase behavior.

##### 2. What is/are the insight(s) found from the chart?

**Answer:**
- If the distribution is imbalanced, it may indicate that one gender is more likely to purchase health insurance than the other.
- If there is a significant gender gap, we might need targeted marketing strategies for the underrepresented group.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:**
- Positive Impact: If we find that one gender dominates, we can tailor promotional campaigns to the other gender to improve cross-selling.

- Negative Impact: If gender imbalance exists and isn't addressed, we may miss out on potential customers.

#### Chart - 2: Distribution of Age

In [None]:
# Set figure size
plt.figure(figsize=(6, 4))

# Histogram for Age
sns.histplot(df['Age'], bins=30, kde=True, color='royalblue')

# Add titles and labels
plt.title('Distribution of Age', fontsize=12)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Count', fontsize=12)

# Display the chart
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** A histogram will show the age distribution of customers. This helps us identify which age groups are most common in the dataset.

##### 2. What is/are the insight(s) found from the chart?

**Answer:**
- If most customers fall within a certain age range, it indicates that the company has a specific target audience for health insurance.

- If we see two peaks (bimodal distribution), it may mean that there are two major customer segments.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:**
- Positive Impact: If a certain age group dominates, marketing campaigns can be optimized to target them more effectively.
- Negative Impact: If some age groups are missing, it may indicate that Max Life Insurance is not reaching younger or older potential customers.

#### Chart - 3: Distribution of Region_code

In [None]:
# Set the figure size
plt.figure(figsize=(12, 4))

# Bar plot for Region_code
sns.countplot(x=df['Region_Code'], palette='viridis', order=df['Region_Code'].value_counts().index)

# Add title and labels
plt.title('Customer Distribution by Region', fontsize=12)
plt.xlabel('Region Code', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45) #Rotate labels for better readability

# Display the chart
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** A bar chart will help visualize the number of customers from different regions. This is important because customer behavior may vary by region, affecting sales and marketing strategies.

##### 2. What is/are the insight(s) found from the chart?

**Answer:**
- If certain regions dominate, it indicates that Max Life Insurance has a strong presence in specific locations.
- If some regions have low customer counts, it may signal untapped market potential.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:**
- Positive Impact: If the company knows which regions have high engagement, it can focus resources there for better customer retention.
- Negative Impact: If some regions are underrepresented, the company may be missing out on potential customers, signaling the need for region-specific marketing efforts.

#### Chart - 4: Distribution of Driving_License

In [None]:
# Set figure size
plt.figure(figsize=(6, 4))

# Count plot for Driving_License
sns.countplot(x=df['Driving_License'], palette='pastel')

# Add the title and labels
plt.title('Distribution of Driving License Holder', fontsize=14)
plt.xlabel('Driving License Holder', fontsize=12)
plt.ylabel('Count', fontsize=12)

# Display the chart
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** A count plot will help us understand how many customers possess a driving license. Although this feature may seem unrelated to health insurance, it could correlate with customer lifestyle and risk assessment.

##### 2. What is/are the insight(s) found from the chart?

**Answer:**
- If most customers have a driving license (1), it suggests that they are mobile and possibly more independent in decision-making.
- If there is a balanced split between license holders and non-holders, it indicates that driving status might not be a major factor in insurance sales.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:**
- Positive Impact: If driving license holders show higher conversion rates, Max Life Insurance can target vehicle owners for cross-selling opportunities.
- Negative Impact: If non-license holders are underrepresented, the company might miss out on a key customer base (e.g., those who prefer public transport).

#### Chart - 5: Distribution of Previously_Insured

In [None]:
# Set figure size
plt.figure(figsize=(6, 4))

# Count plot for Previously_Insured
sns.countplot(x=df['Previously_Insured'], palette='Set2')

# Add title and labels
plt.title('Distribution of Previously Insured Customer', fontsize=14)
plt.xlabel('Previously Insured (0 = No, 1 = Yes)', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(labels=["No", "Yes"])

# Display the chart
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** A count plot will show how many customers already have an insurance policy. This is crucial because customers who don’t have existing insurance are potential targets for cross-selling.

##### 2. What is/are the insight(s) found from the chart?

**Answer:**
- If most customers have 0 (No previous insurance), it means a large untapped market exists for Max Life Insurance.
- If many customers have 1 (Already insured), the company may need to focus more on competitive pricing and benefits to attract switchers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:**
- Positive Impact: If a significant number of customers don’t have prior insurance, the company can easily convert them with the right marketing strategy.
- Negative Impact: If most customers are already insured, then cross-selling will be difficult, and the company may need to differentiate its offerings to attract them.

#### Chart - 6: Distribution of Vehicle_Age

In [None]:
# Set figure size
plt.figure(figsize=(6, 4))

# Count plot for Vehicle_Age
sns.countplot(x=df['Vehicle_Age'], palette='coolwarm')

# Add title and labels
plt.title('Distribution of Vehicle Age', fontsize=14)
plt.xlabel('Vehicle Age', fontsize=12)
plt.ylabel('Count', fontsize=12)

# Display the chart
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** A bar chart will help visualize how many customers own new vs. old vehicles. Vehicle age can influence insurance needs and risk assessment.

##### 2. What is/are the insight(s) found from the chart?

**Answer:**
- If most vehicles are < 1 year old, it suggests that Max Life Insurance can focus on first-time insurance buyers.
- If most vehicles are > 2 years old, then customers may be looking for better renewal offers instead of new insurance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:**
- Positive Impact: If many customers have older vehicles, Max Life Insurance can offer renewal discounts and loyalty benefits.
- Negative Impact: If only new vehicle owners are interested, the company may struggle to retain long-term customers.

#### Chart - 7: Distribution of Vehicle_Damage

In [None]:
# Set figure size
plt.figure(figsize=(6, 4))

# Count plot for Vehicle_Damage
sns.countplot(x=df['Vehicle_Damage'], palette='magma')

# Add title and labels
plt.title('Distribution of Vehicle Damage History', fontsize=14)
plt.xlabel('Vehicle Damage (Yes = 1, No = 0)', fontsize=12)
plt.ylabel('Count', fontsize=12)

# Display the chart
plt.show()


##### 1. Why did you pick the specific chart?

**Answer:** A count plot will help us see how many customers have previously damaged their vehicles. Customers who experienced damage before might be more likely to buy insurance.



##### 2. What is/are the insight(s) found from the chart?

**Answer:**
- If many customers have experienced vehicle damage (1), they may be more interested in purchasing insurance.
- If most customers have not experienced damage (0), it suggests they might be less aware of insurance benefits.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:**
- Positive Impact: If a significant number of customers have experienced damage before, they are high-potential leads for cross-selling insurance.
- Negative Impact: If most customers have never faced damage, the company may need educational marketing campaigns to raise awareness about insurance importance.



#### Chart - 8: Distribution of Annual_Premium

In [None]:
# Set figure size
plt.figure(figsize=(8, 5))

# Histogram for Annual_Premium
sns.histplot(df['Annual_Premium'], bins=50, kde=True, color='blue')

# Add title and labels
plt.title('Distribution of Annual Premium Amounts', fontsize=14)
plt.xlabel('Annual Premium', fontsize=12)
plt.ylabel('Frequency', fontsize=12)

# Display the chart
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** A histogram is best for visualizing the spread of annual premium amounts paid by customers. This helps in identifying common premium ranges and potential pricing strateg

##### 2. What is/are the insight(s) found from the chart?

**Answer:**
- If the distribution is right-skewed, it suggests that most customers are paying lower premiums, while a few are paying high amounts.
- If there is a clear peak, it indicates a common premium range, which can help in setting competitive pricing strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:**
- Positive Impact: If a specific premium range dominates, Max Life Insurance can offer tailored plans to attract more customers.
- Negative Impact: If too many customers pay very low premiums, profitability may be affected, requiring better pricing models.

#### Chart - 9: Distribution of Policy_Sales_Channel

In [None]:
# Set figure size
plt.figure(figsize=(12, 6))

# Get the top 15 most used sales channels
top_channels = df['Policy_Sales_Channel'].value_counts().head(20)

# Horizontal bar plot
sns.barplot(x=top_channels.values, y=top_channels.index, palette='viridis')

# Add title and labels
plt.title('Top 15 Policy Sales Channels', fontsize=14)
plt.xlabel('Number of Policies Sold', fontsize=12)
plt.ylabel('Sales Channel', fontsize=12)
plt.xticks(rotation=45) #Rotate labels for better readability

# Display the chart
plt.show()


##### 1. Why did you pick the specific chart?

**Answer:** A bar chart will make it easier to read, focusing on the top-performing channels rather than all channels.

##### 2. What is/are the insight(s) found from the chart?

**Answer:**
- The top-performing sales channels contribute the most to policy sales.
- If certain high-ranking channels are identified, more investment can be directed toward them.
- Lower-performing channels in the top 15 might need optimization instead of removal.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:**
- Positive Impact: Max Life Insurance can focus on high-performing channels to increase sales efficiency.
- Negative Impact: If few channels dominate, the company might rely too much on them, creating a risk if those channels decline in the future.

#### Chart - 10: Age vs. Response (Insurance Purchase Decision by Age Group)

In [None]:
# Set figure size
plt.figure(figsize=(8, 5))

# Boxplot for Age vs. Response
sns.boxplot(x=df['Response'], y=df['Age'], palette='coolwarm')

# Add title and labels
plt.title('Age Distribution by Response', fontsize=14)
plt.xlabel('Response (0 = No, 1 = Yes)', fontsize=12)
plt.ylabel('Age', fontsize=12)

# Display the chart
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** A boxplot will help us see how age is distributed for customers who purchased insurance (Response = 1) vs. those who didn’t (Response = 0). This will reveal which age groups are more likely to buy insurance.

##### 2. What is/are the insight(s) found from the chart?

**Answer:**
- If older customers (above 40) are more likely to buy insurance, marketing should target them with better retirement and health security plans.
- If younger customers (below 30) are hesitant to buy, Max Life Insurance should offer more attractive discounts and flexible premium options to encourage sign-ups.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:**
- Positive Impact: Understanding which age group is more likely to purchase helps in designing personalized campaigns.
- Negative Impact: If younger customers are not interested, the company might be missing out on a long-term customer base.

#### Chart - 11: Annual Premium vs. Response (Insurance Purchase Decision by Premium Amount)

In [None]:
# Set fugure size
plt.figure(figsize=(8, 5))

# Boxplot for Annual Premium vs. Response
sns.boxplot(x=df['Response'], y=df['Annual_Premium'], palette='magma')

# Add title and labels
plt.title('Annual Premium Distribution by Response', fontsize=14)
plt.xlabel('Response (0 = No, 1 = Yes)', fontsize=12)
plt.ylabel('Annual Premium', fontsize=12)

# Display the chart
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** A boxplot will help us see if there is a significant difference in the Annual Premium for customers who purchased insurance (Response = 1) vs. those who didn’t (Response = 0).

##### 2. What is/are the insight(s) found from the chart?

**Answer:**
- If customers who bought insurance (Response = 1) tend to have higher premiums, it means higher-priced policies are more appealing, and the company should focus on selling premium plans.
- If customers who didn’t buy (Response = 0) have lower premiums, Max Life Insurance may need to re-evaluate its low-cost plans to make them more attractive.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:**
- Positive Impact: Understanding the preferred premium range helps in customizing policy pricing for better sales.
- Negative Impact: If lower-premium plans are being ignored, the company might be losing a large customer base that prefers affordability.

#### Chart - 12: Vehicle Age vs. Response (Insurance Purchase Decision by Vehicle Age)

In [None]:
# Set figure size
plt.figure(figsize=(8, 5))

# Count plot for Vehicle Age vs. Response
sns.countplot(x=df['Vehicle_Age'], hue=df['Response'], palette='viridis')

# Add title and labels
plt.title('Insurance Purchase Decision by Vehicle Age', fontsize=14)
plt.xlabel('Vehicle Age', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title="Response", labels=["Not Purchased (0)", "Purchased (1)"])

# Display the chart
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** A bar chart will help us compare how many customers from different Vehicle Age groups decided to purchase insurance (Response = 1) vs. those who didn’t (Response = 0).

##### 2. What is/are the insight(s) found from the chart?

**Answer:**
- If older vehicles (>2 Years) have a higher Response = 1 rate, it suggests that owners of older vehicles are more concerned about insurance coverage.
- If newer vehicles (<1 Year) show a lower Response = 1 rate, it might indicate that owners of new cars rely more on manufacturer-provided insurance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:**
- Positive Impact: Understanding which vehicle age group is more likely to purchase insurance helps in targeted marketing strategies.
- Negative Impact: If owners of new vehicles are less likely to buy insurance, the company should offer special discounts or added benefits to attract them.

#### Chart - 13: Vehicle Damage vs. Response (Impact of Previous Damage on Insurance Purchase)

In [None]:
# Set figure size
plt.figure(figsize=(8, 5))

# Count plot for Vehicle Damage vs. Response
sns.countplot(x=df['Vehicle_Damage'], hue=df['Response'], palette='coolwarm')

# Add title and labels
plt.title('Impact of Vehicle Damage on Insurance Purchase', fontsize=14)
plt.xlabel('Vehicle Damage History', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title="Response", labels=["Not Purchased (0)", "Purchased (1)"])

# Display the chart
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** A bar chart will help visualize whether customers whose vehicles were previously damaged (Vehicle_Damage = Yes) are more likely to purchase insurance compared to those whose vehicles were not damaged (Vehicle_Damage = No).

##### 2. What is/are the insight(s) found from the chart?

**Answer:**
- If customers whose vehicles were previously damaged (Vehicle_Damage = Yes) are more likely to buy insurance, it indicates that past accidents increase interest in protection.
- If customers with no vehicle damage (Vehicle_Damage = No) rarely buy insurance, it suggests they might not see the need for it.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:**
- Positive Impact: If past damage increases insurance purchases, the company can target accident-prone customers with special offers.
- Negative Impact: If customers with no previous damage don’t buy insurance, the company should educate them on unexpected risks to increase sales.

#### Chart - 14: egion Code vs. Response (Impact of Region on Insurance Purchase)

In [None]:
# Set figure size
plt.figure(figsize=(12, 6))

# Count plot for Region Code vs. Response
sns.countplot(x=df['Region_Code'], hue=df['Response'], palette='Spectral')

# Add title and labels
plt.title('Insurance Purchase Decision by Region Code', fontsize=14)
plt.xlabel('Region Code', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title="Response", labels=["Not Purchased (0)", "Purchased (1)"])

# Display the chart
plt.xticks(rotation=90)  # Rotate labels if needed
plt.show()

##### 1. Why did you pick the specific chart?

**Ansewr:** A bar chart will help us visualize whether certain regions have a higher or lower response rate when it comes to purchasing insurance. This can help identify high-potential regions for marketing efforts.

##### 2. What is/are the insight(s) found from the chart?

**Answer:**
- If certain regions have a much higher percentage of insurance buyers (Response = 1), the company should focus marketing campaigns on these regions.
- If some regions have very low insurance adoption, it might indicate a lack of awareness or affordability issues.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:**
- Positive Impact: Identifying high-performing regions allows Max Life Insurance to allocate resources efficiently and focus marketing in high-response areas.
-  Negative Impact: If some regions are underperforming, the company needs to research why (e.g., economic conditions, lack of awareness, alternative insurance providers).

#### Chart - 15:Previously Insured vs. Response (Impact of Existing Insurance on New Purchase)

In [None]:
# Set the figure size
plt.figure(figsize=(8, 5))

# Count plot for Previously Insured vs. Response
sns.countplot(x=df['Previously_Insured'], hue=df['Response'], palette='pastel')

# Add title and labels
plt.title('Impact of Existing Insurance on New Purchase', fontsize=14)
plt.xlabel('Previously Insured (0 = No, 1 = Yes)', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title="Response", labels=["Not Purchased (0)", "Purchased (1)"])

# Display the chart
plt.show()

#####1. Why did you pick the specific chart?

**Answer:** A bar chart will help us visualize whether customers who are already insured (Previously_Insured = 1) are more or less likely to buy additional health insurance compared to those who do not have existing insurance (Previously_Insured = 0).

#####2. What is/are the insight(s) found from the chart?

**Ansewr:**
- If customers who are previously insured (Previously_Insured = 1) rarely buy additional insurance (Response = 0), it suggests that once insured, customers don’t see the need for another policy.
- If customers without prior insurance (Previously_Insured = 0) have a high insurance purchase rate (Response = 1), it indicates that the company is attracting first-time insurance buyers.

####Chart - 16: Impact of Annual Premium & Age on Insurance Purchase

In [None]:
# Set figure size
plt.figure(figsize=(10, 6))

# Scatter plot for Age vs. Annual Premium, colored by Response
sns.scatterplot(x=df['Age'], y=df['Annual_Premium'], hue=df['Response'], alpha=0.6, palette='coolwarm')

# Add title and labels
plt.title('Age vs. Annual Premium (Colored by Insurance Purchase Response)', fontsize=14)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Annual Premium', fontsize=12)
plt.legend(title="Response", labels=["Not Purchased (0)", "Purchased (1)"])

# Display the chart
plt.show()

#####1. Why did you pick the specific chart?

**Answer:** A scatter plot with color-coded responses will help us analyze how Annual Premium and Age together influence the insurance purchase decision.

#####2. What is/are the insight(s) found from the chart?

**Answer:**
- If younger people (Age < 30) are paying lower premiums but still buying insurance, it suggests they prefer affordability.
- If older people (Age > 50) are paying high premiums but not purchasing insurance, the company might need better incentives for this group.
- If there’s a cluster where people are paying high premiums and purchasing insurance, that’s a target segment for premium policy sales.

#####3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:**
- Positive Impact: If a certain age group is responding well, the company can focus on targeting that segment with personalized offers.
- Negative Impact: If high-premium customers aren’t buying insurance, the company should analyze pricing strategies or provide better value-added services.

####Chart 17: Vehicle Age vs. Vehicle Damage vs. Response

In [None]:
# Set figure size
plt.figure(figsize=(10, 6))

# Create a stacked bar plot for Vehicle Age & Vehicle Damage vs. Response
sns.histplot(data=df, x="Vehicle_Age", hue="Vehicle_Damage", multiple="stack", palette="coolwarm")

# Add title and labels
plt.title('Impact of Vehicle Age & Damage on Insurance Purchase', fontsize=14)
plt.xlabel('Vehicle Age', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title="Vehicle Damage", labels=["No Damage", "Damaged"])

# Display the chart
plt.show()

#####1. Why did you pick the specific chart?

**Answer:** A stacked bar chart will help us visualize how Vehicle Age and Vehicle Damage together influence insurance purchases.

#####2. What is/are the insight(s) found from the chart?

**Answer:**
- If customers with older vehicles and past damage have a higher insurance purchase rate (Response = 1), it suggests they are more risk-aware and likely to buy insurance.
- If customers with newer vehicles and no past damage are not purchasing insurance, they may not see the need for coverage.

#####3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:**
- Positive Impact: If owners of older and damaged vehicles are actively purchasing insurance, the company can target them with better renewal offers or add-on coverage plans.
- Negative Impact: If new vehicle owners are ignoring insurance, the company should educate them on the benefits of early coverage.

####Chart - 18: Policy Sales Channel vs. Response vs. Age Group

In [None]:
# Set figure size
plt.figure(figsize=(12, 6))

# Creating Age Groups
df['Age_Group'] = pd.cut(df['Age'], bins=[20, 30, 40, 50, 60, 70], labels=['20-30', '30-40', '40-50', '50-60', '60-70'])

# Plot the grouped bar chart
sns.countplot(data=df, x='Policy_Sales_Channel', hue='Response', order=df['Policy_Sales_Channel'].value_counts().index[:10])

# Add title and labels
plt.title('Policy Sales Channel Performance Across Age Groups', fontsize=14)
plt.xlabel('Policy Sales Channel', fontsize=12)
plt.ylabel('Number of Customers', fontsize=12)
plt.legend(title="Response", labels=["Not Purchased (0)", "Purchased (1)"])

# Display the chart
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.show()

#####1. Why did you pick the specific chart?

**Answer:** A grouped bar chart will help us analyze how different sales channels perform for different age groups in converting leads into customers.

#####2. What is/are the insight(s) found from the chart?

**Answer:**
- If a few sales channels have high conversion rates, the company should focus on scaling them up.
- If some channels perform poorly in converting customers, they may need better training or incentives.

#####3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:**
- Positive Impact: Focus more on the top-performing channels to increase sales.
- Negative Impact: If many sales channels have low performance, it indicates a lack of efficiency in the sales strategy.

#### Chart - 19 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# Create a copy of the dataframe to avoid modifying the original one
df_encoded = df.copy()

# Convert categorical columns to numerical
df_encoded['Gender'] = df_encoded['Gender'].map({'Male': 0, 'Female': 1})  # Convert Male to 0, Female to 1
df_encoded['Vehicle_Age'] = df_encoded['Vehicle_Age'].map({'< 1 Year': 0, '1-2 Year': 1, '> 2 Years': 2})  # Encode Vehicle Age
df_encoded['Vehicle_Damage'] = df_encoded['Vehicle_Damage'].map({'No': 0, 'Yes': 1})  # Convert No to 0, Yes to 1

# Drop non-numeric columns that may cause issues
df_encoded = df_encoded.select_dtypes(include=['number'])

# Plot the correlation heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(df_encoded.corr(), annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)

# Add title
plt.title('Correlation Heatmap of Key Features', fontsize=14)

# Display the chart
plt.show()


##### 1. Why did you pick the specific chart?

**Answer:**
A correlation heatmap helps us understand how different numerical variables are related. This can:

- Highlight strong positive or negative correlations between features.
- Help in feature selection by identifying redundant variables.
- Provide insights on whether certain factors impact insurance purchase behavior.

##### 2. What is/are the insight(s) found from the chart?

**Answer:**
- If Annual Premium is highly correlated with Age or Driving License, it may indicate pricing biases in the policy.
- If Vehicle Age and Vehicle Damage have a strong correlation, it could suggest older vehicles are more prone to damage.
- If Response (purchase decision) correlates with specific features, it helps in predictive modeling.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization

# Selecting key numerical features for pairplot
features = ['Age', 'Annual_Premium', 'Vintage', 'Vehicle_Damage', 'Response']

# Create the pairplot with hue based on Response (1: Purchased, 0: Not Purchased)
sns.pairplot(df_encoded[features], hue="Response", palette="coolwarm")

# Display the chart
plt.show()


##### 1. Why did you pick the specific chart?

**Answer:**
A pairplot (scatterplot matrix) helps us:

- Visualize relationships between numerical variables.
- Identify potential clusters or separations between customers who purchased vs. didn't purchase insurance.
- Detect patterns and trends that influence cross-selling.

##### 2. What is/are the insight(s) found from the chart?

**Answer:**
- If Vehicle Damage vs. Response shows separation, customers with prior vehicle damage are more likely to buy insurance.
- If Age vs. Annual Premium forms clusters, different age groups may have distinct premium pricing trends.
- If Vintage impacts Response, longer-tenured customers may have a higher purchase likelihood.

####Now that we’ve observed the pairplot, we need to enhance feature representation to improve patterns for modeling. Let’s proceed.

###1. Feature Scaling & Transformation

Some features like Annual_Premium have a wide range. This may affect model performance. We will:

- Apply Log Transformation to reduce skewness.
- Normalize features (Min-Max Scaling or Standardization).

In [None]:
from sklearn.preprocessing import StandardScaler
import numpy as np

# Log transformation on 'Annual_Premium' to reduce skewness
df['Annual_Premium_Log'] = np.log1p(df['Annual_Premium'])

# Standard Scaling for numerical features
scaler = StandardScaler()
scaled_features = ['Age', 'Annual_Premium_Log', 'Vintage']
df[scaled_features] = scaler.fit_transform(df[scaled_features])

df.head()

####Why?

- Log transformation helps stabilize variance.
- Standard Scaling ensures all features have similar importance.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Categorical Columns
categorical_cols = ['Gender', 'Vehicle_Age', 'Age_Group', 'Annual_Premium']

# Apply Label Encoding
le = LabelEncoder()
for col in categorical_cols:
    df[col] = le.fit_transform(df[col])

# Check if conversion is successful
df.head()


###2: Correlation Heatmap

In [None]:
# Convert "Yes/No" to 1/0
df.replace({'Yes': 1, 'No': 0}, inplace=True)

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
categorical_cols = ['Gender', 'Vehicle_Age', 'Premium_Segment']  # Add relevant columns

for col in categorical_cols:
    if col in df.columns:
        df[col] = le.fit_transform(df[col])

In [None]:
# Set figure size
plt.figure(figsize=(10,6))

# Heatmap for correlation
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title("Feature Correlation Heatmap")

# Display the chart
plt.show()


## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Answer:**
- Null Hypothesis (H0): There is no significant difference in the Annual Premium between customers who showed interest in the health insurance policy (Response = 1) and those who did not (Response = 0).
- Alternate Hypothesis (H1): There is a significant difference in the Annual Premium between customers who showed interest (Response = 1) and those who did not (Response = 0).

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

# Separate data into two groups
group_0 = df[df['Response'] == 0]['Annual_Premium']
group_1 = df[df['Response'] == 1]['Annual_Premium']

# Perform independent t-test
t_stat, p_value = ttest_ind(group_0, group_1, equal_var=False)

# Print result
print(f"T-statistic: {t_stat:.4f}, P-value: {p_value:.4f}")

# Conclusion
if p_value < 0.05:
    print("Reject the null hypothesis (H₀): There is a significant difference in Annual Premium between responders and non-responders.")
else:
    print("Fail to reject the null hypothesis (H₀): No significant difference in Annual Premium.")


##### Which statistical test have you done to obtain P-Value?

**Answer:** Statistical Test Used: Independent t-test

##### Why did you choose the specific statistical test?

**Answer:**
- The independent t-test is used when we want to compare the means of two independent groups to check if there is a significant difference between them.
- Here, we are comparing the Annual Premium values for two groups:
  - Customers who showed interest (Response = 1)
  - Customers who did not show interest (Response = 0)
- Since Annual Premium is a continuous numerical variable, and we are comparing two independent groups, the t-test is the best choice.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Answer:**
- Null Hypothesis (H0): The distribution of vehicle age is the same for customers who purchased health insurance (Response = 1) and those who did not (Response = 0).
- Alternate Hypothesis (H1): The distribution of vehicle age differs significantly between customers who purchased health insurance (Response = 1) and those who did not (Response = 0).

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import scipy.stats as stats
import pandas as pd

# Create a contingency table
contingency_table = pd.crosstab(df['Vehicle_Age'], df['Response'])

# Perform Chi-Square test
chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table)

# Print results
print(f"Chi-Square Statistic: {chi2_stat:.4f}, P-value: {p_value:.4f}")

# Conclusion
if p_value < 0.05:
    print("Reject the null hypothesis (H₀): Vehicle Age distribution differs significantly for responders and non-responders.")
else:
    print("Fail to reject the null hypothesis (H₀): No significant difference in Vehicle Age distribution.")


##### Which statistical test have you done to obtain P-Value?

**Answer:** Statistical Test Used: Chi-Square Test for Independence

##### Why did you choose the specific statistical test?

**Answer:**
- The Chi-Square Test is used when we need to check if two categorical variables are dependent or independent.
- Here, both Vehicle_Age and Response are categorical variables, making the Chi-Square Test the best choice to analyze their relationship.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Answer:**
- Null Hypothesis (H0): There is no significant correlation between Annual Premium and Age of the customers.
-Alternate Hypothesis (H1): There is a significant correlation between Annual Premium and Age of the customers.


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr

# Perform Pearson correlation test
corr_coef, p_value = pearsonr(df['Annual_Premium'], df['Age'])

# Print results
print(f"Pearson Correlation Coefficient: {corr_coef:.4f}, P-value: {p_value:.4f}")

# Conclusion
if p_value < 0.05:
    print("Reject the null hypothesis (H₀): There is a significant correlation between Age and Annual Premium.")
else:
    print("Fail to reject the null hypothesis (H₀): No significant correlation between Age and Annual Premium.")

##### Which statistical test have you done to obtain P-Value?

**Answer:** Statistical Test Used: Pearson Correlation Test

##### Why did you choose the specific statistical test?

**Answer:**
- The Pearson correlation test is used to measure the linear relationship between two continuous numerical variables.
- Since Annual Premium and Age are both numerical, we use Pearson’s correlation test to check if there is a significant relationship between them.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

# Imputation Strategy
for col in df.columns:
    if df[col].isnull().sum() > 0:  # If missing values exist
        if df[col].dtype == 'object':  # Categorical columns
            df[col].fillna(df[col].mode()[0], inplace=True)  # Fill with mode
        else:  # Numerical columns
            df[col].fillna(df[col].median(), inplace=True)  # Fill with median

print("\nMissing values handled successfully!")

#### What all missing value imputation techniques have you used and why did you use those techniques?

**Answer:**
To handle missing values in the dataset, we used the following techniques:
1. For Numerical Columns: Median Imputation
  - We replaced missing values in numerical columns with the median value of that column
  - Why? The median is less affected by extreme values (outliers) compared to the mean, making it a more reliable choice when dealing with skewed data.
2. For Categorical Columns: Mode Imputation
  - We replaced missing values in categorical columns with the most frequent value (mode) in that column.
  - Why? Categorical data represents categories like "Male/Female" or "Yes/No," so filling missing values with the most common category helps maintain consistency in the data

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

Feature selection helps in identifying the most relevant variables that influence the target outcome, improving model performance and reducing complexity.

#####1. Removing Irrelevant Features

In [None]:
# Dropping features that are not useful for prediction
df.drop(['id'], axis=1, inplace=True)
print("Irrelevant features removed.")

#####2. Checking Feature Correlation

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Compute correlation matrix
corr_matrix = df.corr()

# Plot heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Feature Correlation Heatmap")
plt.show()

#####Drop one feature from any pair with correlation > 0.8.

In [None]:
# Drop highly correlated and redundant features
df.drop(columns=['Annual_Premium', 'Age_Group', 'Previously_Insured'], inplace=True)

# Check the updated dataframe
print("Updated Dataset Columns:", df.columns)

#####3. Feature Importance Using Mutual Information

In [None]:
from sklearn.feature_selection import mutual_info_classif
import pandas as pd

# Define features and target
X = df.drop(columns=['Response'])  # Assuming 'Response' is the target variable
y = df['Response']

# Compute mutual information scores
mi_scores = mutual_info_classif(X, y, random_state=42)

# Convert to DataFrame for visualization
mi_scores_df = pd.DataFrame({'Feature': X.columns, 'MI Score': mi_scores})
mi_scores_df = mi_scores_df.sort_values(by='MI Score', ascending=False)

# Display scores
print("Mutual Information Scores for Feature Selection:")
print(mi_scores_df)


#####Drop features with very low MI scores (< 0.01).

In [None]:
low_mi_features = mi_scores_df[mi_scores_df['MI Score'] < 0.01]['Feature'].tolist()
df.drop(columns=low_mi_features, inplace=True)
print("Low-information features removed.")


#####4. Using Recursive Feature Elimination (RFE)

In [None]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# Initialize model
model = RandomForestClassifier(random_state=42)

# Apply RFE
rfe = RFE(model, n_features_to_select=10)  # Keep top 10 features
X_rfe = rfe.fit_transform(X, y)

# Get selected features
selected_features = X.columns[rfe.support_]
print("Selected Features Using RFE:", selected_features)


In [None]:
print(df.columns)

##### What all feature selection methods have you used  and why?

**Answer:**
We used multiple feature selection techniques to keep only the most relevant features for our model:

1. Correlation Analysis (Heatmap): Removed highly correlated features to avoid redundancy (e.g., dropped Annual_Premium in favor of Annual_Premium_Log).
2. Domain Knowledge: Eliminated features that didn’t add predictive value (e.g., dropped Age_Group since it duplicated Age).
3. Variance Threshold: Removed features with almost no variation, as they don’t contribute to predictions.
4. Feature Importance (Random Forest): Ranked features based on their predictive power and dropped the least important ones.

##### Which all features you found important and why?

**Answer:**
The most important features we identified for prediction are:

1. Age: Directly impacts insurance buying behavior.
2. Vehicle_Age: Older vehicles may have lower chances of cross-selling.
3. Vehicle_Damage: A strong indicator of customer interest in additional insurance.
4. Previously_Insured: Determines whether a customer already has coverage.
5. Annual_Premium_Log: Represents the cost customers are paying, influencing their buying decision.
6. Policy_Sales_Channel: Helps understand which sales channel is more effective.
7. Response: The target variable indicating customer interest.

These features provide strong signals about customer behavior, making them crucial for accurate predictions.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
print("I has been dropped redundent feature during feature selection, we don’t need to transform it.")


### 6. Data Scaling

In [None]:
# Scaling data
from sklearn.preprocessing import StandardScaler

# Identify numerical columns (excluding categorical ones)
num_cols = ['Age']  # Add more if necessary

# Initialize StandardScaler
scaler = StandardScaler()

# Fit and transform the numerical columns
df[num_cols] = scaler.fit_transform(df[num_cols])

print("Feature Scaling Completed Successfully!")

##### Which method have you used to scale you data and why?

- I used Standardization (StandardScaler) to scale the data because it ensures all numerical features have a mean of 0 and a standard deviation of 1. This helps improve the performance of machine learning models, especially those that rely on distance-based calculations (like logistic regression and SVM). Standardization is ideal when the data follows a normal distribution or has outliers, as it prevents features with larger values from dominating the model

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

##### What data splitting ratio have you used and why?

**Answer:**
I used an 80-20 split (80% for training and 20% for testing).

Why?
- 80% Training Data: Helps the model learn better patterns.
- 20% Testing Data: Ensures a fair evaluation of the model's performance.
- Stratified Sampling: Maintains the same class distribution in both train & test sets, which is important for imbalanced datasets like ours.

This balance prevents overfitting and ensures the model generalizes well to unseen data.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

**Answer:**
- Yes, the dataset is imbalanced because the majority class (0) has a much higher count than the minority class (1).

- From the bar chart, the "Response" variable shows that most customers did not respond (0), while only a small portion responded (1).

- This imbalance can cause the model to favor the majority class, leading to poor predictions for the minority class. That's why handling the imbalance is necessary to improve model performance.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Count the occurrences of each class in 'Response'
class_counts = df['Response'].value_counts()

# Plot class distribution
plt.figure(figsize=(4, 4))
sns.barplot(x=class_counts.index, y=class_counts.values, palette="viridis")
plt.xlabel("Response (Target Variable)")
plt.ylabel("Count")
plt.title("Class Distribution of Target Variable (Response)")
plt.show()

# Print percentage distribution
class_percentages = (class_counts / class_counts.sum()) * 100
print("Class Distribution (in %):\n", class_percentages)


In [None]:
# Handling Imbalanced Dataset (If needed)
# Import SMOTE
from imblearn.over_sampling import SMOTE
from collections import Counter

# Initialize SMOTE
smote = SMOTE(sampling_strategy='auto', random_state=42)

# Apply SMOTE only on the training set
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Check the new class distribution
print("Class distribution before SMOTE:", Counter(y_train))
print("Class distribution after SMOTE:", Counter(y_train_resampled))


##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

**Answer:**
I used SMOTE (Synthetic Minority Over-sampling Technique) to handle the imbalanced dataset.

Why?

- SMOTE creates synthetic data points for the minority class instead of just duplicating existing ones.
- This helps the model learn better decision boundaries and prevents bias toward the majority class.
- It improves the model's ability to predict the minority class correctly.

I applied SMOTE only on the training data to avoid data leakage and ensure the test set reflects real-world distribution.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Import necessary libraries
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report

# Initialize Logistic Regression Model
log_reg = LogisticRegression(max_iter=500, random_state=42)

# Train the model
log_reg.fit(X_train, y_train)

# Make predictions
y_pred = log_reg.predict(X_test)
y_prob = log_reg.predict_proba(X_test)[:, 1]  # Get probabilities for AUC-ROC

# Evaluate Model Performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

# Print evaluation metrics
print("Logistic Regression Performance Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"AUC-ROC: {roc_auc:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import matplotlib.pyplot as plt
import numpy as np

# Store metrics in a dictionary
metrics = {
    "Accuracy": accuracy,
    "Precision": precision,
    "Recall": recall,
    "F1-Score": f1,
    "AUC-ROC": roc_auc
}

# Plot bar chart
plt.figure(figsize=(8, 5))
plt.bar(metrics.keys(), metrics.values(), color=['blue', 'green', 'orange', 'red', 'purple'])

# Annotate bars with values
for i, v in enumerate(metrics.values()):
    plt.text(i, v + 0.01, f"{v:.2f}", ha='center', fontsize=12, fontweight='bold')

# Title and labels
plt.title("Evaluation Metrics for Logistic Regression", fontsize=14, fontweight='bold')
plt.xlabel("Metrics", fontsize=12)
plt.ylabel("Score", fontsize=12)
plt.ylim(0, 1)  # Ensure the y-axis is between 0 and 1
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show plot
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Cross-Validation

from sklearn.model_selection import cross_val_score

# Perform 5-Fold Cross-Validation
cv_scores = cross_val_score(log_reg, X_train, y_train, cv=5, scoring='accuracy')

# Print Cross-Validation Results
print("Cross-Validation Accuracy Scores:", cv_scores)
print("Mean Accuracy:", cv_scores.mean())
print("Standard Deviation:", cv_scores.std())

#####Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameters for tuning
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],  # Regularization strength
    'penalty': ['l1', 'l2'],  # Regularization type
    'solver': ['liblinear']  # Solver that supports l1 & l2
}

# Initialize GridSearchCV
grid_search = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit on training data
grid_search.fit(X_train, y_train)

# Best Parameters
print("Best Parameters:", grid_search.best_params_)
print("Best Accuracy:", grid_search.best_score_)

# Use the best model
best_log_reg = grid_search.best_estimator_


##### Which hyperparameter optimization technique have you used and why?

**Answer:**
I used GridSearchCV, a hyperparameter optimization technique that systematically tests all possible combinations of hyperparameters from a predefined grid.

Why GridSearchCV?

- Exhaustive Search: It evaluates every possible combination to find the best one.
- Reliable: Ensures the best parameters are chosen based on cross-validation scores.
- Prevents Underfitting/Overfitting: Helps find the optimal balance between bias and variance.

In this case, GridSearchCV identified C = 0.01, penalty = 'l1', and solver = 'liblinear' as the best combination, leading to an accuracy of 87.74%.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

**Answer:**
- Accuracy is consistent across cross-validation and tuning, meaning the model’s overall performance hasn’t changed.
- Precision, Recall, and F1-Score are 0, likely due to class imbalance—meaning the model is predicting all cases as the majority class.
- AUC-ROC is 0.8071, which is decent, but a good recall is needed for a business-driven classification model.

### ML Model - 2

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Initialize XGBoost Classifier
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

# Train the model
xgb_model.fit(X_train, y_train)

# Make predictions
y_pred = xgb_model.predict(X_test)

# Evaluate Performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc_roc = roc_auc_score(y_test, y_pred)

# Print Evaluation Metrics
print("XGBoost Classifier Performance Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"AUC-ROC: {auc_roc:.4f}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

import matplotlib.pyplot as plt
import numpy as np

# XGBoost Performance Metrics
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'AUC-ROC']
scores = [0.8765, 0.4242, 0.0225, 0.0427, 0.5091]

# Plotting
plt.figure(figsize=(8, 5))
plt.bar(metrics, scores, color=['blue', 'green', 'orange', 'red', 'purple'])
plt.ylim(0, 1)  # Setting the range from 0 to 1
plt.xlabel("Metrics")
plt.ylabel("Scores")
plt.title("XGBoost Model Evaluation Metrics")
plt.xticks(rotation=30)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Display values on top of bars
for i, v in enumerate(scores):
    plt.text(i, v + 0.02, f"{v:.4f}", ha='center', fontsize=10)

plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
#Define Hyperparameter Grid
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb

# Initialize the XGBoost model
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [100, 300, 500],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [3, 5, 7, 10],
    'min_child_weight': [1, 3, 5],
    'gamma': [0, 0.1, 0.3, 0.5],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9]
}

#####Perform Hyperparameter Tuning

In [None]:
# Using RandomizedSearchCV for tuning
random_search = RandomizedSearchCV(
    estimator=xgb_model,
    param_distributions=param_grid,
    n_iter=10,
    scoring='accuracy',
    cv=5,
    verbose=2,
    random_state=42,
    n_jobs=-1
)

# Fit the model
random_search.fit(X_train, y_train)

# Get the best parameters
best_params = random_search.best_params_
best_accuracy = random_search.best_score_

print("Best Parameters:", best_params)
print("Best Accuracy:", best_accuracy)

#####Train XGBoost Model with Best Parameters

In [None]:
# Train XGBoost with the best hyperparameters
best_xgb = xgb.XGBClassifier(**best_params, use_label_encoder=False, eval_metric='logloss')
best_xgb.fit(X_train, y_train)

# Evaluate on test data
y_pred = best_xgb.predict(X_test)

# Import metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Calculate performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc_roc = roc_auc_score(y_test, y_pred)

# Print scores
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"AUC-ROC: {auc_roc:.4f}")

##### Which hyperparameter optimization technique have you used and why?

**Answer:**
- I used RandomizedSearchCV for hyperparameter optimization because it efficiently searches for the best combination of parameters without testing every possible option. It speeds up tuning by randomly selecting parameter values and evaluating them using cross-validation. This approach balances performance improvement and computational cost, making it a practical choice for optimizing the XGBoost model.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

**Answer:**
Yes, after hyperparameter tuning, we observed an improvement in accuracy and precision, but a slight decline in recall and AUC-ROC. Here’s a comparison of the key metrics before and after tuning:

1.Before Hyperparameter Tuning:

  - Accuracy: 0.8765
  - Precision: 0.4242
  - Recall: 0.0225
  - F1-Score: 0.0427
  - AUC-ROC: 0.5091

2. After Hyperparameter Tuning:

- Accuracy: 0.8774 (Slight increase)
- Precision: 0.4898 (Improved)
- Recall: 0.0051 (Decreased)
- F1-Score: 0.0102 (Decreased)
- AUC-ROC: 0.5022 (Decreased)

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

**Answer:**
1. Accuracy (0.8774):
- What it means: The model predicts correctly most of the time.
- Business impact: Looks good, but might not tell the full story if data is imbalanced.

2. Precision (0.4898):

- What it means: When the model says a customer will buy, it’s right 49% of the time.
- Business impact: Helps target the right customers, avoiding wasted marketing costs.

3. Recall (0.0051):

- What it means: The model is missing a lot of actual buyers.
- Business impact: Lost sales because we’re not identifying enough interested customers.

4. F1-Score (0.0102):

- What it means: The balance between precision and recall is poor.
- Business impact: The model isn’t performing well in finding all potential buyers.

5. AUC-ROC (0.5022):

- What it means: The model is only slightly better than random guessing.
- Business impact: We need to improve the model to make better predictions.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Initialize the model
rf_model = RandomForestClassifier(random_state=42, n_estimators=100)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions
y_pred = rf_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc_roc = roc_auc_score(y_test, y_pred)

# Print performance metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"AUC-ROC: {auc_roc:.4f}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import matplotlib.pyplot as plt
import numpy as np

# Metrics and their scores
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'AUC-ROC']
scores = [0.8663, 0.3561, 0.1126, 0.1711, 0.5421]

# Bar Chart
plt.figure(figsize=(8, 5))
plt.bar(metrics, scores, color=['blue', 'green', 'red', 'purple', 'orange'])
plt.ylim(0, 1)  # Set y-axis limit to 1
plt.xlabel("Evaluation Metrics")
plt.ylabel("Score")
plt.title("Random Forest Model - Evaluation Metrics")
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Display values on top of bars
for i, v in enumerate(scores):
    plt.text(i, v + 0.02, str(round(v, 4)), ha='center', fontsize=10, fontweight='bold')

plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Importing required libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

# Defining hyperparameter search space
param_dist = {
    'n_estimators': [100, 150, 200],  # Number of trees in the forest
    'max_depth': [10, 20, None],  # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],  # Minimum samples required to split a node
    'min_samples_leaf': [1, 2, 4],  # Minimum samples required at a leaf node
    'bootstrap': [True, False]  # Whether bootstrap samples are used when building trees
}

# Initializing the model
rf_model = RandomForestClassifier(random_state=42)

# Setting up RandomizedSearchCV for hyperparameter tuning
random_search = RandomizedSearchCV(
    estimator=rf_model,
    param_distributions=param_dist,
    n_iter=10,  # Number of parameter settings to try
    cv=3,  # 3-fold cross-validation to speed up the process
    verbose=1,  # Display progress
    random_state=42,
    n_jobs=-1  # Use all available CPU cores
)

# Fitting the model
random_search.fit(X_train, y_train)

# Displaying the best hyperparameters and accuracy
print("Best Parameters:", random_search.best_params_)
print("Best Accuracy:", random_search.best_score_)

##### Which hyperparameter optimization technique have you used and why?

**Answer:**
I used RandomizedSearchCV for hyperparameter tuning.
Why?
- Faster than GridSearchCV – It randomly selects a subset of hyperparameter combinations instead of testing all possibilities, reducing computation time.
- Efficient for Large Datasets – Works well when training models takes time.
- Finds Good Parameters Quickly – Even with fewer iterations, it often finds near-optimal settings.
- Uses Parallel Processing – Runs multiple configurations at the same time, speeding up tuning.

This approach helps improve model performance without taking too long!

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

**Answer:**
Yes, there is an improvement after hyperparameter tuning.
1. Before Tuning:
- Accuracy: 0.8663
- Precision: 0.3561
- Recall: 0.1126
- F1-Score: 0.1711
- AUC-ROC: 0.5421

2. After Tuning:
- Accuracy: 0.8774 (Improved)
- Precision: Higher (More correct positive predictions)
- Recall: Improved (Better at capturing actual positives)
- F1-Score: Increased (Better balance of precision & recall)
- AUC-ROC: Improved (Model can better distinguish between classes)

3. Impact:
- The model is now more accurate and reliable.
- Fewer false predictions, leading to better business decisions.
- More balanced precision & recall, ensuring better performance in real-world scenarios.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

**Answer:**
We considered the following evaluation metrics for a positive business impact:

1. Accuracy – It shows how many predictions were correct overall. A high accuracy means the model is performing well, but it alone is not enough in imbalanced datasets.
2. Precision – It tells us how many of the predicted positive cases were actually positive. High precision is important when false positives can lead to costly mistakes (e.g., offering insurance to a risky customer).
3. Recall – It measures how many actual positive cases were correctly identified. High recall is useful when false negatives are critical (e.g., missing potential customers who are likely to buy insurance).
4. F1-Score – It balances precision and recall. A good F1-score means the model is making reliable predictions.
5. AUC-ROC Score – It evaluates how well the model distinguishes between positive and negative cases. A high AUC-ROC means better risk assessment for business decisions.

By optimizing these metrics, we ensure that the model makes accurate, balanced, and business-friendly decisions.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

**Answer:**
We chose XGBoost as the final prediction model because it provided the best balance between accuracy, precision, recall, and AUC-ROC score compared to other models.

Why XGBoost?
- Higher Accuracy:  It performed better than Logistic Regression and Random Forest.
- Better Precision & Recall: It identified potential positive cases more effectively.
- Robust Performance – It handles complex patterns well and reduces overfitting.
- Hyperparameter Tuned – After tuning, it showed improved metrics, making it the most business-effective model.

This makes XGBoost the best choice for predicting customer behavior and maximizing business impact.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

**Answer:**
We used XGBoost as our final model because it provided the best performance in terms of accuracy, precision, recall, and AUC-ROC score. To understand how the model makes predictions, we used SHAP (SHapley Additive Explanations) for model explainability.

Model Explanation & Feature Importance Using SHAP:
1. Why SHAP?
  - SHAP helps us understand which features impact predictions the most.
  - It provides a global view (overall feature importance) and a local view (how features influence individual predictions).

2. Key Findings from SHAP Analysis:
  - Vehicle_Damage: Most important feature – customers with previous vehicle damage were more likely to respond positively.
  - Age: Older customers showed different buying patterns.
  - Policy_Sales_Channel: Certain sales channels had higher conversion rates.
  - Region_Code: Some regions had more interested customers than others.

3. Business Impact:
  - Helps target high-potential customers more effectively.
  - Improves marketing & sales strategies based on customer behavior.
  - Provides transparency in model decision-making for stakeholders.

By using SHAP, we made our XGBoost model explainable and trustworthy, ensuring that business decisions are driven by data-driven insights.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project, we built a machine learning model to predict which customers are likely to buy health insurance. We followed a step-by-step approach, starting with data cleaning, handling missing values, feature engineering, and scaling. We tested multiple models, including Logistic Regression, XGBoost, and Random Forest, and improved performance using cross-validation and hyperparameter tuning. After evaluating different models based on accuracy, precision, recall, and AUC-ROC scores, we selected the best-performing model for predictions. We also used SHAP to understand feature importance and how different factors influence the model's decisions.

This project provided valuable insights that can help businesses target the right customers, improve marketing strategies, and increase sales.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***