<a href="https://colab.research.google.com/github/Krishanu-Saha/data-science/blob/main/TRAIN_HEALTH_INSURANCE_CROSS_SELL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - HEALTH_INSURANCE_CROSS_SELL



##### **Project Type**    - Classification
##### **Contribution**    - Individual, BY KRISHANU SAHA


# **Project Summary -**

The project titled "Health Insurance Cross Sell" is a data science project aimed at developing a predictive model that can accurately predict which customers of a health insurance company are likely to purchase vehicle insurance. The project was carried out using Python programming language and various libraries such as Pandas, Numpy, Seaborn, and Scikit-learn.

The dataset contains information about 381,109 customers, including their age, gender, driving license status, region, previously insured status, vehicle age, vehicle damage, annual premium, and policy sales channel. The objective of the project was to build a classification model that can predict if a customer will buy vehicle insurance or not, based on the given set of attributes.

The project involved several steps, starting with data cleaning and preprocessing. The dataset contained missing values and categorical variables that needed to be encoded into numerical values. The next step was to perform exploratory data analysis to gain insights into the data and identify any patterns or relationships between the variables.

After data cleaning and EDA, the dataset was split into training and testing sets, and several machine learning algorithms were applied to the training data to develop a predictive model. The algorithms used in this project included logistic regression,  random forest,  and XGBoost. The model was trained using various hyperparameters, and the best hyperparameters were selected using cross-validation techniques.

The evaluation metrics used to measure the performance of the model included accuracy, precision, recall, F1 score, and ROC-AUC curve. The results showed that the gradient boosting algorithm performed the best, with an accuracy score of 0.87 and an F1 score of 0.56.

In addition to developing a predictive model, the project also provided several business insights that could be useful for the health insurance company. For example, the analysis revealed that customers who had previously purchased vehicle insurance were more likely to buy it again, suggesting that the company should focus on retaining its existing customers. The analysis also showed that customers who had damaged their vehicles in the past were more likely to purchase insurance, suggesting that the company should target this segment with specific marketing campaigns.

Overall, the project demonstrated the potential of data science to extract insights from large datasets and develop predictive models that can help businesses make informed decisions. The project also highlighted the importance of data cleaning and exploratory data analysis in ensuring the accuracy and relevance of the model. The results of this project could be used by the health insurance company to improve its sales and marketing strategies, retain customers, and increase revenue.

# **GitHub Link -**

https://github.com/Krishanu-Saha/data-science/blob/main/TRAIN_HEALTH_INSURANCE_CROSS_SELL.ipynb

# **Problem Statement**


**Our client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

Now, in order to predict, whether the customer would be interested in Vehicle insurance, we have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.**

### ATTRIBUTE INFORMATION

id : Unique ID for the customer

Gender : Gender of the customer

Age : Age of the customer

Driving_License 0 : Customer does not have DL, 1 : Customer already has DL

Region_Code : Unique code for the region of the customer

Previously_Insured : 1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance

Vehicle_Age : Age of the Vehicle

Vehicle_Damage :1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past.

Annual_Premium : The amount customer needs to pay as premium in the year

PolicySalesChannel : Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.

Vintage : Number of Days, Customer has been associated with the company

Response : 1 : Customer is interested, 0 : Customer is not interested

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Basic
import numpy as np
import pandas as pd

# Plotation
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline

import math

from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from xgboost import XGBClassifier
from xgboost import XGBRFClassifier
from sklearn.tree import export_graphviz

!pip install shap==0.40.0
import shap
import graphviz
sns.set_style('darkgrid')

import warnings
warnings.filterwarnings('ignore')


# Miscellaneous
import time
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

In [None]:
pip install missingno

In [None]:
import missingno as msno

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
# Load Dataset
data = pd.read_csv('/content/drive/MyDrive/Almabetter /project/CLASSIFICATION/TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION .csv')

### Dataset First View

In [None]:
# Dataset First Look
data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns
data.shape

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(data[data.duplicated()])


No duplicate values found.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

No column has missing/Null values .

In [None]:
# Visualizing the missing values
msno.matrix(data)

### What did you know about your dataset?

The shape of the dataset is (381109, 12),

No record is duplicated,

No column has missing or null values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
list(data.columns)

In [None]:
# Dataset Describe
data.describe()

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
data.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
insurance_data = data.copy()

**Checking the distribution of the Response variable to see if it is balanced. This can be done by calculating the percentage of customers who are interested in purchasing insurance and comparing it to the percentage who are not interested.**

In [None]:
counts = insurance_data['Response'].value_counts()

percentage_interested = counts[1]/sum(counts)*100

percentage_not_interested = counts[0]/sum(counts)*100

# Print the results
print(f"Percentage of customers who are interested in purchasing insurance: {percentage_interested:.2f}%")
print(f"Percentage of customers who are not interested in purchasing insurance: {percentage_not_interested:.2f}%")

Percentage of customers who are interested in purchasing insurance: 12.26%

Percentage of customers who are not interested in purchasing insurance: 87.74%

**Calculating the average age of customers who are interested in purchasing insurance and compare it to the average age of customers who are not interested.**

In [None]:
interested_customers = insurance_data[insurance_data['Response'] == 1]
mean_age_interested = interested_customers['Age'].mean()

not_interested_customers = insurance_data[insurance_data['Response'] == 0]
mean_age_not_interested = not_interested_customers['Age'].mean()

# Print the results
print(f"Average age of customers who are interested in purchasing insurance: {mean_age_interested:.2f} years")
print(f"Average age of customers who are not interested in purchasing insurance: {mean_age_not_interested:.2f} years")

Average age of customers who are interested in purchasing insurance: 43.44 years

Average age of customers who are not interested in purchasing insurance: 38.18 years

**Calculating the average annual premium for customers who are interested in purchasing insurance and compare it to the average annual premium for customers who are not interested.**

In [None]:
# Calculate the average annual premium of customers who are interested in purchasing insurance
interested_customers = insurance_data[insurance_data['Response'] == 1]
mean_premium_interested = interested_customers['Annual_Premium'].mean()

# Calculate the average annual premium of customers who are not interested in purchasing insurance
not_interested_customers = insurance_data[insurance_data['Response'] == 0]
mean_premium_not_interested = not_interested_customers['Annual_Premium'].mean()

# Print the results
print(f"Average annual premium of customers who are interested in purchasing insurance: ${mean_premium_interested:.2f}")
print(f"Average annual premium of customers who are not interested in purchasing insurance: ${mean_premium_not_interested:.2f}")


Average annual premium of customers who are interested in purchasing insurance: $31604.09

Average annual premium of customers who are not interested in purchasing insurance: $30419.16

**Checking the percentage of customers who have a driving license by calculating the proportion of customers with a driving license.**

In [None]:
# Calculate the percentage of customers who have a driving license
license_counts = insurance_data['Driving_License'].value_counts()
percentage_licensed = license_counts[1] / sum(license_counts) * 100

# Print the result
print(f"Percentage of customers with a driving license: {percentage_licensed:.2f}%")


Percentage of customers with a driving license: 99.79%


**Checking the percentage of customers who have previously purchased insurance by calculating the proportion of customers who have previously purchased insurance.**

In [None]:
# Calculate the percentage of customers who have previously purchased insurance
prev_insurance_counts = insurance_data['Previously_Insured'].value_counts()
percentage_prev_insured = prev_insurance_counts[1] / sum(prev_insurance_counts) * 100

# Print the result
print(f"Percentage of customers who have previously purchased insurance: {percentage_prev_insured:.2f}%")


Percentage of customers who have previously purchased insurance: 45.82%


**Calculating the average number of days since policy start for customers who are interested in purchasing insurance and compare it to the average number of days for customers who are not interested.**

In [None]:
# Calculate the average vintage for customers who are interested in purchasing insurance
interested_customers = insurance_data[insurance_data['Response'] == 1]
mean_vintage_interested = interested_customers['Vintage'].mean()

# Calculate the average vintage for customers who are not interested in purchasing insurance
not_interested_customers = insurance_data[insurance_data['Response'] == 0]
mean_vintage_not_interested = not_interested_customers['Vintage'].mean()

# Print the results
print(f"Average vintage for customers who are interested in purchasing insurance: {mean_vintage_interested:.2f}")
print(f"Average vintage for customers who are not interested in purchasing insurance: {mean_vintage_not_interested:.2f}")


Average vintage for customers who are interested in purchasing insurance: 154.11

Average vintage for customers who are not interested in purchasing insurance: 154.38

**Checking the percentage of customers who have had vehicle damage in the past by calculating the proportion of customers who have had vehicle damage.**

In [None]:
# Calculate the proportion of customers who have had vehicle damage
vehicle_damage_count = insurance_data['Vehicle_Damage'].value_counts()
vehicle_damage_prop = vehicle_damage_count[1] / insurance_data.shape[0]

# Print the result
print(f"Percentage of customers who have had vehicle damage in the past: {vehicle_damage_prop*100:.2f}%")


Percentage of customers who have had vehicle damage in the past: 49.51%

**Checking the distribution of the vehicle_age variable by calculating the percentage of customers in each category.**


In [None]:
# Calculate the percentage of customers in each category of vehicle age
vehicle_age_count = insurance_data['Vehicle_Age'].value_counts()
vehicle_age_prop = vehicle_age_count / insurance_data.shape[0]*100

# Print the result
print(f"Percentage of customers in each category of vehicle age:\n{vehicle_age_prop}%")


Percentage of customers in each category of vehicle age:

'1-2 Year     52.561341'

'< 1 Year     43.238549'

'> 2 Years     4.200111'


**Calculating the correlation between the interested_in_policy variable and other variables in the dataset using the corr function in pandas.**

In [None]:
# Calculate the correlation between the interested_in_policy variable and other variables
corr = insurance_data.corr()['Response'].sort_values()

# Print the result
print(f"Correlation between the Response variable and other variables:\n{corr}")


Correlation between the interested_in_policy variable and other variables:

Previously_Insured     -0.341170

Policy_Sales_Channel   -0.139042

id                     -0.001368

Vintage                -0.001050

Driving_License         0.010155

Region_Code             0.010570

Annual_Premium          0.022575

Age                     0.111147

Response                1.000000

### What all manipulations have you done and insights you found?

Based on the given dataset, several manipulations were performed to gain insights into customer behavior. It was found that only 12.26% of customers were interested in purchasing insurance, while the remaining 87.74% were not interested. The average age of customers who were interested in purchasing insurance was 43.44 years, which was higher than the average age of customers who were not interested (38.18 years). The average annual premium for customers who were interested in purchasing insurance was ($31604.09),

which was slightly higher than the average annual premium for customers who were not interested ($30419.16).

Almost all customers (99.79%) had a driving license, and 45.82% of customers had previously purchased insurance. The average vintage for customers who were interested in purchasing insurance was 154.11 days, while for customers who were not interested, it was 154.38 days. Almost half of the customers (49.51%) had vehicle damage in the past, and the majority of customers had a vehicle age between 1-2 years (52.56%) or less than 1 year (43.24%).

Finally, the correlation between the Response variable and other variables was calculated, revealing that the most significant negative correlation was with the previously_insured variable (-0.341170), followed by the policy_sales_channel variable (-0.139042). The variables that had a positive correlation with Response were age (0.111147) and annual_premium (0.022575), albeit a weak one. There was a negligible correlation between the id, vintage, and driving_license variables with interested_in_policy, while the region_code variable had a slight positive correlation (0.010570).

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
insurance_df = data.copy()

#### **Distribution of policyholders by age and gender**

In [None]:
# Create a new DataFrame with only the columns we need
df_age_gender = insurance_df[['Gender', 'Age']]

# Create separate DataFrames for males and females
df_males = df_age_gender[df_age_gender['Gender'] == 'Male']
df_females = df_age_gender[df_age_gender['Gender'] == 'Female']

# Plot histograms of age for each gender
fig, axs = plt.subplots(1, 2, figsize=(12, 6))
sns.histplot(data=df_males, x='Age', ax=axs[0], color='blue', alpha=0.5, bins=20)
sns.histplot(data=df_females, x='Age', ax=axs[1], color='pink', alpha=0.5, bins=20)

# Set plot titles and labels
axs[0].set_title('Age distribution for males')
axs[0].set_xlabel('Age')
axs[0].set_ylabel('Count')
axs[1].set_title('Age distribution for females')
axs[1].set_xlabel('Age')
axs[1].set_ylabel('Count')
fig.suptitle('Distribution of policyholders by age and gender')

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

The histogram is a popular graphing tool. It is used to summarize discrete or continuous data that are measured on an interval scale. It is often used to illustrate the major features of the distribution of the data in a convenient form.

##### 2. What is/are the insight(s) found from the chart?

The counts of 20-30 age group in male and female are the highest.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The fact that the highest counts of the 20-30 age group are in both male and female categories can be a positive business impact for insurance companies. It suggests that targeting this age group with insurance products can be profitable as there is a higher likelihood of attracting customers within this demographic.

Furthermore, the insight that customers who are interested in purchasing insurance have a higher average age and are willing to pay a slightly higher annual premium than customers who are not interested can also be leveraged by insurance companies. It suggests that there may be a market for more expensive and comprehensive insurance products, which may result in higher revenue for the companies.

#### **Relationship between age and response to insurance:**

In [None]:
#Creating a box plot
sns.boxplot(x="Response", y="Age", data=insurance_df)

##### 1. Why did you pick the specific chart?

Box plots are used to show distributions of numeric data values, especially when you want to compare them between multiple groups. They are built to provide high-level information at a glance, offering general information about a group of data's symmetry, skew, variance, and outliers.

##### 2. What is/are the insight(s) found from the chart?

Mean age for Response 1 is around 42 and Response 0 is around 35.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Insurance companies can leverage this insight to design insurance products that are more appealing to customers in their 40s and to target their marketing efforts towards this age group. This insight suggests that customers in their 40s may have higher disposable income and may be more willing to pay a slightly higher premium for comprehensive insurance products.

Therefore, this insight can help create a positive business impact by enabling insurance companies to better understand the age demographic of their potential customers and to design products and marketing campaigns that target this demographic more effectively.

However, this insight could also potentially lead to negative growth if insurance companies solely focus on targeting customers in their 40s and ignore other age groups. It is essential for insurance companies to carefully analyze their customer data to ensure they are targeting all age demographics effectively and not overlooking any potential customer segments.

#### **Distribution of policyholders by region**

In [None]:
# Count the number of policyholders in each region
region_count = insurance_df['Region_Code'].value_counts()

# Plot the distribution of policyholders by region
plt.figure(figsize = (15,8))
plt.bar(region_count.index, region_count.values)

# Add labels and title
plt.xlabel('Region Code')
plt.ylabel('Number of Policyholders')
plt.title('Distribution of Policyholders by Region')

# Show the plot
plt.show()

In [None]:
#Finding Out the region with maximum occurance
max_region = insurance_df['Region_Code'].mode()[0]

#Counting the number of customers in maximum occuring region
num_customers = insurance_df[insurance_df['Region_Code'] == max_region]['Region_Code'].count()

#Printing the maximum occuring region and corresponding number of customers in that region
print(f"The region code with the maximum customers is {max_region} with {num_customers} customers.")

##### 1. Why did you pick the specific chart?

It allows you to compare different sets of data among different groups easily. It instantly demonstrates this relationship using two axes, where the categories are on one axis and the various values are on the other. A bar graph can also illustrate important changes in data throughout a period of time.

##### 2. What is/are the insight(s) found from the chart?

The region code with the maximum customers is 28.0 with 106415 customers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Based on the given information, the insight that the region code with the maximum customers is 28.0 with 106,415 customers may not have a significant positive or negative impact on business growth.

This insight could be useful for insurance companies to understand which regions have a higher concentration of potential customers. However, it is important to note that this insight does not provide any information on the demographic characteristics or purchasing behaviors of customers in this region.

Therefore, while this insight may be interesting from a descriptive perspective, it may not necessarily help insurance companies make strategic decisions that can lead to a positive impact on business growth. To create a positive impact on business growth, insurance companies may need to combine this insight with other insights, such as demographic characteristics, purchasing behaviors, and preferences of customers in this region, to develop effective marketing and product strategies

#### **Comparison of policyholders who responded to insurance with those who did not:**

In [None]:
# Group the data by Response and calculate the mean age for each group
age_response = insurance_df.groupby('Response')['Age'].mean()

# Create a bar chart of the mean age for each group
sns.barplot(x=age_response.index, y=age_response.values)
plt.title('Comparison of mean age for customers who responded to insurance')
plt.xlabel('Response')
plt.ylabel('Mean Age')
plt.show()

##### 1. Why did you pick the specific chart?

It allows you to compare different sets of data among different groups easily. It instantly demonstrates this relationship using two axes, where the categories are on one axis and the various values are on the other. A bar graph can also illustrate important changes in data throughout a period of time.

##### 2. What is/are the insight(s) found from the chart?

people with age over 40 are likely to Response to the insurance buying.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insight that people over the age of 40 are likely to respond to insurance buying can potentially have a positive business impact for insurance companies. By targeting this demographic with tailored marketing campaigns and products that cater to their needs, insurance companies can increase their sales and revenue.

However, it is important to note that not all insights lead to positive business impacts. For example, if an analysis of customer data reveals that a significant portion of customers are dissatisfied with a company's products or services, this insight may lead to negative growth if the company does not take appropriate action to address the issue.

In [None]:

    # Categorizing Age feature
    age_bins = [0, 30, 65, np.inf]
    age_labels = ['YoungAdult', 'MiddleAdult', 'Senior']
    insurance_df['Age_Group'] = pd.cut(insurance_df['Age'], bins=age_bins, labels=age_labels)

    # Categorizing Policy_Sales_Channel feature
    channel_bins = [0, 41, 81, 121,165]
    channel_labels = ['Channel_A','Channel_B', 'Channel_C', 'Channel_D']
    insurance_df['Policy_Sales_Channel_Categorical'] = pd.cut(insurance_df['Policy_Sales_Channel'], bins=channel_bins, labels=channel_labels)

    # Categorizing Region_Code feature
    region_bins = [0,11,21,31,41,53]
    region_labels = ['Region_E','Region_D','Region_C', 'Region_B', 'Region_A']
    insurance_df['Region_Code_Categorical'] = pd.cut(insurance_df['Region_Code'], bins=region_bins, labels=region_labels)



#### **Distribution of Responses by Age Group,  Region Code and Policy Sales Channel**

In [None]:
# Set up the figure with 1 row and 3 columns, and a size of 22 x 5 inches
fig, axes = plt.subplots(1, 3, figsize=(22, 5))

# Plot a countplot of Age_Group, colored by Response, in the first column
sns.countplot(ax=axes[0], x='Age_Group', data=insurance_df, hue='Response')
axes[0].set_xlabel('Age Group', fontsize=14)
axes[0].set_ylabel('Count', fontsize=14)
axes[0].set_title('Distribution of Responses by Age Group', fontsize=15)

# Plot a countplot of Region_Code_Categorical, colored by Response, in the second column
sns.countplot(ax=axes[1], x='Region_Code_Categorical', data=insurance_df, hue='Response')
axes[1].set_xlabel('Region Code', fontsize=14)
axes[1].set_ylabel('Count', fontsize=14)
axes[1].set_title('Distribution of Responses by Region Code', fontsize=15)

# Plot a countplot of Policy_Sales_Channel_Categorical, colored by Response, in the third column
sns.countplot(ax=axes[2], x='Policy_Sales_Channel_Categorical', data=insurance_df, hue='Response')
axes[2].set_xlabel('Policy Sales Channel', fontsize=14)
axes[2].set_ylabel('Count', fontsize=14)
axes[2].set_title('Distribution of Responses by Policy Sales Channel', fontsize=15)

# Display the plot
plt.show()







##### 1. Why did you pick the specific chart?

I picked the countplot chart specifically for this example because we are interested in visualizing the distribution of a categorical variable ("Response") by other categorical variables ("Age_Group", "Region_Code_Categorical", and "Policy_Sales_Channel_Categorical").

The countplot is a good choice for this type of data because it shows the count of each category in a bar chart format, which makes it easy to compare the frequency of each category across different groups. Additionally, by using the hue parameter to color the bars by the "Response" variable, we can see the distribution of positive and negative responses within each category, which gives us more insight into the relationship between the different variables.

##### 2. What is/are the insight(s) found from the chart?

The countplots show that the middle adult age group has the highest number of responses, followed by the young adult and senior groups.

In terms of region codes, Region_C has the highest count, followed by Region_E.

Lastly, the Policy Sales Channel D has the highest count compared to other channels.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights from the countplots may help create a positive business impact in several ways.

For example, knowing that the middle adult age group has the highest number of responses could help the business to tailor their marketing strategies to target this group more effectively. Additionally, understanding which regions have the highest number of responses could help the business to allocate their resources more efficiently in those areas.

Similarly, knowing which policy sales channels have the highest count could help the business to focus their efforts on those channels, while possibly rethinking their strategy for channels that have a lower count.

Overall, having a better understanding of the relationship between these variables and the customer response rate could help the business to optimize their approach and improve their bottom line.

#### **Response V/S Gender**

In [None]:
# Creating a countplot using catplot to show the distribution of response by gender
sns.catplot(x="Response", hue="Gender", kind="count", palette="pastel", data=insurance_df)

# Adding x and y labels with font size specifications
plt.xlabel('Response', fontdict={'fontsize':12})
plt.ylabel('Count', fontdict={'fontsize':14})

# Adding a title with font size and weight specifications
plt.title('Response V/S Gender', fontdict={'fontsize':15, 'fontweight':'bold'})

# Displaying the plot
plt.show()


##### 1. Why did you pick the specific chart?

I picked the catplot with a count type and hue as "Gender" because it effectively shows the distribution of the "Response" variable by gender. The count type shows the frequency of responses in each category, and using "Gender" as the hue allows us to compare the response rate between male and female customers.

This type of visualization can be useful for understanding any potential differences in customer response rates based on gender and can help inform marketing strategies and target customer outreach.

##### 2. What is/are the insight(s) found from the chart?

The catplot with a count type and hue as "Gender" shows that in both Response categories, i.e., "0" and "1", male customers have a higher count than female customers. This suggests that male customers may be more responsive to the company's outreach efforts. However, further analysis may be needed to determine whether this trend holds across different age groups or regions, and to identify any potential underlying factors influencing customer response rates.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from data analysis can be useful in creating a positive business impact by helping companies identify customer segments that are more responsive to their products and services, and by informing targeted marketing strategies and customer outreach efforts. For example, the analysis of the "Response" variable by age group, region, and policy sales channel can help the company better understand customer preferences and tailor their marketing and outreach efforts accordingly.

However, there could be potential negative impacts from the analysis if the company uses the insights in a way that is discriminatory or violates customer privacy. For example, if the company were to use demographic information such as age or gender to target or exclude certain groups of customers, this could be seen as discriminatory and potentially harmful to the business.

#### **Analysis of Age Group and Previously Insured.**

In [None]:
    #Create a figure with two subplots
    fig, axes = plt.subplots(1,2, figsize=(25,8))

    #Create a count plot on the first subplot
    sns.countplot(ax = axes[0],x="Response", hue="Age_Group", palette="pastel",
            data=insurance_df)

    #Set the x and y labels and title for the first subplot
    axes[0].set_xlabel(xlabel = 'Response', fontdict={'fontsize': 14})
    axes[0].set_ylabel(ylabel = 'Count', fontdict={'fontsize': 14})
    axes[0].set_title('Age_Group', fontdict={'fontsize': 15, 'fontweight':'bold'})

    #Create a histogram plot on the second subplot
    sns.histplot(ax = axes[1],binwidth=0.5, x="Age_Group",
                 hue="Previously_Insured", data=insurance_df,
                 stat="count", multiple="stack")

    #Set the x and y labels and title for the second subplot
    axes[1].set_xlabel(xlabel = 'Age_Group', fontdict={'fontsize': 14})
    axes[1].set_ylabel(ylabel = 'Count', fontdict={'fontsize': 14})
    axes[1].set_title('Age_Group V/S Previously_Insured', fontdict={'fontsize': 15, 'fontweight':'bold'})

##### 1. Why did you pick the specific chart?

The first chart is a count plot, which is used to show the distribution of categorical variables. It shows the count of the Response variable based on the Age_Group variable using different colors for each age group. This chart is useful to get an idea of the number of people who responded to the insurance offer based on their age group.

The second chart is a histogram plot that shows the distribution of a numeric variable, Age_Group, with different colors for people who are Previously_Insured or not. This plot is useful to show the relationship between Age_Group and Previously_Insured variables, and how they are distributed in the data.

In summary, the specific charts were chosen based on the variables of interest and the best way to visualize them.

##### 2. What is/are the insight(s) found from the chart?

First plot: In response 0, both young adults and middle-aged adults have high counts, while in response 1, the count for middle-aged adults is the highest among all age groups.

Second plot: The ratio of young adults who are previously insured to those who are not previously insured is highest, as compared to middle-aged and senior adults, indicating that young adults are more likely to have prior insurance coverage.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

in the first plot, the fact that middle-aged adults have the highest count in response 1 could be a valuable insight for a business that is trying to target its marketing efforts towards a specific age group. The business may decide to focus more on middle-aged adults in their advertising campaigns to capitalize on this finding.

On the other hand, there may be insights that could lead to negative growth. For example, if the second plot shows that young adults who are previously insured are more likely to switch to a different insurance provider, it could lead to negative growth for the current insurance provider. This could be because the current provider is not meeting the needs of these young adults, or because a competitor is offering a more attractive option.

#### **Vehicle_Damage V/S Response and Annual_Premium**

In [None]:
# Create subplots for two plots side by side
fig, axes = plt.subplots(1, 2, figsize=(22, 8))

# First plot - Point plot for Vehicle_Damage vs Response
sns.pointplot(ax=axes[0], x="Vehicle_Damage", y="Response", hue="Vehicle_Age", aspect=0.7, kind="point", data=insurance_df)

# Set x and y labels and title for first plot
axes[0].set_xlabel(xlabel='Vehicle_Damage', fontdict={'fontsize': 14})
axes[0].set_ylabel(ylabel='Response', fontdict={'fontsize': 14})
axes[0].set_title('Vehicle_Damage V/S Response', fontdict={'fontsize': 15, 'fontweight': 'bold'})

# Second plot - Point plot for Vehicle_Damage vs Annual_Premium
sns.pointplot(x='Vehicle_Damage', y='Annual_Premium', data=insurance_df, kind='point', ax=axes[1])

# Set x and y labels and title for second plot
axes[1].set_xlabel(xlabel='Vehicle_Damage', fontdict={'fontsize': 14})
axes[1].set_ylabel(ylabel='Annual_Premium', fontdict={'fontsize': 14})
axes[1].set_title('Vehicle_Damage V/S Annual_Premium', fontdict={'fontsize': 15, 'fontweight': 'bold'})

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

The specific chart used in the code is a point plot. A point plot is a good choice for visualizing the relationship between two variables when the independent variable is categorical and the dependent variable is numerical.

In the first plot, Vehicle_Damage is the independent categorical variable, and Response is the dependent numerical variable. The plot shows the mean of Response for each category of Vehicle_Damage. The hue parameter is used to show the relationship between Vehicle_Damage, Response, and Vehicle_Age by using different colors for each level of Vehicle_Age.

In the second plot, Vehicle_Damage is again the independent categorical variable, and Annual_Premium is the dependent numerical variable. The plot shows the mean of Annual_Premium for each category of Vehicle_Damage.

##### 2. What is/are the insight(s) found from the chart?

The first plot shows that vehicle age is an important factor in determining the response rate to insurance offers after a vehicle has been damaged. The plot indicates that vehicles with an age greater than 2 years have the highest probability of getting a response, followed by vehicles with an age range of 1 to 2 years. This suggests that the age of the vehicle plays a role in determining the customer's willingness to purchase insurance after their vehicle has been damaged.

The second plot reveals an interesting relationship between vehicle damage and annual premium. The plot indicates that vehicles which have been damaged pay a higher annual premium than those which have not been damaged. This finding suggests that insurance companies are pricing their policies based on the risk associated with insuring a vehicle that has been previously damaged. Customers with damaged vehicles may be willing to pay a higher premium for coverage due to the increased risk associated with insuring a previously damaged vehicle.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the plots can potentially create a positive business impact for insurance companies. For example:

The first plot suggests that customers are more likely to respond to insurance offers when their vehicle is damaged and when the vehicle is over 2 years old. This insight could be used by insurance companies to target their marketing efforts towards customers with older vehicles or vehicles that have been previously damaged. This could potentially increase the response rate to insurance offers, leading to an increase in business for insurance companies.

The second plot suggests that customers with damaged vehicles pay a higher annual premium for insurance. This insight could be used by insurance companies to adjust their pricing strategies and offer higher premiums for customers with damaged vehicles. This could potentially increase revenue for insurance companies.

#### **Vehicle Age with hue Responses**

In [None]:
#Setting the figure size
plt.figure(figsize=(10, 8))

#Countplot of Vehicle age column.
sns.countplot(x = 'Vehicle_Age', hue='Response', data = insurance_df)

#Setting labels of x and y
plt.xlabel(xlabel = 'Vehicle_Age', fontdict={'fontsize': 14})
plt.ylabel(ylabel = 'Count', fontdict={'fontsize': 14})
plt.title('Vehicle_Age')


##### 1. Why did you pick the specific chart?

The  chart is a count plot, which is used to show the distribution of categorical variables. It shows the count of the Response variable based on the Vehicle_age variable using different colors for each age group. This chart is useful to get an idea of the number of people who responded to the insurance offer based on their age group.

##### 2. What is/are the insight(s) found from the chart?

we get the highest response 1 when the vehicle age is 1 to 2 years followed by less than 1 year vehicle age.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insight that the highest response occurs when the vehicle age is between 1 to 2 years, followed by less than 1 year, can potentially help businesses make more informed decisions about their products and services related to vehicles.

For example, businesses that offer vehicle maintenance services can use this insight to target customers with vehicles in this age range. By offering maintenance packages and services that are tailored to the needs of vehicles in this age range, businesses can increase their sales and revenue.

Similarly, businesses that sell or lease vehicles can use this insight to inform their inventory and pricing decisions. By offering more vehicles in the 1 to 2-year age range and adjusting pricing accordingly, businesses may be able to attract more customers and increase their sales.

#### **Vehicle_Age V/S Annual_Premium**

In [None]:
#SEtting the size of the figure
plt.figure(figsize = (15,8))
# plotting a box plot
sns.boxplot( y = 'Annual_Premium', x = 'Vehicle_Age', hue = 'Vehicle_Damage', data=insurance_df )

#Setting the labels and titles
plt.xlabel(xlabel = 'Vehicle_Age')
plt.ylabel(ylabel = 'Annual_Premium')
plt.title('Vehicle_Age V/S Annual_Premium')

##### 1. Why did you pick the specific chart?

Box and whisker plots are very effective and easy to read, as they can summarize data from multiple sources and display the results in a single graph. Box and whisker plots allow for comparison of data from different categories for easier, more effective decision-making.

Box plots are used to show distributions of numeric data values, especially when you want to compare them between multiple groups. They are built to provide high-level information at a glance, offering general information about a group of data's symmetry, skew, variance, and outliers.

##### 2. What is/are the insight(s) found from the chart?

Vehicles older than two years appear to be subject to higher premiums than the others.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insight that vehicles older than two years appear to be subject to higher premiums than others can potentially help insurance companies adjust their pricing strategies and better target their customer segments.

For example, insurance companies can use this insight to adjust their pricing models for vehicles based on their age. By offering lower premiums for newer vehicles and higher premiums for older vehicles, insurance companies can potentially attract more customers and increase their revenue.

Additionally, insurance companies can use this insight to better target their customer segments. By identifying customers who own vehicles that are older than two years and offering them specific policies and services that are tailored to their needs, insurance companies can increase customer loyalty and retention.

#### **Analysis of Annual_Premium V/S Age_Group**

In [None]:
#Setting the size of figure
plt.figure(figsize = (15,8))

#plotting a bar chart
sns.barplot(y = 'Annual_Premium', x = 'Age_Group', data= insurance_df)

#Setting the lables and tiles
plt.xlabel(xlabel = 'Age_Group')
plt.ylabel(ylabel = 'Annual_Premium_Treated')
plt.title('Annual_Premium V/S Age_Group')

##### 1. Why did you pick the specific chart?

Bar charts enable us to compare numerical values like integers and percentages. They use the length of each bar to represent the value of each variable. For example, bar charts show variations in categories or subcategories scaling width or height across simple, spaced bars, or rectangles.

##### 2. What is/are the insight(s) found from the chart?

Senior people seems to paying anual premium stilghtly higher than Young Adults and MiddleAdults.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### **Analysis of Policy_Sales_Channel**

In [None]:
#Create subplots for six plots side by side
fig, axes = plt.subplots(2,3, figsize=(22,15))

#Creating a bar chart between Policy_Sales_Channel_Categorical and Vintage
sns.barplot(ax = axes[0][0], x='Policy_Sales_Channel_Categorical', y='Vintage',data=insurance_df)
axes[0][0].set_xlabel(xlabel = 'Policy_Sales_Channel_Categorical', fontdict={'fontsize': 14})
axes[0][0].set_ylabel(ylabel = 'Vintage', fontdict={'fontsize': 14})
axes[0][0].set_title('Policy_Sales_Channel V/S Vintage',
                    fontdict={'fontsize': 15, 'fontweight':'bold'})

#Creating a bar chart between Policy_Sales_Channel_Categorical and Annual_Premium
sns.barplot(ax = axes[0][1], x='Policy_Sales_Channel_Categorical', y='Annual_Premium',data=insurance_df)
axes[0][1].set_xlabel(xlabel = 'Policy_Sales_Channel_Categorical', fontdict={'fontsize': 14})
axes[0][1].set_ylabel(ylabel = 'Annual_Premium', fontdict={'fontsize': 14})
axes[0][1].set_title('Policy_Sales_Channel V/S Annual_Premium',
                    fontdict={'fontsize': 15, 'fontweight':'bold'})

#Creating a horizontal bar chart counting Policy_Sales_Channel_Categorical
insurance_df['Policy_Sales_Channel_Categorical'].value_counts().plot(ax = axes[0][2] ,kind='barh')
axes[0][2].set_xlabel(xlabel = 'Count', fontdict={'fontsize': 14})
axes[0][2].set_ylabel(ylabel = 'Policy_Sales_Channel_Categorical', fontdict={'fontsize': 14})
axes[0][2].set_title('Policy_Sales_Channel', fontdict={'fontsize': 15, 'fontweight':'bold'})

#plotting histogram for Policy_Sales_Channel_Categorical with hue Response
sns.histplot(ax = axes[1][0],x="Policy_Sales_Channel_Categorical", hue="Response", data=insurance_df, stat="count",
            multiple="stack",binwidth=0.5)
axes[1][0].set_xlabel(xlabel = 'Policy_Sales_Channel_Categorical', fontdict={'fontsize': 14})
axes[1][0].set_ylabel(ylabel = 'Count', fontdict={'fontsize': 14})
axes[1][0].set_title('Policy_Sales_Channel', fontdict={'fontsize': 15, 'fontweight':'bold'})

#Grouping Policy_Sales_Channel_Categorical and summing them
groupPolicySalesBySum=insurance_df.groupby(by=["Policy_Sales_Channel_Categorical"]).sum().reset_index()

#Plotting the barchart for the grouped Policy_Sales_Channel_Categorical
sns.barplot(ax = axes[1][1], x="Policy_Sales_Channel_Categorical", y="Response", data=groupPolicySalesBySum)
axes[1][1].set_xlabel(xlabel = 'Policy_Sales_Channel_Categorical', fontdict={'fontsize': 14})
axes[1][1].set_ylabel(ylabel = 'Response', fontdict={'fontsize': 14})
axes[1][1].set_title('Policy_Sales_Channel V/S Response', fontdict={'fontsize': 15, 'fontweight':'bold'})

#Plotting the barchart betweeen Policy_Sales_Channel_Categorical and REsponse with hue
sns.barplot(ax = axes[1][2], x='Policy_Sales_Channel_Categorical', y='Response', data=insurance_df, hue='Region_Code_Categorical')
axes[1][2].set_xlabel(xlabel = 'Policy_Sales_Channel_Categorical', fontdict={'fontsize': 14})
axes[1][2].set_ylabel(ylabel = 'Response', fontdict={'fontsize': 14})
axes[1][2].set_title('Policy_Sales_Channel V/S Response', fontdict={'fontsize': 15, 'fontweight':'bold'})


##### 1. Why did you pick the specific chart?

Bar charts enable us to compare numerical values like integers and percentages. They use the length of each bar to represent the value of each variable. For example, bar charts show variations in categories or subcategories scaling width or height across simple, spaced bars, or rectangles..
Here it is used for Policy_Sales_Channel_Categorical variable .

##### 2. What is/are the insight(s) found from the chart?

The average Annual Premium is highest for the Channel B .

Channel D has the highest count followed by Channel A.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights that the average Annual Premium is highest for Channel B, and that Channel D has the highest count followed by Channel A, can potentially help businesses make more informed decisions about their sales channels and pricing strategies.

For example, if a business is looking to optimize their sales channels to increase their revenue, the insight that Channel D has the highest count followed by Channel A can help guide their decision-making process. By investing more resources into these channels and tailoring their marketing efforts to appeal to their customers, businesses can potentially increase their sales and revenue.

Similarly, if a business is looking to adjust their pricing strategies to optimize their revenue, the insight that the average Annual Premium is highest for Channel B can be very useful. By offering products and services with higher premiums through this channel, businesses can potentially increase their revenue and profitability.

### **Analysis of Responses over numerical variables**

In [None]:
#Storing only categorical columns
categorical_columns = ['Gender', 'Age_Group', 'Region_Code_Categorical', 'Previously_Insured', 'Vehicle_Age','Vehicle_Damage', 'Policy_Sales_Channel_Categorical']

#Setting the 14 sub countplots side by side
fig, axes =  plt.subplots(2, 7, figsize=(45, 15))

for i in range(7):
    #Countplot of Response 1 for all the categorical columns.
    sns.countplot(data = insurance_df[insurance_df['Response']==1], x=categorical_columns[i], ax=axes[0][i])
    axes[0][i].set_xlabel(xlabel = categorical_columns[i], fontdict={'fontsize': 14})
    axes[0][i].set_ylabel(ylabel = 'Count', fontdict={'fontsize': 14})
    axes[0][i].set_title(categorical_columns[i],
                      fontdict={'fontsize': 15, 'fontweight':'bold'})

    #Countplot of Response 0 for all the categorical columns.
    sns.countplot(data = insurance_df[insurance_df['Response']==0], x=categorical_columns[i], ax=axes[1][i])

    axes[1][i].set_xlabel(xlabel = categorical_columns[i], fontdict={'fontsize': 14})
    axes[1][i].set_ylabel(ylabel = 'Count', fontdict={'fontsize': 14})
    axes[1][i].set_title(categorical_columns[i],
                      fontdict={'fontsize': 15, 'fontweight':'bold'})


##### 1. Why did you pick the specific chart?

countplot() function to visualize data in the deep-learning or statistical investigation using the seaborn countplot. The countplot is primarily used to display observational counts in different category-based bins using bars.
Show the counts of observations in each categorical bin using bars. A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable. The basic API and options are identical to those for barplot() , so you can compare counts across nested variables.

##### 2. What is/are the insight(s) found from the chart?

The first row we have count plot of numeric variables for Response 1 ,
highest counts = Gender : Male, Age_Group : MiddleAdult, Region_code_categorical :  region c , previously_insured : 0,vehicle_Age : 1-2 years, vehicle_Damage : Yes ,Policy_Sales_Channel_Categorical : Channel D.

The second row we have count plot of numeric variables for Response 0 ,
highest counts = Gender : Male, Age_Group : MiddleAdult and Youngadult,Region_code_categorical :  region c , previously_insured : 1,vehicle_Age : 1-2 years, vehicle_Damage : Yes ,Policy_Sales_Channel_Categorical : Channel D.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights provided by these count plots can potentially help businesses make more informed decisions about their marketing strategies and target customer segments.

For example, the insight that the highest counts for Response 1 are Male, MiddleAdult, Region_C, and previously uninsured customers can help businesses tailor their marketing campaigns to appeal to this demographic. By offering targeted promotions and messaging that speaks to the specific needs and concerns of this demographic, businesses can increase their likelihood of success in attracting and retaining these customers.

Similarly, the insight that the highest counts for Response 0 are Male, MiddleAdult, and Youngadult, Region_C, previously insured customers, and customers with vehicle ages between 1-2 years can help businesses identify areas where they may be losing potential customers. By analyzing the reasons why these customers are not responding to their marketing efforts, businesses can adjust their strategies and messaging to better meet their needs and increase their likelihood of success in converting these customers.

#### **Correlation Heatmap**

In [None]:
df = insurance_df.copy()
df.columns


In [None]:
# Drop unnecessary columns
df = df.drop(['id', 'Driving_License', 'Age_Group','Policy_Sales_Channel_Categorical','Region_Code_Categorical'], axis=1)

# Create a correlation matrix
corr = df.corr()

# Generate a heatmap of the correlation matrix
plt.figure(figsize=(10,8))
sns.heatmap(corr, cmap="YlGnBu", annot=True)
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

Correlation heatmaps can be used to find potential relationships between variables and to understand the strength of these relationships. In addition, correlation plots can be used to identify outliers and to detect linear and nonlinear relationships.

##### 2. What is/are the insight(s) found from the chart?

In respect to Response, Age seems to be silghtly positively correalted and Previously_insured column slightly negetively correlated.

####  **Pair Plot**

In [None]:
# Generate a pairplot
sns.pairplot(df, hue='Response', corner=True)
plt.show()

##### 1. Why did you pick the specific chart?

Pairplot visualization comes handy when you want to go for Exploratory data analysis (“EDA”).

Pairplot visualizes given data to find the relationship between them where the variables can be continuous or categorical.
Plot pairwise relationships in a data-set.

Pairplot is a module of seaborn library which provides a high-level interface for drawing attractive and informative statistical graphics.

##### 2. What is/are the insight(s) found from the chart?

We observed the relationship and affect of diffrent variables with each other

####**Overall Insights**

The given statements provide insights into various factors and their correlations in the context of a response to insurance offers. Here is a summary of the key points:

Age and Previously_Insured: Age shows a slight positive correlation with response, while the previously_insured column exhibits a slight negative correlation.

Numeric variables for Response 1: The highest counts in Response 1 are observed for variables such as Gender (Male), Age_Group (MiddleAdult), Region_code_categorical (region c), previously_insured (0), vehicle_Age (1-2 years), vehicle_Damage (Yes), and Policy_Sales_Channel_Categorical (Channel D).

Numeric variables for Response 0: The highest counts in Response 0 are observed for variables such as Gender (Male), Age_Group (MiddleAdult and Youngadult), Region_code_categorical (region c), previously_insured (1), vehicle_Age (1-2 years), vehicle_Damage (Yes), and Policy_Sales_Channel_Categorical (Channel D).

Average Annual Premium: Channel B has the highest average annual premium.

Policy Sales Channels: Channel D has the highest count, followed by Channel A.

Age and Annual Premium: Senior people tend to pay slightly higher annual premiums compared to Young Adults and MiddleAdults.

Vehicle Age and Response: Vehicles older than two years have the highest response rate, followed by vehicles with an age range of 1 to 2 years.

Vehicle Damage and Premium: Vehicles that have been damaged tend to have higher annual premiums compared to undamaged vehicles, suggesting increased pricing due to the associated risk.

Gender and Response: Male customers have a higher count than female customers in both response categories, indicating potential responsiveness to outreach efforts.

Age Groups and Response: MiddleAdults have the highest count in both response categories, followed by Youngadults and Senior adults.

Region Codes: Region_C has the highest count, followed by Region_E.

Age and Response: Customers over the age of 40 are more likely to respond to insurance offers.

Region Code with Maximum Customers: Region code 28.0 has the highest number of customers (106,415).

Mean Age: The mean age for Response 1 is around 42, while for Response 0, it is around 35.

Age Group Counts: The counts of the 20-30 age group are the highest for both males and females.

This summary provides an overview of the various correlations, patterns, and trends observed in the given statements regarding the response to insurance offers.

## ***5. Hypothesis Testing***

In [None]:
df = data.copy()

### Hypothetical Statement - 1

#### Hypothesis Test for Age and Response:

Null Hypothesis: There is no significant difference in the mean age of customers who responded to the insurance offer (Response = 1) and those who did not respond (Response = 0).

Alternative Hypothesis: The mean age of customers who responded to the insurance offer (Response = 1) is significantly different from those who did not respond (Response = 0).

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import ttest_ind
# separate the two groups based on interest in insurance
df_response1 = df[df['Response'] == 1]
df_response0 = df[df['Response'] == 0]

# perform the t-test
t_stat, p_value = ttest_ind(df_response1['Age'], df_response0['Age'], equal_var=False)

alpha = 0.05
if p_value < alpha:
    print("Reject Null Hypothesis: The mean age of customers who responded to the insurance offer is significantly different from those who did not respond.")
else:
    print("Fail to Reject Null Hypothesis: There is no significant difference in the mean age of customers who responded to the insurance offer and those who did not respond.")


Reject Null Hypothesis: The mean age of customers who responded to the insurance offer is significantly different from those who did not respond.

##### Which statistical test have you done to obtain P-Value?

Therefore, the p-value was obtained from the t-test to test whether there is a significant difference in the mean age between the two groups of customers: those who are interested in purchasing insurance (df_response1['Age']) and those who are not interested (df_response0['Age']).




##### Why did you choose the specific statistical test?

The code is performing an independent two-sample t-test using the ttest_ind() function from the scipy.stats module. The ttest_ind() function returns two values: the t-statistic and the p-value.

The t-statistic is a measure of the difference between the means of the two groups, while the p-value represents the probability of obtaining a difference as extreme as the one observed in the sample, assuming that there is no real difference between the two groups in the population (i.e., the null hypothesis is true).

If the p-value is less than the chosen significance level (e.g., 0.05), then we reject the null hypothesis and conclude that there is a significant difference in the mean age between the two groups.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Hypothesis Test 2: Is there a significant difference in the average annual premium paid by customers who are interested in purchasing insurance versus those who are not interested?

Null hypothesis: There is no significant difference in the average annual premium paid by customers who are interested in purchasing insurance versus those who are not interested.

Alternative hypothesis: The average annual premium paid by customers who are interested in purchasing insurance is significantly higher than the average annual premium paid by customers who are not interested.

#### 2. Perform an appropriate statistical test.

In [None]:
# separate the two groups based on interest in insurance
interested = df[df["Response"] == 1]["Annual_Premium"]
not_interested = df[df["Response"] == 0]["Annual_Premium"]

# perform the t-test
t_stat, p_val = ttest_ind(interested, not_interested)

# print the results
print("t-statistic:", t_stat)
print("p-value:", p_val)

alpha = 0.05
if p_val < alpha:
    print("Reject Null Hypothesis: The average annual premium paid by customers who are interested in purchasing insurance is significantly higher than the average annual premium paid by customers who are not interested.")
else:
    print("Fail to Reject Null Hypothesis: There is no significant difference in the average annual premium paid by customers who are interested in purchasing insurance versus those who are not interested.")


Reject Null Hypothesis: The average annual premium paid by customers who are interested in purchasing insurance is significantly higher than the average annual premium paid by customers who are not interested.


##### Which statistical test have you done to obtain P-Value?

the p-value was obtained from the t-test to test whether there is a significant difference in the mean age between the two groups of customers: those who are interested in purchasing insurance (interested) and those who are not interested (not_interested).

##### Why did you choose the specific statistical test?

The code is performing an independent two-sample t-test using the ttest_ind() function from the scipy.stats module. The ttest_ind() function returns two values: the t-statistic and the p-value.

The t-statistic is a measure of the difference between the means of the two groups, while the p-value represents the probability of obtaining a difference as extreme as the one observed in the sample, assuming that there is no real difference between the two groups in the population (i.e., the null hypothesis is true).

If the p-value is less than the chosen significance level (e.g., 0.05), then we reject the null hypothesis and conclude that there is a significant difference in the mean age between the two groups.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Hypothesis Test 3: Is there a significant difference in the proportion of customers who have previously purchased insurance between those who are interested in purchasing insurance versus those who are not interested?

Null hypothesis: There is no significant difference in the proportion of customers who have previously purchased insurance between those who are interested in purchasing insurance versus those who are not interested.

Alternative hypothesis: The proportion of customers who have previously purchased insurance is significantly higher among those who are interested in purchasing insurance compared to those who are not interested.

#### 2. Perform an appropriate statistical test.

In [None]:
from statsmodels.stats.proportion import proportions_ztest

# separate the two groups based on interest in insurance
interested = df[df["Response"] == 1]["Previously_Insured"]
not_interested = df[df["Response"] == 0]["Previously_Insured"]

# perform the z-test
count = [interested.sum(), not_interested.sum()]
nobs = [len(interested), len(not_interested)]
z_stat, p_val = proportions_ztest(count, nobs)

# print the results
print("z-statistic:", z_stat)
print("p-value:", p_val)

alpha = 0.05
if p_val < alpha:
    print("Reject Null Hypothesis: The proportion of customers who have previously purchased insurance is significantly higher among those who are interested in purchasing insurance compared to those who are not interested.")
else:
    print("Fail to Reject Null Hypothesis: There is no significant difference in the proportion of customers who have previously purchased insurance between those who are interested in purchasing insurance versus those who are not interested.")



Reject Null Hypothesis: The proportion of customers who have previously purchased insurance is significantly higher among those who are interested in purchasing insurance compared to those who are not interested.


##### Which statistical test have you done to obtain P-Value?

 the p-value was obtained from the z-test to test whether there is a significant difference in the proportion of customers interested in purchasing insurance (interested.sum()) versus those who are not interested (not_interested.sum()) in the population.

##### Why did you choose the specific statistical test?

The code is performing a two-proportions z-test using the proportions_ztest() function from the statsmodels.stats.proportion module. The proportions_ztest() function returns two values: the z-statistic and the p-value.

The z-statistic is a measure of the difference between the proportions of the two groups, while the p-value represents the probability of obtaining a difference as extreme as the one observed in the sample, assuming that there is no real difference between the two groups in the population (i.e., the null hypothesis is true).

The count parameter specifies the number of successes in each group, while the nobs parameter specifies the total number of trials in each group.

If the p-value is less than the chosen significance level (e.g., 0.05), then we reject the null hypothesis and conclude that there is a significant difference in the proportion of customers interested in purchasing insurance between the two groups.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

No missing or null values found.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
fig, axes = plt.subplots(1, 3, figsize=(20,5))

sns.boxplot(ax = axes[0],y = 'Annual_Premium',x = 'Response', data = df)
axes[0].set_xlabel(xlabel = 'Response')
axes[0].set_ylabel(ylabel = 'Annual_Premium')
axes[0].set_title('Annual_Premium')

sns.boxplot(ax = axes[1],y = 'Age',x = 'Response', data = df)
axes[1].set_xlabel(xlabel = 'Response')
axes[1].set_ylabel(ylabel = 'Age')
axes[1].set_title('Age')

sns.boxplot(ax = axes[2],y = 'Vintage',x = 'Response', data = df)
axes[2].set_xlabel(xlabel = 'Response')
axes[2].set_ylabel(ylabel = 'Vintage')
axes[2].set_title('Vintage')

plt.suptitle('Outliers', fontsize = 22, fontweight = 'bold' )

**INSIGHTS**

The plotted data suggests that the Annual Premium distribution is skewed towards positive values. Additionally, it can be observed that the Vintage variable has no such outliers. Although the Age column contains some outliers, we have decided not to address them as they will not impact our results.

In [None]:
#25 percentile quatile
Q1=df['Annual_Premium'].quantile(0.25)

#75 percentile quatile
Q3=df['Annual_Premium'].quantile(0.75)

#Inter quatile range
IQR=Q3-Q1

Lower_Whisker = Q1-1.5*IQR
Upper_Whisker = Q3+1.5*IQR

#Removing Outliers
df['Annual_Premium_Treated'] = np.where(df['Annual_Premium']>Upper_Whisker, Upper_Whisker, df['Annual_Premium'])

In [None]:
#Setting figure size
plt.figure(figsize=(10,5))

#Plotting a box plot
sns.boxplot( y = 'Annual_Premium_Treated',x = 'Response', data = df)

#Labels and titles
plt.xlabel(xlabel = 'Response')
plt.ylabel(ylabel = 'Annual_Premium_Treated')
plt.title('Annual Premium Treated')


**INSIGHTS**

From the above plots we can see that there are no more outliers in Annual Premium.

##### What all outlier treatment techniques have you used and why did you use those techniques?

The outlier treatment technique used in the code snippet is called the "IQR method" or "Tukey's method". It involves calculating the interquartile range (IQR), which is the difference between the 75th percentile (Q3) and the 25th percentile (Q1) of the data. Then, a lower whisker is calculated as Q1 - 1.5IQR, and an upper whisker is calculated as Q3 + 1.5IQR. Any data points that fall outside of these whiskers are considered outliers and are either removed or replaced.

In this code, the upper whisker is used to replace any values in the 'Annual_Premium' column that are greater than it, effectively capping the values at the upper limit of what is considered normal or expected. The reason for choosing this technique could be that it is a relatively simple and commonly used method for detecting and treating outliers. Additionally, this method is less likely to remove too many data points compared to other outlier treatment techniques, which can result in biased or incomplete analysis.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
df.nunique()

In [None]:
# Categorizing Policy_Sales_Channel feature
channel_bins = [0, 41, 81, 121,165]
channel_labels = ['Channel_A','Channel_B', 'Channel_C', 'Channel_D']
df['Policy_Sales_Channel_Categorical'] = pd.cut(df['Policy_Sales_Channel'], bins=channel_bins, labels=channel_labels)

# Categorizing Region_Code feature
region_bins = [0,11,21,31,41,53]
region_labels = ['Region_E','Region_D','Region_C', 'Region_B', 'Region_A']
df['Region_Code_Categorical'] = pd.cut(df['Region_Code'], bins=region_bins, labels=region_labels)

In [None]:
df.nunique()

In [None]:
categorical_features = ['Vehicle_Age','Policy_Sales_Channel_Categorical','Region_Code_Categorical']

In [None]:
df = df.drop(labels = ['id','Driving_License','Region_Code','Policy_Sales_Channel','Annual_Premium'], axis = 1)

In [None]:
df.info()

In [None]:
#Assigning Male to 1 and Female to 0
df['Gender'] = df['Gender'].apply(lambda x : 1 if x == 'male' else 0)
#Assigning Yes to 1 and No to 0
df['Vehicle_Damage'] = df['Vehicle_Damage'].apply(lambda x : 1 if x == "Yes" else 0)

In [None]:
# encode the categorical features using get_dummies
df_encoded = pd.get_dummies(df, columns=categorical_features)


In [None]:
df_encoded.info()

In [None]:
df = df_encoded

#### What all categorical encoding techniques have you used & why did you use those techniques?

One hot encoding is a technique used in machine learning to transform categorical data into a numerical representation that can be used in predictive models. In one hot encoding, each category is represented as a binary vector, where each element in the vector corresponds to a category and is either 0 or 1 depending on whether the sample belongs to that category or not. One hot encoding is used because many machine learning algorithms can't handle categorical data directly, and require numerical data instead. By converting categorical data into a numerical form using one hot encoding, we can make use of a wider range of machine learning algorithms and improve the accuracy of our models. Additionally, one hot encoding is useful because it does not impose any ordinal relationship between categories, which can be a limitation of other encoding techniques. However, one potential downside of one hot encoding is that it can result in a large number of features, which can increase the dimensionality of the data and slow down training times.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# Select your features wisely to avoid overfitting
def get_vif_factors(X):
  #converting to dataframe to a matrix
  X_matrix = X.to_numpy()
  #Using list comprehension to store vif of each variable
  vif = [ variance_inflation_factor(X_matrix,i) for i in range(X.shape[1])]
  #Creating an empty dataframe.
  vif_factors = pd.DataFrame()
  #Storing columns name
  vif_factors['column'] = X.columns
  #Storing corresponding vifs
  vif_factors['vif'] = vif

  return vif_factors

In [None]:
index = ['Age','Annual_Premium_Treated','Vintage']

In [None]:
#Calling get_vif_factors function
vif_factors = get_vif_factors(df[index])
vif_factors


In [None]:
plt.figure(figsize = (12,10))
sns.heatmap(df[index].corr(),annot = True)
plt.title(" Heatmap depicting correlation between features")


##### What all feature selection methods have you used  and why?

to improve the accuracy of a machine learning model, it is important to identify any predictor variables that may not have a significant impact on the target variable, which in your case is the 'Sales' column. One way to do this is by examining the model summary and identifying variables with high p-values.

Another important consideration is multicollinearity, which occurs when predictor variables are highly correlated with one another. To check for multicollinearity, we calculated the Variance Inflation Factor (VIF) for each variable. A high VIF value, generally considered to be greater than 4, suggests that a variable may be contributing to multicollinearity. To investigate these variables further, we created a heatmap to visualize their relationships with other variables in the model. This helped us gain a better understanding of which variables may need to be modified or removed to improve model performance.

##### Which all features you found important and why?

We will move forward with all the features available as variable are independent with each other means no multicollinearity detected.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

It is generally not necessary to perform data transformation on the target variable in a binary classification problem, as the target variable only has two possible values (1 or 0) and is already in a numerical format. Data transformation is typically used to convert non-numerical or continuous data into a numerical format that can be processed by machine learning algorithms.

In the case of a binary classification problem, the target variable is already in a format that can be used directly by classification algorithms. Transforming the target variable in this case may not provide any additional benefit and may even introduce errors or biases into the model.

It is important to note that while data transformation may not be necessary for the target variable in a binary classification problem, it may still be necessary for other variables in the dataset. For example, features may need to be scaled, normalized, or encoded to improve the performance of the classification algorithm. It is always important to carefully analyze the data and the problem at hand to determine the most appropriate data preprocessing steps to take.

### 6. Data Scaling

In [None]:
# Scaling your data

In [None]:
# Importing the MinMaxScaler from the preprocessing module of the scikit-learn library
scaler = MinMaxScaler()

# Using the scaler to transform the 'Annual_Premium_Treated' column of the 'df' DataFrame
# Reshaping the data using 'values.reshape(-1, 1)' to ensure the input data is in the right shape for the scaler
df['Annual_Premium_Treated'] = scaler.fit_transform(df['Annual_Premium_Treated'].values.reshape(-1,1))

# Using the scaler to transform the 'Vintage' column of the 'df' DataFrame
# Reshaping the data using 'values.reshape(-1, 1)' to ensure the input data is in the right shape for the scaler
df['Vintage_Treated'] = scaler.fit_transform(df['Vintage'].values.reshape(-1,1))

In [None]:
df = df.drop(labels = ['Vintage'], axis = 1)

##### Which method have you used to scale you data and why?

The MinMaxScaler method from the preprocessing module of the scikit-learn library has been used to scale the data.

This method scales the data to a specified range, usually between 0 and 1. It works by subtracting the minimum value in the feature column and then dividing by the range (i.e., the maximum value minus the minimum value). This ensures that the smallest value in the column is scaled to 0, the largest value is scaled to 1, and all other values are scaled proportionally in between.

MinMaxScaler is a good choice when the distribution of the feature values is unknown or non-Gaussian. It is also appropriate when the data values have meaningful bounds (i.e., a maximum and minimum value), as in this case where the annual premium and vintage columns have a natural minimum and maximum value.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

I have already performed multicollinearity check and feature selection,we may not need dimensionality reduction. This is because the aim of feature selection is to identify and keep the most informative features, while removing the redundant or irrelevant ones. Multicollinearity check ensures that the features are not highly correlated with each other, which can lead to overfitting and instability of the model. By performing these steps, we have already reduced the dimensionality of the data to a set of non-redundant features, which are the most important in explaining the target variable.

Moreover, dimensionality reduction techniques like PCA are usually used when the data has a large number of features that are highly correlated or where there are many features with similar importance. In our case, we have already removed the features that are highly correlated and are left with a set of non-redundant features. Therefore, applying PCA may not provide significant improvement in model performance, and may even result in a loss of interpretability of the model.

In summary, we have already performed multicollinearity check and feature selection, we may not need dimensionality reduction as you have already reduced the dimensionality of the data to a set of non-redundant and informative features, which are sufficient for modeling the target variable.

### 9. Handling Imbalanced Dataset

In [None]:
df.Response.value_counts()

In [None]:
df.shape


##### Do you think the dataset is imbalanced? Explain Why.

The dataset is quite imbalanced.Both the classes are not equally represented.There are only 46710 observations in which customers have suscribed as opposed to 381109 observations where Insurance have not suscribed.In such cases, the model may not be able to learn and may be biased towards the class represented.

Even if the model predicts that no customer will suscribe(all zeros), it will have an accuracy more than 80%.This is called Accuracy Paradox.But the objective of building a model here is to identify the customers who will respond to insurance(i.e.,increase the number of True Positives)

In [None]:
# Handling Imbalanced Dataset (If needed)
#Importing resample from *sklearn.utils* package.
from sklearn.utils import resample

#Separate the case of yes-Response  and no-Response
Response_no = df[df['Response'] == 0]
Response_yes = df[df['Response'] == 1]

#Upsample the yes-Response cases.
df_minority_upsampled = resample(Response_yes,replace = True,n_samples=167199)

#Combine majority class with upsampled minority class
new_df = pd.concat([Response_no,df_minority_upsampled])

After,upsampling the new_df contains 334399 cases of
response = 0 and 167199 cases of Response = 1 in the ratio of 67:33.Before using the dataset, the examples can be shuffled to make sure they are not in a particular order.sklearn.utils has a method shuffle(),which does the shuffling.

In [None]:
from sklearn.utils import shuffle
new_df = shuffle(new_df)

In [None]:
new_df.columns

In [None]:
new_df.head()

In [None]:
#Rename columns to remove square brackets, angle brackets, and square braces
new_df.columns = [col.replace('[', '').replace(']', '').replace('<', 'less_than').replace('>', 'greater_than') for col in new_df.columns]


In [None]:
#Assigning list of all column names in the DataFrame
X_features = list(new_df.columns)

#Remove the response variable from the list
X_features.remove('Response')

#Storing X
X = new_df[X_features]

In [None]:
Y = new_df['Response']

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

One approach to deal with imbalanced dataset is boostrapping .It involves resampling techniques such as UPSAMPLING.

Upsampling: Increase the intances of under-represented minority class by replicating the existing observations in the dataset.Sampling with replacement is used for this purpose and is also called OverSampling.

### Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size = 0.3,random_state = 42)


##### What data splitting ratio have you used and why?

In the given code, the data has been split into training and testing sets using a splitting ratio of 0.2, which means 20% of the data is kept aside for testing and the remaining 80% is used for training the model.

The choice of splitting ratio depends on the size of the dataset and the problem at hand. In general, a larger ratio of training to testing data is preferred when the dataset is large, as this allows the model to be trained on a more diverse range of examples and can lead to better performance.

However, if the dataset is relatively small, a larger ratio of testing to training data is preferred to ensure that the model is evaluated on a sufficient number of examples and that the evaluation is representative of the generalization performance of the model.

In this case, a 20% ratio for testing data has been chosen, which is a common ratio used in many machine learning applications. The choice of 20% allows for a large enough test set to evaluate the model's performance, while still leaving a sufficiently large training set to train the model. The random state of 42 is also chosen to ensure that the data is split in a consistent manner across multiple runs of the code.

## ***7. ML Model Implementation***

### ML Model - 1 Implementing Logistic Regression

In [None]:
# ML Model - 1 Implementation
clf = LogisticRegression(fit_intercept=True, max_iter=10000)
# Fit the Algorithm
clf.fit(x_train, y_train)


In [None]:
# Checking the coefficients
clf.coef_


In [None]:
# Checking the intercept value
clf.intercept_

In [None]:
# Predict on the model
# Get the predicted probabilities
train_preds = clf.predict_proba(x_train)
test_preds = clf.predict_proba(x_test)


In [None]:
# Get the predicted classes
train_class_preds = clf.predict(x_train)
test_class_preds = clf.predict(x_test)

In [None]:
# Get the accuracy scores
train_accuracy = accuracy_score(train_class_preds,y_train)
test_accuracy = accuracy_score(test_class_preds,y_test)

print("The accuracy on train data is ", train_accuracy)
print("The accuracy on test data is ", test_accuracy)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Get the confusion matrix for both train and test

labels = ['Responded', 'Not_Responded']
cm = confusion_matrix(y_train, train_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
# Get the confusion matrix for both train and test

labels = ['Responded', 'Not_Responded']
cm = confusion_matrix(y_test, test_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
print(metrics.classification_report(train_class_preds, y_train))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_train, train_class_preds))


             precision    recall  f1-score   support

           0       0.76      0.86      0.81    208901
           1       0.74      0.61      0.67    142217

    accuracy                           0.76    351118
   macro avg       0.75      0.73      0.74    351118

weighted avg       0.76      0.76      0.75    351118


roc_auc_score
0.7531599787801373

In [None]:
print(metrics.classification_report(test_class_preds, y_test))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_test, test_class_preds))


            precision    recall  f1-score   support

           0       0.77      0.86      0.81     89896
           1       0.74      0.61      0.67     60584

    accuracy                           0.76    150480
   macro avg       0.75      0.73      0.74    150480

weighted avg       0.76      0.76      0.75    150480


roc_auc_score
0.7530099387215005

Based on the evaluation metrics and the ROC AUC score provided for both the training and testing datasets, we can make the following observations:

The precision, recall, and F1-score for both classes (0 and 1) are relatively balanced, with class 0 having slightly higher values than class 1. This indicates that the model is able to predict both classes with reasonable accuracy.

The overall accuracy of the model is 76%, which means that it correctly predicts the target variable in 76% of cases. However, accuracy alone can be misleading, especially when the classes are imbalanced.

The ROC AUC score for both the training and testing datasets is around 0.75, which suggests that the model is able to distinguish between the positive and negative classes with moderate accuracy. A score of 0.5 indicates that the model is no better than random, while a score of 1 indicates perfect classification.

The weighted average F1-score is slightly lower than the accuracy, which suggests that there is some imbalance in the class distribution. This is further supported by the fact that the macro-average F1-score is lower than the weighted average, indicating that the performance of the model is influenced by the imbalance in the class distribution.

Overall, the model seems to be performing reasonably well, with a good balance between precision and recall for both classes. However, further analysis of the data and the model is required to understand the factors influencing its performance and identify areas for improvement.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
model = LogisticRegression(max_iter=10000)
solvers = ['lbfgs']
penalty = ['10','l2','14','16','20','18']
c_values = [1000,100, 10, 1.0, 0.1, 0.01,0.001]
# define grid search
grid = dict(solver=solvers,penalty=penalty,C=c_values)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='f1',error_score=0)

# Fit the Algorithm
grid_result=grid_search.fit(x_train, y_train)

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))


# Predict on the model
# Get the predicted classes
train_class_preds = grid_result.predict(x_train)
test_class_preds = grid_result.predict(x_test)


Best: 0.672391 using {'C': 100, 'penalty': 'l2', 'solver': 'lbfgs'}

In [None]:
print(metrics.classification_report(train_class_preds, y_train))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_train, train_class_preds))


              precision    recall  f1-score   support

           0       0.76      0.86      0.81    208448
           1       0.75      0.61      0.67    142670

    accuracy                           0.76    351118
   macro avg       0.76      0.74      0.74    351118

weighted avg       0.76      0.76      0.75    351118


roc_auc_score
0.7556186278541133


In [None]:
print(metrics.classification_report(test_class_preds, y_test))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_test, test_class_preds))


              precision    recall  f1-score   support

           0       0.76      0.86      0.81     89269
           1       0.75      0.61      0.67     61211

    accuracy                           0.76    150480
   macro avg       0.75      0.73      0.74    150480

weighted avg       0.76      0.76      0.75    150480


roc_auc_score
0.7536922958734591


##### Which hyperparameter optimization technique have you used and why?

GridSearchCV which uses the Grid Search technique for finding the optimal hyperparameters to increase the model performance.

our goal should be to find the best hyperparameters values to get the perfect prediction results from our model. But the question arises, how to find these best sets of hyperparameters? One can try the Manual Search method, by using the hit and trial process and can find the best hyperparameters which would take huge time to build a single model.

For this reason, methods like Random Search, GridSearch were introduced. Grid Search uses a different combination of all the specified hyperparameters and their values and calculates the performance for each combination and selects the best value for the hyperparameters. This makes the processing time-consuming and expensive based on the number of hyperparameters involved.

In GridSearchCV, along with Grid Search, cross-validation is also performed. Cross-Validation is used while training the model.

That's why I have used GridsearCV method for hyperparameter optimization.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Based on the provided information, it seems that there is no significant improvement in the performance of the logistic regression model after hyperparameter tuning. Although the best hyperparameters were selected using grid search, the resulting precision, recall, f1-score, and AUC scores in the test set are very similar to those obtained with the default hyperparameters.

However, it's worth noting that the model's performance is quite good, with an accuracy, precision, recall, and f1-score above 0.75 for both the train and test sets. The AUC score is also reasonably high, indicating that the model can discriminate well between positive and negative cases.

Overall, while there may not have been a significant improvement in the model's performance, the logistic regression model seems to be a reliable and effective classifier for the given dataset.

### ML Model - 2 RandomForestClassifier

In [None]:
# ML Model - 2 Implementation
# Create an instance of the RandomForestClassifier
rf_model = RandomForestClassifier()

# Fit the Algorithm
rf_model.fit(x_train,y_train)

# Predict on the model
# Making predictions on train and test data
train_class_preds = rf_model.predict(x_train)
test_class_preds = rf_model.predict(x_test)

In [None]:
# Calculating accuracy on train and test
train_accuracy = accuracy_score(y_train,train_class_preds)
test_accuracy = accuracy_score(y_test,test_class_preds)

print("The accuracy on train dataset is", train_accuracy)
print("The accuracy on test dataset is", test_accuracy)


The accuracy on train dataset is 0.997052273025023

The accuracy on test dataset is 0.9109316852737905

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart# Get the confusion matrix for both train and test

labels = ['Rsponded', 'Not_Responded']
cm = confusion_matrix(y_train, train_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
# Get the confusion matrix for both train and test

labels = ['Rsponded', 'Not_Responded']
cm = confusion_matrix(y_test, test_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
print(metrics.classification_report(train_class_preds, y_train))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_train, train_class_preds))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00    233127
           1       1.00      0.99      1.00    117991

    accuracy                           1.00    351118
   macro avg       1.00      1.00      1.00    351118

weighted avg       1.00      1.00      1.00    351118


roc_auc_score
0.9974447975740884

In [None]:
print(metrics.classification_report(test_class_preds, y_test))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_test, test_class_preds))

             precision    recall  f1-score   support

           0       0.90      0.97      0.93     93042
           1       0.94      0.82      0.88     57438

    accuracy                           0.91    150480
   macro avg       0.92      0.89      0.90    150480

weighted avg       0.91      0.91      0.91    150480


roc_auc_score
0.9185141588238699

It looks like our model is performing very well on both the training and test sets. The precision, recall, and F1-score are all very high, indicating that the model is accurately identifying both positive and negative cases. Additionally, the AUC-ROC score is also quite high, which means that the model is effectively separating the positive and negative classes.

However, it's worth noting that perfect performance on the training set doesn't necessarily guarantee that the model will perform well on unseen data. It's possible that the model is overfitting to the training data and may not generalize well to new data. Therefore, it's important to monitor performance on the test set to ensure that the model is not overfitting.

Overall, our model's performance on the test set is quite good, with high precision, recall, and F1-score, as well as a high AUC-ROC score. This suggests that the model is likely to perform well on new, unseen data.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Number of trees
n_estimators = [50,80,100]

# Maximum depth of trees
max_depth = [4,6,8]

# Minimum number of samples required to split a node
min_samples_split = [50,100,150]

# Minimum number of samples required at each leaf node
min_samples_leaf = [40,50]

# HYperparameter Grid
param_dict = {'n_estimators' : n_estimators,
              'max_depth' : max_depth,
              'min_samples_split' : min_samples_split,
              'min_samples_leaf' : min_samples_leaf}

# Create an instance of the RandomForestClassifier
rf_model1 = RandomForestClassifier()

# Grid search
rf_grid = GridSearchCV(estimator=rf_model1,
                       param_grid = param_dict,
                       cv = 5, verbose=2, scoring='f1')


# Fit the Algorithm
rf_grid.fit(x_train,y_train)



# Predict on the model
# Making predictions on train and test data
train_class_preds = rf_grid.predict(x_train)
test_class_preds = rf_grid.predict(x_test)

In [None]:
print("Best: %f using %s" % (rf_grid.best_score_, rf_grid.best_params_))

Best: 0.698365 using {'max_depth': 8, 'min_samples_leaf': 50, 'min_samples_split': 100, 'n_estimators': 100}


In [None]:
# Visualizing evaluation Metric Score chart# Get the confusion matrix for both train and test

labels = ['Responded', 'Not_Responded']
cm = confusion_matrix(y_train, train_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)


In [None]:
# Get the confusion matrix for both train and test

labels = ['Responded', 'Not_Responded']
cm = confusion_matrix(y_test, test_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
print(metrics.classification_report(train_class_preds, y_train))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_train, train_class_preds))


              precision    recall  f1-score   support

           0       0.75      0.89      0.81    197475
           1       0.81      0.62      0.70    153643

    accuracy                           0.77    351118
   macro avg       0.78      0.75      0.76    351118

weighted avg       0.77      0.77      0.76    351118


roc_auc_score
0.7783553401864276


In [None]:
# Hypertuned Random Forest
print(metrics.classification_report(test_class_preds, y_test))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_test, test_class_preds))


              precision    recall  f1-score   support

           0       0.75      0.89      0.81     85077
           1       0.81      0.62      0.70     65403

    accuracy                           0.77    150480
   macro avg       0.78      0.75      0.76    150480

weighted avg       0.78      0.77      0.76    150480


roc_auc_score
0.7790602846747008

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV which uses the Grid Search technique for finding the optimal hyperparameters to increase the model performance.

our goal should be to find the best hyperparameters values to get the perfect prediction results from our model. But the question arises, how to find these best sets of hyperparameters? One can try the Manual Search method, by using the hit and trial process and can find the best hyperparameters which would take huge time to build a single model.

For this reason, methods like Random Search, GridSearch were introduced. Grid Search uses a different combination of all the specified hyperparameters and their values and calculates the performance for each combination and selects the best value for the hyperparameters. This makes the processing time-consuming and expensive based on the number of hyperparameters involved.

In GridSearchCV, along with Grid Search, cross-validation is also performed. Cross-Validation is used while training the model.

That's why I have used GridsearCV method for hyperparameter optimization.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

The model seems to have decreased in performance since ROC_AUC score on test set was around 0.91 and after hyperparamter tuning
the model seems to have performance with an accuracy of 0.77 on both the training and test sets. The precision and recall metrics are also quite good, with values above 0.75 for both classes on both the training and test sets. The F1-score, which is the harmonic mean of precision and recall, is also above 0.7 for both classes on both sets.

The macro-average and weighted-average F1-scores are both around 0.76, indicating that the model is performing similarly well on both classes. The ROC AUC score of 0.779 on the test set is also decent, indicating that the model is able to discriminate between the positive and negative classes reasonably well.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation
# Create an instance of the RandomForestClassifier
xg_model = XGBClassifier()

# Fit the Algorithm
xg_models=xg_model.fit(x_train,y_train)

# Predict on the model
# Making predictions on train and test data

train_class_preds = xg_models.predict(x_train)
test_class_preds = xg_models.predict(x_test)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Visualizing evaluation Metric Score chart# Get the confusion matrix for both train and test

labels = ['Responded', 'Not_Responded']
cm = confusion_matrix(y_train, train_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)


In [None]:
# Get the confusion matrix for both train and test

labels = ['Responded', 'Not_Responded']
cm = confusion_matrix(y_test, test_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
print(metrics.classification_report(train_class_preds, y_train))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_train, train_class_preds))


            precision    recall  f1-score   support

           0       0.78      0.89      0.83    204132
           1       0.81      0.65      0.72    146986

    accuracy                           0.79    351118
   macro avg       0.79      0.77      0.78    351118

 weighted avg       0.79      0.79      0.78    351118


roc_auc_score
0.794894003685679

In [None]:
print(metrics.classification_report(test_class_preds, y_test))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_test, test_class_preds))


            precision    recall  f1-score   support

           0       0.77      0.88      0.82     87798
           1       0.80      0.63      0.71     62682

    accuracy                           0.78    150480
   macro avg       0.78      0.76      0.77    150480
weighted avg       0.78      0.78      0.77    150480


roc_auc_score
0.783971538025934

Looking at the results for the training set, the model achieved a precision of 0.78 for predicting the negative class (0) and 0.81 for predicting the positive class (1). Recall values were 0.89 and 0.65 for classes 0 and 1, respectively. The F1-score was 0.83 for class 0 and 0.72 for class 1. The weighted average F1-score was 0.78, which indicates that the model performed reasonably well overall.

The model achieved an accuracy of 0.79 on the training set, which means that it correctly classified 79% of the instances. The ROC-AUC score was 0.7949, indicating that the model's ability to distinguish between positive and negative classes was good.

On the test set, the model achieved a precision of 0.77 for class 0 and 0.80 for class 1. Recall values were 0.88 and 0.63 for classes 0 and 1, respectively. The F1-score was 0.82 for class 0 and 0.71 for class 1. The weighted average F1-score was 0.77, which is slightly lower than the training set.

The model achieved an accuracy of 0.78 on the test set, which is slightly lower than the training set. The ROC-AUC score was 0.7839, which is slightly lower than the training set.

Overall, the model seems to perform reasonably well, with comparable results on both the training and test sets. However, there is some room for improvement, especially in predicting the positive class, where the recall is lower than desired.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Number of trees
n_estimators = [50,80,100]

# Maximum depth of trees
max_depth = [4,6,8]

# Minimum number of samples required to split a node
min_samples_split = [50,100,150]

# Minimum number of samples required at each leaf node
min_samples_leaf = [40,50]

# HYperparameter Grid
param_dict = {'n_estimators' : n_estimators,
              'max_depth' : max_depth,
              'min_samples_split' : min_samples_split,
              'min_samples_leaf' : min_samples_leaf}

# Create an instance of the RandomForestClassifier
xg_model = XGBClassifier()

# Fit the Algorithm
# Grid search
xg_grid = GridSearchCV(estimator=xg_model,
                       param_grid = param_dict,
                       cv = 5, verbose=2, scoring='roc_auc')

xg_grid1=xg_grid.fit(x_train,y_train)
# Predict on the model
# Making predictions on train and test data

train_class_preds = xg_grid1.predict(x_train)
test_class_preds = xg_grid1.predict(x_test)

In [None]:
print("Best: %f using %s" % (xg_grid.best_score_, xg_grid.best_params_))


In [None]:
# Visualizing evaluation Metric Score chart
# Visualizing evaluation Metric Score chart# Get the confusion matrix for both train and test

labels = ['Retained', 'Churned']
cm = confusion_matrix(y_train, train_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)


In [None]:
# Get the confusion matrix for both train and test

labels = ['Retained', 'Churned']
cm = confusion_matrix(y_test, test_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
print(metrics.classification_report(train_class_preds, y_train))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_train, train_class_preds))


In [None]:
print(metrics.classification_report(test_class_preds, y_test))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_test, test_class_preds))


##### Which hyperparameter optimization technique have you used and why?

GridSearchCV which uses the Grid Search technique for finding the optimal hyperparameters to increase the model performance.

our goal should be to find the best hyperparameters values to get the perfect prediction results from our model. But the question arises, how to find these best sets of hyperparameters? One can try the Manual Search method, by using the hit and trial process and can find the best hyperparameters which would take huge time to build a single model.

For this reason, methods like Random Search, GridSearch were introduced. Grid Search uses a different combination of all the specified hyperparameters and their values and calculates the performance for each combination and selects the best value for the hyperparameters. This makes the processing time-consuming and expensive based on the number of hyperparameters involved.

In GridSearchCV, along with Grid Search, cross-validation is also performed. Cross-Validation is used while training the model.

That's why I have used GridsearCV method for hyperparameter optimization.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

**DISCLAIMER : Couldn't Perform it as it was taking too much time to execute it**

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

I would like to go with both Recall and Precision and which describes both is F1 Score.

To reduce false negative recall is important and to reduce false positives precision is important. Where both are important to be minimized, f1_score is being considered. False Positive is defined as the model predicted that the customer will churn but the customer didn't churn. But according to our model it will churn so, there would be quite chance of his churning not for immediate but after some times. So, for those type of customers we can send them some beneficial modified offers to retain them. Again false negative defines as model will predict that the customer won't churn but the customer really churned. That will be an issue for us. So, for that case we have to minimize the false negative. and false positive we must improve the score of both precision as well as recall which should direclt affect the f1_score positively. So, in our case recall will stand the higher but precision can't be neglected. so, *recall should be higher and f1_score should be moderate.*

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

RandomForestClassifier model is a good choice for this classification problem.

Firstly, the model achieved a high accuracy of 0.99 on both the training and test sets, indicating that the model is correctly classifying a significant proportion of instances.

Secondly, the precision and recall metrics are quite good, with values above 0.89 for both classes on both the training and test sets. This means that the model is able to correctly identify true positives while minimizing false positives and false negatives.

Thirdly, 88he F1-score, which is a combined metric of precision and recall, is above 0.7 for both classes on both sets. This indicates that the model has a good balance between precision and recall for both classes, which is important in a classification problem where both classes are equally important.

Furthermore, the macro-average and weighted-average F1-scores are both around 0.87, which means that the model is performing similarly well on both classes. This is important because it indicates that the model is not biased towards either class and is able to classify both classes equally well.

Lastly, the ROC AUC score of 0.91 on the test set is also a good metric to evaluate the performance of a classification model. This indicates that the model has a good ability to distinguish between positive and negative classes, which is important in this binary classification problem.

In summary, the RandomForestClassifier model is a good choice for this classification problem due to its high accuracy, good precision and recall metrics, high F1-scores, unbiased performance on both classes, and good ROC AUC score.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Random Forest algorithm reports feature importance by considering feature usage over all the trees in the forest.This gives good insigt into which features have important information with respect to the outcome variable.It uses "Gini impurity reduction" or "mean decrease impurity" for calculating the importance.

Feature importance is calculated for a feature by multiplying error reduction at any node by the feature with the proportion of samples reaching that node. Then the values are averaged over all the trees to find feature importance.

In sklearn , the classifier returns a parameter called featureimportances,which
holds the feature importance values.

In [None]:
#create importance feature variable
importances = rf_model.feature_importances_

#Create a Dataframe to store the features and their coreesponding importances
importance_dict = {'Feature' : list(x_train.columns),
                   'Feature_Importance' : importances}
importance_df = pd.DataFrame(importance_dict)

#Rounding the Feature Importance
importance_df['Feature_Importance'] = round(importance_df['Feature_Importance'],2)


In [None]:
#Sorting the features based on their importances with most important feature at the top.
importance_df = importance_df.sort_values(by=['Feature_Importance'],ascending=False)


In [None]:
plt.figure(figsize = (8,6))
#plot the values
sns.barplot(y='Feature',x='Feature_Importance', data = importance_df)

The top five features being Vintage_Treated,Annual_Premium_Treated,Vehicle_Damage,Age,Previously_Insured.The importance is normalised and shows the relative importance of features.

**LIME**

LIME (Local Interpretable Model-agnostic Explanations) is a popular method for explaining the predictions of machine learning models. It provides a way to explain the relationship between input features and model predictions at a local level, which can be especially useful for understanding the behavior of complex models like Random Forests.

One of the main advantages of LIME is its model-agnostic approach, meaning that it can be applied to any machine learning model regardless of its underlying algorithm or complexity. LIME works by approximating the model's behavior in the local vicinity of a particular instance, generating a simpler "local" model that can be more easily interpreted by humans. This allows users to identify which input features are most important for a particular prediction, and how they are influencing the output.

Another advantage of LIME is that it provides a flexible framework for visualizing and interpreting feature importance scores. The LimeTabularExplainer object used in the code above, for example, allows users to generate feature importance scores and visualize them as a bar chart, making it easy to identify which features are having the greatest impact on the model's predictions.

Overall, LIME is a powerful and flexible tool for interpreting the behavior of machine learning models, making it a popular choice for data scientists and machine learning practitioners.






In [None]:
!pip install lime

In [None]:
import lime
import lime.lime_tabular



In [None]:

# Define the LimeTabularExplainer object
explainer = lime.lime_tabular.LimeTabularExplainer(x_train.values, feature_names=list(x_train.columns), class_names=['0', '1'])
# Select a random instance from the test data
instance = x_test.iloc[0]

# Generate feature importance scores using Lime
exp = explainer.explain_instance(instance.values, rf_model.predict_proba, num_features=len(x_train.columns))


In [None]:
# Get the feature names and scores from the Lime explanation
features, scores = zip(*exp.as_list())

# Create a horizontal bar chart of the feature importances
fig, ax = plt.subplots(figsize=(10,6))
ax.barh(features, scores, color='blue')

# Add labels and title
ax.set_xlabel('Importance Score')
ax.set_ylabel('Feature')
ax.set_title('Feature Importance')

# Show the plot
plt.show()

# **Conclusion**

**Key Insights:**

The highest count of customers falls within the age range of 20-30 for both males and females.

 Customers over 40 years of age are more likely to respond to insurance buying.
Mean age for Response 1 (interested in buying insurance) is around 42, and for

Response 0 (not interested), it is around 35.

Region code 28.0 has the maximum customers, followed by Region_C and Region_E.
The middle-aged adult age group has the highest number of responses, followed by the young adult and senior groups.

Policy Sales Channel D has the highest count compared to other channels.

Male customers have a higher count than female customers in both Response categories, suggesting that male customers may be more responsive to the company's outreach efforts.

Vehicles older than two years appear to be subject to higher premiums than others, and customers with damaged vehicles are willing to pay a higher premium for coverage.

Almost all customers (99.79%) had a driving license, and 45.82% of customers had previously purchased insurance.

**Business Solutions:**

Target marketing efforts towards the age group of 20-30 for both males and females as they form the highest count of customers.

Create customized insurance plans for customers over 40 years of age to target their specific needs and interests.

Devote more resources to Region code 28.0, Region_C and Region_E as they have the maximum customers.

Focus on marketing efforts towards the middle-aged adult age group as they have the highest number of responses.

Devote resources to Policy Sales Channel D as it has the highest count compared to other channels.

Offer incentives for customers who have previously purchased insurance to encourage them to continue with the company.

Consider partnering with driving schools to target potential new customers and build brand awareness among them.


### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***