<a href="https://colab.research.google.com/github/Immortal-sage/hello-world/blob/master/Health_insurance_cross_sell_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Team Member 1 -** Rahul Kumar


# **Project Summary -**

 Predictive Analysis for Vehicle Insurance Sales

In the ever-evolving landscape of the insurance industry, understanding and effectively reaching potential customers is paramount to a company's success. This project centers around the development of predictive models to identify customers interested in purchasing Vehicle Insurance, thereby enabling data-driven communication strategies and optimizing the business model for enhanced revenue.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Our client, an insurance company, currently offers health insurance to its customers and is seeking assistance in developing a predictive model. The goal is to forecast whether policyholders from the previous year would also express an interest in the company's vehicle insurance offerings.

Insurance policies provide a financial safety net, where customers pay regular premiums to receive compensation in the event of specific losses, damages, illnesses, or even unfortunate events like accidents. The premiums act as a collective pool, enabling the insurer to cover potential expenses. For instance, a customer might pay an annual premium of Rs. 5000 for health insurance with coverage up to Rs. 200,000. In case of hospitalization or medical expenses, the insurance company bears the costs up to the policy's limit.

This process hinges on the principles of probability. Not every customer will require a payout in a given year. Instead, a few among many policyholders will make claims, and the collective pool of premiums ensures that the company can meet those commitments.

Similar to health insurance, vehicle insurance requires customers to pay annual premiums. In return, the insurance provider commits to offering compensation in the event of vehicle-related accidents or damages.

Creating a predictive model to anticipate customer interest in vehicle insurance holds immense value for the company. It empowers the company to tailor its communication strategies, reach out to potential customers effectively, and fine-tune its business operations to boost revenue.

In the quest to make these predictions, the available data includes various customer attributes such as gender, age, region code, vehicle details (age and damage), and policy-related information (premiums and sourcing channels).

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from collections import Counter
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from imblearn.over_sampling import RandomOverSampler
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

### Dataset Loading

In [None]:
# Load Dataset

df=pd.read_csv('/content/TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION.csv')


### Dataset First View

In [None]:
# Dataset First Look

df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.info()

rows=df.shape[0]
columns=df.shape[1]
print(f"the no of rows is {rows} and no of columns is {columns}")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
def healthinsu():
  temp=pd.DataFrame(index=df.columns)
  temp["datatype"]=df.dtypes
  temp["not null values"]=df.count()
  temp["null value"]=df.isnull().sum()
  temp["% of the null value"]=df.isnull().mean()
  temp["unique count"]=df.nunique()
  return temp
healthinsu()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(8,6))
sns.heatmap(df.isnull(),cbar =False)

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

**Columns information**

id : Unique ID for the customer

Gender : Gender of the customer

Age : Age of the customer

Driving_License 0 : Customer does not have DL, 1 : Customer already has DL

Region_Code : Unique code for the region of the customer

Previously_Insured : 1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance

Vehicle_Age : Age of the Vehicle

Vehicle_Damage :1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past.

Annual_Premium : The amount customer needs to pay as premium in the year

PolicySalesChannel : Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.

Vintage : Number of Days, Customer has been associated with the company

Response : 1 : Customer is interested, 0 : Customer is not interested

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
display(df.drop_duplicates())

## ***3. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
#Dependent variable 'Response'
plt.figure(figsize=(8,7))
sns.set_theme(style='whitegrid')
sns.countplot(x=df['Response'],data=df)

#####  What is/are the insight(s) found from the chart?

As evident from the provided figure, it's clear that the data exhibits a significant imbalance.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#Distribution of Age
plt.figure(figsize=(15,8))
sns.countplot(x=df['Age'],data=df)


##### 1. Why did you pick the specific chart?

In order to gain insights about age distribution

##### 2. What is/are the insight(s) found from the chart?

The age distribution chart above highlights that the majority of customers fall within the 21 to 25-year age group, with only a small number of customers aged 60 or older.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(7,9))
plt.pie(df['Previously_Insured'].value_counts(), autopct='%.0f%%', shadow=True, startangle=200, explode=[0.01,0])
plt.legend(labels=['Insured','Not insured'])
plt.show()

##### 1. Why did you pick the specific chart?

So that, we could see the difference in the distribution between insured and nor insured

##### 2. What is/are the insight(s) found from the chart?

Out of the total customers, 54% have previous insurance coverage, while the remaining 46% are not insured. Customers without prior insurance are more likely to express interest.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(15,9))
a=df['Annual_Premium']
sns.distplot(a, color='purple')

##### 1. Why did you pick the specific chart?

To check whether the plot is screwed or not .

##### 2. What is/are the insight(s) found from the chart?

The distribution plot indicates that the variable "annual premium" is skewed to the right.






#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(10,6))
sns.boxplot(df['Annual_Premium'])

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Looking at the boxplot above, it's evident that there are numerous outliers in the "annual premium" variable.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(5,7))
sns.countplot(x=df['Vehicle_Damage'])

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Customers who have experienced vehicle damage are more inclined to purchase insurance.






#### Chart - 7

In [None]:
# Chart - 7 visualization code
df['Vehicle_Age'].hist();

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

The plot above illustrates that the majority of individuals have a vehicle age of 1 or 2 years, with only a small number of people having a vehicle age exceeding 2 years.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
#Age VS Response
plt.figure(figsize=(16,8))
sns.countplot(data=df, x='Age',hue='Response', palette='CMRmap_r')
plt.xlabel('Age response')
plt.ylabel('count')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Individuals in the age range of 31 to 50 are more inclined to respond, whereas younger individuals below the age of 30 display less interest in vehicle insurance.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
#Gender vs Response
df.groupby(['Gender', 'Response']).size().unstack().plot(kind = 'bar', stacked = True)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

The male category has a slightly larger representation than the female category, and the likelihood of purchasing insurance is also slightly higher.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize = (10,6) )
sns.countplot(data = df, x = 'Vehicle_Age', hue = 'Response', palette='Dark2_r')
plt.xlabel('Vehicle Age', fontsize = 15)
plt.ylabel('Count', fontsize = 15)
plt.title('Vehicle Age and Customer Response analysis', fontsize = 19)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Customers with a vehicle age of 1-2 years are more likely to be interested compared to the other two groups.

Customers with a vehicle age of less than 1 year have a very low chance of buying insurance.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
sns.barplot(x = 'Response', y ='Annual_Premium', data = df)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

People who response have slightly higher annual premium

#### Chart - 12 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize = (20, 8))
sns.heatmap(df.corr(), annot = True)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

The target variable is not significantly affected by the "Vintage" variable, so we can consider dropping the least correlated variable.

## ***4. Feature Engineering & Data Pre-processing***

###  Categorical Encoding

In [None]:
# Encode your categorical columns
df['Gender'] = df['Gender'].map({'Female':1, 'Male':0})
df.head()

In [None]:
df['Vehicle_Age']= df['Vehicle_Age'].map({'< 1 Year':0,'1-2 Year':1,'> 2 Years':2})
df.head()

In [None]:
df['Vehicle_Damage']=df['Vehicle_Damage'].map({'Yes':1, 'No':0})
df.head()

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
correlation = df.corr()
correlation['Response'].sort_values(ascending = False)[1:]

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
X=df.drop(columns=['id','Driving_License','Policy_Sales_Channel','Vintage','Response'])# independent variable
y = df['Response']# dependent variable

In [None]:
# Fill any numerical NaNs with mode()

fill_mode = lambda col: col.fillna(col.mode())
X = X.apply(fill_mode, axis=0)
df = df.apply(fill_mode, axis=0)

### 9. Handling Imbalanced Dataset

In [None]:
# check for imbalance in data
df['Response'].value_counts()

##### Do you think the dataset is imbalanced? Explain Why.

It's evident that there is a significant class imbalance in the dataset. Standard machine learning techniques like Decision Trees and Logistic Regression tend to be biased toward the majority class, often ignoring the minority class. To address this issue, we can employ resampling techniques.

In [None]:
# Handling Imbalanced Dataset (If needed)#Resampling
ros = RandomOverSampler(random_state=0)
X_new,y_new= ros.fit_resample(X, y)

print("After Random Over Sampling Of Minor Class Total Samples are :", len(y_new))
print('Original dataset shape {}'.format(Counter(y)))
print('Resampled dataset shape {}'.format(Counter(y_new)))

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
X_train, X_test ,y_train, y_test=  train_test_split(X_new, y_new, random_state=42, test_size=0.3)
X_train.shape, X_test.shape , y_train.shape, y_test.shape

### 6. Data Scaling

In [None]:
# Normalizing the Dataset using Standard Scaling Technique.
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)


## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
#Importing Logistic Regression
model= LogisticRegression(random_state=42)
model=model.fit(X_train, y_train)
#Making prediction
pred = model.predict(X_test)
prob = model.predict_proba(X_test)[:,1]

In [None]:
# Evaluation
r_lgt= recall_score(y_test, pred)
print("recall_score : ", r_lgt)

p_lgt= precision_score(y_test, pred)
print("precision_score :",p_lgt)

f1_lgt= f1_score(y_test, pred)
print("f1_score :", f1_lgt)

A_lgt= accuracy_score(pred, y_test)
print("accuracy_score :",A_lgt)

acu_lgt = roc_auc_score(pred, y_test)
print("ROC_AUC Score:",acu_lgt)


In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, _ = roc_curve(y_test, prob)

plt.title('Logistic Regression ROC curve')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.plot(fpr,tpr)
plt.plot((0,1), linestyle="--",color='black')
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
matrix= confusion_matrix(y_test, pred)
print(matrix)
sns.heatmap(matrix ,annot=True, fmt='g')

In [None]:
print(classification_report(pred, y_test))

### ML Model - 2

In [None]:
RF_model= RandomForestClassifier()
RF_model= RF_model.fit(X_train, y_train)
#Making prediction
rf_pred= RF_model.predict(X_test)
rf_proba= RF_model.predict_proba(X_test)[:,1]

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

r_rf=  recall_score(y_test, rf_pred)
print("recall_score : ", r_rf)

p_rf= precision_score(y_test, rf_pred)
print("precision_score :",p_rf)

f1_rf= f1_score(y_test, rf_pred)
print("f1_score :", f1_rf)

A_rf= accuracy_score(y_test, rf_pred)
print("accuracy_score :",A_rf)

acu_rf = roc_auc_score(rf_pred, y_test)
print("ROC_AUC Score:",acu_rf)

In [None]:
fpr, tpr, _ = roc_curve(y_test, rf_proba)

plt.title('Random Forest ROC curve')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.plot(fpr,tpr)
plt.plot((0,1), linestyle="--",color='black')
plt.show()

In [None]:
matrix= confusion_matrix(y_test,rf_pred)
print(matrix)
sns.heatmap(matrix ,annot=True, fmt='g')

The updated confusion matrix suggests that the model has improved in predicting positive responses.

In [None]:
print(classification_report(rf_pred, y_test))

The model performs very well, so we can use it to predict unknown data.

### ML Model - 3

In [None]:
XG_model= XGBClassifier()
XG_model= XG_model.fit(X_train, y_train)
#Making prediction
XG_pred = XG_model.predict(X_test)
XG_prob = XG_model.predict_proba(X_test)[:,1]

In [None]:
# Evaluation
r_XG= recall_score(y_test, XG_pred)
print("recall_score : ", r_XG)

p_XG= precision_score(y_test, XG_pred)
print("precision_score :",p_XG)

f1_XG= f1_score(y_test, XG_pred)
print("f1_score :", f1_XG)

A_XG= accuracy_score( y_test, XG_pred)
print("accuracy_score :",A_XG)

acu_XG = roc_auc_score(XG_pred, y_test)
print("ROC_AUC Score:",acu_XG)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
fpr, tpr, _ = roc_curve(y_test, XG_prob)

plt.title('XGBoost ROC curve')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.plot(fpr,tpr)
plt.plot((0,1), linestyle="--",color='black')
plt.show()

In [None]:
matrix= confusion_matrix(y_test,XG_pred)
print(matrix)
sns.heatmap(matrix ,annot=True, fmt='g')


The confusion matrix indicates that the model has shown some improvement in predicting positive responses.

In [None]:
print(classification_report(XG_pred, y_test))


**Comparison of the model**

In [None]:
com= ['Logistic Regression','Randomforest','XGBClassifier']
data={'Accuracy':[A_lgt,A_rf,A_XG],'Recall':[r_lgt,r_rf, r_XG],'Precision':[p_lgt, p_rf, p_XG], 'f1_score':[f1_lgt, f1_rf, f1_XG],'ROC_AUC':[acu_lgt, acu_rf, acu_XG]}
result=pd.DataFrame(data=data, index=com)
result

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

- We initiated the project by loading the dataset and verifying it for null values and duplicates, which turned out to be unnecessary as none were found.

- In the course of our Exploratory Data Analysis (EDA), we uncovered that customers in the young age group exhibit greater interest in vehicle insurance. Conversely, individuals under the age of 30 tend to be less inclined to invest in vehicle insurance.

- Furthermore, our analysis indicated that customers with vehicles older than 2 years show a higher propensity to consider vehicle insurance. Additionally, those with damaged vehicles display an increased likelihood of being interested in vehicle insurance.

- We identified that the most influential variables affecting the target variable include Age, Previously_Insured, and Annual_Premium. To pinpoint the most crucial features, we leveraged the Mutual Information technique, highlighting Previously_Insured as the paramount feature with the greatest impact on the target variable. Moreover, we ascertained that there is no notable correlation between this feature and the target variable.

- Addressing the highly imbalanced nature of the target variable, we implemented the Random Over Sample resampling technique to rectify this issue.

- We introduced feature scaling techniques to normalize the data, standardizing all features to a consistent scale. This standardization makes the data more suitable for processing by machine learning algorithms.

- In the subsequent phase, we applied a range of machine learning algorithms to forecast customer interest in Vehicle Insurance. Results indicated that the logistic regression model delivered an accuracy of 78%, the XGBClassifier achieved an accuracy of 80%, while the Random Forest model excelled with an accuracy of roughly 91% and an ROC_AUC score of 92%. These outcomes highlight the Random Forest model as the most effective choice among the models considered.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***