<a href="https://colab.research.google.com/github/Manjari-001/Vehicle-Insurance-cross-sell/blob/main/All_stars_ML_Health_insurance_cross_sell.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Statement**

Our client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

# <b> Business Goal </b>
Building a model to see whether a client would be keen on Vehicle Insurance is very useful for the organization since it can then accordingly plan its marketing strategy to connect with those clients and advance its plan of action and income.


# **Attribute Information**

1. id :	Unique ID for the customer

2. Gender	: Gender of the customer

3. Age :	Age of the customer

4. Driving_License	0 : Customer does not have DL, 1 : Customer already has DL

5. Region_Code :	Unique code for the region of the customer

6. Previously_Insured	: 1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance

7. Vehicle_Age :	Age of the Vehicle

8. Vehicle_Damage	 :1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past.

9. Annual_Premium	: The amount customer needs to pay as premium in the year

10. PolicySalesChannel :	Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.

11. Vintage :	Number of Days, Customer has been associated with the company

12. Response :	1 : Customer is interested, 0 : Customer is not interested

In [None]:
#importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error as mae, mean_squared_error as mse, r2_score

import warnings
warnings.filterwarnings('ignore')

In [None]:
#importing the dataset
from google.colab import drive
drive.mount('/content/drive')

MessageError: ignored

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Copy of TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION.csv')

In [None]:
df.head()

In [None]:
df.tail()

# Step 1 : Statistical Inference and Data pre-processing

In [None]:
#Total Observations
print("rows: ", df.shape[0])

#Total Features
print("columns: ", df.shape[1])

In [None]:
#Overview of given dataset
df.info()

In [None]:
#Checking for data types of features given in the dataset
df.dtypes

In [None]:
#Data Description
df.describe()

In [None]:
#Checking for Missing Values in given dataset
df.isna().sum()

In [None]:
# Finding Unique Values for each Variable
df.nunique()

# <b> Data Visualization

Target feature

In [None]:
#checking the distribution of target variable
plt.subplot(1, 2, 1)
sns.countplot(df['Response'])
plt.title('Count of Target variable')

plt.subplot(1,2,2)
count = df['Response'].value_counts()
count.plot.pie(autopct = '%1.1f%%',figsize = (10,8),explode = [0,0.1])

plt.title('Percentage of Response variable')

- From the above graph, we can see that the count of people not opting for the vehicle insurance is more than the ones opting for. 


Gender

In [None]:
#Checking the distribution of Gender variable and their responses
plt.figure(figsize = (13,5))
plt.subplot(1,2,1)
sns.countplot(df['Gender'])
plt.title("Distribution of gender")
plt.subplot(1,2,2)
sns.countplot(df['Gender'], hue = df['Response'])
plt.title("Responses of Male and Female")
plt.show()

- We can see that, the count of Male opting for the insurance is more than that of Female probably because of the assymetrical distribution of data in the given dataset.

Age variable

In [None]:
#Checking the distribution of age along with responses
plt.figure(figsize=(20,10))
sns.countplot(x='Age',hue='Response',data=df)

- We can see that people of age 28-55 tend to buy insurance more than the ones who doesn't fall under this category.

Driving License

In [None]:
#Checking the distribution of people with driving license along with the responses
print("Percentage of  Driving_License feature\n ")
print(df['Driving_License'].value_counts()/len(df)*100)
f,ax = plt.subplots(nrows=1,ncols=2,figsize=(12,6))
axx = ax.flatten()
plt.title("Count plot of Driving_License vs Response")
sns.countplot(df['Driving_License'],ax = axx[0],palette = 'rocket')
sns.countplot('Driving_License', hue = 'Response',ax =axx[1],data = df)

- We can see that, the major chunk of people in our dataset have driving license(99%)

Previously insured

In [None]:
#Checking the distribution of people if they were previously insured along with their responses
f,ax = plt.subplots(nrows=1,ncols=2,figsize=(12,5))
axx = ax.flatten()
sns.countplot(df['Previously_Insured'],ax = axx[0])
sns.countplot('Previously_Insured', hue = 'Response',ax =axx[1],data = df)

- We can see that the people who were not insured previously are the ones who are buying the insurance now and there's a huge dip for the ones who were previously insured.

Vehicle age

In [None]:
#Checking the distribution of vehicle age along with previously insured
sns.countplot(df["Vehicle_Age"],hue=df["Previously_Insured"])

- We can see that people with vehicle age less than 1 year are the ones who bought the insurance followed by the ones with vehicles which are 1-2 year old. And people with vehicle age greater than 2 years old are showing least interest in buying the insurance this time.

Vehicle damage

In [None]:
#Checking the distribution of vehicle damage along with the responses
plt.title("Plot of vechicle damage vs response")
sns.countplot('Vehicle_Damage', hue = 'Response',data = df)
plt.show()

- We can see that people whose vehicles are damaged tend to buy insurance compared to the ones whose vehicles are not.

Annual Premium

In [None]:
#Checking the distribution of insurance premium
plt.figure(figsize=(13,7))
sns.boxplot(df['Annual_Premium'],color='green')
plt.title("boxplot of Annual premium")
plt.show()

In [None]:
#Checking correlation between variables
plt.figure(figsize=(10,5))
sns.heatmap(df.corr(),annot=True)
plt.show()

## <b>Label Encoding

In [None]:
#Encoding categorical variables into numeric Values
from sklearn.preprocessing import LabelEncoder
labelEncoder= LabelEncoder()
df['Gender'] = labelEncoder.fit_transform(df['Gender'])
df['Vehicle_Age'] = labelEncoder.fit_transform(df['Vehicle_Age'])
df['Vehicle_Damage'] = labelEncoder.fit_transform(df['Vehicle_Damage'])

# Seperating Target and Independant Variable

In [None]:
x=df.drop(['Response'],axis=1) 
y=df['Response']   
x.head()   

Unnamed: 0,id,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage
0,1,1,44,1,28.0,0,2,1,40454.0,26.0,217
1,2,1,76,1,3.0,0,0,0,33536.0,26.0,183
2,3,1,47,1,28.0,0,2,1,38294.0,26.0,27
3,4,1,21,1,11.0,1,1,0,28619.0,152.0,203
4,5,0,29,1,41.0,1,1,0,27496.0,152.0,39


## <b>Feature Selection

In [None]:
# from sklearn.ensemble import ExtraTreesClassifier
# model = ExtraTreesClassifier()
# model.fit(x,y)
# print(model.feature_importances_)


In [None]:
# feat_importances = pd.Series(model.feature_importances_, index=x.columns)
# feat_importances.nlargest(11).plot(kind='barh')
# plt.show()

***We can remove less important features (Driving_License, Gender) from the data set***

In [None]:
x=x.drop(['Driving_License','Gender'],axis=1)

In [None]:
df["Response"].value_counts()

0    334399
1     46710
Name: Response, dtype: int64

**As we see, Target Variable category is unevenly distributed. Hence we need to handle this imbalanced data.**

## Handling Data Imbalance

In [None]:
# from imblearn.over_sampling import RandomOverSampler
# from collections import Counter
# randomsample=  RandomOverSampler()
# x_new,y_new=randomsample.fit_sample(x,y)

In [None]:


# from imblearn.under_sampling import RandomUnderSampler
# RUS = RandomUnderSampler(sampling_strategy=.5,random_state=3,)
# X_train,Y_train  = RUS.fit_resample(df[Features],df1['Response'])

## Spliting of data into train and test data

In [None]:
#dividing the dataset into training and testing
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=.30,random_state=0)



## Scaling

In [None]:

#feature scaling
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
xtrain=scaler.fit_transform(xtrain)
xtest=scaler.transform(xtest)
xtest

array([[-0.95475991, -1.39062523,  1.08636261, ...,  0.47576066,
         0.73670425,  1.34732072],
       [ 0.14106638,  0.12146714, -0.92050296, ..., -0.32864093,
         0.22013717,  0.02118141],
       [ 2.13933784, -1.39062523, -0.92050296, ...,  4.82773917,
        -1.58784763, -1.00627787],
       ...,
       [-0.82583917, -0.8613929 ,  1.08636261, ...,  0.49794502,
         0.73670425,  1.02474629],
       [ 0.65674934, -0.0297421 , -0.92050296, ...,  0.38253974,
         0.93964132, -0.98238347],
       [ 0.72120971,  0.65069947, -0.92050296, ..., -0.18441344,
         0.22013717, -0.0982906 ]])

Problem can be identified as Binary Classification (wheather customer opts for vehicle insurance or not)

Dataset has more than 300k records

cannot go with SVM Classifier as it takes more time to train as dataset increase

Idea is to start selection of models as:

 1.Logistic Regression

 2.Random Forest

 3.XGBClassifier

## Model Training & Evaluation

## Logistic Regression

In [None]:
# Logistic Regression Algorithm
logreg=LogisticRegression()
logreg.fit(xtrain,ytrain)

# Model Prediction
ypred=logreg.predict(xtest)


In [None]:
# Model Evaluation
print(f"Logistic regression \n{classification_report(ytest, ypred)}")

Logistic regression 
              precision    recall  f1-score   support

           0       0.88      1.00      0.93    100241
           1       0.00      0.00      0.00     14092

    accuracy                           0.88    114333
   macro avg       0.44      0.50      0.47    114333
weighted avg       0.77      0.88      0.82    114333



## Decision Tree Classifier

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

In [None]:
models = []
models.append(("DT-gini  ", DecisionTreeClassifier(criterion="gini")))
models.append(("DT-entropy  ", DecisionTreeClassifier(criterion="entropy")))


for name, model in models:
    model.fit(xtrain, ytrain)
    ypred= model.predict(xtest)
    print(f"Name -: {name}\n{classification_report(ytest, ypred)}")

Name -: DT-gini  
              precision    recall  f1-score   support

           0       0.90      0.90      0.90    100241
           1       0.29      0.31      0.30     14092

    accuracy                           0.82    114333
   macro avg       0.60      0.60      0.60    114333
weighted avg       0.83      0.82      0.82    114333

Name -: DT-entropy  
              precision    recall  f1-score   support

           0       0.90      0.90      0.90    100241
           1       0.29      0.30      0.30     14092

    accuracy                           0.83    114333
   macro avg       0.60      0.60      0.60    114333
weighted avg       0.83      0.83      0.83    114333



In [None]:
# Model Evaluation
print(f"Decision Tree Classifier \n{classification_report(ytest, ypred)}")

Decision Tree Classifier 
              precision    recall  f1-score   support

           0       0.90      0.90      0.90    100241
           1       0.29      0.30      0.30     14092

    accuracy                           0.83    114333
   macro avg       0.60      0.60      0.60    114333
weighted avg       0.83      0.83      0.83    114333



## Random Forest Classifier

In [None]:
# Random Forest Classifier Algorithm
random_forest=RandomForestClassifier()
random_forest.fit(xtrain,ytrain)

# Model Prediction
ypred=random_forest.predict(xtest)

In [None]:
# Model Evaluation
from sklearn import metrics
print(f"Random Forest Classifier \n{classification_report(ytest, ypred)}")
print(metrics.confusion_matrix(ytest, ypred))

Random Forest Classifier 
              precision    recall  f1-score   support

           0       0.89      0.97      0.93    100241
           1       0.36      0.13      0.19     14092

    accuracy                           0.86    114333
   macro avg       0.62      0.55      0.56    114333
weighted avg       0.82      0.86      0.84    114333

[[97087  3154]
 [12306  1786]]


## Hyperparameter Tunning

In [None]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold, KFold, GridSearchCV
random_search = {'criterion': ['entropy', 'gini'],
               'max_depth': [2,3,4,5,6,7,10],
               'min_samples_leaf': [4, 6, 8],
               'min_samples_split': [5, 7,10],
               'n_estimators': [300]}

random_f = RandomForestClassifier()
model = RandomizedSearchCV(estimator = random_f, param_distributions = random_search, n_iter = 10, 
                               cv = 4, verbose= 1, random_state= 101, n_jobs = -1)
model.fit(xtrain,ytrain)

Fitting 4 folds for each of 10 candidates, totalling 40 fits
