## Cross Sell Vehicle Insurance Prediction 🏠 🏥

#### To Identify the Health Insurance Owners' who will be interested in Vehicle Insurance

**VISION of the Project**: An Insurance Company should be able to identify prospective Auto Insurance holders from within its existing pool of health Insurance holders. Successful cross selling efforts could also strengthen the brand’s image and position in the Insurance market. 

**ISSUE to address**: Health Insurances are considered by many as one of the most important financial covers any individual or household can possess at the time of an untimely medical condition or accident, due to which it has been one of the top selling insurances for many years now. However, in the past few years, Auto Insurance has slowly made its name as the most profitable insurance sold in the market owing to increasing no of vehicle owners and raising awareness of the benefits offered by vehicle insurances, to cover expensive repairs or damages due to accidents. However, searching for prospective insurance buyers could be a very laborious and costly process for any company.

**APPROACH to solve the Issue**: An approach to achieve this would be to use the records of all existing Health Insurance holders, including those who also have an Auto Insurance, perform a detailed analysis and use Machine Learning to create a predictive model that could identify which Health Insurance holders could be potential Vehicle Insurance buyers based on their profile.


# Variable Description

- id: Unique ID for the customer

- Gender: Gender of the customer

- Age:	Age of the customer

- Driving_License:	0 : Customer does not have DL, 1 : Customer already has DL

- Region_Code:	Unique code for the region of the customer

- Previously_Insured:	1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance

- Vehicle_Age:	Age of the Vehicle

- Vehicle_Damage:	1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past.

- Annual_Premium:	The amount customer needs to pay as premium in the year

- PolicySalesChannel:	Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.

- Vintage:	Number of Days, Customer has been associated with the company

- Response:	1 : Customer is interested, 0 : Customer is not interested

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import warnings
warnings.filterwarnings('ignore')

In [None]:
data = pd.read_csv('../input/health-insurance-cross-sell-prediction/train.csv')

# Dataset Description

In [None]:
data.head()

In [None]:
data.size

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.select_dtypes(include=np.object).head()

In [None]:
data.select_dtypes(include=np.object).nunique()

In [None]:
data.isnull().sum()

#### Inferences - 
- The training dataset contains close to 380K or 3.8 lakh records.
- The training dataset has 11 independent features and 1 dependent / target variable.
- Within the independent features, 3 are categorical features
- Remaining 8 are numerical in nature.
- The dataset contains no missing values.
- The no. of unique categories within the categorical features are between 2-3.

# Data Cleaning

In [None]:
for i in data.select_dtypes(include=np.object).columns:
    print(i)
    print(data[i].unique())
    print()

In [None]:
data.describe().iloc[:,:-1]

In [None]:
for i in data.select_dtypes(include=np.number).columns[:-1]:
    print(i)
    print(data[i].nunique())
    print()

In [None]:
data.select_dtypes(include=np.number).head().iloc[:,:-1]

#### Observations - 
From the observations made on the structure and values present in the features and on comparing them with the description of the variables, we can conclude -
- ID will be of no use to us as it will be unique to each record. Hence it can be removed.
- Region_Code should be a categorical feature as it contains a code for the location the customer resides in.
- Policy Sales Channel should be categorical as it contains codes for various type of sales channels.
- Driving Licence should be a categorical feature as it contains only 2 values i.e 0 and 1 with each number depicting whether a customer has a driving licence or not.
- Previously ensured should be a categorical feature as it contains only 2 values i.e 0 and 1 with each number depicting whether a customer had previously taken an insurance or not.
- Response (target feature) should also be a categorical feature as it contains only 2 values i.e 0 and 1 with each number depicting whether a health insurance customer will take a vehicle insurance or not.
- Also, so better clarity on categories for the purpose of model building and feature selection, we will give proper labels to the categories of driving license and previously ensured

In [None]:
data.drop('id',axis=1,inplace=True)

In [None]:
data[["Region_Code","Policy_Sales_Channel"]] = data[["Region_Code","Policy_Sales_Channel"]].astype('int').astype("object")

In [None]:
data[["Driving_License","Previously_Insured","Response"]] = data[["Driving_License","Previously_Insured","Response"]].astype('object') 


In [None]:
data.Driving_License.replace({1:"Has_License",0:"No_License"},inplace=True)
data.Previously_Insured.replace({1:"Vehicle_Insured",0:"Vehicle_Not_Insured"},inplace=True)
data.Vehicle_Damage.replace({"Yes":"Vehicle_Damaged","No":"Vehicle_Not_Damaged"},inplace=True)
data.Vehicle_Age.replace({"> 2 Years":"MoreThan2Years","1-2 Year":"Years1-2","< 1 Year":"LessThan1Year"},inplace=True)

In [None]:
data.select_dtypes(include=np.object).head()

In [None]:
data.select_dtypes(include=np.number).head()

# Exploratory Data Analysis

## Univariate Analysis

In [None]:
num_data = data.select_dtypes(include=np.number)
cat_data = data.select_dtypes(include=np.object)

In [None]:
plt.figure(figsize = (5,5))
proportions = cat_data.Response.value_counts(1)*100
labels = cat_data.Response.value_counts(1).index
plt.pie(proportions,labels=labels,autopct="%.2f")
plt.title("Proportion of Classes in Target Feature")
plt.show()

In [None]:
num_data.skew()

In [None]:
num_data.describe()

In [None]:
for i in num_data.columns:
    sns.histplot(num_data[i])
    plt.title(f"Distribution of {i}")
    plt.show()

In [None]:
for i in num_data.columns:
    sns.boxplot(num_data[i])
    plt.title(f"Spread of data and Outlier Detection in {i}")
    plt.show()

In [None]:
sns.countplot(cat_data.Gender)
plt.title("Frequency Analysis of Gender")
plt.show()

In [None]:
cat_data.Gender.value_counts(1)*100

In [None]:
sns.countplot(cat_data.Driving_License)
plt.title("Frequency Analysis of Driving_License")
plt.show()

In [None]:
cat_data.Driving_License.value_counts(1)*100

In [None]:
sns.countplot(cat_data.Previously_Insured)
plt.title("Frequency Analysis of Previously_Insured")
plt.show()

In [None]:
cat_data.Previously_Insured.value_counts(1)*100

In [None]:
sns.countplot(cat_data.Vehicle_Age)
plt.title("Frequency Analysis of Vehicle_Age")
plt.show()

In [None]:
cat_data.Vehicle_Age.value_counts(1)*100

In [None]:
sns.countplot(cat_data.Vehicle_Damage)
plt.title("Frequency Analysis of Vehicle_Damage")
plt.show()

In [None]:
cat_data.Vehicle_Damage.value_counts(1)*100

In [None]:
plt.figure(figsize=(15,30))
labels = data.Policy_Sales_Channel.value_counts().index
values = data.Policy_Sales_Channel.value_counts()
plt.barh(y=labels,width=values)
plt.ylabel("Sales Channel No.")
plt.xlabel("Frequency")
plt.title("Frequency Analysis of Policy_Sales_Channel",size=15)
plt.show()

In [None]:
cat_data.Policy_Sales_Channel.value_counts(1)*100

In [None]:
plt.figure(figsize=(25,10))
labels = data.Region_Code.value_counts().index
values = data.Region_Code.value_counts()
plt.bar(x=labels,height=values)
plt.xlabel("Region Code",size=15)
plt.ylabel("Frequency",size=15)
plt.title("Frequency Analysis of Region Code",size=20)
plt.show()

In [None]:
cat_data.Region_Code.value_counts(1)*100

#### Inferences - 
For Numerical Features - 
- Age and Annual Premium are postively skewed, with annual premium having a large skewness values of 1.6
- Vintage or duration of tenure of customers is uniformly distributed.
- Due to very high skewness, Annual Premium has large frequency and large value outliers.


- Age of most of the insurance holders is around 20-30 years, with a significant number of holders also found to be in the range of 40-50 years. However, due to a slightly bi-model appearance, we can say the median age of insurance holders in this dataset
is around 36 years.
- Median Premium most health insurance holders pay per annum is around 31000.
- In respect to vintage, the no. of insurance holders who have held the insurance for a specific period are more or less uniform or identical. No clear majority.


For Categorcal Features - 
- Except Sales Channel and Region, all other categorical features have around 2-3 unique categories only.
- However, within dirving license, % of those without licence is less than 1%. Such a feature will not be of any use.
- However, Sales Channel and Region have more than 20 unique categories, which would make it difficult to encode later.
- However, we have also observed that both sales channel and region have only a few dominating categories, with others very svehiclecely represented. We can considering grouping such categories.


- We can observe that the target feature i.e whether a health insuance holder would be a vehicle insurance, is highly imbalanced, with 87% of health insurance holders not having purchased a vehicle insurance.
- We have more males than females in the dataset, but only by a margin of around 25k.
- Almost all the health insurance holders have a driving license
- We have more health insurance holders who have been previously insured for a vehicle insurance, however, by a very small margin of 25k.
- More than 50% of the vehicle owners have vehicles which are around 1-2 years old.
- Amongst all vehicle owners, the proportion of those who suffered vehicle damage to those who didnt are nearly same.
- In terms of sales policy channels frequently adopted, we observed specific channels in the range 0-30, 120-125 and 150-160. However, the most adopted sales policy channels are 152, 26 and 124.
- In terms of region, we can see majority of the insurance holders coming from region number 28, followed by 8.

### Bi-variate Analysis

In [None]:
data[data.Response==0].describe()

In [None]:
data[data.Response==1].describe()

In [None]:
for i in data.select_dtypes(include=np.number).columns:
    sns.boxplot(x=data.Response,y=data[i])
    plt.title(f"Target Class 0 vs 1 : {i}")
    plt.show()

In [None]:
for i in data.select_dtypes(include=np.number).columns:
    sns.distplot(data[data.Response==1][i])
    sns.distplot(data[data.Response==0][i])
    plt.title(f"Target Class 0 vs 1 : {i}")
    plt.show()

In [None]:
print("STATISTICAL TEST - T-TEST INDEPENDENT SAMPLES")
print("To determine different in sample means of target class 0 and 1")
print("Null Hypothesis: There is no difference in means")
print('Alternate Hypothesis: There is a difference in means')
print()

for i in ["Age","Vintage"]:
    print(i)
    print("P value for no difference:",stats.ttest_ind(data[data.Response==0][i],data[data.Response==1][i])[1])
    print()

In [None]:
print("STATISTICAL TEST - MANN WHITNEY U")
print("To determine different in sample medians of target class 0 and 1")
print("Null Hypothesis: There is no difference in medians")
print('Alternate Hypothesis: There is a difference in medians')
print()
print("P value for no difference:",stats.mannwhitneyu(data[data.Response==0][i],data[data.Response==1][i])[1])
print()

In [None]:
for col in cat_data.drop(["Region_Code","Policy_Sales_Channel","Response"],axis=1).columns:
    sns.countplot(x=data[col],hue=data.Response)
    plt.title(f"Frequency Analysis : Target class 0 vs 1 - {col}")
    plt.show()

In [None]:
plt.figure(figsize = (15,50))
sns.countplot(y=data.Policy_Sales_Channel,hue=data.Response)
plt.title(f"Frequency Analysis : Target class 0 vs 1 - Policy Sales Channel",size=20)
plt.show()

In [None]:
plt.figure(figsize = (20,6))
sns.countplot(x=data.Region_Code,hue=data.Response)
plt.title(f"Frequency Analysis : Target class 0 vs 1 - Region",size=20)
plt.show()

In [None]:
print("STATISTICAL TEST - CHI SQUARE TEST OF INDEPENDENCE")
print("To determine different in target classes are dependent on the predictors")
print("Null Hypothesis: The target class and predictor are independent of each other.")
print('Alternate Hypothesis: The target class and predictor are independent of each other.')
print()

for i in cat_data.columns[:-1]:
    print(i)
    print("P value for no difference:",stats.chi2_contingency(pd.crosstab(data[i],data.Response))[1])
    print()

#### Inference - 

For numerical features:
- In terms of age, there appears to be a significant difference the means of people with and without vehicle insurance. The median age of non insurance holders appears to be around 35 whereas those with insurance appear to be around 42-44.
- In terms of Annual Premium, graphically, we dont see any difference in the median distributions of those with insurance and without insurance and they look identical. The median premiums paid for both groups are around 30-33K. Since our data samples are not normal for annual premium we went with mann-whitney u test to test for difference in medians.
- In terms of days spent / Vintage, we again see no difference in the distributions for those with and without vehicle insurance. Due to uniform distributions, the median days spent for both is around 154.
- Based on statistical analysis on to test for significant difference in means or median, it was observed that Vintage and Annual Premium were numerical features for which p > 0.05, hence we could conclude that there is no difference in average/ medians value of those with and without insurance. Hence, we can exclude Vintage and Annual Premium from our analysis later.


For categorical features:
- We can see that all those who have a driving licence actually took the vehicle insurance. However, based on analysis of unique values, we also observed that the proportion of those who didn have a driving license is negligible. There is hardly any variance in terms of the categories and the predictions will always be biased towards the ones with license. Hence, it will be removed.
- We can also observe that those who already have a vehicle insurance didnt take another vehicle insurance.
- Majority of those who took the vehicle insurance have vehicles which are 1-2 years old.
- Almost all those who did take a vehicle insurance had actually suffered damages to their vehicles in the past.
- Most of those who took vehicle insurance were sold insurance through sales channel 26 and 124.
- Majority of those who took vehicle insurance were from region coded 28.
- Based on the statistical analysis to check dependency of target feature on a predictor, we found that p < 0.05 for all pairs of categorical features and target classes. Hence we cannot remove any categorcal feature from the analysis.

## Multi-variate Analysis

In [None]:
sns.heatmap(data.corr(),annot=True)
plt.title("Correlation Heatmap",size=15)
plt.show()

In [None]:
sns.scatterplot(data.Age,data.Annual_Premium,hue=data.Response)

In [None]:
sns.scatterplot(data.Age,data.Vintage,hue=data.Response)

In [None]:
sns.scatterplot(data.Annual_Premium,data.Vintage,hue=data.Response)

#### Inferences
- Based on the correlation heatmap, we can observe that none of the pairs of numerical features have a considerable relationship with each other.
- Based on the scatterplots, on analyzing the relationship between two numerical features and also understanding how they help to seperate the target classes, we can conclude that the target classes cannot be seperated by the interaction of two numerical features.

#### KEY CONCLUSIONS FROM EDA
- We can conclude that the target classes aren't going to be linearly seperable based on the numerical features.
- Also, based on presence of positive skewnes and extreme outliers in the dataset, transformations might need to be performed on the numerical features, which could hurt the interpretation of the model and busines interpretation.
- Based on the bivariate analysis of categorical features and the trends in respect to target classes , we can conclude that using tree based algorithms would work well.
- Also due to presence of more than 380K records, a knn model would computationally take a large amount of time to run as it would have to calculate distance of 1 records with all other records in training and repeat that process again and again. 
Hence we won't consider a KNN Model during model building.
- Hence we will focus mainly on tree based algorithms during model building.
- Also due to the presence of several categorical features, which might need to get encoded, we might decide on using naive bayes.

# Data Preparation

In [None]:
data_copy = data.copy()

In [None]:
data.drop(["Vintage","Annual_Premium","Driving_License"],axis=1,inplace=True)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PowerTransformer, PolynomialFeatures, OneHotEncoder

In [None]:
data.select_dtypes(include=np.object).nunique()

#### NOTE - 
- Since we aren't performing regression or knn, we won't need to worry about transforming the data to be normal or scaling.
- Also, since tree based algorithm are robust to outliers, no need to treat them. However, if models do stuggle, we can consider it as a means of improvment.
- The only numerical feature in the dataset i.e age, doesnt seem to have any outliers.
- Since we don't have any missing values, no need for missing value treatment.
- The only need is to encode the categorical features.

Important Points about Encoding:
- Except region and sales policy, all other features can be one hot encoded as they have only around 2-3 unique categories.
- As region code and sales channel policy have more than 50 unique categories, we cannot encode them directly.
- Target and frequency encoding is not appropriate as it won't help in any logical interpretation.
- Label encoding is not possible as we don't have any logical order in the above two categorical features.
- Hence, we need to try to one hot encode them. However, one hot encoding we will create n number of addition columns, where n is the no. of unique categories in the variable, which will lead to curse of dimensionality.
- To prevent that, we will group together all categories which contribute less than 2% to the total records in the dataset under the category "others".

In [None]:
Region_Dict = {}
for i,a in list(zip(data.Region_Code.value_counts().index.unique(),data.Region_Code.value_counts(1)*100)):
    if a < 2:
        Region_Dict[i] = "Others"
    else:
        Region_Dict[i] = "Region_" + str(i)

In [None]:
Policy_Sales_Channel_Dict = {}
for i,a in list(zip(data.Policy_Sales_Channel.value_counts().index.unique(),data.Policy_Sales_Channel.value_counts(1)*100)):
    if a < 2:
        Policy_Sales_Channel_Dict[i] = "Others"
    else:
        Policy_Sales_Channel_Dict[i] = "Channel_" + str(i)

In [None]:
data.Region_Code = data.Region_Code.map(Region_Dict)
data.Policy_Sales_Channel = data.Policy_Sales_Channel.map(Policy_Sales_Channel_Dict)

In [None]:
X = data.drop("Response",axis=1)
y= data.Response.astype("int")

In [None]:
xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=0.30,random_state=0,stratify=y)

In [None]:
xtrain.head()

In [None]:
xtrain.select_dtypes(include=np.object).nunique()

- No. of categories in Region and Policy Sales Channel have been drastically reduced and hence they can be one hot encoded

In [None]:
OHE = OneHotEncoder(drop="first",handle_unknown="error").fit(xtrain[["Gender","Previously_Insured","Vehicle_Age","Vehicle_Damage","Region_Code","Policy_Sales_Channel"]])

In [None]:
xtrain_cat = pd.DataFrame(OHE.transform(xtrain[["Gender","Previously_Insured","Vehicle_Age","Vehicle_Damage","Region_Code","Policy_Sales_Channel"]]).toarray(),columns=OHE.get_feature_names(),index=xtrain.index)

In [None]:
xtest_cat = pd.DataFrame(OHE.transform(xtest[["Gender","Previously_Insured","Vehicle_Age","Vehicle_Damage","Region_Code","Policy_Sales_Channel"]]).toarray(),columns=OHE.get_feature_names(),index=xtest.index)

In [None]:
xtrain_cat.head()

In [None]:
xtrain = xtrain.merge(xtrain_cat,left_index=True,right_index=True).drop(["Gender","Previously_Insured","Vehicle_Age","Vehicle_Damage","Region_Code","Policy_Sales_Channel"],axis=1)

In [None]:
xtest = xtest.merge(xtest_cat,left_index=True,right_index=True).drop(["Gender","Previously_Insured","Vehicle_Age","Vehicle_Damage","Region_Code","Policy_Sales_Channel"],axis=1)

In [None]:
xtrain.head()

# Model Building

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score,f1_score,classification_report,confusion_matrix

In [None]:
results = pd.DataFrame({"Model":None,"Train F1":None,"Test F1":None,"CV Mean F1":None,"CV std in scores":None},index=range(0,6))

### Naive Bayes - Bernoulli

In [None]:
nb = BernoulliNB().fit(xtrain.drop("Age",axis=1),ytrain)
f1_train_nb = f1_score(ytrain,nb.predict(xtrain.drop("Age",axis=1)))
f1_test_nb = f1_score(ytest,nb.predict(xtest.drop("Age",axis=1)))
print("f1:",f1_train_nb)
print()
print("Model Performance Report")
print("Training Performance:",classification_report(ytrain,nb.predict(xtrain.drop("Age",axis=1))),sep="\n")
print()
print("Test Performance:",classification_report(ytest,nb.predict(xtest.drop("Age",axis=1))),sep="\n")

In [None]:
cv_nb = cross_val_score(BernoulliNB(),xtrain.drop("Age",axis=1),ytrain,cv=3,scoring="f1")
print("Mean of f1 scores:",cv_nb.mean())
print("Deviation in f1 scores:",cv_nb.std())
results.iloc[0,0] = "Naive Bayes"
results.iloc[0,1] = f1_train_nb
results.iloc[0,2] = f1_test_nb
results.iloc[0,3] = cv_nb.mean()
results.iloc[0,4] = cv_nb.std()

### Decision Tree

In [None]:
dt = DecisionTreeClassifier().fit(xtrain,ytrain)
f1_train_dt = f1_score(ytrain,dt.predict(xtrain))
f1_test_dt = f1_score(ytest,dt.predict(xtest))
print("f1 score:",f1_train_dt)
print()
print("Model Performance Report")
print("Training Performance:",classification_report(ytrain,dt.predict(xtrain)),sep="\n")
print()
print("Test Performance:",classification_report(ytest,dt.predict(xtest)),sep="\n")

In [None]:
cv_dt = cross_val_score(DecisionTreeClassifier(),xtrain,ytrain,cv=3,scoring="f1")
print("Mean of f1 scores:",cv_dt.mean())
print("Deviation in f1 scores:",cv_dt.std())
results.iloc[1,0] = "Decision Tree"
results.iloc[1,1] = f1_train_dt
results.iloc[1,2] = f1_test_dt
results.iloc[1,3] = cv_dt.mean()
results.iloc[1,4] = cv_dt.std()

### Random Forest

In [None]:
rf = RandomForestClassifier().fit(xtrain,ytrain)
f1_train_rf = f1_score(ytrain,rf.predict(xtrain))
f1_test_rf = f1_score(ytest,rf.predict(xtest))
print("f1:",f1_train_rf)
print()
print("Model Performance Report")
print("Training Performance:",classification_report(ytrain,rf.predict(xtrain)),sep="\n")
print()
print("Test Performance:",classification_report(ytest,rf.predict(xtest)),sep="\n")

In [None]:
cv_rf = cross_val_score(RandomForestClassifier(),xtrain,ytrain,cv=3,scoring="f1")
print("Mean of f1 scores:",cv_rf.mean())
print("Deviation in f1 scores:",cv_rf.std())
results.iloc[2,0] = "Random Forest"
results.iloc[2,1] = f1_train_rf
results.iloc[2,2] = f1_test_rf
results.iloc[2,3] = cv_rf.mean()
results.iloc[2,4] = cv_rf.std()

### Adaboosting Classifier

In [None]:
ada = AdaBoostClassifier().fit(xtrain,ytrain)
f1_train_ada = f1_score(ytrain,ada.predict(xtrain))
f1_test_ada = f1_score(ytest,ada.predict(xtest))
print("f1:",f1_train_ada)
print()
print("Model Performance Report")
print("Training Performance:",classification_report(ytrain,ada.predict(xtrain)),sep="\n")
print()
print("Test Performance:",classification_report(ytest,ada.predict(xtest)),sep="\n")

In [None]:
cv_ada = cross_val_score(AdaBoostClassifier(),xtrain,ytrain,cv=3,scoring="f1")
print("Mean of f1 scores:",cv_ada.mean())
print("Deviation in f1 scores:",cv_ada.std())
results.iloc[3,0] = "Adaboost"
results.iloc[3,1] = f1_train_ada
results.iloc[3,2] = f1_test_ada
results.iloc[3,3] = cv_ada.mean()
results.iloc[3,4] = cv_ada.std()

### Gradient Boosting

In [None]:
gb = GradientBoostingClassifier().fit(xtrain,ytrain)
f1_train_gb = f1_score(ytrain,gb.predict(xtrain))
f1_test_gb = f1_score(ytest,gb.predict(xtest))
print("f1:",f1_train_gb)
print()
print("Model Performance Report")
print("Training Performance:",classification_report(ytrain,gb.predict(xtrain)),sep="\n")
print()
print("Test Performance:",classification_report(ytest,gb.predict(xtest)),sep="\n")

In [None]:
cv_gb = cross_val_score(GradientBoostingClassifier(),xtrain,ytrain,cv=3,scoring="f1")
print("Mean of f1 scores:",cv_gb.mean())
print("Deviation in f1 scores:",cv_gb.std())
results.iloc[4,0] = "Gradient boost"
results.iloc[4,1] = f1_train_gb
results.iloc[4,2] = f1_test_gb
results.iloc[4,3] = cv_gb.mean()
results.iloc[4,4] = cv_gb.std()

### XGB

In [None]:
xgb = XGBClassifier().fit(xtrain,ytrain)
f1_train_xgb = f1_score(ytrain,xgb.predict(xtrain))
f1_test_xgb = f1_score(ytest,xgb.predict(xtest))
print("f1:",f1_train_xgb)
print()
print("Model Performance Report")
print("Training Performance:",classification_report(ytrain,xgb.predict(xtrain)),sep="\n")
print()
print("Test Performance:",classification_report(ytest,xgb.predict(xtest)),sep="\n")

In [None]:
cv_xgb = cross_val_score(XGBClassifier(),xtrain,ytrain,cv=3,scoring="f1")
print("Mean of f1 scores:",cv_xgb.mean())
print("Deviation in f1 scores:",cv_xgb.std())
results.iloc[5,0] = "XGB"
results.iloc[5,1] = f1_train_xgb
results.iloc[5,2] = f1_test_xgb
results.iloc[5,3] = cv_xgb.mean()
results.iloc[5,4] = cv_xgb.std()

In [None]:
results

#### Inferences - 

NOTE - Due to heavy imbalance in target classes, we need to give considerable importance to both precision and recall of class 1. Hence throughout the evaluation, we will focus on **F1 score**

We created 3 types of models - Probability, Tree based and boosting techniques.
- The naive bayes models produced the best results across all models despite the heavy imbalance due to the usage of probabilities only. It scored a f1 score of 0.419 on the test data and recorded average f1 score of 0.419 during cross validation, indicating no signs of overfitting.
- Tree based algorithms like decision tree and random forest recorded very poor scores across both train and test with f1 scores in the range of 0.10-0.20 with mean of f1 scores falling in the range of 0.11-0.14.
- Boosting techniques such as adaboost, gradient boost and xgb performed even worse, with very negligible scores across train and test, falling in the range of 0.0 to 0.03.
- Tree based and boosting based algorithms suffered heavily due to heavy imbalance in target classes across train and test where the proportion of class 0 to 1 is 87% to 13%. 
- In order to improve these scores, we need to oversample the minority class only in the TRAIN DATA, in order to improve learning of the class 1 during training and to prevent data leakage.

## OverSampling the Minority Class

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.naive_bayes import BernoulliNB

In [None]:
print("Total instances = ",len(ytrain))
print("Instances of class 0 = ",len(ytrain[ytrain == 0]))
print("Instances of class 1 = ",len(ytrain[ytrain == 1]))
print("50% of class 0 = ",round(0.5*len(ytrain[ytrain == 0])))

In [None]:
strategy = {1:117040}
smote = SMOTE(sampling_strategy=strategy,random_state=0)

In [None]:
xtrain_over,ytrain_over = smote.fit_resample(xtrain,ytrain)

In [None]:
ytrain_over.value_counts(1)*100

In [None]:
results_smote = pd.DataFrame({"Model":None,"Train F1":None,"Test F1":None,"CV Mean F1":None,"CV std in scores":None},index=range(0,3))

## Creating Baseline Models with Oversampled Data

In [None]:
nb = BernoulliNB().fit(xtrain_over.drop("Age",axis=1),ytrain_over)
f1_train_nb = f1_score(ytrain_over,nb.predict(xtrain_over.drop("Age",axis=1)))
f1_test_nb = f1_score(ytest,nb.predict(xtest.drop("Age",axis=1)))

print("f1 on train:",f1_train_nb)
print("f1 on test:",f1_test_nb)

print()
print("Model Performance Report on train")
print(classification_report(ytrain_over,nb.predict(xtrain_over.drop("Age",axis=1))))

print("Model Performance Report on test")
print(classification_report(ytest,nb.predict(xtest.drop("Age",axis=1))))

In [None]:
cv_nb = cross_val_score(BernoulliNB(),xtrain_over.drop("Age",axis=1),ytrain_over,cv=3,scoring="f1")
print("Mean of f1 scores:",cv_nb.mean())
print("Deviation in f1 scores:",cv_nb.std())
results_smote.iloc[0,0] = "Naive Bayes"
results_smote.iloc[0,1] = f1_train_nb
results_smote.iloc[0,2] = f1_test_nb
results_smote.iloc[0,3] = cv_nb.mean()
results_smote.iloc[0,4] = cv_nb.std()

In [None]:
dt = DecisionTreeClassifier().fit(xtrain_over,ytrain_over)
f1_train_dt = f1_score(ytrain_over,dt.predict(xtrain_over))
f1_test_dt = f1_score(ytest,dt.predict(xtest))

print("f1 on train:",f1_train_dt)
print("f1 on test:",f1_test_dt)

print()
print("Model Performance Report on train")
print(classification_report(ytrain_over,dt.predict(xtrain_over)))

print("Model Performance Report on test")
print(classification_report(ytest,dt.predict(xtest)))

In [None]:
cv_dt = cross_val_score(DecisionTreeClassifier(),xtrain_over,ytrain_over,cv=3,scoring="f1")
print("Mean of f1 scores:",cv_dt.mean())
print("Deviation in f1 scores:",cv_dt.std())
results_smote.iloc[1,0] = "Decision Tree"
results_smote.iloc[1,1] = f1_train_dt
results_smote.iloc[1,2] = f1_test_dt
results_smote.iloc[1,3] = cv_dt.mean()
results_smote.iloc[1,4] = cv_dt.std()

In [None]:
rb = RandomForestClassifier().fit(xtrain_over,ytrain_over)
f1_train_rb = f1_score(ytrain_over,rb.predict(xtrain_over))
f1_test_rb = f1_score(ytest,rb.predict(xtest))

print("f1 on train:",f1_train_rb)
print("f1 on test:",f1_test_rb)

print()
print("Model Performance Report on train")
print(classification_report(ytrain_over,rb.predict(xtrain_over)))

print("Model Performance Report on test")
print(classification_report(ytest,rb.predict(xtest)))

In [None]:
cv_rb = cross_val_score(RandomForestClassifier(),xtrain_over,ytrain_over,cv=3,scoring="f1")
print("Mean of f1 scores:",cv_rb.mean())
print("Deviation in f1 scores:",cv_rb.std())
results_smote.iloc[2,0] = "Random Forest Classifier"
results_smote.iloc[2,1] = f1_train_rb
results_smote.iloc[2,2] = f1_test_rb
results_smote.iloc[2,3] = cv_rb.mean()
results_smote.iloc[2,4] = cv_rb.std()

In [None]:
results_smote

#### Inferences - 
On building baseline models using oversampled data, we can conclude the following - 
- We built 3 baseline models - Bernoulli Naive bayes, Decision tree and Random Forest.
- Decision tree and random forest were giving higher f1 scores across the train at 0.75.
- However, since tree based algorithms are prone to overfitting, the deviation in cross validation scores was close to 3% and f1 score on test was recorded around 0.424-0.426
- The naive bayes model generalized better and provided the most consistent results, based on lowest deviation of 0.01%  compared to the 2-3% deviation of tree models.
- However, naive bayes model doesnt have much scope for improvement in terms of parameters to tune and it produced very low precision for class 1 on test data i.e 0.29 compared to recall of 0.89. 
- In this respect, tree models performed better with slightly lesser difference in recall and precision scores in test sample. Hence we would choose a tree based model.
- Since, decision tree was giving a lower deviation in scores, lesser complex to tune and less time consuming due to being a standalone estimator, we chose the decision tree for further tuning.

## Feature Selection

In [None]:
results_features = pd.DataFrame({"Model":None,"Train F1":None,"Test F1":None,"CV Mean F1":None,"CV std in scores":None},index=range(0,3))

In [None]:
from sklearn.feature_selection import RFE
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.metrics import roc_auc_score,roc_curve

### Recursive Feature Elimination

In [None]:
### Code commented due to long processing time
# rfe = RFE(DecisionTreeClassifier(),n_features_to_select=10).fit(xtrain_over,ytrain_over)

In [None]:
ranking = np.array([ 1,  1,  1, 15,  1,  1, 12,  8,  1, 13,  4,  9, 11, 10,  3,  2,  6,
        1,  7,  1, 14,  1,  1,  5])
ranking

In [None]:
print("Best Features:", xtrain_over.columns[ranking == 1])

In [None]:
rfe_features =  xtrain_over.columns[ranking == 1]

dt_rfe = DecisionTreeClassifier().fit(xtrain_over[list(rfe_features)],ytrain_over)
f1_train_dt_rfe = f1_score(ytrain_over,dt_rfe.predict(xtrain_over[list(rfe_features)]))
f1_test_dt_rfe = f1_score(ytest,dt_rfe.predict(xtest[list(rfe_features)]))

print("f1 on train:",f1_train_dt_rfe)
print("f1 on test:",f1_test_dt_rfe)

print()
print("Model Performance Report on train")
print(classification_report(ytrain_over,dt_rfe.predict(xtrain_over[list(rfe_features)])))
print()
cv_dt_rfe = cross_val_score(DecisionTreeClassifier(),xtrain_over[list(rfe_features)],ytrain_over,cv=3,scoring="f1")
print("Mean of f1 scores:",cv_dt_rfe.mean())
print("Deviation in f1 scores:",cv_dt_rfe.std())
results_features.iloc[0,0] = "Recursive Feature Elimination"
results_features.iloc[0,1] = f1_train_dt_rfe
results_features.iloc[0,2] = f1_test_dt_rfe
results_features.iloc[0,3] = cv_dt_rfe.mean()
results_features.iloc[0,4] = cv_dt_rfe.std()
print()
print("Model Performance Report on test")
print(classification_report(ytest,dt_rfe.predict(xtest[list(rfe_features)])))

fpr, tpr, thres = roc_curve(ytest,dt_rfe.predict_proba(xtest[list(rfe_features)])[:,1])
plt.plot(fpr,tpr)
plt.fill_between(fpr,tpr)
plt.title(f"AUC Score: {roc_auc_score(ytest,dt_rfe.predict_proba(xtest[list(rfe_features)])[:,1])}",size=15)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.show()

sns.heatmap(confusion_matrix(ytest,dt_rfe.predict(xtest[list(rfe_features)])),annot=True,fmt="g")
plt.title("Decision Tree with RFE Features",size=15)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.xticks([0.5,1.5],["No","Yes"])
plt.yticks([0.5,1.5],["No","Yes"])
plt.show()

### Forward Feature Elimination

In [None]:
## Code commented due to long processing time
# sfs = SequentialFeatureSelector(DecisionTreeClassifier(),k_features="best",cv=3,scoring="f1").fit(xtrain_over,ytrain_over)

In [None]:
sfs_features = ['Age',
 'x0_Male',
 'x1_Vehicle_Not_Insured',
 'x2_MoreThan2Years',
 'x2_Years1-2',
 'x3_Vehicle_Not_Damaged',
 'x4_Region_11',
 'x4_Region_15',
 'x4_Region_28',
 'x4_Region_29',
 'x4_Region_3',
 'x4_Region_30',
 'x4_Region_33',
 'x4_Region_36',
 'x4_Region_41',
 'x4_Region_46',
 'x4_Region_50',
 'x4_Region_8',
 'x5_Channel_124',
 'x5_Channel_152',
 'x5_Channel_156',
 'x5_Channel_160',
 'x5_Channel_26',
 'x5_Others']
sfs_features

In [None]:
dt_sfs = DecisionTreeClassifier().fit(xtrain_over[list(sfs_features)],ytrain_over)
f1_train_dt_sfs = f1_score(ytrain_over,dt_sfs.predict(xtrain_over[list(sfs_features)]))
f1_test_dt_sfs = f1_score(ytest,dt_sfs.predict(xtest[list(sfs_features)]))

print("f1 on train:",f1_train_dt_sfs)
print("f1 on test:",f1_test_dt_sfs)

print()
print("Model Performance Report on train")
print(classification_report(ytrain_over,dt_sfs.predict(xtrain_over[list(sfs_features)])))
print()
cv_dt_sfs = cross_val_score(DecisionTreeClassifier(),xtrain_over[list(sfs_features)],ytrain_over,cv=3,scoring="f1")
print("Mean of f1 scores:",cv_dt_sfs.mean())
print("Deviation in f1 scores:",cv_dt_sfs.std())
results_features.iloc[1,0] = "Forward Feature Elimination"
results_features.iloc[1,1] = f1_train_dt_sfs
results_features.iloc[1,2] = f1_test_dt_sfs
results_features.iloc[1,3] = cv_dt_sfs.mean()
results_features.iloc[1,4] = cv_dt_sfs.std()
print()
print("Model Performance Report on test")
print(classification_report(ytest,dt_sfs.predict(xtest[list(sfs_features)])))

fpr, tpr, thres = roc_curve(ytest,dt_sfs.predict_proba(xtest[list(sfs_features)])[:,1])
plt.plot(fpr,tpr)
plt.fill_between(fpr,tpr)
plt.title(f"AUC Score: {roc_auc_score(ytest,dt_sfs.predict_proba(xtest[list(sfs_features)])[:,1])}",size=15)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.show()

sns.heatmap(confusion_matrix(ytest,dt_sfs.predict(xtest[list(sfs_features)])),annot=True,fmt="g")
plt.title("Decision Tree with SFS Features",size=15)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.xticks([0.5,1.5],["No","Yes"])
plt.yticks([0.5,1.5],["No","Yes"])
plt.show()

### Feature Importances

In [None]:
importances = pd.DataFrame({"Features":xtrain_over.columns,"Importances":dt.feature_importances_})

In [None]:
importances.sort_values("Importances",ascending=False,inplace=True)

In [None]:
importances

In [None]:
fe_features = importances[importances.Importances >= 0.01]["Features"].tolist()

dt_fe = DecisionTreeClassifier().fit(xtrain_over[list(fe_features)],ytrain_over)
f1_train_dt_fe = f1_score(ytrain_over,dt_fe.predict(xtrain_over[list(fe_features)]))
f1_test_dt_fe = f1_score(ytest,dt_fe.predict(xtest[list(fe_features)]))

print("f1 on train:",f1_train_dt_fe)
print("f1 on test:",f1_test_dt_fe)

print()
print("Model Performance Report on train")
print(classification_report(ytrain_over,dt_fe.predict(xtrain_over[list(fe_features)])))
print()
cv_dt_fe = cross_val_score(DecisionTreeClassifier(),xtrain_over[list(fe_features)],ytrain_over,cv=3,scoring="f1")
print("Mean of f1 scores:",cv_dt_fe.mean())
print("Deviation in f1 scores:",cv_dt_fe.std())
results_features.iloc[2,0] = "Feature Importances"
results_features.iloc[2,1] = f1_train_dt_fe
results_features.iloc[2,2] = f1_test_dt_fe
results_features.iloc[2,3] = cv_dt_fe.mean()
results_features.iloc[2,4] = cv_dt_fe.std()
print()
print("Model Performance Report on test")
print(classification_report(ytest,dt_fe.predict(xtest[list(fe_features)])))

fpr, tpr, thres = roc_curve(ytest,dt_fe.predict_proba(xtest[list(fe_features)])[:,1])
plt.plot(fpr,tpr)
plt.fill_between(fpr,tpr)
plt.title(f"AUC Score: {roc_auc_score(ytest,dt_fe.predict_proba(xtest[list(fe_features)])[:,1])}",size=15)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.show()

sns.heatmap(confusion_matrix(ytest,dt_fe.predict(xtest[list(fe_features)])),annot=True,fmt="g")
plt.title("Decision Tree with Feature Imp Features",size=15)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.xticks([0.5,1.5],["No","Yes"])
plt.yticks([0.5,1.5],["No","Yes"])
plt.show()

In [None]:
results_features

#### Inferences - 
Based on finding out the best set of features for a decision tree using multiple techniques, we concluded the following - 
- Using Forward Feature Elimination, the decision tree model scored the highest f1 score on train sample i.e 0.75. However, when using cross validation, the model scored the recorded higher deviation in scores than the other 2 models and also scored the lowest f1 score on test i.e 0.42 due to the instability.
- Between the decision tree models using set of features from Recursive feature elimination and feature importances, we had similar scores. However, the model using feature importances recorded a high f1 score on test i.e 0.4434, higher train f1 score of 0.72 and higher mean cv score of 0.7119.

Hence we will use the decision tree model with set of features determined by feature importances.

## Parameter Optimization

In [None]:
from sklearn.model_selection import GridSearchCV

### Code commented due to long processing time
# params = {"max_depth":range(2,10),"criterion":["gini","entropy"],"min_samples_leaf":[30,40,50],"min_samples_split":[60,80,100]}
# grid = GridSearchCV(DecisionTreeClassifier(),param_grid=params,cv=3,scoring="precision").fit(xtrain_over[list(fe_features)],ytrain_over)

In [None]:
grid_params = {"max_depth":9,"min_sampples_leaf":30,"min_samples_split":80}
grid_params

In [None]:
fe_features = importances[importances.Importances >= 0.01]["Features"].tolist()

dt_grid = DecisionTreeClassifier(criterion="gini",max_depth=9,min_samples_leaf=30,min_samples_split=80).fit(xtrain_over[list(fe_features)],ytrain_over)
f1_train_dt_grid = f1_score(ytrain_over,dt_grid.predict(xtrain_over[list(fe_features)]))
f1_test_dt_grid = f1_score(ytest,dt_grid.predict(xtest[list(fe_features)]))

print("f1 on train:",f1_train_dt_grid)
print("f1 on test:",f1_test_dt_grid)

print()
print("Model Performance Report on train")
print(classification_report(ytrain_over,dt_grid.predict(xtrain_over[list(fe_features)])))
print()
cv_dt_grid = cross_val_score(DecisionTreeClassifier(criterion="entropy",max_depth=9,min_samples_leaf=30,min_samples_split=80),xtrain_over[list(fe_features)],ytrain_over,cv=3,scoring="f1")
print("Mean of f1 scores:",cv_dt_grid.mean())
print("Deviation in f1 scores:",cv_dt_grid.std())
results_features.iloc[2,0] = "Feature Importances"
results_features.iloc[2,1] = f1_train_dt_grid
results_features.iloc[2,2] = f1_test_dt_grid
results_features.iloc[2,3] = cv_dt_grid.mean()
results_features.iloc[2,4] = cv_dt_grid.std()
print()
print("Model Performance Report on test")
print(classification_report(ytest,dt_grid.predict(xtest[list(fe_features)])))

fpr, tpr, thres = roc_curve(ytest,dt_grid.predict_proba(xtest[list(fe_features)])[:,1])
plt.plot(fpr,tpr)
plt.fill_between(fpr,tpr)
plt.title(f"AUC Score: {roc_auc_score(ytest,dt_grid.predict_proba(xtest[list(fe_features)])[:,1])}",size=15)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.show()

sns.heatmap(confusion_matrix(ytest,dt_grid.predict(xtest[list(fe_features)])),annot=True,fmt="g")
plt.title("Tuned Decision Tree with Feature Imp Features",size=15)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.xticks([0.5,1.5],["No","Yes"])
plt.yticks([0.5,1.5],["No","Yes"])
plt.show()

#### Inferences - 
- Based on finding the optimal parameters for decision tree using top features from feature importances, we obtained a model with a slight increase in f1 score on class 1 i.e from 0.443 to 0.450.
- Also the model noticed an increase in auc score from 0.8408 to 0.848.
- The deviation in the cross validation scores dropped drastically from 0.015 to 0.007, which indicates the model has become more stable.

# Saving the Model

In [None]:
import pickle

In [None]:
model = pickle.dump(dt_grid,open("model.pkl","wb"))

# Business Interpretation of the Model

- We started with a dataset consisting of around 380K Health insurance holders, of which only 13% had gone on to purchase a vehicle insurance.


- Based on an in-depth analysis of all features and extensive trial and error, we have created a Decision tree model to predict which health insurance customers would buy a vehicle insurance.


- The model has been tuned with the optimal parameters to improve quality of the predictions.

**Key insights about the model** -
- For creating the models the most useful features for identifying potential vehicle insurance customers were chosen. These features can be the key focus points for salesmen or the organization to target when trying to identify potential vehicle insurance holders. The features were based on whether a health insurance holder - 
     - 1 Had a car which had been damaged before or not.
     - 2 Age of the holder.
     - 3 Had a vehicle that was not insured or not.
     - 4 Was a male or female
     - 5 Belonged to the regions coded 28, 152 or 8.
     - 6 Was approached by policy sales channel 152 or 160.
     - 7 Had a vehicle that was around 1-2 years old.
     
     
- Based on the chosen features and optimal parameters the final model produced a train f1 score of 0.75 during training phase and scored a f1 score of 0.45 during test. The model was very reliable and produced consistent results, based on a deviation in f1 scores across several test samples calculated to only 0.5%.


- During the testing phase, the model was able to identify close to 80% of the potential vehicle insurance holders in the sample.


However, **the model suffered from the following drawbacks** - 
- There was a huge difference in the test and train scores of the model indicating underfitting, which was a problem across almost all models created and tried due to the heavy imbalance in instances of health insurance holders who purchased vehicle insurance and those who did not.


- As a result, the learning of potential vehicle insurance holders was not robust during training.


- Oversampling for these class of customers in the train sample, we found that our results during the test phase had increased significantly


- However, while the scores during the test phase had improved significantly, there was still huge difference in the train and test results.


- As oversampling of minority class was not done in test, the models found it difficult to identify such records and as a result, was more biased in predictions towards those insurance holders who didnt buy a vehicle insurance.

**Suggestions** -

- To improve the model in the near future, more instances of health and vehicle insurance holders can be added to the data in order to retrain the model and get better results during test phase.