<a href="https://colab.research.google.com/github/Avisikta-Majumdar/Capstone-Project_Health_Insurance_Cross_Sell_Prediction/blob/main/Individual_Notebook_Health_Insurance_Cross_Sell_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Statement**

Our client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

# **Attribute Information**

1. id :	Unique ID for the customer

2. Gender	: Gender of the customer

3. Age :	Age of the customer

4. Driving_License	0 : Customer does not have DL, 1 : Customer already has DL

5. Region_Code :	Unique code for the region of the customer

6. Previously_Insured	: 1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance

7. Vehicle_Age :	Age of the Vehicle

8. Vehicle_Damage	 :1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past.

9. Annual_Premium	: The amount customer needs to pay as premium in the year

10. PolicySalesChannel :	Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.

11. Vintage :	Number of Days, Customer has been associated with the company

12. Response :	1 : Customer is interested, 0 : Customer is not interested

# <font size="+2" color='#ff3ba8'><b><i><u>Contents</u>
* <font size="+2" color='#053c96'> <b>Importing Libraries
* <font size="+2" color='#053c96'> <b>Import Data
* <font size="+2" color='#053c96'> <b>Data Summary
* <font size="+2" color='#053c96'> <b>Data Visualization
* <font size="+2" color='#053c96'> <b>Data Cleaning ( EDA )
* <font size="+2" color='#053c96'> <b>Feature Selection
* <font size="+2" color='#053c96'> <b> Model Selection
* <font size="+2" color='#053c96'><b> Hyperparameter Tuning
* <font size="+2" color='#053c96'><b>Conclusion


In [None]:
!pip install imblearn



In [None]:
!pip install xgboost



## 1.Importing Libraries

In [None]:
# import libraries
import pandas  as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Used in data preprocessing
from sklearn.preprocessing import LabelEncoder 
from sklearn.preprocessing import StandardScaler

#used to split dataset
from sklearn.model_selection import train_test_split

#used to resampling(when our dependent variable is imbalanced)
from imblearn.over_sampling import RandomOverSampler
from collections import Counter


#Ml algorithms
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression


#used in feature selection
from sklearn.ensemble import ExtraTreesClassifier



from sklearn.metrics import precision_score 
from sklearn.metrics import recall_score 
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report

sns.set_theme(style="darkgrid")

## 2. Import Data

In [None]:
from google.colab import drive
drive.mount("/content/drive")

In [None]:
path = "/content/drive/MyDrive/AlmaBetter/Team Capstone Projects/Submitted Projects/3. Classification ( Health Insurance Cross Sell Prediction )/TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION.csv"
df = pd.read_csv(path)

## 3. Data Summary

### df.head()

In [None]:
df.head(2)

### df.tail()

In [None]:
df.tail()

### df.info()

In [None]:
df.info(memory_usage = 'deep')

### df.shape

In [None]:
df.shape

### df.columns

In [None]:
df.columns

###  Dataset details
*   A new *DataFrame* where we have columns name of this df along with datatype , missing value no ,  unique values no , first value , second value

In [None]:
def DataInfoAll(df):
    print(f"Dataset Shape: {df.shape}")
    print("-"*75)
    summary = pd.DataFrame(df.dtypes,columns=['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary = summary[['Name','dtypes']]
    summary['Missing'] = df.isnull().sum().values    
    summary['Uniques'] = df.nunique().values
    summary['First Value'] = df.iloc[0].values
    summary['Second Value'] = df.iloc[1].values
    return summary


In [None]:
DataInfoAll(df)

There is no Null value present in this dataset

All the numerical values are present in integer or float datatype , <br>

### Checking outliers

In [None]:
import matplotlib.pyplot as plt
df1 = df[list(df.describe())]

for column in df1:
        plt.figure(figsize=( 8 , 8))
        plt.title(f'Boxplot for {column}' , fontsize = 15)
        sns.boxplot(data=df1, x=column)

### Checking duplicate values 

In [None]:
duplicate = df[df.duplicated()]
print(f"There are {duplicate.shape[0]} duplicate rows present in the dataset.")

### Checking NaN values

In [None]:
df.isna().sum().to_frame().T

## Data Visualization

### Target Variable

In [None]:
sns.set_theme(style="darkgrid")
sns.countplot(df['Response'] , data = df)

### The data is highly imbalanced.
As you can see in above graph, there are very few interested customers whose stats are less than 50000 and those above 300000 are not interested

### Let's check outlier present in all numerical columns 

In [None]:
plt.rcParams['figure.figsize']=(20,12)
ax = df[list(df.describe())].plot(kind='box', title='Boxplot', showmeans=True)

plt.show()

#### As you can see
* ##### **Annual_Premium** has the highest outliers present in this dataset
* ##### **Driving_License** has very less outliers.
* ##### **Response** has very less outliers.

### Gender

In [None]:
plt.figure(figsize = (13,5))
plt.subplot(1,2,1)
sns.countplot(df['Gender'],palette='husl')
plt.title("Count of Male & Female")
plt.subplot(1,2,2)
sns.countplot(df['Gender'], hue = df['Response'],palette="husl")
plt.title("Response in Male and Female Category")
plt.show()

* The gender variable ratio in the dataset is almost equal, male category is slightly more than female and also the chances of buying insurance is also little high than female.<br><br>
* The number of male is greater than 200000 and The number of female is close to 175000. The number of male is intersted which is greater than 25000 and The number of female is intersted which is below 25000.Male category is slightly greater than that of female and chances of buying the insurance is also little high

### Age vs Response

In [None]:
df.columns

In [None]:
#### Age VS Response
plt.figure(figsize=(20,10))
sns.countplot(x='Age',hue='Response',data=df)

### Checking is there outlier present or not

In [None]:
sns.boxplot(df['Age'])

* Young people below 30 are not interested in vehicle insurance. Reasons could be lack of experience, less maturity level and they don't have expensive vehicles yet.
* People aged between 30-60 are more likely to be interested.
* From the boxplot we can see that there no outlier in the data

As you can see there is no outliers present in **Age**

In [None]:
df.Driving_License.value_counts()

In [None]:
plt.figure( figsize = (10 , 6))
sns.countplot(df['Driving_License'],hue=df['Response'])

* Customers who are interested in Vehicle Insurance almost all have driving license

### Previously_Insured Vs Response

In [None]:
plt.figure( figsize = (10 , 6))
sns.countplot(x = 'Previously_Insured' , hue = 'Response' , data = df , palette = 'husl' )

* Those who have not insurance some of them are taking insurance

### Vehicle_Age Vs Response

In [None]:
df.Vehicle_Age.value_counts()

In [None]:
plt.figure( figsize = (10 , 6))
sns.countplot(x = 'Vehicle_Age' , hue = 'Response' , data = df , palette = 'husl')
plt.axis([None,None,10,175000])

* From seeing this graph we can say that if the vehicle's age is in between 1 to 2 years ,those vehicle owners are more likely to buy insurance<br><br>
* No of customers with Vehicle_Age >2 is more than the no of customers whose Vehicle_Age< 1

### Annual_Premium

In [None]:
plt.figure(figsize=(13,7))
plt.subplot(2,1,1)
sns.distplot(df['Annual_Premium'], color='green')
plt.title("Distribution of Annual premium")
plt.show()

* From the distribution plot we can infer that **the annual premimum variable is right skewed.**

In [None]:
plt.figure(figsize=(13,7))
sns.boxplot(df['Annual_Premium'])
plt.title("boxplot of Annual premium")
plt.show()

* As you can see that in the column **Annual_premium** there are many outliers present

### Correlation Matrix

In [None]:
corr = df.corr()

f, ax = plt.subplots(figsize = (8 , 8 ))

sns.heatmap(corr, ax=ax, annot=True,linewidths=3,cmap='YlGn')

plt.title("Pearson correlation of Features", fontsize=25 ,y=1.05, size=15)

* **Target variable ( Response )** is not much affected by Vintage variable. we can drop least correlated variable.

## Data Cleaning ( EDA )

#### Removing duplicate rows 

In [None]:
df_old_row = df.shape[0]
df.drop_duplicates(inplace = True)
df_new_row = df.shape[0]
if df_old_row == df_new_row:
    print("There was no duplicate rows present")
else:
    print(f"There was {df_old_row - df_new_row} duplicate rows present")

In [None]:
numerical_cols = list(df.describe())
numerical_df = df[numerical_cols]
numerical_df.head()

In [None]:
categorical_cols = list(set(df.columns) - set(numerical_cols))
categorical_df = df[categorical_cols]
categorical_df.head()

Let's convert the categorical columns into numeric using **LabelEncoder**,<br>But before that let's check in each column of categorical_df how namy unique values are present



In [None]:
for column_name in categorical_df.columns:
    print('-'*35)
    print(categorical_df[column_name].value_counts(),'\n')
    print('-'*35)

#### Using LabelEncoder

In [None]:
categorical_df.head(3)

In [None]:
le = LabelEncoder()
categorical_df = categorical_df.apply(le.fit_transform)
categorical_df.head(3)

In [None]:
## Let's check the classes of label encoder
le.classes_

In [None]:
##Let's use inverse_transform
le.inverse_transform([1])

In [None]:
categorical_df_new = categorical_df
categorical_df_new.head()

Let's make new **df** by merging *numerical_df* DataFrame with *categorical_df_new*

In [None]:
df = pd.merge( numerical_df , categorical_df_new , left_index = True , right_index = True )

In [None]:
df.head(2)

**id** column is having the insurance id number, it will not help us to prediction, that's why I'm dropping this column

In [None]:
df = df.drop( axis=1 , columns = ['id'])
df.head(2)

In [None]:
DataInfoAll(df)

### **Seprating dependent and independent variables**

In [None]:
x = df.drop(columns = ['Response'])
y = df.Response

In [None]:
x.head()

In [None]:
y.head()

##  **Feature Selection**

In [None]:
# Building the model
from sklearn.ensemble import ExtraTreesClassifier
extra_tree_forest = ExtraTreesClassifier(n_estimators = 5,criterion ='entropy', max_features = 2)

# Training the model
extra_tree_forest.fit(x, y)

# Computing the importance of each feature
feature_importance = extra_tree_forest.feature_importances_

# Normalizing the individual importances
feature_importance_normalized = np.std( [ tree.feature_importances_ for tree in extra_tree_forest.estimators_ ] , axis = 0)


# Plotting a Bar Graph to compare the models
plt.figure(figsize = (24,12))
plt.bar(x.columns, feature_importance_normalized)
plt.xlabel('Feature Labels' , fontsize = 25)
plt.ylabel('Feature Importances' , fontsize = 25)
plt.title('Comparison of different Feature Importances' , fontsize = 45)
plt.show()



In [None]:
feat_importances_Series = pd.Series( feature_importance_normalized , index=x.columns)
print("Feature Name\t\t Importance")
print("-"*37 , end='\n')
feat_importances_Series.sort_values()

* We can **remove less important features from the data set**
* *Driving_License , Gender* is contributing very less that's why I'm removing those columns 

In [None]:
x.columns

In [None]:
x.drop( labels = [ 'Driving_License'  , 'Gender' ] , axis = 1 , inplace = True)

In [None]:
x.head(2)

### Handling Imbalanced data
* *When observation in one class is higher than the observation in other classes then there exists a class imbalance. We can clearly see that there is a huge difference between the data set. Solving this issue we use resampling technique*

In [None]:
type(y)

### Using **RandomOverSampler** to resample the dataset

In [None]:
"HEALTH INSURANCE CROSS SELL PREDICTION".title()

In [None]:
randomsample=  RandomOverSampler(random_state = 1)
x_new,y_new=randomsample.fit_resample(x,y)



plt.figure(figsize = (13,5))
plt.subplot(1,2,1)
sns.countplot(y,palette='husl')



from collections import Counter
print('Original dataset shape {}'.format(Counter(y)))
print('Resampled dataset shape {}'.format(Counter(y_new)))
plt.subplot(1,2,2)
sns.countplot(y_new,palette='husl')

* As you can see now our response is having same no of both classes 

### Splitting Dataset into 80:20 ratio

In [None]:
#dividing the dataset into training and testing
xtrain , xtest , ytrain , ytest = train_test_split( x_new , y_new , test_size = 0.2 , random_state = 1 )
print(f"xtrain.shape\txtest.shape\tytrain.shape\tytest.shape")
print('-'*60)
print(f'{xtrain.shape}\t{xtest.shape}\t {ytrain.shape}\t {ytest.shape}')

## **Feature Scaling**

In [None]:
scaler = StandardScaler()
xtrain = scaler.fit_transform(xtrain)
xtest = scaler.transform(xtest)

## **Model Selection**
#### Problem can be identified as Binary Classification (wheather customer opts for vehicle insurance or not)
#### Dataset has more than 300k records
#### cannot go with SVM Classifier as it takes more time to train as dataset increase

#### Idea is to start selection of models as:

### **1. Logistic Regression**
### **2. Random Forest**
### **3. XGBClassifier**

### <font size = +2 color = #2718d3> 1.Logistic Regression

In [None]:
model=LogisticRegression()

model=model.fit(xtrain,ytrain)

pred=model.predict(xtest)

lr_probability =model.predict_proba(xtest)[:,1]


acc_lr=accuracy_score(ytest,pred)
recall_lr=recall_score(ytest,pred)
precision_lr=precision_score(ytest,pred)
f1score_lr=f1_score(ytest,pred)
AUC_LR=roc_auc_score(pred,ytest)

#print accuracy and Auc values of model
print("Accuracy : ", round(accuracy_score(ytest,pred) , 3))
print("Precision:" , round(precision_score(ytest,pred) , 3))
print("Recall:" , round(recall_score(ytest,pred), 3))
print("F1-Score:" , round(f1_score(ytest,pred) , 3))
print("ROC_AUC Score:" , round(AUC_LR , 3))

In [None]:
print(classification_report(pred,ytest))

#### ROC curve for logistic reg.

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, _ = roc_curve(ytest, lr_probability)
plt.figure( figsize = (4 ,4))
plt.title('Logistic Regression ROC curve')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')


plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()

#### Confusion Matrix for Logistic Reg.

In [None]:
cm=confusion_matrix(ytest,pred)
print(cm)
plt.figure( figsize = (4 ,4))
sns.heatmap(cm,annot=True,cmap='BuPu')

### <font size = +2 color = #2718d3> 2.RandomForest Classifier

In [None]:
randomforest = RandomForestClassifier()

randomforest=randomforest.fit(xtrain, ytrain)

y_pred = randomforest.predict(xtest)

RF_probability = randomforest.predict_proba(xtest)[:,1]



AUC_RF=roc_auc_score(y_pred,ytest)
acc_rf=accuracy_score(ytest,y_pred)
recall_rf=recall_score(ytest,y_pred)
precision_rf=precision_score(ytest,y_pred)
f1score_rf=f1_score(ytest,y_pred)

#print accuracy and Auc values of model
print("Accuracy : ", round(accuracy_score(ytest , y_pred) , 3))
print("Precision:" , round(precision_score(ytest,y_pred) , 3))
print("Recall:" , round(recall_score(ytest , y_pred), 3))
print("F1-Score:" , round(f1_score(ytest , y_pred) , 3))
print("ROC_AUC Score:" , round(AUC_LR , 3))

In [None]:
print(classification_report(y_pred,ytest))

#### ROC curve for RandomForest

In [None]:
fpr, tpr, _ = roc_curve(ytest, RF_probability)
plt.figure( figsize = (4 , 4))
plt.title('RF Classifier ROC curve')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()

#### Confusion Matrix for Random Forest

In [None]:
cm=confusion_matrix(y_pred,ytest)
print(cm)
plt.figure( figsize = (4 , 4))
sns.heatmap(cm,annot=True,cmap='RdPu')

### <font size = +2 color = #2718d3>3. XGBClassifier

In [None]:
xgb=XGBClassifier()

XGB_fit=xgb.fit(xtrain, ytrain)

y_predict = XGB_fit.predict(xtest)

XGB_probability = XGB_fit.predict_proba(xtest)[:,1]



acc_xgb = accuracy_score( ytest , y_predict)
recall_xgb = recall_score( ytest , y_predict)
precision_xgb = precision_score( ytest , y_predict)
f1score_xgb = f1_score( ytest , y_predict)

AUC_xgb = roc_auc_score( y_predict , ytest)


#print accuracy and Auc values of model
print("Accuracy : ", round(acc_xgb , 3))
print("Precision:" , round(precision_xgb , 3))
print("Recall:" , round( recall_xgb , 3))
print("F1-Score:" , round( f1score_xgb , 3))
print("ROC_AUC Score:" , round(AUC_xgb , 3))

In [None]:
print(classification_report( y_predict , ytest ))

#### ROC curve for XGBoost

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, _ = roc_curve(ytest, XGB_probability)
plt.figure( figsize = (4 ,4))

plt.title('XGBoost ROC curve')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()

#### Confusion Matrix for XGBoost

In [None]:
#it helps to identify how many values are classified correctly
cm=confusion_matrix(ytest,y_predict)
print(cm)
plt.figure( figsize = ( 4 , 4 ))
sns.heatmap(cm,annot=True,cmap='GnBu')

##  <font size = '+2'  color = '#c4d318'><b> Let's compare the models

In [None]:
ind=['Logistic regression','RandomForest','XGBClassifier']
data={"Accuracy":[acc_lr,acc_rf,acc_xgb],"Recall":[recall_lr,recall_rf,recall_xgb],"Precision":[precision_lr,precision_rf,precision_xgb],
    'f1_score':[f1score_lr,f1score_rf,f1score_xgb],"ROC_AUC":[AUC_LR,AUC_RF,AUC_xgb]}
result=pd.DataFrame(data=data,index=ind)
result

## <font size='+2' color = '#2441ff'> Hyperparameter Tuning
* <i>RandomForestClassifier is giving highest accuracy , that's why by using GridSearchCV I will set Hyperparameters value

In [None]:
from sklearn.model_selection import cross_val_score , ShuffleSplit , GridSearchCV

In [None]:
Model = RandomForestClassifier()

In [None]:
params = {
    'n_estimators': [5,10,25],
    'criterion':["gini", "entropy"],
    'max_depth' : [5,25,50],
    'min_samples_split':[2,15,45]
        }

In [None]:
gridsearch = GridSearchCV(Model , params , cv=5, return_train_score=True)

In [None]:
gridsearch.fit(xtrain , ytrain )

In [None]:
print(gridsearch.best_params_)

Now we got the best values of our hyperparameters,<br>
* #### Let's *build* the **FinalModel**

In [None]:
gridsearch_predictions = gridsearch.predict( xtest ) 
  
# print classification report 
print(classification_report(ytest, gridsearch_predictions)) 

In [None]:
Grid_predict_proba = gridsearch.predict_proba(xtest)[:,1]

AUC_RF_Best = roc_auc_score(gridsearch_predictions ,ytest)
acc_rf_Best = accuracy_score(gridsearch_predictions ,y_pred)
recall_rf_Best = recall_score(gridsearch_predictions ,y_pred)
precision_rf_Best = precision_score(ytest,gridsearch_predictions )
f1score_rf_Best = f1_score(ytest,gridsearch_predictions )

#print accuracy and Auc values of model
print("Accuracy : ", round(accuracy_score(ytest,gridsearch_predictions) , 3))
print("Precision:" , round(precision_score(ytest,gridsearch_predictions) , 3))
print("Recall:" , round(recall_score(ytest, gridsearch_predictions ), 3))
print("F1-Score:" , round(f1_score(ytest, gridsearch_predictions ) , 3))
print("ROC_AUC Score:" , round(AUC_LR , 3))

In [None]:
print(classification_report(y_pred , gridsearch_predictions ))

In [None]:
fpr, tpr, _ = roc_curve(ytest, Grid_predict_proba)

plt.figure( figsize = ( 6 , 6 ))
plt.title('RF Classification(with Hyperparameters) ROC curve')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()

In [None]:
cm=confusion_matrix(y_pred,ytest)
print(cm)
plt.figure( figsize = ( 4 , 4 ))
sns.heatmap(cm,annot=True,cmap='RdPu')

## <b>Final Result

In [None]:
ind=['RandomForest','RandomForest(Using Hyper.)']
data={"Accuracy":[acc_rf , acc_rf_Best],"Recall":[ recall_rf , recall_rf_Best],"Precision":[ precision_rf , precision_rf_Best],
    'f1_score':[ f1score_rf , f1score_rf_Best],"ROC_AUC":[ AUC_RF , AUC_RF_Best]}
result=pd.DataFrame(data=data,index=ind)
result

* As you can see after using **Hyperparameters** *Accuracy , Precision , f1_score , ROC_AUC Increased*(tiny change) , *Recall decresed*<br>
* But the change is very low , If u wish then you can ignore it also.

## **Conclusion**
### The ML model for the problem statement was created using python with the help of the dataset, and the ML model created with RandomForest model  performed better than Logistics Regression & XGBClassifier . <br>Thus, for the given problem, the models created by Random Forest is preferred.


#### 1. Customers of age between 30 to 60 are more likely to buy insurance.

#### 2. Customers with Driving License have higher chance of buying Insurance.

#### 3. Customers with Vehicle_Damage are likely to buy insurance.

#### 4. The variable such as Previously_insured , Vehcile_Damage are more affecting the target variable.

#### 5 The variable such as Driving_License ,   Gender  are not affecting the target variable.

#### 6. comparing ROC curve we can see that Random Forest model perform better. Because curves closer to the top-left corner, it indicate a better performance.