<a href="https://colab.research.google.com/github/DrBharathiTC/HEALTH-INSURANCE-CROSS-SELL-PREDICTION.ipynb/blob/main/HEALTH_INSURANCE_CROSS_SELL_PREDICTION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Statement**

Our client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

# **Attribute Information**

1. id :	Unique ID for the customer

2. Gender	: Gender of the customer

3. Age :	Age of the customer

4. Driving_License	0 : Customer does not have DL, 1 : Customer already has DL

5. Region_Code :	Unique code for the region of the customer

6. Previously_Insured	: 1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance

7. Vehicle_Age :	Age of the Vehicle

8. Vehicle_Damage	 :1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past.

9. Annual_Premium	: The amount customer needs to pay as premium in the year

10. PolicySalesChannel :	Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.

11. Vintage :	Number of Days, Customer has been associated with the company

12. Response :	1 : Customer is interested, 0 : Customer is not interested

# **Contents**


*Importing Libraries
*Import Data
*Data Summary
*Data Visualization
*Data Cleaning ( EDA )
*Feature Selection
*Model Selection
*Hyperparameter Tuning
*Conclusion

In [None]:
#Installing package
!pip install pandas-profiling==2.7.1

In [None]:
pip install scikit-optimize 

# 1.Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import Perceptron
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from xgboost import XGBRFClassifier
from sklearn.tree import DecisionTreeClassifier
import lightgbm as ltb

from sklearn.model_selection import GridSearchCV
from skopt import BayesSearchCV
import time
from math import sqrt
from sklearn import metrics
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, confusion_matrix, roc_auc_score, classification_report
from skopt.space import Real, Categorical, Integer

# 2. Import Data

In [None]:
from google.colab import drive
drive.mount("/content/drive")

In [None]:
DF1=pd.read_csv("/content/drive/My Drive/almabetter projects/Health insurance cross cell prediction/TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION.csv")

In [None]:
#Copying the dataset
HIDF1=DF1.copy()

# 3. Data Summary

In [None]:
HIDF1.head()

In [None]:
HIDF1.tail()

In [None]:
HIDF1.info()

In [None]:
HIDF1.shape

In [None]:
HIDF1.columns

# **Dataset details**
A new DataFrame where we have columns name of this df along with datatype , missing value no , unique values no , first value , second value

In [None]:
def DataInfoAll(HIDF1):
    print(f"Dataset Shape: {HIDF1.shape}")
    print("-"*75)
    summary = pd.DataFrame(HIDF1.dtypes,columns=['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary = summary[['Name','dtypes']]
    summary['Missing'] = HIDF1.isnull().sum().values    
    summary['Uniques'] = HIDF1.nunique().values
    summary['First Value'] = HIDF1.iloc[0].values
    summary['Second Value'] = HIDF1.iloc[1].values
    return summary


In [None]:
DataInfoAll(HIDF1)

There is no Null value present in this dataset.
All the numerical values are present in integer or float datatype 

### *Let's check outlier present in all numerical columns* 

In [None]:
plt.rcParams['figure.figsize']=(20,10)
ax =HIDF1[list(HIDF1.describe())].plot(kind='box', title='Boxplot', showmeans=True)

plt.show()

As you can see
* Annual_Premium has the highest outliers present in this dataset
* Driving_License has very less outliers.
* Response has very less outliers.

In [None]:
HIDF1.describe()

# Checking duplicate values

In [None]:
duplicate = HIDF1[HIDF1.duplicated()]
print(f"There are {duplicate.shape[0]} duplicate rows present in the dataset.")

# Checking NaN values

In [None]:
HIDF1.isna().sum().to_frame().T

### Categorical features statistics details

The following argument will help us to mention categorical columns and will also show a summary of all the categorical features.



In [None]:
HIDF1.describe(include='O')

Observation -

In our dataset, there are more men than women.
The age range of 1-2 year vehicles is higher in our dataset.
Many of the clients' vehicles have been damaged.

# **Data Visualization**  - Exploratory Data Analysis

# Target Variable

In [None]:
#Storing target column into a variable 
Dependent_variable = HIDF1['Response']

In [None]:
sns.set_theme(style="darkgrid")
sns.countplot(x='Response', data=HIDF1)
plt.title('Not-Interested vs Interested Policyholders', fontsize=20) #title for the countplot
plt.show()

### The data is highly imbalanced.
As you can see in above graph, there are very few interested customers whose stats are less than 50000 and those above 300000 are not interested

In [None]:
HIDF1.Response.value_counts()/HIDF1.shape[0]

Observation -

The dependant variable has binary values of 0 and 1. We can infer from the plot above that many clients have no interest in purchasing vehicle insurance. 12.2 percent of the data are 1's and 87.7 percent of the data are 0s. This data must be handled using the imbalance technique since the output feature is unbalanced.

### Gender

In [None]:
plt.figure(figsize = (13,5))
plt.subplot(1,2,1)
sns.countplot(x='Gender', data=HIDF1, palette='husl')
plt.title("Count of Male & Female")
plt.subplot(1,2,2)
sns.countplot(x='Gender', hue='Response', data=HIDF1, palette="husl")
plt.title("Response in Male and Female Category")
plt.show()

In [None]:
a = HIDF1.groupby('Gender')['Age'].mean()
a

* The gender variable ratio in the dataset is almost equal, male category is slightly more than female and also the chances of buying insurance is also little high than female.

* The number of male is greater than 200000 and The number of female is close to 175000. The number of male is intersted which is greater than 25000 and The number of female is intersted which is below 25000.Male category is slightly greater than that of female and chances of buying the insurance is also little high

## Age vs Response

In [None]:
#### Age VS Response
plt.figure(figsize=(20,10))
sns.countplot(x='Age',hue='Response',data=HIDF1)

* Young people below 30 are not interested in vehicle insurance. Reasons could be lack of experience, less maturity level and they don't have expensive vehicles yet.
* People aged between 30-60 are more likely to be interested.



In [None]:
sns.boxplot(HIDF1['Age'])

As you can see there is no outliers present in Age

In [None]:
HIDF1.Driving_License.value_counts()

In [None]:
plt.figure(figsize = (8, 5))
sns.countplot(x='Driving_License', hue='Response', data=HIDF1)

* Customers who are interested in Vehicle Insurance almost all have driving license


# Previously_Insured Vs Response

In [None]:
plt.figure( figsize = (10 , 6))
sns.countplot(x = 'Previously_Insured' , hue = 'Response' , data = HIDF1 , palette = 'husl' )


* Those who have not insurance some of them are taking insurance

# Vehicle_Age Vs Response

In [None]:
HIDF1.Vehicle_Age.value_counts()

In [None]:
plt.figure( figsize = (10 , 6))
sns.countplot(x = 'Vehicle_Age' , hue = 'Response' , data = HIDF1 , palette = 'husl')
plt.axis([None,None,10,175000])

* From seeing this graph we can say that if the vehicle's age is in between 1 to 2 years ,those vehicle owners are more likely to buy insurance

* No of customers with Vehicle_Age >2 is more than the no of customers whose Vehicle_Age< 1

## Region code Vs Response

In [None]:
plt.figure(figsize = (20,15))
sns.countplot(x='Region_Code', hue='Response', data=HIDF1)
plt.title('Response in terms of Region_Code', fontsize=15)


*   Region Code - 0.28 has more customers


## Vehicle_Damage Vs Response

In [None]:
sns.countplot(x=HIDF1['Vehicle_Damage'], data=HIDF1)
plt.title('Vehicle Damage Status', fontsize=15)
plt.show()

We can infer from the above plot that the number of policyholders for both vehicle damage statuses are almost equal.

# Annual_Premium

In [None]:
plt.figure(figsize=(13,7))
plt.subplot(2,1,1)
sns.distplot(HIDF1['Annual_Premium'], color='green')
plt.title("Distribution of Annual premium")
plt.show()

* From the distribution plot we can infer that the annual premimum variable is right skewed.


In [None]:
plt.figure(figsize=(13,7))
sns.boxplot(HIDF1['Annual_Premium'])
plt.title("boxplot of Annual premium")
plt.show()

* As you can see that in the column Annual_premium there are many outliers present

## Coverting Categorical columns into Numerical columns using Encoding techniques


Label Encoding on Vehicle_Age and Vehicle_Damage columns

In [None]:
from sklearn import preprocessing
from sklearn. preprocessing import LabelEncoder
#changing categorical value to numerical values
labelEncoder= LabelEncoder()
HIDF1['Vehicle_Age'] = labelEncoder.fit_transform(HIDF1['Vehicle_Age'])
HIDF1['Vehicle_Damage'] = labelEncoder.fit_transform(HIDF1['Vehicle_Damage'])

One Hot Encoding on Gender Column

In [None]:
#One hot encoder on Gender
from sklearn.preprocessing import OneHotEncoder
enc=OneHotEncoder()
enc_data=pd.DataFrame(enc.fit_transform(HIDF1[['Gender']]).toarray())
names=enc.get_feature_names_out()
enc_data.columns=names
df1=HIDF1.join(enc_data)

In [None]:
#Data after Encoding
df1.info()

Observation -

We can see that all columns have been numerically converted.

We are removing the gender column since we have separated into two columns, Gender Female and Gender Male.

In [None]:
#Removing Gender feature
df1.drop('Gender',axis=1,inplace=True)

In [None]:
#Checking shape after adding/removing features
df1.shape

In [None]:
df1.head()

In [None]:
#Once again checking the duplicates
duplicate = df1[df1.duplicated()]
print(duplicate)

No Duplicates found in this dataset.

## Variance Threshold Removal

Using this method we can check which columns have constant values.

In [None]:
from sklearn.feature_selection import VarianceThreshold

In [None]:
#Implementation Variance Threshold
variance_threshold = VarianceThreshold(threshold=0)
variance_threshold.fit(df1)
variance_threshold.get_support()

Observation -

In our data set, there isn't a single column with constant values.italicised text

## Feature Selection using f_classification
## Seperating Dependent and Independent Variables

In [None]:
#importing the libraries
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

In [None]:
independent = df1.drop(['Response'], axis=1) #Contain all independent variables
dependent = df1['Response'] #Contain Dependent variable

In [None]:
#Finding scores of each feature
f_scores = f_classif(independent, dependent)
f_scores

In [None]:
#The Second array consists of p-values that we need.
p_values = pd.Series(f_scores[1], index= independent.columns)
p_values.plot(kind='bar', color='blue', figsize=(16,5))
plt.title('p-value scores for numerical features')
plt.show()

We can drop Id and Vintage columns as per the above chart.

# Feature Importance

In [None]:
#Checking Feature importance by using RandomForestClassifier


from sklearn.ensemble import RandomForestClassifier
# Create the random forest with hyperparameters
model= RandomForestClassifier(n_estimators=340)
# Fit the mmodel
model.fit(independent,dependent)
# get the importance of thr resulting features
importances= model.feature_importances_
# Create a data frame for visualization
final_df= pd.DataFrame({"Features": pd.DataFrame(independent).columns, "Importances": importances})
final_df.set_index('Importances')
# Sort in ascending order to better visualization
final_df= final_df.sort_values('Importances')
# Plot the feature importances in bars
final_df.plot.bar(color='teal')



*   So after doing F_Classifier and RandomForestClassifier we can observe that id,vintage,Gender are less important. So we can drop those columns.



In [None]:
df1.info()

In [None]:
df1.drop(['id','Vintage'],axis=1,inplace=True)

# Correlation Feature Selection


In [None]:
#Checking correlation of all the columns using heatmap
plt.figure(figsize = (18,10))
correlation = df1.corr()
sns.heatmap(correlation, annot= True,linewidths=3,cmap='coolwarm')
plt.title("Pearson correlation of Features", y=1.05, size=15)

Observations based on correlation plot:-

Gender_female and male 100% Multicollinearity we can remove any one feature among these 2
Previously insured and vechicle_damage have high correlations with dependent variable

In [None]:
#Dropping gender female
df1.drop('Gender_Female',axis=1,inplace=True)

Obsservation-

We have remove Gender_female since it has 100% Multicollinearity with male columns. So, we can remove any one feature among these 2.

In [None]:
#Checking shape after removing 3 columns
df1.shape

The final dataset shape will be used in Model Training.

## Split Train & Test data

In [None]:
#Splitting the data into train and test data

X = df1.drop(['Response'], axis=1) #Contain all independent variables
y = df1['Response'] 

In [None]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size=.30,random_state=0)
print(Xtrain.shape,Xtest.shape,ytrain.shape,ytest.shape)

In [None]:
#Make a list to get most important Features
train_col_list = list(Xtrain.columns)
train_col_list

In [None]:
ytrain = ytrain.values.reshape(-1,1)

In [None]:
ytest = ytest.values.reshape(-1,1)

In [None]:
ytrain.shape

In [None]:
ytest.shape

## **Handling Imbalanced data**

One of the most significant challenges when dealing with unbalanced datasets is the metrics used to evaluate their model. Using simpler metrics, such as accuracy score, can be misleading. In a dataset with highly unbalanced classes, the classifier will always "predict" the most common class without performing any feature analysis, and while it will have a high accuracy rate, it will often be incorrect.

### Using Over Sampling Technique

In [None]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state = 42)
X_ros, y_ros = ros.fit_resample(Xtrain, ytrain)

print('Original dataset shape', len(HIDF1))
print('Resampled dataset shape', len(y_ros))
print('Resampled dataset shape', len(X_ros))
print('Resampled dataset shape', len(ytrain))
print('Resampled dataset shape', len(Xtrain))

Observation-

The dataset has now been balanced using the oversampling technique, and it is ready for training the model.



###Feature Scaling

In [None]:
#Feature Scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_ros = scaler.fit_transform(X_ros)
Xtest = scaler.transform(Xtest)

 

*   Scaled down the train varible which makes easy for a model to learn.



# Model Training

In [None]:
from sklearn.linear_model import Perceptron
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBRFClassifier
from sklearn.ensemble import AdaBoostClassifier
import lightgbm as lgb

In [None]:
#Defining all these models
models = [
           ['LinearClassifier: ', Perceptron()],
           ['LogisticRegresseer:', LogisticRegression()],
           ['GNB: ', GaussianNB()],
           ['BNB: ', BernoulliNB()],
           ['KNeighborsClassifier: ', KNeighborsClassifier()],
           ['DecisionTreeClassifier: ', DecisionTreeClassifier()],
           ['RandomForestClassifier ',RandomForestClassifier()],
           ['GradientBoostingClassifier: ', GradientBoostingClassifier()] ,
           ['XGBRFClassifier: ', XGBRFClassifier()],
           ['AdaBoostClassifier: ',AdaBoostClassifier()],
           ['LgbmClassifier: ', lgb.LGBMClassifier()]
         ]

In [None]:
#store all the metrics values in data frame
import time
model_data = []
for name,curr_model in models :
      curr_model_data = {}
      curr_model.random_state = 42
      curr_model_data["Name"] = name
      start = time.time()
      curr_model.fit(X_ros,y_ros)
      end = time.time()
      y_train_pred=curr_model.predict(X_ros)
      y_test_pred= curr_model.predict(Xtest)
      curr_model_data["Train_Time"] = end - start
      curr_model_data["Train accuracy"] =accuracy_score(y_ros,y_train_pred )
      curr_model_data["Test accuracy"] =accuracy_score(ytest, y_test_pred)
      curr_model_data["Train precision"] = precision_score(y_ros,y_train_pred)
      curr_model_data["Test precision"] = precision_score(ytest,y_test_pred)
      curr_model_data["Train recall"] = recall_score(y_ros,y_train_pred)
      curr_model_data["Test recall"] = recall_score(ytest,y_test_pred)
      curr_model_data["Train f1 score"] = f1_score(y_ros,y_train_pred)
      curr_model_data["Test f1 score"] = f1_score(ytest,y_test_pred)
      curr_model_data['Train ROC-AUC'] = roc_auc_score(y_ros,y_train_pred)
      curr_model_data["Test ROC-AUC"] = roc_auc_score(ytest,y_test_pred)
      model_data.append(curr_model_data)
 

In [None]:
results = pd.DataFrame(model_data)
results

Observation -

Hurrah! Here are the results of all the models. The best evaluation metric is recall, and we can see that Boosting Algorithms are performing well in this case.

However, we can perform hyperparameter tuning on these models to determine the optimum model.

In [None]:
#Draw plot for above models metrices
results.plot(x="Name", y=['Train accuracy' , 'Test accuracy' ,'Train precision','Test precision','Train recall','Test recall','Train f1 score','Test f1 score'], kind="bar" , title = 'Accuracy Score Results' , figsize= (10,8)) 

Observation -

RandomForestClassifier is performing well in terms of accuracy.



In [None]:
model_data2 = []
for name,curr_model in models :
    curr_model_data = {}
    curr_model.random_state = 42
    curr_model_data["Name"] = name
    start = time.time()
    curr_model.fit(X_ros,y_ros)
    end = time.time()
    curr_model_data["Train_Time"] = end - start
    curr_model_data["conf_mat"] = confusion_matrix(ytest,[round(value) for value in (curr_model.predict(Xtest))])
    model_data2.append(curr_model_data)

In [None]:
Conf_Mat_df= pd.DataFrame(model_data2)
Conf_Mat_df

Observation -

We can observe from the confusion matrix that LgbmClassifier, XGBRFClassifier are the top models.


## Let's perform Cross Validation and Hyper parameter tuning on these models to get better results.

## Hyperparameter Tuning on RandomForestClassifier

In [None]:
rf = RandomForestClassifier(random_state=40)
#Cross validation and hyperparameter tuning
rf_bayes = BayesSearchCV(estimator= rf,
                         search_spaces = {
                          'max_depth': Integer(2,100),
                          'min_samples_leaf': Integer(1,100),
                          'min_samples_split': Integer(2,100),
                          'n_estimators': Integer(1,140),
                          'max_features': ["auto", "sqrt", "log2"]
                        },
                       cv = 5, verbose=2, scoring='accuracy',n_iter=10)

rf_bayes.fit(X_ros,y_ros)

In [None]:
rf_bayes.best_params_

In [None]:
rf_bayes.best_estimator_

In [None]:
#make prediction
train_pred=rf_bayes.best_estimator_.predict(X_ros)
test_pred=rf_bayes.best_estimator_.predict(Xtest)

In [None]:
# Calculating accuracy on train and test
train_accuracy = accuracy_score(y_ros, train_pred)
test_accuracy = accuracy_score(ytest, test_pred)

print("The accuracy on train dataset is", train_accuracy)
print("The accuracy on test dataset is", test_accuracy)

In [None]:
# Get the confusion matrices for train and test
train_cm = confusion_matrix(y_ros, train_pred)
test_cm = confusion_matrix(ytest, test_pred)

In [None]:
train_cm
test_cm

In [None]:
print(classification_report(y_ros,train_pred))
print("\n")
print(classification_report(ytest,test_pred))

## Roc curve for Train data

In [None]:
metrics.plot_roc_curve(rf_bayes, X_ros, y_ros) 

In [None]:
metrics.plot_precision_recall_curve(rf_bayes, X_ros, y_ros)

## Roc curve for Test data

In [None]:
metrics.plot_roc_curve(rf_bayes, Xtest, ytest)

In [None]:
metrics.plot_precision_recall_curve(rf_bayes, Xtest, ytest)

## Hyperparameter Tuning on LgbmClassifier

In [None]:
lgbm = ltb.LGBMClassifier()
#Cross validation and hyperparameter tuning
lg_bayes = BayesSearchCV(estimator= lgbm,
                         search_spaces = {
                          'max_depth':Integer(4,100) ,
                          'num_leaves': Integer(3,200),
                          'n_estimators': Integer(3,200),
                          'min_split_gain': Integer(1.0,10.0),
                          'n_jobs': Integer(1,30),
                        },
                       cv = 5, verbose=2, scoring='accuracy', n_iter=10)

lg_bayes.fit(X_ros,y_ros)

In [None]:
lg_bayes.best_params_

In [None]:
lg_bayes.best_estimator_

In [None]:
#make prediction
lgtrain_pred=lg_bayes.best_estimator_.predict(X_ros)
lgtest_pred=lg_bayes.best_estimator_.predict(Xtest)

In [None]:
# Calculating accuracy on train and test
train_accuracy = accuracy_score(y_ros, lgtrain_pred)
test_accuracy = accuracy_score(ytest, lgtest_pred)

print("The accuracy on train dataset is", train_accuracy)
print("The accuracy on test dataset is", test_accuracy)

In [None]:
# Get the confusion matrices for train and test
train_cm = confusion_matrix(y_ros, lgtrain_pred)
test_cm = confusion_matrix(ytest, lgtest_pred)

In [None]:
# Print the classification report for train and test
print(classification_report(y_ros,lgtrain_pred))
print("\n")
print(classification_report(ytest,lgtest_pred))

## Roc curve for Train data

In [None]:
metrics.plot_roc_curve(lg_bayes, X_ros, y_ros) 

In [None]:
metrics.plot_precision_recall_curve(lg_bayes, X_ros, y_ros)

## Roc Curve for Test data

In [None]:
metrics.plot_roc_curve(lg_bayes, Xtest, ytest)metrics.plot_precision_recall_curve(lg_bayes, Xtest, ytest)

In [None]:
metrics.plot_precision_recall_curve(lg_bayes, Xtest, ytest)Observation

# **Observation**

The ML model for the problem statement was created using python with the help of the dataset, and the ML model created with LGBM and Random Forest models performed better than other models.

In comparison to both models, the LGBM model performed well on the most essential evaluation metric, 'Recall,' with values of 0.95 on train data and 0.92 on test data. As a result, we conclude LGBMClassifier is the best model for this dataset.

## Finally, let us highlight the most important features that will be beneficial to the client.


In [None]:
importances = lg_bayes.best_estimator_.feature_importances_
importance_dict = {'Feature' : train_col_list,
                   'Feature Importance' : importances}
importance_df = pd.DataFrame(importance_dict)

In [None]:
importance_df['Feature Importance'] = round(importance_df['Feature Importance'],2)

In [None]:
#Our top feature in descending order
importance_df.sort_values(by=['Feature Importance'],ascending=False)

The most significant features are listed from top to bottom.




##**Conclusion:**

Our client is an insurance firm that has supplied Health Insurance to its customers. They now need assistance in developing a model to predict whether the policyholders (customers) from the previous year will be interested in the company's Vehicle Insurance.

Building a model to predict if a client is interested in Vehicle Insurance is extremely beneficial to the company because they can then plan communication strategy to reach out to those customers and optimise its business model and revenue.

Now, we have information about demographics (gender, age, region code type), vehicles (vehicle age, damage), policies (premium, sourcing channel), and so on to predict whether the customer would be interested in Vehicle insurance.

Key points:

Customers of age between 30 to 60 are more likely to buy insurance.

Customers with Vehicle_Damage are likely to buy insurance.

Customers with Driving License have higher chance of buying Insurance.

The variable such as Age, Previously_insured,Annual_premium are more affecting the target variable.

We can see that LGBM model preform better for this dataset.



Improvements:

By using a marketing and advertising approach, we can reduce the gender gap.

We can clearly see that we have a larger number of consumers without vehicle insurance, therefore we can easily target them directly with our campaign.

Since there are less policy holders with vehicles older than two years, we must pay more attention to the other two categories (1-2 years and >1 year). Because most sales agencies that offer vehicle insurance for the first year are actually our target and we can give them the best incentives to reduce competition in the market.

As we saw that we have nearly equal policy holders for both vehicle damage status, so we can target those policy holders whose vehicles are damaged in the past.