Context:
Our client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.
An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.
Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.
Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.
Dataset Here
Increasing revenue from cross-selling additional Vehicle Insurance products.
-
Building a Machine Learning model to predict potential buyers who are interested in vehicle insurance.
-
Increasing Conversion Rate.
Conversion Rate
The number of customers who decide to take additional products such as Vehicle Insurance, indirectly increasing the company's revenue.
-
Exploratory Data Analysis
-
Data Pre-Processing
-
Modelling
-
Deployment
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
from matplotlib import rcParams
%matplotlib inline
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from scipy.stats import boxcox
from imblearn import under_sampling, over_sampling
import gdown
from sklearn.model_selection import train_test_split
from mlxtend.plotting import plot_confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, confusion_matrix, fbeta_score, make_scorer
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import cross_validate, RandomizedSearchCV, GridSearchCV, HalvingGridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, Ridge, Lasso, ElasticNet
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier, reset_parameter, LGBMClassifier
import shap
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform
print('numpy version : ',np.__version__)
print('pandas version : ',pd.__version__)
print('seaborn version : ',sns.__version__)
From the above boxplot, it can be seen that there are no outliers except for the Annual Premium column
.
Here, there is no feature that has a high correlation with the target, which is response... but there are features that have a strong negative correlation between them, namely age and policy sales channel with a correlation coefficient of -0.58.
Vehicle insurance is more attractive to customers aged 30-50 years
and mostly attracted by male
customers compared to female
customers
Then customers whose vehicle conditions are not good
and customers whose vehicle age is still quite young (1-2 years
) have an interest in buying vehicle insurance.
and customers who do not have vehicle insurance
are more interested
in about 20% of the total customers who do not have vehicle insurance.
From the graph that we get when EDA has been done in Stage 1, we know Annual Premium
has outliers that are extreme enough to be handled by IQR removal or capping.
It was decided to keep using the df dataframe because, it is normal for the Annual_Premium
column to have outliers so no outliers were removed. This is also based on the consideration of building a model that is robust to outliers.
Converts Vehicle_Damage
to an integer in = 0: Customer's vehicle has never been damaged, 1: Customer's vehicle has been damaged, and Vehicle_Age
and 0: < 1 Year, 1: 1-2 Years, 2: > 2 Years. And Gender
with One Hot Encoding. Convert to numbers starting from 0 to facilitate machine learning. Convert columns with datatype bool to int to make it easier for the model to process.
Class Imbalance handling is done with oversampling and undersampling with the consideration that the data does not tend to be biased, where the difference between the two values 0 and 1 is more than 50% so that if oversampling is done, it does not guarantee an increase in machine learning performance, but oversampling is also needed so that the data does not underfit.
-
Premium_Per_Channel
, to calculate and provide new insight into the total premium of variousPolicy_Sales_Channels
so that channel grouping can be done based onAnnual_Premium
. -
Vintage_Group
, a new feature that converts theVintage
feature into a category with a certain range which is defined as New (just joined), Intermediate (joined for a while), Long-term (joined for a long time). -
Not_Insured_and_Damaged
, a column that takes the value 1 if thePreviously_Insured
column has the value 0 andVehicle_Damage
has the value 1. -
Channel_Response_Rate
, is the response rate of each channel which indicates how effective a channel is to get a 'Yes' answer from here can also be done grouping Channels that have a high rate.
Model | Accuracy Test | Accuracy Train | Precision Test | Precision Train | Recall Test | Recall Train | F1 Test | F1 Train | ROC AUC Test | ROC AUC Train | ROC AUC CrossVal Test | ROC AUC CrossVal Train |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Logistic | 0.79 | 0.79 | 0.72 | 0.72 | 0.93 | 0.94 | 0.82 | 0.82 | 0.87 | 0.87 | 0.93 | 0.92 |
KNN | 0.79 | 0.80 | 0.78 | 0.78 | 0.83 | 0.83 | 0.80 | 0.80 | 0.89 | 0.89 | 0.93 | 0.92 |
Decision Tree | 0.82 | 0.83 | 0.78 | 0.79 | 0.89 | 0.90 | 0.83 | 0.84 | 0.91 | 0.93 | 0.93 | 0.92 |
XGBoost | 0.82 | 0.83 | 0.78 | 0.78 | 0.91 | 0.91 | 0.84 | 0.84 | 0.92 | 0.92 | 0.93 | 0.92 |
Random Forest | 0.82 | 0.83 | 0.78 | 0.79 | 0.89 | 0.90 | 0.83 | 0.84 | 0.92 | 0.93 | 0.93 | 0.92 |
LightGBM | 0.82 | 0.82 | 0.77 | 0.77 | 0.91 | 0.91 | 0.84 | 0.84 | 0.92 | 0.92 | 0.93 | 0.92 |
Gradient Boost | 0.82 | 0.82 | 0.76 | 0.76 | 0.93 | 0.92 | 0.84 | 0.84 | 0.91 | 0.91 | 0.93 | 0.92 |
Model | Accuracy Test | Accuracy Train | Precision Test | Precision Train | Recall Test | Recall Train | F1 Test | F1 Train | ROC AUC Test | ROC AUC Train | ROC AUC CrossVal Test | ROC AUC CrossVal Train |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Logistic | 0.78 | 0.79 | 0.71 | 0.71 | 0.96 | 0.95 | 0.82 | 0.82 | 0.87 | 0.87 | 0.93 | 0.92 |
Decision Tree | 0.79 | 0.79 | 0.73 | 0.72 | 0.94 | 0.94 | 0.82 | 0.82 | 0.86 | 0.86 | 0.93 | 0.92 |
XGBoost | 0.82 | 0.82 | 0.76 | 0.76 | 0.94 | 0.94 | 0.84 | 0.84 | 0.92 | 0.92 | 0.93 | 0.92 |
Random Forest | 0.81 | 0.81 | 0.74 | 0.74 | 0.95 | 0.95 | 0.83 | 0.83 | 0.89 | 0.89 | 0.93 | 0.92 |
LightGBM | 0.82 | 0.82 | 0.77 | 0.77 | 0.92 | 0.92 | 0.84 | 0.84 | 0.92 | 0.92 | 0.93 | 0.92 |
Gradient Boost | 0.82 | 0.82 | 0.76 | 0.76 | 0.93 | 0.93 | 0.84 | 0.83 | 0.91 | 0.91 | 0.93 | 0.92 |