The aim of this report is to implement and evaluate several alternative predictive models to predict whether an existing customer would be interested or not in Vehicle Insurance with the help of the predictors: Gender, Age, Driving License, Region Code, Previously Insured, Vehicle Age, Vehicle Damage, Annual Premium, Policy Sales Channel and Vintage.

The data preprocessing have already been done in the datavisualization-preprocessing report.

# 1. Load the data

In [4]:
#the required libraries are imported
import numpy as np           #for efficient numerical operations
import pandas as pd          #for manipulating and visualising data

import time                  #for getting local time from the number of seconds elapsed

import seaborn as sns             #for data visualization

#load the required train and test datasets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [5]:
train.head()

Unnamed: 0,Gender,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Policy_Sales_Channel,Vintage,Age_log,Annual_Premium_log,Response
0,0.930654,0.047053,-1.197614,-0.956976,0.724263,1.022372,0.228267,0.210398,0.871617,-0.345003,0
1,-1.074513,0.047053,-0.654205,1.044959,-1.025121,-0.978117,0.747745,-0.852524,-0.785843,-0.331879,0
2,0.930654,0.047053,-0.654205,-0.956976,0.724263,1.022372,-1.589904,-1.461615,0.489337,-0.459352,1
3,0.930654,0.047053,-1.818652,1.044959,-1.025121,-0.978117,0.896167,-0.780867,-0.785843,-1.036554,0
4,-1.074513,0.047053,-1.896282,1.044959,-1.025121,-0.978117,0.747745,-1.354128,-0.598007,1.223449,0


In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189674 entries, 0 to 189673
Data columns (total 11 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Gender                189674 non-null  float64
 1   Driving_License       189674 non-null  float64
 2   Region_Code           189674 non-null  float64
 3   Previously_Insured    189674 non-null  float64
 4   Vehicle_Age           189674 non-null  float64
 5   Vehicle_Damage        189674 non-null  float64
 6   Policy_Sales_Channel  189674 non-null  float64
 7   Vintage               189674 non-null  float64
 8   Age_log               189674 non-null  float64
 9   Annual_Premium_log    189674 non-null  float64
 10  Response              189674 non-null  int64  
dtypes: float64(10), int64(1)
memory usage: 15.9 MB


In [7]:
train.shape[0]

189674

In [8]:
test.shape[0]

152444

The train and test dataset are too large and would take massive amounts of time to train models on this dataset. Therefore, a sample is taken out of the train and test dataset.

In [9]:
#sampling from train dataset
ftrain = train.sample(n=50000, random_state=7)

In [10]:
ftrain.head()

Unnamed: 0,Gender,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Policy_Sales_Channel,Vintage,Age_log,Annual_Premium_log,Response
50183,0.930654,0.047053,-1.430503,1.044959,-1.025121,-0.978117,0.729192,1.416637,-1.096595,0.815633,0
46028,0.930654,0.047053,-1.430503,1.044959,-1.025121,-0.978117,0.747745,-0.374806,-0.988722,2.021025,0
41101,-1.074513,0.047053,0.587873,-0.956976,-1.025121,-0.978117,0.747745,0.735888,-0.509064,-0.685612,0
39742,-1.074513,0.047053,0.122094,-0.956976,0.724263,1.022372,-1.589904,0.819489,0.547607,2.429836,1
187634,0.930654,0.047053,-0.343685,1.044959,-1.025121,-0.978117,0.896167,0.640345,-1.327174,0.145065,0


In [11]:
#sampling from test dataset
ftest = test.sample(n=35000, random_state=7)

In [14]:
ftest.head()

Unnamed: 0,Gender,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Policy_Sales_Channel,Vintage,Age_log,Annual_Premium_log,Response
140791,0.922772,0.045936,0.122655,-0.919475,-1.074838,0.991209,0.220503,-0.864108,-1.257842,-0.084741,1
21139,-1.083691,0.045936,-1.161235,-0.919475,0.689124,0.991209,-1.587009,1.170042,0.51849,1.03298,1
12854,-1.083691,0.045936,-0.028391,1.087577,-1.074838,-1.008869,0.884487,-0.253863,-0.829732,-2.127575,0
13704,0.922772,0.045936,1.482068,1.087577,-1.074838,-1.008869,0.884487,-0.026517,-1.377059,-0.049361,0
21554,0.922772,0.045936,-0.406006,-0.919475,0.689124,0.991209,-1.587009,-1.306835,0.896837,0.59747,0


# 2. Train Models

The first step is to create seperate arrays for the predictors (`Xtrain`) and for the target (`ytrain`):

In [15]:
from sklearn.model_selection import GridSearchCV

#seperating the predictors and target variable
Xtrain = ftrain.drop('Response', axis=1)

ytrain = ftrain['Response'].copy()

# 2.1. Baseline Model

A majority class classifier is used as baseline where most common class label in the training set would be found out and predicted as the output always.

In [16]:
#count the number of instances
ftrain["Response"].value_counts()

0    43867
1     6133
Name: Response, dtype: int64

0: Not interested, 1: Interested

In [17]:
#train set size
ftrain.shape[0]

50000

According to the baseline classifier, the output will be "Not interested" for all predictions. In this project, macro-averaging will be used (precision, recall and F-score are evaluated in each class seperately and then avergaed across classes).

Therefore, applying the baseline classifier to all of the train dataset.

For responses with "Not interested", the accuarcy measures will be:

 - Precision: 43867/50000 = 0.877
 - Recall: 50000/50000 = 1.0
 - F-score: 2/(1/precision+1/recall) = 0.935
 
For responses with "Interested", the accuarcy measures will be:

 - Precision: 0.0/0.0 = 0.0
 - Recall: 0.0/6133 = 0.0
 - F-score: 0.0
 
The averages of the two classes which is the eventual baseline scores, are:

 - Precision: 0.439
 - Recall: 0.5
 - F-score: 0.468

# 2.2. Random Forest

In [21]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()

#put in the hyperparameters
param_grid = {
    'n_estimators': [10, 100, 200, 1000],
    'max_depth': [3, 5, 15],
    'min_samples_split': [5, 10],
    'random_state': [7]
}

#5-fold cross-validation is used
grid_search = GridSearchCV(rf, param_grid, cv=5,
                          scoring='f1_macro',
                          return_train_score=True)

start = time.time()
grid_search.fit(Xtrain, ytrain)
end = time.time() - start
print(f"Took {end} seconds")

Took 899.0751824378967 seconds


In [22]:
grid_search.best_estimator_

RandomForestClassifier(max_depth=15, min_samples_split=5, n_estimators=10,
                       random_state=7)

In [23]:
grid_search.best_score_

0.5077910772485159

The best hyperparameters prove to be n_estimators = 200, max_depth = 15 and min_sample_split=5. Based on this, they achieve a F-score of 0.51 which is the best one so far.

The results of the best model are recorded in each split and the below command gives the index of the best performing model,

In [24]:
grid_search.cv_results_['rank_test_score'].tolist().index(1)

16

In [25]:
rf_split_test_scores = []
for x in range(5):
    #extract f-score of the best model (index=18) from each of the 5 splits
    val = grid_search.cv_results_[f"split{x}_test_score"][18]
    rf_split_test_scores.append(val)

The scores achieved by all the models for different hyperparameter are reviewed:

In [26]:
val_scores = grid_search.cv_results_['mean_test_score']
train_scores = grid_search.cv_results_['mean_train_score']
params = [str(x) for x in grid_search.cv_results_["params"]]

for val_score, train_score, param in sorted(zip(val_scores, train_scores, params), reverse=True):
    print(val_score, train_score, param)

0.5077910772485159 0.6144937097675991 {'max_depth': 15, 'min_samples_split': 5, 'n_estimators': 10, 'random_state': 7}
0.500160391905794 0.5800966185190405 {'max_depth': 15, 'min_samples_split': 10, 'n_estimators': 10, 'random_state': 7}
0.4901043794684668 0.5983222542940497 {'max_depth': 15, 'min_samples_split': 5, 'n_estimators': 100, 'random_state': 7}
0.489769462731983 0.5974611399846733 {'max_depth': 15, 'min_samples_split': 5, 'n_estimators': 200, 'random_state': 7}
0.488091595400079 0.5962736744561885 {'max_depth': 15, 'min_samples_split': 5, 'n_estimators': 1000, 'random_state': 7}
0.48802775984303126 0.565265257292125 {'max_depth': 15, 'min_samples_split': 10, 'n_estimators': 200, 'random_state': 7}
0.4879105683806045 0.5624624248068428 {'max_depth': 15, 'min_samples_split': 10, 'n_estimators': 1000, 'random_state': 7}
0.487125083987973 0.5651251866402451 {'max_depth': 15, 'min_samples_split': 10, 'n_estimators': 100, 'random_state': 7}
0.4673314366705239 0.46733143701057855 {

The performance of Random Forest varies between 0.47 and 0.51. It can also be noticed that better score is achieved for greater max_depth. However, this score is only slight better than the baseline model. Therefore, more models need to evaluted for better understanding.

In [27]:
# put them into a separate variable for convenience
feature_importances = grid_search.best_estimator_.feature_importances_

# the order of the features in `feature_importances` is the same as in the Xtrain dataframe,
# so we can "zip" the two and print in the descending order:

for k, v in sorted(zip(feature_importances, Xtrain.columns), reverse=True):
    print(f"{v}: {k}")

Vehicle_Damage: 0.23034578200483757
Age_log: 0.15910325280867196
Annual_Premium_log: 0.1399068056819435
Previously_Insured: 0.13163255776225508
Vintage: 0.11583783732321544
Policy_Sales_Channel: 0.08475334501851194
Region_Code: 0.06984894911839537
Vehicle_Age: 0.05473726453721193
Gender: 0.012907136629646949
Driving_License: 0.0009270691153103203


Vehicle damage, age, annual premium, previously insured and vintage are quite predictive of whether a customer would be interested in vehicle insurance or not.

Every other variable has very little to do with the response of the customer.

Following on, the model is saved to the disk so that it can be used in the future directly for testing instead of re-training the model.

In [28]:
import os
from joblib import dump

#creating a folder to save all the models
if not os.path.exists('ML models'):
    os.makedirs('ML models')

dump(grid_search.best_estimator_, 'ML models/rf-clf.joblib')

['ML models/rf-clf.joblib']

The model will be loaded later on using joblib's load function.

# 2.3. Support Vector Machines

# 2.3.1. Linear SVMs

In [21]:
from sklearn.svm import LinearSVC

lsvm = LinearSVC()

# specify the hypermaters
param_grid = {
    'C': [0.1, 1, 3, 5],
    'max_iter': [5000],
    'random_state': [7]
}

#5-fold cross-validation is used
grid_search = GridSearchCV(lsvm, param_grid, cv=5,
                           scoring='f1_macro', 
                           return_train_score=True) 

start = time.time()
grid_search.fit(Xtrain, ytrain)
end = time.time() - start
print(f"Took {end} seconds")



Took 498.01508927345276 seconds


In [22]:
grid_search.best_estimator_

LinearSVC(C=0.1, max_iter=5000, random_state=7)

In [23]:
grid_search.best_score_

0.4673314366705239

There is no significant difference between the f-score of the Linear SVM and the baseline model. Therefore, this model turns out to be very poor.

In [24]:
val_scores = grid_search.cv_results_["mean_test_score"]
train_scores = grid_search.cv_results_["mean_train_score"]
params = [str(x) for x in grid_search.cv_results_["params"]]

for val_score, train_score, param in sorted(zip(val_scores, train_scores, params), reverse=True):
    print(val_score, train_score, param)

0.4673314366705239 0.46733143701057855 {'C': 3, 'max_iter': 5000, 'random_state': 7}
0.4673314366705239 0.46733143701057855 {'C': 1, 'max_iter': 5000, 'random_state': 7}
0.4673314366705239 0.46733143701057855 {'C': 0.1, 'max_iter': 5000, 'random_state': 7}
0.46732576201501475 0.4673662960221964 {'C': 5, 'max_iter': 5000, 'random_state': 7}


From the above results, it can be seen that there is no difference in the F-score as the C-value changes. 

However, this model is now saved for future refernece.

In [35]:
import os
from joblib import dump

# create a folder where all trained models will be kept
if not os.path.exists("ML models"):
    os.makedirs("ML models")
    
dump(grid_search.best_estimator_, 'ML models/svm-lnr-clf.joblib')

['ML models/svm-lnr-clf.joblib']

# 2.3.2. Radial Basis Function

In [27]:
from sklearn.svm import SVC

svm = SVC()

#put in the parameters
param_grid = {
    'C': [1, 10, 100],
    'gamma': ["scale", "auto"],
    'kernel': ["rbf"],
    'random_state': [7]
}

#5-fold cross-validation is used
grid_search = GridSearchCV(svm, param_grid, cv=5,
                           scoring='f1_macro', 
                           return_train_score=True) 

start = time.time()
grid_search.fit(Xtrain, ytrain)
end = time.time() - start
print(f"Took {end} seconds")

Took 5844.312863826752 seconds


This model took significant number of hours to train and are impartical for large datasets.

In [28]:
grid_search.best_estimator_

SVC(C=100, gamma='auto', random_state=7)

In [29]:
grid_search.best_score_

0.4752675855876845

The F-score of this model is approximately 0.475 which is 0.01 more than that of the baseline model. Therefore, this model turns out to be no better than the baseline model as well.

In [30]:
# obtain the f-scores of the best models in each split

svmrbf_split_test_scores = []
for x in range(5):
    # extract f-score of the best model (at index=0) from each of the 5 splits
    val = grid_search.cv_results_[f"split{x}_test_score"][0]
    svmrbf_split_test_scores.append(val)

In [31]:
val_scores = grid_search.cv_results_["mean_test_score"]
train_scores = grid_search.cv_results_["mean_train_score"]
params = [str(x) for x in grid_search.cv_results_["params"]]

for val_score, train_score, param in sorted(zip(val_scores, train_scores, params), reverse=True):
    print(val_score, train_score, param)

0.4752675855876845 0.49304582093749144 {'C': 100, 'gamma': 'auto', 'kernel': 'rbf', 'random_state': 7}
0.4749530799050228 0.49179605580100966 {'C': 100, 'gamma': 'scale', 'kernel': 'rbf', 'random_state': 7}
0.4682958744065912 0.47032797495572937 {'C': 10, 'gamma': 'auto', 'kernel': 'rbf', 'random_state': 7}
0.4682958744065912 0.470081072861702 {'C': 10, 'gamma': 'scale', 'kernel': 'rbf', 'random_state': 7}
0.4673314366705239 0.46733143701057855 {'C': 1, 'gamma': 'scale', 'kernel': 'rbf', 'random_state': 7}
0.4673314366705239 0.46733143701057855 {'C': 1, 'gamma': 'auto', 'kernel': 'rbf', 'random_state': 7}


From the above results, it can be seen that the F-scores of the model increase with increase in the C - value. This is similar to that of random forests where high values of dept produced better results.

Polynomial SVM was ignored as it took significant hours to train. 

The SVM rbf model is saved.

In [36]:
import os
from joblib import dump

# create a folder where all trained models will be kept
if not os.path.exists("ML models"):
    os.makedirs("ML models")
    
dump(grid_search.best_estimator_, 'ML models/svm-rbf-clf.joblib')

['ML models/svm-rbf-clf.joblib']

From the two SVM models above, the F-scores are less than what was observed for Random forest and are not significantly different from the baseline models. Compared to the SVM models, the random forest was slightly better with an F-score of 0.51. However, this is an extremly poor score as well in reality. A model with such poor score has significantly low prediction power.


# 3. Test the Models

Even though, models that were trained have a poor f-score, the random forest with the relatively high f-score will be evaluated on the test dataset.

The model is loaded from the local disk:

In [29]:
from joblib import load

best_rf = load("ML models/rf-clf.joblib")

In [30]:
# drop labels for training set, but keep all others
Xtest = ftest.drop("Response", axis=1)
ytest = ftest["Response"].copy()

In [31]:
from sklearn.metrics import precision_recall_fscore_support

# rf
yhat = best_rf.predict(Xtest)

# micro-averaged precision, recall and f-score
p, r, f, s = precision_recall_fscore_support(ytest, yhat, average="macro")
print("Random Forest:")
print(f"Precision: {p}")
print(f"Recall: {r}")
print(f"F score: {f}")

Random Forest:
Precision: 0.6461335174851621
Recall: 0.5258023688231142
F score: 0.5218418455618304


Thus, similar classification accuracy can be found with Random forrest classifier, as observed during cross-validation.

# 4. Future Improvements and Business Scenario

There is big room for future improvments for the model as the models accuracy is very poor. Different steps need to be taken to overcome this problem. One of the reason for this poor score could be that fact that there was significantly low number of customers interested in vehicle insurance compared to customers who weren't. This could have potentially created a bias in the models learning. Another problem is that the predictors may not be a good representative of the target variable. To support this, the correlation matrix in the group report showed very poor correlation between the target variable and the predictors. Addressing these problesm could be potential future improvements. 

Currently, this model cannot be used in real world scenarios due to its low accuarcy but similar models with high accuracy can be used in various business scenarios. For example, banks can use this type of model to predict who would be interested in a certain type of credit or debit cards.