<h3> Introduction

The goal of this notebook is to predict whether people claimed money from their auto insurance.
The target is the variable "TARGET_FLAG" which is equal to 1 if there was an insurance claim and 0 otherwise.  
The dataset contains the following information:
- General information about the insured: sex, age, income, number of kids...  
- General information about the car: the type of car, the color of the car, the age of the car...
- Specific information about the insured: the value of previous claims, the frequency of claims...  
  
The notebook outputs the file: data/test_prediction.csv which contains the id of the insured and the predicted value for target_flag.

In [3]:
!pip install lightgbm



In [4]:
import re
import numpy as np
import pandas as pd

In [5]:
pd.set_option("display.max_columns", 30)

<h3> 1) Data exploration

<h4> 1.1) Data import

In [6]:
root = "~/Descartes/data-scientist-auto-test-main/" #to be changed based on the root of your folder
df_train = pd.read_csv(root + "data/auto-insurance-fall-2017/train_auto.csv")
df_test = pd.read_csv(root + "data/auto-insurance-fall-2017/test_auto.csv")

<h4> 1.2) Missing values

In [7]:
count_nan = df_train.isna().sum() # number of missing values by column
count_nan = count_nan[count_nan >= 1]
print("Number of missing values by feature in train:")
print(count_nan)
na_rows = df_train.shape[0] - df_train.dropna().shape[0] # number of rows with missing values in at least one column
print(na_rows)
print("Proportion of rows with missing values: {:.2f}".format(na_rows / df_train.shape[0]))

Number of missing values by feature in train:
AGE           6
YOJ         454
INCOME      445
HOME_VAL    464
JOB         526
CAR_AGE     510
dtype: int64
2116
Proportion of rows with missing values: 0.26


Since the proportion of rows with missing value is quite high: ~25% and we do not have many data points ~8k, we will keep the rows with missing values and handle them via a simple imputer in section 2.1.

<h4> 1.3) Categorical variables

In [8]:
print("Columns stored as strings with their number of unique values:")
print(df_train.select_dtypes(include=['object']).apply(lambda column: len(pd.unique(column))))

Columns stored as strings with their number of unique values:
INCOME        6613
PARENT1          2
HOME_VAL      5107
MSTATUS          2
SEX              2
EDUCATION        5
JOB              9
CAR_USE          2
BLUEBOOK      2789
CAR_TYPE         6
RED_CAR          2
OLDCLAIM      2857
REVOKED          2
URBANICITY       2
dtype: int64


INCOME, HOME_VAL, BLUEBOOK and OLDCLAIM should be numerical values so they will be transformed in the appropriate data type.  
PARENT1, MSTATUS, SEX, CAR_USE, RED_CAR, REVOKED and URBANICITY are binary variables so they can be easily transformed in 0/1 columns.  
EDUCATION, JOB and CAR_TYPE are categorical variables with more than two values, they will be handled in section 2 with a OneHotEncoder.

<h4> 1.4) Basic preprocessing

In [9]:
binary_columns = ["PARENT1", "MSTATUS", "CAR_USE", "RED_CAR", "REVOKED", "URBANICITY"]

def currency_to_int(value):
    """Transform string containing a currency and commas into int type"""
    if pd.notnull(value):
        return int(''.join([char for char in value if char.isnumeric()]))

def basic_preprocessing(df):
    """
    Transform binary columns into int and columns with currency into int
    Drop Sex column to avoid a sexist bias
    """
    res = df.copy()
    res = res.drop("SEX", axis=1)
    res["INCOME"] = res["INCOME"].apply(currency_to_int)
    res["HOME_VAL"] = res["HOME_VAL"].apply(currency_to_int)
    res["BLUEBOOK"] = res["BLUEBOOK"].apply(currency_to_int)
    res["OLDCLAIM"] = res["OLDCLAIM"].apply(currency_to_int)
    res["PARENT1"] = res["PARENT1"] == "Yes"
    res["MSTATUS"] = res["MSTATUS"] == "Yes"
    res["CAR_USE"] = res["CAR_USE"] == "Private"
    res["RED_CAR"] = res["RED_CAR"] == "yes"
    res["REVOKED"] = res["REVOKED"] == "Yes"
    res["URBANICITY"] = res["URBANICITY"] == "Highly Urban/ Urban"
    res[binary_columns] = res[binary_columns].astype(int)
    return res

df_train_p = basic_preprocessing(df_train)

In addition to transforming binary columns and "currency" columns into int, we remove the variable SEX. This is done to ensure that our algorithm does not have a direct sexist bias. This might not be needed in this case but it is quite important if the algorithm is used to make real life decisions such as pricing auto insurances.

<h4> 1.5) Data summary

In [10]:
df_train_p.describe()

Unnamed: 0,INDEX,TARGET_FLAG,TARGET_AMT,KIDSDRIV,AGE,HOMEKIDS,YOJ,INCOME,PARENT1,HOME_VAL,MSTATUS,TRAVTIME,CAR_USE,BLUEBOOK,TIF,RED_CAR,OLDCLAIM,CLM_FREQ,REVOKED,MVR_PTS,CAR_AGE,URBANICITY
count,8161.0,8161.0,8161.0,8161.0,8155.0,8161.0,7707.0,7716.0,8161.0,7697.0,8161.0,8161.0,8161.0,8161.0,8161.0,8161.0,8161.0,8161.0,8161.0,8161.0,7651.0,8161.0
mean,5151.867663,0.263816,1504.324648,0.171057,44.790313,0.721235,10.499286,61898.094609,0.131969,154867.289723,0.599681,33.485725,0.628845,15709.899522,5.351305,0.291386,4037.076216,0.798554,0.122534,1.695503,8.328323,0.795491
std,2978.893962,0.440728,4704.02693,0.511534,8.627589,1.116323,4.092474,47572.682808,0.338478,129123.774574,0.489993,15.908333,0.483144,8419.734075,4.146635,0.454429,8777.139104,1.158453,0.327922,2.147112,5.700742,0.403367
min,1.0,0.0,0.0,0.0,16.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,1500.0,1.0,0.0,0.0,0.0,0.0,0.0,-3.0,0.0
25%,2559.0,0.0,0.0,0.0,39.0,0.0,9.0,28097.0,0.0,0.0,0.0,22.0,0.0,9280.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
50%,5133.0,0.0,0.0,0.0,45.0,0.0,11.0,54028.0,0.0,161160.0,1.0,33.0,1.0,14440.0,4.0,0.0,0.0,0.0,0.0,1.0,8.0,1.0
75%,7745.0,1.0,1036.0,0.0,51.0,1.0,13.0,85986.0,0.0,238724.0,1.0,44.0,1.0,20850.0,7.0,1.0,4636.0,2.0,0.0,3.0,12.0,1.0
max,10302.0,1.0,107586.13616,4.0,81.0,5.0,23.0,367030.0,1.0,885282.0,1.0,142.0,1.0,69740.0,25.0,1.0,57037.0,5.0,1.0,13.0,28.0,1.0


In [11]:
df_train_p.corr()

Unnamed: 0,INDEX,TARGET_FLAG,TARGET_AMT,KIDSDRIV,AGE,HOMEKIDS,YOJ,INCOME,PARENT1,HOME_VAL,MSTATUS,TRAVTIME,CAR_USE,BLUEBOOK,TIF,RED_CAR,OLDCLAIM,CLM_FREQ,REVOKED,MVR_PTS,CAR_AGE,URBANICITY
INDEX,1.0,-0.00167,-0.000593,0.015576,0.033846,5.2e-05,0.026733,0.008821,-0.013674,0.012109,0.007954,-0.02307,-0.003683,0.013912,-0.009214,0.016855,-0.001264,0.01878,0.002896,0.007883,-0.000699,-0.000738
TARGET_FLAG,-0.00167,1.0,0.534246,0.103668,-0.103217,0.115621,-0.070512,-0.142008,0.157622,-0.183737,-0.135125,0.048368,-0.142674,-0.103383,-0.08237,-0.006947,0.138084,0.216196,0.151939,0.219197,-0.100651,0.224251
TARGET_AMT,-0.000593,0.534246,1.0,0.055394,-0.041728,0.061988,-0.022085,-0.058307,0.096965,-0.085602,-0.087661,0.027987,-0.098614,-0.0047,-0.046481,0.008092,0.070953,0.116419,0.061385,0.137866,-0.058822,0.120974
KIDSDRIV,0.015576,0.103668,0.055394,1.0,-0.075179,0.464015,0.043305,-0.047134,0.196604,-0.019792,0.042461,0.008447,-0.001422,-0.021549,-0.001989,-0.043638,0.020403,0.037063,0.043062,0.053566,-0.053993,-0.037124
AGE,0.033846,-0.103217,-0.041728,-0.075179,1.0,-0.445441,0.136072,0.18097,-0.314025,0.209984,0.090716,0.005269,0.033304,0.165025,-6.6e-05,0.020324,-0.02929,-0.024092,-0.038477,-0.071575,0.176221,0.051351
HOMEKIDS,5.2e-05,0.115621,0.061988,0.464015,-0.445441,1.0,0.086829,-0.15933,0.449274,-0.11068,0.043526,-0.007246,0.004458,-0.107894,0.011813,-0.068148,0.029911,0.029349,0.045116,0.060601,-0.152146,-0.063483
YOJ,0.026733,-0.070512,-0.022085,0.043305,0.136072,0.086829,1.0,0.286074,-0.049767,0.26992,0.145631,-0.016945,-0.022337,0.143465,0.024787,0.050633,-0.00298,-0.026308,-0.006415,-0.037855,0.061406,0.08387
INCOME,0.008821,-0.142008,-0.058307,-0.047134,0.18097,-0.15933,0.286074,1.0,-0.075257,0.575244,-0.030724,-0.047082,-0.081031,0.42928,-0.001035,0.058807,-0.045442,-0.047752,-0.020737,-0.063159,0.414238,0.206004
PARENT1,-0.013674,0.157622,0.096965,0.196604,-0.314025,0.449274,-0.049767,-0.075257,1.0,-0.261065,-0.477228,-0.023741,-0.006194,-0.050458,-0.001952,-0.042086,0.034689,0.048742,0.049719,0.068453,-0.061153,-0.02221
HOME_VAL,0.012109,-0.183737,-0.085602,-0.019792,0.209984,-0.11068,0.26992,0.575244,-0.261065,1.0,0.459408,-0.035525,-0.027353,0.259533,0.002063,0.016212,-0.069195,-0.094049,-0.050609,-0.085395,0.217468,0.11969


<h3> 2. Model Selection

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, f1_score, make_scorer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV
from lightgbm import LGBMClassifier

<h4> 2.1) Preprocessing

In [13]:
X = df_train_p.drop(["TARGET_FLAG", "TARGET_AMT"], axis = 1)
y = df_train_p["TARGET_FLAG"]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [14]:
categorical_col = ["EDUCATION", "JOB", "CAR_TYPE"]
numerical_col = ["INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM", "AGE", "TRAVTIME", "CAR_AGE", "YOJ"]

ct = ColumnTransformer(
    [("categorical", OneHotEncoder(), categorical_col),
     ("numerical", StandardScaler(), numerical_col)],
    remainder = "passthrough")

imputer = SimpleImputer(strategy='median', missing_values=np.nan)

I applied a standard scaler on numerical variables which have a mean around 10 or more. This scaling is used to get all the numerical variables in roughly the same magnitude because most machine learning algorithms perform better in this case.  
I applied a one hot encoder on our categorical variables which take more than 2 values. The idea is that it helps ML algorithms differentiate efficiently categorical values while not increasing too much the number of features (~+20 features).  
I also defined an imputer which will be used to impute missing values for numeric features. Empty categorical features are being managed automatically by the one hot encoding but missing numeric features have to be inputted manually. The strategy used is to replace missing values by the median. I I used the median instead of the mean because the mean is positively biased by high values for some fields such as income. 

<h4> 2.2) Metrics

We want to predict the classes defined by TARGET_FLAG. The class 1 is in minority (~25% of the entries). Using accuracy to analyse the performance of our algorithm would be misleading because we could get a quite high accracy by only predicting 0 all of the time. The F1 score is a metric which enables us to determine how well are we able to predict class 1. We will use this metrics to measure the performance of our algorithms.

<h4> 2.3) Linear model

In [15]:
logit_model = LogisticRegression(class_weight="balanced", max_iter=2000)
logit_pipe = Pipeline([("preprocessing", ct), ("imputer", imputer), ("model", logit_model)])
logit_pipe.fit(X_train, y_train)
y_pred_train = logit_pipe.predict(X_train)
print("Confusion matrix and f1 score on train data: ")
print(confusion_matrix(y_train, y_pred_train))
print(f1_score(y_train, y_pred_train))
y_pred_val = logit_pipe.predict(X_val)
print("Confusion matrix and f1 score on validation data: ")
print(confusion_matrix(y_val, y_pred_val))
print(f1_score(y_val, y_pred_val))

Confusion matrix and f1 score on train data: 
[[3424 1395]
 [ 422 1287]]
0.5861990434980642
Confusion matrix and f1 score on validation data: 
[[850 339]
 [124 320]]
0.5802357207615593


We start by testing the logistic classifier because it is one of the most simple classifiers. The number max_iter was changed because the default number was not high enough to make the algorithm converge. Finally, the parameter class_weight: balanced enables the algorithm to learn on the weighted sum of the loss by class where the weights are inversely proportional to the size of each class. It is useful when there is a class imbalance, which is the case in our dataset.

<h4> 2.4) Random Forest Model

In [16]:
tree_model = RandomForestClassifier(class_weight="balanced", max_depth=7)
tree_pipe = Pipeline([("preprocessing", ct), ("imputer", imputer), ("model", tree_model)])
tree_pipe.fit(X_train, y_train)
y_pred_train = tree_pipe.predict(X_train)
print("Confusion matrix and f1 score on train data: ")
print(confusion_matrix(y_train, y_pred_train))
print(f1_score(y_train, y_pred_train))
y_pred_val = tree_pipe.predict(X_val)
print("Confusion matrix and f1 score on validation data: ")
print(confusion_matrix(y_val, y_pred_val))
print(f1_score(y_val, y_pred_val))

Confusion matrix and f1 score on train data: 
[[3689 1130]
 [ 317 1392]]
0.658000472701489
Confusion matrix and f1 score on validation data: 
[[883 306]
 [130 314]]
0.5902255639097744


The random forest classifier is a classifier which tends to perform well when the relationships between the target and the features is more complex than a linear function. In this case, we see that it does not perform better than the regular logistic classifier on the validation set. It might be because we do not have that much data points.  
I changed the max_depth of trees because the default paramater was causing the classifier to overfit. The f1 score on the training dataset was 1 and 0.4 on the validation set. I quickly searched for the max_depth which was maximising the F1 score on our validation set.

<h4> 2.3) Gradient Boosting

In [17]:
lgbm_model = LGBMClassifier(class_weight="balanced", verbose=-1)
lgbm_pipe = Pipeline([("preprocessing", ct), ("model", lgbm_model)])
lgbm_pipe.fit(X_train, y_train)
y_pred_train = lgbm_pipe.predict(X_train)
print("Confusion matrix and f1 score on train data: ")
print(confusion_matrix(y_train, y_pred_train))
print(f1_score(y_train, y_pred_train))
y_pred_val = lgbm_pipe.predict(X_val)
print("Confusion matrix and f1 score on train data: ")
print(confusion_matrix(y_val, y_pred_val))
print(f1_score(y_val, y_pred_val))

Confusion matrix and f1 score on train data: 
[[4224  595]
 [  76 1633]]
0.8295656591313183
Confusion matrix and f1 score on train data: 
[[949 240]
 [142 302]]
0.6125760649087221


LGBM classifier is a classifier which tends to perform particularly well especially when the number of data points is huge. In this case, we see that it does perform slightly better than the two previous algorithms even if the difference is not huge. The imputer is not used for this model because it has a built-in mechanism to deal with missing values.

The best model so far is the __LGBM Classifier__ even if it is by a small margin. Therefore we will try to hypertune it and keep this model for the final prediction.

<h3> 3) Model tuning and prediciton

<h4> 3.1) Hypertuning

In [18]:
scorer = make_scorer(f1_score)
param_grid = {"model__num_leaves": [150, 200, 300, 500], "model__min_child_samples": [20, 80, 150, 200, 300]}
estimator = Pipeline([("preprocessing", ct), ("model", lgbm_model)])
grid_search = GridSearchCV(estimator, param_grid, scoring=scorer)
grid_search.fit(X, y)
final_model = grid_search.best_estimator_
best_score = grid_search.best_score_
best_params = grid_search.best_params_
print("Best F1 score: ", best_score)
print("Best parameters: ", best_params)

Best F1 score:  0.6065769080510934
Best parameters:  {'model__min_child_samples': 200, 'model__num_leaves': 150}


The F1 score of our final model on a 5 fold cross-validation is __0.607__.

<h4> 3.2) Prediction

In [19]:
df_test_p = basic_preprocessing(df_test).drop(["TARGET_FLAG", "TARGET_AMT"], axis = 1)
test_predict = final_model.predict(df_test_p)
test_index = df_test_p["INDEX"].to_list()
res = pd.DataFrame({"id": test_index, "target": test_predict})
res.to_csv(root + "data/test_prediction.csv", index=False)

<h3> Conclusion

Our final model has an expected F1 score of 0.607 based on a 5-fold cross validation.This model uses an LGBM classifier to predict whether insured will claim money from their auto insurance. The prediction for the insured in test_auto.csv can be found in the file data/test_prediction.csv.