# Insurance prediction : Classification

In this project the goal is to classify between the two classes 0 and 1 the fatures target_flag.

## Description of my approach

### Load data and useful librairy

Here I load all the Library I need and the training and testing set

### First dummy classifier : 

In the first part I will encode a dummy classifier which will be for a reference/goal to at least reach. Thus I will know how good my model is and how far I ahve improved my situation starting from a very basic classifier. The dummy classifier will only predict the most common class between 0 and 1. It is also for me a way to have a first contact with the data.

### Data exploration : 

Then I will explore the distribution of my features and my target. It is a way for me to know what king of model could be apply, what are the roles between the different features and to detect some imbalanced data.

### Data treatment : 

I will then be able : 
- to complete the mising values if there are
- properly encode the non numeric data to take them into account in my model.
- rescale the data if needed, it depends on the values and in the model that I apply


### Machine Learning Model :

Then I will choose a relevant algorithm and finetuned its hyperparameters thanks to a Cross vaidation.


### Deep Learning Model :

Finally I will discuss about the different solutions in deeplearning and by the state of the art. If I have still time I will offer a way to implement one of them.

### Load data and useful Library

I need : 

    - panda, numpy and scickit-learn.
    - training and testing data sets

In [122]:
#Load the librairies I will need

#Basic libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV #to make train and test data
import csv

#in order to make the pipelines
from sklearn.pipeline import make_pipeline

#for preprocessing the data
from sklearn.preprocessing import RobustScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer #a transformer has to be different for each type of columns
from sklearn.compose import make_column_selector #a to select our columns by their types, names etc

#This is our Dummy Classifier
from sklearn.dummy import DummyClassifier

#model predictions 
import xgboost as xgb

#to measure the performance of our model
from sklearn.metrics import accuracy_score

Now it is time to load our training and testing set.

In [165]:
test_path = 'auto-insurance-fall-2017/test_auto.csv' #change it if differs on your computer
train_path = 'auto-insurance-fall-2017/train_auto.csv' #change it if differs on your computer
target_amt_path = 'auto-insurance-fall-2017/SHELL_AUTO.csv' #change it if differs on your computer

#First treatment to convert the columns with $
false_categorical_features = {'INCOME': lambda s: s.replace('$', '').replace(',',''), 'HOME_VAL': lambda s: s.replace('$', '').replace(',',''), 'BLUEBOOK': lambda s: s.replace('$', '').replace(',',''), 'OLDCLAIM': lambda s: s.replace('$', '').replace(',','')}


validation_set = pd.read_csv(test_path, converters=false_categorical_features) #our set to finally test our data set
training_set = pd.read_csv(train_path, converters=false_categorical_features)#our train set to train our model
TARGET_AMT = pd.read_csv(target_amt_path) 
validation_set['TARGET_AMT'] = TARGET_AMT['p_target'] # Complete the validation set

#Now I convert the rows that were with dollars in integer
for name in false_categorical_features:
        validation_set[name] = pd.to_numeric(validation_set[name], errors='coerce', downcast='integer')
        training_set[name] = pd.to_numeric(training_set[name], errors='coerce', downcast='integer')

final_target = validation_set['TARGET_FLAG'] # This is the target that we want to predict (time series)
validation_features = validation_set.iloc[:, validation_set.columns!='TARGET_FLAG']# the data set containing our features

training_target = training_set['TARGET_FLAG'] # This is the target that we want to predict (time series)
training_features = training_set.iloc[:, training_set.columns!='TARGET_FLAG']# the data set containing our features

# It is time to split our data in two parts, the set to train on et the set to test our model:
X_train, X_test, y_train, y_test = train_test_split(training_features, training_target, test_size=0.2) # taking 80 % for the traind set and 20 % for the testing set


In [166]:
#This is how our training set looks like

training_set.head() # printing the head of the set

Unnamed: 0,INDEX,TARGET_FLAG,TARGET_AMT,KIDSDRIV,AGE,HOMEKIDS,YOJ,INCOME,PARENT1,HOME_VAL,...,BLUEBOOK,TIF,CAR_TYPE,RED_CAR,OLDCLAIM,CLM_FREQ,REVOKED,MVR_PTS,CAR_AGE,URBANICITY
0,1,0,0.0,0,60.0,0,11.0,67349.0,No,0.0,...,14230,11,Minivan,yes,4461,2,No,3,18.0,Highly Urban/ Urban
1,2,0,0.0,0,43.0,0,11.0,91449.0,No,257252.0,...,14940,1,Minivan,yes,0,0,No,0,1.0,Highly Urban/ Urban
2,4,0,0.0,0,35.0,1,10.0,16039.0,No,124191.0,...,4010,4,z_SUV,no,38690,2,No,3,10.0,Highly Urban/ Urban
3,5,0,0.0,0,51.0,0,14.0,,No,306251.0,...,15440,7,Minivan,yes,0,0,No,0,6.0,Highly Urban/ Urban
4,6,0,0.0,0,50.0,0,,114986.0,No,243925.0,...,18000,1,z_SUV,no,19217,2,Yes,3,17.0,Highly Urban/ Urban


In [167]:
#This is how our validation set looks like

validation_set.head() # printing the head of the set

Unnamed: 0,INDEX,TARGET_FLAG,TARGET_AMT,KIDSDRIV,AGE,HOMEKIDS,YOJ,INCOME,PARENT1,HOME_VAL,...,BLUEBOOK,TIF,CAR_TYPE,RED_CAR,OLDCLAIM,CLM_FREQ,REVOKED,MVR_PTS,CAR_AGE,URBANICITY
0,3,,161.1019,0,48.0,0,11.0,52881.0,No,0.0,...,21970,1,Van,yes,0,0,No,2,10.0,Highly Urban/ Urban
1,9,,253.867641,1,40.0,1,11.0,50815.0,Yes,0.0,...,18930,6,Minivan,no,3295,1,No,2,1.0,Highly Urban/ Urban
2,10,,145.172185,0,44.0,2,12.0,43486.0,Yes,0.0,...,5900,10,z_SUV,no,0,0,No,0,10.0,z_Highly Rural/ Rural
3,18,,148.92545,0,35.0,2,,21204.0,Yes,0.0,...,9230,6,Pickup,no,0,0,Yes,0,4.0,z_Highly Rural/ Rural
4,21,,263.740847,0,59.0,0,12.0,87460.0,No,0.0,...,15420,1,Minivan,yes,44857,2,No,4,1.0,Highly Urban/ Urban


In [168]:
validation_set.describe()

Unnamed: 0,INDEX,TARGET_FLAG,TARGET_AMT,KIDSDRIV,AGE,HOMEKIDS,YOJ,INCOME,HOME_VAL,TRAVTIME,BLUEBOOK,TIF,OLDCLAIM,CLM_FREQ,MVR_PTS,CAR_AGE
count,2141.0,0.0,2141.0,2141.0,2140.0,2141.0,2047.0,2016.0,2030.0,2141.0,2141.0,2141.0,2141.0,2141.0,2141.0,2012.0
mean,5150.098552,,270.318591,0.162541,45.016822,0.717422,10.379091,60324.265377,153217.671429,33.152265,15469.425502,5.244745,4022.167679,0.808968,1.765997,8.1834
std,2956.329272,,214.62908,0.486949,8.525006,1.116579,4.170008,47003.422189,129456.870285,15.722393,8462.367121,3.971026,8565.379145,1.137481,2.203413,5.766263
min,3.0,,3.165496,0.0,17.0,0.0,0.0,0.0,0.0,5.0,1500.0,1.0,0.0,0.0,0.0,0.0
25%,2632.0,,81.81315,0.0,39.0,0.0,9.0,25817.75,0.0,22.0,8870.0,1.0,0.0,0.0,0.0,1.0
50%,5224.0,,225.875822,0.0,45.0,0.0,11.0,51778.0,158840.0,33.0,14170.0,4.0,0.0,0.0,1.0,8.0
75%,7669.0,,405.88416,0.0,51.0,1.0,13.0,86278.25,236651.5,43.0,21050.0,7.0,4718.0,2.0,3.0,12.0
max,10300.0,,960.498458,3.0,73.0,5.0,19.0,291182.0,669271.0,105.0,49940.0,25.0,54399.0,5.0,12.0,26.0


In [126]:
#This is how our target training set looks like

training_target.head() # printing the head of the set

0    0
1    0
2    0
3    0
4    0
Name: TARGET_FLAG, dtype: int64

## First dummy classifier : 

Now It is time to make the dummy classifier

In [127]:
Dummy = DummyClassifier(strategy="most_frequent") # We predict the most frequence class that is in our data set 

Dummy.fit(X_train, y_train)
dummy_pred = Dummy.predict(X_test)

print(dummy_pred) #a vector of 0

print(accuracy_score(y_test, dummy_pred)) # the percentage of well classified predicted target, in our y_test we have 73 % de 0

[0 0 0 ... 0 0 0]
0.747091243110839


### Data exploration : 

The exploration is : 

- Print the missing values 
- Have a quick look at the data features


We have a lot of missing values to take into account in our transformation : 

In [128]:
training_features.describe()

Unnamed: 0,INDEX,TARGET_AMT,KIDSDRIV,AGE,HOMEKIDS,YOJ,INCOME,HOME_VAL,TRAVTIME,BLUEBOOK,TIF,OLDCLAIM,CLM_FREQ,MVR_PTS,CAR_AGE
count,8161.0,8161.0,8161.0,8155.0,8161.0,7707.0,7716.0,7697.0,8161.0,8161.0,8161.0,8161.0,8161.0,8161.0,7651.0
mean,5151.867663,1504.324648,0.171057,44.790313,0.721235,10.499286,61898.094609,154867.289723,33.485725,15709.899522,5.351305,4037.076216,0.798554,1.695503,8.328323
std,2978.893962,4704.02693,0.511534,8.627589,1.116323,4.092474,47572.682808,129123.774574,15.908333,8419.734075,4.146635,8777.139104,1.158453,2.147112,5.700742
min,1.0,0.0,0.0,16.0,0.0,0.0,0.0,0.0,5.0,1500.0,1.0,0.0,0.0,0.0,-3.0
25%,2559.0,0.0,0.0,39.0,0.0,9.0,28097.0,0.0,22.0,9280.0,1.0,0.0,0.0,0.0,1.0
50%,5133.0,0.0,0.0,45.0,0.0,11.0,54028.0,161160.0,33.0,14440.0,4.0,0.0,0.0,1.0,8.0
75%,7745.0,1036.0,0.0,51.0,1.0,13.0,85986.0,238724.0,44.0,20850.0,7.0,4636.0,2.0,3.0,12.0
max,10302.0,107586.13616,4.0,81.0,5.0,23.0,367030.0,885282.0,142.0,69740.0,25.0,57037.0,5.0,13.0,28.0


We can also observ that the data are not scaled at all. A rescaling for the numerical could be done. There are huge outliners in some features like "TRAVTIME" or "CAR_AGE".

## Data treatment : 

I will then be able : 
- to complete the mising values if there are
- properly encode the non numeric data to take them into account in my model.
- rescale the data if needed, it depends on the values and in the model that I apply

In [146]:
# First we separate for he treatments the features
numerical_features = make_column_selector(dtype_include = np.number)
categorical_features = make_column_selector(dtype_exclude = np.number)


#Then we make our pipeline with our tranformations
numerical_pipeline = make_pipeline(SimpleImputer(strategy='median'), RobustScaler()) 
categorical_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
preprocessor = make_column_transformer((numerical_pipeline, numerical_features), (categorical_pipeline, categorical_features))

## Machine Learning Model :
I will implement a XGBoostClassifier and finetuned its hyperparameters thanks to a Cross vaidation.

In [169]:
clf = xgb.XGBClassifier() # Defining my classifier

#Defining the params to be tuned
params = {
        'min_child_weight': [1, 5, 10],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [3, 4, 5]
        }

#The grid search 
gd_sr = GridSearchCV(estimator=clf,
                     param_grid=params,
                     scoring='accuracy',
                     cv=5,
                     n_jobs=-1,
                     verbose=10)
#The final pipeline
model = make_pipeline(preprocessor, gd_sr)

model.fit(X_train,y_train)
test_pred = model.predict(X_test)

Fitting 5 folds for each of 405 candidates, totalling 2025 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   13.7s
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:   15.3s
[Parallel(n_jobs=-1)]: Done  29 tasks      | elapsed:   15.7s
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed:   17.1s
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:   18.6s
[Parallel(n_jobs=-1)]: Done  66 tasks      | elapsed:   19.9s
[Parallel(n_jobs=-1)]: Done  81 tasks      | elapsed:   21.2s
[Parallel(n_jobs=-1)]: Done  96 tasks      | elapsed:   22.3s
[Parallel(n_jobs=-1)]: Done 113 tasks      | elapsed:   24.5s
[Parallel(n_jobs=-1)]: Done 130 tasks      | elapsed:   26.1s
[Parallel(n_jobs=-1)]: Done 149 tasks      | elapsed:   27.7s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:   29.2s
[Parallel(n_jobs=-1)]: Done 189 tasks      | elapsed:   31.3s
[Parallel(n_jobs=-1)]: Done 210 tasks      | elapsed:

At the end, my model has an accuracy of 1 over the test set 

In [170]:
#  accuracy is very good since it is one, so 100 % of our testing set is well predicted :

print(accuracy_score(y_test, test_pred))

1.0


In [171]:
prediction = model.predict(validation_features)

## Deeplearning Model: 

I don't have more time to implement a Deeplearning model (because I had to do this in 2 hours). Yet we could implement a multiLayre perceptron with pytorch or Keras.