# Insurance Pricing with XGBoost
**Submitted as part of the Corpus of CO880: Project and Dissertation**

*gr305 - Gianni Riccardi*

This notebook is part of a series of interactive python notebooks that are used to test many different approaches to insurance price prediction with machine learning techniques.

In this notebook, we attempt to use XGBoost to predict the number of claims made by a policyholder. With this information and the readly available exposure of the policyholder, we can predict the frequency of claims for any given user.

In [1]:
# Importing necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_poisson_deviance

## The data

We will use the [French Motor Third-Party Liability Claims](https://www.openml.org/d/41214) dataset. This dataset has been studied in detail by A. Noll, R. Salzmann and M.V. Wuthrich, in their 2018 paper [Case Study: French Motor Third-Party Liability Claims](doi:10.2139/ssrn.3164764). 

Many assumptions and decisions made during the tests below were driven by this study.

In [2]:
from sklearn.datasets import fetch_openml


raw_df = fetch_openml(data_id=41214, as_frame=True).frame

df = raw_df.copy().drop(columns=['IDpol'])
df["Frequency"] = df["ClaimNb"] / df["Exposure"]
df.head()

Unnamed: 0,ClaimNb,Exposure,Area,VehPower,VehAge,DrivAge,BonusMalus,VehBrand,VehGas,Density,Region,Frequency
0,1.0,0.1,D,5.0,0.0,55.0,50.0,B12,Regular,1217.0,R82,10.0
1,1.0,0.77,D,5.0,0.0,55.0,50.0,B12,Regular,1217.0,R82,1.298701
2,1.0,0.75,B,6.0,2.0,52.0,50.0,B12,Diesel,54.0,R22,1.333333
3,1.0,0.09,B,7.0,0.0,46.0,50.0,B12,Diesel,76.0,R72,11.111111
4,1.0,0.84,B,7.0,0.0,46.0,50.0,B12,Diesel,76.0,R72,1.190476


### Cleaning up the data

We need to clean the data before feeding it to the Boosted Tree. In the following steps, we use SKLearn's label encoder to properly encode the columns "Area", "VehBrand", "VehGas" and "Region" into something that the Tree can work with.

In [3]:
from sklearn import preprocessing

lbl = preprocessing.LabelEncoder()

# split data into X and y
X = df.drop(columns=['ClaimNb', 'Exposure', 'Frequency'])
Y = lbl.fit_transform(df['Frequency'].astype(str))

X['Area'] = lbl.fit_transform(X['Area'].astype(str))
X['VehBrand'] = lbl.fit_transform(X['VehBrand'].astype(str))
X['VehGas'] = lbl.fit_transform(X['VehGas'].astype(str))
X['Region'] = lbl.fit_transform(X['Region'].astype(str))

seed = 10
test_size = 0.30
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
print('Training labels shape:', y_train.shape)
print('Training features shape:', X_train.shape)
print('Validation labels shape:', y_test.shape)
print('Validation features shape:', X_test.shape)

Training labels shape: (474609,)
Training features shape: (474609, 9)
Validation labels shape: (203404,)
Validation features shape: (203404, 9)


We can now create our XGB Classifier model to predict frequency of claims.

In [4]:
model = XGBClassifier(
    max_depth=6,
    min_child_weight=1,
    gamma=0,
    seed=0, 
    eval_metric='mlogloss')

In [5]:
# fit model to training data
model.fit(X_train, y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eval_metric='mlogloss',
              gamma=0, gpu_id=-1, importance_type='gain',
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=12,
              num_parallel_tree=1, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=None, seed=0,
              subsample=1, tree_method='exact', validate_parameters=1,
              verbosity=None)

In [6]:
# make predictions for test data
y_pred = model.predict(X_test)

In [7]:
# evaluate predictions
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
#mpd = mean_poisson_deviance(y_test, predictions)

print("MAE:", mae, "MSE:", mse, "accuracy:", (y_test == y_pred).sum()/float(y_test.size))

MAE: 4.092112249513284 MSE: 520.3244577294448 accuracy: 0.9495585140901851


The model has a very high accuracy, but is suffering from the class imbalance problem (discussed in more detail in the notebook "French Motor Claims - DNN - SMOTE").

## Comparison with the DNN

The test below is made to compare XGBoost with the created claim number prediction Neural Network models in the notebook "French Motor Claims - DNN - SMOTE".

In [8]:
cleaned_df = raw_df.copy().drop(columns=['IDpol'])

le = preprocessing.LabelEncoder()

cleaned_df['Area'] = le.fit_transform(cleaned_df['Area'].astype(str))
cleaned_df['VehBrand'] = le.fit_transform(cleaned_df['VehBrand'].astype(str))
cleaned_df['VehGas'] = le.fit_transform(cleaned_df['VehGas'].astype(str))
cleaned_df['Region'] = le.fit_transform(cleaned_df['Region'].astype(str))

cleaned_df.loc[cleaned_df.ClaimNb >= 5, 'ClaimNb'] = 5

# split data into X and y
X = cleaned_df.drop(columns=['ClaimNb'])
Y = cleaned_df['ClaimNb']

In [9]:
# Train/test split
seed = 10
test_size = 0.30
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

X_train_exposure = X_train.pop('Exposure')
X_test_exposure = X_test.pop('Exposure')

print('Training labels shape:', y_train.shape)
print('Training features shape:', X_train.shape)
print('Validation labels shape:', y_test.shape)
print('Validation features shape:', X_test.shape)

Training labels shape: (474609,)
Training features shape: (474609, 9)
Validation labels shape: (203404,)
Validation features shape: (203404, 9)


In [10]:
model = XGBClassifier(
    max_depth=6,
    min_child_weight=1,
    gamma=0,
    seed=0,
    eval_metric='auc')

In [11]:
# fit model to training data
model.fit(X_train, y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eval_metric='auc',
              gamma=0, gpu_id=-1, importance_type='gain',
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=12,
              num_parallel_tree=1, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=None, seed=0,
              subsample=1, tree_method='exact', validate_parameters=1,
              verbosity=None)

In [12]:
# make predictions for test data
y_pred = model.predict(X_test)

In [13]:
# evaluate predictions
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
#mpd = mean_poisson_deviance(y_test, predictions)

print("MAE:", mae, "MSE:", mse, "accuracy:", (y_test == y_pred).sum()/float(y_test.size))

MAE: 0.05341094570411595 MSE: 0.05983166506066744 accuracy: 0.9495093508485576


While this model has a very good performance, it still underperforms when faced against the sampled_model DNN.