 # Application of XGBoost classification to Prudential data

In this workbook an application of XGBoost to predict the risk level of Applicant
for life insurance.

The prudential life insurance company as given a dataset as a part of compitetion to
predict the risk level of an applicant.

In [67]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import PCA
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import GridSearchCV
import time
from imblearn.over_sampling import SMOTE
import seaborn as sns
import warnings
from sklearn.preprocessing import MaxAbsScaler
warnings.filterwarnings("ignore", category=DeprecationWarning)
%matplotlib inline

SMALL_SIZE = 10
MEDIUM_SIZE = 12

plt.rc('font', size=SMALL_SIZE)
plt.rc('axes', titlesize=MEDIUM_SIZE)
plt.rc('axes', labelsize=MEDIUM_SIZE)
plt.rcParams['figure.dpi']=150

In [68]:
data = pd.read_csv("train_prud.csv")
data.head()

Unnamed: 0,Id,Product_Info_1,Product_Info_2,Product_Info_3,Product_Info_4,Product_Info_5,Product_Info_6,Product_Info_7,Ins_Age,Ht,...,Medical_Keyword_40,Medical_Keyword_41,Medical_Keyword_42,Medical_Keyword_43,Medical_Keyword_44,Medical_Keyword_45,Medical_Keyword_46,Medical_Keyword_47,Medical_Keyword_48,Response
0,2,1,D3,10,0.076923,2,1,1,0.641791,0.581818,...,0,0,0,0,0,0,0,0,0,8
1,5,1,A1,26,0.076923,2,3,1,0.059701,0.6,...,0,0,0,0,0,0,0,0,0,4
2,6,1,E1,26,0.076923,2,3,1,0.029851,0.745455,...,0,0,0,0,0,0,0,0,0,8
3,7,1,D4,10,0.487179,2,3,1,0.164179,0.672727,...,0,0,0,0,0,0,0,0,0,8
4,8,1,D2,26,0.230769,2,3,1,0.41791,0.654545,...,0,0,0,0,0,0,0,0,0,8


In [69]:
data.isnull().sum().sum()

393103

There are a total of 393101 missing values in the dataset including all the columns.
One of the advantage of XGBoost is the training and testing accuracy would not be
affected by missing values in data unlike some linear models.

In [70]:
data.describe()

Unnamed: 0,Id,Product_Info_1,Product_Info_3,Product_Info_4,Product_Info_5,Product_Info_6,Product_Info_7,Ins_Age,Ht,Wt,...,Medical_Keyword_40,Medical_Keyword_41,Medical_Keyword_42,Medical_Keyword_43,Medical_Keyword_44,Medical_Keyword_45,Medical_Keyword_46,Medical_Keyword_47,Medical_Keyword_48,Response
count,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,...,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0
mean,39507.211515,1.026355,24.415655,0.328952,2.006955,2.673599,1.043583,0.405567,0.707283,0.292587,...,0.056954,0.010054,0.045536,0.01071,0.007528,0.013691,0.008488,0.019905,0.054496,5.636837
std,22815.883089,0.160191,5.072885,0.282562,0.083107,0.739103,0.291949,0.19719,0.074239,0.089037,...,0.231757,0.099764,0.208479,0.102937,0.086436,0.116207,0.091737,0.139676,0.226995,2.456833
min,2.0,1.0,1.0,0.0,2.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,19780.0,1.0,26.0,0.076923,2.0,3.0,1.0,0.238806,0.654545,0.225941,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
50%,39487.0,1.0,26.0,0.230769,2.0,3.0,1.0,0.402985,0.709091,0.288703,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0
75%,59211.0,1.0,26.0,0.487179,2.0,3.0,1.0,0.567164,0.763636,0.345188,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0
max,79146.0,2.0,38.0,1.0,3.0,3.0,3.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,8.0


The target variable response is the level of risk associated with that particular row
where each row represents an applicant and standings of applicant on various aspects.
Both of missing values are in the features Medical_History_10.

In [71]:
data['Response'].value_counts()


8    19489
6    11233
7     8027
2     6552
1     6207
5     5432
4     1428
3     1013
Name: Response, dtype: int64

The data is fairly distributed over different risk levels. The features id is just an ID of an applicant, it doesn't
contribute to our prediction.

In [72]:
data.drop(['Id'], axis=1, inplace=True)
data.head(1)

Unnamed: 0,Product_Info_1,Product_Info_2,Product_Info_3,Product_Info_4,Product_Info_5,Product_Info_6,Product_Info_7,Ins_Age,Ht,Wt,...,Medical_Keyword_40,Medical_Keyword_41,Medical_Keyword_42,Medical_Keyword_43,Medical_Keyword_44,Medical_Keyword_45,Medical_Keyword_46,Medical_Keyword_47,Medical_Keyword_48,Response
0,1,D3,10,0.076923,2,1,1,0.641791,0.581818,0.148536,...,0,0,0,0,0,0,0,0,0,8


In [73]:
le = LabelEncoder()
f_encoded = le.fit_transform(data['Product_Info_2'])
data['Product_Info_2'] = f_encoded
x = data.drop('Response', axis=1)
y = data[['Response']]

In [74]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)


In [75]:
#XGBoost
xgb = XGBClassifier(n_estimators=100)
training_start = time.perf_counter()
xgb.fit(x_train, y_train)
training_end = time.perf_counter()
prediction_start = time.perf_counter()
pred = xgb.predict(x_test)
prediction_end = time.perf_counter()
pred = pred.reshape(y_test.shape)
acc_xgb = (pred == y_test).sum().astype(float) / len(pred)*100
xgb_train_time = training_end-training_start
xgb_prediction_time = prediction_end-prediction_start
print("XGBoost's prediction accuracy is: %3.2f" % acc_xgb)
print("Time consumed for training: %4.3f" % xgb_train_time)
print("Time consumed for prediction: %6.5f seconds" % xgb_prediction_time)

  return f(**kwargs)


XGBoost's prediction accuracy is: 58.89
Time consumed for training: 36.383
Time consumed for prediction: 0.43197 seconds


The performance can be increased by tuning the hyper parameters for XGBoost.