**Exercise predictive modeling:**<br>
In this exercise, you recapitulate the principles of predictive modeling. You will build a predictive model for a travel 
insurance, which predicts whether a given insurance offer leads to a claim. You can use "Ex03-Python_Machine_Learning.ipynb" 
to lookup on the model building procedure and the required commands in python. 

In the next cells, we provide the code for importing required packages and for loading the data set (You need to adapt the
path to the data.). Afterwards, the exercises begin.  

In [35]:
# required packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.inspection import permutation_importance
from sklearn.inspection import plot_partial_dependence
import os
import seaborn as sns
import matplotlib.pyplot as plt

In [36]:
## read data
trav_ins = pd.read_csv("travel insurance.csv",index_col=False,sep=',', encoding='utf-8')

In [37]:
## get to know the data
print(trav_ins.shape)
trav_ins.head()

(63326, 11)


Unnamed: 0,Agency,Agency Type,Distribution Channel,Product Name,Claim,Duration,Destination,Net Sales,Commision (in value),Gender,Age
0,CBH,Travel Agency,Offline,Comprehensive Plan,No,186,MALAYSIA,-29.0,9.57,F,81
1,CBH,Travel Agency,Offline,Comprehensive Plan,No,186,MALAYSIA,-29.0,9.57,F,71
2,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,65,AUSTRALIA,-49.5,29.7,,32
3,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,60,AUSTRALIA,-39.6,23.76,,32
4,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,79,ITALY,-19.8,11.88,,41


In [38]:
trav_ins.duplicated().sum()
trav_ins = trav_ins.drop_duplicates()

**Exercise 1:**<br>
Impute missing values with an approach of your own choice.

In [39]:
trav_ins.isnull().sum().sort_values(ascending=False)/len(trav_ins)

Gender                  0.693239
Age                     0.000000
Commision (in value)    0.000000
Net Sales               0.000000
Destination             0.000000
Duration                0.000000
Claim                   0.000000
Product Name            0.000000
Distribution Channel    0.000000
Agency Type             0.000000
Agency                  0.000000
dtype: float64

In [40]:
trav_ins.drop(columns='Gender', inplace=True)

**Exercise 2:**<br>
Appropriately encode the target "Claim".

In [41]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Check unique values for streets (3)
# print(f"The unique values for 'Street' are {data.Alley.unique()}")

# # Instantiate the OneHotEncoder
# ohe = OneHotEncoder(sparse = False) 

# # Fit encoder
# ohe.fit(data[['Alley']]) 

# # Display the detected categories
# print(f"The categories detected by the OneHotEncoder are {ohe.categories_}")

In [42]:
ohe = OneHotEncoder(sparse = False,drop="if_binary")
ohe.fit(trav_ins[['Claim']]) 

OneHotEncoder(drop='if_binary', sparse=False)

In [43]:
trav_ins[ohe.get_feature_names_out()] = ohe.transform(trav_ins[['Claim']])

In [45]:
trav_ins.drop(columns = ["Claim"], inplace = True)

In [46]:
trav_ins

Unnamed: 0,Agency,Agency Type,Distribution Channel,Product Name,Duration,Destination,Net Sales,Commision (in value),Age,Claim_Yes
0,CBH,Travel Agency,Offline,Comprehensive Plan,186,MALAYSIA,-29.0,9.57,81,0.0
1,CBH,Travel Agency,Offline,Comprehensive Plan,186,MALAYSIA,-29.0,9.57,71,0.0
2,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,65,AUSTRALIA,-49.5,29.70,32,0.0
3,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,60,AUSTRALIA,-39.6,23.76,32,0.0
4,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,79,ITALY,-19.8,11.88,41,0.0
...,...,...,...,...,...,...,...,...,...,...
63320,JZI,Airlines,Online,Basic Plan,5,BRUNEI DARUSSALAM,18.0,6.30,27,0.0
63321,JZI,Airlines,Online,Basic Plan,111,JAPAN,35.0,12.25,31,0.0
63322,JZI,Airlines,Online,Basic Plan,58,CHINA,40.0,14.00,40,0.0
63323,JZI,Airlines,Online,Basic Plan,2,MALAYSIA,18.0,6.30,57,0.0


**Exercise 3:**<br>
Appropriately encode the categorical variables. We recommend merging categories with few observations in one group.

In [47]:
categorical_cols = ['Agency', 'Agency Type', 'Distribution Channel', 'Product Name','Destination'] 

#import pandas as pd
df = pd.get_dummies(trav_ins, columns = categorical_cols)
df

Unnamed: 0,Duration,Net Sales,Commision (in value),Age,Claim_Yes,Agency_ADM,Agency_ART,Agency_C2B,Agency_CBH,Agency_CCR,...,Destination_UNITED KINGDOM,Destination_UNITED STATES,Destination_URUGUAY,Destination_UZBEKISTAN,Destination_VANUATU,Destination_VENEZUELA,Destination_VIET NAM,"Destination_VIRGIN ISLANDS, U.S.",Destination_ZAMBIA,Destination_ZIMBABWE
0,186,-29.0,9.57,81,0.0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,186,-29.0,9.57,71,0.0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,65,-49.5,29.70,32,0.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,60,-39.6,23.76,32,0.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,79,-19.8,11.88,41,0.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
63320,5,18.0,6.30,27,0.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
63321,111,35.0,12.25,31,0.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
63322,58,40.0,14.00,40,0.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
63323,2,18.0,6.30,57,0.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Exercise 4:**<br>
Split 80% of the data in the training set and the remaining 20% data in the test set.

In [49]:
X = df.drop('Claim_Yes',axis=1)
y = df['Claim_Yes']

In [50]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

**Exercise 5:**<br>
Build a random forest model on the training data. Find the best tuning parameters by grid search.  
Below is a simple grid, which you could choose.

In [51]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

rand = RandomForestClassifier(max_depth=2, random_state=0)
# clf.fit(X, y)


In [52]:
## example parameter grid
param_grid = {'n_estimators': [500],
              'max_features': [4, 8, 12],
              'max_depth':[10]
              }


In [56]:
clf = GridSearchCV(rand, param_grid)
clf

GridSearchCV(estimator=RandomForestClassifier(max_depth=2, random_state=0),
             param_grid={'max_depth': [10], 'max_features': [4, 8, 12],
                         'n_estimators': [500]})

In [57]:
clf.fit(X, y)

GridSearchCV(estimator=RandomForestClassifier(max_depth=2, random_state=0),
             param_grid={'max_depth': [10], 'max_features': [4, 8, 12],
                         'n_estimators': [500]})

In [59]:
clf.best_estimator_

RandomForestClassifier(max_depth=10, max_features=4, n_estimators=500,
                       random_state=0)

In [60]:
rand = RandomForestClassifier(max_depth=10, max_features=4, n_estimators=500,
                       random_state=0)
rand.fit(X_train,y_train)

RandomForestClassifier(max_depth=10, max_features=4, n_estimators=500,
                       random_state=0)

In [61]:
rand.score(X_test,y_test)

0.9858912905851497

In [62]:
roc_auc_score(y_test, RFC.predict_proba(X_test)[:,1])

NameError: name 'roc_auc_score' is not defined

In [66]:
from sklearn.metrics import roc_auc_score

In [73]:
roc_auc_score(y_test, rand.predict_proba(X_test)[:, 1])

0.7990607189648562

**Exercise 6:**<br>
Evaluate the model by appropriate metrics on the test set. Is the model able to predict claims with certainty?
Is the model usefull for application?

**Bonus:**<br>
Identify which features are most relevant in predicting claims and how they affect the predictions.

In [None]:
## feature importance

In [3]:
## partial dependence plot