<!-- Simon-Style -->
<p style="font-size:19px; text-align:left; margin-top:    15px;"><i>German Association of Actuaries (DAV) — Working Group "Explainable Artificial Intelligence"</i></p>
<p style="font-size:25px; text-align:left; margin-bottom: 15px"><b>Use Case SOA GLTD Experience Study:<br>
Tree Model - Hyper Parameters
</b></p>
<p style="font-size:19px; text-align:left; margin-bottom: 15px; margin-bottom: 15px">Guido Grützner (<a href="mailto:guido.gruetzner@quantakt.com">guido.gruetzner@quantakt.com</a>)

# Introduction

This report performs a grid search for tuning parameter `max_depth` for DecisionTreeClassifier Trees. Optimal hyperparameters are pct-dependent. See the text before the last cell of "Initialisation" below for more information on `pct`.

This notebook will take roughly 5 min to run with the given search grid and a choice of `pct=0.3` for the amount of data. 

# Initialisation

In [1]:
from sklearn.tree import DecisionTreeClassifier

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import \
    GroupShuffleSplit, GridSearchCV

import gltd_utilities

import time
import numpy as np
import pandas as pd
pd.options.mode.copy_on_write = True

* Adapt the path for the data file in the call of `load_gltd_data`, if necessary.
* Adapt `pct` to your requirements for anything between  $0.05\leq pct\leq1$. 
* Input 1 uses all data available, lower numbers the respective fraction. Below 0.05 predictions become somewhat volatile.

In [2]:
(X, Y, ID, nm_cat, nm_num, seed, rng) = gltd_utilities.load_gltd_data(
                                            "./", pct=0.3)
seed

'156700121105998886695440653718163913769'

# Gridsearch

In [3]:
ct = ColumnTransformer(
        [("", OneHotEncoder(drop="first", sparse_output=False, dtype=int),
                        nm_cat)], 
        remainder="passthrough", verbose_feature_names_out=False)

X_ohe = ct.fit_transform(X)

train_indx, test_indx = next(
    GroupShuffleSplit(random_state=rng.integers(low=0, high=1000)).split(X_ohe, groups=ID))
xtrain, xtest = X_ohe[train_indx, :], X_ohe[test_indx,:] 

ytrain, ytest = Y.iloc[train_indx], Y.iloc[test_indx]
idtrain, idtest = ID.iloc[train_indx], ID.iloc[test_indx]

In [4]:

# Set up possible values of parameters to optimize over
pgrid = {"max_depth": [6, 7, 8], "criterion": ["log_loss"]}

md = DecisionTreeClassifier(random_state=rng.integers(low=0, high=1000))

cv = GroupShuffleSplit(n_splits=5,
                            random_state=rng.integers(low=0, high=1000))
tic = time.time()
clf = GridSearchCV(estimator=md, param_grid=pgrid, cv=cv, 
                   scoring="neg_log_loss", verbose=4)

clf.fit(xtrain, ytrain, groups=idtrain)
print(f"Time it took: {np.ceil((time.time() - tic)/60)}min.")
df = pd.DataFrame(clf.cv_results_)
df

Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV 1/5] END ..criterion=log_loss, max_depth=6;, score=-0.056 total time=  18.8s
[CV 2/5] END ..criterion=log_loss, max_depth=6;, score=-0.055 total time=  19.4s
[CV 3/5] END ..criterion=log_loss, max_depth=6;, score=-0.055 total time=  18.6s
[CV 4/5] END ..criterion=log_loss, max_depth=6;, score=-0.055 total time=  18.6s
[CV 5/5] END ..criterion=log_loss, max_depth=6;, score=-0.054 total time=  18.9s
[CV 1/5] END ..criterion=log_loss, max_depth=7;, score=-0.056 total time=  22.0s
[CV 2/5] END ..criterion=log_loss, max_depth=7;, score=-0.055 total time=  21.9s
[CV 3/5] END ..criterion=log_loss, max_depth=7;, score=-0.056 total time=  23.4s
[CV 4/5] END ..criterion=log_loss, max_depth=7;, score=-0.055 total time=  22.7s
[CV 5/5] END ..criterion=log_loss, max_depth=7;, score=-0.056 total time=  22.8s
[CV 1/5] END ..criterion=log_loss, max_depth=8;, score=-0.059 total time=  25.8s
[CV 2/5] END ..criterion=log_loss, max_depth=8;, 

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_criterion,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,18.791785,0.30073,0.187245,0.013073,log_loss,6,"{'criterion': 'log_loss', 'max_depth': 6}",-0.056156,-0.055065,-0.054993,-0.054555,-0.054376,-0.055029,0.000621,1
1,22.470687,0.549937,0.17768,0.006303,log_loss,7,"{'criterion': 'log_loss', 'max_depth': 7}",-0.056007,-0.055429,-0.055568,-0.054629,-0.055721,-0.055471,0.000463,2
2,24.791498,0.636613,0.194141,0.008316,log_loss,8,"{'criterion': 'log_loss', 'max_depth': 8}",-0.058606,-0.057964,-0.059386,-0.057487,-0.058476,-0.058384,0.000639,3


## Results

In [5]:
clf.best_params_

{'criterion': 'log_loss', 'max_depth': 6}

In [6]:
clf.best_score_

-0.05502919947828779

# Sanity check

By their construction, tree models tend to create probability estimates biased towards zero or one. Of course, both are impossible values, given the use case. This is partially mitigated by the loss function, which we use. Indeed, the number of zero and one predictions is limited, given the size of the data. This acceptable amount of calibration is further evidenced by the calibration diagram in "ana_fit".    

In [7]:
res = clf.best_estimator_.predict_proba(xtrain)[:,1]
if np.isnan(res).any():
    raise ValueError("Dreaded NaNs!")
if (res <= 0).any():
    tt = sum(res == 0)
    print(f"Number of exact zero values in train: {tt}")
    if (res < 0).any():
        raise ValueError("Dreaded Negatives!")
if (res >= 1).any():
    tt = sum(res == 1)
    print(f"Number of exact one values in train: {tt}")
    if (res > 1).any():
        raise ValueError("Dreaded Larger-Than-Ones!")

res = clf.best_estimator_.predict_proba(xtest)[:,1]
if np.isnan(res).any():
    raise ValueError("Dreaded NaNs!")
if (res <= 0).any():
    tt = sum(res == 0)
    print(f"Number of exact zero values in test: {tt}")
    if (res < 0).any():
        raise ValueError("Dreaded Negatives!")
if (res >= 1).any():
    tt = sum(res == 1)
    print(f"Number of exact one values in test: {tt}")
    if (res > 1).any():
        raise ValueError("Dreaded Larger-Than-Ones!")

Number of exact zero values in train: 8394
Number of exact zero values in test: 2058
