<!-- Simon-Style -->
<p style="font-size:19px; text-align:left; margin-top:    15px;"><i>German Association of Actuaries (DAV) — Working Group "Explainable Artificial Intelligence"</i></p>
<p style="font-size:25px; text-align:left; margin-bottom: 15px"><b>Use Case SOA GLTD Experience Study:<br>
Gradient Boosting - Hyper Parameters
</b></p>
<p style="font-size:19px; text-align:left; margin-bottom: 15px; margin-bottom: 15px">Guido Grützner (<a href="mailto:guido.gruetzner@quantakt.com">guido.gruetzner@quantakt.com</a>)

# Introduction

This report performs a grid search for tuning parameters `learning_rate` and `max_leaf_nodes` for Histogram-based Gradient Boosting Classification Trees. 

Optimal hyperparameters are pct-dependent (see further information on `pct` in the text before the last cell of "Initialisation" below). Recommendation: `pct=0.3` with `learning_rate` 0.025 is a good compromise. For `pct=1` you can decrease the learning rate further for a (small?) improvement in predictive quality, but this will require more execution time per fit. For a given pct value, time required for a single fit is roughly  proportional to `learning_rate`. I.e. learning rate 0.025 will be roughly 3 times faster than 0.0075. 

As-is, this notebook takes about 20 min to run.

# Initialisation

In [1]:
from sklearn.ensemble import HistGradientBoostingClassifier

from sklearn.model_selection import \
    GroupShuffleSplit, GridSearchCV

import gltd_utilities

import time
import numpy as np
import pandas as pd
pd.options.mode.copy_on_write = True

# adjust accordingly to your hardware, 
# more CPUs is faster but then the script may block your machine
import os
os.environ['LOKY_MAX_CPU_COUNT'] = '4'

* Adapt the path for the data file in the call of `load_gltd_data`, if necessary.
* Adapt `pct` to your requirements for anything between  $0.05\leq pct\leq1$. 
* Input 1 uses all data available, lower numbers the respective fraction. Below 0.05 predictions become somewhat volatile.

In [2]:
(X, Y, ID, nm_cat, nm_num, seed, rng) = gltd_utilities.load_gltd_data("./", pct=0.3)
seed

'202208864442763689745147491085394347929'

# Gridsearch

In [3]:
train_indx, test_indx = next(
    GroupShuffleSplit(random_state=rng.integers(low=0, high=1000)).split(X, groups=ID))
xtrain, xtest = X.iloc[train_indx], X.iloc[test_indx] 
ytrain, ytest = Y.iloc[train_indx], Y.iloc[test_indx]
idtrain, idtest = ID.iloc[train_indx], ID.iloc[test_indx]

In [4]:
# Set up possible values of parameters to optimize over
# These values are for demonstration purposes and can be freely varied
pgrid = {"learning_rate": [0.01, 0.025, 0.03],
         "max_leaf_nodes": [80, 100, 120]}

tic = time.time()
md = HistGradientBoostingClassifier(
    max_iter=1000,
    categorical_features=nm_cat,
    random_state=rng.integers(low=0, high=1000))

cv = GroupShuffleSplit(n_splits=5,
                            random_state=rng.integers(low=0, high=1000))

clf = GridSearchCV(estimator=md, param_grid=pgrid, cv=cv, 
                   scoring="neg_log_loss", verbose=4)

clf.fit(xtrain, ytrain, groups=idtrain)

df = pd.DataFrame(clf.cv_results_)
df

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV 1/5] END learning_rate=0.01, max_leaf_nodes=80;, score=-0.052 total time= 1.2min
[CV 2/5] END learning_rate=0.01, max_leaf_nodes=80;, score=-0.053 total time= 1.1min
[CV 3/5] END learning_rate=0.01, max_leaf_nodes=80;, score=-0.052 total time= 1.2min
[CV 4/5] END learning_rate=0.01, max_leaf_nodes=80;, score=-0.052 total time= 1.3min
[CV 5/5] END learning_rate=0.01, max_leaf_nodes=80;, score=-0.053 total time= 1.2min
[CV 1/5] END learning_rate=0.01, max_leaf_nodes=100;, score=-0.052 total time= 1.2min
[CV 2/5] END learning_rate=0.01, max_leaf_nodes=100;, score=-0.053 total time= 1.2min
[CV 3/5] END learning_rate=0.01, max_leaf_nodes=100;, score=-0.052 total time= 1.2min
[CV 4/5] END learning_rate=0.01, max_leaf_nodes=100;, score=-0.052 total time= 1.4min
[CV 5/5] END learning_rate=0.01, max_leaf_nodes=100;, score=-0.053 total time= 1.2min
[CV 1/5] END learning_rate=0.01, max_leaf_nodes=120;, score=-0.052 total time= 1.2min

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_max_leaf_nodes,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,65.921828,3.833313,6.080504,0.361877,0.01,80,"{'learning_rate': 0.01, 'max_leaf_nodes': 80}",-0.052138,-0.053258,-0.051852,-0.051589,-0.053118,-0.052391,0.000675,1
1,67.621828,3.61567,6.056598,0.422609,0.01,100,"{'learning_rate': 0.01, 'max_leaf_nodes': 100}",-0.052167,-0.053283,-0.0519,-0.051598,-0.053181,-0.052426,0.000683,2
2,86.950743,16.041092,7.057517,2.091277,0.01,120,"{'learning_rate': 0.01, 'max_leaf_nodes': 120}",-0.052203,-0.053277,-0.051904,-0.051626,-0.053179,-0.052438,0.000671,3
3,44.296185,2.996142,3.385859,0.230172,0.025,80,"{'learning_rate': 0.025, 'max_leaf_nodes': 80}",-0.052192,-0.05333,-0.051915,-0.05166,-0.053198,-0.052459,0.00068,4
4,41.423154,6.115008,3.070336,0.182183,0.025,100,"{'learning_rate': 0.025, 'max_leaf_nodes': 100}",-0.052275,-0.053383,-0.051959,-0.051649,-0.05328,-0.052509,0.000701,6
5,36.932912,1.897958,3.103457,0.115718,0.025,120,"{'learning_rate': 0.025, 'max_leaf_nodes': 120}",-0.0523,-0.053355,-0.052019,-0.051749,-0.053242,-0.052533,0.00065,8
6,29.039922,1.360811,2.722639,0.098517,0.03,80,"{'learning_rate': 0.03, 'max_leaf_nodes': 80}",-0.052193,-0.053377,-0.051965,-0.051699,-0.053266,-0.0525,0.00069,5
7,28.615507,2.007628,2.715792,0.203726,0.03,100,"{'learning_rate': 0.03, 'max_leaf_nodes': 100}",-0.052308,-0.053384,-0.051979,-0.051668,-0.053312,-0.05253,0.000698,7
8,29.095299,1.818937,2.661654,0.172794,0.03,120,"{'learning_rate': 0.03, 'max_leaf_nodes': 120}",-0.052353,-0.053387,-0.052062,-0.051787,-0.053321,-0.052582,0.000655,9


# Result

In [5]:
clf.best_params_

{'learning_rate': 0.01, 'max_leaf_nodes': 80}

In [6]:
clf.best_score_

-0.05239086062346413

# Sanity check

The sanity check is performed to ensure that all outputs are indeed probabilities larger than zero and smaller than one, since both values are impossible values, given the use case. Beyond this simple check, calibration of the model is further evidenced by a calibration diagram in "ana_fit".

In [7]:
res = clf.best_estimator_.predict_proba(xtrain)[:,1]
if np.isnan(res).any():
    raise ValueError("Dreaded NaNs!")
if (res <= 0).any():
    tt = sum(res == 0)
    print(f"Number of exact zero values in train: {tt}")
    if (res < 0).any():
        raise ValueError("Dreaded Negatives!")
if (res >= 1).any():
    tt = sum(res == 1)
    print(f"Number of exact one values in train: {tt}")
    if (res > 1).any():
        raise ValueError("Dreaded Larger-Than-Ones!")

res = clf.best_estimator_.predict_proba(xtest)[:,1]
if np.isnan(res).any():
    raise ValueError("Dreaded NaNs!")
if (res <= 0).any():
    tt = sum(res == 0)
    print(f"Number of exact zero values in test: {tt}")
    if (res < 0).any():
        raise ValueError("Dreaded Negatives!")
if (res >= 1).any():
    tt = sum(res == 1)
    print(f"Number of exact one values in test: {tt}")
    if (res > 1).any():
        raise ValueError("Dreaded Larger-Than-Ones!")

In [8]:
print(f"Time it took: {np.ceil((time.time() - tic)/60)}min.")

Time it took: 42.0min.
