![title](img/logo_white_full.png)

# Hyperoptimization Tutorial

In this part of the tutorial, we are going to learn what is **hyperoptimization**. While **parameters** are learned during training – for example, the slope of linear regression or weights of neural network, **hyperparameters** are left for a data scientist to select beforehand.

The selection of correct values for hyperparameters is crucial and can significantly improve the performance of a model. 

We could list three methods that are used for hyperoptimization:
* **Grid Search** - most standard approach looking through whole hyperparameter space
* **Random Search** - randomly select combinations from hyperparameter space
* **Tree-structured Parzen Estimator (TPE)** - more intelligent way of tuning.

As insurers and banks require interpretability, the most preferred and understandable method is Grid Search, however, you can play with other techniques too. To learn about Random Search and TPE, you can read [this post](http://dkopczyk.quantee.co.uk/hyperparameter-optimization/). 

The tutorial is based on [Allstate Claim Severity Kaggle competition data](https://www.kaggle.com/c/allstate-claims-severity).

In [1]:
import pickle # Load and save Python objects

import numpy as np # Arrays
import pandas as pd # Data-Frames
from plotly.offline import init_notebook_mode # Plotly

from sklearn.model_selection import GridSearchCV # Hyperoptimization
from sklearn.metrics import mean_absolute_error, make_scorer # MAE
from lightgbm import LGBMRegressor # Model to tune

from utils import plot_gs_surface, plot_gs_scatter # Custom Utilities written for this tutorial

import warnings # Ignore annoying warnings
warnings.filterwarnings('ignore')

# Required for Jupyter to produce in-line Plotly graphs
init_notebook_mode(connected=True)

The selected estimator to tune is LightGBM. The plan is to:
1. Download the training and testing **data** produced in the Data Processing Tutorial. Once again we do not use Kaggle test dataset as it is unlabelled.
2. Select **hyperparameter space**.
3. Perform **Grid Search** with 5-fold cross-validation.
4. Check the **performance** on the testing data.

---
## Data
Download the training and testing data that you have created during Data Processing Tutorial.

In [2]:
with open('data/data.pkl', 'rb') as f:
    X_train, X_test, y_train, y_test = pickle.load(f)
print(X_train.shape)
print(X_test.shape)

(141238, 130)
(47080, 130)


---
## Estimator
Let's start with LightGBM estimator with default parameters. Just to remind you, we have transformed the losses to log-losses, so we need custom evaluation metric to maximize Kaggle's definition of MAE. From Regression Models Tutorial you might remember, that MAE of LightGBM with default parameters was 1152.14. Can we improve it with hyperoptimization?

In [3]:
def mae_from_logs(y_true, y_pred):
    return 'mae_from_logs', mean_absolute_error(np.exp(y_true), np.exp(y_pred)), False

base = LGBMRegressor(n_jobs=-1, random_state=2019)

---
## Hyperparameter space
Grid Search is checking all combinations of hyperparameters' values and then returns the best set. For instance, if we want to optimize Neural Network hyperparameters: the number of layers $n_{layers}=[3,5,10]^T$ and number of neurons in each layer $n_{neurons}=[64,128]^T$ then Grid Search will fit the estimator 2x3=6 times on all possible combinations: $$(3,64), (3,128), (5,64), (5,128), (10,64), (10,128)$$

To avoid **overfitting** problem, we should not hyperotimize on the same testing data as used for checking the overall performance. Thus, we can take the training data and split it again or better, use **cross-validation** and select the best set of hyperparameters based on k-fold score. Thus, we are going to use ```GridSearchCV``` implementation in scikit-learn.

We are not going to optimize all LightGBM hyperparameters, but only important ones. Furthermore, the possible values may seem to be selected arbitrarily, but after some experience as a data scientist, you will see that there exist a common set of values to check.

In [4]:
hyper_space = {'learning_rate': [0.001, 0.01, 0.1],
               'n_estimators': [10, 50, 250],
               'max_depth':  [4, 8, -1],
               'num_leaves': [15, 31, 127],
               'colsample_bytree': [0.6, 0.8, 1.0]}

Ideally, we would run Grid Search CV on all possible combinations, but we would quickly notice that to produce 5-fold CV scores the number of fits would be 3x3x3x3x3x5=1215. That's a lot! You can try if you have a powerful CPU or GPU, but I propose to split the hyperparameter space into three parts:
* Part 1: tune ```learning_rate``` and ```n_estimators```. They come together due to the fact that for gradient boosting methods there exists a trade-off between learning rate and a number of trees (the smaller the learning rate, the more trees we would need).
* Part 2: tune ```max_depth``` and ```num_leaves```. The unconstrained depth of a tree can induce overfitting. Thus, when we try to tune a number of leaves we should control the maximum depth also.
* Part 3: tune ```colsample_bytree```

In [5]:
hyper_space_1 = {k: hyper_space[k] for k in ['learning_rate', 'n_estimators']}
hyper_space_2 = {k: hyper_space[k] for k in ['max_depth', 'num_leaves']} 
hyper_space_3 = {k: hyper_space[k] for k in ['colsample_bytree']}

---
## Grid Search
### Part 0: Preliminaries
The ```GridSearchCV``` should select best set of hyperparameters based on cross-validated MAE calculated from log-losses. Thus, we define a custom scorer.

In [6]:
def mae_from_logs_score(y_true, y_pred):
    return mean_absolute_error(np.exp(y_true), np.exp(y_pred))

mae_from_logs_scorer = make_scorer(mae_from_logs_score, greater_is_better=False)

### Part 1: ```learning_rate``` and ```n_estimators```
Now, we tune the learning rate and number of estimators as well as the trade-off between them. Notice, that custom scorer for grid search is passed in ```GridSearchCV``` object initialization, whereas the fitting parameters for LGBMRegressor can be directly passed to ```fit``` method of ```GridSearchCV```. That all thanks to the scikit-learn wrapper of LightGBM!

Now, run the following code snippet and wait a bit for the results. If you want more text logs to be output, change ```verbose``` argument to 2.

In [7]:
est = GridSearchCV(base, hyper_space_1, scoring=mae_from_logs_scorer, cv=5, verbose=1)
est.fit(X_train, y_train, eval_metric=mae_from_logs)
est.best_params_, est.best_score_

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:  2.3min finished


({'learning_rate': 0.1, 'n_estimators': 250}, -1155.1192355309925)

You can examine the results of Grid Search CV by accessing ```cv_results_``` property of fitted ```GridSearchCV```.

In [8]:
keys = ['params','mean_test_score', 'std_test_score', 'mean_fit_time', 'mean_score_time']
data_to_display = {k: est.cv_results_[k]  for k in keys}
pd.DataFrame(data_to_display).sort_values(by='mean_test_score')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
8,3.888909,0.09069,0.134342,0.007652,0.1,250,"{'learning_rate': 0.1, 'n_estimators': 250}",-1151.240322,-1151.970104,-1178.674841,...,-1155.119236,12.560195,1,-1096.075996,-1097.017964,-1089.090743,-1095.453339,-1096.974588,-1094.922526,2.974013
7,1.400648,0.008636,0.043737,0.01169,0.1,50,"{'learning_rate': 0.1, 'n_estimators': 50}",-1194.121787,-1195.564874,-1215.262153,...,-1194.660514,12.667367,2,-1182.359816,-1182.966887,-1176.242977,-1181.476787,-1184.88972,-1181.587237,2.897586
5,5.084647,0.264946,0.146839,0.012498,0.01,250,"{'learning_rate': 0.01, 'n_estimators': 250}",-1276.972272,-1275.144329,-1293.758606,...,-1275.894746,11.158078,3,-1269.130215,-1269.835176,-1265.39964,-1268.20112,-1274.170738,-1269.347378,2.844238
6,0.612357,0.006248,0.018744,0.006248,0.1,10,"{'learning_rate': 0.1, 'n_estimators': 10}",-1441.946337,-1438.967286,-1460.766284,...,-1440.420907,11.457151,4,-1438.590882,-1438.795357,-1434.361492,-1437.921265,-1440.409883,-1438.015776,2.001648
4,1.241092,0.02271,0.037489,0.007653,0.01,50,"{'learning_rate': 0.01, 'n_estimators': 50}",-1580.9287,-1580.153071,-1603.914835,...,-1582.70175,10.996244,5,-1582.0848,-1581.912153,-1577.107786,-1583.311355,-1584.572435,-1581.797706,2.533027
2,7.085296,2.302225,0.143451,0.049516,0.001,250,"{'learning_rate': 0.001, 'n_estimators': 250}",-1677.717755,-1677.793345,-1702.263059,...,-1680.557699,11.10378,6,-1680.825689,-1680.632047,-1675.155251,-1681.8029,-1682.150261,-1680.11323,2.544123
3,0.57468,0.030371,0.021866,0.007654,0.01,10,"{'learning_rate': 0.01, 'n_estimators': 10}",-1749.374195,-1750.318356,-1774.946446,...,-1753.106074,11.106976,7,-1753.817735,-1753.600206,-1747.669689,-1754.841036,-1754.736174,-1752.932968,2.676655
1,2.666527,0.045028,0.054905,0.007293,0.001,50,"{'learning_rate': 0.001, 'n_estimators': 50}",-1776.645545,-1777.856832,-1802.489479,...,-1780.71013,11.049997,8,-1781.571243,-1781.322129,-1775.160959,-1782.618409,-1782.322121,-1780.598972,2.759986
0,1.170227,0.061718,0.037089,0.010505,0.001,10,"{'learning_rate': 0.001, 'n_estimators': 10}",-1799.540957,-1800.974053,-1825.594592,...,-1803.85456,11.015323,9,-1804.836021,-1804.6137,-1798.281981,-1805.873329,-1805.460488,-1803.813104,2.801394


Let's visualize the 5-folded CV MAE. We will use Plotly package with predefined function that produces a surface plot. To check inner workings of the function, you can open it in the github repo.

In [9]:
plot_gs_surface(est, x_axis='learning_rate', y_axis='n_estimators',
                z_axis='5-fold CV MAE', greater_is_better=False)

### Part 2: ```max_depth``` and ```num_leaves```
As, we already tuned number of estimators and learning rate, we can proceed to the next pair of hyperparameters. Let's quickly review them:
- **max_depth**: it describes the maximum depth of tree and is used to handle overfitting (anytime you discover your testing score is much worse than training score, you can lower this parameter).
- **num_leaves**: number of leaves per tree.

In [10]:
# Update base estimator with tuned learning_rate and n_estimators
base.set_params(**est.best_params_)
# Tune max_depth and num_leaves
est = GridSearchCV(base, hyper_space_2, scoring=mae_from_logs_scorer, cv=5, verbose=1)
est.fit(X_train, y_train, eval_metric=mae_from_logs)
est.best_params_, est.best_score_
# Plot
plot_gs_surface(est, x_axis='max_depth', y_axis='num_leaves',
                z_axis='5-fold CV MAE', greater_is_better=False)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:  3.8min finished


### Part 3: ```colsample_bytree```
Lastly, we tune the feature fraction (aka colsample_bytree)
- **colsample_bytree**: LightGBM will randomly select part of features on each iteration if feature_fraction smaller than 1.0. For example, if you set it to 0.8, LightGBM will select 80% of features before training each tree Tuning can be used to speed up training or to deal with overfitting.

In [11]:
# Update base estimator with tuned max_depth and num_leaves
base.set_params(**est.best_params_)
# Tune max_depth and num_leaves
est = GridSearchCV(base, hyper_space_3, scoring=mae_from_logs_scorer, cv=5, verbose=1)
est.fit(X_train, y_train, eval_metric=mae_from_logs)
est.best_params_, est.best_score_
# Plot
plot_gs_scatter(est, x_axis='colsample_bytree', y_axis='5-fold CV MAE', greater_is_better=False)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:  1.1min finished


---
## Performance
Finally, we can use the best set of hyperparameters to check the performance of the model against testing data. The ```GridSearchCV``` object has argument ```refit=True``` as default, which means we can directly use ```est``` to predict testing labels and calculate the score.

In [12]:
y_pred = est.predict(X_test)
print('The MAE calculated on testing data = {0:.2f}'.format(mae_from_logs_score(y_test, y_pred)))

The MAE calculated on testing data = 1144.19


We have started with MAE equal to 1152.14 and ended up with 1144.19. Well, it is not so much in that particular case, but we use only basics of hyperoptimization and it always depends to the use case. If you are experienced data scientist it is always worth to check whether tuning of hyperparameters can improve the predictability. 

---
## Further notes
* You can try to optimize other hyperparameters such as ```min_data_in_leaf```. For reference check LightGBM [docs](https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html).
* Change the ranges of hyperparameters. For instance ```colsample_bytree``` might be below 0.5 or try to increase ```n_estimators```.
* Try out Random Search or TPE algorithms for hyperoptimization. They do not require so heavy calculations like Grid Search.