# Hyperparameter tuning

In the previous section, we did not discuss the parameters of random forest
and gradient-boosting. However, there are a couple of things to keep in mind
when setting these.

This notebook gives crucial information regarding how to set the
hyperparameters of both random forest and gradient boosting decision tree
models.

<div class="admonition caution alert alert-warning">
<p class="first admonition-title" style="font-weight: bold;">Caution!</p>
<p class="last">For the sake of clarity, no cross-validation will be used to estimate the
testing error. We are only showing the effect of the parameters
on the validation set of what should be the inner cross-validation.</p>
</div>

## Random forest

The main parameter to tune for random forest is the `n_estimators` parameter.
In general, the more trees in the forest, the better the statistical
performance will be. However, it will slow down the fitting and prediction
time. The goal is to balance computing time and statistical performance when
setting the number of estimators when putting such learner in production.

The `max_depth` parameter could also be tuned. Sometimes, there is no need
to have fully grown trees. However, be aware that with random forest, trees
are generally deep since we are seeking to overfit the learners on the
bootstrap samples because this will be mitigated by combining them.
Assembling underfitted trees (i.e. shallow trees) might also lead to an
underfitted forest.

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data, target = fetch_california_housing(return_X_y=True, as_frame=True)
target *= 100  # rescale the target in k$
data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=0)

We can display an interactive diagram with the following command:

In [2]:
from sklearn import set_config
set_config(display='diagram')

In [3]:
%%time
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

param_grid = {
    "n_estimators": [10, 20, 30],
    "max_depth": [3, 5, None],
}
grid_search = GridSearchCV(
    RandomForestRegressor(n_jobs=-1), param_grid=param_grid,
    scoring="neg_mean_absolute_error", n_jobs=-1
)
grid_search.fit(data_train, target_train)
grid_search

CPU times: user 3.4 s, sys: 212 ms, total: 3.61 s
Wall time: 5.6 s


In [4]:
import pandas as pd

columns = [f"param_{name}" for name in param_grid.keys()]
columns += ["mean_test_score", "rank_test_score"]
cv_results = pd.DataFrame(grid_search.cv_results_)
cv_results["mean_test_score"] = -cv_results["mean_test_score"]
cv_results[columns].sort_values(by="rank_test_score")

Unnamed: 0,param_n_estimators,param_max_depth,mean_test_score,rank_test_score
8,30,,34.457539,1
7,20,,35.021611,2
6,10,,36.27818,3
5,30,5.0,48.412611,4
4,20,5.0,48.883932,5
3,10,5.0,48.942815,6
1,20,3.0,57.222705,7
2,30,3.0,57.249009,8
0,10,3.0,57.497154,9


We can observe that in our grid-search, the largest `max_depth` together
with the largest `n_estimators` led to the best statistical performance.

## Gradient-boosting decision trees

For gradient-boosting, parameters are coupled, so we cannot set the
parameters one after the other anymore. The important parameters are
`n_estimators`, `max_depth`, and `learning_rate`.

Let's first discuss the `max_depth` parameter.
We saw in the section on gradient-boosting that the algorithm fits the error
of the previous tree in the ensemble. Thus, fitting fully grown trees will
be detrimental.
Indeed, the first tree of the ensemble would perfectly fit (overfit) the data
and thus no subsequent tree would be required, since there would be no
residuals.
Therefore, the tree used in gradient-boosting should have a low depth,
typically between 3 to 8 levels. Having very weak learners at each step will
help reducing overfitting.

With this consideration in mind, the deeper the trees, the faster the
residuals will be corrected and less learners are required. Therefore,
`n_estimators` should be increased if `max_depth` is lower.

Finally, we have overlooked the impact of the `learning_rate` parameter
until now. When fitting the residuals, we would like the tree
to try to correct all possible errors or only a fraction of them.
The learning-rate allows you to control this behaviour.
A small learning-rate value would only correct the residuals of very few
samples. If a large learning-rate is set (e.g., 1), we would fit the
residuals of all samples. So, with a very low learning-rate, we will need
more estimators to correct the overall error. However, a too large
learning-rate tends to obtain an overfitted ensemble,
similar to having a too large tree depth.

In [5]:
%%time
from sklearn.ensemble import GradientBoostingRegressor

param_grid = {
    "n_estimators": [10, 30, 50],
    "max_depth": [3, 5, None],
    "learning_rate": [0.1, 1],
}
grid_search = GridSearchCV(
    GradientBoostingRegressor(), param_grid=param_grid,
    scoring="neg_mean_absolute_error", n_jobs=-1
)
grid_search.fit(data_train, target_train)

CPU times: user 2.75 s, sys: 244 ms, total: 2.99 s
Wall time: 16.5 s


In [6]:
columns = [f"param_{name}" for name in param_grid.keys()]
columns += ["mean_test_score", "rank_test_score"]
cv_results = pd.DataFrame(grid_search.cv_results_)
cv_results["mean_test_score"] = -cv_results["mean_test_score"]
cv_results[columns].sort_values(by="rank_test_score")

Unnamed: 0,param_n_estimators,param_max_depth,param_learning_rate,mean_test_score,rank_test_score
5,50,5.0,0.1,35.669533,1
11,50,3.0,1.0,36.701843,2
10,30,3.0,1.0,37.50065,3
13,30,5.0,1.0,39.029043,4
4,30,5.0,0.1,39.277029,5
12,10,5.0,1.0,39.462326,6
14,50,5.0,1.0,39.78283,7
2,50,3.0,0.1,40.602042,8
9,10,3.0,1.0,41.589715,9
7,30,,0.1,45.597385,10


<div class="admonition caution alert alert-warning">
<p class="first admonition-title" style="font-weight: bold;">Caution!</p>
<p class="last">Here, we tune the <tt class="docutils literal">n_estimators</tt> but be aware that using early-stopping as
in the previous exercise will be better.</p>
</div>