# 📝 Exercise M6.04

The aim of the exercise is to get familiar with the histogram
gradient-boosting in scikit-learn. Besides, we will use this model within
a cross-validation framework in order to inspect internal parameters found
via grid-search.

We will use the California housing dataset.

In [1]:
from sklearn.datasets import fetch_california_housing

data, target = fetch_california_housing(return_X_y=True, as_frame=True)
target *= 100  # rescale the target in k$

First, create a histogram gradient boosting regressor. You can set the
trees number to be large, and configure the model to use early-stopping.

In [4]:
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import cross_validate

histogram_gradient_boosting = HistGradientBoostingRegressor(
    max_iter=1000, random_state=0, early_stopping=True)

cv_results_hgbdt = cross_validate(
    histogram_gradient_boosting, data, target,
    scoring="neg_mean_absolute_error", n_jobs=2,
)

cv_results_hgbdt

{'fit_time': array([5.80690765, 6.2191062 , 6.88225055, 5.49124002, 2.09159374]),
 'score_time': array([0.07534909, 0.07658339, 0.08445168, 0.06959248, 0.03091264]),
 'test_score': array([-44.91991964, -40.63537539, -43.94305064, -41.20125617,
        -48.282484  ])}

We will use a grid-search to find some optimal parameter for this model.
In this grid-search, you should search for the following parameters:

* `max_depth: [3, 8]`;
* `max_leaf_nodes: [15, 31]`;
* `learning_rate: [0.1, 1]`.

Feel free to explore the space with additional values. Create the
grid-search providing the previous gradient boosting instance as the model.

In [12]:
from sklearn.model_selection import GridSearchCV
import pandas as pd

param_grid = {'max_depth': [3, 8],
'max_leaf_nodes': [15, 31],
'learning_rate': [0.1, 1]}
    
grid_search = GridSearchCV(estimator=histogram_gradient_boosting,
             param_grid=param_grid, scoring="neg_mean_absolute_error")

grid_search.fit(data, target)

results = pd.DataFrame(grid_search.cv_results_)
results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_max_depth,param_max_leaf_nodes,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,1.007783,0.043401,0.0497,0.000575,0.1,3,15,"{'learning_rate': 0.1, 'max_depth': 3, 'max_le...",-46.297464,-42.077569,-46.090133,-43.128872,-47.88175,-45.095158,2.1528,3
1,1.042863,0.079973,0.052702,0.007611,0.1,3,31,"{'learning_rate': 0.1, 'max_depth': 3, 'max_le...",-46.297464,-42.077569,-46.090133,-43.128872,-47.88175,-45.095158,2.1528,3
2,0.832551,0.216217,0.029398,0.005377,0.1,8,15,"{'learning_rate': 0.1, 'max_depth': 8, 'max_le...",-45.898979,-39.469801,-44.974679,-43.455593,-47.628497,-44.28551,2.761652,2
3,1.13392,0.158181,0.034464,0.006788,0.1,8,31,"{'learning_rate': 0.1, 'max_depth': 8, 'max_le...",-44.324271,-40.965863,-44.118187,-41.92332,-47.992446,-43.864818,2.428413,1
4,0.159366,0.014774,0.008941,0.001114,1.0,3,15,"{'learning_rate': 1, 'max_depth': 3, 'max_leaf...",-47.986569,-44.420546,-47.858845,-63.053725,-55.925714,-51.84908,6.755547,5
5,0.157971,0.015353,0.008154,0.000672,1.0,3,31,"{'learning_rate': 1, 'max_depth': 3, 'max_leaf...",-47.986569,-44.420546,-47.858845,-63.053725,-55.925714,-51.84908,6.755547,5
6,0.122979,0.02222,0.005211,0.000822,1.0,8,15,"{'learning_rate': 1, 'max_depth': 8, 'max_leaf...",-48.514604,-55.222343,-53.880653,-54.932201,-56.885613,-53.887083,2.854118,8
7,0.129859,0.021437,0.004363,0.000421,1.0,8,31,"{'learning_rate': 1, 'max_depth': 8, 'max_leaf...",-53.738311,-55.136105,-51.804251,-51.065385,-57.115494,-53.771909,2.201757,7


Finally, we will run our experiment through cross-validation. In this regard,
define a 5-fold cross-validation. Besides, be sure to shuffle the data.
Subsequently, use the function `sklearn.model_selection.cross_validate`
to run the cross-validation. You should also set `return_estimator=True`,
so that we can investigate the inner model trained via cross-validation.

In [17]:
from sklearn.model_selection import cross_validate
from sklearn.model_selection import ShuffleSplit

cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)

cv_results = cross_validate(histogram_gradient_boosting,
                            data, target, cv=cv, return_estimator=True)

cv_results['test_score']

array([0.84259533, 0.83665715, 0.83714738, 0.8478605 , 0.83789599])

Now that we got the cross-validation results, print out the mean and
standard deviation score.

In [18]:
import statistics

print(statistics.mean(cv_results['test_score']))
print(statistics.stdev(cv_results['test_score']))

0.8404312694272594
0.004778395162123806


Then inspect the `estimator` entry of the results and check the best
parameters values. Besides, check the number of trees used by the model.

In [22]:
for estimator in cv_results['estimator']:
    print(estimator.n_iter_)

271
119
159
156
282


Inspect the results of the inner CV for each estimator of the outer CV.
Aggregate the mean test score for each parameter combination and make a box
plot of these scores.

In [24]:
for i, estimator in enumerate(cv_results['estimator']):
    print("estimator number " + str(i))
    print()
    print(estimator.validation_score_)
    print()
    print()
    

estimator number 0

[-6442.8677972  -5687.40931488 -5031.45464239 -4503.06168986
 -4058.11314761 -3721.38826875 -3431.91133306 -3172.00954646
 -2942.06280347 -2751.84886147 -2593.06178496 -2459.49230728
 -2341.83670222 -2238.92922161 -2139.72478408 -2058.94320613
 -1993.32893307 -1938.48472493 -1883.10931909 -1824.84187944
 -1789.95724383 -1749.01307127 -1712.56971225 -1689.2054992
 -1653.36907856 -1613.22149266 -1587.26815687 -1559.43906705
 -1532.34207442 -1512.29239761 -1493.45701392 -1478.84839232
 -1470.37617271 -1461.6680932  -1452.64793587 -1442.63877171
 -1435.81081458 -1430.75057163 -1421.49647565 -1412.32480083
 -1403.82039484 -1396.84093237 -1390.09631307 -1377.65055405
 -1368.29957631 -1363.47272842 -1360.27076529 -1352.31643792
 -1345.34795518 -1346.20200615 -1340.43810092 -1340.86812738
 -1340.37687728 -1341.14343575 -1333.37090813 -1333.83099685
 -1327.57578254 -1326.77149784 -1321.07794946 -1319.73444876
 -1314.50863841 -1311.75239983 -1306.09819802 -1300.13408331
 -129