# 📝 Exercise M6.01

The aim of this notebook is to investigate if we can tune the hyperparameters
of a bagging regressor and evaluate the gain obtained.

We will load the California housing dataset and split it into a training and
a testing set.

In [2]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data, target = fetch_california_housing(as_frame=True, return_X_y=True)
target *= 100  # rescale the target in k$
data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=0, test_size=0.5)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
dtypes: float64(8)
memory usage: 1.3 MB


<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">If you want a deeper overview regarding this dataset, you can refer to the
Appendix - Datasets description section at the end of this MOOC.</p>
</div>

We can display an interactive diagram with the following command:

In [3]:
from sklearn import set_config
set_config(display='diagram')

Create a `BaggingRegressor` and provide a `DecisionTreeRegressor`
to its parameter `base_estimator`. Train the regressor and evaluate its
statistical performance on the testing set using the mean absolute error.

In [4]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor

tree = DecisionTreeRegressor()
bagging = BaggingRegressor(base_estimator=tree, n_jobs=2)
bagging

In [5]:
%%time
from sklearn.metrics import mean_absolute_error

bagging.fit(data_train, target_train)
target_predicted = bagging.predict(data_test)
print(f"Basic mean absolute error of the bagging regressor:\n"
    f"{mean_absolute_error(target_test, target_predicted):.2f} k$")

Basic mean absolute error of the bagging regressor:
37.42 k$
CPU times: user 33.7 ms, sys: 67.7 ms, total: 101 ms
Wall time: 894 ms


Now, create a `RandomizedSearchCV` instance using the previous model and
tune the important parameters of the bagging regressor. Find the best
parameters  and check if you are able to find a set of parameters that
improve the default regressor still using the mean absolute error as a
metric.

<div class="admonition tip alert alert-warning">
<p class="first admonition-title" style="font-weight: bold;">Tip</p>
<p class="last">You can list the bagging regressor's parameters using the <tt class="docutils literal">get_params</tt>
method.</p>
</div>

In [6]:
for param in bagging.get_params().keys():
    print(param)

base_estimator__ccp_alpha
base_estimator__criterion
base_estimator__max_depth
base_estimator__max_features
base_estimator__max_leaf_nodes
base_estimator__min_impurity_decrease
base_estimator__min_impurity_split
base_estimator__min_samples_leaf
base_estimator__min_samples_split
base_estimator__min_weight_fraction_leaf
base_estimator__random_state
base_estimator__splitter
base_estimator
bootstrap
bootstrap_features
max_features
max_samples
n_estimators
n_jobs
oob_score
random_state
verbose
warm_start


In [12]:
%%time
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV

param_grid = {
    "n_estimators": randint(10, 30),
    "max_samples": [0.5, 0.8, 1.0],
    "max_features": [0.5, 0.8, 1.0],
    "base_estimator__max_depth": randint(3, 10)
}
search = RandomizedSearchCV(
    bagging, param_grid, n_iter=20, 
    scoring="neg_mean_absolute_error"
)
_ = search.fit(data_train, target_train)

CPU times: user 2.11 s, sys: 183 ms, total: 2.29 s
Wall time: 17.8 s


In [15]:
import pandas as pd

columns = [f"param_{name}" for name in param_grid.keys()]
columns += ["mean_test_score", "std_test_score", "rank_test_score"]
cv_results = pd.DataFrame(search.cv_results_)
cv_results = cv_results[columns].sort_values(by="rank_test_score")
cv_results["mean_test_score"] = -cv_results["mean_test_score"]
cv_results

Unnamed: 0,param_n_estimators,param_max_samples,param_max_features,param_base_estimator__max_depth,mean_test_score,std_test_score,rank_test_score
19,28,0.8,0.8,8,40.195642,0.699337,1
4,24,0.8,1.0,8,40.937565,1.149383,2
10,20,1.0,1.0,8,41.065162,1.234174,3
7,20,0.8,1.0,8,41.197871,1.25496,4
8,12,0.5,1.0,6,45.309351,1.012026,5
12,27,0.8,1.0,6,45.473691,1.202343,6
11,27,0.8,0.5,8,46.596409,1.654841,7
13,23,0.5,0.5,8,47.075243,0.855827,8
9,17,0.5,0.5,7,47.261583,1.065918,9
14,24,0.5,1.0,5,47.829583,1.425732,10


In [16]:
target_predicted = search.predict(data_test)
print(f"Mean absolute error after tuning of the bagging regressor:\n"
    f"{mean_absolute_error(target_test, target_predicted):.2f} k$")

Mean absolute error after tuning of the bagging regressor:
40.76 k$


We see that the predictor provided by the bagging regressor does not need much hyperparameter tuning compared to a single decision tree. We see that the bagging regressor provides a predictor in which fine tuning is not as important as in the case of fitting a single decision tree.