## Hyperparameter Tuning

In our pursuit of optimizing predictive performance for California housing price prediction, we turn our attention towards hyperparameter tuning.

Hyperparameters play a pivotal role in shaping the behavior and performance of machine learning models, and fine-tuning them can lead to significant improvements in predictive accuracy and generalization.

#### Loading and preparing the data

In [1]:
from sklearn.datasets import  fetch_california_housing
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error

In [2]:
california = fetch_california_housing()
print(california["DESCR"])

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per ce

In [3]:
df_cali = pd.DataFrame(california["data"], columns = california["feature_names"])
df_cali["median_house_value"] = california["target"]

df_cali.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,median_house_value
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


#### Normalization & Feature Selection

Like we did in Feature Engineering lesson, we are going to normalize our data and select a subset of columns as our features.

#### Train Test Split

In [4]:
features = df_cali.drop(columns = ["median_house_value","AveOccup", "Population", "AveBedrms"])
target = df_cali["median_house_value"]

In [5]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=0)

Create an instance of the normalizer

In [6]:
normalizer = MinMaxScaler()
normalizer.fit(X_train)

In [7]:
X_train_norm = normalizer.transform(X_train)

X_test_norm = normalizer.transform(X_test)

In [8]:
X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)

# Grid Search

**Grid Search** - we define a grid of hyperparameter values we want to try. Grid Search tries all possible combinations.

So far, our best model was AdaBoost yield a R-Squared of 0.83.


Let's see how we fine tune our model, in order to that, we will optimize the following hyperparameters:

- **n_estimators:** number of estimators, in this case, number of trees

- **max_leaf_nodes:** maxium number of total leafs to consider

- **max_depth:** maxium number of levels in each tree

- First we define the grid with values to consider when train several possible combinations.


In [9]:
grid = {"n_estimators": [50, 100, 200],
        "estimator__max_leaf_nodes": [250, 500, 1000],
        "estimator__max_depth":[10,30,50]}

In [10]:
ada_reg = AdaBoostRegressor(DecisionTreeRegressor())

Now we're going to create an instance of a class that implements the K-Fold cross-validation technique. This technique is aimed to get a better estimation of the final model point performance by splitting the training set in several "K" subsets called "folds". Next, aside from the final model (that is trained using the training set) the system will train K models called "subrogate models" by using all the folds excepting one that will be left out during the training. As there are K possible folds to left out from training, one different fold will be left out for the training of the "subrogate models". Thus, the system generates one final model "F" plus K "subrogate models" "f" trained using all the folds except a different one in each. As these "subrogate models" are trained using all the rows of the training set with the exception of a different "fold" each, we can assume that they will have a similar performace to the final model. Then, we can obtain a better estimation of how good the final model will perform by averaging all the performances of the "subrogate models" thanks to the Central Limit Theorem.

$$score(f_{i}) ≈ score(F)$$

$$score_{estimation} = mean(score(f_{1}), score(f_{2}),...score(f_{K})) $$



In [17]:
model = GridSearchCV(estimator = ada_reg, param_grid = grid, cv=5, verbose=3, n_jobs=-1) # The "cv" option here is used to provide the desired number of folds K.

The next cell will take several minutes to run (~15min) as the system has to try all the possible combinations of hyper-parameters (3 * 3 * 3 = 27), and for each possible combination train the final model plus the K (5 in our case) "subrogate models" which makes a total of (27 * 5 = 135) models. Therefore, is critical avoid setting too many or too foolish possible values in the grid.

In [None]:
Fitting 5 folds for each of 27 candidates, totalling 135 fits
[CV 1/5; 1/27] START estimator__max_depth=10, estimator__max_leaf_nodes=250, n_estimators=50
[CV 1/5; 1/27] END estimator__max_depth=10, estimator__max_leaf_nodes=250, n_estimators=50;, score=0.781 total time=   3.9s
[CV 2/5; 1/27] START estimator__max_depth=10, estimator__max_leaf_nodes=250, n_estimators=50
[CV 2/5; 1/27] END estimator__max_depth=10, estimator__max_leaf_nodes=250, n_estimators=50;, score=0.790 total time=   3.9s
[CV 3/5; 1/27] START estimator__max_depth=10, estimator__max_leaf_nodes=250, n_estimators=50
[CV 3/5; 1/27] END estimator__max_depth=10, estimator__max_leaf_nodes=250, n_estimators=50;, score=0.777 total time=   4.1s
[CV 4/5; 1/27] START estimator__max_depth=10, estimator__max_leaf_nodes=250, n_estimators=50
[CV 4/5; 1/27] END estimator__max_depth=10, estimator__max_leaf_nodes=250, n_estimators=50;, score=0.785 total time=   4.1s
[CV 5/5; 1/27] START estimator__max_depth=10, estimator__max_leaf_nodes=250, n_estimators=50

In [18]:
model.fit(X_train_norm, y_train)

Fitting 5 folds for each of 27 candidates, totalling 135 fits


KeyboardInterrupt: 

- After training, we can check what has been the best combination for the hyperparameters tested in terms of the mean performance score of all the "subrogate models" in the training set.

In [None]:
model.best_params_

- Even more, we can retrieve the best model with the best parameters when accessing **best_estimator_** attribute

In [None]:
best_model = model.best_estimator_

- Let's evaluate this model on the TEST set (remember that the models were evaluated with the samples in the train set).

In [None]:
pred = best_model.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", root_mean_squared_error(pred, y_test))
print("R2 score", best_model.score(X_test_norm, y_test)) # ada_reg

# Random Search

There is another strategy to search for the best combination of hyper-parameters using K-fold cross-validation. Instead of letting the system try all the possible combinations given in the params_grid, we will provide a "range" of values to try for each hyper-parameters, and let the system to try several randomly selected combinations. There is no way to know beforehand if this approach will result in a more performant model than by using the GridSearch technique.

**Random Search** - we define probability distributions for each hyperparameter, from which random values are sampled. It’s up to the researcher to set the maximum number of combinations.

In [None]:
grid = {"n_estimators": [int(x) for x in np.linspace(start = 2, stop = 20, num = 3)],
        "estimator__max_leaf_nodes": [int(x) for x in np.linspace(start = 5, stop = 30, num = 3)],
        "estimator__max_depth":[int(x) for x in np.linspace(1, 11, num = 3)]}

ada_reg = AdaBoostRegressor(DecisionTreeRegressor())
# n_iter specifies how many possible randomly selected combinations of hyper-parameters will be used.
# higher n_iter can potentially yield better results, but increases computational time.
model = RandomizedSearchCV(estimator = ada_reg, param_distributions = grid, n_iter = 5, cv = 5)
display(model.fit(X_train_norm,y_train), "")

display(model.best_params_ , "")

best_model = model.best_estimator_
pred = best_model.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test), "\n")
print("RMSE", root_mean_squared_error(pred, y_test), "\n")
print("R2 score", best_model.score(X_test_norm, y_test), "\n")

In [None]:
grid = {"n_estimators": [int(x) for x in np.linspace(start = 2, stop = 20, num = 3)],
        "estimator__max_leaf_nodes": [int(x) for x in np.linspace(start = 5, stop = 30, num = 3)],
        "estimator__max_depth":[int(x) for x in np.linspace(1, 11, num = 3)]}

In [None]:
ada_reg = AdaBoostRegressor(DecisionTreeRegressor())

# n_iter specifies how many possible randomly selected combinations of hyper-parameters will be used.
model = RandomizedSearchCV(estimator = ada_reg, param_distributions = grid, n_iter = 5, cv = 5)

Again, this will take some time to be run.

In [None]:
model.fit(X_train_norm,y_train)

In [None]:
model.best_params_

- We can retrieve the best model with the best parameters when accessing **best_estimator_** attribute

In [None]:
best_model = model.best_estimator_

- Evaluate our model

In [None]:
pred = best_model.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", best_model.score(X_test_norm, y_test))

We dont guarantee these hyperparameters are optimal! We can just guarantee that these are the best from the ones we tried!