# Automatic Hyperparameter Tuning with Sklearn Using Grid and Random Search
## Grid and Random Search vs. Halving Search in Sklearn
<img src='images/4.jpg'></img>
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@wildlittlethingsphoto?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Helena Lopes</a>
        on 
        <a href='https://www.pexels.com/photo/four-person-standing-on-cliff-in-front-of-sun-697243/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pexels</a>
    </strong>
</figcaption>

### Setup

In [15]:
import datetime
import json
import os
import time
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from colorama import Back, Fore, Style
from matplotlib import rcParams
from tqdm.notebook import tqdm, trange

warnings.filterwarnings("ignore")

plt.style.use("ggplot")
rcParams["axes.spines.top"] = False
rcParams["axes.spines.right"] = False
rcParams["figure.figsize"] = [12, 9]
rcParams["figure.dpi"] = 300
rcParams["figure.autolayout"] = True
rcParams["font.style"] = 16
rcParams["xtick.labelsize"] = 10
rcParams["ytick.labelsize"] = 10
custom_palette = ["#221f1f", "#b20710", "#e50914", "#f5f5f1"]
sns.set_palette(custom_palette)

%config InlineBackend.figure_format = 'retina'

pd.set_option("max_colwidth", 100)
pd.set_option("display.precision", 4)
pd.options.display.max_columns = 12

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

np.random.seed(1121218)

### What is a hyperparameter?

Today, algorithms that hide a world of math under the hood can be trained with only a few lines of code. Their success depends first on the data trained and then, on what hyperparameters that were used by the user. So, what are these hyperparameters?

Hyperparameters are user-defined values like [*k* in kNN](https://towardsdatascience.com/intro-to-scikit-learns-k-nearest-neighbors-classifier-and-regressor-4228d8d1cba6?source=your_stories_page-------------------------------------) and *alpha* in [Ridge and Lasso regression](https://towardsdatascience.com/intro-to-regularization-with-ridge-and-lasso-regression-with-sklearn-edcf4c117b7a?source=your_stories_page-------------------------------------). They strictly control the fit of the model and this means, for each dataset, there is a unique set of optimal hyperparameters to be found. The most basic way of finding this perfect set would be randomly trying out different values based on gut feeling. However, as you might guess, this method quickly becomes useless when there are many hyperparameters to tune.

Instead, today you will learn about two methods for automatic hyperparameter tuning: Random search and Grid search. Given a set of possible values for all hyperparameters of a model, Grid search fits a model using every single combination of these hyperparameters. What is more, in each fit, Grid search uses cross-validation to account for overfitting. After all combinations are tried out, the search retains the parameters that resulted in best scores so that you can use them to build your final model.

Random search takes a bit different approach than Grid. Instead of exhaustively trying out every single combination of hyperparameters, which can be computationally-expensive and time-consuming, it randomly samples hyperparameters and tries to get closer to the best set. 

Fortunately, Scikit-learn provides `GridSearchCV` and `RandomizedSearchCV` classes that make this process a breeze. Today, you will learn all about them!

### Prepping the Data

We will be tuning a RandomForestRegressor model on [Iowa housing dataset](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data). I chose Random Forests because it has large enough hyperparameters that make this guide more informative but the process you will be learning can be applied to any model in the Sklearn API. So, let's start. 

In [26]:
houses_train = pd.read_csv("data/train.csv")
houses_test = pd.read_csv("data/test.csv")

houses_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,...,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,...,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,...,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,...,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,...,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,...,0,12,2008,WD,Normal,250000


The target is `SalePrice`. For simplicity, I will choose only numeric features:

In [51]:
X = houses_train.select_dtypes(include="number").drop("SalePrice", axis=1)
y = houses_train.SalePrice

X_test = houses_test.select_dtypes(include="number")

First, both training and test sets contain missing values. We will use `SimpleImputer` to deal with them:

In [52]:
from sklearn.impute import SimpleImputer

# Impute both train and test sets
imputer = SimpleImputer(strategy="mean")
X = imputer.fit_transform(X)
X_test = imputer.fit_transform(X_test)

Now, let's fit a base `RandomForestRegressor` with default parameters. As we will use the test set only for final evaluation, I will create a separate validation set using the training data:

In [55]:
%%time

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3)

# Fit a base model
forest = RandomForestRegressor()

_ = forest.fit(X_train, y_train)

print(f"R2 for training set: {forest.score(X_train, y_train)}")
print(f"R2 for validation set: {forest.score(X_valid, y_valid)}\n")

R2 for training set: 0.9785951576271396
R2 for validation set: 0.832622375495487

Wall time: 1.71 s


> Note: The main focus of this article is on how to perform hyperparameter tuning. We won't worry about other topics like overfitting or feature engineering but only narrow down on how to use Random and Grid search so that you can apply automatic hyperparameter tuning in real-life setting.

We got a 0.83 for R2 on the test set. We fit the regressor only with default parameters which are:

In [37]:
forest.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

That's a lot of hyperparameters. We won't be tweaking all of them but focus only on the most important ones. Specifically:

- `n_esimators` - number of trees to be used
- `max_feauters` - the number of features to use at each node plit
- `max_depth`: the number of leaves on each tree
- `min_samples_split`: the minimum number of samples required to split an internal node
- `min_samples_leaf`: the minimum number of samples in each leaf
- `bootstrap`: method of sampling - with or without replacement.

Both Grid Search and Random Search tries to find the optimal values for each of these hyperparameters. Let's see this in action first with Random Search.

### Randomized Search with Sklearn RandomizedSearchCV

Scikit-learn provides `RandomizedSearchCV` class to implement random search. It requires two arguments to set up: an estimator and the set of possible values for hyperparameters called a *parameter grid* or *space*. Let's define this parameter grid for our random forest model:

In [39]:
n_estimators = np.arange(100, 2000, step=100)
max_features = ["auto", "sqrt", "log2"]
max_depth = list(np.arange(10, 100, step=10)) + [None]
min_samples_split = np.arange(2, 10, step=2)
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]

param_grid = {
    "n_estimators": n_estimators,
    "max_features": max_features,
    "max_depth": max_depth,
    "min_samples_split": min_samples_split,
    "min_samples_leaf": min_samples_leaf,
    "bootstrap": bootstrap,
}

param_grid

{'n_estimators': array([ 100,  200,  300,  400,  500,  600,  700,  800,  900, 1000, 1100,
        1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900]),
 'max_features': ['auto', 'sqrt', 'log2'],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, None],
 'min_samples_split': array([2, 4, 6, 8]),
 'min_samples_leaf': [1, 2, 4],
 'bootstrap': [True, False]}

This parameter grid dictionary should have hyperparameters as keys in the syntax they appear in the model's documentation. The possible values can be given as an array.

Now, let's finally import `RandomizedSearchCV` from `sklearn.model_selection` and instantiate it:

In [43]:
from sklearn.model_selection import RandomizedSearchCV

forest = RandomForestRegressor()

random_cv = RandomizedSearchCV(
    forest, param_grid, n_iter=100, cv=3, scoring="r2", n_jobs=-1
)

Apart from the accepted estimator and the parameter grid, it has `n_iter` parameter. It controls how many iterations of random picking of hyperparameter combinations we allow in the search. We set it to 100, so it will randomly sample 100 combinations and return the best score. We are also using a 3-fold cross-validation with coefficient of determination as scoring which is the default. You can pass any other scoring function from `sklearn.metrics.SCORERS.keys()`. Now, let's start the process:

> Note, since Randomized Search performs cross-validation, we can fit it on the training data as a whole. Because of how CV works, it will create separate sets for training and evaluation. Also, I am setting `n_jobs` to -1 to use all cores on my machine.

In [57]:
%%time

_ = random_cv.fit(X, y)

print("Best params:\n")
print(random_cv.best_params_)

Best params:

{'n_estimators': 800, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 20, 'bootstrap': False}
Wall time: 16min 56s


After ~17 minutes of training, the best params found can be accessed with `.best_params_` attribute. We can also see the best score:

In [58]:
random_cv.best_score_

0.8690868090696587

We got around 87% coefficient of determination which is an improvement by 4% over the base model.

### Grid Search