# Automatic Hyperparameter Tuning with Sklearn Using Grid and Random Search
## Grid and Random Search vs. Halving Search in Sklearn
<img src='images/4.jpg'></img>
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@wildlittlethingsphoto?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Helena Lopes</a>
        on 
        <a href='https://www.pexels.com/photo/four-person-standing-on-cliff-in-front-of-sun-697243/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pexels</a>
    </strong>
</figcaption>

### Setup

In [15]:
import datetime
import json
import os
import time
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from colorama import Back, Fore, Style
from matplotlib import rcParams
from tqdm.notebook import tqdm, trange

warnings.filterwarnings("ignore")

plt.style.use("ggplot")
rcParams["axes.spines.top"] = False
rcParams["axes.spines.right"] = False
rcParams["figure.figsize"] = [12, 9]
rcParams["figure.dpi"] = 300
rcParams["figure.autolayout"] = True
rcParams["font.style"] = 16
rcParams["xtick.labelsize"] = 10
rcParams["ytick.labelsize"] = 10
custom_palette = ["#221f1f", "#b20710", "#e50914", "#f5f5f1"]
sns.set_palette(custom_palette)

%config InlineBackend.figure_format = 'retina'

pd.set_option("max_colwidth", 100)
pd.set_option("display.precision", 4)
pd.options.display.max_columns = 12

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

np.random.seed(1121218)

### What is a hyperparameter?

Today, algorithms that hide a world of math under the hood can be trained with only a few lines of code. Their success depends first on the data trained and then, on what hyperparameters that were used by the user. So, what are these hyperparameters?

Hyperparameters are user-defined values like [*k* in kNN](https://towardsdatascience.com/intro-to-scikit-learns-k-nearest-neighbors-classifier-and-regressor-4228d8d1cba6?source=your_stories_page-------------------------------------) and *alpha* in [Ridge and Lasso regression](https://towardsdatascience.com/intro-to-regularization-with-ridge-and-lasso-regression-with-sklearn-edcf4c117b7a?source=your_stories_page-------------------------------------). They strictly control the fit of the model and this means, for each dataset, there is a unique set of optimal hyperparameters to be found. The most basic way of finding this perfect set would be randomly trying out different values based on gut feeling. However, as you might guess, this method quickly becomes useless when there are many hyperparameters to tune.

Fortunately, Scikit-learn provides 4 different, yet similar classes that make hyperparameter tuning a breeze. Today, you will get a hands-on knowledge of two of these classes, namely `GridSearchCV` and `RandomSearchCV`.

### Prepping the Data

Today, we will be tuning a RandomForestRegressor model on [Iowa housing dataset](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data). I chose Random Forests because it has large enough hyperparameters that make this guide more informative but the process you will be learning can be applied to any model in the Sklearn API. So, let's start. 

In [26]:
houses_train = pd.read_csv("data/train.csv")
houses_test = pd.read_csv("data/test.csv")

houses_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,...,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,...,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,...,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,...,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,...,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,...,0,12,2008,WD,Normal,250000


The target is `SalePrice`. For simplicity, I will choose only numeric features:

In [33]:
X = houses_train.select_dtypes(include="number").drop("SalePrice", axis=1)
y = houses_train.SalePrice

X_test = houses_test.select_dtypes(include="number")

We will create separate training and validation sets from `X` and `y` and use `X_test` only for final evaluation:

In [35]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3)

Now, let's fit a base `RandomForestRegressor` with default parameters after imputing the missing values:

In [36]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer

# Impute both train and test sets
imputer = SimpleImputer(strategy="mean")
X_train = imputer.fit_transform(X_train)
X_valid = imputer.fit_transform(X_valid)

# Fit a base model
forest = RandomForestRegressor()

_ = forest.fit(X_train, y_train)

print(f"R2 for training set: {forest.score(X_train, y_train)}")
print(f"R2 for validation set: {forest.score(X_valid, y_valid)}")

R2 for training set: 0.9761145782229289
R2 for validation set: 0.877708998710343


> Note: The main focus of this article is on how to perform hyperparameter tuning. We won't worry about other topics but only narrow down on how to use Random and Grid search so that you can apply automatic hyperparameter tuning in real-life setting.

We got a 0.87 for R2 on the test set. We fit the regressor only with default parameters which are:

In [37]:
forest.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

That's a lot of hyperparameters. We won't be tweaking all of them but focus only on the most important ones. Specifically:

- `n_esimators` - number of trees to be used
- `max_feauters` - the number of features to use at each node plit
- `max_depth`: the number of leaves on each tree
- `min_samples_split`: the minimum number of samples required to split an internal node
- `min_samples_leaf`: the minimum number of samples in each leaf
- `bootstrap`: method of sampling - with or without replacement.

### Randomized Search with Sklearn RandomSearchCV

### Halving Grid and Randomized Searches

### Which one to choose?