#### Grid Search VS Random Search VS Bayesian Optimization

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 

# load data
digits = datasets.load_digits()

# flatten the images
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

# Split data into train and test subsets
X_train, X_test, y_train, y_test = train_test_split(data, digits.target, test_size=0.25, shuffle=False)

The goal is to fine-tune a random forest model with the grid search, random search, and Bayesian optimization.

Each method will be evaluated based on:

- The total number of trials executed
- The number of trials needed to yield the optimal hyperparameters
- The score of the model (f-1 score in this case)
- The run time

The random forest classifier object and the search space are shown below:

In [2]:
from sklearn.ensemble import RandomForestClassifier

# random forest classifier object
rfc = RandomForestClassifier(random_state=42)

# define sample space
param_grid = {
    'n_estimators': [100,150,200],
    'criterion': ['gini', 'entropy'],
    'min_samples_split': [2, 3, 4],
    'min_samples_leaf': [1, 2, 3, 4, 5],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [5, 6, 7]
    }

Altogether, there are 810 unique hyperparameter combinations.

### 1. Grid Search

First, let’s obtain the optimal hyperparameters using the grid search method and time the process. Of course, this means that we will test all 810 hyperparameter sets and pick out the one that yields the best results.

In [3]:
import time
from sklearn.model_selection import GridSearchCV

# create grid search object
gs = GridSearchCV(estimator=rfc,
                  param_grid=param_grid,
                  scoring='f1_micro',
                  cv=5,
                  n_jobs=-1,
                  verbose=2)

# perform hyperparameter tuning (while timing the process)
time_start = time.time()
gs.fit(X_train, y_train)
time_grid = time.time() - time_start

# store result in a data frame 
values_grid = [810, gs.best_index_+1, gs.best_score_, time_grid]
columns = ['Number of iterations', 'Iteration Number of Optimal Hyperparamters', 'Score', 'Time Elapsed (s)']
results_grid = pd.DataFrame([values_grid], columns = columns)

Fitting 5 folds for each of 810 candidates, totalling 4050 fits


##### 2. Random Search

Next, we will use the random search to identify the optimal hyperparameters and time the process. The search is limited to 100 trials.

In [4]:
from sklearn.model_selection import RandomizedSearchCV

# create a random search object
rs = RandomizedSearchCV(estimator=rfc,
                  param_distributions=param_grid,
                  scoring='f1_micro',
                  cv=5,
                  n_jobs=-1,
                  verbose=2,
                  n_iter=100)

# perform hyperparamter tuning (while timing the process)
time_start = time.time()
rs.fit(X_train, y_train)
time_random = time.time() - time_start

# store result in a data frame 
values_grid = [[100, rs.best_index_+1, rs.best_score_, time_random]]
results_random = pd.DataFrame(values_grid, columns = columns)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


#### 3. Bayesian Optimization

Finally, we perform hyperparameter tuning with the Bayesian optimization and time the process. In Python, this can be accomplished with the Optuna module.

Its syntax differs from that of Sklearn, but it performs the same operation.

For the sake of consistency, we will use 100 trials in this procedure as well.

In [6]:
import optuna 
from optuna.samplers import TPESampler
from sklearn.model_selection import cross_val_score

def objective(trial):
    """return the f1-score"""

    # search space
    n_estimators =  trial.suggest_int('n_estimators', low=100, high=200, step=50)
    criterion = trial.suggest_categorical('criterion', ['gini', 'entropy'])
    min_samples_split = trial.suggest_int('min_samples_split', low=2, high=4, step=1)
    min_samples_leaf = trial.suggest_int('min_samples_leaf', low=1, high=5, step=1)
    max_depth = trial.suggest_int('max_depth', low=5, high=7, step=1)
    max_features = trial.suggest_categorical('max_features', ['auto', 'sqrt','log2'])

    # random forest classifier object
    rfc = RandomForestClassifier(n_estimators=n_estimators, 
                                                  criterion=criterion,
                                                  min_samples_split=min_samples_split,
                                                  min_samples_leaf=min_samples_leaf,
                                                  max_depth=max_depth,
                                                  max_features=max_features,
                                                  random_state=42)
    score =  cross_val_score(estimator=rfc, 
                             X=X_train, 
                             y=y_train, 
                             scoring='f1_micro',
                             cv=5,
                             n_jobs=-1).mean()
    
    return score

# create a study (aim to maximize score)
study = optuna.create_study(sampler=TPESampler(), direction='maximize')

# perform hyperparamter tuning (while timing the process)
time_start = time.time()
study.optimize(objective, n_trials=100)
time_bayesian = time.time() - time_start

# store result in a data frame 
values_bayesian = [100, study.best_trial.number, study.best_trial.value, time_bayesian]
results_bayesian = pd.DataFrame([values_bayesian], columns = columns)

[32m[I 2022-05-03 18:06:22,939][0m A new study created in memory with name: no-name-89ee3c92-a1fb-4e7c-8f0e-2115369deb7c[0m
[32m[I 2022-05-03 18:06:26,131][0m Trial 0 finished with value: 0.9131488365689109 and parameters: {'n_estimators': 150, 'criterion': 'entropy', 'min_samples_split': 3, 'min_samples_leaf': 3, 'max_depth': 5, 'max_features': 'auto'}. Best is trial 0 with value: 0.9131488365689109.[0m
[32m[I 2022-05-03 18:06:27,320][0m Trial 1 finished with value: 0.9116673550874295 and parameters: {'n_estimators': 150, 'criterion': 'entropy', 'min_samples_split': 2, 'min_samples_leaf': 3, 'max_depth': 6, 'max_features': 'log2'}. Best is trial 0 with value: 0.9131488365689109.[0m
[32m[I 2022-05-03 18:06:28,376][0m Trial 2 finished with value: 0.9131653586672174 and parameters: {'n_estimators': 100, 'criterion': 'gini', 'min_samples_split': 4, 'min_samples_leaf': 2, 'max_depth': 6, 'max_features': 'sqrt'}. Best is trial 2 with value: 0.9131653586672174.[0m
[32m[I 2022-05

Now that we have executed hyperparameter tuning with all three approaches, let’s see how the results of each method compare to each other.

For convenience, we will store the results of all 3 hyperparameter tuning procedures in a single data frame.

In [7]:
# store all results in a single data frame
df = results_grid.append(results_random).append(results_bayesian)
df.index = ['Grid Search', 'Random Search', 'Bayesian Optimization']
df

Unnamed: 0,Number of iterations,Iteration Number of Optimal Hyperparamters,Score,Time Elapsed (s)
Grid Search,810,680,0.935426,488.608124
Random Search,100,43,0.933196,63.926313
Bayesian Optimization,100,49,0.934685,69.563363


The grid search registered the highest score (joint with the Bayesian optimization method). However, the method required carrying out 810 trials and only managed to obtain the optimal hyperparameters at the 680th iteration. Also, its run time far exceeded that of the random search and the Bayesian optimization methods.



The random search method required only 100 trials and needed only 36 iterations to find the best hyperparameter set. It also took the least amount of time to execute. However, the random search method registered the lowest score out of the 3 methods.



The Bayesian optimization also performed 100 trials but was able to achieve the highest score after only 49 iterations, far less than the grid search’s 680 iterations. Although it executed the same number of trials as the random search, it has a longer run time since it is an informed search method.

If you hate diplomatic answers and just want my personal opinion, I would say that I usually favor the Bayesian optimization.

Given the run time needed for fine-tuning models with larger training data sets and search spaces, I usually shun the grid search. The random search requires fewer iterations and is the fastest of all 3 methods, but its level of success depends on the hyperparameter sets that are selected at random. In some cases, it will select the optimal hyperparameters; in other cases, it will omit the optimal hyperparameters completely. Due to this inconsistency, I do not like relying on randomness for bigger machine learning tasks.



I prefer the Bayesian optimization approach for its ability to consistently attain the optimal hyperparameters with fewer iterations. Its individual iterations may take more time than those of the uninformed search methods, but that is rarely a deal-breaker for me.