<a href="https://colab.research.google.com/github/LeonardoGoncRibeiro/06_MachineLearning/blob/main/01_Basic/08_OptimizingModelHyperparameters2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning: Optimizing model hyperparameters using random exploration

In previous courses, we have already learned how to optimize model hyperparameters using a grid search. Also, we saw that, to correctly evaluate the accuracy of a model, we should performed a nested cross-validation. However, when we used nested cross-validation with a grid search, we soon have to fit our model thousands of times. 

This is fine if we have a simple model, but, when we use more complex models such as Catboost, Gradient Boosting or XGBoosting, model fitting starts to take a rather long time. 

In these cases, we can take advantage of a randomized search, where we may not find the optimal values for the parameters, but the randomized search may guide us towards these best values in a much quicker way. 

In this course, we will use the following packages:


In [79]:
import time

import pandas as pd
import numpy as np
from scipy.stats import randint

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score

from sklearn.model_selection import StratifiedShuffleSplit

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score

Also, we will use the following dataset, which contains information about car prices, their age, their km per year, and whether they were sold or not.

In [16]:
uri = "https://gist.githubusercontent.com/guilhermesilveira/e99a526b2e7ccc6c3b70f53db43a87d2/raw/1605fc74aa778066bf2e6695e24d53cf65f2f447/machine-learning-carros-simulacao.csv"

df = pd.read_csv(uri).drop(columns=["Unnamed: 0"], axis=1)
df.columns = ['Price', 'Sold', 'Age', 'Km_per_year']

df.head()

Unnamed: 0,Price,Sold,Age,Km_per_year
0,30941.02,1,18,35085.22134
1,40557.96,1,20,12622.05362
2,89627.5,0,12,11440.79806
3,95276.14,0,3,43167.32682
4,117384.68,1,4,12770.1129


# Randomized Search Cross-Validation

In the previous course, we used a Grid Search Cross-Validation to get the best set of hyper-parameters of a decision tree model. For instance, we can do:

In [17]:
y = df.Sold
X = df.drop('Sold', axis = 1)

In [18]:
parameter_space = {'max_depth' : [3, 5, 7],
                   'min_samples_split' : [32, 64, 128],
                   'min_samples_leaf' : [32, 64, 128],
                   'criterion' : ['gini', 'entropy'],
                   'splitter' : ['best', 'random']}

In [19]:
SEED = 301
np.random.seed(SEED)

model = DecisionTreeClassifier( )
cv = StratifiedKFold(n_splits = 10, shuffle = True)

grid_search = GridSearchCV(model, parameter_space, cv = cv, return_train_score = True)
grid_search.fit(X, y)
results_df = pd.DataFrame(grid_search.cv_results_)

Note that this approach requires the fitting of $3 \times 3 \times 3 \times 2 \times 2 \times 10 = 1080$ models. In the end, we can get the best model parameters:

In [20]:
results_df_summary = results_df[['params', 'mean_train_score', 'mean_test_score', 'mean_fit_time']].copy( )

results_df_summary.sort_values('mean_test_score', ascending = False).head(1)

Unnamed: 0,params,mean_train_score,mean_test_score,mean_fit_time
0,"{'criterion': 'gini', 'max_depth': 3, 'min_sam...",0.787511,0.7869,0.012217


So, in the best configuration, we have a model that shows a train score of 78.75% and a test score of 78.69%. Let's use a nested cross-validation to get a robust estimate of the model accuracy:

In [21]:
scores = cross_val_score(grid_search, X, y, cv = cv)

acc_avg = scores.mean( )*100
acc_std = scores.std( )*100

print("Accuracy: {:.2f}% - Confidence interval [{:.2f}%, {:.2f}%]".format(acc_avg, acc_avg - 2*acc_std, acc_avg + 2*acc_std))

Accuracy: 78.69% - Confidence interval [77.19%, 80.19%]


We got a very close average accuracy!



Now, instead of using a Grid Search, we can also randomly pick $n$ points in the grid, and fit our model only in those $n$ points. To do that, we use the RandomizedSearchCV. Let's try it:

In [30]:
SEED = 301
np.random.seed(SEED)

model = DecisionTreeClassifier( )
cv = StratifiedKFold(n_splits = 10, shuffle = True)

number_of_sets_tested = 50

random_search = RandomizedSearchCV(model, parameter_space, cv = cv, return_train_score = True, n_iter = number_of_sets_tested)
random_search.fit(X, y)
results_df = pd.DataFrame(random_search.cv_results_)

results_df_summary = results_df[['params', 'mean_train_score', 'mean_test_score', 'mean_fit_time']].copy( )

results_df_summary.sort_values('mean_test_score', ascending = False).head(1)

Unnamed: 0,params,mean_train_score,mean_test_score,mean_fit_time
0,"{'criterion': 'entropy', 'max_depth': 3, 'min_...",0.787511,0.7869,0.01561


Note that, here, instead of building the model 1080 times, we only fitted 50 models (chosen at random). Still, we were able to find a model which shows a very similar accuracy! 

Thus, here, the Randomized Search CV was able to improve the efficiency of our process by orders of magnitude.

To get a more appropriate value for the model accuracy, we should use a nested cross validation. Thus, we can do:

In [32]:
scores = cross_val_score(random_search, X, y, cv = cv)

acc_avg = scores.mean( )*100
acc_std = scores.std( )*100

print("Accuracy: {:.2f}% - Confidence interval [{:.2f}%, {:.2f}%]".format(acc_avg, acc_avg - 2*acc_std, acc_avg + 2*acc_std))

Accuracy: 78.71% - Confidence interval [76.75%, 80.67%]
Accuracy: 78.70% - Confidence interval [75.53%, 81.87%]


Nice! Still, we got a very similar average accuracy! However, here, the confidence interval was "broader", as we are more susceptible to randomness.

## Extending the parameter space

Note that, when we use a Randomized Search CV, we can use any parameter space, as long as we set a feasible number of models to be fitted (which is 50, in our case). Thus, while increasing the parameter space effectively reduced the process efficiency for Grid Search CV, here, increase the parameter space has no effect at all on efficiency. 

So, actually, let's extend our parameter space, so that we may test even more sets of parameters:

In [33]:
parameter_space = {'max_depth' : [3, 5, 7, 10, 15, 20, 30, None],
                   'min_samples_split' : randint(32, 128),           # Random number between 32 and 128
                   'min_samples_leaf' : randint(32, 128),            # Random number between 32 and 128
                   'criterion' : ['gini', 'entropy'],
                   'splitter' : ['best', 'random']}

Nice! Now, let's perform our randomized search:

In [34]:
SEED = 301
np.random.seed(SEED)

model = DecisionTreeClassifier( )
cv = StratifiedKFold(n_splits = 10, shuffle = True)

number_of_sets_tested = 50

random_search = RandomizedSearchCV(model, parameter_space, cv = cv, return_train_score = True, n_iter = number_of_sets_tested)
random_search.fit(X, y)
results_df = pd.DataFrame(random_search.cv_results_)

results_df_summary = results_df[['params', 'mean_train_score', 'mean_test_score', 'mean_fit_time']].copy( )

results_df_summary.sort_values('mean_test_score', ascending = False).head(1)

Unnamed: 0,params,mean_train_score,mean_test_score,mean_fit_time
0,"{'criterion': 'entropy', 'max_depth': 3, 'min_...",0.787511,0.7869,0.015672


Great! Let's see the set of parameters for the best model:

In [35]:
random_search.best_params_

{'criterion': 'entropy',
 'max_depth': 3,
 'min_samples_leaf': 71,
 'min_samples_split': 100,
 'splitter': 'best'}

Nice! Note that the best ```min_samples_leaf``` is 71, and the best ```min_samples_split``` is 100. Previously, we were not testing with these values and, thus, we would never be able to find them!

Finally, let's run a nested cross validation:

In [36]:
scores = cross_val_score(random_search, X, y, cv = cv)

acc_avg = scores.mean( )*100
acc_std = scores.std( )*100

print("Accuracy: {:.2f}% - Confidence interval [{:.2f}%, {:.2f}%]".format(acc_avg, acc_avg - 2*acc_std, acc_avg + 2*acc_std))

Accuracy: 78.71% - Confidence interval [76.75%, 80.67%]


Still, accuracy was very similar. Note that, in fact, most of the parameters change very little the model performance. For instance, it is hard to notice the difference for ```min_samples_leaf = 100``` or ```min_samples_leaf = 101```. When we use a very large parameter space, we are also adding a lot of "spurious" parameters into our parameter space, since many of those are very similar. Thus, our randomized search will be affected.

Thus, still, one should worry when defining the parameter space, and one should understand whether the values are trully different for different parameters.

# Comparing Grid Search with Randomized Search

Finally, we extend our comparisons between Randomized and Grid Search. First, let's define an user-defined function:

In [71]:
def GetResults(estimator, X, y):
  SEED = 301
  np.random.seed(SEED)

  estimator.fit(X, y)
  results_df = pd.DataFrame(estimator.cv_results_)

  results_df_summary = results_df[['params', 'mean_train_score', 'mean_test_score', 'mean_fit_time']].copy( )

  results_df_summary.sort_values('mean_test_score', ascending = False).head(1)

  return results_df_summary

Thus, this time, we will test our methods using a different model: a Random Forest. Thus, let's define the parameter space:

In [55]:
parameter_space = {'max_depth' : [3, 5],
                   'min_samples_split' : [32, 64, 128],          
                   'min_samples_leaf' : [32, 64, 128],            
                   'criterion' : ['gini', 'entropy'],
                   'n_estimators' : [50, 100],
                   'bootstrap' : [True, False]}

So, on total, we have 144 possible combinations. Now, let's define our model and our cross validation method:

In [56]:
model = RandomForestClassifier( )
cv = StratifiedKFold(n_splits = 5, shuffle = True)

Since, for each model, we perform 5 splits, we have, on total, 720 combinations.

Finally, let's run our GridSearchCV. Here, we will measure the time spent:

In [57]:
tic = time.time( )

grid_search = GridSearchCV(model, parameter_space, cv = cv, return_train_score = True)
result_df_summary = GetResults(grid_search, X, y)

tac = time.time( )
time_spent = tac - tic

print("Time spent: {:.2f} s".format(time_spent))

Time spent: 355.83 s


So, it took 355.83 s (almost 6 minutes). Now, let's get highest test accuracy:

In [60]:
result_df_summary.sort_values('mean_test_score', ascending = False).head(1)

Unnamed: 0,params,mean_train_score,mean_test_score,mean_fit_time
142,"{'bootstrap': False, 'criterion': 'entropy', '...",0.780075,0.7788,0.381708


So, the highest test accuracy is 77.88%. We could use a ```cross_validate```, but it would likely take too long. Now, let's use a Randomized Search:

In [62]:
tic = time.time( )

number_of_sets_tested = 20
random_search = RandomizedSearchCV(model, parameter_space, cv = cv, return_train_score = True, n_iter = number_of_sets_tested)
result_df_summary = GetResults(random_search, X, y)

tac = time.time( )
time_spent = tac - tic

print("Time spent: {:.2f} s".format(time_spent))

Time spent: 50.05 s


Nice! Now, it took only 50 s (7 times faster). This is expected, since, here, we perform only 100 model fits (down from 720). 

Now, let's check the highest test accuracy:

In [63]:
result_df_summary.sort_values('mean_test_score', ascending = False).head(1)

Unnamed: 0,params,mean_train_score,mean_test_score,mean_fit_time
2,"{'n_estimators': 100, 'min_samples_split': 32,...",0.7797,0.7768,0.611426


Our test accuracy is 77.68%, which is very close to the higest test accuracy for the best model found using a Grid Search! We were able to greatly improve our efficiency, while presenting a very minor accuracy loss.

# Extra: Optimizing hyper-parameters using a train-test-validation split

For hyper-parameter optimization, we used methods which involved cross-validation metrics to get the best case (```GridSearchCV``` and ```RandomizedSearchCV```). However, we may want to perform optimization using a train-test-validation split, which is common in scenarios where we have more data, and we are worried about possible data leakage.

First, let's perform our split:

In [66]:
split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2)

Thus, here, we are performing a train-test split, where the test set has 20% of the total number of entries.

Now, we can do:

In [68]:
number_of_sets_tested = 20
random_search = RandomizedSearchCV(model, parameter_space, cv = split, return_train_score = True, n_iter = number_of_sets_tested)
result_df_summary = GetResults(random_search, X, y)

Note that we passed the split as the ```cv``` parameter for the Randomized Search. This time, instead of performing multiple splits and evaluating multiple models, we only perform one split (train and test), our model is only trained once, and our test accuracy is evaluated via the test set.

Now, let's see our results. First, let's get the set of parameters which shows the highest test score:

In [70]:
result_df_summary.sort_values('mean_test_score', ascending = False).head(1)

Unnamed: 0,params,mean_train_score,mean_test_score,mean_fit_time
19,"{'n_estimators': 50, 'min_samples_split': 32, ...",0.775125,0.7785,0.271797


Nice! However, now, we have no data to use as a validation set. In fact, we have to, previously, separate our dataset, so that we first create a validation set. Thus, let's make the following split:

*   Train: 60%
*   Test: 20%
*   Validation: 20%

Thus, we can do:



In [84]:
X_rest, X_val, y_rest, y_val = train_test_split(X, y, test_size = 0.2)

Now, let's perform our split. Note that, to divide 80% of our data into two datasets, one with 60% and 20%, we have to do:

In [86]:
split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.25)

Finally, let's perform our randomized search using the rest of the dataset (excluding the validation set):

In [87]:
number_of_sets_tested = 20
random_search = RandomizedSearchCV(model, parameter_space, cv = split, return_train_score = True, n_iter = number_of_sets_tested)
result_df_summary = GetResults(random_search, X_rest, y_rest)

Now, we can get the test score for the best model parameters:

In [88]:
result_df_summary.sort_values('mean_test_score', ascending = False).head(1)

Unnamed: 0,params,mean_train_score,mean_test_score,mean_fit_time
13,"{'n_estimators': 100, 'min_samples_split': 32,...",0.785667,0.7915,0.461831


Now, to get the accuracy for the validation set, we can do:

In [89]:
best_forest = random_search.best_estimator_

y_pred = best_forest.predict(X_val)

In [90]:
acc = accuracy_score(y_val, y_pred)

print("Accuracy: {:.2%}".format(acc))

Accuracy: 75.90%


Nice! Note that the accuracy on the validation set was slightly lower than the accuracy on the test set.