<a href="https://colab.research.google.com/github/Sjoerd-de-Witte/Machine-Learning-2023/blob/main/4_6_Random_hyperparameter_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!gdown -O /tmp/ml.py 174lBNvDBJSVWs3OpNL3a68cnhWIcWYuY
%run /tmp/ml.py

Downloading...
From: https://drive.google.com/uc?id=174lBNvDBJSVWs3OpNL3a68cnhWIcWYuY
To: /tmp/ml.py
  0% 0.00/1.31k [00:00<?, ?B/s]100% 1.31k/1.31k [00:00<00:00, 4.00MB/s]


# Random Search

In this notebook, you will attempt to tune multiple hyperparameters for a study. The difficulty is that the optimal settings for one hyperparameter maybe obscured by a suboptimal setting of other hyperparameters. Therefore, multiple hyperparameters should be jointly tuned and not one after the other.

Traditionally, grid search is used to systematically try all combinations of settings for multiple hyperparameters. Since this is very inefficient, often a limited number of settings per hyperparameter are tried. In "Random search for hyper-parameter optimization", Bergsma et al. (2012) propose to use a Random search, which in practice is much more efficient. We will apply this technique to the Wine classification task and see how that works.

In [2]:
from pipetorch.data import DFrame
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
import numpy as np
import optuna

# Data

We will load and prepare the Wine dataset for binary classification over all features. We will tune a RandomForest (an ensemble consisting of decision trees) and therefore scaling is not relevant.

In [3]:
df = DFrame.read_from_kaggle('uciml/red-wine-quality-cortez-et-al-2009')
df['quality'] = df.quality > 5
df = df[['pH', 'alcohol', 'quality']]

Downloading dataset uciml/red-wine-quality-cortez-et-al-2009 from kaggle to /root/.pipetorchuser/red-wine-quality-cortez-et-al-2009


In [4]:
train_X, valid_X, train_y, valid_y = train_test_split(df.drop(columns='quality'), df.quality, test_size=0.2)

# Tune max_depth en n_estimators

Use Optuna to setup a study to tune these two hyperparameters simultaneously. You will need to write a function that performs a 'trail', that instantiates a RandomForestClassifier with the suggested values for the hyperparameters, that fits the model to the training set and returns the valdation F1 Score.

In [15]:
# complete the trial function, to sample n_estimators and max_depth from
# Optuna's trial object and run evaluate_settings to train a model and
# return the F1 score
def evaluate_settings(n_estimators, max_depth, train_X, train_y, valid_X, valid_y):
    model = RandomForestClassifier(max_depth=max_depth, n_estimators=n_estimators)
    model.fit(train_X, train_y)
    y_pred = model.predict(valid_X)
    f1 = f1_score(valid_y, y_pred)
    return f1

# Define the Optuna trial function
def objective(trial):
    n_estimators = trial.suggest_int("n_estimators", 10, 1000)
    max_depth = trial.suggest_int("max_depth", 1, 32)

    f1_score = evaluate_settings(n_estimators, max_depth, train_X, train_y, valid_X, valid_y)

    return f1_score

study = optuna.create_study()
study.optimize(objective, n_trials=10)


best_params = study.best_params
best_f1_score = study.best_value

print("Best Hyperparameters:", best_params)
print("Best F1 Score:", best_f1_score)

[I 2023-10-03 10:19:31,822] A new study created in memory with name: no-name-01e2d835-4f92-4ae1-a877-c9029a2149b1
[I 2023-10-03 10:19:35,447] Trial 0 finished with value: 0.7551622418879056 and parameters: {'n_estimators': 873, 'max_depth': 30}. Best is trial 0 with value: 0.7551622418879056.
[I 2023-10-03 10:19:36,060] Trial 1 finished with value: 0.7499999999999999 and parameters: {'n_estimators': 223, 'max_depth': 20}. Best is trial 1 with value: 0.7499999999999999.
[I 2023-10-03 10:19:37,775] Trial 2 finished with value: 0.7621951219512195 and parameters: {'n_estimators': 782, 'max_depth': 6}. Best is trial 1 with value: 0.7499999999999999.
[I 2023-10-03 10:19:38,803] Trial 3 finished with value: 0.7607361963190183 and parameters: {'n_estimators': 449, 'max_depth': 6}. Best is trial 1 with value: 0.7499999999999999.
[I 2023-10-03 10:19:42,991] Trial 4 finished with value: 0.7559523809523808 and parameters: {'n_estimators': 985, 'max_depth': 22}. Best is trial 1 with value: 0.749999

Best Hyperparameters: {'n_estimators': 924, 'max_depth': 19}
Best F1 Score: 0.744047619047619


# Tune max_depth, n_estimators and min_samples_leaf

Use Optuna to setup a study to tune these three hyperparameters simultaneously.

In [18]:
# complete the trial function, to sample n_estimators and max_depth from
# Optuna's trial object and run evaluate_settings to train a model and
# return the F1 score
def evaluate_settings(min_samples_leaf, n_estimators, max_depth, train_X, train_y, valid_X, valid_y):
    model = RandomForestClassifier(max_depth=max_depth, n_estimators=n_estimators, min_samples_leaf=min_samples_leaf)
    model.fit(train_X, train_y)
    y_pred = model.predict(valid_X)
    f1 = f1_score(valid_y, y_pred)
    return f1

# Define the Optuna trial function
def objective(trial):
    n_estimators = trial.suggest_int("n_estimators", 10, 1000)
    max_depth = trial.suggest_int("max_depth", 1, 32)
    min_samples_leaf = trial.suggest_int("min_samples_leaf", 1, 4)

    f1_score = evaluate_settings(min_samples_leaf, n_estimators, max_depth, train_X, train_y, valid_X, valid_y)

    return f1_score

study = optuna.create_study()
study.optimize(objective, n_trials=10)


best_params = study.best_params
best_f1_score = study.best_value

print("Best Hyperparameters:", best_params)
print("Best F1 Score:", best_f1_score)

[I 2023-10-03 10:22:39,739] A new study created in memory with name: no-name-23f6ddf2-d81d-46e9-a753-913060a1c021
[I 2023-10-03 10:22:43,273] Trial 0 finished with value: 0.7311178247734139 and parameters: {'n_estimators': 699, 'max_depth': 13, 'min_samples_leaf': 2}. Best is trial 0 with value: 0.7311178247734139.
[I 2023-10-03 10:22:43,967] Trial 1 finished with value: 0.7125748502994012 and parameters: {'n_estimators': 301, 'max_depth': 30, 'min_samples_leaf': 2}. Best is trial 1 with value: 0.7125748502994012.
[I 2023-10-03 10:22:45,115] Trial 2 finished with value: 0.7261538461538461 and parameters: {'n_estimators': 515, 'max_depth': 12, 'min_samples_leaf': 2}. Best is trial 1 with value: 0.7125748502994012.
[I 2023-10-03 10:22:45,852] Trial 3 finished with value: 0.7425149700598803 and parameters: {'n_estimators': 331, 'max_depth': 28, 'min_samples_leaf': 3}. Best is trial 1 with value: 0.7125748502994012.
[I 2023-10-03 10:22:47,275] Trial 4 finished with value: 0.755952380952380

Best Hyperparameters: {'n_estimators': 301, 'max_depth': 30, 'min_samples_leaf': 2}
Best F1 Score: 0.7125748502994012


In [19]:
optuna.visualization.plot_slice(study)

In [None]:
halt_notebook()