# Hyperparameters Search

One can tune the hyperparameters of a machine learning model using optuna. All that is needed is to define a function that is going to be minimized (or maximized). And the space where the parameters lies. Note: Plots are not showing.

## Introduction

Import the libraries.

In [19]:
import numpy as np
np.random.seed(12345)
import pandas as pd
print(f"Numpy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
import re

Numpy version: 1.19.5
Pandas version: 1.1.5


In [20]:
import sklearn
print(f"Scikit-Learn version: {sklearn.__version__}")
from sklearn.base import clone
from sklearn.model_selection import KFold

Scikit-Learn version: 0.22.2.post1


In [3]:
def mean_absolute_error(y, y_pred):
    acc = 0
    for idx in range(len(y_pred)):
        error = abs(y[idx]-y_pred[idx])
        acc += error
    acc = acc / len(y_pred)
    return acc
def get_scores(model, X, y, k=10):
    kf = KFold(n_splits=k)
    errors = []
    for train_idx, test_idx in kf.split(X):
        model_ = clone(model)
        model_.fit(X[train_idx], y[train_idx])
        y_pred = model_.predict(X[test_idx])
        err = mean_absolute_error(y[test_idx], y_pred)
        errors.append(err)
    return round(np.mean(errors), 2)

## Prepare dataset

Load the $X$ and $y$ datasets (from drive).

In [4]:
from google.colab import drive
drive.mount("/content/drive", force_remount=True)
X = pd.read_csv("/content/drive/MyDrive/data-science/solotodo-notebooks-analysis/X.csv")
y = pd.read_csv("/content/drive/MyDrive/data-science/solotodo-notebooks-analysis/y.csv")
X.drop(columns="Unnamed: 0", inplace=True)
y.drop(columns="Unnamed: 0", inplace=True)

Mounted at /content/drive


In [5]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import make_pipeline
scaler_pipeline = make_pipeline(StandardScaler(), MinMaxScaler())
X_scaled = scaler_pipeline.fit_transform(X)

## Ridge Tunning

Install optuna to Google Colab

In [6]:
!pip install optuna
import optuna
print(f"Optuna version: {optuna.__version__}")

Collecting optuna
  Downloading optuna-2.10.0-py3-none-any.whl (308 kB)
[K     |████████████████████████████████| 308 kB 6.3 MB/s 
[?25hCollecting alembic
  Downloading alembic-1.7.4-py3-none-any.whl (209 kB)
[K     |████████████████████████████████| 209 kB 58.2 MB/s 
[?25hCollecting colorlog
  Downloading colorlog-6.5.0-py2.py3-none-any.whl (11 kB)
Collecting cliff
  Downloading cliff-3.9.0-py3-none-any.whl (80 kB)
[K     |████████████████████████████████| 80 kB 11.1 MB/s 
Collecting cmaes>=0.8.2
  Downloading cmaes-0.8.2-py3-none-any.whl (15 kB)
Collecting Mako
  Downloading Mako-1.1.5-py2.py3-none-any.whl (75 kB)
[K     |████████████████████████████████| 75 kB 5.1 MB/s 
Collecting cmd2>=1.0.0
  Downloading cmd2-2.2.0-py3-none-any.whl (144 kB)
[K     |████████████████████████████████| 144 kB 63.9 MB/s 
[?25hCollecting stevedore>=2.0.1
  Downloading stevedore-3.4.0-py3-none-any.whl (49 kB)
[K     |████████████████████████████████| 49 kB 6.6 MB/s 
[?25hCollecting autopage>=0.

Define the objective function and optimize it.

Note: both models may have the same random state, to make sure we compare them under the same conditions.

In [7]:
from sklearn.linear_model import Ridge
def ridge_search(trial):
    alpha = trial.suggest_float("alpha", 0, 10)
    fit_intercept = trial.suggest_categorical("fit_intercept", [True, False])
    tol = trial.suggest_float("tol", 0, 1)
    random_state = trial.suggest_categorical("random_state", [555])
    model = Ridge(
        alpha=alpha, 
        fit_intercept=fit_intercept, 
        tol=tol, 
        random_state=random_state
    )
    score = get_scores(model, X=X_scaled, y=y.values, k=10)
    return score
ridge_study = optuna.create_study(direction="minimize")

[32m[I 2021-10-14 19:25:37,476][0m A new study created in memory with name: no-name-40379db8-c104-4d60-94dd-b911f01bb45a[0m


In [8]:
ridge_study.optimize(ridge_search, n_trials=500)

[32m[I 2021-10-14 19:25:37,528][0m Trial 0 finished with value: 178400.74 and parameters: {'alpha': 9.819608006905868, 'fit_intercept': False, 'tol': 0.367954737447465, 'random_state': 555}. Best is trial 0 with value: 178400.74.[0m
[32m[I 2021-10-14 19:25:37,552][0m Trial 1 finished with value: 177033.66 and parameters: {'alpha': 4.008888792471018, 'fit_intercept': False, 'tol': 0.6621038007557737, 'random_state': 555}. Best is trial 1 with value: 177033.66.[0m
[32m[I 2021-10-14 19:25:37,577][0m Trial 2 finished with value: 180647.32 and parameters: {'alpha': 9.234653332988824, 'fit_intercept': True, 'tol': 0.929413971219498, 'random_state': 555}. Best is trial 1 with value: 177033.66.[0m
[32m[I 2021-10-14 19:25:37,602][0m Trial 3 finished with value: 178241.68 and parameters: {'alpha': 6.229424265577653, 'fit_intercept': True, 'tol': 0.3468997650261939, 'random_state': 555}. Best is trial 1 with value: 177033.66.[0m
[32m[I 2021-10-14 19:25:37,628][0m Trial 4 finished wi

In [21]:
print(ridge_study.best_params)

{'alpha': 3.357007933856045, 'fit_intercept': False, 'tol': 0.411850258814384, 'random_state': 555}


In [22]:
from optuna import visualization
fig = visualization.plot_optimization_history(ridge_study)
fig.show()

There seems to be no more room to a substantive improvement.

## Random Forest Tunning

In [15]:
from sklearn.ensemble import RandomForestRegressor
def rfreg_search(trial):
    n_estimators = trial.suggest_int("n_estimators", 10, 1000)
    criterion = trial.suggest_categorical("criterion", ["mae"])
    max_depth = trial.suggest_int("max_depth", 1, 100)
    min_samples_split = trial.suggest_float("min_samples_split", 0.01, 1)
    min_samples_leaf = trial.suggest_float("min_samples_leaf", 0.01, 0.5)
    min_weight_fraction_leaf = trial.suggest_float("min_weight_fraction_leaf", 0, 0.5)
    max_features = trial.suggest_int("max_features", 1, 13)
    min_impurity_decrease = trial.suggest_float("min_impurity_decrease", 0, 1)
    ccp_alpha = trial.suggest_float("ccp_alpha", 0, 1)
    random_state = trial.suggest_categorical("random_state", [555])
    model = RandomForestRegressor(
        n_estimators=n_estimators,
        criterion=criterion,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        min_samples_leaf=min_samples_leaf,
        min_weight_fraction_leaf=min_weight_fraction_leaf,
        max_features=max_features,
        min_impurity_decrease=min_impurity_decrease,
        ccp_alpha=ccp_alpha,
        random_state=random_state
    )
    score = get_scores(model, X=X.values, y=y.values.ravel(), k=10)
    return score
rfreg_study = optuna.create_study(direction="minimize")

[32m[I 2021-10-14 19:30:04,264][0m A new study created in memory with name: no-name-39d6bcf9-4e27-4da4-9961-393ca24374b3[0m


In [16]:
rfreg_study.optimize(rfreg_search, n_trials=500)

[32m[I 2021-10-14 19:30:14,744][0m Trial 0 finished with value: 330686.22 and parameters: {'n_estimators': 405, 'criterion': 'mae', 'max_depth': 83, 'min_samples_split': 0.40544450093789824, 'min_samples_leaf': 0.2772112642226623, 'min_wieght_fraction_leaf': 0.2992450293530424, 'max_features': 11, 'min_impurity_decrease': 0.7448638003457366, 'ccp_alpha': 0.6169018735955101, 'random_state': 555}. Best is trial 0 with value: 330686.22.[0m
[32m[I 2021-10-14 19:30:23,830][0m Trial 1 finished with value: 378028.22 and parameters: {'n_estimators': 611, 'criterion': 'mae', 'max_depth': 38, 'min_samples_split': 0.6950614704147353, 'min_samples_leaf': 0.37407730027454994, 'min_wieght_fraction_leaf': 0.4948590423445442, 'max_features': 5, 'min_impurity_decrease': 0.7187704833720792, 'ccp_alpha': 0.747185877907856, 'random_state': 555}. Best is trial 0 with value: 330686.22.[0m
[32m[I 2021-10-14 19:30:24,739][0m Trial 2 finished with value: 371316.61 and parameters: {'n_estimators': 57, '

In [23]:
print(rfreg_study.best_params)

{'n_estimators': 508, 'criterion': 'mae', 'max_depth': 65, 'min_samples_split': 0.024371138348970756, 'min_samples_leaf': 0.010076361444197107, 'min_wieght_fraction_leaf': 0.009321734735842187, 'max_features': 8, 'min_impurity_decrease': 0.9324005001492703, 'ccp_alpha': 0.8910237262960391, 'random_state': 555}


In [24]:
fig = visualization.plot_optimization_history(rfreg_study)
fig.show()

It seems to have converged to a value.