<a href="https://colab.research.google.com/github/Shuraimi/demo-repo/blob/main/Hyperparameter_tuning_using_optuna.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hyperparameter tuning using Optuna

In [1]:
import seaborn as sns

In [2]:
import pandas as pd

In [3]:
healthexp=sns.load_dataset('healthexp')
healthexp.head()

Unnamed: 0,Year,Country,Spending_USD,Life_Expectancy
0,1970,Germany,252.311,70.6
1,1970,France,192.143,72.2
2,1970,Great Britain,123.993,71.9
3,1970,Japan,150.437,72.0
4,1970,USA,326.961,70.9


In [4]:
# convert categorical to numeric
healthexp=pd.get_dummies(healthexp,dtype=int)

In [5]:
healthexp.head()

Unnamed: 0,Year,Spending_USD,Life_Expectancy,Country_Canada,Country_France,Country_Germany,Country_Great Britain,Country_Japan,Country_USA
0,1970,252.311,70.6,0,0,1,0,0,0
1,1970,192.143,72.2,0,1,0,0,0,0
2,1970,123.993,71.9,0,0,0,1,0,0
3,1970,150.437,72.0,0,0,0,0,1,0
4,1970,326.961,70.9,0,0,0,0,0,1


In [6]:
# create X and y
X=healthexp.drop('Life_Expectancy',axis=1)
y=healthexp['Life_Expectancy']

In [7]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=54)

In [8]:
from sklearn.ensemble import RandomForestRegressor

In [9]:
model=RandomForestRegressor(random_state=34)
model.fit(X_train,y_train)

In [10]:
y_preds=model.predict(X_test)

In [11]:
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
print(mean_squared_error(y_test,y_preds))
print(mean_absolute_error(y_test,y_preds))
print(r2_score(y_test,y_preds))

0.1553235999999905
0.31138181818180044
0.9836234548107303


In [12]:
!pip install optuna

Collecting optuna
  Downloading optuna-3.6.1-py3-none-any.whl (380 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m380.1/380.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.13.1-py3-none-any.whl (233 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.4/233.4 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting colorlog (from optuna)
  Downloading colorlog-6.8.2-py3-none-any.whl (11 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.3.3-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.8/78.8 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Mako, colorlog, alembic, optuna
Successfully installed Mako-1.3.3 alembic-1.13.1 colorlog-6.8.2 optuna-3.6.1


In [18]:
# import Optuna
import optuna

In [14]:
from sklearn.model_selection import cross_val_score

In [31]:
# define objective for our Optuna
def objective(trial):
    n_estimators=trial.suggest_int('n_estimators',100,1000)
    max_depth=trial.suggest_int('max_depth',10,50)
    min_samples_split=trial.suggest_int('min_samples_split',2,32)
    min_samples_leaf=trial.suggest_int('min_smaples_leaf',1,32)

    model=RandomForestRegressor(
    n_estimators=n_estimators,
    max_depth=max_depth,
    min_samples_split=min_samples_split,
    min_samples_leaf=min_samples_leaf,
    random_state=21)

    model.fit(X_train,y_train)
    y_pred=model.predict(X_test)
    #score=mean_squared_error(y_test,y_pred)
    # use cross val score on the train data instead
    score=cross_val_score(model,X_train,y_train,cv=5,scoring='neg_mean_squared_error',n_jobs=-1).mean()
    return score

In [25]:
study=optuna.create_study(directions=['minimize'],sampler=optuna.samplers.RandomSampler(seed=42))


[I 2024-04-22 16:51:06,589] A new study created in memory with name: no-name-d56bc315-357b-41ed-a6a1-48d800fba298


In [32]:
study.optimize(objective,n_trials=200)

[I 2024-04-22 16:54:48,381] Trial 3 finished with value: -2.3103640559247993 and parameters: {'n_estimators': 437, 'max_depth': 48, 'min_samples_split': 24, 'min_smaples_leaf': 20}. Best is trial 3 with value: -2.3103640559247993.
[I 2024-04-22 16:54:51,548] Trial 4 finished with value: -2.906414579705588 and parameters: {'n_estimators': 240, 'max_depth': 16, 'min_samples_split': 3, 'min_smaples_leaf': 28}. Best is trial 4 with value: -2.906414579705588.
[I 2024-04-22 16:54:56,483] Trial 5 finished with value: -3.3554945071958877 and parameters: {'n_estimators': 641, 'max_depth': 39, 'min_samples_split': 2, 'min_smaples_leaf': 32}. Best is trial 5 with value: -3.3554945071958877.
[I 2024-04-22 16:55:03,570] Trial 6 finished with value: -1.0175242099148512 and parameters: {'n_estimators': 850, 'max_depth': 18, 'min_samples_split': 7, 'min_smaples_leaf': 6}. Best is trial 5 with value: -3.3554945071958877.
[I 2024-04-22 16:55:06,156] Trial 7 finished with value: -1.573749748791981 and pa

Generally, more the number of trials, better is the hyperparameter value and better score. Start with 100

In [33]:
# getting our best hyperparameters
study.best_params

{'n_estimators': 625,
 'max_depth': 13,
 'min_samples_split': 32,
 'min_smaples_leaf': 32}

In [34]:
best_params=study.best_params

## Optuna visualisation

Optuna had four different visualisations:-
1. `optuna.plot_optimization_history(study)`
2.
3.
4.

In [35]:
import matplotlib.pyplot as plt

In [37]:
optuna.visualization.plot_optimization_history(study)

Here, we can see that the trials are from 0-200 and their scores for each trial.

The next plot is not that useful but there's a plot for it.

In [39]:
optuna.visualization.plot_parallel_coordinate(study)

This is a parallel cordinate plot which takes look at objective values and other hyperparameter values and draws lines from min to max. Here we can see what is working and what is not. Lighter lines are pretty bad and darker lines are a bit better.

In [41]:
optuna.visualization.plot_slice(study,params=['n_estimators','max_depth','min_samples_split','min_smaples_leaf'])

In [42]:
optuna.visualization.plot_param_importances(study)

## Create a new model

In [45]:
best_n_estimators=best_params['n_estimators']
best_max_depth=best_params['max_depth']
best_min_samples_split=best_params['min_samples_split']
best_min_samples_leaf=best_params['min_smaples_leaf']

In [47]:
best_model=RandomForestRegressor(n_estimators=best_n_estimators,
max_depth=best_max_depth,
min_samples_split=best_min_samples_split,
min_samples_leaf=best_min_samples_leaf)

In [48]:
best_model.fit(X_train,y_train)

In [49]:
y_pred=best_model.predict(X_test)

In [50]:
mean_squared_error(y_test,y_pred)

2.9561974642137976

In [51]:
mean_absolute_error(y_test,y_pred)

1.340928399140877

In [52]:
r2_score(y_test,y_pred)

0.6883132932722098

This is about hyperparameter tuning using Optuna. For more detailed info refer to the blog :-

https://medium.com/@ethannabatchian/optimizing-random-forest-models-a-deep-dive-into-hyperparameter-tuning-with-optuna-b8e4fe7f3670