### Why learn Optuna since we already have GridSearchCV and RandomSearchCV

GridSearchCV -> Very costly in sense it will try every possible combination
RandomSearchCV -> It might happen the best params will not get into light
Optuna -> Uses baysiean search which is an intelligent way to find the best hyperparameters

**Key Terms**

1. Study
  - A Study in Optuna is an optimization session that encompasses multiple trials. It's essentially a collection of trails aimed at optimizing the objective function. You can think of a study as the overall experiment or search process
  - Eg. A Study to find the best hyperparameters for an XGBoost Model

2. Trial
  - A trial is a single iteration of the optimization process where a specific set of hyperparameters is evaluated. Each trial runs the objective function once with a distinct set of hyperparameters
  - Example : One trial could involve training a model with learning rate of 0.01 and a max depth of 5

3. Trial Parameters
  - These are the specific hyperparater values chosen during a trial. Each trial will have a unique combination of hyperparameters that are evaluated to see how they impact the objective function
  - Example : In one trial, the learning rate might be 0.001, while the batch size is 32 and in another trial, the learning rate could be 0.01 with the batch size 64

4. Objective Function
  - The objective function is the function to be optimzied (minimized or maximized) during the hyperparameter search. It takes hyperparameter as input and returns a value (such as accuracy, loss or any other metric) that Optuna tries to optimize
  - Example : In a classification task, the objective function could be the cross-entropy loss with Optuna seeks to optimize

5. Sampler
  - A sampler is the algorithm that suggests which hyperparameter should be evaluated next. Optuna uses the Tree-structured Parzen Estimator (TPE) by default, but it also supports other sampling methods like Random Search or even custom samplers
  - Example : TPE suggests promising areas of the hyperparameter space, focusing on regions that are likely to yield better results

### Imports

In [1]:
!pip install optuna

Collecting optuna
  Downloading optuna-4.2.1-py3-none-any.whl.metadata (17 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.15.1-py3-none-any.whl.metadata (7.2 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.3.9-py3-none-any.whl.metadata (2.9 kB)
Downloading optuna-4.2.1-py3-none-any.whl (383 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m383.6/383.6 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading alembic-1.15.1-py3-none-any.whl (231 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m231.8/231.8 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.9.0-py3-none-any.whl (11 kB)
Downloading Mako-1.3.9-py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.5/78.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Mak

In [2]:
import optuna
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/refs/heads/master/pima-indians-diabetes.data.csv"
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThcikness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
df = pd.read_csv(url, names=columns)

df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThcikness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Preprocessing

In [3]:
import numpy as np

cols_with_missing_vals = ['Glucose', 'BloodPressure', 'SkinThcikness', 'Insulin', 'BMI']

df[cols_with_missing_vals] = df[cols_with_missing_vals].replace(0, np.nan)
df.fillna(df.mean(), inplace=True)

print(df.isnull().sum())

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThcikness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


In [5]:
X = df.drop('Outcome', axis = 1)
y = df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print(X_train.shape, X_test.shape)

(537, 8) (231, 8)


In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

def objective(trial):
  n_estimators = trial.suggest_int('n_estimators', 200, 2000, 10)
  max_depth = trial.suggest_int('max_depth', 10, 100)

  model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=42)
  score = cross_val_score(model, X_train, y_train, cv=3, scoring='accuracy').mean()

  return score

### Optuna using Bayesian Sampler (TPESampler)

In [10]:
study = optuna.create_study(direction='maximize', sampler = optuna.samplers.TPESampler()) # We aim to maximize accuracy
study.optimize(objective, n_trials=50)

[I 2025-03-09 16:16:11,830] A new study created in memory with name: no-name-ff76c2bb-c482-405a-8693-344d17e6f095
  n_estimators = trial.suggest_int('n_estimators', 200, 2000, 10)
[I 2025-03-09 16:16:22,437] Trial 0 finished with value: 0.7672253258845437 and parameters: {'n_estimators': 1350, 'max_depth': 54}. Best is trial 0 with value: 0.7672253258845437.
  n_estimators = trial.suggest_int('n_estimators', 200, 2000, 10)
[I 2025-03-09 16:16:29,102] Trial 1 finished with value: 0.7616387337057727 and parameters: {'n_estimators': 1010, 'max_depth': 11}. Best is trial 0 with value: 0.7672253258845437.
  n_estimators = trial.suggest_int('n_estimators', 200, 2000, 10)
[I 2025-03-09 16:16:44,477] Trial 2 finished with value: 0.7635009310986964 and parameters: {'n_estimators': 1730, 'max_depth': 60}. Best is trial 0 with value: 0.7672253258845437.
  n_estimators = trial.suggest_int('n_estimators', 200, 2000, 10)
[I 2025-03-09 16:16:49,919] Trial 3 finished with value: 0.7709497206703911 and

In [11]:
study.best_params

{'n_estimators': 330, 'max_depth': 69}

In [12]:
study.best_value

0.7746741154562384

In [13]:
from sklearn.metrics import accuracy_score
best_model = RandomForestClassifier(**study.best_trial.params, random_state=42)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(test_accuracy)

0.7445887445887446


### Optuna using RandomSampler

In [15]:
study = optuna.create_study(direction='maximize', sampler = optuna.samplers.RandomSampler()) # We aim to maximize accuracy
study.optimize(objective, n_trials=10)

[I 2025-03-09 16:25:16,457] A new study created in memory with name: no-name-618bfa8c-89f1-4196-8d9f-0b86d8a22691
  n_estimators = trial.suggest_int('n_estimators', 200, 2000, 10)
[I 2025-03-09 16:25:30,962] Trial 0 finished with value: 0.7653631284916201 and parameters: {'n_estimators': 1620, 'max_depth': 100}. Best is trial 0 with value: 0.7653631284916201.
  n_estimators = trial.suggest_int('n_estimators', 200, 2000, 10)
[I 2025-03-09 16:25:37,910] Trial 1 finished with value: 0.7672253258845437 and parameters: {'n_estimators': 650, 'max_depth': 58}. Best is trial 1 with value: 0.7672253258845437.
  n_estimators = trial.suggest_int('n_estimators', 200, 2000, 10)
[I 2025-03-09 16:25:53,097] Trial 2 finished with value: 0.7616387337057727 and parameters: {'n_estimators': 1400, 'max_depth': 77}. Best is trial 1 with value: 0.7672253258845437.
  n_estimators = trial.suggest_int('n_estimators', 200, 2000, 10)
[I 2025-03-09 16:25:56,063] Trial 3 finished with value: 0.7728119180633147 and

In [16]:
study.best_params

{'n_estimators': 510, 'max_depth': 74}

In [17]:
study.best_value

0.7728119180633147

In [18]:
from sklearn.metrics import accuracy_score
best_model = RandomForestClassifier(**study.best_trial.params, random_state=42)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(test_accuracy)

0.7402597402597403


### Optuna using GridSampler

In [19]:
search_space = {
    'n_estimators': [100, 200, 300],
    'max_depth': [20, 30]
}

In [20]:
study = optuna.create_study(direction='maximize', sampler = optuna.samplers.GridSampler(search_space)) # We aim to maximize accuracy
study.optimize(objective, n_trials=10)

[I 2025-03-09 16:27:07,965] A new study created in memory with name: no-name-1296bd9b-1f8e-40e8-8b95-390d170596c1
  n_estimators = trial.suggest_int('n_estimators', 200, 2000, 10)
[I 2025-03-09 16:27:10,831] Trial 0 finished with value: 0.7746741154562384 and parameters: {'n_estimators': 300, 'max_depth': 30}. Best is trial 0 with value: 0.7746741154562384.
  n_estimators = trial.suggest_int('n_estimators', 200, 2000, 10)
[I 2025-03-09 16:27:14,745] Trial 1 finished with value: 0.7728119180633147 and parameters: {'n_estimators': 300, 'max_depth': 20}. Best is trial 0 with value: 0.7746741154562384.
  n_estimators = trial.suggest_int('n_estimators', 200, 2000, 10)
[I 2025-03-09 16:27:17,781] Trial 2 finished with value: 0.7728119180633147 and parameters: {'n_estimators': 200, 'max_depth': 20}. Best is trial 0 with value: 0.7746741154562384.
  n_estimators = trial.suggest_int('n_estimators', 200, 2000, 10)
[I 2025-03-09 16:27:18,997] Trial 3 finished with value: 0.7690875232774674 and pa

In [21]:
study.best_params

{'n_estimators': 300, 'max_depth': 30}

In [22]:
study.best_value

0.7746741154562384

In [23]:
from sklearn.metrics import accuracy_score
best_model = RandomForestClassifier(**study.best_trial.params, random_state=42)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(test_accuracy)

0.7445887445887446


### Optuna Visualizations

In [24]:
from optuna.visualization import plot_optimization_history, plot_parallel_coordinate, plot_slice, plot_contour, plot_param_importances

In [25]:
plot_optimization_history(study).show()

In [26]:
plot_parallel_coordinate(study).show()

In [27]:
plot_slice(study).show()

In [28]:
plot_contour(study).show()

In [None]:
plot_param_importances(study).show() ## You can use only when RandomSampler or TESampler

### Optimizing Multiple ML Models

In [31]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

In [34]:
def objective(trial):

  classifier_name = trial.suggest_categorical('classifier', ['RandomForest', 'GradientBoosting', 'SVC'])

  if classifier_name == 'RandomForest':
    n_estimators = trial.suggest_int('n_estimators', 200, 2000, 10)
    max_depth = trial.suggest_int('max_depth', 10, 100)
    model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=42)

  elif classifier_name =='GradientBoosting':
    n_estimators = trial.suggest_int('n_estimators', 200, 2000, 10)
    max_depth = trial.suggest_int('max_depth', 10, 100)
    model = GradientBoostingClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=42)

  elif classifier_name =='SVC':
    C = trial.suggest_float('C', 1e-10, 1e10, log=True)
    model = SVC(C=C, random_state=42)

  score = cross_val_score(model, X_train, y_train, cv=3, scoring='accuracy').mean()

  return score

In [35]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=10)

[I 2025-03-09 16:40:24,049] A new study created in memory with name: no-name-de190a0e-82f3-4909-8c75-5ec38cafad71

suggest_int() got {'step'} as positional arguments but they were expected to be given as keyword arguments.

[I 2025-03-09 16:40:35,976] Trial 0 finished with value: 0.7635009310986964 and parameters: {'classifier': 'RandomForest', 'n_estimators': 1740, 'max_depth': 96}. Best is trial 0 with value: 0.7635009310986964.
[I 2025-03-09 16:40:36,032] Trial 1 finished with value: 0.6499068901303539 and parameters: {'classifier': 'SVC', 'C': 1.7550455352738307e-05}. Best is trial 0 with value: 0.7635009310986964.

suggest_int() got {'step'} as positional arguments but they were expected to be given as keyword arguments.

[I 2025-03-09 16:40:39,396] Trial 2 finished with value: 0.6945996275605214 and parameters: {'classifier': 'GradientBoosting', 'n_estimators': 320, 'max_depth': 65}. Best is trial 0 with value: 0.7635009310986964.

suggest_int() got {'step'} as positional argumen

In [36]:
best_trial = study.best_trial
print("Best parmas : ", best_trial.params)
print("Best value : ", best_trial.value)

Best parmas :  {'classifier': 'RandomForest', 'n_estimators': 320, 'max_depth': 59}
Best value :  0.7728119180633147


In [39]:
study.trials_dataframe()

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_C,params_classifier,params_max_depth,params_n_estimators,state
0,0,0.763501,2025-03-09 16:40:24.054081,2025-03-09 16:40:35.976408,0 days 00:00:11.922327,,RandomForest,96.0,1740.0,COMPLETE
1,1,0.649907,2025-03-09 16:40:35.977738,2025-03-09 16:40:36.031987,0 days 00:00:00.054249,1.8e-05,SVC,,,COMPLETE
2,2,0.6946,2025-03-09 16:40:36.033185,2025-03-09 16:40:39.396531,0 days 00:00:03.363346,,GradientBoosting,65.0,320.0,COMPLETE
3,3,0.761639,2025-03-09 16:40:39.398079,2025-03-09 16:40:47.887684,0 days 00:00:08.489605,,RandomForest,43.0,1320.0,COMPLETE
4,4,0.649907,2025-03-09 16:40:47.889361,2025-03-09 16:40:47.943379,0 days 00:00:00.054018,0.037322,SVC,,,COMPLETE
5,5,0.759777,2025-03-09 16:40:47.945087,2025-03-09 16:40:54.566431,0 days 00:00:06.621344,,RandomForest,94.0,1000.0,COMPLETE
6,6,0.689013,2025-03-09 16:40:54.568191,2025-03-09 16:40:54.723990,0 days 00:00:00.155799,25083.189531,SVC,,,COMPLETE
7,7,0.77095,2025-03-09 16:40:54.727943,2025-03-09 16:40:57.412919,0 days 00:00:02.684976,,RandomForest,49.0,270.0,COMPLETE
8,8,0.772812,2025-03-09 16:40:57.416958,2025-03-09 16:41:02.518936,0 days 00:00:05.101978,,RandomForest,59.0,320.0,COMPLETE
9,9,0.6946,2025-03-09 16:41:02.529306,2025-03-09 16:41:16.268963,0 days 00:00:13.739657,,GradientBoosting,97.0,1770.0,COMPLETE


In [41]:
study.trials_dataframe()['params_classifier'].value_counts()

Unnamed: 0_level_0,count
params_classifier,Unnamed: 1_level_1
RandomForest,5
SVC,3
GradientBoosting,2


In [42]:
study.trials_dataframe().groupby('params_classifier')['value'].mean()

Unnamed: 0_level_0,value
params_classifier,Unnamed: 1_level_1
GradientBoosting,0.6946
RandomForest,0.765736
SVC,0.662942


In [43]:
plot_optimization_history(study).show()

### Distributed Computing