# Hyperparameter tuning with Optuna

## Introduction

In this notebook, you will learn the importance of tuning hyperparameters. You will first try to optimize the parameters manually and then we will see how to automate the search using Optuna.


## Install Dependencies and Stable Baselines3 Using Pip

```bash
pip install stable-baselines3[extra]
```

In [None]:
!pip install stable-baselines3 sb3-contrib optuna --quiet

## Imports

In [3]:
import gym
import numpy as np

The first thing you need to import is the RL model, check the documentation to know what you can use on which problem

In [4]:
from stable_baselines3 import PPO, A2C, SAC, TD3, DQN
from sb3_contrib import QRDQN, TQC
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
import torch.nn as nn

# Automatic Hyperparameter Tuning





In this part we will create a script that allows to search for the best hyperparameters automatically.

### Imports

In [5]:
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
from optuna.visualization import plot_optimization_history, plot_param_importances

### Config

In [6]:
N_TRIALS = 100  # Maximum number of trials
N_JOBS = 1 # Number of jobs to run in parallel
N_STARTUP_TRIALS = 5  # Stop random sampling after N_STARTUP_TRIALS
N_EVALUATIONS = 2  # Number of evaluations during the training
N_TIMESTEPS = int(2e4)  # Training budget
EVAL_FREQ = int(N_TIMESTEPS / N_EVALUATIONS)
N_EVAL_ENVS = 5
N_EVAL_EPISODES = 10
TIMEOUT = int(60 * 15)  # 15 minutes

ENV_ID = "CartPole-v1"

DEFAULT_HYPERPARAMS = {
    "policy": "MlpPolicy",
    "env": ENV_ID,
}

### Define the search space

In [7]:
from typing import Any, Dict
import torch
import torch.nn as nn

def sample_a2c_params(trial: optuna.Trial) -> Dict[str, Any]:
    """
    Sampler for A2C hyperparameters.

    :param trial: Optuna trial object
    :return: The sampled hyperparameters for the given trial.
    """
    # Discount factor between 0.9 and 0.9999
    gamma = 1.0 - trial.suggest_float("gamma", 0.0001, 0.1, log=True)
    max_grad_norm = trial.suggest_float("max_grad_norm", 0.3, 5.0, log=True)
    # 8, 16, 32, ... 1024
    n_steps = 2 ** trial.suggest_int("exponent_n_steps", 3, 10)


    learning_rate = trial.suggest_float("learning_rate",1e-5, 1, log=True)
    net_arch = trial.suggest_categorical("net_arch",choices=["tiny", "small"])
    activation_fn = trial.suggest_categorical("activation_fn",choices=["tanh","relu"])

    # Display true values
    trial.set_user_attr("gamma_", gamma)
    trial.set_user_attr("n_steps", n_steps)

    net_arch = [
        {"pi": [64], "vf": [64]}
        if net_arch == "tiny"
        else {"pi": [64, 64], "vf": [64, 64]}
    ]

    activation_fn = {"tanh": nn.Tanh, "relu": nn.ReLU}[activation_fn]

    return {
        "n_steps": n_steps,
        "gamma": gamma,
        "learning_rate": learning_rate,
        "max_grad_norm": max_grad_norm,
        "policy_kwargs": {
            "net_arch": net_arch,
            "activation_fn": activation_fn,
        },
    }

### Define the objective function

First we define a custom callback to report the results of periodic evaluations to Optuna:

In [8]:
from stable_baselines3.common.callbacks import EvalCallback

class TrialEvalCallback(EvalCallback):
    """
    Callback used for evaluating and reporting a trial.

    :param eval_env: Evaluation environement
    :param trial: Optuna trial object
    :param n_eval_episodes: Number of evaluation episodes
    :param eval_freq:   Evaluate the agent every ``eval_freq`` call of the callback.
    :param deterministic: Whether the evaluation should
        use a stochastic or deterministic policy.
    :param verbose:
    """

    def __init__(
        self,
        eval_env: gym.Env,
        trial: optuna.Trial,
        n_eval_episodes: int = 5,
        eval_freq: int = 10000,
        deterministic: bool = True,
        verbose: int = 0,
    ):

        super().__init__(
            eval_env=eval_env,
            n_eval_episodes=n_eval_episodes,
            eval_freq=eval_freq,
            deterministic=deterministic,
            verbose=verbose,
        )
        self.trial = trial
        self.eval_idx = 0
        self.is_pruned = False

    def _on_step(self) -> bool:
        if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:
            # Evaluate policy (done in the parent class)
            super()._on_step()
            self.eval_idx += 1
            # Send report to Optuna
            self.trial.report(self.last_mean_reward, self.eval_idx)
            # Prune trial if need
            if self.trial.should_prune():
                self.is_pruned = True
                return False
        return True

### Define the objective function

Then we define the objective function that is in charge of sampling hyperparameters, creating the model and then returning the result to Optuna

In [9]:
def objective(trial: optuna.Trial) -> float:
    """
    Objective function using by Optuna to evaluate
    one configuration (i.e., one set of hyperparameters).

    Given a trial object, it will sample hyperparameters,
    evaluate it and report the result (mean episodic reward after training)

    :param trial: Optuna trial object
    :return: Mean episodic reward after training
    """

    kwargs = DEFAULT_HYPERPARAMS.copy()

    # Sample hyperparameters and update the keyword arguments
    kwargs.update(sample_a2c_params(trial))

    # Create the RL model
    model = A2C(**kwargs)

    # Create envs used for evaluation using `make_vec_env`, `ENV_ID` and `N_EVAL_ENVS`
    eval_envs = make_vec_env(ENV_ID,N_EVAL_ENVS)

    # Create the `TrialEvalCallback` callback defined above that will periodically evaluate
    eval_callback = TrialEvalCallback(eval_envs,trial, N_EVAL_EPISODES, EVAL_FREQ, True, 1)

    nan_encountered = False
    try:
        # Train the model
        model.learn(N_TIMESTEPS, callback=eval_callback)
    except AssertionError as e:
        # Sometimes, random hyperparams can generate NaN
        print(e)
        nan_encountered = True
    finally:
        # Free memory
        model.env.close()
        eval_envs.close()

    # Tell the optimizer that the trial failed
    if nan_encountered:
        return float("nan")

    if eval_callback.is_pruned:
        raise optuna.exceptions.TrialPruned()

    return eval_callback.last_mean_reward

### The optimization loop

In [None]:
import torch as th

# Set pytorch num threads to 1 for faster training
th.set_num_threads(1)
# Select the sampler, can be random, TPESampler, CMAES, ...
sampler = TPESampler(n_startup_trials=N_STARTUP_TRIALS)
# Do not prune before 1/3 of the max budget is used
pruner = MedianPruner(
    n_startup_trials=N_STARTUP_TRIALS, n_warmup_steps=N_EVALUATIONS // 3
)
# Create the study and start the hyperparameter optimization
study = optuna.create_study(sampler=sampler, pruner=pruner, direction="maximize")

try:
    study.optimize(objective, n_trials=N_TRIALS, n_jobs=N_JOBS, timeout=TIMEOUT)
except KeyboardInterrupt:
    pass

print("\n\n\n\n\nNumber of finished trials: ", len(study.trials))

print("Best trial:")
trial = study.best_trial

print(f"  Value: {trial.value}")

print("  Params: ")
for key, value in trial.params.items():
    print(f"    {key}: {value}")

print("  User attrs:")
for key, value in trial.user_attrs.items():
    print(f"    {key}: {value}")

# Write report
study.trials_dataframe().to_csv("study_results_a2c_cartpole.csv")

fig1 = plot_optimization_history(study)
fig2 = plot_param_importances(study)

fig1.show()
fig2.show()

[I 2025-03-04 16:12:12,908] A new study created in memory with name: no-name-43efba92-aa7a-4e1f-8520-7f54ab5ddfd2


Eval num_timesteps=10000, episode_reward=9.40 +/- 0.66
Episode length: 9.40 +/- 0.66
New best mean reward!


[I 2025-03-04 16:12:30,827] Trial 0 finished with value: 9.2 and parameters: {'gamma': 0.0001565552546387575, 'max_grad_norm': 1.6446413164628697, 'exponent_n_steps': 8, 'learning_rate': 0.29211762192057805, 'net_arch': 'small', 'activation_fn': 'relu'}. Best is trial 0 with value: 9.2.


Eval num_timesteps=20000, episode_reward=9.20 +/- 0.60
Episode length: 9.20 +/- 0.60




Eval num_timesteps=10000, episode_reward=71.90 +/- 11.85
Episode length: 71.90 +/- 11.85
New best mean reward!


[I 2025-03-04 16:12:48,715] Trial 1 finished with value: 87.4 and parameters: {'gamma': 0.00013261331934499, 'max_grad_norm': 0.9773772272117918, 'exponent_n_steps': 3, 'learning_rate': 2.1269700904602606e-05, 'net_arch': 'small', 'activation_fn': 'tanh'}. Best is trial 1 with value: 87.4.


Eval num_timesteps=20000, episode_reward=87.40 +/- 44.18
Episode length: 87.40 +/- 44.18
New best mean reward!
Eval num_timesteps=10000, episode_reward=78.30 +/- 33.35
Episode length: 78.30 +/- 33.35
New best mean reward!


[I 2025-03-04 16:13:02,387] Trial 2 finished with value: 92.2 and parameters: {'gamma': 0.05687182639947384, 'max_grad_norm': 0.6586832697701895, 'exponent_n_steps': 4, 'learning_rate': 0.0002542689742822569, 'net_arch': 'tiny', 'activation_fn': 'relu'}. Best is trial 2 with value: 92.2.


Eval num_timesteps=20000, episode_reward=92.20 +/- 28.33
Episode length: 92.20 +/- 28.33
New best mean reward!
Eval num_timesteps=10000, episode_reward=113.70 +/- 33.40
Episode length: 113.70 +/- 33.40
New best mean reward!
Eval num_timesteps=20000, episode_reward=100.70 +/- 45.97
Episode length: 100.70 +/- 45.97


[I 2025-03-04 16:13:14,676] Trial 3 finished with value: 100.7 and parameters: {'gamma': 0.0007340544047573383, 'max_grad_norm': 4.686071443578941, 'exponent_n_steps': 9, 'learning_rate': 0.00019796285091692518, 'net_arch': 'small', 'activation_fn': 'relu'}. Best is trial 3 with value: 100.7.


Eval num_timesteps=10000, episode_reward=488.30 +/- 35.10
Episode length: 488.30 +/- 35.10
New best mean reward!


[I 2025-03-04 16:13:27,034] Trial 4 finished with value: 500.0 and parameters: {'gamma': 0.06898337252688375, 'max_grad_norm': 1.0704444392930572, 'exponent_n_steps': 7, 'learning_rate': 0.00889227980581267, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 4 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-03-04 16:13:32,960] Trial 5 pruned. 


Eval num_timesteps=10000, episode_reward=9.40 +/- 0.80
Episode length: 9.40 +/- 0.80
New best mean reward!
Eval num_timesteps=10000, episode_reward=137.10 +/- 3.36
Episode length: 137.10 +/- 3.36
New best mean reward!


[I 2025-03-04 16:13:44,741] Trial 6 finished with value: 237.9 and parameters: {'gamma': 0.00816930731981947, 'max_grad_norm': 2.00779006126141, 'exponent_n_steps': 6, 'learning_rate': 0.006664807592420221, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 4 with value: 500.0.


Eval num_timesteps=20000, episode_reward=237.90 +/- 33.63
Episode length: 237.90 +/- 33.63
New best mean reward!


[I 2025-03-04 16:13:49,966] Trial 7 pruned. 


Eval num_timesteps=10000, episode_reward=9.40 +/- 0.80
Episode length: 9.40 +/- 0.80
New best mean reward!
Eval num_timesteps=10000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-03-04 16:14:01,963] Trial 8 finished with value: 500.0 and parameters: {'gamma': 0.021757294266547465, 'max_grad_norm': 3.112485503184081, 'exponent_n_steps': 7, 'learning_rate': 0.004168826186899289, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 4 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00


[I 2025-03-04 16:14:08,233] Trial 9 pruned. 


Eval num_timesteps=10000, episode_reward=9.40 +/- 0.80
Episode length: 9.40 +/- 0.80
New best mean reward!


[I 2025-03-04 16:14:13,973] Trial 10 pruned. 


Eval num_timesteps=10000, episode_reward=93.80 +/- 45.15
Episode length: 93.80 +/- 45.15
New best mean reward!
Eval num_timesteps=10000, episode_reward=198.20 +/- 29.97
Episode length: 198.20 +/- 29.97
New best mean reward!


[I 2025-03-04 16:14:25,606] Trial 11 finished with value: 500.0 and parameters: {'gamma': 0.02566265913506896, 'max_grad_norm': 3.964979309825269, 'exponent_n_steps': 7, 'learning_rate': 0.005423606529239947, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 4 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!
Eval num_timesteps=10000, episode_reward=371.80 +/- 176.54
Episode length: 371.80 +/- 176.54
New best mean reward!


[I 2025-03-04 16:14:37,786] Trial 12 finished with value: 410.9 and parameters: {'gamma': 0.024688414958465688, 'max_grad_norm': 2.473566817305961, 'exponent_n_steps': 7, 'learning_rate': 0.0010573695663104482, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 4 with value: 500.0.


Eval num_timesteps=20000, episode_reward=410.90 +/- 136.11
Episode length: 410.90 +/- 136.11
New best mean reward!


[I 2025-03-04 16:14:43,269] Trial 13 pruned. 


Eval num_timesteps=10000, episode_reward=83.90 +/- 11.65
Episode length: 83.90 +/- 11.65
New best mean reward!
Eval num_timesteps=10000, episode_reward=450.20 +/- 100.89
Episode length: 450.20 +/- 100.89
New best mean reward!


[I 2025-03-04 16:14:55,822] Trial 14 finished with value: 198.6 and parameters: {'gamma': 0.008408190333776691, 'max_grad_norm': 1.3706229118733158, 'exponent_n_steps': 5, 'learning_rate': 0.0023098648110028547, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 4 with value: 500.0.


Eval num_timesteps=20000, episode_reward=198.60 +/- 45.30
Episode length: 198.60 +/- 45.30


[I 2025-03-04 16:15:01,340] Trial 15 pruned. 


Eval num_timesteps=10000, episode_reward=22.30 +/- 3.32
Episode length: 22.30 +/- 3.32
New best mean reward!


[I 2025-03-04 16:15:06,827] Trial 16 pruned. 


Eval num_timesteps=10000, episode_reward=100.70 +/- 37.05
Episode length: 100.70 +/- 37.05
New best mean reward!


[I 2025-03-04 16:15:13,160] Trial 17 pruned. 


Eval num_timesteps=10000, episode_reward=9.40 +/- 0.80
Episode length: 9.40 +/- 0.80
New best mean reward!
Eval num_timesteps=10000, episode_reward=492.80 +/- 21.60
Episode length: 492.80 +/- 21.60
New best mean reward!
Eval num_timesteps=20000, episode_reward=414.60 +/- 64.65
Episode length: 414.60 +/- 64.65


[I 2025-03-04 16:15:25,721] Trial 18 finished with value: 414.6 and parameters: {'gamma': 0.0037346817796266286, 'max_grad_norm': 0.698561380205939, 'exponent_n_steps': 9, 'learning_rate': 0.0018565331094538976, 'net_arch': 'small', 'activation_fn': 'relu'}. Best is trial 4 with value: 500.0.


Eval num_timesteps=10000, episode_reward=215.30 +/- 155.26
Episode length: 215.30 +/- 155.26
New best mean reward!


[I 2025-03-04 16:15:37,766] Trial 19 finished with value: 279.6 and parameters: {'gamma': 0.04621051194933854, 'max_grad_norm': 1.39279710940246, 'exponent_n_steps': 6, 'learning_rate': 0.0007372764473025734, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 4 with value: 500.0.


Eval num_timesteps=20000, episode_reward=279.60 +/- 145.34
Episode length: 279.60 +/- 145.34
New best mean reward!


[I 2025-03-04 16:15:43,027] Trial 20 pruned. 


Eval num_timesteps=10000, episode_reward=94.80 +/- 24.55
Episode length: 94.80 +/- 24.55
New best mean reward!
Eval num_timesteps=10000, episode_reward=320.20 +/- 125.73
Episode length: 320.20 +/- 125.73
New best mean reward!


[I 2025-03-04 16:15:54,950] Trial 21 finished with value: 500.0 and parameters: {'gamma': 0.025115024776808346, 'max_grad_norm': 4.610950155397442, 'exponent_n_steps': 7, 'learning_rate': 0.006529135188959501, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 4 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


# Conclusion

What we have seen in this notebook:
- how to do automatic hyperparameter search with optuna
