**The second line codes are practices except for Todo codes.**

# Hyperparameter tuning with Optuna

Github repo: https://github.com/araffin/tools-for-robotic-rl-icra2022

Optuna: https://github.com/optuna/optuna

Stable-Baselines3: https://github.com/DLR-RM/stable-baselines3

Documentation: https://stable-baselines3.readthedocs.io/en/master/

SB3 Contrib: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib

RL Baselines3 zoo: https://github.com/DLR-RM/rl-baselines3-zoo

[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a collection of pre-trained Reinforcement Learning agents using Stable-Baselines3.

It also provides basic scripts for training, evaluating agents, tuning hyperparameters and recording videos.


## Introduction

In this notebook, you will learn the importance of tuning hyperparameters. You will first try to optimize the parameters manually and then we will see how to automate the search using Optuna.


## Install Dependencies and Stable Baselines3 Using Pip

List of full dependencies can be found in the [README](https://github.com/DLR-RM/stable-baselines3).


```
pip install stable-baselines3[extra]
```

In [None]:
!pip install stable-baselines3

In [None]:
!pip install stable-baselines3

Collecting stable-baselines3
  Downloading stable_baselines3-2.0.0-py3-none-any.whl (178 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m178.4/178.4 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gymnasium==0.28.1 (from stable-baselines3)
  Downloading gymnasium-0.28.1-py3-none-any.whl (925 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m925.5/925.5 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
Collecting jax-jumpy>=1.0.0 (from gymnasium==0.28.1->stable-baselines3)
  Downloading jax_jumpy-1.0.0-py3-none-any.whl (20 kB)
Collecting farama-notifications>=0.0.1 (from gymnasium==0.28.1->stable-baselines3)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Installing collected packages: farama-notifications, jax-jumpy, gymnasium, stable-baselines3
Successfully installed farama-notifications-0.0.4 gymnasium-0.28.1 jax-jumpy-1.0.0 stable-baselines3-2.0.0


In [None]:
# Optional: install SB3 contrib to have access to additional algorithms
!pip install sb3-contrib

In [None]:
!pip install sb3-contrib

Collecting sb3-contrib
  Downloading sb3_contrib-2.0.0-py3-none-any.whl (80 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.3/80.3 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sb3-contrib
Successfully installed sb3-contrib-2.0.0


In [None]:
# Optuna will be used in the last part when doing hyperparameter tuning
!pip install optuna

In [None]:
!pip install optuna

Collecting optuna
  Downloading optuna-3.2.0-py3-none-any.whl (390 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m390.6/390.6 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.11.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cmaes>=0.9.1 (from optuna)
  Downloading cmaes-0.10.0-py3-none-any.whl (29 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.2.4-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.7/78.7 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Mako, colorlog, cmaes, alembic, optuna
Successfully installed Mako-1.2.4 alembic-1.11.1 cmaes-0.10.0 colorlog-6.7.0 optuna-3.2.0


## Imports

In [None]:
import gym
import numpy as np

In [None]:
import gym
import numpy as np

The first thing you need to import is the RL model, check the documentation to know what you can use on which problem

In [None]:
from stable_baselines3 import PPO, A2C, SAC, TD3, DQN

In [None]:
from stable_baselines3 import PPO, A2C, SAC, TD3, DQN

  if not hasattr(tensorboard, "__version__") or LooseVersion(


In [None]:
# Algorithms from the contrib repo
# https://github.com/Stable-Baselines-Team/stable-baselines3-contrib
from sb3_contrib import QRDQN, TQC

In [None]:
from sb3_contrib import QRDQN, TQC

  and should_run_async(code)


In [None]:
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy

In [None]:
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy

# Part I: The Importance Of Tuned Hyperparameters



When compared with Supervised Learning, Deep Reinforcement Learning is far more sensitive to the choice of hyper-parameters such as learning rate, number of neurons, number of layers, optimizer ... etc.

Poor choice of hyper-parameters can lead to poor/unstable convergence. This challenge is compounded by the variability in performance across random seeds (used to initialize the network weights and the environment).

In addition to hyperparameters, selecting the appropriate algorithm is also an important choice. We will demonstrate it on the simple Pendulum task.

See [gym doc](https://gym.openai.com/envs/Pendulum-v0/): "The inverted pendulum swingup problem is a classic problem in the control literature. In this version  of the problem, the pendulum starts in a random position, and the goal is to swing it up so it stays upright."


Let's try first with PPO and a small budget of 4000 steps (20 episodes):

In [None]:
env_id = "Pendulum-v1"
# Env used only for evaluation
eval_envs = make_vec_env(env_id, n_envs=10)
# 4000 training timesteps
budget_pendulum = 4000

In [None]:
env_id = "Pendulum-v1"
eval_envs = make_vec_env(env_id, n_envs=10)
budget_pendulum = 4000

  and should_run_async(code)


### PPO

In [None]:
ppo_model = PPO("MlpPolicy", env_id, seed=0, verbose=0).learn(budget_pendulum)

In [None]:
mean_reward, std_reward = evaluate_policy(ppo_model, eval_envs, n_eval_episodes=100, deterministic=True)

print(f"PPO Mean episode reward: {mean_reward:.2f} +/- {std_reward:.2f}")

PPO Mean episode reward: -1175.13 +/- 264.33


In [None]:
ppo_model = PPO("MlpPolicy", env_id, seed=0, verbose=0).learn(budget_pendulum)

In [None]:
mean_reward, std_reward = evaluate_policy(ppo_model, eval_envs, n_eval_episodes=100, deterministic=True)
print(f"PPO Mean episode reward: {mean_reward:.2f} +/- {std_reward:.2f}")

PPO Mean episode reward: -1200.28 +/- 267.60


### A2C

In [None]:
# Define and train a A2C model
a2c_model = A2C("MlpPolicy", env_id, seed=0, verbose=0).learn(budget_pendulum)

In [None]:
# Evaluate the train A2C model
mean_reward, std_reward = evaluate_policy(a2c_model, eval_envs, n_eval_episodes=100, deterministic=True)

print(f"A2C Mean episode reward: {mean_reward:.2f} +/- {std_reward:.2f}")

A2C Mean episode reward: -1365.89 +/- 33.68


Both are far from solving the env (mean reward around -200).
Now, let's try with an off-policy algorithm:

### Training longer PPO ?

Maybe training longer would help?

You can try with 10x the budget, but in the case of A2C/PPO, training longer won't help much, finding better hyperparameters is needed instead.

In [None]:
# train longer
new_budget = 10 * budget_pendulum

ppo_model = PPO("MlpPolicy", env_id, seed=0, verbose=0).learn(new_budget)

In [None]:
new_budget = 10 * budget_pendulum
ppo_model = PPO("MlpPolicy", env_id, seed=0, verbose=0).learn(new_budget)

In [None]:
mean_reward, std_reward = evaluate_policy(ppo_model, eval_envs, n_eval_episodes=100, deterministic=True)

print(f"PPO Mean episode reward: {mean_reward:.2f} +/- {std_reward:.2f}")

In [None]:
mean_reward, std_reward = evaluate_policy(ppo_model, eval_envs, n_eval_episodes=100, deterministic=True)
print(f"PPO Mean episode reward: {mean_reward:.2f} +/- {std_reward:.2f}")

PPO Mean episode reward: -1148.25 +/- 223.49


### PPO - Tuned Hyperparameters

Using Optuna, we can in fact tune the hyperparameters and find a working solution (from the [RL Zoo](https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/hyperparams/ppo.yml)):

In [None]:
tuned_params = {
    "gamma": 0.9,
    "use_sde": True,
    "sde_sample_freq": 4,
    "learning_rate": 1e-3,
}

# budget = 10 * budget_pendulum
ppo_tuned_model = PPO("MlpPolicy", env_id, seed=1, verbose=1, **tuned_params).learn(50_000, log_interval=5)

In [None]:
tuned_params = {
    "gamma": 0.9,
    "use_sde": True,
    "sde_sample_freq": 4,
    "learning_rate": 1e-3,
}
ppo_tuned_model = PPO("MlpPolicy", env_id, seed=1, verbose=1, **tuned_params).learn(50_000, log_interval=5)

Using cuda device
Creating environment from the given name 'Pendulum-v1'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 200        |
|    ep_rew_mean          | -1.17e+03  |
| time/                   |            |
|    fps                  | 509        |
|    iterations           | 5          |
|    time_elapsed         | 20         |
|    total_timesteps      | 10240      |
| train/                  |            |
|    approx_kl            | 0.02357509 |
|    clip_fraction        | 0.225      |
|    clip_range           | 0.2        |
|    entropy_loss         | -2.6       |
|    explained_variance   | 0.833      |
|    learning_rate        | 0.001      |
|    loss                 | 13.4       |
|    n_updates            | 40         |
|    policy_gradient_loss | -0.0229    |
|    std                  | 0.892      |
|    value_loss           | 35

In [None]:
mean_reward, std_reward = evaluate_policy(ppo_tuned_model, eval_envs, n_eval_episodes=100, deterministic=True)

print(f"Tuned PPO Mean episode reward: {mean_reward:.2f} +/- {std_reward:.2f}")

Tuned PPO Mean episode reward: -192.88 +/- 111.45


In [None]:
mean_reward, std_reward = evaluate_policy(ppo_tuned_model, eval_envs, n_eval_episodes=100, deterministic=True)
print(f"Tuned PPO Mean episode reward: {mean_reward:.2f} +/- {std_reward:.2f}")

Tuned PPO Mean episode reward: -195.89 +/- 213.87


Note: if you try SAC on the simple MountainCarContinuous environment, you will encounter some issues without tuned hyperparameters: https://github.com/rail-berkeley/softlearning/issues/76

Simple environments can be challenging even for SOTA algorithms.

# Part II: Grad Student Descent


### Challenge (10 minutes): "Grad Student Descent"
The challenge is to find the best hyperparameters (max performance) for A2C on `CartPole-v1` with a limited budget of 20 000 training steps.


Maximum reward: 500 on `CartPole-v1`

The hyperparameters should work for different random seeds.

In [None]:
budget = 20_000

In [None]:
budget = 20_000

  and should_run_async(code)


#### The baseline: default hyperparameters

In [None]:
eval_envs_cartpole = make_vec_env("CartPole-v1", n_envs=10)

In [None]:
eval_envs_cartpole = make_vec_env("CartPole-v1", n_envs=10)

  and should_run_async(code)


In [None]:
model = A2C("MlpPolicy", "CartPole-v1", seed=8, verbose=1).learn(budget)

In [None]:
model = A2C("MlpPolicy", "CartPole-v1", seed=8, verbose=1).learn(budget)

Using cuda device
Creating environment from the given name 'CartPole-v1'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 20.2     |
|    ep_rew_mean        | 20.2     |
| time/                 |          |
|    fps                | 475      |
|    iterations         | 100      |
|    time_elapsed       | 1        |
|    total_timesteps    | 500      |
| train/                |          |
|    entropy_loss       | -0.688   |
|    explained_variance | -0.0129  |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | 1.95     |
|    value_loss         | 8.69     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 19.9     |
|    ep_rew_mean        | 19.9     |
| time/                 |          |
|    fps                | 457      |

In [None]:
mean_reward, std_reward = evaluate_policy(model, eval_envs_cartpole, n_eval_episodes=50, deterministic=True)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:135.02 +/- 99.80


In [None]:
mean_reward, std_reward = evaluate_policy(model, eval_envs_cartpole, n_eval_episodes=50, deterministic=True)
print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:148.10 +/- 19.82


**Your goal is to beat that baseline and get closer to the optimal score of 500**

Time to tune!

In [None]:
import torch.nn as nn

In [None]:
import torch.nn as nn

  and should_run_async(code)


In [None]:
policy_kwargs = dict(
    net_arch=[
      dict(vf=[64, 64], pi=[64, 64]), # network architectures for actor/critic
    ],
    activation_fn=nn.Tanh,
)

hyperparams = dict(
    n_steps=5, # number of steps to collect data before updating policy
    learning_rate=7e-4,
    gamma=0.99, # discount factor
    max_grad_norm=0.5, # The maximum value for the gradient clipping
    ent_coef=0.0, # Entropy coefficient for the loss calculation
)

model = A2C("MlpPolicy", "CartPole-v1", seed=8, verbose=1, **hyperparams).learn(budget)

In [None]:
policy_kwargs = dict(
    net_arch=[
        dict(vf=[64, 64], pi=[64,64])
    ],
    activiation_fn=nn.Tanh,
)

hyperparams = dict(
    n_steps=5,
    learning_rate=7e-4,
    gamma=0.99,
    max_grad_norm=0.5,
    ent_coef=0.0,
)

model = A2C("MlpPolicy", "CartPole-v1", seed=8, verbose=1, **hyperparams).learn(budget)

Using cuda device
Creating environment from the given name 'CartPole-v1'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 20.2     |
|    ep_rew_mean        | 20.2     |
| time/                 |          |
|    fps                | 469      |
|    iterations         | 100      |
|    time_elapsed       | 1        |
|    total_timesteps    | 500      |
| train/                |          |
|    entropy_loss       | -0.688   |
|    explained_variance | -0.0129  |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | 1.95     |
|    value_loss         | 8.69     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 19.9     |
|    ep_rew_mean        | 19.9     |
| time/                 |          |
|    fps                | 472      |

In [None]:
mean_reward, std_reward = evaluate_policy(model, eval_envs_cartpole, n_eval_episodes=50, deterministic=True)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

In [None]:
mean_reward, std_reward = evaluate_policy(model, eval_envs_cartpole, n_eval_episodes=50, deterministic=True)
print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:146.82 +/- 16.71


Hint - Recommended Hyperparameter Range

```python
gamma = trial.suggest_float("gamma", 0.9, 0.99999, log=True)
max_grad_norm = trial.suggest_float("max_grad_norm", 0.3, 5.0, log=True)
# from 2**3 = 8 to 2**10 = 1024
n_steps = 2 ** trial.suggest_int("exponent_n_steps", 3, 10)
learning_rate = trial.suggest_float("lr", 1e-5, 1, log=True)
ent_coef = trial.suggest_float("ent_coef", 0.00000001, 0.1, log=True)
# net_arch tiny: {"pi": [64], "vf": [64]}
# net_arch default: {"pi": [64, 64], "vf": [64, 64]}
# activation_fn = nn.Tanh / nn.ReLU
```

# Part III: Automatic Hyperparameter Tuning





In this part we will create a script that allows to search for the best hyperparameters automatically.

### Imports

In [None]:
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
from optuna.visualization import plot_optimization_history, plot_param_importances

In [None]:
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
from optuna.visualization import plot_optimization_history, plot_param_importances

### Config

In [None]:
N_TRIALS = 100  # Maximum number of trials
N_JOBS = 1 # Number of jobs to run in parallel
N_STARTUP_TRIALS = 5  # Stop random sampling after N_STARTUP_TRIALS
N_EVALUATIONS = 2  # Number of evaluations during the training
N_TIMESTEPS = int(2e4)  # Training budget
EVAL_FREQ = int(N_TIMESTEPS / N_EVALUATIONS)
N_EVAL_ENVS = 5
N_EVAL_EPISODES = 10
TIMEOUT = int(60 * 15)  # 15 minutes

ENV_ID = "CartPole-v1"

DEFAULT_HYPERPARAMS = {
    "policy": "MlpPolicy",
    "env": ENV_ID,
}

In [None]:
N_TRIALS = 100
N_JOBS = 1
N_STARTUP_TRIALS = 5
N_EVALUATIONS = 2
N_TIMESTEPS = int(2e4)
EVAL_FREQ = int(N_TIMESTEPS / N_EVALUATIONS)
N_EVAL_ENVS = 5
N_EVAL_EPISODES = 10
TIMEOUT = int(60 * 15)

ENV_ID = "CartPole-v1"

DEFAULT_HYPERPARAMS = {
    "policy": "MlpPolicy",
    "env": ENV_ID,
}

  and should_run_async(code)


### Exercise (5 minutes): Define the search space

In [None]:
from typing import Any, Dict
import torch
import torch.nn as nn

def sample_a2c_params(trial: optuna.Trial) -> Dict[str, Any]:
    """
    Sampler for A2C hyperparameters.

    :param trial: Optuna trial object
    :return: The sampled hyperparameters for the given trial.
    """
    # Discount factor between 0.9 and 0.9999
    gamma = 1.0 - trial.suggest_float("gamma", 0.0001, 0.1, log=True)
    max_grad_norm = trial.suggest_float("max_grad_norm", 0.3, 5.0, log=True)
    # 8, 16, 32, ... 1024
    n_steps = 2 ** trial.suggest_int("exponent_n_steps", 3, 10)

    ### YOUR CODE HERE
    # TODO:
    # - define the learning rate search space [1e-5, 1] (log) -> `suggest_float`
    # - define the network architecture search space ["tiny", "small"] -> `suggest_categorical`
    # - define the activation function search space ["tanh", "relu"]
    learning_rate = trial.suggest_float("lr", 1e-5, 1, log=True)
    net_arch = trial.suggest_categorical("net_arch", ["tiny", "small"])
    activation_fn = trial.suggest_categorical("activation_fn", ["tanh", "relu"])

    ### END OF YOUR CODE

    # Display true values
    trial.set_user_attr("gamma_", gamma)
    trial.set_user_attr("n_steps", n_steps)

    net_arch = {"pi": [64], "vf": [64]} if net_arch == "tiny" else {"pi": [64, 64], "vf": [64, 64]}
    # net_arch = [
    #     {"pi": [64], "vf": [64]}
    #     if net_arch == "tiny"
    #     else {"pi": [64, 64], "vf": [64, 64]}
    # ]

    activation_fn = {"tanh": nn.Tanh, "relu": nn.ReLU}[activation_fn]

    return {
        "n_steps": n_steps,
        "gamma": gamma,
        "learning_rate": learning_rate,
        "max_grad_norm": max_grad_norm,
        "policy_kwargs": {
            "net_arch": net_arch,
            "activation_fn": activation_fn,
        },
    }

### Define the objective function

First we define a custom callback to report the results of periodic evaluations to Optuna:

In [None]:
from stable_baselines3.common.callbacks import EvalCallback

class TrialEvalCallback(EvalCallback):
    """
    Callback used for evaluating and reporting a trial.

    :param eval_env: Evaluation environement
    :param trial: Optuna trial object
    :param n_eval_episodes: Number of evaluation episodes
    :param eval_freq:   Evaluate the agent every ``eval_freq`` call of the callback.
    :param deterministic: Whether the evaluation should
        use a stochastic or deterministic policy.
    :param verbose:
    """

    def __init__(
        self,
        eval_env: gym.Env,
        trial: optuna.Trial,
        n_eval_episodes: int = 5,
        eval_freq: int = 10000,
        deterministic: bool = True,
        verbose: int = 0,
    ):

        super().__init__(
            eval_env=eval_env,
            n_eval_episodes=n_eval_episodes,
            eval_freq=eval_freq,
            deterministic=deterministic,
            verbose=verbose,
        )
        self.trial = trial
        self.eval_idx = 0
        self.is_pruned = False

    def _on_step(self) -> bool:
        if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:
            # Evaluate policy (done in the parent class)
            super()._on_step()
            self.eval_idx += 1
            # Send report to Optuna
            self.trial.report(self.last_mean_reward, self.eval_idx)
            # Prune trial if need
            if self.trial.should_prune():
                self.is_pruned = True
                return False
        return True

### Exercise (10 minutes): Define the objective function

Then we define the objective function that is in charge of sampling hyperparameters, creating the model and then returning the result to Optuna

In [None]:
def objective(trial: optuna.Trial) -> float:
    """
    Objective function using by Optuna to evaluate
    one configuration (i.e., one set of hyperparameters).

    Given a trial object, it will sample hyperparameters,
    evaluate it and report the result (mean episodic reward after training)

    :param trial: Optuna trial object
    :return: Mean episodic reward after training
    """

    kwargs = DEFAULT_HYPERPARAMS.copy()
    ### YOUR CODE HERE
    # TODO:
    # 1. Sample hyperparameters and update the default keyword arguments: `kwargs.update(other_params)`
    # 2. Create the evaluation envs
    # 3. Create the `TrialEvalCallback`

    # 1. Sample hyperparameters and update the keyword arguments
    kwargs.update(sample_a2c_params(trial))

    # Create the RL model
    model = A2C(**kwargs)

    # 2. Create envs used for evaluation using `make_vec_env`, `ENV_ID` and `N_EVAL_ENVS`
    eval_envs = make_vec_env(ENV_ID, N_EVAL_ENVS)

    # 3. Create the `TrialEvalCallback` callback defined above that will periodically evaluate
    # and report the performance using `N_EVAL_EPISODES` every `EVAL_FREQ`
    # TrialEvalCallback signature:
    # TrialEvalCallback(eval_env, trial, n_eval_episodes, eval_freq, deterministic, verbose)
    eval_callback = TrialEvalCallback(eval_envs,
                                        trial,
                                        N_EVAL_EPISODES,
                                        EVAL_FREQ,
                                        deterministic=True,
                                        verbose=1)

    ### END OF YOUR CODE

    nan_encountered = False
    try:
        # Train the model
        model.learn(N_TIMESTEPS, callback=eval_callback)
    except AssertionError as e:
        # Sometimes, random hyperparams can generate NaN
        print(e)
        nan_encountered = True
    finally:
        # Free memory
        model.env.close()
        eval_envs.close()

    # Tell the optimizer that the trial failed
    if nan_encountered:
        return float("nan")

    if eval_callback.is_pruned:
        raise optuna.exceptions.TrialPruned()

    return eval_callback.last_mean_reward

### The optimization loop

In [None]:
import torch as th

# Set pytorch num threads to 1 for faster training
th.set_num_threads(1)
# Select the sampler, can be random, TPESampler, CMAES, ...
sampler = TPESampler(n_startup_trials=N_STARTUP_TRIALS)
# Do not prune before 1/3 of the max budget is used
pruner = MedianPruner(
    n_startup_trials=N_STARTUP_TRIALS, n_warmup_steps=N_EVALUATIONS // 3
)
# Create the study and start the hyperparameter optimization
study = optuna.create_study(sampler=sampler, pruner=pruner, direction="maximize")

try:
    study.optimize(objective, n_trials=N_TRIALS, n_jobs=N_JOBS, timeout=TIMEOUT)
except KeyboardInterrupt:
    pass

print("Number of finished trials: ", len(study.trials))

print("Best trial:")
trial = study.best_trial

print(f"  Value: {trial.value}")

print("  Params: ")
for key, value in trial.params.items():
    print(f"    {key}: {value}")

print("  User attrs:")
for key, value in trial.user_attrs.items():
    print(f"    {key}: {value}")

# Write report
study.trials_dataframe().to_csv("study_results_a2c_cartpole.csv")

fig1 = plot_optimization_history(study)
fig2 = plot_param_importances(study)

fig1.show()
fig2.show()

[I 2023-07-28 09:04:48,367] A new study created in memory with name: no-name-859e0d3c-3402-4b75-b498-b5912fdce5ce


Eval num_timesteps=10000, episode_reward=154.30 +/- 11.99
Episode length: 154.30 +/- 11.99
New best mean reward!


[I 2023-07-28 09:05:16,585] Trial 0 finished with value: 382.6 and parameters: {'gamma': 0.04215517403455309, 'max_grad_norm': 1.7073843260560286, 'exponent_n_steps': 7, 'lr': 0.00596240573126716, 'net_arch': 'tiny', 'activation_fn': 'relu'}. Best is trial 0 with value: 382.6.


Eval num_timesteps=20000, episode_reward=382.60 +/- 78.43
Episode length: 382.60 +/- 78.43
New best mean reward!
Eval num_timesteps=10000, episode_reward=9.20 +/- 0.75
Episode length: 9.20 +/- 0.75
New best mean reward!


[I 2023-07-28 09:05:55,088] Trial 1 finished with value: 9.6 and parameters: {'gamma': 0.07130142538734371, 'max_grad_norm': 0.462683705792196, 'exponent_n_steps': 3, 'lr': 0.24508814166578513, 'net_arch': 'small', 'activation_fn': 'relu'}. Best is trial 0 with value: 382.6.


Eval num_timesteps=20000, episode_reward=9.60 +/- 0.66
Episode length: 9.60 +/- 0.66
New best mean reward!
Eval num_timesteps=10000, episode_reward=9.20 +/- 0.40
Episode length: 9.20 +/- 0.40
New best mean reward!


[I 2023-07-28 09:06:22,497] Trial 2 finished with value: 9.4 and parameters: {'gamma': 0.002366802118240152, 'max_grad_norm': 4.246206446841251, 'exponent_n_steps': 7, 'lr': 0.021790037040285475, 'net_arch': 'tiny', 'activation_fn': 'relu'}. Best is trial 0 with value: 382.6.


Eval num_timesteps=20000, episode_reward=9.40 +/- 0.49
Episode length: 9.40 +/- 0.49
New best mean reward!
Eval num_timesteps=10000, episode_reward=63.80 +/- 20.20
Episode length: 63.80 +/- 20.20
New best mean reward!
Eval num_timesteps=20000, episode_reward=102.40 +/- 52.42
Episode length: 102.40 +/- 52.42
New best mean reward!


[I 2023-07-28 09:06:49,337] Trial 3 finished with value: 102.4 and parameters: {'gamma': 0.06546840684668276, 'max_grad_norm': 0.5394612161952804, 'exponent_n_steps': 10, 'lr': 0.00013642308894257194, 'net_arch': 'tiny', 'activation_fn': 'relu'}. Best is trial 0 with value: 382.6.


Eval num_timesteps=10000, episode_reward=111.00 +/- 52.91
Episode length: 111.00 +/- 52.91
New best mean reward!


[I 2023-07-28 09:07:18,359] Trial 4 finished with value: 108.1 and parameters: {'gamma': 0.0007147455458788434, 'max_grad_norm': 0.652114223001188, 'exponent_n_steps': 7, 'lr': 8.692812085900122e-05, 'net_arch': 'small', 'activation_fn': 'tanh'}. Best is trial 0 with value: 382.6.


Eval num_timesteps=20000, episode_reward=108.10 +/- 68.17
Episode length: 108.10 +/- 68.17
Eval num_timesteps=10000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2023-07-28 09:07:50,505] Trial 5 finished with value: 500.0 and parameters: {'gamma': 0.0001080633423743654, 'max_grad_norm': 1.8653950360801839, 'exponent_n_steps': 4, 'lr': 0.0029079073425006753, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 5 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
Eval num_timesteps=10000, episode_reward=282.90 +/- 142.67
Episode length: 282.90 +/- 142.67
New best mean reward!


[I 2023-07-28 09:08:28,695] Trial 6 finished with value: 163.5 and parameters: {'gamma': 0.00014980041085903243, 'max_grad_norm': 1.6784446670410864, 'exponent_n_steps': 3, 'lr': 0.0008870127434244448, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 5 with value: 500.0.


Eval num_timesteps=20000, episode_reward=163.50 +/- 39.56
Episode length: 163.50 +/- 39.56


[I 2023-07-28 09:08:42,880] Trial 7 pruned. 


Eval num_timesteps=10000, episode_reward=63.50 +/- 56.07
Episode length: 63.50 +/- 56.07
New best mean reward!
Eval num_timesteps=10000, episode_reward=133.80 +/- 112.17
Episode length: 133.80 +/- 112.17
New best mean reward!


[I 2023-07-28 09:09:14,026] Trial 8 finished with value: 133.9 and parameters: {'gamma': 0.0005492659566364785, 'max_grad_norm': 1.0040439707529412, 'exponent_n_steps': 5, 'lr': 1.2809391741319797e-05, 'net_arch': 'small', 'activation_fn': 'tanh'}. Best is trial 5 with value: 500.0.


Eval num_timesteps=20000, episode_reward=133.90 +/- 56.16
Episode length: 133.90 +/- 56.16
New best mean reward!


[I 2023-07-28 09:09:28,112] Trial 9 pruned. 


Eval num_timesteps=10000, episode_reward=8.80 +/- 0.75
Episode length: 8.80 +/- 0.75
New best mean reward!
Eval num_timesteps=10000, episode_reward=283.10 +/- 88.96
Episode length: 283.10 +/- 88.96
New best mean reward!
Eval num_timesteps=20000, episode_reward=402.30 +/- 59.44
Episode length: 402.30 +/- 59.44
New best mean reward!


[I 2023-07-28 09:09:57,319] Trial 10 finished with value: 402.3 and parameters: {'gamma': 0.00010841023957467492, 'max_grad_norm': 2.8526413455255444, 'exponent_n_steps': 10, 'lr': 0.0017686822890453516, 'net_arch': 'small', 'activation_fn': 'tanh'}. Best is trial 5 with value: 500.0.


Eval num_timesteps=10000, episode_reward=144.40 +/- 18.29
Episode length: 144.40 +/- 18.29
New best mean reward!
Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2023-07-28 09:10:26,287] Trial 11 finished with value: 500.0 and parameters: {'gamma': 0.000111605412176099, 'max_grad_norm': 2.7294776693309797, 'exponent_n_steps': 10, 'lr': 0.0012847579297842933, 'net_arch': 'small', 'activation_fn': 'tanh'}. Best is trial 5 with value: 500.0.


Eval num_timesteps=10000, episode_reward=213.90 +/- 34.24
Episode length: 213.90 +/- 34.24
New best mean reward!
Eval num_timesteps=20000, episode_reward=491.00 +/- 27.00
Episode length: 491.00 +/- 27.00
New best mean reward!


[I 2023-07-28 09:10:55,888] Trial 12 finished with value: 491.0 and parameters: {'gamma': 0.00039613533898523397, 'max_grad_norm': 2.4778846729897497, 'exponent_n_steps': 9, 'lr': 0.001077120387547686, 'net_arch': 'small', 'activation_fn': 'tanh'}. Best is trial 5 with value: 500.0.


Eval num_timesteps=10000, episode_reward=233.30 +/- 127.01
Episode length: 233.30 +/- 127.01
New best mean reward!
Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2023-07-28 09:11:25,044] Trial 13 finished with value: 500.0 and parameters: {'gamma': 0.00032050383880305255, 'max_grad_norm': 4.598681568922146, 'exponent_n_steps': 8, 'lr': 0.007233605962143598, 'net_arch': 'small', 'activation_fn': 'tanh'}. Best is trial 5 with value: 500.0.
[I 2023-07-28 09:11:40,539] Trial 14 pruned. 


Eval num_timesteps=10000, episode_reward=139.90 +/- 122.23
Episode length: 139.90 +/- 122.23
New best mean reward!


[I 2023-07-28 09:11:55,240] Trial 15 pruned. 


Eval num_timesteps=10000, episode_reward=54.80 +/- 7.98
Episode length: 54.80 +/- 7.98
New best mean reward!


[I 2023-07-28 09:12:09,112] Trial 16 pruned. 


Eval num_timesteps=10000, episode_reward=9.60 +/- 0.80
Episode length: 9.60 +/- 0.80
New best mean reward!
Eval num_timesteps=10000, episode_reward=425.40 +/- 78.47
Episode length: 425.40 +/- 78.47
New best mean reward!


[I 2023-07-28 09:12:40,646] Trial 17 finished with value: 122.7 and parameters: {'gamma': 0.0008236690606958723, 'max_grad_norm': 1.5319705410224365, 'exponent_n_steps': 4, 'lr': 0.0024374870506971404, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 5 with value: 500.0.


Eval num_timesteps=20000, episode_reward=122.70 +/- 5.88
Episode length: 122.70 +/- 5.88
Eval num_timesteps=10000, episode_reward=210.30 +/- 172.19
Episode length: 210.30 +/- 172.19
New best mean reward!
Eval num_timesteps=20000, episode_reward=161.00 +/- 117.99
Episode length: 161.00 +/- 117.99


[I 2023-07-28 09:13:08,043] Trial 18 finished with value: 161.0 and parameters: {'gamma': 0.00022613895803886915, 'max_grad_norm': 2.413328655182138, 'exponent_n_steps': 8, 'lr': 0.0003335834519441241, 'net_arch': 'tiny', 'activation_fn': 'relu'}. Best is trial 5 with value: 500.0.
[I 2023-07-28 09:13:22,720] Trial 19 pruned. 


Eval num_timesteps=10000, episode_reward=9.10 +/- 0.94
Episode length: 9.10 +/- 0.94
New best mean reward!


[I 2023-07-28 09:13:38,448] Trial 20 pruned. 


Eval num_timesteps=10000, episode_reward=9.70 +/- 0.78
Episode length: 9.70 +/- 0.78
New best mean reward!


[I 2023-07-28 09:13:52,610] Trial 21 pruned. 


Eval num_timesteps=10000, episode_reward=78.90 +/- 87.63
Episode length: 78.90 +/- 87.63
New best mean reward!


[I 2023-07-28 09:14:06,921] Trial 22 pruned. 


Eval num_timesteps=10000, episode_reward=155.00 +/- 6.10
Episode length: 155.00 +/- 6.10
New best mean reward!


[I 2023-07-28 09:14:21,229] Trial 23 pruned. 


Eval num_timesteps=10000, episode_reward=146.80 +/- 132.40
Episode length: 146.80 +/- 132.40
New best mean reward!
Eval num_timesteps=10000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!
Eval num_timesteps=20000, episode_reward=432.80 +/- 72.77
Episode length: 432.80 +/- 72.77


[I 2023-07-28 09:14:51,325] Trial 24 finished with value: 432.8 and parameters: {'gamma': 0.00015843703785598935, 'max_grad_norm': 3.9812831283786356, 'exponent_n_steps': 10, 'lr': 0.0033206866040116187, 'net_arch': 'small', 'activation_fn': 'tanh'}. Best is trial 5 with value: 500.0.
[I 2023-07-28 09:15:05,549] Trial 25 pruned. 


Eval num_timesteps=10000, episode_reward=8.70 +/- 0.78
Episode length: 8.70 +/- 0.78
New best mean reward!


[I 2023-07-28 09:15:19,788] Trial 26 pruned. 


Eval num_timesteps=10000, episode_reward=169.80 +/- 27.50
Episode length: 169.80 +/- 27.50
New best mean reward!


[I 2023-07-28 09:15:34,076] Trial 27 pruned. 


Eval num_timesteps=10000, episode_reward=9.10 +/- 0.54
Episode length: 9.10 +/- 0.54
New best mean reward!


[I 2023-07-28 09:15:47,009] Trial 28 pruned. 


Eval num_timesteps=10000, episode_reward=41.70 +/- 8.43
Episode length: 41.70 +/- 8.43
New best mean reward!
Eval num_timesteps=10000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2023-07-28 09:16:15,352] Trial 29 finished with value: 500.0 and parameters: {'gamma': 0.00017824048058050765, 'max_grad_norm': 1.3452054905528885, 'exponent_n_steps': 7, 'lr': 0.0037552619066924158, 'net_arch': 'tiny', 'activation_fn': 'relu'}. Best is trial 5 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
Eval num_timesteps=10000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!
Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00


[I 2023-07-28 09:16:45,856] Trial 30 finished with value: 500.0 and parameters: {'gamma': 0.0005395856023812694, 'max_grad_norm': 3.9215518973340147, 'exponent_n_steps': 9, 'lr': 0.006252305893368215, 'net_arch': 'small', 'activation_fn': 'tanh'}. Best is trial 5 with value: 500.0.


Eval num_timesteps=10000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2023-07-28 09:17:14,255] Trial 31 finished with value: 500.0 and parameters: {'gamma': 0.00018680331918747165, 'max_grad_norm': 1.4358857551463482, 'exponent_n_steps': 7, 'lr': 0.002903182651194456, 'net_arch': 'tiny', 'activation_fn': 'relu'}. Best is trial 5 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00


[I 2023-07-28 09:17:27,808] Trial 32 pruned. 


Eval num_timesteps=10000, episode_reward=143.70 +/- 5.62
Episode length: 143.70 +/- 5.62
New best mean reward!


[I 2023-07-28 09:17:41,184] Trial 33 pruned. 


Eval num_timesteps=10000, episode_reward=114.40 +/- 70.33
Episode length: 114.40 +/- 70.33
New best mean reward!
Eval num_timesteps=10000, episode_reward=489.10 +/- 21.81
Episode length: 489.10 +/- 21.81
New best mean reward!


[I 2023-07-28 09:18:09,812] Trial 34 finished with value: 493.7 and parameters: {'gamma': 0.00023314790995219187, 'max_grad_norm': 1.891477786959778, 'exponent_n_steps': 6, 'lr': 0.0022390354271472845, 'net_arch': 'tiny', 'activation_fn': 'relu'}. Best is trial 5 with value: 500.0.


Eval num_timesteps=20000, episode_reward=493.70 +/- 18.24
Episode length: 493.70 +/- 18.24
New best mean reward!


[I 2023-07-28 09:18:28,227] Trial 35 pruned. 


Eval num_timesteps=10000, episode_reward=9.30 +/- 0.64
Episode length: 9.30 +/- 0.64
New best mean reward!
Eval num_timesteps=10000, episode_reward=304.00 +/- 196.46
Episode length: 304.00 +/- 196.46
New best mean reward!


[I 2023-07-28 09:18:55,855] Trial 36 pruned. 


Eval num_timesteps=20000, episode_reward=397.40 +/- 88.78
Episode length: 397.40 +/- 88.78
New best mean reward!


[I 2023-07-28 09:19:10,073] Trial 37 pruned. 


Eval num_timesteps=10000, episode_reward=86.90 +/- 25.72
Episode length: 86.90 +/- 25.72
New best mean reward!
Eval num_timesteps=10000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!
Eval num_timesteps=20000, episode_reward=218.10 +/- 94.44
Episode length: 218.10 +/- 94.44


[I 2023-07-28 09:19:39,487] Trial 38 finished with value: 218.1 and parameters: {'gamma': 0.0006341579796716915, 'max_grad_norm': 4.134219059764608, 'exponent_n_steps': 8, 'lr': 0.005713607480188818, 'net_arch': 'small', 'activation_fn': 'relu'}. Best is trial 5 with value: 500.0.
[I 2023-07-28 09:19:53,001] Trial 39 pruned. 


Eval num_timesteps=10000, episode_reward=82.60 +/- 29.71
Episode length: 82.60 +/- 29.71
New best mean reward!
Number of finished trials:  40
Best trial:
  Value: 500.0
  Params: 
    gamma: 0.0001080633423743654
    max_grad_norm: 1.8653950360801839
    exponent_n_steps: 4
    lr: 0.0029079073425006753
    net_arch: tiny
    activation_fn: tanh
  User attrs:
    gamma_: 0.9998919366576257
    n_steps: 16


Complete example: https://github.com/DLR-RM/rl-baselines3-zoo

# Conclusion

What we have seen in this notebook:
- the importance of good hyperparameters
- how to do automatic hyperparameter search with optuna


# Practice


## Optimisation RL using Optuna (LunarLander-v2)

In [23]:
!pip install stable-baselines3



In [24]:
!pip install sb3-contrib



In [25]:
!apt-get install swig cmake ffmpeg

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
swig is already the newest version (4.0.2-1ubuntu1).
cmake is already the newest version (3.22.1-1ubuntu1.22.04.1).
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 9 not upgraded.


In [26]:
!pip install optuna



In [21]:
!pip install git+https://github.com/DLR-RM/rl-baselines3-zoo@update/hf

Collecting git+https://github.com/DLR-RM/rl-baselines3-zoo@update/hf
  Cloning https://github.com/DLR-RM/rl-baselines3-zoo (to revision update/hf) to /tmp/pip-req-build-f6kaaq4k
  Running command git clone --filter=blob:none --quiet https://github.com/DLR-RM/rl-baselines3-zoo /tmp/pip-req-build-f6kaaq4k
  Running command git checkout -b update/hf --track origin/update/hf
  Switched to a new branch 'update/hf'
  Branch 'update/hf' set up to track remote branch 'update/hf' from 'origin'.
  Resolved https://github.com/DLR-RM/rl-baselines3-zoo to commit 7dcbff7e74e7a12c052452181ff353a4dbed313a
  Running command git submodule update --init --recursive -q
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [22]:
!pip install gymnasium[box2d]



In [38]:
import gym
import numpy as np

In [28]:
from stable_baselines3 import PPO, A2C, SAC, TD3, DQN

In [29]:
from sb3_contrib import QRDQN, TQC

In [30]:
import torch.nn as nn

In [31]:
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy

In [32]:
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
from optuna.visualization import plot_optimization_history, plot_param_importances

In [49]:
N_TRIALS = 100
N_JOBS = 2
N_STARTUP_TRIALS = 5
N_EVALUATIONS = 2
N_TIMESTEPS = int(2e4)
EVAL_FREQ = int(N_TIMESTEPS / N_EVALUATIONS)
N_EVAL_ENVS = 11
N_EVAL_EPISODES = 10
TIMEOUT = int(60 * 30)

ENV_ID = "LunarLander-v2"

DEFAULT_HYPERPARAMS = {
    # "policy": "MlpPolicy",
    "env": ENV_ID,
}

In [50]:
from typing import Any, Dict

import numpy as np
import optuna
from stable_baselines3.common.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise
from torch import nn as nn

from rl_zoo3 import linear_schedule

def sample_ppo_params(trial: optuna.Trial) -> Dict[str, Any]:
    """
    Sampler for PPO hyperparams.

    :param trial:
    :return:
    """
    policy = trial.suggest_categorical("policy", ["MlpPolicy"])
    # policy = trial.suggest_categorical("policy", ["MlpPolicy", "CnnPolicy"])
    batch_size = trial.suggest_categorical("batch_size", [8, 16, 32, 64, 128, 256, 512])
    n_steps = trial.suggest_categorical("n_steps", [8, 16, 32, 64, 128, 256, 512, 1024, 2048])
    gamma = trial.suggest_categorical("gamma", [0.9, 0.95, 0.98, 0.99, 0.995, 0.999, 0.9999])
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 1, log=True)
    lr_schedule = "constant"
    # Uncomment to enable learning rate schedule
    # lr_schedule = trial.suggest_categorical('lr_schedule', ['linear', 'constant'])
    ent_coef = trial.suggest_float("ent_coef", 0.00000001, 0.1, log=True)
    clip_range = trial.suggest_categorical("clip_range", [0.1, 0.2, 0.3, 0.4])
    n_epochs = trial.suggest_categorical("n_epochs", [1, 5, 10, 20])
    gae_lambda = trial.suggest_categorical("gae_lambda", [0.8, 0.9, 0.92, 0.95, 0.98, 0.99, 1.0])
    max_grad_norm = trial.suggest_categorical("max_grad_norm", [0.3, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 5])
    vf_coef = trial.suggest_float("vf_coef", 0, 1)
    net_arch = trial.suggest_categorical("net_arch", ["small", "medium"])
    # Uncomment for gSDE (continuous actions)
    # log_std_init = trial.suggest_uniform("log_std_init", -4, 1)
    # Uncomment for gSDE (continuous action)
    # sde_sample_freq = trial.suggest_categorical("sde_sample_freq", [-1, 8, 16, 32, 64, 128, 256])
    # Orthogonal initialization
    ortho_init = False
    # ortho_init = trial.suggest_categorical('ortho_init', [False, True])
    # activation_fn = trial.suggest_categorical('activation_fn', ['tanh', 'relu', 'elu', 'leaky_relu'])
    activation_fn = trial.suggest_categorical("activation_fn", ["tanh", "relu"])

    # TODO: account when using multiple envs
    if batch_size > n_steps:
        batch_size = n_steps

    if lr_schedule == "linear":
        learning_rate = linear_schedule(learning_rate)

    # Independent networks usually work best
    # when not working with images
    net_arch = {
        "small": dict(pi=[64, 64], vf=[64, 64]),
        "medium": dict(pi=[256, 256], vf=[256, 256]),
    }[net_arch]

    activation_fn = {"tanh": nn.Tanh, "relu": nn.ReLU, "elu": nn.ELU, "leaky_relu": nn.LeakyReLU}[activation_fn]

    return {
        "policy": policy,
        "n_steps": n_steps,
        "batch_size": batch_size,
        "gamma": gamma,
        "learning_rate": learning_rate,
        "ent_coef": ent_coef,
        "clip_range": clip_range,
        "n_epochs": n_epochs,
        "gae_lambda": gae_lambda,
        "max_grad_norm": max_grad_norm,
        "vf_coef": vf_coef,
        # "sde_sample_freq": sde_sample_freq,
        "policy_kwargs": dict(
            # log_std_init=log_std_init,
            net_arch=net_arch,
            activation_fn=activation_fn,
            ortho_init=ortho_init,
        ),
    }

In [51]:
from stable_baselines3.common.callbacks import EvalCallback

class TrialEvalCallback(EvalCallback):
    """
    Callback used for evaluating and reporting a trial.

    :param eval_env: Evaluation environement
    :param trial: Optuna trial object
    :param n_eval_episodes: Number of evaluation episodes
    :param eval_freq:   Evaluate the agent every ``eval_freq`` call of the callback.
    :param deterministic: Whether the evaluation should
        use a stochastic or deterministic policy.
    :param verbose:
    """

    def __init__(
        self,
        eval_env: gym.Env,
        trial: optuna.Trial,
        n_eval_episodes: int = 5,
        eval_freq: int = 10000,
        deterministic: bool = True,
        verbose: int = 0,
    ):

        super().__init__(
            eval_env=eval_env,
            n_eval_episodes=n_eval_episodes,
            eval_freq=eval_freq,
            deterministic=deterministic,
            verbose=verbose,
        )
        self.trial = trial
        self.eval_idx = 0
        self.is_pruned = False

    def _on_step(self) -> bool:
        if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:
            # Evaluate policy (done in the parent class)
            super()._on_step()
            self.eval_idx += 1
            # Send report to Optuna
            self.trial.report(self.last_mean_reward, self.eval_idx)
            # Prune trial if need
            if self.trial.should_prune():
                self.is_pruned = True
                return False
        return True

In [52]:
def objective(trial: optuna.Trial) -> float:
    """
    Objective function using by Optuna to evaluate
    one configuration (i.e., one set of hyperparameters).

    Given a trial object, it will sample hyperparameters,
    evaluate it and report the result (mean episodic reward after training)

    :param trial: Optuna trial object
    :return: Mean episodic reward after training
    """

    kwargs = DEFAULT_HYPERPARAMS.copy()
    ### YOUR CODE HERE
    # TODO:
    # 1. Sample hyperparameters and update the default keyword arguments: `kwargs.update(other_params)`
    # 2. Create the evaluation envs
    # 3. Create the `TrialEvalCallback`

    # 1. Sample hyperparameters and update the keyword arguments
    kwargs.update(sample_ppo_params(trial))

    # Create the RL model
    model = PPO(**kwargs)

    # 2. Create envs used for evaluation using `make_vec_env`, `ENV_ID` and `N_EVAL_ENVS`
    eval_envs = make_vec_env(ENV_ID, N_EVAL_ENVS)

    # 3. Create the `TrialEvalCallback` callback defined above that will periodically evaluate
    # and report the performance using `N_EVAL_EPISODES` every `EVAL_FREQ`
    # TrialEvalCallback signature:
    # TrialEvalCallback(eval_env, trial, n_eval_episodes, eval_freq, deterministic, verbose)
    eval_callback = TrialEvalCallback(eval_envs,
                                        trial,
                                        N_EVAL_EPISODES,
                                        EVAL_FREQ,
                                        deterministic=True,
                                        verbose=1)

    ### END OF YOUR CODE

    nan_encountered = False
    try:
        # Train the model
        model.learn(N_TIMESTEPS, callback=eval_callback)
    except AssertionError as e:
        # Sometimes, random hyperparams can generate NaN
        print(e)
        nan_encountered = True
    finally:
        # Free memory
        model.env.close()
        eval_envs.close()

    # Tell the optimizer that the trial failed
    if nan_encountered:
        return float("nan")

    if eval_callback.is_pruned:
        raise optuna.exceptions.TrialPruned()

    return eval_callback.last_mean_reward

In [53]:
import torch

# Set pytorch num threads to 1 for faster training
torch.set_num_threads(1)
# Select the sampler, can be random, TPESampler, CMAES, ...
sampler = TPESampler(n_startup_trials=N_STARTUP_TRIALS)
# Do not prune before 1/3 of the max budget is used
pruner = MedianPruner(
    n_startup_trials=N_STARTUP_TRIALS, n_warmup_steps=N_EVALUATIONS // 3
)
# Create the study and start the hyperparameter optimization
study = optuna.create_study(sampler=sampler, pruner=pruner, direction="maximize")

try:
    study.optimize(objective, n_trials=N_TRIALS, n_jobs=N_JOBS, timeout=TIMEOUT)
except KeyboardInterrupt:
    pass

print("Number of finished trials: ", len(study.trials))

print("Best trial:")
trial = study.best_trial

print(f"  Value: {trial.value}")

print("  Params: ")
for key, value in trial.params.items():
    print(f"    {key}: {value}")

print("  User attrs:")
for key, value in trial.user_attrs.items():
    print(f"    {key}: {value}")

# Write report
study.trials_dataframe().to_csv("study_results_ppo_cartpole.csv")

fig1 = plot_optimization_history(study)
fig2 = plot_param_importances(study)

fig1.show()
fig2.show()

[I 2023-07-29 09:08:56,529] A new study created in memory with name: no-name-246e8e41-a500-4bfc-975c-d1ceaced2b26


Eval num_timesteps=10000, episode_reward=-653.67 +/- 79.66
Episode length: 67.20 +/- 6.21
New best mean reward!
Eval num_timesteps=20000, episode_reward=-393.17 +/- 34.02
Episode length: 73.30 +/- 8.78
New best mean reward!


[I 2023-07-29 09:10:36,627] Trial 1 finished with value: -393.16663439999996 and parameters: {'policy': 'MlpPolicy', 'batch_size': 128, 'n_steps': 256, 'gamma': 0.99, 'learning_rate': 0.07819875344488565, 'ent_coef': 8.420792635557077e-07, 'clip_range': 0.1, 'n_epochs': 5, 'gae_lambda': 0.8, 'max_grad_norm': 0.3, 'vf_coef': 0.8076595567033983, 'net_arch': 'medium', 'activation_fn': 'tanh'}. Best is trial 1 with value: -393.16663439999996.


Eval num_timesteps=10000, episode_reward=-553.63 +/- 115.29
Episode length: 64.00 +/- 13.93
New best mean reward!
Eval num_timesteps=10000, episode_reward=-746.01 +/- 141.59
Episode length: 209.30 +/- 68.94
New best mean reward!


[I 2023-07-29 09:11:54,209] Trial 2 finished with value: -554.925823 and parameters: {'policy': 'MlpPolicy', 'batch_size': 32, 'n_steps': 64, 'gamma': 0.995, 'learning_rate': 0.15774836580234916, 'ent_coef': 0.001076498675662731, 'clip_range': 0.2, 'n_epochs': 1, 'gae_lambda': 0.9, 'max_grad_norm': 5, 'vf_coef': 0.0031315483958033186, 'net_arch': 'small', 'activation_fn': 'relu'}. Best is trial 1 with value: -393.16663439999996.


Eval num_timesteps=20000, episode_reward=-554.93 +/- 188.89
Episode length: 63.00 +/- 12.09
Eval num_timesteps=10000, episode_reward=-580.10 +/- 162.42
Episode length: 65.70 +/- 12.95
New best mean reward!


[I 2023-07-29 09:13:20,150] Trial 3 finished with value: -493.25805410000004 and parameters: {'policy': 'MlpPolicy', 'batch_size': 256, 'n_steps': 16, 'gamma': 0.9, 'learning_rate': 0.21321983952137882, 'ent_coef': 0.05337789871369882, 'clip_range': 0.1, 'n_epochs': 1, 'gae_lambda': 0.9, 'max_grad_norm': 0.8, 'vf_coef': 0.8892468221536586, 'net_arch': 'small', 'activation_fn': 'tanh'}. Best is trial 1 with value: -393.16663439999996.


Eval num_timesteps=20000, episode_reward=-493.26 +/- 79.40
Episode length: 62.80 +/- 7.76
New best mean reward!


[I 2023-07-29 09:14:07,116] Trial 0 finished with value: -231.7009669 and parameters: {'policy': 'MlpPolicy', 'batch_size': 512, 'n_steps': 16, 'gamma': 0.995, 'learning_rate': 3.742304996255937e-05, 'ent_coef': 0.07420634455536304, 'clip_range': 0.1, 'n_epochs': 10, 'gae_lambda': 1.0, 'max_grad_norm': 0.5, 'vf_coef': 0.5492692424051365, 'net_arch': 'medium', 'activation_fn': 'tanh'}. Best is trial 0 with value: -231.7009669.


Eval num_timesteps=20000, episode_reward=-231.70 +/- 134.49
Episode length: 325.10 +/- 107.59
New best mean reward!
Eval num_timesteps=10000, episode_reward=-757.62 +/- 239.15
Episode length: 111.90 +/- 31.10
New best mean reward!
Eval num_timesteps=20000, episode_reward=-635.24 +/- 170.58
Episode length: 94.00 +/- 20.84
New best mean reward!


[I 2023-07-29 09:15:09,877] Trial 4 finished with value: -635.2393652 and parameters: {'policy': 'MlpPolicy', 'batch_size': 8, 'n_steps': 1024, 'gamma': 0.98, 'learning_rate': 0.16186354072714873, 'ent_coef': 0.08196431105682786, 'clip_range': 0.3, 'n_epochs': 1, 'gae_lambda': 0.98, 'max_grad_norm': 0.5, 'vf_coef': 0.6613509856096131, 'net_arch': 'small', 'activation_fn': 'relu'}. Best is trial 0 with value: -231.7009669.


Eval num_timesteps=10000, episode_reward=-438.00 +/- 58.70
Episode length: 352.60 +/- 77.72
New best mean reward!
Number of finished trials:  7
Best trial:
  Value: -231.7009669
  Params: 
    policy: MlpPolicy
    batch_size: 512
    n_steps: 16
    gamma: 0.995
    learning_rate: 3.742304996255937e-05
    ent_coef: 0.07420634455536304
    clip_range: 0.1
    n_epochs: 10
    gae_lambda: 1.0
    max_grad_norm: 0.5
    vf_coef: 0.5492692424051365
    net_arch: medium
    activation_fn: tanh
  User attrs:
