<a href="https://colab.research.google.com/github/CaptainAmu/Reinforcement-Learning-Tutorial/blob/main/notebooks/unit3/optuna/hyp_optim_Lunarlander.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Optimizing the Hyperparameters for training the Lunarlander-v2

In this notebook we will use ```optuna``` to automatically select the best set of hyperparameters for training the ```Lunarlander-v2``` under ```PPO```.

# Preparation

## Install dependencies and create a virtual screen

In [None]:
!apt install swig cmake
!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt

!sudo apt-get update
!sudo apt-get install -y python3-opengl
!apt install ffmpeg
!apt install xvfb
!pip3 install pyvirtualdisplay

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
cmake is already the newest version (3.22.1-1ubuntu1.22.04.2).
Suggested packages:
  swig-doc swig-examples swig4.0-examples swig4.0-doc
The following NEW packages will be installed:
  swig swig4.0
0 upgraded, 2 newly installed, 0 to remove and 35 not upgraded.
Need to get 1,116 kB of archives.
After this operation, 5,542 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 swig4.0 amd64 4.0.2-1ubuntu1 [1,110 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 swig all 4.0.2-1ubuntu1 [5,632 B]
Fetched 1,116 kB in 1s (1,023 kB/s)
Selecting previously unselected package swig4.0.
(Reading database ... 126435 files and directories currently installed.)
Preparing to unpack .../swig4.0_4.0.2-1ubuntu1_amd64.deb ...
Unpacking swig4.0 (4.0.2-1ubuntu1) ...
Selecting previously unselected package swig.
Preparing to unpack .../swig_4.0.2-1ubu

In [37]:
# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()


datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).



<pyvirtualdisplay.display.Display at 0x7cf9f64fa810>

## Import the packages

In [None]:
!pip install pygame==2.5.2

# 手动安装 box2d-py
!pip install box2d-py==2.3.5

# 安装 gymnasium，但不强制 pygame 版本
!pip install gymnasium==0.28.1

# stable-baselines3 alpha 版本
!pip install stable-baselines3==2.0.0a5

# Hugging Face 相关工具
!pip install huggingface_sb3

Collecting pygame==2.5.2
  Downloading pygame-2.5.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading pygame-2.5.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.9/13.9 MB[0m [31m62.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pygame
  Attempting uninstall: pygame
    Found existing installation: pygame 2.6.1
    Uninstalling pygame-2.6.1:
      Successfully uninstalled pygame-2.6.1
Successfully installed pygame-2.5.2
Collecting box2d-py==2.3.5
  Using cached box2d-py-2.3.5.tar.gz (374 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: box2d-py
  Building wheel for box2d-py (setup.py) ... [?25l[?25hdone
  Created wheel for box2d-py: filename=box2d_py-2.3.5-cp312-cp312-linux_x86_64.whl size=2381958 sha256=30270e74e8cd9683062d8dd3f52260365e56ea0fe7f88cc8e00baa3031d59fd5
  Stored in directory:

In [38]:
import gymnasium

from huggingface_sb3 import load_from_hub, package_to_hub
from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor

## Create Environment

In [None]:
import gymnasium as gym

env = gym.make("LunarLander-v2")
print(f'Action space, {env.action_space}')
print(f'Action space sample: {env.action_space.sample()}')
print(f'Observation space, {env.observation_space}')
print(f'Observation space sample: {env.observation_space.sample()}')

env = make_vec_env("LunarLander-v2", n_envs = 16) # 16 envs in parallel

Action space, Discrete(4)
Action space sample: 3
Observation space, Box([-90.        -90.         -5.         -5.         -3.1415927  -5.
  -0.         -0.       ], [90.        90.         5.         5.         3.1415927  5.
  1.         1.       ], (8,), float32)
Observation space sample: [-78.315315    63.04755      2.13648     -0.6513512   -2.5556724
   4.4190054    0.20695183   0.46599555]


# Use Automatic Hyperparameter Tuning to train PPO model

## Imports

In [None]:
!pip install optuna



In [None]:
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
from optuna.visualization import plot_optimization_history, plot_param_importances

## Config

In [None]:
N_TRIALS = 100  # Maximum number of trials
N_JOBS = 1 # Number of jobs to run in parallel
N_STARTUP_TRIALS = 5  # Stop random sampling after N_STARTUP_TRIALS
N_EVALUATIONS = 2  # Number of evaluations during the training
N_TIMESTEPS = int(1e5)  # Training budget
EVAL_FREQ = int(N_TIMESTEPS / N_EVALUATIONS)
N_EVAL_ENVS = 5
N_EVAL_EPISODES = 10
TIMEOUT = int(60 * 15)  # 15 minutes

ENV_ID = "LunarLander-v2"

DEFAULT_HYPERPARAMS = {
    "policy": "MlpPolicy",
    "env": ENV_ID,
}


## Defining the Search Space

Recall from unit 1, in the PPO model for ```Lunarlander-v2``` there is a baseline set of hyperparameters given by:

```
model_PPO = PPO(
    policy = 'MlpPolicy',
    env = env,
    n_steps = 1024,
    batch_size = 64,
    n_epochs = 4,
    gamma = 0.999,
    gae_lambda = 0.98,
    ent_coef = 0.01,
    verbose = 1
)
```

which worked quite well. Let's set the search space for hyperparameters around them.

In [None]:
### DEPRECATED: This is adapted optuna_lab.ipynb which doesn't incorporate the baseline hyperparams for PPO. ###

from typing import Any, Dict
import torch
import torch.nn as nn

def sample_ppo_params(trial: optuna.Trial) -> Dict[str, Any]:
    """
    Sampler for PPO hyperparameters.

    :param trial: Optuna trial object
    :return: The sampled hyperparameters for the given trial.
    """
    # Discount factor between 0.9 and 0.9999
    gamma = 1.0 - trial.suggest_float("gamma", 0.0001, 0.1, log=True)
    max_grad_norm = trial.suggest_float("max_grad_norm", 0.3, 5.0, log=True)
    # 256， 512， 1024
    n_steps = 2 ** trial.suggest_int("exponent_n_steps", 8, 10)

    # - define the learning rate search space [1e-5, 1] (log) -> `suggest_float`
    # - define the network architecture search space ["tiny", "small"] -> `suggest_categorical`
    # - define the activation function search space ["tanh", "relu"]
    learning_rate = trial.suggest_float('lr', 1e-5, 1, log=True)
    net_arch = trial.suggest_categorical('net_arch', ['tiny', 'small'])
    activation_fn = trial.suggest_categorical('activation_fn', ['tanh', 'relu'])

    # Display true values
    trial.set_user_attr("gamma_", gamma)
    trial.set_user_attr("n_steps", n_steps)

    net_arch = [
        {"pi": [64], "vf": [64]} if net_arch == "tiny"
        else {"pi": [64, 64], "vf": [64, 64]}
    ]

    activation_fn = {"tanh": nn.Tanh, "relu": nn.ReLU}[activation_fn]

    batch_size = trial.suggest_categorical("batch_size", [64, 128, 256])
    clip_range = trial.suggest_float("clip_range", 0.1, 0.3)
    gae_lambda = trial.suggest_float("gae_lambda", 0.8, 1.0)
    ent_coef = trial.suggest_float("ent_coef", 1e-8, 0.01, log=True)

    return {
        "n_steps": n_steps,
        "gamma": gamma,
        "learning_rate": learning_rate,
        "max_grad_norm": max_grad_norm,
        "batch_size": batch_size,
        "clip_range": clip_range,
        "gae_lambda": gae_lambda,
        "ent_coef": ent_coef,
        "policy_kwargs": {
            "net_arch": net_arch,
            "activation_fn": activation_fn,
        },
    }

### DEPRECATED WARNING ###

In [45]:
from typing import Any, Dict
import torch
import torch.nn as nn

def sample_ppo_params(trial):
    n_steps = 1024 # trial.suggest_categorical("n_steps", [512, 1024, 2048])
    batch_size = 64
    n_epochs = 4
    gamma = 0.999 # trial.suggest_float("gamma", 0.95, 0.9999, log=True)
    gae_lambda = 0.98
    ent_coef = 0.01 # trial.suggest_float("ent_coef", 1e-4, 0.05, log=True)
    learning_rate = trial.suggest_float("learning_rate", 1e-3, 3e-3, log=True)
    # clip_range = trial.suggest_float("clip_range", 0.1, 0.3)
    # max_grad_norm = trial.suggest_float("max_grad_norm", 0.3, 5.0, log=True)
    # vf_coef = trial.suggest_float("vf_coef", 0.1, 1.0)

    # net_arch = trial.suggest_categorical("net_arch", ['tiny', 'small'])
    # net_arch = dict(pi=[64], vf=[64]) if net_arch == 'tiny' else dict(pi=[64, 64], vf=[64, 64])

    # activation_fn = trial.suggest_categorical("activation_fn", ['tanh', 'relu'])
    # activation_fn = {"tanh": nn.Tanh, "relu": nn.ReLU}[activation_fn]

    return {
        "n_steps": n_steps,
        "batch_size": batch_size,
        "n_epochs": n_epochs,
        "gamma": gamma,
        "gae_lambda": gae_lambda,
        "ent_coef": ent_coef,
        "learning_rate": learning_rate,
        # "clip_range": clip_range,
        # "max_grad_norm": max_grad_norm,
        # "vf_coef": vf_coef,
        # "policy_kwargs": {
        #     "net_arch": net_arch,
        #     "activation_fn": activation_fn,
        # },
    }


### Defining the objective function

First define a custom callback to report the results of periodic evaluations to ```optuna```.

In [40]:
from stable_baselines3.common.callbacks import EvalCallback

class TrialEvalCallback(EvalCallback):
    """
    Callback used for evaluating and reporting a trial.

    :param eval_env: Evaluation environement
    :param trial: Optuna trial object
    :param n_eval_episodes: Number of evaluation episodes
    :param eval_freq:   Evaluate the agent every ``eval_freq`` call of the callback.
    :param deterministic: Whether the evaluation should
        use a stochastic or deterministic policy.
    :param verbose:
    """

    def __init__(
        self,
        eval_env: gym.Env,
        trial: optuna.Trial,
        n_eval_episodes: int = 5,
        eval_freq: int = 10000,
        deterministic: bool = True,
        verbose: int = 0,
    ):

        super().__init__(
            eval_env=eval_env,
            n_eval_episodes=n_eval_episodes,
            eval_freq=eval_freq,
            deterministic=deterministic,
            verbose=verbose,
        )
        self.trial = trial
        self.eval_idx = 0
        self.is_pruned = False

    def _on_step(self) -> bool:
        if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:
            # Evaluate policy (done in the parent class)
            super()._on_step()
            self.eval_idx += 1
            # Send report to Optuna
            self.trial.report(self.last_mean_reward, self.eval_idx)
            # Prune trial if need
            if self.trial.should_prune():
                self.is_pruned = True
                return False
        return True

Then we define the objective function that is in charge of sampling hyperparameters, creating the model and then returning the results to ```Optuna```.

In [43]:
def objective(trial: optuna.Trial) -> float:
    """
    Objective function using by Optuna to evaluate
    one configuration (i.e., one set of hyperparameters).

    Given a trial object, it will sample hyperparameters,
    evaluate it and report the result (mean episodic reward after training)

    :param trial: Optuna trial object
    :return: Mean episodic reward after training
    """

    kwargs = DEFAULT_HYPERPARAMS.copy()
    ### YOUR CODE HERE
    # TODO:
    # 1. Sample hyperparameters and update the default keyword arguments: `kwargs.update(other_params)`
    # 2. Create the evaluation envs
    # 3. Create the `TrialEvalCallback`

    # 1. Sample hyperparameters and update the keyword arguments
    kwargs.update(sample_ppo_params(trial))

    # Create the RL model
    model = PPO(**kwargs)

    # 2. Create envs used for evaluation using `make_vec_env`, `ENV_ID` and `N_EVAL_ENVS`
    eval_envs = make_vec_env(ENV_ID, n_envs=N_EVAL_ENVS)

    # 3. Create the `TrialEvalCallback` callback defined above that will periodically evaluate
    # and report the performance using `N_EVAL_EPISODES` every `EVAL_FREQ`
    # TrialEvalCallback signature:
    # TrialEvalCallback(eval_env, trial, n_eval_episodes, eval_freq, deterministic, verbose)
    eval_callback = TrialEvalCallback(
        eval_envs,
        trial,
        n_eval_episodes=N_EVAL_EPISODES,
        eval_freq=EVAL_FREQ,
        deterministic=True,
        verbose=0,
    )
    ### END OF YOUR CODE

    nan_encountered = False
    try:
        # Train the model
        model.learn(N_TIMESTEPS, callback=eval_callback)
    except AssertionError as e:
        # Sometimes, random hyperparams can generate NaN
        print(e)
        nan_encountered = True
    finally:
        # Free memory
        model.env.close()
        eval_envs.close()

    # Tell the optimizer that the trial failed
    if nan_encountered:
        return float("nan")

    if eval_callback.is_pruned:
        raise optuna.exceptions.TrialPruned()

    return eval_callback.last_mean_reward

## The optimization loop

In [46]:
import torch as th

# Set pytorch num threads to 1 for faster training
th.set_num_threads(1)
# Select the sampler, can be random, TPESampler, CMAES, ...
sampler = TPESampler(n_startup_trials=N_STARTUP_TRIALS)
# Do not prune before 1/3 of the max budget is used
pruner = MedianPruner(
    n_startup_trials=N_STARTUP_TRIALS, n_warmup_steps=N_EVALUATIONS // 3
)
# Create the study and start the hyperparameter optimization
study = optuna.create_study(sampler=sampler, pruner=pruner, direction="maximize")

try:
    study.optimize(objective, n_trials=N_TRIALS, n_jobs=N_JOBS, timeout=TIMEOUT)
except KeyboardInterrupt:
    pass

print("Number of finished trials: ", len(study.trials))

print("Best trial:")
trial = study.best_trial

print(f"  Value: {trial.value}")

print("  Params: ")
for key, value in trial.params.items():
    print(f"    {key}: {value}")

print("  User attrs:")
for key, value in trial.user_attrs.items():
    print(f"    {key}: {value}")

# Write report
study.trials_dataframe().to_csv("study_results_a2c_cartpole.csv")

fig1 = plot_optimization_history(study)
fig2 = plot_param_importances(study)

fig1.show()
fig2.show()

[I 2025-09-17 03:56:53,374] A new study created in memory with name: no-name-7e022cbd-b601-45c7-944d-82d28f1f7bb6
[I 2025-09-17 04:00:15,325] Trial 0 finished with value: 39.5296909 and parameters: {'learning_rate': 0.0014547741815038122}. Best is trial 0 with value: 39.5296909.
[I 2025-09-17 04:04:05,874] Trial 1 finished with value: 66.91349290000001 and parameters: {'learning_rate': 0.001906929064183189}. Best is trial 1 with value: 66.91349290000001.
[I 2025-09-17 04:07:20,774] Trial 2 finished with value: 112.94084219999999 and parameters: {'learning_rate': 0.0012823289467222385}. Best is trial 2 with value: 112.94084219999999.
[I 2025-09-17 04:11:00,826] Trial 3 finished with value: 154.71600569999998 and parameters: {'learning_rate': 0.0021176811650070824}. Best is trial 3 with value: 154.71600569999998.
[I 2025-09-17 04:15:06,331] Trial 4 finished with value: -139.309303 and parameters: {'learning_rate': 0.0017243105501260206}. Best is trial 3 with value: 154.71600569999998.


Number of finished trials:  5
Best trial:
  Value: 154.71600569999998
  Params: 
    learning_rate: 0.0021176811650070824
  User attrs:



datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).

