# Stable Baselines3 Tutorial - Callbacks and hyperparameter tuning
(Taken from <https://stable-baselines3.readthedocs.io/en/master/guide/callbacks.html#>)
- Comparing default and beest hyperparameters in RL.
- Using callbacks for monitoring, auto-saving, model manipulation, progress bars...


In [1]:
# Dependencies: swig, tqdm

import gym
from stable_baselines3 import A2C, SAC, PPO, TD3

# 1. Hyperparameter tuning
We'll compare here the performance of "Soft Actor Critic" on the Pendulum environment with default and "tuned" hyperparameters.

Resources:
- rl zoo: https://github.com/DLR-RM/rl-baselines3-zoo
- Optuna: https://github.com/optuna/optuna

In [2]:
import numpy as np
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor

eval_env = Monitor(gym.make('Pendulum-v0')) # AH: Wrapped with Monitor to prevent erroneous metrics.

## Default model

In [3]:
default_model = SAC('MlpPolicy', 'Pendulum-v0', verbose=1, seed=0, batch_size=64, policy_kwargs=dict(net_arch=[64, 64])).learn(8000)

Using cpu device
Creating environment from the given name 'Pendulum-v0'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 200       |
|    ep_rew_mean     | -1.36e+03 |
| time/              |           |
|    episodes        | 4         |
|    fps             | 125       |
|    time_elapsed    | 6         |
|    total_timesteps | 800       |
| train/             |           |
|    actor_loss      | 20.3      |
|    critic_loss     | 0.968     |
|    ent_coef        | 0.812     |
|    ent_coef_loss   | -0.337    |
|    learning_rate   | 0.0003    |
|    n_updates       | 699       |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 200       |
|    ep_rew_mean     | -1.55e+03 |
| time/              |           |
|    episodes        | 8         |
|    fps             | 121       |
|    time_e

In [4]:
mean_reward, std_reward = evaluate_policy(default_model, eval_env, n_eval_episodes=500)
print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:-182.25 +/- 99.20


## Tuned model

In [5]:
tuned_model = SAC('MlpPolicy', 'Pendulum-v0', batch_size=256, verbose=1, policy_kwargs=dict(net_arch=[256, 256]), seed=0).learn(8000)

Using cpu device
Creating environment from the given name 'Pendulum-v0'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 200       |
|    ep_rew_mean     | -1.56e+03 |
| time/              |           |
|    episodes        | 4         |
|    fps             | 51        |
|    time_elapsed    | 15        |
|    total_timesteps | 800       |
| train/             |           |
|    actor_loss      | 24.8      |
|    critic_loss     | 0.259     |
|    ent_coef        | 0.814     |
|    ent_coef_loss   | -0.339    |
|    learning_rate   | 0.0003    |
|    n_updates       | 699       |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 200       |
|    ep_rew_mean     | -1.61e+03 |
| time/              |           |
|    episodes        | 8         |
|    fps             | 47        |
|    time_e

In [6]:
mean_reward, std_reward = evaluate_policy(tuned_model, eval_env, n_eval_episodes=500)
print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:-148.26 +/- 91.90


# 2. Callbacks
Callback = function that will be called at a given stage of the training.
They are passed as an argument of `model.learn()`.

Types:
* Custom callback: called at 5 specific moments of the training process.
* Event callback: called when a certain user-defined situation is detected.

In [2]:
from stable_baselines3.common.callbacks import BaseCallback

## Types of callback

### 2.1 Custom callback
The class derives from `BaseCallback`.

Events:
* `_on_training_start`  Called before the first rollout starts.
* `_on_rollout_start`   
    * Rollout = collection of environment interactions using current policy.
    * Triggered before collecting new samples.
    * For off-policy algorithms, rollout = steps taken in the eenv between two updates.
* `_on_step`
    * Called by the model after each call to `env.step()`.
    * For child callback (of an `EventCallback`), this is called when the event is triggered.
    * :return: (bool) If False, training is aborted early.
* `_on_rollout_end`     Triggered before updating the policy.
* `_on_training_end`    Triggered before exiting the `learn()` method.

Variables accessible in the callback:
* `self.model`          The RL model (`type: BaseAlgorithm`).
* `self.training_env`   The environment used for training (`type Union[gym.Env, VecEnv, None]`).
* `self.n_calls`        Number of times the callback was called (`type int`).
* `self.num_timesteps`  Total number of steps taken (number of envs x step calls) (`type int`).
* `self.locals`         Local variables (`type: Dict[str, Any]`).
* `self.globals`        Global variables (`type: Dict[str, Any]`).
* `self.logger`         The logger object, to report on terminal (`type: stable_baselines3.common.logger`).
* `self.parent`         The parent object (`type: Optional[BaseCallback]`).

### 2.2 Event callback
The class `EventCallback` derives from `BaseCallback`.
When an event is triggered (e.g. `EvalCallback` when there's a new best model) =>
a child callback is called (e.g. `StopTrainingOnRewardThreshold` if mean reward > a threshold).

Callback collection:
* Save the model periodically (`CheckpointCallback`)
* Evaluate the model periodically and save the best one (`EvalCallback`)
* Chain callbacks (`CallbackList`)
* Trigger callback on events (`Event Callback`, `EveryNTimesteps`)
* Stop training early based on a reward threshold (`StopTrainingOnRewardThreshold`)

(Note: when using multiple envs. the frequence must be calculated like this: `save_freq = max(save_freq // n_envs, 1)`
)

#### CheckpointCallback
Save the model periodically.

In [7]:
from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import CheckpointCallback

# Save a checkpoint every 1000 steps
checkpoint_callback = CheckpointCallback(
    save_freq=1000, save_path='./logs/', name_prefix='rl_model')

model = SAC('MlpPolicy', 'Pendulum-v0')
model.learn(2000, callback=checkpoint_callback)

<stable_baselines3.sac.sac.SAC at 0x13de4ed60>

#### EvalCallback
Evaluate the model periodically and save the best one.

In [8]:
import gym
from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import EvalCallback

# Separate evaluation env
eval_env = gym.make('Pendulum-v0')
# Use deterministic actions for evaluation
eval_callback = EvalCallback(
    eval_env, best_model_save_path='./logs/',
    log_path='./logs/', eval_freq=500,
    deterministic=True, render=False)

model = SAC('MlpPolicy', 'Pendulum-v0')
model.learn(5000, callback=eval_callback)



Eval num_timesteps=500, episode_reward=-1296.92 +/- 46.46
Episode length: 200.00 +/- 0.00
New best mean reward!
Eval num_timesteps=1000, episode_reward=-1654.46 +/- 110.17
Episode length: 200.00 +/- 0.00
Eval num_timesteps=1500, episode_reward=-1480.33 +/- 115.36
Episode length: 200.00 +/- 0.00
Eval num_timesteps=2000, episode_reward=-1111.86 +/- 71.71
Episode length: 200.00 +/- 0.00
New best mean reward!
Eval num_timesteps=2500, episode_reward=-660.48 +/- 110.03
Episode length: 200.00 +/- 0.00
New best mean reward!
Eval num_timesteps=3000, episode_reward=-1141.39 +/- 512.11
Episode length: 200.00 +/- 0.00
Eval num_timesteps=3500, episode_reward=-912.03 +/- 590.94
Episode length: 200.00 +/- 0.00
Eval num_timesteps=4000, episode_reward=-297.22 +/- 99.59
Episode length: 200.00 +/- 0.00
New best mean reward!
Eval num_timesteps=4500, episode_reward=-171.75 +/- 60.03
Episode length: 200.00 +/- 0.00
New best mean reward!
Eval num_timesteps=5000, episode_reward=-167.10 +/- 57.42
Episode lengt

<stable_baselines3.sac.sac.SAC at 0x13dabcfd0>

#### CallbackList
A list of chained callbacks will be called sequentially. Alternatively pass a list of callbacks to `learn()`.

In [9]:
import gym
from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import CallbackList, CheckpointCallback, EvalCallback

checkpoint_callback = CheckpointCallback(save_freq=1000, save_path='./logs/')
# Separate evaluation env
eval_env = gym.make('Pendulum-v0')
eval_callback = EvalCallback(eval_env, best_model_save_path='./logs/best_model',
                             log_path='./logs/results', eval_freq=500)
# Create the callback list
callback = CallbackList([checkpoint_callback, eval_callback])

model = SAC('MlpPolicy', 'Pendulum-v0')
# Equivalent to:
# model.learn(5000, callback=[checkpoint_callback, eval_callback])
model.learn(5000, callback=callback)

Eval num_timesteps=500, episode_reward=-1674.55 +/- 186.33
Episode length: 200.00 +/- 0.00
New best mean reward!
Eval num_timesteps=1000, episode_reward=-1790.63 +/- 140.02
Episode length: 200.00 +/- 0.00
Eval num_timesteps=1500, episode_reward=-1355.32 +/- 105.37
Episode length: 200.00 +/- 0.00
New best mean reward!
Eval num_timesteps=2000, episode_reward=-1041.60 +/- 99.52
Episode length: 200.00 +/- 0.00
New best mean reward!
Eval num_timesteps=2500, episode_reward=-717.63 +/- 126.30
Episode length: 200.00 +/- 0.00
New best mean reward!
Eval num_timesteps=3000, episode_reward=-811.61 +/- 357.61
Episode length: 200.00 +/- 0.00
Eval num_timesteps=3500, episode_reward=-153.40 +/- 52.65
Episode length: 200.00 +/- 0.00
New best mean reward!
Eval num_timesteps=4000, episode_reward=-223.64 +/- 90.65
Episode length: 200.00 +/- 0.00
Eval num_timesteps=4500, episode_reward=-120.91 +/- 72.26
Episode length: 200.00 +/- 0.00
New best mean reward!
Eval num_timesteps=5000, episode_reward=-125.08 +/

<stable_baselines3.sac.sac.SAC at 0x13de95790>

#### StopTrainingOnRewardThreshold
Stop training early based on a reward threshold.

It must be used with the EvalCallback and use the event triggered by a new best model.

In [None]:
import gym
from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold

# Separate evaluation env
eval_env = gym.make('Pendulum-v0')
# Stop training when the model reaches the reward threshold
callback_on_best = StopTrainingOnRewardThreshold(reward_threshold=-200, verbose=1)
eval_callback = EvalCallback(eval_env, callback_on_new_best=callback_on_best, verbose=1)

model = SAC('MlpPolicy', 'Pendulum-v0', verbose=1)
# Almost infinite number of timesteps, but the training will stop
# early as soon as the reward threshold is reached
model.learn(int(1e10), callback=eval_callback)

#### EveryNTimesteps
Trigger its child callback every n_steps timesteps.

In [None]:
import gym
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import CheckpointCallback, EveryNTimesteps

# this is equivalent to defining CheckpointCallback(save_freq=500)
# checkpoint_callback will be triggered every 500 steps
checkpoint_on_event = CheckpointCallback(save_freq=1, save_path='./logs/')
event_callback = EveryNTimesteps(n_steps=500, callback=checkpoint_on_event)

model = PPO('MlpPolicy', 'Pendulum-v0', verbose=1)

model.learn(int(2e4), callback=event_callback)

#### StopTrainingOnMaxEpisodes
Stop training at a maximum number of episodes (ignoring model’s total_timesteps).

Note: For multiple environments it assumes max_episodes * n_envs episodes.

In [None]:
from stable_baselines3 import A2C
from stable_baselines3.common.callbacks import StopTrainingOnMaxEpisodes

# Stops training when the model reaches the maximum number of episodes
callback_max_episodes = StopTrainingOnMaxEpisodes(max_episodes=5, verbose=1)

model = A2C('MlpPolicy', 'Pendulum-v0', verbose=1)
# Almost infinite number of timesteps, but the training will stop
# early as soon as the max number of episodes is reached
model.learn(int(1e10), callback=callback_max_episodes)

## Examples of callback

### Example 1: Auto-saving best training model.
* Approach: to observe mean training reward over time.
* NOTE: the right approach would be to evaluate the model on a test environment in `EvalCallback`.

In [4]:
import os

import numpy as np

from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.results_plotter import load_results, ts2xy

In [13]:
class SaveOnBestTrainingRewardCallback(BaseCallback):
    """
    Callback for saving a model based on the training reward.

    :param check_freq: (int) The check is done every "check_freq" steps.
    :param log_dir: (str) Path to folder where model will be saved.
    :param verbose: (int)
    """
    def __init__(self, check_freq, log_dir, verbose=1):
        super(SaveOnBestTrainingRewardCallback, self).__init__(verbose)
        self.check_freq = check_freq
        self.log_dir = log_dir
        self.save_path = os.path.join(log_dir, 'best_model')
        self.best_mean_reward = -np.inf

    def _init_callback(self) -> None:
        # Create folder if needed.
        if self.save_path is not None:
            os.makedirs(self.save_path, exist_ok=True)
    
    def _on_step(self) -> bool:
        if self.n_calls % self.check_freq == 0:

            # Retrieve training reward.
            x, y = ts2xy(load_results(self.log_dir), 'timesteps')
            if len(x) > 0:
                # Mean training reward over the last 100 episodes.
                mean_reward = np.mean(y[-100:])
                if self.verbose > 0:
                    print(f"Num timesteps: {self.num_timesteps}")
                    print(f"Best mean reward: {self.best_mean_reward:.2} - Last mean reward per episode: {mean_reward:.2}")
                
                # New best model, you could save the agent here.
                if mean_reward > self.best_mean_reward:
                    self.best_mean_reward = mean_reward
                    # Example for saving best model.
                    if self.verbose > 0:
                        print(f"Saving new best model at {x[-1]} timesteps to {self.save_path}.zip")
                    self.model.save(self.save_path)
        
        return True

In [17]:
# Create log dir.
log_dir = "tmp/gym/"
os.makedirs(log_dir, exist_ok=True)

# Create and wrap the environment.
env = make_vec_env('CartPole-v1', n_envs=1, monitor_dir=log_dir)
# This is equivalent to:
# env = gym.make('CartPole-v1')
# env = Monitor(env, log_dir)
# env = DummyVecEnv([lambda: env])

# Create Callback:
callback = SaveOnBestTrainingRewardCallback(check_freq=20, log_dir=log_dir, verbose=1)

# Create and train the model.
model = A2C('MlpPolicy', env, verbose=0)
model.learn(total_timesteps=10000, callback=callback)

Num timesteps: 20
Best mean reward: -inf - Last mean reward per episode: 2e+01
Saving new best model at 20 timesteps to tmp/gym/best_model.zip
Num timesteps: 40
Best mean reward: 2e+01 - Last mean reward per episode: 1.6e+01
Num timesteps: 60
Best mean reward: 2e+01 - Last mean reward per episode: 1.6e+01
Num timesteps: 80
Best mean reward: 2e+01 - Last mean reward per episode: 1.7e+01
Num timesteps: 100
Best mean reward: 2e+01 - Last mean reward per episode: 1.7e+01
Num timesteps: 120
Best mean reward: 2e+01 - Last mean reward per episode: 1.8e+01
Num timesteps: 140
Best mean reward: 2e+01 - Last mean reward per episode: 1.7e+01
Num timesteps: 160
Best mean reward: 2e+01 - Last mean reward per episode: 1.8e+01
Num timesteps: 180
Best mean reward: 2e+01 - Last mean reward per episode: 1.9e+01
Num timesteps: 200
Best mean reward: 2e+01 - Last mean reward per episode: 1.9e+01
Num timesteps: 220
Best mean reward: 2e+01 - Last mean reward per episode: 2e+01
Saving new best model at 203 tim

<stable_baselines3.a2c.a2c.A2C at 0x13de848e0>

### Example 2: Realtime plotting of performance
[AH: Doesn't seem to work with VS Code notebooks.]

In [None]:
from stable_baselines3 import PPO

import matplotlib.pyplot as plt
import numpy as np
%matplotlib notebook

class PlottingCallback(BaseCallback):
    """
    Callback for plotting the performance in realtime.

    :param verbose: (int)
    """
    def __init__(self, verbose=1):
        super(PlottingCallback, self).__init__(verbose)
        self._plot = None

    def _on_step(self) -> bool:
        # get the monitor's data
        x, y = ts2xy(load_results(log_dir), 'timesteps')
        if self._plot is None: # make the plot
            plt.ion()
            fig = plt.figure(figsize=(6,3))
            ax = fig.add_subplot(111)
            line, = ax.plot(x, y)
            self._plot = (line, ax, fig)
            plt.show()
        else: # update and rescale the plot
            self._plot[0].set_data(x, y)
            self._plot[-2].relim()
            self._plot[-2].set_xlim([self.locals["total_timesteps"] * -0.02, 
                                    self.locals["total_timesteps"] * 1.02])
            self._plot[-2].autoscale_view(True,True,True)
            self._plot[-1].canvas.draw()
        
# Create log dir
log_dir = "tmp/gym/"
os.makedirs(log_dir, exist_ok=True)

# Create and wrap the environment
env = make_vec_env('MountainCarContinuous-v0', n_envs=1, monitor_dir=log_dir)

plotting_callback = PlottingCallback()
        
model = PPO('MlpPolicy', env, verbose=0)
model.learn(20000, callback=plotting_callback)

### Example 3: Progress bar
[AH: Low value.]

In [11]:
from stable_baselines3 import TD3
from tqdm.auto import tqdm

class ProgressBarCallback(BaseCallback):
    """
    :param pbar: (tqdm.pbar) Progress bar object
    """
    def __init__(self, pbar):
        super(ProgressBarCallback, self).__init__()
        self._pbar = pbar

    def _on_step(self):
        # Update the progress bar:
        self._pbar.n = self.num_timesteps
        self._pbar.update(0)

# this callback uses the 'with' block, allowing for correct initialisation and destruction
class ProgressBarManager(object):
    def __init__(self, total_timesteps): # init object with total timesteps
        self.pbar = None
        self.total_timesteps = total_timesteps
        
    def __enter__(self): # create the progress bar and callback, return the callback
        self.pbar = tqdm(total=self.total_timesteps)
            
        return ProgressBarCallback(self.pbar)

    def __exit__(self, exc_type, exc_val, exc_tb): # close the callback
        self.pbar.n = self.total_timesteps
        self.pbar.update(0)
        self.pbar.close()
        
model = TD3('MlpPolicy', 'Pendulum-v0', verbose=0)
with ProgressBarManager(2000) as callback: # this the garanties that the tqdm progress bar closes correctly
    model.learn(2000, callback=callback)

100%|██████████| 2000/2000 [00:25<00:00, 79.47it/s] 


### Example 4: Composition
* Several callbacks can do a composition into a single callback (e.g. save best model, show progress bar...).
* To do that, a list is passed to `learn()`.

In [14]:
#from stable_baselines3.common.callbacks import CallbackList

# Create log dir
log_dir = "tmp/gym/"
os.makedirs(log_dir, exist_ok=True)

# Create and wrap the environment
env = make_vec_env('CartPole-v1', n_envs=1, monitor_dir=log_dir)

# Create callbacks
auto_save_callback = SaveOnBestTrainingRewardCallback(check_freq=1000, log_dir=log_dir)

model = PPO('MlpPolicy', env, verbose=0)
with ProgressBarManager(10000) as progress_callback:
  # This is equivalent to callback=CallbackList([progress_callback, auto_save_callback])
  model.learn(10000, callback=[progress_callback, auto_save_callback])

 12%|█▏        | 1210/10000 [00:00<00:05, 1699.06it/s]

Num timesteps: 1000
Best mean reward: -inf - Last mean reward per episode: 2.2e+01
Saving new best model at 980 timesteps to tmp/gym/best_model.zip


 19%|█▉        | 1900/10000 [00:01<00:04, 1700.13it/s]

Num timesteps: 2000
Best mean reward: 2.2e+01 - Last mean reward per episode: 2.2e+01
Saving new best model at 1978 timesteps to tmp/gym/best_model.zip


 32%|███▏      | 3233/10000 [00:03<00:05, 1228.17it/s]

Num timesteps: 3000
Best mean reward: 2.2e+01 - Last mean reward per episode: 2.6e+01
Saving new best model at 2971 timesteps to tmp/gym/best_model.zip


 40%|███▉      | 3989/10000 [00:03<00:04, 1344.00it/s]

Num timesteps: 4000
Best mean reward: 2.6e+01 - Last mean reward per episode: 2.8e+01
Saving new best model at 3991 timesteps to tmp/gym/best_model.zip


 52%|█████▏    | 5177/10000 [00:05<00:04, 1204.94it/s]

Num timesteps: 5000
Best mean reward: 2.8e+01 - Last mean reward per episode: 3.3e+01
Saving new best model at 4972 timesteps to tmp/gym/best_model.zip


 60%|██████    | 6027/10000 [00:06<00:02, 1539.62it/s]

Num timesteps: 6000
Best mean reward: 3.3e+01 - Last mean reward per episode: 3.6e+01
Saving new best model at 5948 timesteps to tmp/gym/best_model.zip


 72%|███████▏  | 7231/10000 [00:07<00:02, 1200.23it/s]

Num timesteps: 7000
Best mean reward: 3.6e+01 - Last mean reward per episode: 4e+01
Saving new best model at 6857 timesteps to tmp/gym/best_model.zip


 81%|████████  | 8079/10000 [00:08<00:01, 1217.26it/s]

Num timesteps: 8000
Best mean reward: 4e+01 - Last mean reward per episode: 4.8e+01
Saving new best model at 7975 timesteps to tmp/gym/best_model.zip


 92%|█████████▏| 9202/10000 [00:10<00:00, 1153.42it/s]

Num timesteps: 9000
Best mean reward: 4.8e+01 - Last mean reward per episode: 5.4e+01
Saving new best model at 8916 timesteps to tmp/gym/best_model.zip


10217it [00:10, 1593.83it/s]                          

Num timesteps: 10000
Best mean reward: 5.4e+01 - Last mean reward per episode: 6.3e+01
Saving new best model at 9955 timesteps to tmp/gym/best_model.zip


100%|██████████| 10000/10000 [00:11<00:00, 834.08it/s]
