# Stable Baselines3 Tutorial - Callbacks and hyperparameter tuning
(Taken from <https://stable-baselines3.readthedocs.io/en/master/guide/callbacks.html#>)
- Comparing default and beest hyperparameters in RL.
- Using callbacks for monitoring, auto-saving, model manipulation, progress bars...


In [1]:
# Dependencies: swig, tqdm

import gym
from stable_baselines3 import A2C, SAC, PPO, TD3

# 1. Hyperparameter tuning
We'll compare here the performance of "Soft Actor Critic" on the Pendulum environment with default and "tuned" hyperparameters.

Resources:
- rl zoo: https://github.com/DLR-RM/rl-baselines3-zoo
- Optuna: https://github.com/optuna/optuna

In [6]:
import numpy as np
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor

eval_env = Monitor(gym.make('Pendulum-v0')) # AH: Wrapped with Monitor to prevent erroneous metrics.

## Default model

In [3]:
default_model = SAC('MlpPolicy', 'Pendulum-v0', verbose=1, seed=0, batch_size=64, policy_kwargs=dict(net_arch=[64, 64])).learn(8000)

Using cpu device
Creating environment from the given name 'Pendulum-v0'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 200       |
|    ep_rew_mean     | -1.36e+03 |
| time/              |           |
|    episodes        | 4         |
|    fps             | 136       |
|    time_elapsed    | 5         |
|    total_timesteps | 800       |
| train/             |           |
|    actor_loss      | 20.3      |
|    critic_loss     | 0.968     |
|    ent_coef        | 0.812     |
|    ent_coef_loss   | -0.337    |
|    learning_rate   | 0.0003    |
|    n_updates       | 699       |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 200       |
|    ep_rew_mean     | -1.55e+03 |
| time/              |           |
|    episodes        | 8         |
|    fps             | 129       |
|    time_e

In [10]:
mean_reward, std_reward = evaluate_policy(default_model, eval_env, n_eval_episodes=500)
print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:-182.34 +/- 97.99


## Tuned model

In [1]:
tuned_model = SAC('MlpPolicy', 'Pendulum-v0', batch_size=256, verbose=1, policy_kwargs=dict(net_arch=[256, 256]), seed=0).learn(8000)

NameError: name 'SAC' is not defined

In [11]:
mean_reward, std_reward = evaluate_policy(tuned_model, eval_env, n_eval_episodes=500)
print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:-141.48 +/- 90.55


# 2. Callbacks
Callback = function that will be called at a given stage of the training.
They are passed as an argument of `model.learn()`.

Types:
* Custom callback: called at 5 specific moments of the training process.
* Event callback: called when a certain user-defined situation is detected.

## Types of callback

### 2.1 Custom callback
The class derives from `BaseCallback`.

Events:
* `_on_training_start`  Called before the first rollout starts.
* `_on_rollout_start`   
    * Rollout = collection of environment interactions using current policy.
    * Triggered before collecting new samples.
    * For off-policy algorithms, rollout = steps taken in the eenv between two updates.
* `_on_step`
    * Called by the model after each call to `env.step()`.
    * For child callback (of an `EventCallback`), this is called when the event is triggered.
    * :return: (bool) If False, training is aborted early.
* `_on_rollout_end`     Triggered before updating the policy.
* `_on_training_end`    Triggered before exiting the `learn()` method.

Variables accessible in the callback:
* `self.model`          The RL model (`type: BaseAlgorithm`).
* `self.training_env`   The environment used for training (`type Union[gym.Env, VecEnv, None]`).
* `self.n_calls`        Number of times the callback was called (`type int`).
* `self.num_timesteps`  Total number of steps taken (number of envs x step calls) (`type int`).
* `self.locals`         Local variables (`type: Dict[str, Any]`).
* `self.globals`        Global variables (`type: Dict[str, Any]`).
* `self.logger`         The logger object, to report on terminal (`type: stable_baselines3.common.logger`).
* `self.parent`         The parent object (`type: Optional[BaseCallback]`).

### 2.2 Event callback
The class `EventCallback` derives from `BaseCallback`.
When an event is triggered (e.g. `EvalCallback` when there's a new best model) =>
a child callback is called (e.g. `StopTrainingOnRewardThreshold` if mean reward > a threshold).

Callback collection:
* Save the model periodically (`CheckpointCallback`)
* Evaluate the model periodically and save the best one (`EvalCallback`)
* Chain callbacks (`CallbackList`)
* Trigger callback on events (`Event Callback`, `EveryNTimesteps`)
* Stop training early based on a reward threshold (`StopTrainingOnRewardThreshold`)

(Note: when using multiple envs. the frequence must be calculated like this: `save_freq = max(save_freq // n_envs, 1)`
)

#### CheckpointCallback
Save the model periodically.

In [2]:
from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import CheckpointCallback

# Save a checkpoint every 1000 steps
checkpoint_callback = CheckpointCallback(
    save_freq=1000, save_path='./logs/', name_prefix='rl_model')

model = SAC('MlpPolicy', 'Pendulum-v0')
model.learn(2000, callback=checkpoint_callback)

<stable_baselines3.sac.sac.SAC at 0x109ab33a0>

#### EvalCallback
Evaluate the model periodically and save the best one.

In [None]:
import gym
from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import EvalCallback

# Separate evaluation env
eval_env = gym.make('Pendulum-v0')
# Use deterministic actions for evaluation
eval_callback = EvalCallback(
    eval_env, best_model_save_path='./logs/',
    log_path='./logs/', eval_freq=500,
    deterministic=True, render=False)

model = SAC('MlpPolicy', 'Pendulum-v0')
model.learn(5000, callback=eval_callback)

#### CallbackList
A list of chained callbacks will be called sequentially. Alternatively pass a list of callbacks to `learn()`.

In [None]:
import gym
from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import CallbackList, CheckpointCallback, EvalCallback

checkpoint_callback = CheckpointCallback(save_freq=1000, save_path='./logs/')
# Separate evaluation env
eval_env = gym.make('Pendulum-v0')
eval_callback = EvalCallback(eval_env, best_model_save_path='./logs/best_model',
                             log_path='./logs/results', eval_freq=500)
# Create the callback list
callback = CallbackList([checkpoint_callback, eval_callback])

model = SAC('MlpPolicy', 'Pendulum-v0')
# Equivalent to:
# model.learn(5000, callback=[checkpoint_callback, eval_callback])
model.learn(5000, callback=callback)

#### StopTrainingOnRewardThreshold
Stop training early based on a reward threshold.

It must be used with the EvalCallback and use the event triggered by a new best model.

In [None]:
import gym
from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold

# Separate evaluation env
eval_env = gym.make('Pendulum-v0')
# Stop training when the model reaches the reward threshold
callback_on_best = StopTrainingOnRewardThreshold(reward_threshold=-200, verbose=1)
eval_callback = EvalCallback(eval_env, callback_on_new_best=callback_on_best, verbose=1)

model = SAC('MlpPolicy', 'Pendulum-v0', verbose=1)
# Almost infinite number of timesteps, but the training will stop
# early as soon as the reward threshold is reached
model.learn(int(1e10), callback=eval_callback)

#### EveryNTimesteps
Trigger its child callback every n_steps timesteps.

In [None]:
import gym
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import CheckpointCallback, EveryNTimesteps

# this is equivalent to defining CheckpointCallback(save_freq=500)
# checkpoint_callback will be triggered every 500 steps
checkpoint_on_event = CheckpointCallback(save_freq=1, save_path='./logs/')
event_callback = EveryNTimesteps(n_steps=500, callback=checkpoint_on_event)

model = PPO('MlpPolicy', 'Pendulum-v0', verbose=1)

model.learn(int(2e4), callback=event_callback)

#### StopTrainingOnMaxEpisodes
Stop training at a maximum number of episodes (ignoring model’s total_timesteps).

Note: For multiple environments it assumes max_episodes * n_envs episodes.

In [None]:
from stable_baselines3 import A2C
from stable_baselines3.common.callbacks import StopTrainingOnMaxEpisodes

# Stops training when the model reaches the maximum number of episodes
callback_max_episodes = StopTrainingOnMaxEpisodes(max_episodes=5, verbose=1)

model = A2C('MlpPolicy', 'Pendulum-v0', verbose=1)
# Almost infinite number of timesteps, but the training will stop
# early as soon as the max number of episodes is reached
model.learn(int(1e10), callback=callback_max_episodes)

## Examples of callback