<a href="https://colab.research.google.com/github/TBKHori/Music-Recon13/blob/main/advanced_saving_loading.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stable Baselines3 - Advanced Saving and Loading

Github Repo: [https://github.com/DLR-RM/stable-baselines3](https://github.com/DLR-RM/stable-baselines3)


[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a training framework for Reinforcement Learning (RL), using Stable Baselines3.

It provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.

Documentation is available online: [https://stable-baselines3.readthedocs.io/](https://stable-baselines3.readthedocs.io/)

## Introduction

In this notebook, you will learn how to use some advanced features of stable baselines3 (SB3): how to easily create a test environment for periodic evaluation, use a policy independently from a model (and how to save it, load it) and save/load a replay buffer.

## Install Dependencies and Stable Baselines Using Pip


```
pip install stable-baselines3[extra]
```

In [1]:
!pip install "stable-baselines3[extra]>=2.0.0a4"

Collecting stable-baselines3[extra]>=2.0.0a4
  Downloading stable_baselines3-2.0.0-py3-none-any.whl (178 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m178.4/178.4 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gymnasium==0.28.1 (from stable-baselines3[extra]>=2.0.0a4)
  Downloading gymnasium-0.28.1-py3-none-any.whl (925 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m925.5/925.5 kB[0m [31m59.1 MB/s[0m eta [36m0:00:00[0m
Collecting shimmy[atari]~=0.2.1 (from stable-baselines3[extra]>=2.0.0a4)
  Downloading Shimmy-0.2.1-py3-none-any.whl (25 kB)
Collecting autorom[accept-rom-license]~=0.6.0 (from stable-baselines3[extra]>=2.0.0a4)
  Downloading AutoROM-0.6.1-py3-none-any.whl (9.4 kB)
Collecting jax-jumpy>=1.0.0 (from gymnasium==0.28.1->stable-baselines3[extra]>=2.0.0a4)
  Downloading jax_jumpy-1.0.0-py3-none-any.whl (20 kB)
Collecting farama-notifications>=0.0.1 (from gymnasium==0.28.1->stable-baselines3[extra]>=2.0.0a4)
  Downlo

## Import policy, RL agent, ...

In [2]:
from stable_baselines3 import SAC, TD3
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.env_util import make_vec_env

  if not hasattr(tensorboard, "__version__") or LooseVersion(
  float8_e4m3b11fnuz = ml_dtypes.float8_e4m3b11


## Create the Gym env and instantiate the agent

For this example, we will use Pendulum environment.

"The inverted pendulum swingup problem is a classic problem in the control literature. In this version of the problem, the pendulum starts in a random position, and the goal is to swing it up so it stays upright."

Pendulum-v1 environment: [documentation](https://gymnasium.farama.org/environments/classic_control/pendulum/), [source code](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/classic_control/pendulum.py)

![Pendulum](https://gymnasium.farama.org/_images/pendulum.gif)


We chose the MlpPolicy because input of Pendulum is a feature vector, not images.

The type of action to use (discrete/continuous) will be automatically deduced from the environment action space



### Create the agent and evaluation callback

We will use an [EvalCallback](https://stable-baselines3.readthedocs.io/en/master/guide/callbacks.html#evalcallback) to periodically evaluate the agent on a separate env.

In [3]:
model = SAC("MlpPolicy", "Pendulum-v1", verbose=1, learning_rate=1e-3)

Using cuda device
Creating environment from the given name 'Pendulum-v1'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [4]:
eval_env = make_vec_env("Pendulum-v1", n_envs=5)

n_training_envs = 1 # adjust eval freq depending on the number of training envs
# Evaluate the model every 1000 steps on 5 test episodes and save the evaluation to the logs folder
eval_callback = EvalCallback(eval_env, eval_freq=1000 // n_training_envs, n_eval_episodes=5, log_path="./logs/")

Train the agent and evaluate it periodically on the test env.

In [5]:
model.learn(6000, callback=eval_callback, progress_bar=False)

----------------------------------
| rollout/           |           |
|    ep_len_mean     | 200       |
|    ep_rew_mean     | -1.55e+03 |
| time/              |           |
|    episodes        | 4         |
|    fps             | 73        |
|    time_elapsed    | 10        |
|    total_timesteps | 800       |
| train/             |           |
|    actor_loss      | 28.1      |
|    critic_loss     | 0.0535    |
|    ent_coef        | 0.504     |
|    ent_coef_loss   | -1.04     |
|    learning_rate   | 0.001     |
|    n_updates       | 699       |
----------------------------------
Eval num_timesteps=1000, episode_reward=-1653.85 +/- 187.63
Episode length: 200.00 +/- 0.00
----------------------------------
| eval/              |           |
|    mean_ep_length  | 200       |
|    mean_reward     | -1.65e+03 |
| time/              |           |
|    total_timesteps | 1000      |
| train/             |           |
|    actor_loss      | 34.7      |
|    critic_loss     | 0.0415    

<stable_baselines3.sac.sac.SAC at 0x7f46353783d0>

## Save/Load the replay buffer

By default, the replay buffer is not saved when calling `model.save()`, in order to save space on the disk (a replay buffer can be up to several GB when using images).
However, SB3 provides a `save_replay_buffer()` and `load_replay_buffer()` method to save it separately.

In [6]:
# save the model
model.save("sac_pendulum")

# the saved model does not contain the replay buffer
loaded_model = SAC.load("sac_pendulum")
print(f"The loaded_model has {loaded_model.replay_buffer.size()} transitions in its buffer")

The loaded_model has 0 transitions in its buffer


In [7]:
# now save the replay buffer too
model.save_replay_buffer("sac_replay_buffer")

# load it into the loaded_model
loaded_model.load_replay_buffer("sac_replay_buffer")

# now the loaded replay is not empty anymore
print(f"The loaded_model has {loaded_model.replay_buffer.size()} transitions in its buffer")

The loaded_model has 6000 transitions in its buffer


## Save the policy only

In SB3, you save the policy independently from the model if needed.

Note: if you don't save the complete model, you cannot continue training afterward

In [8]:
policy = model.policy
policy.save("sac_policy_pendulum.pkl")

In [9]:
env = model.get_env()

In [10]:
# Evaluate the policy
mean_reward, std_reward = evaluate_policy(policy, eval_env, n_eval_episodes=10, deterministic=True)

print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

mean_reward=-174.52 +/- 112.33007423278072


## Load the policy only

In [11]:
from stable_baselines3.sac.policies import MlpPolicy
saved_policy = MlpPolicy.load("sac_policy_pendulum.pkl")
# also possible:
# saved_policy = SAC.policy_aliases["MlpPolicy"].load("sac_policy_pendulum.pkl")

In [12]:
# Evaluate the loaded policy
mean_reward, std_reward = evaluate_policy(saved_policy, eval_env, n_eval_episodes=10, deterministic=True)

print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

mean_reward=-108.26 +/- 62.81619865932142
