<a href="https://colab.research.google.com/github/araffin/rl-tutorial-jnrr19/blob/sb3/2_gym_wrappers_saving_loading.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stable Baselines3 Tutorial - Gym wrappers, saving and loading models

Github repo: https://github.com/araffin/rl-tutorial-jnrr19/tree/sb3/

Stable-Baselines3: https://github.com/DLR-RM/stable-baselines3

Documentation: https://stable-baselines3.readthedocs.io/en/master/

RL Baselines3 zoo: https://github.com/DLR-RM/rl-baselines3-zoo


## Introduction

Very frequently, you will want to extend the environement's functionality in some generic way. For example, do some manipulations on the observations before giving them to the agent. Gym provides you with a convenient framework for these situations called the Wrapper class. 

Another class you should be aware of is Monitor. It is implemented like Wrapper and can write information about your agent's performance in a file with an optional video recording of your agent in action.

(Taken from the book Deep RL Maxime Lapan).

## Install Dependencies and Stable Baselines3 Using Pip

In [1]:
!apt install swig
!pip install stable-baselines3[extra]

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  swig3.0
Suggested packages:
  swig-doc swig-examples swig3.0-examples swig3.0-doc
The following NEW packages will be installed:
  swig swig3.0
0 upgraded, 2 newly installed, 0 to remove and 37 not upgraded.
Need to get 1,100 kB of archives.
After this operation, 5,822 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 swig3.0 amd64 3.0.12-1 [1,094 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 swig amd64 3.0.12-1 [6,460 B]
Fetched 1,100 kB in 1s (1,263 kB/s)
Selecting previously unselected package swig3.0.
(Reading database ... 155222 files and directories currently installed.)
Preparing to unpack .../swig3.0_3.0.12-1_amd64.deb ...
Unpacking swig3.0 (3.0.12-1) ...
Selecting previously unselected package swig.
Preparing to unpack .../swig_3.0.12-1_amd64.deb ...
Unpackin

In [2]:
import gym
from stable_baselines3 import A2C, SAC, PPO, TD3

# Saving and loading

Saving and loading stable-baselines models is straightforward: you can directly call `.save()` and `.load()` on the models.

In [3]:
import os

# Create save dir
save_dir = "/tmp/gym/" #look at this directory
os.makedirs(save_dir, exist_ok=True)

model = PPO('MlpPolicy', 'Pendulum-v0', verbose=0).learn(8000)
# The model will be saved under PPO_tutorial.zip
model.save(save_dir + "/PPO_tutorial")

# sample an observation from the environment
obs = model.env.observation_space.sample()

# Check prediction before saving
print("pre saved", model.predict(obs, deterministic=True))

del model # delete trained model to demonstrate loading

loaded_model = PPO.load(save_dir + "/PPO_tutorial")
# Check that the prediction is the same after loading (for the same observation)
print("loaded", loaded_model.predict(obs, deterministic=True))

pre saved (array([-0.04128068], dtype=float32), None)
loaded (array([-0.04128068], dtype=float32), None)


Saving in stable-baselines is quite powerful, as you save the training hyperparameters, with the current weights. This means in practice, you can simply load a custom model, without redefining the parameters, and continue learning.

The loading function can also update the model's class variables when loading.
Here, we do with another training method A2C.
See this page for Vectorized Environments: https://stable-baselines.readthedocs.io/en/master/guide/vec_envs.html

In [4]:
import os
from stable_baselines3.common.vec_env import DummyVecEnv

# Create save dir
save_dir = "/tmp/gym/"
os.makedirs(save_dir, exist_ok=True)

model = A2C('MlpPolicy', 'Pendulum-v0', verbose=0, gamma=0.9, n_steps=20).learn(8000)
# The model will be saved under A2C_tutorial.zip
model.save(save_dir + "/A2C_tutorial")

del model # delete trained model to demonstrate loading

# load the model, and when loading set verbose to 1
loaded_model = A2C.load(save_dir + "/A2C_tutorial", verbose=1)

# show the save hyperparameters
print("loaded:", "gamma =", loaded_model.gamma, "n_steps =", loaded_model.n_steps)

# as the environment is not serializable, we need to set a new instance of the environment with DummyVecEnv
#Équivalent à une exécution en série, pour un processus
#Here, we continue learning
loaded_model.set_env(DummyVecEnv([lambda: gym.make('Pendulum-v0')]))
# and continue training
loaded_model.learn(8000)

loaded: gamma = 0.9 n_steps = 20
------------------------------------
| time/                 |          |
|    fps                | 1030     |
|    iterations         | 100      |
|    time_elapsed       | 1        |
|    total_timesteps    | 2000     |
| train/                |          |
|    entropy_loss       | -1.47    |
|    explained_variance | 0.0108   |
|    learning_rate      | 0.0007   |
|    n_updates          | 499      |
|    policy_loss        | -43.9    |
|    std                | 1.05     |
|    value_loss         | 1.12e+03 |
------------------------------------
------------------------------------
| time/                 |          |
|    fps                | 1039     |
|    iterations         | 200      |
|    time_elapsed       | 3        |
|    total_timesteps    | 4000     |
| train/                |          |
|    entropy_loss       | -1.46    |
|    explained_variance | 0.0221   |
|    learning_rate      | 0.0007   |
|    n_updates          | 599      |
|    

<stable_baselines3.a2c.a2c.A2C at 0x7f167c47f8d0>

# Gym and VecEnv wrappers

## Anatomy of a gym wrapper

A gym wrapper follows the [gym](https://stable-baselines.readthedocs.io/en/master/guide/custom_env.html) interface: it has a `reset()` and `step()` method.

Because a wrapper is *around* an environment, we can access it with `self.env`, this allow to easily interact with it without modifying the original env.
There are many wrappers that have been predefined, for a complete list refer to [gym documentation](https://github.com/openai/gym/tree/master/gym/wrappers)

## First example: limit the episode length

One practical use case of a wrapper is when you want to limit the number of steps by episode, for that you will need to overwrite the `done` signal when the limit is reached. It is also a good practice to pass that information in the `info` dictionnary.

In [5]:
class TimeLimitWrapper(gym.Wrapper):
  """
  :param env: (gym.Env) Gym environment that will be wrapped
  :param max_steps: (int) Max number of steps per episode
  """
  def __init__(self, env, max_steps=100):
    # Call the parent constructor, so we can access self.env later
    super(TimeLimitWrapper, self).__init__(env)
    self.max_steps = max_steps
    # Counter of steps per episode
    self.current_step = 0
  
  def reset(self):
    """
    Reset the environment 
    """
    # Reset the counter
    self.current_step = 0
    return self.env.reset()

  def step(self, action):
    """
    :param action: ([float] or int) Action taken by the agent
    :return: (np.ndarray, float, bool, dict) observation, reward, is the episode over?, additional informations
    """
    self.current_step += 1
    obs, reward, done, info = self.env.step(action)
    # Overwrite the done signal when 
    if self.current_step >= self.max_steps:
      done = True
      # Update the info dict to signal that the limit was exceeded
      info['time_limit_reached'] = True
    return obs, reward, done, info


#### Test the wrapper

In [6]:
from gym.envs.classic_control.pendulum import PendulumEnv

# Here we create the environment directly because gym.make() already wrap the environement in a TimeLimit wrapper otherwise
env = PendulumEnv()
# Wrap the environment
env = TimeLimitWrapper(env, max_steps=100)

In [7]:
obs = env.reset()
done = False
n_steps = 0
while not done:
  # Take random actions
  random_action = env.action_space.sample()
  obs, reward, done, info = env.step(random_action)
  n_steps += 1

print(n_steps, info)

100 {'time_limit_reached': True}


In practice, `gym` already have a wrapper for that named `TimeLimit` (`gym.wrappers.TimeLimit`) that is used by most environments.

## Second example: normalize actions

It is usually a good idea to normalize observations and actions before giving it to the agent, this prevent [hard to debug issue](https://github.com/hill-a/stable-baselines/issues/473).

In this example, we are going to normalize the action space of *Pendulum-v0* so it lies in [-1, 1] instead of [-2, 2].

In [8]:
import numpy as np

class NormalizeActionWrapper(gym.Wrapper):
  """
  :param env: (gym.Env) Gym environment that will be wrapped
  """
  def __init__(self, env):
    # Retrieve the action space
    action_space = env.action_space
    assert isinstance(action_space, gym.spaces.Box), "This wrapper only works with continuous action space (spaces.Box)"
    # Retrieve the max/min values
    self.low, self.high = action_space.low, action_space.high

    # We modify the action space, so all actions will lie in [-1, 1]
    env.action_space = gym.spaces.Box(low=-1, high=1, shape=action_space.shape, dtype=np.float32)

    # Call the parent constructor, so we can access self.env later
    super(NormalizeActionWrapper, self).__init__(env)
  
  def rescale_action(self, scaled_action):
      """
      Rescale the action from [-1, 1] to [low, high]
      (no need for symmetric action space)
      :param scaled_action: (np.ndarray)
      :return: (np.ndarray)
      """
      return self.low + (0.5 * (scaled_action + 1.0) * (self.high -  self.low))

  def reset(self):
    """
    Reset the environment 
    """
    # Reset the counter
    return self.env.reset()

  def step(self, action):
    """
    :param action: ([float] or int) Action taken by the agent
    :return: (np.ndarray, float, bool, dict) observation, reward, is the episode over?, additional informations
    """
    # Rescale action from [-1, 1] to original [low, high] interval
    rescaled_action = self.rescale_action(action)
    obs, reward, done, info = self.env.step(rescaled_action)
    return obs, reward, done, info


#### Test before rescaling actions

In [9]:
original_env = gym.make("Pendulum-v0")

print(original_env.action_space.low)
for _ in range(10):
  print(original_env.action_space.sample())

[-2.]
[-1.821175]
[1.8000855]
[-0.72909755]
[-1.0894296]
[0.02703821]
[-0.4260787]
[0.8469812]
[-0.0250614]
[-0.60594916]
[-1.5737854]


#### Test the NormalizeAction wrapper

In [10]:
env = NormalizeActionWrapper(gym.make("Pendulum-v0"))

print(env.action_space.low)

for _ in range(10):
  print(env.action_space.sample())

[-1.]
[0.26289186]
[0.31486264]
[0.07406089]
[-0.25210303]
[0.3383621]
[0.04249078]
[0.97044873]
[-0.30929026]
[-0.65102375]
[-0.65454125]


#### Test with a RL algorithm

We are going to use the Monitor wrapper of stable baselines, wich allow to monitor training stats (mean episode reward, mean episode length)

See here for more details: https://stable-baselines3.readthedocs.io/en/master/common/monitor.html

In [11]:
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv

In [12]:
env = Monitor(gym.make('Pendulum-v0'))
env = DummyVecEnv([lambda: env])

In [13]:
model = A2C("MlpPolicy", env, verbose=1).learn(int(1000))

Using cpu device
-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 200       |
|    ep_rew_mean        | -1.23e+03 |
| time/                 |           |
|    fps                | 652       |
|    iterations         | 100       |
|    time_elapsed       | 0         |
|    total_timesteps    | 500       |
| train/                |           |
|    entropy_loss       | -1.42     |
|    explained_variance | -0.00511  |
|    learning_rate      | 0.0007    |
|    n_updates          | 99        |
|    policy_loss        | -34       |
|    std                | 1         |
|    value_loss         | 1.33e+03  |
-------------------------------------
-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 200       |
|    ep_rew_mean        | -1.21e+03 |
| time/                 |           |
|    fps                | 648       |
|    iterations         | 200       |
|    time_elapsed       | 1      

With the action wrapper

In [14]:
normalized_env = Monitor(gym.make('Pendulum-v0'))
# Note that we can use multiple wrappers
normalized_env = NormalizeActionWrapper(normalized_env)
normalized_env = DummyVecEnv([lambda: normalized_env])

In [15]:
model_2 = A2C("MlpPolicy", normalized_env, verbose=1).learn(int(1000))

Using cpu device
-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 200       |
|    ep_rew_mean        | -1.25e+03 |
| time/                 |           |
|    fps                | 625       |
|    iterations         | 100       |
|    time_elapsed       | 0         |
|    total_timesteps    | 500       |
| train/                |           |
|    entropy_loss       | -1.43     |
|    explained_variance | -0.0204   |
|    learning_rate      | 0.0007    |
|    n_updates          | 99        |
|    policy_loss        | -35.7     |
|    std                | 1.01      |
|    value_loss         | 1.39e+03  |
-------------------------------------
-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 200       |
|    ep_rew_mean        | -1.28e+03 |
| time/                 |           |
|    fps                | 631       |
|    iterations         | 200       |
|    time_elapsed       | 1      