<a href="https://colab.research.google.com/github/StevenJokess/rl-colab-notebooks/blob/sb3/stable_baselines_her(colab).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stable Baselines - Hindsight Experience Replay on Highway Env

Github Repo: [https://github.com/DLR-RM/stable-baselines3](https://github.com/DLR-RM/stable-baselines3)

Highway env: [https://github.com/eleurent/highway-env](https://github.com/eleurent/highway-env) 

[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a collection of pre-trained Reinforcement Learning agents using Stable-Baselines3.

It also provides basic scripts for training, evaluating agents, tuning hyperparameters and recording videos.

Documentation is available online: [https://stable-baselines3.readthedocs.io/](https://stable-baselines3.readthedocs.io/)

## Install Dependencies and Stable Baselines Using Pip


```
pip install stable-baselines3[extra]
```

In [1]:
# Install stable-baselines latest version
!pip install git+https://github.com/DLR-RM/stable-baselines3

Collecting git+https://github.com/DLR-RM/stable-baselines3
  Cloning https://github.com/DLR-RM/stable-baselines3 to /tmp/pip-req-build-x8fgi2vo
  Running command git clone -q https://github.com/DLR-RM/stable-baselines3 /tmp/pip-req-build-x8fgi2vo
Building wheels for collected packages: stable-baselines3
  Building wheel for stable-baselines3 (setup.py) ... [?25l[?25hdone
  Created wheel for stable-baselines3: filename=stable_baselines3-0.11.0a5-cp36-none-any.whl size=150619 sha256=43d75c792bfc585e4dde2f165140cefd291545fa0ceb7be2418fc35d3a49569d
  Stored in directory: /tmp/pip-ephem-wheel-cache-ddna5e4k/wheels/cf/89/6b/cd4b89427eb5ff0858bcba73911088d606c59eb3a97290b1bb
Successfully built stable-baselines3


In [2]:
# Install highway-env
!pip install git+https://github.com/eleurent/highway-env

Collecting git+https://github.com/eleurent/highway-env
  Cloning https://github.com/eleurent/highway-env to /tmp/pip-req-build-6gshyyfd
  Running command git clone -q https://github.com/eleurent/highway-env /tmp/pip-req-build-6gshyyfd
Building wheels for collected packages: highway-env
  Building wheel for highway-env (setup.py) ... [?25l[?25hdone
  Created wheel for highway-env: filename=highway_env-1.0.dev0-cp36-none-any.whl size=80900 sha256=deb8c69065052a48dbcacc8070c1ca24279042f98f550bce341043c46bafe298
  Stored in directory: /tmp/pip-ephem-wheel-cache-pl4fcmj6/wheels/e6/10/d8/02a077ca221bbac1c6fc12c1370c2f773a8cd602d4be3df0cc
Successfully built highway-env


## Import policy, RL agent, ...

In [3]:
import gym
import highway_env
import numpy as np

from stable_baselines3 import HER, SAC, DDPG
from stable_baselines3.common.noise import NormalActionNoise

## Create the Gym env and instantiate the agent

For this example, we will be using the parking environment from the [highway-env](https://github.com/eleurent/highway-env) repo by @eleurent.

The parking env is a goal-conditioned continuous control task, in which the vehicle must park in a given space with the appropriate heading.


![parking-env](https://raw.githubusercontent.com/eleurent/highway-env/gh-media/docs/media/parking-env.gif)



### Train Soft Actor-Critic (SAC) agent

Here, we use HER "future" goal sampling strategy, where we create 4 artificial transitions per real transition

Note: the hyperparameters (network architecture, discount factor, ...) were tuned for this task

In [4]:
env = gym.make("parking-v0")

In [5]:
# SAC hyperparams:
model = HER('MlpPolicy', env, SAC, n_sampled_goal=4,
            goal_selection_strategy='future', online_sampling=True,
            verbose=1, buffer_size=int(1e6),
            learning_rate=1e-3,
            gamma=0.95, batch_size=256,
            policy_kwargs=dict(net_arch=[256, 256, 256]), max_episode_length=100)

Using cuda device


In [6]:
# Train for 1e5 steps
model.learn(int(2560)) #int(1e5)
# Save the trained agent
model.save('her_sac_highway')

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    ep_rew_mean     | -54.9    |
|    success rate    | 0        |
| time/              |          |
|    episodes        | 4        |
|    fps             | 33       |
|    time_elapsed    | 11       |
|    total timesteps | 400      |
| train/             |          |
|    actor_loss      | -2.46    |
|    critic_loss     | 0.0269   |
|    ent_coef        | 0.742    |
|    ent_coef_loss   | -1       |
|    learning_rate   | 0.001    |
|    n_updates       | 299      |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    ep_rew_mean     | -57.8    |
|    success rate    | 0        |
| time/              |          |
|    episodes        | 8        |
|    fps             | 31       |
|    time_elapsed    | 25       |
|    total timesteps | 800      |
| train/             |          |
|    actor_los

In [7]:
# Load saved model
model = HER.load('her_sac_highway', env=env)

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


#### Evaluate the agent

In [8]:
obs = env.reset()

# Evaluate the agent
episode_reward = 0
for _ in range(10):# 1000
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    episode_reward += reward
    if done or info[0].get('is_success', False):
        print("Reward:", episode_reward, "Success?", info[0].get('is_success', False))
        episode_reward = 0.0
        obs = env.reset()

KeyError: ignored

### Train DDPG agent

In [9]:
# Create the action noise object that will be used for exploration
n_actions = env.action_space.shape[0]
noise_std = 0.2
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=noise_std * np.ones(n_actions))

model = HER('MlpPolicy', env, DDPG, n_sampled_goal=4,
            goal_selection_strategy='future', online_sampling=True,
            verbose=1, buffer_size=int(1e6),
            learning_rate=1e-3, action_noise=action_noise,
            gamma=0.95, batch_size=256,
            policy_kwargs=dict(net_arch=[256, 256, 256]), max_episode_length=100)

Using cuda device


In [10]:
# Train for 2e5 steps
model.learn(int(2560))
# Save the trained agent
model.save('her_ddpg_highway')

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    ep_rew_mean     | -50.1    |
|    success rate    | 0        |
| time/              |          |
|    episodes        | 4        |
|    fps             | 41       |
|    time_elapsed    | 9        |
|    total timesteps | 400      |
| train/             |          |
|    actor_loss      | 0.383    |
|    critic_loss     | 0.00158  |
|    ent_coef        | 0.0872   |
|    ent_coef_loss   | -7.69    |
|    learning_rate   | 0.001    |
|    n_updates       | 200      |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    ep_rew_mean     | -54.9    |
|    success rate    | 0        |
| time/              |          |
|    episodes        | 8        |
|    fps             | 37       |
|    time_elapsed    | 21       |
|    total timesteps | 800      |
| train/             |          |
|    actor_los

In [11]:
# Load saved model
model = HER.load('her_ddpg_highway', env=env)

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


#### Evaluate the agent

In [12]:
obs = env.reset()

# Evaluate the agent
episode_reward = 0
for _ in range(10): #1000
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    episode_reward += reward
    if done or info[0].get('is_success', False):
        print("Reward:", episode_reward, "Success?", info[0].get('is_success', False))
        episode_reward = 0.0
        obs = env.reset()

KeyError: ignored