# Stable Baselines - Hindsight Experience Replay on Highway Env

Github Repo: [https://github.com/DLR-RM/stable-baselines3](https://github.com/DLR-RM/stable-baselines3)

Highway env: [https://github.com/eleurent/highway-env](https://github.com/eleurent/highway-env)

[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a training framework for Reinforcement Learning (RL), using Stable Baselines3.

It provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.

Documentation is available online: [https://stable-baselines3.readthedocs.io/](https://stable-baselines3.readthedocs.io/)

## Install Dependencies and Stable Baselines Using Pip


```
pip install stable-baselines3[extra]
```

In [1]:
# for autoformatting
# %load_ext jupyter_black

In [2]:
# Install stable-baselines latest version
# !pip install "stable-baselines3[extra]>=2.0.0a4"

In [3]:
# Install highway-env
# !pip install highway-env

## Import policy, RL agent, ...

In [4]:
import gymnasium as gym
import highway_env
import numpy as np

from stable_baselines3 import HerReplayBuffer, SAC, DDPG
from stable_baselines3.common.noise import NormalActionNoise

import imageio

## Create the Gym env and instantiate the agent

For this example, we will be using the parking environment from the [highway-env](https://github.com/Farama-Foundation/HighwayEnv) repo by @eleurent.

The parking env is a goal-conditioned continuous control task, in which the vehicle must park in a given space with the appropriate heading.


![parking-env](https://raw.githubusercontent.com/eleurent/highway-env/gh-media/docs/media/parking-env.gif)



### Train Soft Actor-Critic (SAC) agent

Here, we use HER "future" goal sampling strategy, where we create 4 artificial transitions per real transition

Note: the hyperparameters (network architecture, discount factor, ...) were tuned for this task

In [5]:
env = gym.make("parking-v0")

In [6]:
# SAC hyperparams:
model = SAC(
    "MultiInputPolicy",
    env,
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs=dict(
        n_sampled_goal=4,
        goal_selection_strategy="future",
    ),
    verbose=1,
    buffer_size=int(1e6),
    learning_rate=1e-3,
    gamma=0.95,
    batch_size=256,
    policy_kwargs=dict(net_arch=[256, 256, 256]),
)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [7]:
# Train for 1e5 steps
model.learn(int(1e5))
# Save the trained agent
model.save('her_sac_highway')

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 89       |
|    ep_rew_mean     | -52.8    |
|    success_rate    | 0        |
| time/              |          |
|    episodes        | 4        |
|    fps             | 10       |
|    time_elapsed    | 34       |
|    total_timesteps | 356      |
| train/             |          |
|    actor_loss      | -2.47    |
|    critic_loss     | 0.0573   |
|    ent_coef        | 0.776    |
|    ent_coef_loss   | -0.861   |
|    learning_rate   | 0.001    |
|    n_updates       | 255      |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 81.2     |
|    ep_rew_mean     | -47.4    |
|    success_rate    | 0        |
| time/              |          |
|    episodes        | 8        |
|    fps             | 10       |
|    time_elapsed    | 63       |
|    total_timesteps | 650      |
| train/             |          |
|    actor_los

In [8]:
# Load saved model
model = SAC.load('her_sac_highway', env=env)

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


#### Evaluate SAC agent

In [9]:
# we use the gym >v.26 API here. Note that you could also wrap the env in a DummyVecEnv
# which allows you to use a simplified API
obs, _ = env.reset()

# Evaluate the agent
episode_reward = 0
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    done = truncated or terminated
    episode_reward += reward
    if done or info.get("is_success", False):
        print("Reward:", episode_reward, "Success?", info.get("is_success", False))
        episode_reward = 0.0
        obs, _ = env.reset()

Reward: -13.220677567305547 Success? False
Reward: -12.489266528126073 Success? False
Reward: -6.312339707254436 Success? True
Reward: -5.989705652621492 Success? True
Reward: -14.453279608260836 Success? False
Reward: -9.729581976657908 Success? True
Reward: -7.401599480443453 Success? True
Reward: -11.037402987219409 Success? True
Reward: -3.1743355432581586 Success? True
Reward: -3.5783419868026614 Success? True
Reward: -3.3315189380794026 Success? True
Reward: -6.57567094880589 Success? True
Reward: -8.735434387996053 Success? True
Reward: -4.520352591625949 Success? True
Reward: -10.839384820143952 Success? False
Reward: -3.1086932440667376 Success? True
Reward: -11.491171472721291 Success? False
Reward: -6.462331375395717 Success? True
Reward: -5.513745468769035 Success? True
Reward: -5.294821258521379 Success? True
Reward: -5.346307176079068 Success? True
Reward: -4.353214644771106 Success? True
Reward: -5.041152246562712 Success? True
Reward: -4.369845515097199 Success? True
Re

### Train DDPG agent

In [6]:
# Create the action noise object that will be used for exploration
n_actions = env.action_space.shape[0]
noise_std = 0.2
action_noise = NormalActionNoise(
    mean=np.zeros(n_actions), sigma=noise_std * np.ones(n_actions)
)

model = DDPG(
    "MultiInputPolicy",
    env,
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs=dict(
        n_sampled_goal=4,
        goal_selection_strategy="future",
    ),
    verbose=1,
    buffer_size=int(1e6),
    learning_rate=1e-3,
    action_noise=action_noise,
    gamma=0.95,
    batch_size=256,
    policy_kwargs=dict(net_arch=[256, 256, 256]),
)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [7]:
# Train for 2e5 steps
model.learn(int(2e5))
# Save the trained agent
model.save('her_ddpg_highway')

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 173      |
|    ep_rew_mean     | -87.2    |
|    success_rate    | 0        |
| time/              |          |
|    episodes        | 4        |
|    fps             | 53       |
|    time_elapsed    | 12       |
|    total_timesteps | 693      |
| train/             |          |
|    actor_loss      | 1.28     |
|    critic_loss     | 0.0345   |
|    learning_rate   | 0.001    |
|    n_updates       | 592      |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 242      |
|    ep_rew_mean     | -118     |
|    success_rate    | 0        |
| time/              |          |
|    episodes        | 8        |
|    fps             | 43       |
|    time_elapsed    | 44       |
|    total_timesteps | 1934     |
| train/             |          |
|    actor_loss      | 1.64     |
|    critic_loss     | 0.0113   |
|    learning_

In [8]:
# Load saved model
model = DDPG.load('her_ddpg_highway', env=env)

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


#### Evaluate DDPG agent

In [9]:
# we use the gym >v.26 API here. Note that you could also wrap the env in a DummyVecEnv
# which allows you to use the old gym API a simplified API
obs, _ = env.reset()

# Evaluate the agent
episode_reward = 0
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    done = truncated or terminated
    episode_reward += reward
    if done or info.get("is_success", False):
        print("Reward:", episode_reward, "Success?", info.get("is_success", False))
        episode_reward = 0.0
        obs, _ = env.reset()

Reward: -10.88932621376994 Success? True
Reward: -7.232710411189086 Success? True
Reward: -3.5694489330911026 Success? True
Reward: -4.57312157455117 Success? True
Reward: -5.470549049887844 Success? True
Reward: -3.513680684665023 Success? True
Reward: -8.487132322612954 Success? True
Reward: -4.604462292015917 Success? True
Reward: -6.044720056588108 Success? True
Reward: -5.87483292411637 Success? True
Reward: -5.194191953385466 Success? True
Reward: -7.833608261846191 Success? True
Reward: -5.794497177199117 Success? True
Reward: -3.6651461906779335 Success? True
Reward: -4.5692794591671095 Success? True
Reward: -8.759227017443811 Success? True
Reward: -5.777826507868408 Success? True
Reward: -6.765293100222839 Success? True
Reward: -10.206196430375389 Success? True
Reward: -7.782004360009789 Success? True
Reward: -5.496592646557844 Success? True
Reward: -4.074065411663049 Success? True
Reward: -10.137375746957051 Success? True
Reward: -6.8234976561469 Success? True
Reward: -4.1867