# Stable Baselines - Hindsight Experience Replay on Highway Env

Github Repo: [https://github.com/DLR-RM/stable-baselines3](https://github.com/DLR-RM/stable-baselines3)

Highway env: [https://github.com/eleurent/highway-env](https://github.com/eleurent/highway-env)

[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a training framework for Reinforcement Learning (RL), using Stable Baselines3.

It provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.

Documentation is available online: [https://stable-baselines3.readthedocs.io/](https://stable-baselines3.readthedocs.io/)

## Install Dependencies and Stable Baselines Using Pip


```
pip install stable-baselines3[extra]
```

In [1]:
# for autoformatting
# %load_ext jupyter_black

In [2]:
# Install stable-baselines latest version
!pip install "stable-baselines3[extra]>=2.0.0a4"

Collecting stable-baselines3[extra]>=2.0.0a4
  Downloading stable_baselines3-2.0.0-py3-none-any.whl (178 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/178.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m178.4/178.4 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gymnasium==0.28.1 (from stable-baselines3[extra]>=2.0.0a4)
  Downloading gymnasium-0.28.1-py3-none-any.whl (925 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m925.5/925.5 kB[0m [31m29.8 MB/s[0m eta [36m0:00:00[0m
Collecting shimmy[atari]~=0.2.1 (from stable-baselines3[extra]>=2.0.0a4)
  Downloading Shimmy-0.2.1-py3-none-any.whl (25 kB)
Collecting autorom[accept-rom-license]~=0.6.0 (from stable-baselines3[extra]>=2.0.0a4)
  Downloading AutoROM-0.6.1-py3-none-any.whl (9.4 kB)
Collecting jax-jumpy>=1.0.0 (from gymnasium==0.28.1->stable-baselines3[extra]>=2.0.0a4)
  Downloading jax_jumpy-1.0.0-py3-none-any.whl 

In [3]:
# Install highway-env
!pip install highway-env

Collecting highway-env
  Downloading highway_env-1.8.2-py3-none-any.whl (104 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/104.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.0/104.0 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: highway-env
Successfully installed highway-env-1.8.2


## Import policy, RL agent, ...

In [4]:
import gymnasium as gym
import highway_env
import numpy as np

from stable_baselines3 import HerReplayBuffer, SAC, DDPG
from stable_baselines3.common.noise import NormalActionNoise

  if not hasattr(tensorboard, "__version__") or LooseVersion(
  float8_e4m3b11fnuz = ml_dtypes.float8_e4m3b11


## Create the Gym env and instantiate the agent

For this example, we will be using the parking environment from the [highway-env](https://github.com/Farama-Foundation/HighwayEnv) repo by @eleurent.

The parking env is a goal-conditioned continuous control task, in which the vehicle must park in a given space with the appropriate heading.


![parking-env](https://raw.githubusercontent.com/eleurent/highway-env/gh-media/docs/media/parking-env.gif)



### Train Soft Actor-Critic (SAC) agent

Here, we use HER "future" goal sampling strategy, where we create 4 artificial transitions per real transition

Note: the hyperparameters (network architecture, discount factor, ...) were tuned for this task

In [5]:
env = gym.make("parking-v0")

In [6]:
# SAC hyperparams:
model = SAC(
    "MultiInputPolicy",
    env,
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs=dict(
        n_sampled_goal=4,
        goal_selection_strategy="future",
    ),
    verbose=1,
    buffer_size=int(1e6),
    learning_rate=1e-3,
    gamma=0.95,
    batch_size=256,
    policy_kwargs=dict(net_arch=[256, 256, 256]),
)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [7]:
# Train for 1e5 steps
model.learn(int(1e5))
# Save the trained agent
model.save('her_sac_highway')

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
|    success_rate    | 0.91     |
| time/              |          |
|    episodes        | 2212     |
|    fps             | 37       |
|    time_elapsed    | 2052     |
|    total_timesteps | 77402    |
| train/             |          |
|    actor_loss      | 1.59     |
|    critic_loss     | 0.0061   |
|    ent_coef        | 0.00465  |
|    ent_coef_loss   | -0.687   |
|    learning_rate   | 0.001    |
|    n_updates       | 77301    |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 20       |
|    ep_rew_mean     | -7.09    |
|    success_rate    | 0.91     |
| time/              |          |
|    episodes        | 2216     |
|    fps             | 37       |
|    time_elapsed    | 2054     |
|    total_timesteps | 77475    |
| train/             |          |
|    actor_loss      | 1.56     |
|    critic_loss     | 0.0062   |
|    ent_coef    

In [8]:
# Load saved model
model = SAC.load('her_sac_highway', env=env)

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


#### Evaluate the agent

In [9]:
# we use the gym >v.26 API here. Note that you could also wrap the env in a DummyVecEnv
# which allows you to use a simplified API
obs, _ = env.reset()

# Evaluate the agent
episode_reward = 0
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    done = truncated or terminated
    episode_reward += reward
    if done or info.get("is_success", False):
        print("Reward:", episode_reward, "Success?", info.get("is_success", False))
        episode_reward = 0.0
        obs, _ = env.reset()

Reward: -12.985952643578237 Success? False
Reward: -2.4818934176069205 Success? True
Reward: -6.200242145528466 Success? True
Reward: -6.804976008673456 Success? True
Reward: -2.4973094134231113 Success? True
Reward: -7.2051187561172165 Success? True
Reward: -8.514093503910791 Success? True
Reward: -5.985852771201228 Success? True
Reward: -4.641636335505931 Success? True
Reward: -3.435680310102163 Success? True
Reward: -4.113605305653299 Success? True
Reward: -11.414662685014392 Success? True
Reward: -5.474037742222492 Success? True
Reward: -4.441864032818608 Success? True
Reward: -5.915973986098036 Success? True
Reward: -14.548069039852653 Success? False
Reward: -2.5207500819519884 Success? True
Reward: -8.210777452752591 Success? True
Reward: -13.780042800807317 Success? False
Reward: -4.824219766937436 Success? True
Reward: -7.012449805631088 Success? True
Reward: -12.830665892762651 Success? False
Reward: -7.806276284088399 Success? True
Reward: -8.93582105393842 Success? True
Rewa

### Train DDPG agent

In [10]:
# Create the action noise object that will be used for exploration
n_actions = env.action_space.shape[0]
noise_std = 0.2
action_noise = NormalActionNoise(
    mean=np.zeros(n_actions), sigma=noise_std * np.ones(n_actions)
)

model = DDPG(
    "MultiInputPolicy",
    env,
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs=dict(
        n_sampled_goal=4,
        goal_selection_strategy="future",
    ),
    verbose=1,
    buffer_size=int(1e6),
    learning_rate=1e-3,
    action_noise=action_noise,
    gamma=0.95,
    batch_size=256,
    policy_kwargs=dict(net_arch=[256, 256, 256]),
)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [11]:
# Train for 2e5 steps
model.learn(int(2e5))
# Save the trained agent
model.save('her_ddpg_highway')

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
|    time_elapsed    | 3715     |
|    total_timesteps | 172274   |
| train/             |          |
|    actor_loss      | 0.827    |
|    critic_loss     | 0.00734  |
|    learning_rate   | 0.001    |
|    n_updates       | 172164   |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 27.3     |
|    ep_rew_mean     | -8.13    |
|    success_rate    | 0.95     |
| time/              |          |
|    episodes        | 3164     |
|    fps             | 46       |
|    time_elapsed    | 3717     |
|    total_timesteps | 172369   |
| train/             |          |
|    actor_loss      | 0.853    |
|    critic_loss     | 0.011    |
|    learning_rate   | 0.001    |
|    n_updates       | 172262   |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 27.4     |
|    ep_rew_mean 

In [12]:
# Load saved model
model = DDPG.load('her_ddpg_highway', env=env)

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


#### Evaluate the agent

In [13]:
# we use the gym >v.26 API here. Note that you could also wrap the env in a DummyVecEnv
# which allows you to use the old gym API a simplified API
obs, _ = env.reset()

# Evaluate the agent
episode_reward = 0
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    done = truncated or terminated
    episode_reward += reward
    if done or info.get("is_success", False):
        print("Reward:", episode_reward, "Success?", info.get("is_success", False))
        episode_reward = 0.0
        obs, _ = env.reset()

Reward: -7.252649405517162 Success? True
Reward: -12.512272431820186 Success? False
Reward: -6.9715930294429596 Success? True
Reward: -11.742236279859005 Success? False
Reward: -6.023317212987999 Success? True
Reward: -8.617461968026744 Success? True
Reward: -4.495503959671879 Success? True
Reward: -3.886870998409277 Success? True
Reward: -7.4517091689693995 Success? True
Reward: -5.464385185810589 Success? True
Reward: -5.921512318985064 Success? True
Reward: -2.7266699497834512 Success? True
Reward: -6.892278677988796 Success? True
Reward: -13.014264558156649 Success? False
Reward: -8.914868693307087 Success? True
Reward: -8.583111589178747 Success? True
Reward: -6.8875961123554355 Success? True
Reward: -8.55289886928479 Success? True
Reward: -3.29278398870272 Success? True
Reward: -5.060954916188682 Success? True
Reward: -8.679734648327164 Success? True
Reward: -3.6149708495630173 Success? True
Reward: -13.45216672826766 Success? False
Reward: -4.255882505093875 Success? True
Reward