# Physical Workshop on RL

Welcome to our first workshop for TIL! In this notebook, we will focus on setting up a simple RL agent using Stable Baselines 3. We will also go through simple saving and loading and the making of parallel environments.

In [16]:
!pip install swig gymnasium stable-baselines3
!pip install gymnasium[box2d]



### Why Stable Baselines 3?

You might notice that our training materials are in TorchRL, a different framework.

Stable Baselines 3 provide a simple way of training RL models, while abstracting a lot of details out, making it good for a simple training run.

This also makes it great for iterating through different models on the fly. However, it also lacks customisability and is very complex if users are trying to modify the model. For these reasons, we decided to use TorchRL as the main framework for the RL materials.

### Initialising the environment
We use gymnasium, an environment repository created by the Farama Foundation. This library stores multiple environments and we can load different environments to play with.

Moreover, gym also provides us with a set of actions that allow us to interact with the environment.

#### Creating the agent
In Stable Baselines, the agent can be created via a single line of code. `MlpPolicy` tells it that the policy network is a default multi layer perceptron, or just a simple neural network.

In [17]:
import gymnasium as gym

from stable_baselines3 import DQN
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv
from IPython.display import HTML

# Create environment
env = gym.make("LunarLander-v3", render_mode="rgb_array")

# Instantiate the agent
model = DQN("MlpPolicy", env, verbose=1)


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


### The Environment

<img src = "https://gymnasium.farama.org/_images/lunar_lander.gif"/>

We are using an environment called LunarLander-v3, where the goal is landing the robot between the two flags.

There are four discrete actions available:

  0: do nothing

  1: fire left orientation engine

  2: fire main engine

  3: fire right orientation engine

The state is an 8-dimensional vector: the coordinates of the lander in x & y, its linear velocities in x & y, its angle, its angular velocity, and two booleans that represent whether each leg is in contact with the ground or not.

**This environment is a discrete action environment.**

Discrete: Finite set of actions

Continuous: Arbitrary set of actions -  Any number signifies a valid action


### Checking the environment

It is always a good habit to check details of the environment, or the environment specs.

In [35]:
import stable_baselines3.common.env_checker as env_checker

env_checker.check_env(env) #Checks if env follows Gym API and compatible with SB3

print("Observation space:", env.observation_space)
# e.g. Box([-4.8 , -Inf, …], [4.8 , Inf, …], (4,), float32)

print("Action space:", env.action_space)
# e.g. Discrete(2)

print("Metadata:", env.metadata)
# e.g. {'render.modes': ['human', 'rgb_array'], …}

print("Spec:", env.spec)
# e.g. EnvSpec(id='CartPole-v1', entry_point='gym.envs.classic_control:CartPoleEnv', …)


None
Observation space: Box([ -2.5        -2.5       -10.        -10.         -6.2831855 -10.
  -0.         -0.       ], [ 2.5        2.5       10.        10.         6.2831855 10.
  1.         1.       ], (8,), float32)
Action space: Discrete(4)
Metadata: {'render_modes': ['human', 'rgb_array'], 'render_fps': 50}
Spec: EnvSpec(id='LunarLander-v3', entry_point='gymnasium.envs.box2d.lunar_lander:LunarLander', reward_threshold=200, nondeterministic=False, max_episode_steps=1000, order_enforce=True, disable_env_checker=False, kwargs={'render_mode': 'rgb_array'}, namespace=None, name='LunarLander', version=3, additional_wrappers=(), vector_entry_point=None)


### Visualising rollouts

It is an even better idea to see what the agent should be doing in the environment, and how it plays out.

In [28]:
import os
import IPython.display as ipd

os.makedirs("logs", exist_ok=True)

vec_env = DummyVecEnv([lambda: gym.make("LunarLander-v3", render_mode="rgb_array")])
obs = vec_env.reset()
video_folder = "logs"
video_length = 100 #@param {type:"slider", min: 100, max:1000}

# Record the video starting at the first step
vec_env = VecVideoRecorder(vec_env, video_folder,
                       record_video_trigger=lambda x: x == 0, video_length=video_length,
                       name_prefix="random-agent-LunarLander-v3")

vec_env.reset()
for _ in range(video_length + 1):
  action = [vec_env.action_space.sample()]
  obs, _, _, _ = vec_env.step(action)
# Save the video
vec_env.close()

#Render video
ipd.Video(f"{video_folder}/random-agent-LunarLander-v3-step-0-to-step-100.mp4", embed = True)

Saving video to /content/logs/random-agent-LunarLander-v3-step-0-to-step-100.mp4
Moviepy - Building video /content/logs/random-agent-LunarLander-v3-step-0-to-step-100.mp4.
Moviepy - Writing video /content/logs/random-agent-LunarLander-v3-step-0-to-step-100.mp4





Moviepy - Done !
Moviepy - video ready /content/logs/random-agent-LunarLander-v3-step-0-to-step-100.mp4


### Training
Unlike other frameworks, SB3 provides a simple one line interface to train models. However, you lose out on a lot of customisability, such as implementing reward buffers, modifying rewards etc.

In [14]:
# Train the agent and display a progress bar
model.learn(total_timesteps=int(2e5), progress_bar=True) #Takes 12 minutes
# Save the agent
model.save("dqn_lunar")


Output()

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 73       |
|    ep_rew_mean      | -126     |
|    exploration_rate | 0.986    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 712      |
|    time_elapsed     | 0        |
|    total_timesteps  | 292      |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 1.32     |
|    n_updates        | 47       |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 76.2     |
|    ep_rew_mean      | -178     |
|    exploration_rate | 0.971    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 725      |
|    time_elapsed     | 0        |
|    total_timesteps  | 610      |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 2.33     |
|    n_updates      

### Evaluating the agent

In [38]:
del model  # delete trained model to demonstrate loading

# Load the trained agent
# NOTE: if you have loading issue, you can pass `print_system_info=True`
# to compare the system on which the model was trained vs the current one
# model = DQN.load("dqn_lunar", env=env, print_system_info=True)

model = DQN.load("dqn_lunar", env=env)

# Evaluate the agent
# NOTE: If you use wrappers with your environment that modify rewards,
#       this will be reflected here. To evaluate with original rewards,
#       wrap environment in a "Monitor" wrapper before other wrappers.
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")





Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
mean_reward=-126.03 +/- 25.5738366973129


### Rendering actions taken by the model

Reward is a good way of getting a gauge of how good your model is, but it is always better to see how the model does on the environment directly.

In [None]:
# Enjoy trained agent
vec_env = model.get_env()
vec_env = VecVideoRecorder(vec_env, video_folder,
                       record_video_trigger=lambda x: x == 0, video_length=video_length,
                       name_prefix=f"DQN-LunarLander")
obs = vec_env.reset()
for i in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, rewards, dones, info = vec_env.step(action)

vec_env.close()

In [31]:
### Render Video
ipd.Video(f"{video_folder}/DQN-LunarLander-step-0-to-step-100.mp4", embed = True)

### Multiprocessing

Wait, if you actually trained the model, it would take 10 minutes for 200k timesteps.

That is not good, and is very slow. Observe that we are not using up all our RAM as well, so this is not MONEY WORTH.

We can spawn multiple environments of Lunar Lander, and train our policy on it to speed things up.

In [43]:
import gymnasium as gym

from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.utils import set_random_seed

def make_env(env_id: str, rank: int, seed: int = 0):
    """
    Utility function for multiprocessed env.

    :param env_id: the environment ID
    :param num_env: the number of environments you wish to have in subprocesses
    :param seed: the initial seed for RNG
    :param rank: index of the subprocess
    """
    def _init():
        env = gym.make(env_id, render_mode="rgb_array")
        env.reset(seed=seed + rank)
        return env
    set_random_seed(seed)
    return _init

env_id = "LunarLander-v3"
num_cpu = 4  # Number of processes to use
# Create the vectorized environment
vec_env = SubprocVecEnv([make_env(env_id, i) for i in range(num_cpu)])

# Stable Baselines provides you with make_vec_env() helper
# which does exactly the previous steps for you.
# You can choose between `DummyVecEnv` (usually faster) and `SubprocVecEnv`
# env = make_vec_env(env_id, n_envs=num_cpu, seed=0, vec_env_cls=SubprocVecEnv)

model = DQN("MlpPolicy", vec_env, verbose=1)
model.learn(total_timesteps=2e5, progress_bar = True) #Takes 6 minutes!



Output()

Using cpu device
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.976    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 1150     |
|    time_elapsed     | 0        |
|    total_timesteps  | 508      |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 1.66     |
|    n_updates        | 25       |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.958    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 1121     |
|    time_elapsed     | 0        |
|    total_timesteps  | 884      |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 1.23     |
|    n_updates        | 49       |
----------------------------------
----------------------------------
| rollout/            |          |
|  

<stable_baselines3.dqn.dqn.DQN at 0x796fde1614d0>

### Evaluate your model

In [42]:
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10, deterministic=True)
print(f'mean_reward={mean_reward:.2f} +/- {std_reward:.2f}')




mean_reward=39.90 +/- 89.20


### Visualise Results

In [44]:
vec_env = VecVideoRecorder(vec_env, video_folder,
                       record_video_trigger=lambda x: x == 0, video_length=video_length,
                       name_prefix=f"DQN-Multi-LunarLander")

obs = vec_env.reset()
for _ in range(1000):
    action, _states = model.predict(obs)
    obs, rewards, dones, info = vec_env.step(action)
vec_env.close()
ipd.Video(f"{video_folder}/DQN-Multi-LunarLander-step-0-to-step-100.mp4", embed = True)

Saving video to /content/logs/DQN-Multi-LunarLander-step-0-to-step-100.mp4
Moviepy - Building video /content/logs/DQN-Multi-LunarLander-step-0-to-step-100.mp4.
Moviepy - Writing video /content/logs/DQN-Multi-LunarLander-step-0-to-step-100.mp4





Moviepy - Done !
Moviepy - video ready /content/logs/DQN-Multi-LunarLander-step-0-to-step-100.mp4
