<a href="https://colab.research.google.com/github/Efrat-p1/car_accidents_uk/blob/main/SB3_Intro_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this exercise we will use stable-baselines in order to train some agents on some environments.

First let's install:

In [None]:
!apt-get update
!apt-get install ffmpeg freeglut3-dev xvfb  # For visualization
# !pip install 'shimmy>=0.2.1'
!pip install stable-baselines3

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
0% [Connecting to archive.ubuntu.com] [Waiting for headers] [1 InRelease 0 B/3,626 B 0%] [Connecting0% [Connecting to archive.ubuntu.com] [Waiting for headers] [Connecting to ppa.launchpadcontent.net                                                                                                     Get:2 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:7 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Get:8 http://security.ubuntu.com/ubuntu jammy-security/multiverse amd64 Packages [44.6 kB]
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/pp

In [None]:
!pip3 install Box2D
!pip install -q swig
!pip install -q gymnasium[box2d]
!pip install pyglet==1.5.27

In [None]:
import gymnasium as gym
import numpy as np
import stable_baselines3
import tensorflow as tf
import time

In [None]:
stable_baselines3.__version__

## 1 - Exercise Train CartPole with PPO (10 mins)

The following code construct a CartPole-v1 environment and a PPO agent:

In [None]:
from stable_baselines3 import PPO

env = gym.make('CartPole-v1')
env.reset()
model = PPO('MlpPolicy', env, verbose=1)

Let's evaluate the un-trained agent, this should be a random agent.

In [None]:
from stable_baselines3.common.evaluation import evaluate_policy

# Random Agent, before training
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

Train the agent and evaluate its performances.

Use *model.learn(total_timesteps)*

## 2 - Show results in Tensorboard (5 mins)

Let's see the training process in tensorboard:

In [None]:
model = PPO('MlpPolicy', env, tensorboard_log="./tb/")
model.learn(total_timesteps=10000, tb_log_name='tb1')

In [None]:
%load_ext tensorboard
%tensorboard --logdir ./tb

## 3 - Exercise Train with Vectorized Envs (10 mins)

Stable-baselines enable us to duplicate our environment and parallelize the episodes:

In [None]:
from stable_baselines3.common.env_util import make_vec_env

env = make_vec_env('CartPole-v1', n_envs=8)

In [None]:
# Reset all environments return all observations
observations = env.reset()
print(observations.shape)

In [None]:
# Running a step requires several actions (for each environment)
observations, rewards, dones, info = env.step([0,0,0,0,0,0,0,0])

# It returns all in plural:
print(rewards)
print(dones)

Vectorized environments allow to easily multiprocess training.
Train again PPO agent with the vectorized environments, compare accuracy and training duration.


## 4 - Custom Policy (10 mins)
Let's define a custom policy model for the A2C Algorithm:


In [None]:
from stable_baselines3 import A2C
import torch

# In stable-baselines it is defined this way:
model1 = A2C('MlpPolicy', env,policy_kwargs = dict(activation_fn=torch.nn.ReLU, net_arch=dict(vf=[16, 8], pi=[24]))) # 16,

In [None]:
# Note that pi = policy layers , vf = value layers (each dense layer has its weights and bias)
for i in model1.get_parameters()['policy'].items():
  print(i[0], i[1].shape)

The equivalent keras code for such a policy model is:

In [None]:
import tensorflow as tf
obs = tf.keras.layers.Input(shape=(4,))
# common = tf.keras.layers.Dense(16, activation='relu', name='Common')(obs)

value = tf.keras.layers.Dense(16, activation='relu', name='Value1')(obs)
value = tf.keras.layers.Dense(8, activation='relu', name='Value2')(value)

policy = tf.keras.layers.Dense(24, activation='relu', name='Policy')(obs)

policy_model_tf = tf.keras.Model(inputs=obs, outputs=[value, policy])

tf.keras.utils.plot_model(policy_model_tf)

The equivalent pytorch code for such a model is:

In [None]:
class policy_model(torch.nn.Module):
    def __init__(self):
        super(policy_model, self).__init__()
        # self.common_dense = torch.nn.Linear(4, 16)
        self.value_dense1 = torch.nn.Linear(16, 16)
        self.value_dense2 = torch.nn.Linear(16, 8)
        self.policy_dense = torch.nn.Linear(16, 24)

    def forward(self, obs):
        # common = self.common_dense(obs)
        # common = torch.nn.ReLU()(common)

        value = self.value_dense1(obs)
        value = torch.nn.ReLU()(value)
        value = self.value_dense2(value)
        value = torch.nn.ReLU()(value)

        policy = self.policy_dense(obs)
        policy = torch.nn.ReLU()(policy)
        return value, policy

policy_model_torch=policy_model()

## 5 - Exercise Record Videos while Training (20 mins)

Try to train a PPO agent on LunarLander-v2 environment with a smaller model policy (less than the default policy model).

Moreover, use a callback while training that generate a demo video (sevral times in the training process) of the current agent plays in the environment.

Use *self.num_timesteps* to know which training step is it.

Use the following code to generate a video:

In [None]:
# Set up fake display; otherwise rendering will fail
import os
os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

import base64
from pathlib import Path
from IPython import display as ipythondisplay

In [None]:
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv

def record_video(env, model, video_length=500):
  eval_env = DummyVecEnv([lambda: gym.make(env)])
  eval_env = VecVideoRecorder(eval_env, video_folder='videos/',
                              record_video_trigger=lambda step: step == 0, video_length=video_length,
                              name_prefix='Video_File_name')

  obs = eval_env.reset()
  for _ in range(video_length):
    action, _ = model.predict(obs)
    obs, _, _, _ = eval_env.step(action)

  eval_env.close()

Finally after training ends, you can use the following code to watch the saved videos:

In [None]:
video_path=''
prefix='' # prefix of file name

html = []
for mp4 in Path(video_path).glob("*.mp4".format(prefix)):
    video_b64 = base64.b64encode(mp4.read_bytes())
    html.append('''<video alt="{}" autoplay
                  loop controls style="height: 400px;">
                  <source src="data:video/mp4;base64,{}" type="video/mp4" />
              </video>'''.format(mp4, video_b64.decode('ascii')))
ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))