# COMP47590 Advanced Machine Learning

## Assignment 2: Going the Distance
Uses the PPO actor-critic method to train a neural network to control a simple robot in the RacingCar environment from OpenAI gym (https://gym.openai.com/envs/RacingCar-v0/). 

![Racing](racing_car.gif)

There are five discrete **actions** in this environment:
- left (0)
- right (1)
- brake (2)
- accelerate (3)
- none (4)

**Reward** of -0.1 is awarded every frame and +1000/N for every track tile visited, where N is the total number of tiles in track. For example, if you have finished in 732 frames, your reward is 1000 - 0.1*732 = 926.8 points.

And the **state** is represented using a single image frame (96 * 96).

### Initialisation

If using Google colab you need to install packages - comment out lines below.

In [1]:
!apt install swig cmake ffmpeg
!apt-get install -y xvfb x11-utils
!pip install stable-baselines3[extra] pyglet box2d box2d-kengz
!pip install pyvirtualdisplay PyOpenGL PyOpenGL-accelerate

Reading package lists... Done
Building dependency tree       
Reading state information... Done
swig is already the newest version (3.0.12-1).
cmake is already the newest version (3.10.2-1ubuntu2.18.04.2).
ffmpeg is already the newest version (7:3.4.8-0ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 40 not upgraded.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
x11-utils is already the newest version (7.7+3build1).
xvfb is already the newest version (2:1.19.6-1ubuntu4.10).
0 upgraded, 0 newly installed, 0 to remove and 40 not upgraded.


For Google colab comment out this cell to make a virtual rendering canvas so render calls work (we still won't see display!)

In [2]:
import pyvirtualdisplay
#
_display = pyvirtualdisplay.Display(visible=False,  # use False with Xvfb
                                    size=(1400, 900))
_ = _display.start()

Import required packages. 

In [3]:
import torch 
import gym
import stable_baselines3 as sb3
import imageio
import numpy as np
import base64
import IPython
import PIL.Image
import pyvirtualdisplay
import os
import gym
from stable_baselines3 import PPO ,A2C
from stable_baselines3.common.vec_env import DummyVecEnv # Vectorise the environment
from stable_baselines3.common.evaluation import evaluate_policy # gives average reward and SD of that
from stable_baselines3.common.vec_env import VecFrameStack

# Video stuff 
from pathlib import Path
from IPython import display as ipythondisplay
from stable_baselines3.common.vec_env import VecVideoRecorder, SubprocVecEnv, DummyVecEnv

import pandas as pd # For data frames and data frame manipulation
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 1000)
import numpy as np # For general  numeric operations

import matplotlib.pyplot as plt
%matplotlib inline 

### Create and Explore the Environment

Create the **CarRacing-v0** environment. Add wrappers to resize the images and convert to greyscale.

In [4]:
env = gym.make('CarRacing-v0')
env = gym.wrappers.resize_observation.ResizeObservation(env, 64)
env = gym.wrappers.gray_scale_observation.GrayScaleObservation(env, keep_dim = True)

Explore the environment - view the action space and observation space.

In [5]:
env.action_space

Box([-1.  0.  0.], [1. 1. 1.], (3,), float32)

In [6]:
env.observation_space

Box([[[0]
  [0]
  [0]
  ...
  [0]
  [0]
  [0]]

 [[0]
  [0]
  [0]
  ...
  [0]
  [0]
  [0]]

 [[0]
  [0]
  [0]
  ...
  [0]
  [0]
  [0]]

 ...

 [[0]
  [0]
  [0]
  ...
  [0]
  [0]
  [0]]

 [[0]
  [0]
  [0]
  ...
  [0]
  [0]
  [0]]

 [[0]
  [0]
  [0]
  ...
  [0]
  [0]
  [0]]], [[[255]
  [255]
  [255]
  ...
  [255]
  [255]
  [255]]

 [[255]
  [255]
  [255]
  ...
  [255]
  [255]
  [255]]

 [[255]
  [255]
  [255]
  ...
  [255]
  [255]
  [255]]

 ...

 [[255]
  [255]
  [255]
  ...
  [255]
  [255]
  [255]]

 [[255]
  [255]
  [255]
  ...
  [255]
  [255]
  [255]]

 [[255]
  [255]
  [255]
  ...
  [255]
  [255]
  [255]]], (64, 64, 1), uint8)

Play an episode of the environment using random actions

In [7]:
episodes=5
for episode in range(0,episodes+1):
    state=env.reset() # observation for the environment , returns vector
    done=False
    score=0
    
    while not done:
        #screen = env.render(mode='rgb_array')
        #plt.imshow(screen)
        #ipythondisplay.clear_output(wait=True)
        #ipythondisplay.display(plt.gcf())
        action = env.action_space.sample() # generates the random action 
        n_state,reward,done,info = env.step(action) # returns next set of observation, reward, wheteher or not episode is done
        score+=reward
    print("Episode:{},Score:{}".format(episode,score))
env.close()
    


Track generation: 1137..1425 -> 288-tiles track
Episode:0,Score:-30.313588850174582
Track generation: 1244..1559 -> 315-tiles track
Episode:1,Score:-36.30573248407693
Track generation: 1143..1433 -> 290-tiles track
Episode:2,Score:-30.795847750865516
Track generation: 1132..1419 -> 287-tiles track
Episode:3,Score:-33.56643356643401
Track generation: 1181..1480 -> 299-tiles track
Episode:4,Score:-32.88590604026898
Track generation: 963..1215 -> 252-tiles track
Episode:5,Score:-20.31872509960172


In [8]:
# Display video recording of the rendered event
def show_videos(video_path='', prefix='ppo-carracing'):
  html = []
  for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
      video_b64 = base64.b64encode(mp4.read_bytes())
      html.append('''<video alt="{}" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>'''.format(mp4, video_b64.decode('ascii')))
  ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

### Single Image Agent
Create an agent that controls the car using a single image frame as the state input. We recommend a PPO agent with the following hyper-parameters (although you can experiment):
- learning_rate = 3e-5
- n_steps = 512
- ent_coef = 0.001
- batch_size = 128
- gae_lambda =  0.9
- n_epochs = 20
- use_sde = True
- sde_sample_freq = 4
- clip_range = 0.4
- policy_kwargs = {'log_std_init': -2, 'ortho_init':False},

We also recommend enabling **tensorboard** monitoring of the training process.

In [9]:
env_name='CarRacing-v0'
env=gym.make(env_name)

env = DummyVecEnv([lambda:env])

In [10]:
log_path=os.path.join('RL_Training','Logs')
model=PPO('CnnPolicy',env,verbose=1,tensorboard_log=log_path, learning_rate=3e-5,n_steps = 512,ent_coef = 0.001,
            batch_size = 128,gae_lambda = 0.9,n_epochs = 20,use_sde = True,sde_sample_freq = 4,clip_range = 0.4,policy_kwargs = {'log_std_init': -2, 'ortho_init':False})


Using cpu device
Wrapping the env in a VecTransposeImage.


Examine the actor and critic network architectures.

In [11]:
print(model.policy)

ActorCriticCnnPolicy(
  (features_extractor): NatureCNN(
    (cnn): Sequential(
      (0): Conv2d(3, 32, kernel_size=(8, 8), stride=(4, 4))
      (1): ReLU()
      (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (3): ReLU()
      (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
      (5): ReLU()
      (6): Flatten(start_dim=1, end_dim=-1)
    )
    (linear): Sequential(
      (0): Linear(in_features=4096, out_features=512, bias=True)
      (1): ReLU()
    )
  )
  (mlp_extractor): MlpExtractor(
    (shared_net): Sequential()
    (policy_net): Sequential()
    (value_net): Sequential()
  )
  (action_net): Linear(in_features=512, out_features=3, bias=True)
  (value_net): Linear(in_features=512, out_features=1, bias=True)
)


Create an evaluation callback that is called every at regular intervals and renders the episode.

In [12]:
eval_env = gym.make('CarRacing-v0')
eval_callback = sb3.common.callbacks.EvalCallback(eval_env, 
                                                  best_model_save_path='./log_tb_carracing_PPO/',
                                                  log_path='./log_tb_carracing_PPO/', 
                                                  eval_freq=5000,
                                                  render=True)

Train the model for a large number of timesteps (500,000 timesteps will probably work well).

In [13]:
model.learn(total_timesteps=10000,
           callback=eval_callback,
            tb_log_name="Basic PPO Network - single Agent")

Track generation: 1030..1292 -> 262-tiles track
Logging to RL_Training/Logs/Basic PPO Network - single Agent_4




----------------------------
| time/              |     |
|    fps             | 70  |
|    iterations      | 1   |
|    time_elapsed    | 7   |
|    total_timesteps | 512 |
----------------------------
Track generation: 1051..1328 -> 277-tiles track
----------------------------------------
| time/                   |            |
|    fps                  | 47         |
|    iterations           | 2          |
|    time_elapsed         | 21         |
|    total_timesteps      | 1024       |
| train/                  |            |
|    approx_kl            | 0.22158775 |
|    clip_fraction        | 0.423      |
|    clip_range           | 0.4        |
|    entropy_loss         | 4.82       |
|    explained_variance   | -7.05e-05  |
|    learning_rate        | 3e-05      |
|    loss                 | 0.7        |
|    n_updates            | 20         |
|    policy_gradient_loss | 0.00199    |
|    std                  | 0.135      |
|    value_loss           | 1.13       |
-----------



Track generation: 1155..1448 -> 293-tiles track
Track generation: 1139..1428 -> 289-tiles track
Track generation: 1192..1494 -> 302-tiles track
Track generation: 1148..1446 -> 298-tiles track
Track generation: 1068..1342 -> 274-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1227..1538 -> 311-tiles track
Eval num_timesteps=5000, episode_reward=-45.99 +/- 7.27
Episode length: 1000.00 +/- 0.00
----------------------------------------
| eval/                   |            |
|    mean_ep_length       | 1e+03      |
|    mean_reward          | -46        |
| time/                   |            |
|    total_timesteps      | 5000       |
| train/                  |            |
|    approx_kl            | 0.06531533 |
|    clip_fraction        | 0.163      |
|    clip_range           | 0.4        |
|    entropy_loss         | -6.19      |
|    explained_variance   | 0.944      |
|    learning_rate        | 3e-05      |
|    loss 

<stable_baselines3.ppo.ppo.PPO at 0x7f17cbfca150>

Connect to the tensorboard log using **TensorBoard** from the command line to view training progress: 

`tensorboard --logdir ./logs_carracing_PPO/`

Then open TensorBoard in a browser, typically located at:

`http://localhost:6006/`

Save the trained agent.

In [None]:
save_path_single_agent = os.path.join('RL_Training','Saved Models','PPO_self_driving-singleAgent')
model.save(save_path_single_agent)

For memory management delete old agent and environment (assumes variable names - change if required).

In [15]:
del model
del env
del eval_env

### Create Image Stack Agent

Create the CarRacing-v0 environment using wrappers to resize the images to 64 x 64 and change to greyscale. Also add a wrapper to create a stack of 4 frames. 

In [16]:
env_name='CarRacing-v0'
env=gym.make(env_name)
env =  sb3.common.env_util.make_vec_env('CarRacing-v0',n_envs=1,seed=0)
env = VecFrameStack(env, n_stack=4)


Create an agent that controls the car using a stack of input image frames as the state input. We recommend a PPO agent with the following hyper-parameters (although you can experiment):
- learning_rate = 3e-5
- n_steps = 512
- ent_coef = 0.001
- batch_size = 128
- gae_lambda =  0.9
- n_epochs = 20
- use_sde = True
- sde_sample_freq = 4
- clip_range = 0.4
- policy_kwargs = {'log_std_init': -2, 'ortho_init':False},

We also recommend enabling **tensorboard** monitoring of the training process.

In [17]:
log_path=os.path.join('RL_Training','Logs')

model=PPO('CnnPolicy',env,verbose=1,tensorboard_log=log_path,learning_rate=3e-5,n_steps=512,ent_coef=0.001,batch_size=128,
             gae_lambda=0.9,n_epochs=20,use_sde=True,sde_sample_freq=4,clip_range=0.4,policy_kwargs = {'log_std_init': -2, 'ortho_init':False})


Using cpu device
Wrapping the env in a VecTransposeImage.


Examine the actor and critic network architectures.

In [18]:
print(model.policy)

ActorCriticCnnPolicy(
  (features_extractor): NatureCNN(
    (cnn): Sequential(
      (0): Conv2d(12, 32, kernel_size=(8, 8), stride=(4, 4))
      (1): ReLU()
      (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (3): ReLU()
      (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
      (5): ReLU()
      (6): Flatten(start_dim=1, end_dim=-1)
    )
    (linear): Sequential(
      (0): Linear(in_features=4096, out_features=512, bias=True)
      (1): ReLU()
    )
  )
  (mlp_extractor): MlpExtractor(
    (shared_net): Sequential()
    (policy_net): Sequential()
    (value_net): Sequential()
  )
  (action_net): Linear(in_features=512, out_features=3, bias=True)
  (value_net): Linear(in_features=512, out_features=1, bias=True)
)


Create an evaluation callback that is called every at regular intervals and renders the episode.

In [19]:
from stable_baselines3.common.callbacks import StopTrainingOnRewardThreshold
eval_env=gym.make('CarRacing-v0')
eval_env =  sb3.common.env_util.make_vec_env('CarRacing-v0',n_envs=1,seed=0)
stop_call = StopTrainingOnRewardThreshold(reward_threshold=190,verbose=1)
eval_env = sb3.common.vec_env.VecFrameStack(eval_env, n_stack=4)
eval_callback = sb3.common.callbacks.EvalCallback(eval_env, 
                                                  callback_on_new_best = stop_call,
                                                  best_model_save_path='./log_tb_carracing_PPO/',
                                                  log_path='./log_tb_carracing_PPO/', 
                                                  eval_freq=50000,
                                                  render=True)

In [20]:
eval_env.observation_space.shape

(96, 96, 12)

Train the model for a large number of timesteps (500,000 timesteps will probably work well).

In [21]:
model.learn(total_timesteps=100000, 
            callback=eval_callback,
            tb_log_name="Basic PPO Network")


Track generation: 1143..1442 -> 299-tiles track
Logging to RL_Training/Logs/Basic PPO Network_14




----------------------------
| time/              |     |
|    fps             | 94  |
|    iterations      | 1   |
|    time_elapsed    | 5   |
|    total_timesteps | 512 |
----------------------------
Track generation: 1087..1369 -> 282-tiles track
---------------------------------------
| rollout/                |           |
|    ep_len_mean          | 1e+03     |
|    ep_rew_mean          | -43       |
| time/                   |           |
|    fps                  | 33        |
|    iterations           | 2         |
|    time_elapsed         | 30        |
|    total_timesteps      | 1024      |
| train/                  |           |
|    approx_kl            | 2.5118575 |
|    clip_fraction        | 0.55      |
|    clip_range           | 0.4       |
|    entropy_loss         | 3.52      |
|    explained_variance   | 0.000121  |
|    learning_rate        | 3e-05     |
|    loss                 | 0.505     |
|    n_updates            | 20        |
|    policy_gradient_loss | 0

<stable_baselines3.ppo.ppo.PPO at 0x7f173e213990>

Connect to the tensorboard log using **TensorBoard** from the command line to view training progress: 

`tensorboard --logdir ./log_tb_carracing_PPO/`

Then open TensorBoard in a browser, typically located at:

`http://localhost:6006/`

Save the trained agent.

In [22]:
save_path_img_stack = os.path.join('RL_Training','Saved Models','PPO_self_driving-Image stack')
model.save(save_path_img_stack)

For memory management delete old agent and environment (assumes variable names - change if required).

In [23]:
del model
del env
del eval_env

### Evaluation

Load the single image saved agent

In [24]:
model =PPO.load(save_path_single_agent)

Setup the single image environment for evaluation.

In [25]:
env_name='CarRacing-v0'
env=gym.make(env_name)

env = DummyVecEnv([lambda:env])

Evaluate the agent in the environment for 30 episodes, rendering the process. 

In [26]:
evaluate_policy(model,env,n_eval_episodes=30,render=True)



Track generation: 1172..1469 -> 297-tiles track
Track generation: 1262..1590 -> 328-tiles track
Track generation: 1134..1420 -> 286-tiles track
Track generation: 984..1234 -> 250-tiles track
Track generation: 1033..1297 -> 264-tiles track
Track generation: 1148..1439 -> 291-tiles track
Track generation: 1112..1394 -> 282-tiles track
Track generation: 1133..1420 -> 287-tiles track
Track generation: 1325..1660 -> 335-tiles track
Track generation: 1047..1313 -> 266-tiles track
Track generation: 1184..1484 -> 300-tiles track
Track generation: 1011..1268 -> 257-tiles track
Track generation: 1346..1689 -> 343-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1281..1605 -> 324-tiles track
Track generation: 1036..1303 -> 267-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1220..1529 -> 309-tiles track
Track generation: 1099..1385 -> 286-tiles track
Track generation: 1069..1

(-86.07127403368553, 1.0186510382780782)

The below cell has a video recorder to record the rendered events.

In [27]:
eval_env = VecVideoRecorder(env, video_folder='videos/',
                              record_video_trigger=lambda step: step == 0, video_length=5000,
                              name_prefix='ppo-carracing')

obs = eval_env.reset()
for _ in range(5000):
   action, _ = model.predict(obs)
   obs, _, _, _ = eval_env.step(action)


eval_env.close()

Track generation: 1212..1519 -> 307-tiles track
Track generation: 1167..1462 -> 295-tiles track
Track generation: 1119..1403 -> 284-tiles track
Track generation: 1101..1381 -> 280-tiles track
Track generation: 1156..1448 -> 292-tiles track
Track generation: 1122..1406 -> 284-tiles track
Saving video to /content/videos/ppo-carracing-step-0-to-step-5000.mp4


For memory management delete the single image agent (assumes variable names - change if required).

In [29]:
del model
del eval_env

Load the image stack agent

In [35]:
# Add code here 
model=PPO.load(save_path_img_stack)

Set up the image stack environment

In [36]:
env_name='CarRacing-v0'
env=gym.make(env_name)
env =  sb3.common.env_util.make_vec_env('CarRacing-v0',n_envs=1,seed=0)
env = VecFrameStack(env, n_stack=4)


Evaluate the agent in the environment for 30 episodes, rendering the process. 

In [37]:
evaluate_policy(model,env,n_eval_episodes=30,render=True)

Track generation: 1143..1442 -> 299-tiles track
Track generation: 1087..1369 -> 282-tiles track
Track generation: 964..1212 -> 248-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1176..1474 -> 298-tiles track
Track generation: 1283..1608 -> 325-tiles track
Track generation: 1217..1526 -> 309-tiles track
Track generation: 1096..1374 -> 278-tiles track
Track generation: 1198..1501 -> 303-tiles track
Track generation: 1159..1453 -> 294-tiles track
Track generation: 957..1205 -> 248-tiles track
Track generation: 1181..1480 -> 299-tiles track
Track generation: 979..1234 -> 255-tiles track


(-82.8233884, 1.2373295978671313)

In [38]:
eval_env = VecVideoRecorder(env, video_folder='videos/',
                              record_video_trigger=lambda step: step == 0, video_length=5000,
                              name_prefix='ppo-carracing')

obs = eval_env.reset()
for _ in range(5000):
   action, _ = model.predict(obs)
   obs, _, _, _ = eval_env.step(action)


eval_env.close()

Track generation: 1320..1654 -> 334-tiles track
Track generation: 1067..1338 -> 271-tiles track
Track generation: 1067..1338 -> 271-tiles track
Track generation: 1207..1513 -> 306-tiles track
Track generation: 1106..1396 -> 290-tiles track
Track generation: 1296..1628 -> 332-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1047..1319 -> 272-tiles track
Saving video to /content/videos/ppo-carracing-step-0-to-step-5000.mp4


### Reflection

Reflect on which  agent performs better at the task, and the training process involved (max 200 words).

From the videos that are generated after training the models for 500000 timesteps, we can clearly conclude that image stack agent has better reinforcement than single agent. This is because, stack agent allows us to train the agent for **n** environments per step. This will make the actions that are passed to the environment as a n dimensional vector. This will reinforce the agent with better accuracy and speed. Where as the single agent will be trained on just one environment per step which is the reason for it's decreased accuracy than stack agent for a given total_timestep.

**References**:

https://github.com/hill-a/stable-baselines/issues/990 for the video recording and demo
Brightspace materials for the detailed ideology
  