## Going the Distance
Uses the PPO actor-critic method to train a neural network to control a simple robot in the RacingCar environment from OpenAI gym (https://gym.openai.com/envs/RacingCar-v0/). 

![Racing](racing_car.gif)

There are five discrete **actions** in this environment:
- left (0)
- right (1)
- brake (2)
- accelerate (3)
- none (4)

**Reward** of -0.1 is awarded every frame and +1000/N for every track tile visited, where N is the total number of tiles in track. For example, if you have finished in 732 frames, your reward is 1000 - 0.1*732 = 926.8 points.

And the **state** is represented using a single image frame (96 * 96).

### Initialisation

If using Google colab you need to install packages - comment out lines below.

In [1]:
!apt install swig cmake ffmpeg
!apt-get install -y xvfb x11-utils
!pip install stable-baselines3[extra] pyglet box2d box2d-kengz
!pip install pyvirtualdisplay PyOpenGL PyOpenGL-accelerate

Reading package lists... Done
Building dependency tree       
Reading state information... Done
swig is already the newest version (3.0.12-1).
cmake is already the newest version (3.10.2-1ubuntu2.18.04.2).
ffmpeg is already the newest version (7:3.4.8-0ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 40 not upgraded.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
x11-utils is already the newest version (7.7+3build1).
xvfb is already the newest version (2:1.19.6-1ubuntu4.10).
0 upgraded, 0 newly installed, 0 to remove and 40 not upgraded.


For Google colab comment out this cell to make a virtual rendering canvas so render calls work (we still won't see display!)

In [2]:
import pyvirtualdisplay

_display = pyvirtualdisplay.Display(visible=False,  # use False with Xvfb
                                    size=(1400, 900))
_ = _display.start()

Import required packages. 

In [3]:
import torch 
import gym
import stable_baselines3 as sb3
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv, VecFrameStack
from stable_baselines3.common.evaluation import evaluate_policy

import IPython
from IPython import display as ipythondisplay
import PIL.Image
import pyvirtualdisplay

import pandas as pd # For data frames and data frame manipulation
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 1000)
import numpy as np # For general  numeric operations
import os

import matplotlib.pyplot as plt
%matplotlib inline 

### Create and Explore the Environment

Create the **CarRacing-v0** environment. Add wrappers to resize the images and convert to greyscale.

In [4]:
env = gym.make('CarRacing-v0')
env = gym.wrappers.resize_observation.ResizeObservation(env, 64)
env = gym.wrappers.gray_scale_observation.GrayScaleObservation(env, keep_dim = True)

Explore the environment - view the action space and observation space.

In [5]:
env.action_space

Box([-1.  0.  0.], [1. 1. 1.], (3,), float32)

In [6]:
env.observation_space

Box([[[0]
  [0]
  [0]
  ...
  [0]
  [0]
  [0]]

 [[0]
  [0]
  [0]
  ...
  [0]
  [0]
  [0]]

 [[0]
  [0]
  [0]
  ...
  [0]
  [0]
  [0]]

 ...

 [[0]
  [0]
  [0]
  ...
  [0]
  [0]
  [0]]

 [[0]
  [0]
  [0]
  ...
  [0]
  [0]
  [0]]

 [[0]
  [0]
  [0]
  ...
  [0]
  [0]
  [0]]], [[[255]
  [255]
  [255]
  ...
  [255]
  [255]
  [255]]

 [[255]
  [255]
  [255]
  ...
  [255]
  [255]
  [255]]

 [[255]
  [255]
  [255]
  ...
  [255]
  [255]
  [255]]

 ...

 [[255]
  [255]
  [255]
  ...
  [255]
  [255]
  [255]]

 [[255]
  [255]
  [255]
  ...
  [255]
  [255]
  [255]]

 [[255]
  [255]
  [255]
  ...
  [255]
  [255]
  [255]]], (64, 64, 1), uint8)

Play an episode of the environment using random actions

In [7]:

for episode in range(5):
    score = 0
    done = False
    state = env.reset() # return an initial observation
    
    # Starting the process using random actions
    while not done:
        # These steps have been tested in google colab ( might not work in jupyter notebook)
        # start
        #screen = env.render(mode = 'rgb_array')
        #plt.imshow(screen)
        #ipythondisplay.clear_output(wait = True)
        #ipythondisplay.display(plt.gcf())
        # end
        action = env.action_space.sample()  # Agent choosing an action randomly
        n_state,reward,done,info = env.step(action) # returning observation set, reward and completeness state
        score += reward
        
    print("Episode Number:{}, Score awarded:{}".format(episode,score))
    
env.close()


Track generation: 1177..1476 -> 299-tiles track
Episode Number:0, Score awarded:-32.885906040268935
Track generation: 1209..1516 -> 307-tiles track
Episode Number:1, Score awarded:-34.640522875817524
Track generation: 1091..1368 -> 277-tiles track
Episode Number:2, Score awarded:-27.536231884058182
Track generation: 1487..1863 -> 376-tiles track
Episode Number:3, Score awarded:-46.666666666667446
Track generation: 1019..1285 -> 266-tiles track
Episode Number:4, Score awarded:-24.52830188679274


### Single Image Agent
Create an agent that controls the car using a single image frame as the state input. We recommend a PPO agent with the following hyper-parameters (although you can experiment):
- learning_rate = 3e-5
- n_steps = 512
- ent_coef = 0.001
- batch_size = 128
- gae_lambda =  0.9
- n_epochs = 20
- use_sde = True
- sde_sample_freq = 4
- clip_range = 0.4
- policy_kwargs = {'log_std_init': -2, 'ortho_init':False},

We also recommend enabling **tensorboard** monitoring of the training process.

In [8]:
env = gym.make('CarRacing-v0')
env = DummyVecEnv([lambda:env])

In [9]:
# Setting the path to record logs
log_path = os.path.join('logs_carracing_PPO','Logs')

# PPO Agent creation using CNN Policy
agentModel = PPO('CnnPolicy', env, verbose = 1, tensorboard_log = log_path, learning_rate = 3e-5, n_steps = 512, 
                 ent_coef = 0.001, batch_size = 128, gae_lambda = 0.9, n_epochs = 20, use_sde = True, 
                 sde_sample_freq = 4, clip_range = 0.4, policy_kwargs = {'log_std_init': -2, 'ortho_init':False})


Using cpu device
Wrapping the env in a VecTransposeImage.


Examine the actor and critic network architectures.

In [10]:
agentModel.policy

ActorCriticCnnPolicy(
  (features_extractor): NatureCNN(
    (cnn): Sequential(
      (0): Conv2d(3, 32, kernel_size=(8, 8), stride=(4, 4))
      (1): ReLU()
      (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (3): ReLU()
      (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
      (5): ReLU()
      (6): Flatten(start_dim=1, end_dim=-1)
    )
    (linear): Sequential(
      (0): Linear(in_features=4096, out_features=512, bias=True)
      (1): ReLU()
    )
  )
  (mlp_extractor): MlpExtractor(
    (shared_net): Sequential()
    (policy_net): Sequential()
    (value_net): Sequential()
  )
  (action_net): Linear(in_features=512, out_features=3, bias=True)
  (value_net): Linear(in_features=512, out_features=1, bias=True)
)

Create an evaluation callback that is called every at regular intervals and renders the episode.

In [11]:
eval_env = gym.make('CarRacing-v0')
eval_callback = sb3.common.callbacks.EvalCallback(eval_env, 
                                                  best_model_save_path = './logs_carracing_PPO/',
                                                  log_path = './logs_carracing_PPO/', 
                                                  eval_freq = 5000,
                                                  render = True)

Train the model for a large number of timesteps (500,000 timesteps will probably work well).

In [12]:
agentModel.learn(total_timesteps = 25000, callback = eval_callback, 
                 tb_log_name = "Single Image Agent Network")


Track generation: 1042..1313 -> 271-tiles track
Logging to logs_carracing_PPO/Logs/Single Image Agent Network_1




----------------------------
| time/              |     |
|    fps             | 69  |
|    iterations      | 1   |
|    time_elapsed    | 7   |
|    total_timesteps | 512 |
----------------------------
Track generation: 962..1215 -> 253-tiles track
---------------------------------------
| time/                   |           |
|    fps                  | 48        |
|    iterations           | 2         |
|    time_elapsed         | 21        |
|    total_timesteps      | 1024      |
| train/                  |           |
|    approx_kl            | 0.7076732 |
|    clip_fraction        | 0.47      |
|    clip_range           | 0.4       |
|    entropy_loss         | 2.54      |
|    explained_variance   | 6.18e-05  |
|    learning_rate        | 3e-05     |
|    loss                 | 0.603     |
|    n_updates            | 20        |
|    policy_gradient_loss | 0.00843   |
|    std                  | 0.135     |
|    value_loss           | 1.03      |
------------------------------



Track generation: 1195..1498 -> 303-tiles track
Track generation: 1081..1355 -> 274-tiles track
Track generation: 1187..1488 -> 301-tiles track
Track generation: 1020..1284 -> 264-tiles track
Track generation: 1129..1415 -> 286-tiles track
Eval num_timesteps=5000, episode_reward=-82.67 +/- 1.13
Episode length: 1000.00 +/- 0.00
-----------------------------------------
| eval/                   |             |
|    mean_ep_length       | 1e+03       |
|    mean_reward          | -82.7       |
| time/                   |             |
|    total_timesteps      | 5000        |
| train/                  |             |
|    approx_kl            | 0.040646125 |
|    clip_fraction        | 0.0514      |
|    clip_range           | 0.4         |
|    entropy_loss         | -6.69       |
|    explained_variance   | 0.526       |
|    learning_rate        | 3e-05       |
|    loss                 | -0.0538     |
|    n_updates            | 180         |
|    policy_gradient_loss | -0.0389     |

<stable_baselines3.ppo.ppo.PPO at 0x7f92b8329b10>

Connect to the tensorboard log using **TensorBoard** from the command line to view training progress: 

`tensorboard --logdir ./logs_carracing_PPO/`

Then open TensorBoard in a browser, typically located at:

`http://localhost:6006/`

Save the trained agent.

In [13]:
# Path to save single agent model
single_agent_path = os.path.join('logs_carracing_PPO','Saved_Models','singleAgentPPO_model')
agentModel.save(single_agent_path)




For memory management delete old agent and environment (assumes variable names - change if required).

In [14]:
del agentModel
del env
del eval_env

### Create Image Stack Agent

Create the CarRacing-v0 environment using wrappers to resize the images to 64 x 64 and change to greyscale. Also add a wrapper to create a stack of 4 frames. 

In [15]:
env = gym.make('CarRacing-v0')
#env = gym.wrappers.resize_observation.ResizeObservation(env, 64)
#env = gym.wrappers.gray_scale_observation.GrayScaleObservation(env, keep_dim = True)

# Creating Stack of four frames
#env = DummyVecEnv([lambda:env])
env = sb3.common.env_util.make_vec_env('CarRacing-v0',n_envs=1,seed=0)
env = VecFrameStack(env, n_stack=4)


Create an agent that controls the car using a stack of input image frames as the state input. We recommend a PPO agent with the following hyper-parameters (although you can experiment):
- learning_rate = 3e-5
- n_steps = 512
- ent_coef = 0.001
- batch_size = 128
- gae_lambda =  0.9
- n_epochs = 20
- use_sde = True
- sde_sample_freq = 4
- clip_range = 0.4
- policy_kwargs = {'log_std_init': -2, 'ortho_init':False},

We also recommend enabling **tensorboard** monitoring of the training process.

In [16]:
# Setting the path to record logs
log_path = os.path.join('log_tb_carracing_PPO','Logs')

# PPO Agent creation using CNN Policy
agentModel = PPO('CnnPolicy', env, verbose=1, tensorboard_log = log_path, learning_rate = 3e-5, n_steps = 512, 
          ent_coef = 0.001, batch_size = 128, gae_lambda = 0.9, n_epochs = 20, use_sde = True, 
          sde_sample_freq = 4, clip_range = 0.4, policy_kwargs = {'log_std_init': -2, 'ortho_init':False})


Using cpu device
Wrapping the env in a VecTransposeImage.


Examine the actor and critic network architectures.

In [17]:
agentModel.policy

ActorCriticCnnPolicy(
  (features_extractor): NatureCNN(
    (cnn): Sequential(
      (0): Conv2d(12, 32, kernel_size=(8, 8), stride=(4, 4))
      (1): ReLU()
      (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (3): ReLU()
      (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
      (5): ReLU()
      (6): Flatten(start_dim=1, end_dim=-1)
    )
    (linear): Sequential(
      (0): Linear(in_features=4096, out_features=512, bias=True)
      (1): ReLU()
    )
  )
  (mlp_extractor): MlpExtractor(
    (shared_net): Sequential()
    (policy_net): Sequential()
    (value_net): Sequential()
  )
  (action_net): Linear(in_features=512, out_features=3, bias=True)
  (value_net): Linear(in_features=512, out_features=1, bias=True)
)

Create an evaluation callback that is called every at regular intervals and renders the episode.

In [18]:
#eval_env = gym.make('CarRacing-v0')
stop_callback = sb3.common.callbacks.StopTrainingOnRewardThreshold(reward_threshold=190, verbose=1)
eval_callback = sb3.common.callbacks.EvalCallback(env, 
                                                  callback_on_new_best=stop_callback,
                                                  best_model_save_path='./log_tb_carracing_PPO/',
                                                  log_path='./log_tb_carracing_PPO/', 
                                                  eval_freq=10000,
                                                  render=True)

Train the model for a large number of timesteps (500,000 timesteps will probably work well).

In [19]:
agentModel.learn(total_timesteps = 25000, callback = eval_callback, 
                 tb_log_name = "Image Stack Agent Network")


Track generation: 1143..1442 -> 299-tiles track
Logging to log_tb_carracing_PPO/Logs/Image Stack Agent Network_1




----------------------------
| time/              |     |
|    fps             | 98  |
|    iterations      | 1   |
|    time_elapsed    | 5   |
|    total_timesteps | 512 |
----------------------------
Track generation: 1087..1369 -> 282-tiles track
---------------------------------------
| rollout/                |           |
|    ep_len_mean          | 1e+03     |
|    ep_rew_mean          | -59.7     |
| time/                   |           |
|    fps                  | 33        |
|    iterations           | 2         |
|    time_elapsed         | 30        |
|    total_timesteps      | 1024      |
| train/                  |           |
|    approx_kl            | 1.8567095 |
|    clip_fraction        | 0.491     |
|    clip_range           | 0.4       |
|    entropy_loss         | 2.79      |
|    explained_variance   | 0.000721  |
|    learning_rate        | 3e-05     |
|    loss                 | 0.477     |
|    n_updates            | 20        |
|    policy_gradient_loss | 0

<stable_baselines3.ppo.ppo.PPO at 0x7f92398e3990>

Connect to the tensorboard log using **TensorBoard** from the command line to view training progress: 

`tensorboard --logdir ./log_tb_carracing_PPO/`

Then open TensorBoard in a browser, typically located at:

`http://localhost:6006/`

Save the trained agent.

In [20]:
# Path to save image stack agent model
stackAgent_path = os.path.join('log_tb_carracing_PPO','Saved_Models','stackAgentPPO_model')
agentModel.save(stackAgent_path)




For memory management delete old agent and environment (assumes variable names - change if required).

In [25]:
del agentModel
del env

### Evaluation

Load the single image saved agent

In [26]:
agentModel = PPO.load(single_agent_path)


Setup the single image environment for evaluation.

In [27]:
env = gym.make('CarRacing-v0')
env = DummyVecEnv([lambda:env])

Evaluate the agent in the environment for 30 episodes, rendering the process. 

In [28]:
evaluate_policy(agentModel, env, n_eval_episodes=30, render=True)




Track generation: 1172..1469 -> 297-tiles track
Track generation: 1060..1338 -> 278-tiles track
Track generation: 1118..1403 -> 285-tiles track
Track generation: 1087..1365 -> 278-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1112..1394 -> 282-tiles track
Track generation: 1363..1707 -> 344-tiles track
Track generation: 1184..1484 -> 300-tiles track
Track generation: 1180..1479 -> 299-tiles track
Track generation: 1138..1427 -> 289-tiles track
Track generation: 1201..1512 -> 311-tiles track
Track generation: 1326..1662 -> 336-tiles track
Track generation: 1043..1306 -> 263-tiles track


(-31.49887722209096, 13.147621395486937)

For memory management delete the single image agent (assumes variable names - change if required).

In [31]:
del agentModel
del env

Load the image stack agent

In [32]:
agentModel = PPO.load(stackAgent_path)


Set up the image stack environment

In [33]:
env = gym.make('CarRacing-v0')
#env = gym.wrappers.resize_observation.ResizeObservation(env, 64)
#env = gym.wrappers.gray_scale_observation.GrayScaleObservation(env, keep_dim = True)

#env = DummyVecEnv([lambda:env])
env = sb3.common.env_util.make_vec_env('CarRacing-v0',n_envs=1,seed=0)
env = VecFrameStack(env, n_stack=4)


Evaluate the agent in the environment for 30 episodes, rendering the process. 

In [34]:
evaluate_policy(agentModel, env, n_eval_episodes=30, render=True)


Track generation: 1143..1442 -> 299-tiles track
Track generation: 1087..1369 -> 282-tiles track
Track generation: 964..1212 -> 248-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1176..1474 -> 298-tiles track
Track generation: 1283..1608 -> 325-tiles track
Track generation: 1217..1526 -> 309-tiles track
Track generation: 1096..1374 -> 278-tiles track
Track generation: 1198..1501 -> 303-tiles track
Track generation: 1159..1453 -> 294-tiles track
Track generation: 957..1205 -> 248-tiles track
Track generation: 1181..1480 -> 299-tiles track
Track generation: 979..1234 -> 255-tiles track


(-17.695962399999996, 31.027389355649106)

### Reflection

Reflect on which  agent performs better at the task, and the training process involved (max 200 words).

I trained both the agents through `PPO` because it is well suited to our environment action space which is of type box while `DQN` cannot be used for box type action space. `PPO` takes less time and give best results because it tries to simplify and directly optimize the policy. Algorithm that I have used here is `CNNPolicy` rather than `mlpPolicy` because it is more suited to image data and works well with spatial relationship data while MLP works better with tabular data.

In terms of the __agent performance__, __Image Stack Agent__ works better than the __Single Image Agent__ as per the rendering screen generated on training the models on 0.5M timesteps because it is building on 4 environments per step and in general it allows for n environments per step. This improves the accuracy and increases it's speed. Not only this, it converts the action space which is being sent to the environment into multidimensional vector. Single Image agent on the other hand works on one environment per step hence the speed and accuracy suffers. 

I used Google Colab to train and evaluate the agents. Since 0.5M timesteps was taking a lot of time to run due to the system configurations, so I am showcasing the results from the 25K timesteps for Single Image and 25K timesteps for Image Stack agents where the evaluation results show that Image stack agents works far better than the single image agent. I evaluated the results for 0.2M and 0.1M timesteps for both the agents where Image Stack was rewarded better points in every case.

For single image agent the reward was -31 with 13 as variance when trained on 25K timesteps but it resulted in 200-300 reward points when trained on 0.1 to 0.2M timesteps.

For Image stack agent, the reward was -17 with 31 as variance when trained on 25K timesteps but it resulted in 300-400 reward points when trained on 0.1 to 0.2M timesteps.

In [5]:
!jupyter nbconvert --to html COMP47590_Assignment_2_Going_The_Distance_d.ipynb

[NbConvertApp] Converting notebook COMP47590_Assignment_2_Going_The_Distance_d.ipynb to html
[NbConvertApp] Writing 731222 bytes to COMP47590_Assignment_2_Going_The_Distance_d.html
