# COMP47590 Advanced Machine Learning

## Assignment 2: Going the Distance
Uses the PPO actor-critic method to train a neural network to control a simple robot in the RacingCar environment from OpenAI gym (https://gym.openai.com/envs/RacingCar-v0/). 

![Racing](racing_car.gif)

There are five discrete **actions** in this environment:
- left (0)
- right (1)
- brake (2)
- accelerate (3)
- none (4)

**Reward** of -0.1 is awarded every frame and +1000/N for every track tile visited, where N is the total number of tiles in track. For example, if you have finished in 732 frames, your reward is 1000 - 0.1*732 = 926.8 points.

And the **state** is represented using a single image frame (96 * 96).

### Initialisation

If using Google colab you need to install packages - comment out lines below.

In [None]:
#!apt install swig cmake ffmpeg
#!apt-get install -y xvfb x11-utils
#!pip install stable-baselines3[extra] pyglet box2d box2d-kengz
#!pip install pyvirtualdisplay PyOpenGL PyOpenGL-accelerate

For Google colab comment out this cell to make a virtual rendering canvas so render calls work (we still won't see display!)

In [None]:
#import pyvirtualdisplay
#
#_display = pyvirtualdisplay.Display(visible=False,  # use False with Xvfb
#                                    size=(1400, 900))
#_ = _display.start()

Import required packages. 

In [1]:
import torch 
import gym
import stable_baselines3 as sb3

import pandas as pd # For data frames and data frame manipulation
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 1000)
import numpy as np # For general  numeric operations

import matplotlib.pyplot as plt
%matplotlib inline 

  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding en

### Create and Explore the Environment

Create the **CarRacing-v0** environment. Add wrappers to resize the images and convert to greyscale.

In [13]:
env = gym.make('CarRacing-v0')
env = gym.wrappers.resize_observation.ResizeObservation(env, 64)
env = gym.wrappers.gray_scale_observation.GrayScaleObservation(env, keep_dim = True)

Explore the environment - view the action space and observation space.

In [3]:
env.action_space.shape

(3,)

In [14]:
env.observation_space.shape

(64, 64, 1)

Play an episode of the environment using random actions

In [5]:
# Add code here
obs = env.reset()
done = False
while not done:
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
    env.render()
    env.render('rgb_array')

Track generation: 1269..1598 -> 329-tiles track


### Single Image Agent
Create an agent that controls the car using a single image frame as the state input. We recommend a PPO agent with the following hyper-parameters (although you can experiment):
- learning_rate = 3e-5
- n_steps = 512
- ent_coef = 0.001
- batch_size = 128
- gae_lambda =  0.9
- n_epochs = 20
- use_sde = True
- sde_sample_freq = 4
- clip_range = 0.4
- policy_kwargs = {'log_std_init': -2, 'ortho_init':False},

We also recommend enabling **tensorboard** monitoring of the training process.

In [15]:
# Add code here
tb_log = './log_racing_PPO/'
agent = sb3.PPO('MlpPolicy', env,
                    learning_rate = 3e-5,
                    n_steps = 512,
                    ent_coef = 0.001,
                    batch_size = 128,
                    gae_lambda =  0.9,
                    n_epochs = 20,
                    use_sde = True,
                    sde_sample_freq = 4,
                    clip_range = 0.4,
                    policy_kwargs = {'log_std_init': -2, 'ortho_init':False},
                    tensorboard_log=tb_log)

Examine the actor and critic network architectures.

In [16]:
# Add code here

# @David-> I dont really know what to do here....
print(agent.policy)

ActorCriticPolicy(
  (features_extractor): FlattenExtractor(
    (flatten): Flatten(start_dim=1, end_dim=-1)
  )
  (mlp_extractor): MlpExtractor(
    (shared_net): Sequential()
    (policy_net): Sequential(
      (0): Linear(in_features=4096, out_features=64, bias=True)
      (1): Tanh()
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): Tanh()
    )
    (value_net): Sequential(
      (0): Linear(in_features=4096, out_features=64, bias=True)
      (1): Tanh()
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): Tanh()
    )
  )
  (action_net): Linear(in_features=64, out_features=3, bias=True)
  (value_net): Linear(in_features=64, out_features=1, bias=True)
)


Create an evaluation callback that is called every at regular intervals and renders the episode.

In [17]:
# Use a separate evaluation env in case any wrappers have been used
#eval_env = gym.make('CarRacing-v0') 
eval_callback = sb3.common.callbacks.EvalCallback(env, 
                                                  best_model_save_path='./logs_racing1/',
                                                  log_path=tb_log, 
                                                  eval_freq=5000,
                                                  render=True)

Train the model for a large number of timesteps (500,000 timesteps will probably work well).

In [18]:
# Add code here
agent.learn(total_timesteps=500000,callback=eval_callback)

Track generation: 1147..1441 -> 294-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1116..1399 -> 283-tiles track




Track generation: 1228..1540 -> 312-tiles track
Track generation: 1260..1579 -> 319-tiles track
Track generation: 1127..1420 -> 293-tiles track
Track generation: 1136..1428 -> 292-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1305..1635 -> 330-tiles track
Track generation: 1177..1475 -> 298-tiles track
Track generation: 1097..1381 -> 284-tiles track


ValueError: Error: Unexpected observation shape (1, 96, 96, 3) for Box environment, please use (1, 64, 64) or (n_env, 1, 64, 64) for the observation shape.


tensorboard --logdir  ./log_racing_PPO/
http://localhost:6006/

Save the trained agent.

In [None]:
# Add code here


For memory management delete old agent and environment (assumes variable names - change if required).

In [None]:
del agent
del env
del eval_env

### Create Image Stack Agent

Create the CarRacing-v0 environment using wrappers to resize the images to 64 x 64 and change to greyscale. Also add a wrapper to create a stack of 4 frames. 

In [None]:
# Add code here


Create an agent that controls the car using a stack of input image frames as the state input. We recommend a PPO agent with the following hyper-parameters (although you can experiment):
- learning_rate = 3e-5
- n_steps = 512
- ent_coef = 0.001
- batch_size = 128
- gae_lambda =  0.9
- n_epochs = 20
- use_sde = True
- sde_sample_freq = 4
- clip_range = 0.4
- policy_kwargs = {'log_std_init': -2, 'ortho_init':False},

We also recommend enabling **tensorboard** monitoring of the training process.

In [None]:
# Add code here


Examine the actor and critic network architectures.

In [None]:
# Add code here


Create an evaluation callback that is called every at regular intervals and renders the episode.

In [None]:
# Add code here


Train the model for a large number of timesteps (500,000 timesteps will probably work well).

In [None]:
# Add code here


Connect to the tensorboard log using **TensorBoard** from the command line to view training progress: 
~~./log_tb_carracing_PPO/~~
`tensorboard --logdir  ./log_racing_PPO/`

Then open TensorBoard in a browser, typically located at:

`http://localhost:6006/`

Save the trained agent.

In [None]:
# Add code here

For memory management delete old agent and environment (assumes variable names - change if required).

In [None]:
del agent
del env
del eval_env

### Evaluation

Load the single image saved agent

In [None]:
# Add code here

Setup the single image environment for evaluation.

In [None]:
# Add code here

Evaluate the agent in the environment for 30 episodes, rendering the process. 

In [None]:
# Add code here

For memory management delete the single image agent (assumes variable names - change if required).

In [None]:
del agent
del eval_env

Load the image stack agent

In [None]:
# Add code here 


Set up the image stack environment

In [None]:
# Add code here


Evaluate the agent in the environment for 30 episodes, rendering the process. 

In [None]:
# Add code here


### Reflection

Reflect on which  agent performs better at the task, and the training process involved (max 200 words).