# COMP47590 Advanced Machine Learning

- **Student Name:** 
- **Student Number:** 


- **Student Name:** 
- **Student Number:** 


- **Student Name:** 
- **Student Number:** 

## Assignment 2: Going the Distance
Uses the PPO actor-critic method to train a neural network to control a simple robot in the RacingCar environment from OpenAI gym (https://gym.openai.com/envs/RacingCar-v0/). 

![Racing](racing_car.gif)

The **action** space can be continuous or discreet. If **continuous** there are 3 actions :

- 0: steering, -1 is full left, +1 is full right
- 1: gas
- 2: breaking

If **discrete** there are 5 actions:
- 0: do nothing
- 1: steer left
- 2: steer right
- 3: gas
- 4: brake

For this assignment we should use the continuous action space. 

**Reward** of -0.1 is awarded every frame and +1000/N for every track tile visited, where N is the total number of tiles in track. For example, if you have finished in 732 frames, your reward is 1000 - 0.1*732 = 926.8 points.

And the default **observation** is a single image frame (96 * 96).

### Initialisation

If using Google colab you need to install packages - comment out lines below.

In [2]:
#!apt install swig cmake ffmpeg
#!apt-get install -y xvfb x11-utils
#!pip install stable-baselines3[extra] pyglet box2d box2d-kengz
#!pip install pyvirtualdisplay PyOpenGL PyOpenGL-accelerate

For Google colab comment out this cell to make a virtual rendering canvas so render calls work (we still won't see display!)

In [3]:
#import pyvirtualdisplay
#
#_display = pyvirtualdisplay.Display(visible=False,  # use False with Xvfb
#                                    size=(1400, 900))
#_ = _display.start()

Import required packages. 

In [4]:
import torch 
import gymnasium as gym
import stable_baselines3 as sb3

import pandas as pd # For data frames and data frame manipulation
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 1000)
import numpy as np # For general  numeric operations

import matplotlib.pyplot as plt
%matplotlib inline 

### Create and Explore the Environment

Create the **CarRacing-v2** environment. Add wrappers to resize the images and convert to greyscale.

In [5]:
env = gym.make('CarRacing-v3', 
               render_mode = 'human')
env = gym.wrappers.ResizeObservation(env, (64,64))
env = gym.wrappers.GrayscaleObservation(env, keep_dim = True)
env = gym.wrappers.TimeLimit(env, 
                                max_episode_steps = 1500)

Explore the environment - view the action space and observation space.

In [6]:
env.action_space

Box([-1.  0.  0.], 1.0, (3,), float32)

In [7]:
env.observation_space

Box(0, 255, (64, 64, 1), uint8)

Play an episode of the environment using random actions

In [15]:
obs, _ = env.reset()

terminate = False
truncate = False

while not (terminate or truncate):
    
    action = env.action_space.sample()
    obs, reward, terminate, truncate, info = env.step(action)
    
    env.render()


### Single Image Agent
Create an agent that controls the car using a single image frame as the state input. We recommend a PPO agent with the following hyper-parameters (although you can experiment):
- learning_rate = 3e-5
- n_steps = 512
- ent_coef = 0.001
- batch_size = 128
- gae_lambda =  0.9
- n_epochs = 20
- use_sde = True
- sde_sample_freq = 4
- clip_range = 0.4
- policy_kwargs = {'log_std_init': -2, 'ortho_init':False},

We also recommend enabling **tensorboard** monitoring of the training process.

In [None]:
tb_log = './log_tb_highway_PPO/'
agent = sb3.PPO('CnnPolicy', 
                env, 
                verbose=1,
                learning_rate=3e-5,
                n_steps = 512,
                ent_coef=0.001,
                batch_size = 128,
                gae_lambda = 0.9,
                n_epochs = 20,
                use_sde= True,
                sde_sample_freq = 4,
                clip_range = 0.4,
                policy_kwargs = {'log_std_init': -2, 'ortho_init':False},
                tensorboard_log= tb_log
                )



Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Wrapping the env in a VecTransposeImage.


Examine the actor and critic network architectures.

In [None]:
# Add code here


Create an evaluation callback that is called every at regular intervals and renders the episode.

In [None]:
# Add code here


Train the model for a large number of timesteps (500,000 timesteps will probably work well).

In [11]:
agent.learn(total_timesteps=5000)


----------------------------
| time/              |     |
|    fps             | 16  |
|    iterations      | 1   |
|    time_elapsed    | 30  |
|    total_timesteps | 512 |
----------------------------
---------------------------------------
| rollout/                |           |
|    ep_len_mean          | 1e+03     |
|    ep_rew_mean          | -4.76     |
| time/                   |           |
|    fps                  | 16        |
|    iterations           | 2         |
|    time_elapsed         | 60        |
|    total_timesteps      | 1024      |
| train/                  |           |
|    approx_kl            | 0.3699383 |
|    clip_fraction        | 0.308     |
|    clip_range           | 0.4       |
|    entropy_loss         | 3.83      |
|    explained_variance   | -0.000654 |
|    learning_rate        | 3e-05     |
|    loss                 | 0.81      |
|    n_updates            | 20        |
|    policy_gradient_loss | -0.0596   |
|    std                  | 0.135    

<stable_baselines3.ppo.ppo.PPO at 0x29c20de0ed0>

Connect to the tensorboard log using **TensorBoard** from the command line to view training progress: 

`tensorboard --logdir ./logs_carracing_PPO/`

Then open TensorBoard in a browser, typically located at:

`http://localhost:6006/`

Save the trained agent.

In [None]:
# Add code here


For memory management delete old agent and environment (assumes variable names - change if required).

In [None]:
del agent
del env
del eval_env

### Create Image Stack Agent

Create the CarRacing-v0 environment using wrappers to resize the images to 64 x 64 and change to greyscale. Also add a wrapper to create a stack of 4 frames. 

In [None]:
# Add code here


Create an agent that controls the car using a stack of input image frames as the state input. We recommend a PPO agent with the following hyper-parameters (although you can experiment):
- learning_rate = 3e-5
- n_steps = 512
- ent_coef = 0.001
- batch_size = 128
- gae_lambda =  0.9
- n_epochs = 20
- use_sde = True
- sde_sample_freq = 4
- clip_range = 0.4
- policy_kwargs = {'log_std_init': -2, 'ortho_init':False},

We also recommend enabling **tensorboard** monitoring of the training process.

In [None]:
# Add code here


Examine the actor and critic network architectures.

In [None]:
# Add code here


Create an evaluation callback that is called every at regular intervals and renders the episode.

In [None]:
# Add code here


Train the model for a large number of timesteps (500,000 timesteps will probably work well).

In [None]:
# Add code here


Connect to the tensorboard log using **TensorBoard** from the command line to view training progress: 

`tensorboard --logdir ./log_tb_carracing_PPO/`

Then open TensorBoard in a browser, typically located at:

`http://localhost:6006/`

Save the trained agent.

In [None]:
# Add code here

For memory management delete old agent and environment (assumes variable names - change if required).

In [None]:
del agent
del env
del eval_env

### Evaluation

Load the single image saved agent

In [None]:
# Add code here

Setup the single image environment for evaluation.

In [None]:
# Add code here

Evaluate the agent in the environment for 30 episodes, rendering the process. 

In [None]:
# Add code here

For memory management delete the single image agent (assumes variable names - change if required).

In [None]:
del agent
del eval_env

Load the image stack agent

In [None]:
# Add code here 


Set up the image stack environment

In [None]:
# Add code here


Evaluate the agent in the environment for 30 episodes, rendering the process. 

In [None]:
# Add code here


### Reflection

Reflect on which  agent performs better at the task, and the training process involved (max 200 words).