# COMP47590 Advanced Machine Learning

## Assignment 2: Going the Distance
Uses the PPO actor-critic method to train a neural network to control a simple robot in the RacingCar environment from OpenAI gym (https://gym.openai.com/envs/RacingCar-v0/). 

![Racing](racing_car.gif)

There are five discrete **actions** in this environment:
- left (0)
- right (1)
- brake (2)
- accelerate (3)
- none (4)

**Reward** of -0.1 is awarded every frame and +1000/N for every track tile visited, where N is the total number of tiles in track. For example, if you have finished in 732 frames, your reward is 1000 - 0.1*732 = 926.8 points.

And the **state** is represented using a single image frame (96 * 96).

<span style="color:blue">
    
## Foreword: How we conducted the work for this assingment
    
- Use monitor to show it in colab and save mp4 to share
- Work on different machines - Exchange ZIP files
- We will highlight our texts in this notebook in **blue**

</span>

### Initialisation

If using Google colab you need to install packages - comment out lines below.

In [None]:
## We used this in the colab environment

#!apt install swig cmake ffmpeg
#!apt-get install -y xvfb x11-utils
#!pip install stable-baselines3[extra] pyglet box2d box2d-kengz
#!pip install pyvirtualdisplay PyOpenGL PyOpenGL-accelerate

For Google colab comment out this cell to make a virtual rendering canvas so render calls work (we still won't see display!)

In [None]:
#import pyvirtualdisplay
#
#_display = pyvirtualdisplay.Display(visible=False,  # use False with Xvfb
#                                    size=(1400, 900))
#_ = _display.start()

Import required packages. 

In [None]:
import torch 
import gym
import stable_baselines3 as sb3

import pandas as pd # For data frames and data frame manipulation
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 1000)
import numpy as np # For general  numeric operations

import matplotlib.pyplot as plt
%matplotlib inline 

In [None]:
# Imports for MP4 rendering
import io
import base64
from IPython.display import HTML
from gym import wrappers

# Imports for inline Tensorboard
%load_ext tensorboard
import datetime, os

In [None]:
# In colab we ensure to use the GPU for training
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
## Here comes all the logic for showing the mp4s

def show_render_result(rend_env):
  video = io.open('./gym-results/openaigym.video.%s.video000000.mp4' % rend_env.file_infix, 'r+b').read()
  encoded = base64.b64encode(video)
  return HTML(data=''' 
  <video width="720" height="auto" alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>'''
  .format(encoded.decode('ascii')))

### Create and Explore the Environment

Create the **CarRacing-v0** environment. Add wrappers to resize the images and convert to greyscale.

In [None]:
env = gym.make('CarRacing-v0')
env = gym.wrappers.resize_observation.ResizeObservation(env, 64)
env = gym.wrappers.gray_scale_observation.GrayScaleObservation(env, keep_dim = True)

# This is the env we use to monitor when we want a video of the agent
render_env =  wrappers.Monitor(env, "./gym-results", force=True)

Explore the environment - view the action space and observation space.

In [None]:
print("action_space: ", env.action_space)

In [None]:
print("env.observation_space")

Play an episode of the environment using random actions

In [None]:
obs = env.reset()
done = False

while not done:
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
    env.render()
    env.render('rgb_array')

<span style="color:blue">
    
## Exploration of the Environment
    
- Write about shapes and wrappers

</span>

### Single Image Agent
Create an agent that controls the car using a single image frame as the state input. We recommend a PPO agent with the following hyper-parameters (although you can experiment):
- learning_rate = 3e-5
- n_steps = 512
- ent_coef = 0.001
- batch_size = 128
- gae_lambda =  0.9
- n_epochs = 20
- use_sde = True
- sde_sample_freq = 4
- clip_range = 0.4
- policy_kwargs = {'log_std_init': -2, 'ortho_init':False},

We also recommend enabling **tensorboard** monitoring of the training process.

In [None]:
tb_log = './tb_logs_SingleFrame_Training/'

#policy = 'MlpPolicy'
policy = 'CnnPolicy'

agent = sb3.PPO(policy, env,
                    learning_rate = 3e-5,
                    n_steps = 512,
                    ent_coef = 0.001,
                    batch_size = 128,
                    gae_lambda =  0.9,
                    n_epochs = 20,
                    use_sde = True,
                    sde_sample_freq = 4,
                    clip_range = 0.4,
                    policy_kwargs = {'log_std_init': -2, 'ortho_init':False},
                    tensorboard_log=tb_log)

Examine the actor and critic network architectures.

In [None]:
print(agent.policy)

<span style="color:blue">
    
## On the actor and critic network
    
- 2 heads, CNN or MLP output dfims

</span>

Create an evaluation callback that is called every at regular intervals and renders the episode.

In [None]:
eval_env = gym.make('CarRacing-v0')
eval_env = gym.wrappers.resize_observation.ResizeObservation(eval_env, 64)
eval_env = gym.wrappers.gray_scale_observation.GrayScaleObservation(eval_env, keep_dim = True)

# Using MLP policy change: best_model_save_path='./best_model_MLP_Single/'
eval_callback = sb3.common.callbacks.EvalCallback(eval_env, 
                                                  best_model_save_path='./best_model_CNN_Single/',
                                                  log_path=tb_log, 
                                                  eval_freq=5000,
                                                  render=False)

Train the model for a large number of timesteps (500,000 timesteps will probably work well).

In [None]:
%tensorboard --logdir log_racing_PPO_single

In [None]:
agent.learn(total_timesteps=500000,callback=eval_callback)

Save the trained agent.

In [None]:
# Using MLP policy change to: agent.save("./final_models/final_model_MLP_single")
agent.save("./final_models/final_model_CNN_single")

For memory management delete old agent and environment (assumes variable names - change if required).

In [None]:
del agent
del env
del eval_env
del render_env

### Create Image Stack Agent

Create the CarRacing-v0 environment using wrappers to resize the images to 64 x 64 and change to greyscale. Also add a wrapper to create a stack of 4 frames. 

In [None]:
# Create Stacked env
env = gym.make('CarRacing-v0')
env = gym.wrappers.resize_observation.ResizeObservation(env, 64)
env = gym.wrappers.gray_scale_observation.GrayScaleObservation(env, keep_dim = True)
env = sb3.common.vec_env.DummyVecEnv([lambda: env]) 
env = sb3.common.vec_env.VecFrameStack(env, n_stack=4)

# Separate evaluation env
eval_env = gym.make('CarRacing-v0')
eval_env = gym.wrappers.resize_observation.ResizeObservation(eval_env, 64)
eval_env = gym.wrappers.gray_scale_observation.GrayScaleObservation(eval_env, keep_dim = True)
eval_env = sb3.common.vec_env.DummyVecEnv([lambda: eval_env]) 
eval_env = sb3.common.vec_env.VecFrameStack(eval_env, n_stack=4)

# Separate evaluation render env
render_env =  wrappers.Monitor(env, "./gym-results", force=True)

Create an agent that controls the car using a stack of input image frames as the state input. We recommend a PPO agent with the following hyper-parameters (although you can experiment):
- learning_rate = 3e-5
- n_steps = 512
- ent_coef = 0.001
- batch_size = 128
- gae_lambda =  0.9
- n_epochs = 20
- use_sde = True
- sde_sample_freq = 4
- clip_range = 0.4
- policy_kwargs = {'log_std_init': -2, 'ortho_init':False},

We also recommend enabling **tensorboard** monitoring of the training process.

In [None]:
tb_log = './tb_logs_StackFrame_Training/'

#policy = 'MlpPolicy'
policy = 'CnnPolicy'

agent = sb3.PPO(policy, env,
                    learning_rate = 3e-5,
                    n_steps = 512,
                    ent_coef = 0.001,
                    batch_size = 128,
                    gae_lambda =  0.9,
                    n_epochs = 20,
                    use_sde = True,
                    sde_sample_freq = 4,
                    clip_range = 0.4,
                    policy_kwargs = {'log_std_init': -2, 'ortho_init':False},
                    tensorboard_log=tb_log)

Examine the actor and critic network architectures.

In [None]:
print(agent.policy)

In [None]:
<span style="color:blue">
    
## On the actor and critic network
    
- Only difference in dimension

</span>

Create an evaluation callback that is called every at regular intervals and renders the episode.

In [None]:
eval_callback = sb3.common.callbacks.EvalCallback(eval_env, 
                                                  best_model_save_path='./best_model_CNN_4Stack/',
                                                  log_path=tb_log, 
                                                  eval_freq=5000,
                                                  render=False)

Train the model for a large number of timesteps (500,000 timesteps will probably work well).

In [None]:
%tensorboard --logdir ./tb_logs_StackFrame_Training/

In [None]:
agent.learn(total_timesteps=500000, callback=eval_callback)

Save the trained agent.

In [None]:
# Using MLP policy change to: agent.save("./final_models/final_model_CNN_4Stack")
agent.save("./final_models/final_model_CNN_4Stack")

For memory management delete old agent and environment (assumes variable names - change if required).

In [None]:
del agent
del env
del eval_env
del render_env

### Evaluation

Load the single image saved agent

In [None]:
# Add code here

Setup the single image environment for evaluation.

In [None]:
# Add code here

Evaluate the agent in the environment for 30 episodes, rendering the process. 

In [None]:
# Add code here

For memory management delete the single image agent (assumes variable names - change if required).

In [None]:
del agent
del eval_env

Load the image stack agent

In [None]:
# Add code here 


Set up the image stack environment

In [None]:
# Add code here


Evaluate the agent in the environment for 30 episodes, rendering the process. 

In [None]:
# Add code here


### Reflection

Reflect on which  agent performs better at the task, and the training process involved (max 200 words).