# COMP47590 Advanced Machine Learning

## Assignment 2: Going the Distance
Uses the PPO actor-critic method to train a neural network to control a simple robot in the RacingCar environment from OpenAI gym (https://gym.openai.com/envs/RacingCar-v2/). 

![Racing](racing_car.gif)

There are five discrete **actions** in this environment:
- left (0)
- right (1)
- brake (2)
- accelerate (3)
- none (4)

**Reward** of -0.1 is awarded every frame and +1000/N for every track tile visited, where N is the total number of tiles in track. For example, if you have finished in 732 frames, your reward is 1000 - 0.1*732 = 926.8 points.

And the **state** is represented using a single image frame (96 * 96).

### Initialisation

If using Google colab you need to install packages - comment out lines below.

In [2]:
#!apt install swig cmake ffmpeg
#!apt-get install -y xvfb x11-utils
#!pip install stable-baselines3[extra] pyglet box2d box2d-kengz
#!pip install pyvirtualdisplay PyOpenGL PyOpenGL-accelerate

For Google colab comment out this cell to make a virtual rendering canvas so render calls work (we still won't see display!)

In [3]:
#import pyvirtualdisplay
#
#_display = pyvirtualdisplay.Display(visible=False,  # use False with Xvfb
#                                    size=(1400, 900))
#_ = _display.start()

Import required packages. 

In [4]:
import torch 
import gymnasium as gym
import stable_baselines3 as sb3

import pandas as pd # For data frames and data frame manipulation
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 1000)
import numpy as np # For general  numeric operations

import matplotlib.pyplot as plt
%matplotlib inline 

from stable_baselines3.common.monitor import Monitor

### Create and Explore the Environment

Create the **CarRacing-v2** environment. Add wrappers to resize the images and convert to greyscale.

In [5]:
from stable_baselines3.common.vec_env import DummyVecEnv, VecTransposeImage

def make_env():
    # Create an instance of the environment with all necessary wrappers
    env = gym.make('CarRacing-v2', render_mode='human')
    env = Monitor(env)  # Add Monitor for more detailed logging
    env = gym.wrappers.ResizeObservation(env, 64)
    env = gym.wrappers.GrayScaleObservation(env, keep_dim=True)
    env = gym.wrappers.TimeLimit(env, max_episode_steps=1500)
    return env

env = DummyVecEnv([make_env])
env = VecTransposeImage(env)

Explore the environment - view the action space and observation space.

In [6]:
env.action_space

Box([-1.  0.  0.], 1.0, (3,), float32)

In [7]:
env.observation_space

Box(0, 255, (1, 64, 64), uint8)

Play an episode of the environment using random actions

In [8]:
# play one episode using random actions, with detailed logging
import pygame
pygame.init()

obs = env.reset()

done = False
score = 0
step_count = 0

while not done:
    env.render()
    action = env.action_space.sample()
    obs, reward, done, info = env.step([action])  # actions are passed as a list to VecEnv

    score += reward
    step_count += 1
    print(f"Step: {step_count}")
    print(f"Action Taken: {action}")
    print(f"Reward: {reward}")
    print(f"Terminated: {done}")
    print(f"Info: {info}")
    print(f"Score: {score}")
    print("-" * 30)

    if done:
        print(f"Episode finished after {step_count} timesteps with total reward {score}")
        break

Step: 1
Action Taken: [0.34457937 0.5487911  0.47159332]
Reward: [7.3349442]
Terminated: [False]
Info: [{'TimeLimit.truncated': False}]
Score: [7.3349442]
------------------------------
Step: 2
Action Taken: [-0.9577241   0.71824104  0.87855166]
Reward: [-0.1]
Terminated: [False]
Info: [{'TimeLimit.truncated': False}]
Score: [7.2349443]
------------------------------
Step: 3
Action Taken: [0.6720021  0.02354856 0.08385024]
Reward: [-0.1]
Terminated: [False]
Info: [{'TimeLimit.truncated': False}]
Score: [7.1349444]
------------------------------
Step: 4
Action Taken: [0.1566244 0.1258183 0.7440854]
Reward: [-0.1]
Terminated: [False]
Info: [{'TimeLimit.truncated': False}]
Score: [7.0349445]
------------------------------
Step: 5
Action Taken: [0.9913453  0.40902403 0.58703333]
Reward: [-0.1]
Terminated: [False]
Info: [{'TimeLimit.truncated': False}]
Score: [6.9349446]
------------------------------
Step: 6
Action Taken: [-0.8312545   0.70475125  0.3929863 ]
Reward: [-0.1]
Terminated: [Fa

### Single Image Agent
Create an agent that controls the car using a single image frame as the state input. We recommend a PPO agent with the following hyper-parameters (although you can experiment):
- learning_rate = 3e-5
- n_steps = 512
- ent_coef = 0.001
- batch_size = 128
- gae_lambda =  0.9
- n_epochs = 20
- use_sde = True
- sde_sample_freq = 4
- clip_range = 0.4
- policy_kwargs = {'log_std_init': -2, 'ortho_init':False},

We also recommend enabling **tensorboard** monitoring of the training process.

In [9]:
# Define the hyperparameters
learning_rate = 3e-5
n_steps = 512
ent_coef = 0.001
batch_size = 128
gae_lambda = 0.9
n_epochs = 20
use_sde = True
sde_sample_freq = 4
clip_range = 0.4
policy_kwargs = {'log_std_init': -2, 'ortho_init':False}

tensorboard_log = './car_racing_tensorboard_single/'

# Create the PPO agent
agent = sb3.PPO(policy="CnnPolicy", 
                env=env, 
                learning_rate=learning_rate, 
                n_steps=n_steps,
                ent_coef=ent_coef, 
                batch_size=batch_size, 
                gae_lambda=gae_lambda, 
                n_epochs=n_epochs, 
                use_sde=use_sde, 
                sde_sample_freq=sde_sample_freq, 
                clip_range=clip_range, 
                policy_kwargs=policy_kwargs,
                tensorboard_log=tensorboard_log)

Examine the actor and critic network architectures.

In [10]:
print("Policy network architecture:")
print(agent.policy)

Policy network architecture:
ActorCriticCnnPolicy(
  (features_extractor): NatureCNN(
    (cnn): Sequential(
      (0): Conv2d(1, 32, kernel_size=(8, 8), stride=(4, 4))
      (1): ReLU()
      (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (3): ReLU()
      (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
      (5): ReLU()
      (6): Flatten(start_dim=1, end_dim=-1)
    )
    (linear): Sequential(
      (0): Linear(in_features=1024, out_features=512, bias=True)
      (1): ReLU()
    )
  )
  (pi_features_extractor): NatureCNN(
    (cnn): Sequential(
      (0): Conv2d(1, 32, kernel_size=(8, 8), stride=(4, 4))
      (1): ReLU()
      (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (3): ReLU()
      (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
      (5): ReLU()
      (6): Flatten(start_dim=1, end_dim=-1)
    )
    (linear): Sequential(
      (0): Linear(in_features=1024, out_features=512, bias=True)
      (1): ReLU()
    )
  )
  (vf_features_extra

Create an evaluation callback that is called every at regular intervals and renders the episode.

In [11]:
# Use a list of environment creation functions for DummyVecEnv
eval_env = DummyVecEnv([make_env])
eval_env = VecTransposeImage(eval_env)

In [12]:
eval_env.action_space

Box([-1.  0.  0.], 1.0, (3,), float32)

In [13]:
eval_env.observation_space

Box(0, 255, (1, 64, 64), uint8)

In [14]:
observation = env.reset()

eval_log_path = './logs_carracing_PPO_single/'
eval_callback1 = sb3.common.callbacks.EvalCallback(eval_env,
                                                  best_model_save_path=eval_log_path,
                                                  log_path=eval_log_path,
                                                  eval_freq=5000,
                                                  deterministic=False,
                                                  verbose=1,
                                                  render= True)

eval_callback2 = sb3.common.callbacks.EvalCallback(eval_env,
                                                  best_model_save_path=eval_log_path,
                                                  log_path=eval_log_path,
                                                  eval_freq=5000,
                                                  deterministic=True,
                                                  verbose=1,
                                                  render= True)

Train the model for a large number of timesteps (500,000 timesteps will probably work well).

In [15]:
agent.learn(total_timesteps=250000,
            callback = eval_callback1,
            tb_log_name="PPO_CarRacing_Single_Pre")
agent.learn(total_timesteps=250000,
            callback = eval_callback2,
            tb_log_name="PPO_CarRacing_Single_Post")

Eval num_timesteps=5000, episode_reward=-61.52 +/- 3.82
Episode length: 1000.00 +/- 0.00
New best mean reward!
Eval num_timesteps=10000, episode_reward=-93.50 +/- 0.17
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=15000, episode_reward=-83.58 +/- 0.90
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=20000, episode_reward=-83.46 +/- 0.81
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=25000, episode_reward=29.43 +/- 56.13
Episode length: 1000.00 +/- 0.00
New best mean reward!
Eval num_timesteps=30000, episode_reward=-93.45 +/- 0.46
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=35000, episode_reward=-76.59 +/- 2.40
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=40000, episode_reward=-93.28 +/- 0.33
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=45000, episode_reward=-93.06 +/- 0.51
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=50000, episode_reward=-82.69 +/- 0.70
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=55000, episode_reward=-93.38 +/- 0.54


<stable_baselines3.ppo.ppo.PPO at 0x2134621ba10>

Connect to the tensorboard log using **TensorBoard** from the command line to view training progress: 

`tensorboard --logdir ./logs_carracing_PPO/`

Then open TensorBoard in a browser, typically located at:

`http://localhost:6006/`

Save the trained agent.

In [16]:
model_path = "./saved_models/ppo_car_racing_single"
agent.save(model_path)
env.close()



For memory management delete old agent and environment (assumes variable names - change if required).

In [17]:
del agent
del env
del eval_env

### Create Image Stack Agent

Create the CarRacing-v2 environment using wrappers to resize the images to 64 x 64 and change to greyscale. Also add a wrapper to create a stack of 4 frames. 

In [18]:
from stable_baselines3.common.vec_env import VecFrameStack
# Add code here
env = DummyVecEnv([make_env for _ in range(4)])
env = VecFrameStack(env, n_stack=4)
env = VecTransposeImage(env)
env

<stable_baselines3.common.vec_env.vec_transpose.VecTransposeImage at 0x2135a718650>

Create an agent that controls the car using a stack of input image frames as the state input. We recommend a PPO agent with the following hyper-parameters (although you can experiment):
- learning_rate = 3e-5
- n_steps = 512
- ent_coef = 0.001
- batch_size = 128
- gae_lambda =  0.9
- n_epochs = 20
- use_sde = True
- sde_sample_freq = 4
- clip_range = 0.4
- policy_kwargs = {'log_std_init': -2, 'ortho_init':False},

We also recommend enabling **tensorboard** monitoring of the training process.

In [19]:
tensorboard_log = './car_racing_tensorboard_4_stack/'
agent = sb3.PPO(policy="CnnPolicy", 
                env=env, 
                learning_rate=learning_rate, 
                n_steps=n_steps,
                ent_coef=ent_coef, 
                batch_size=batch_size, 
                gae_lambda=gae_lambda, 
                n_epochs=n_epochs, 
                use_sde=use_sde, 
                sde_sample_freq=sde_sample_freq, 
                
                clip_range=clip_range, 
                policy_kwargs=policy_kwargs,
                tensorboard_log=tensorboard_log)

In [20]:
print("Agent's environment:", agent.env)

Agent's environment: <stable_baselines3.common.vec_env.vec_transpose.VecTransposeImage object at 0x000002135A718650>


Examine the actor and critic network architectures.

In [21]:
# Add code here
print("Policy network architecture:")
print(agent.policy)

Policy network architecture:
ActorCriticCnnPolicy(
  (features_extractor): NatureCNN(
    (cnn): Sequential(
      (0): Conv2d(4, 32, kernel_size=(8, 8), stride=(4, 4))
      (1): ReLU()
      (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (3): ReLU()
      (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
      (5): ReLU()
      (6): Flatten(start_dim=1, end_dim=-1)
    )
    (linear): Sequential(
      (0): Linear(in_features=1024, out_features=512, bias=True)
      (1): ReLU()
    )
  )
  (pi_features_extractor): NatureCNN(
    (cnn): Sequential(
      (0): Conv2d(4, 32, kernel_size=(8, 8), stride=(4, 4))
      (1): ReLU()
      (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (3): ReLU()
      (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
      (5): ReLU()
      (6): Flatten(start_dim=1, end_dim=-1)
    )
    (linear): Sequential(
      (0): Linear(in_features=1024, out_features=512, bias=True)
      (1): ReLU()
    )
  )
  (vf_features_extra

Create an evaluation callback that is called every at regular intervals and renders the episode.

In [22]:
eval_env = DummyVecEnv([make_env for _ in range(4)])
eval_env = VecFrameStack(eval_env, n_stack=4)
eval_env = VecTransposeImage(eval_env)

observation = env.reset()

eval_log_path = './logs_carracing_PPO_4_stack/'

eval_callback1 = sb3.common.callbacks.EvalCallback(eval_env,
                                                  best_model_save_path=eval_log_path,
                                                  log_path=eval_log_path,
                                                  eval_freq=5000,
                                                  deterministic=False,
                                                  verbose=1,
                                                  render= True)

eval_callback2 = sb3.common.callbacks.EvalCallback(eval_env,
                                                  best_model_save_path=eval_log_path,
                                                  log_path=eval_log_path,
                                                  eval_freq=5000,
                                                  deterministic=True,
                                                  verbose=1,
                                                  render= True)

Train the model for a large number of timesteps (500,000 timesteps will probably work well).

In [23]:
agent.learn(total_timesteps=250000,
            callback = eval_callback1,
            tb_log_name="PPO_CarRacing_4_Stack_Pre")
agent.learn(total_timesteps=250000,
            callback = eval_callback2,
            tb_log_name="PPO_CarRacing_4_Stack_Post")

Eval num_timesteps=20000, episode_reward=138.10 +/- 219.68
Episode length: 1000.00 +/- 0.00
New best mean reward!
Eval num_timesteps=40000, episode_reward=-90.90 +/- 3.33
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=60000, episode_reward=-81.56 +/- 9.23
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=80000, episode_reward=-58.51 +/- 38.65
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=100000, episode_reward=-82.10 +/- 2.60
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=120000, episode_reward=-78.65 +/- 6.02
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=140000, episode_reward=-81.30 +/- 5.69
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=160000, episode_reward=-76.17 +/- 5.08
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=180000, episode_reward=-81.11 +/- 13.03
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=200000, episode_reward=-76.69 +/- 10.33
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=220000, episode_reward=-4.65 +/- 63.95
Episode l

<stable_baselines3.ppo.ppo.PPO at 0x2135afaba10>

Connect to the tensorboard log using **TensorBoard** from the command line to view training progress: 

`tensorboard --logdir ./log_tb_carracing_PPO/`

Then open TensorBoard in a browser, typically located at:

`http://localhost:6006/`

Save the trained agent.

In [24]:
model_path = "./saved_models/ppo_car_racing_4_stack"
agent.save(model_path)
env.close()

For memory management delete old agent and environment (assumes variable names - change if required).

In [25]:
del agent
del env
del eval_env

### Evaluation

Load the single image saved agent

In [26]:
model_path = "./saved_models/ppo_car_racing_single"

agent = sb3.PPO.load(model_path)

Setup the single image environment for evaluation.

In [27]:
eval_env = DummyVecEnv([make_env])
eval_env = VecTransposeImage(eval_env)
agent.set_env(eval_env)

Evaluate the agent in the environment for 30 episodes, rendering the process. 

In [28]:
mean_reward, std_reward = sb3.common.evaluation.evaluate_policy(agent, 
                                                                agent.get_env(), 
                                                                n_eval_episodes=30,
                                                                render = True)
print("Mean Reward: {} +/- {}".format(mean_reward, std_reward))

Mean Reward: -93.16878216666665 +/- 0.41625394955243944


For memory management delete the single image agent (assumes variable names - change if required).

In [29]:
del agent
del eval_env

Load the image stack agent

In [30]:
model_path = "./saved_models/ppo_car_racing_4_stack"

agent = sb3.PPO.load(model_path)

Set up the image stack environment

In [31]:
eval_env = DummyVecEnv([make_env for _ in range(4)])
eval_env = VecFrameStack(eval_env, n_stack=4)
eval_env = VecTransposeImage(eval_env)
agent.set_env(eval_env)

Evaluate the agent in the environment for 30 episodes, rendering the process. 

In [32]:
mean_reward, std_reward = sb3.common.evaluation.evaluate_policy(agent, 
                                                                agent.get_env(), 
                                                                n_eval_episodes=30,
                                                                render = True)
print("Mean Reward: {} +/- {}".format(mean_reward, std_reward))

Mean Reward: -92.92273259999999 +/- 0.6404525699870152


### Reflection

Reflect on which  agent performs better at the task, and the training process involved (max 200 words).

While training, there was a same issue that the agent repeats spinning decreasing entropy loss. Eventhough I set the deterministic parameter as False for the half of the training evaluation and other efforts to make the entropy loss better, both agent's performance were not good enough. However, the image stack agent was performing better than the single image agent. The higest episode reward mean of sigle image agent and the stack image agent were 109.16 and 138.10 each while stack image agent's margin of the error was wider.

1. rollout_ep_rew_mean

![Single Image Agent](single_rollout_ep_rew_mean.png)
![Stack Image Agent](4_stack_rollout_ep_rew_mean.png)

2. eval_mean_rew_mean

![Single Image Agent](single_eval_mean_reward.png)
![Stack Image Agent](4_stack_eval_mean_reward.png)

As shown in the graphs, stack image agent is more stable and has better performance. It is noticable that both agent's eval_mean_rew_mean was negative, stack image agent's rollout-ep_rew_mean increases steadly according to the tensorboard's analysis. However, the message from the call back fuction shows differnet result, so there should be some error in calculating rewards and choosing actions resulting in spining around while entropy decrease.

Overall, it was challenging because of the traing time while rendering, which makes hard to try the changes.