# COMP47590 Advanced Machine Learning

- **Student Name:** Lucas George Sipos
- **Student Number:** 24292215

## Assignment 2: Going the Distance
Uses the PPO actor-critic method to train a neural network to control a simple robot in the RacingCar environment from OpenAI gym (https://gym.openai.com/envs/RacingCar-v0/). 

![Racing](racing_car.gif)

The **action** space can be continuous or discreet. If **continuous** there are 3 actions :

- 0: steering, -1 is full left, +1 is full right
- 1: gas
- 2: breaking

If **discrete** there are 5 actions:
- 0: do nothing
- 1: steer left
- 2: steer right
- 3: gas
- 4: brake

For this assignment we should use the continuous action space. 

**Reward** of -0.1 is awarded every frame and +1000/N for every track tile visited, where N is the total number of tiles in track. For example, if you have finished in 732 frames, your reward is 1000 - 0.1*732 = 926.8 points.

And the default **observation** is a single image frame (96 * 96).

### Initialisation

If using Google colab you need to install packages - comment out lines below.

In [None]:
#!apt install swig cmake ffmpeg
#!apt-get install -y xvfb x11-utils
# !pip install stable-baselines3[extra] pyglet box2d box2d-kengz
# !pip install pyvirtualdisplay PyOpenGL PyOpenGL-accelerate

For Google colab comment out this cell to make a virtual rendering canvas so render calls work (we still won't see display!)

In [None]:
#import pyvirtualdisplay
#
#_display = pyvirtualdisplay.Display(visible=False,  # use False with Xvfb
#                                    size=(1400, 900))
#_ = _display.start()

Import required packages. 

In [1]:
import torch
import gymnasium as gym
import stable_baselines3 as sb3
from stable_baselines3.common.callbacks import EvalCallback

import pandas as pd  # For data frames and data frame manipulation

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 1000)
import numpy as np  # For general  numeric operations

import matplotlib.pyplot as plt
%matplotlib inline 

### Create and Explore the Environment

Create the **CarRacing-v2** environment. Add wrappers to resize the images and convert to greyscale.

In [2]:
def make_env(render: str = "rgb_array"):
    _env = gym.make('CarRacing-v2', render_mode=render)
    _env = gym.wrappers.ResizeObservation(_env, (84,84))
    _env = gym.wrappers.gray_scale_observation.GrayScaleObservation(_env, keep_dim=True)
    # _env = gym.wrappers.TimeLimit(_env, max_episode_steps=2000)
    _env = gym.wrappers.TimeLimit(_env, max_episode_steps=1500)
    return _env

In [4]:
env = make_env()

Explore the environment - view the action space and observation space.

In [11]:
env.action_space

Box([-1.  0.  0.], 1.0, (3,), float32)

In [12]:
env.observation_space

Box(0, 255, (84, 84, 1), uint8)

Play an episode of the environment using random actions

In [13]:
obs, _ = env.reset()
done = False
total_reward = 0

while not done:
    action = env.action_space.sample()  # Random action
    obs, reward, terminated, truncated, _ = env.step(action)
    done = terminated or truncated
    total_reward += reward

env.close()
print(f"Total reward: {total_reward}")

Total reward: -30.656934306569767


### Single Image Agent
Create an agent that controls the car using a single image frame as the state input. We recommend a PPO agent with the following hyper-parameters (although you can experiment):
- learning_rate = 3e-5
- n_steps = 512
- ent_coef = 0.001
- batch_size = 128
- gae_lambda =  0.9
- n_epochs = 20
- use_sde = True
- sde_sample_freq = 4
- clip_range = 0.4
- policy_kwargs = {'log_std_init': -2, 'ortho_init':False},

We also recommend enabling **tensorboard** monitoring of the training process.

In [5]:
agent = sb3.PPO(
    "CnnPolicy",
    env,
    learning_rate=3e-5,
    n_steps=512,
    ent_coef=0.001,
    batch_size=128,
    gae_lambda=0.9,
    n_epochs=20,
    use_sde=True,
    sde_sample_freq=4,
    clip_range=0.4,
    policy_kwargs={'log_std_init': -2, 'ortho_init': False},
    tensorboard_log="./log_carracing_PPO/"
)



Examine the actor and critic network architectures.

In [6]:
agent.policy

ActorCriticCnnPolicy(
  (features_extractor): NatureCNN(
    (cnn): Sequential(
      (0): Conv2d(1, 32, kernel_size=(8, 8), stride=(4, 4))
      (1): ReLU()
      (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (3): ReLU()
      (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
      (5): ReLU()
      (6): Flatten(start_dim=1, end_dim=-1)
    )
    (linear): Sequential(
      (0): Linear(in_features=3136, out_features=512, bias=True)
      (1): ReLU()
    )
  )
  (pi_features_extractor): NatureCNN(
    (cnn): Sequential(
      (0): Conv2d(1, 32, kernel_size=(8, 8), stride=(4, 4))
      (1): ReLU()
      (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (3): ReLU()
      (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
      (5): ReLU()
      (6): Flatten(start_dim=1, end_dim=-1)
    )
    (linear): Sequential(
      (0): Linear(in_features=3136, out_features=512, bias=True)
      (1): ReLU()
    )
  )
  (vf_features_extractor): NatureCNN(
    (cnn): 

Create an evaluation callback that is called every at regular intervals and renders the episode.

In [7]:
eval_env = make_env()

eval_callback = EvalCallback(
    eval_env,
    eval_freq=10_000,
    render=True,
    best_model_save_path="./best_model/",
    log_path="./log_carracing_eval/",
)



Train the model for a large number of timesteps (500,000 timesteps will probably work well).

In [None]:
agent.learn(
    total_timesteps=500_000,
    callback=eval_callback
)

Connect to the tensorboard log using **TensorBoard** from the command line to view training progress: 

`tensorboard --logdir ./logs_carracing_PPO/`

Then open TensorBoard in a browser, typically located at:

`http://localhost:6006/`

Save the trained agent.

In [18]:
agent.save("ppo_carracing_trained")

For memory management delete old agent and environment (assumes variable names - change if required).

In [9]:
del agent
del env
del eval_env

### Create Image Stack Agent

Create the CarRacing-v0 environment using wrappers to resize the images to 64 x 64 and change to greyscale. Also add a wrapper to create a stack of 4 frames. 

In [3]:
def make_env_tb(render: str = "rgb_array"):
    _env = gym.make('CarRacing-v2', render_mode=render)
    _env = gym.wrappers.resize_observation.ResizeObservation(_env, (84,84))
    _env = gym.wrappers.gray_scale_observation.GrayScaleObservation(_env, keep_dim=True)
    # _env = gym.wrappers.TimeLimit(_env, max_episode_steps=2000)
    _env = gym.wrappers.TimeLimit(_env, max_episode_steps=1500)

    _env = sb3.common.monitor.Monitor(_env)
    _env = sb3.common.vec_env.DummyVecEnv([lambda: _env])
    _env = sb3.common.vec_env.VecFrameStack(_env, n_stack=4)
    return _env

In [11]:
env = make_env_tb()

Create an agent that controls the car using a stack of input image frames as the state input. We recommend a PPO agent with the following hyper-parameters (although you can experiment):
- learning_rate = 3e-5
- n_steps = 512
- ent_coef = 0.001
- batch_size = 128
- gae_lambda =  0.9
- n_epochs = 20
- use_sde = True
- sde_sample_freq = 4
- clip_range = 0.4
- policy_kwargs = {'log_std_init': -2, 'ortho_init':False},

We also recommend enabling **tensorboard** monitoring of the training process.

In [12]:
agent = sb3.PPO(
    "CnnPolicy",
    env,
    learning_rate=3e-5,
    n_steps=512,
    ent_coef=0.001,
    batch_size=128,
    gae_lambda=0.9,
    n_epochs=20,
    use_sde=True,
    sde_sample_freq=4,
    clip_range=0.4,
    policy_kwargs={'log_std_init': -2, 'ortho_init': False},
    tensorboard_log="./log_tb_carracing_PPO/",
)

Examine the actor and critic network architectures.

In [13]:
agent.policy

ActorCriticCnnPolicy(
  (features_extractor): NatureCNN(
    (cnn): Sequential(
      (0): Conv2d(4, 32, kernel_size=(8, 8), stride=(4, 4))
      (1): ReLU()
      (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (3): ReLU()
      (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
      (5): ReLU()
      (6): Flatten(start_dim=1, end_dim=-1)
    )
    (linear): Sequential(
      (0): Linear(in_features=3136, out_features=512, bias=True)
      (1): ReLU()
    )
  )
  (pi_features_extractor): NatureCNN(
    (cnn): Sequential(
      (0): Conv2d(4, 32, kernel_size=(8, 8), stride=(4, 4))
      (1): ReLU()
      (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (3): ReLU()
      (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
      (5): ReLU()
      (6): Flatten(start_dim=1, end_dim=-1)
    )
    (linear): Sequential(
      (0): Linear(in_features=3136, out_features=512, bias=True)
      (1): ReLU()
    )
  )
  (vf_features_extractor): NatureCNN(
    (cnn): 

Create an evaluation callback that is called every at regular intervals and renders the episode.

In [14]:
eval_env = make_env_tb()

eval_callback = EvalCallback(
    eval_env,
    eval_freq=10_000,
    render=True,
    best_model_save_path="./best_model_tb/",
    log_path="./log_tb_carracing_eval/",
)

Train the model for a large number of timesteps (500,000 timesteps will probably work well).

In [None]:
agent.learn(
    total_timesteps=500_000,
    callback=eval_callback
)

Connect to the tensorboard log using **TensorBoard** from the command line to view training progress: 

`tensorboard --logdir ./log_tb_carracing_PPO/`

Then open TensorBoard in a browser, typically located at:

`http://localhost:6006/`

Save the trained agent.

In [8]:
agent.save("ppo_tb_carracing_trained")

For memory management delete old agent and environment (assumes variable names - change if required).

In [None]:
del agent
del env
del eval_env

### Evaluation

Load the single image saved agent

In [7]:
agent = sb3.PPO.load("v6/ppo_carracing_trained.zip")
agent_best = sb3.PPO.load("v6/best_model/best_model.zip")

Setup the single image environment for evaluation.

In [8]:
eval_env = make_env()
n_episodes = 30

Evaluate the agent in the environment for 30 episodes, rendering the process.

In [9]:
mean_reward, std_reward = sb3.common.evaluation.evaluate_policy(agent,
                                                                eval_env,
                                                                n_eval_episodes=n_episodes,
                                                                render=True)
print("Trained Agent Mean Reward: {} +/- {}".format(mean_reward, std_reward))

mean_reward, std_reward = sb3.common.evaluation.evaluate_policy(agent_best,
                                                                eval_env,
                                                                n_eval_episodes=n_episodes,
                                                                render=True)
print("Best Agent Mean Reward: {} +/- {}".format(mean_reward, std_reward))



Trained Agent Mean Reward: -93.17763194491467 +/- 0.5450886293190688
Best Agent Mean Reward: 99.82978916689754 +/- 125.32855192728414


For memory management delete the single image agent (assumes variable names - change if required).

In [10]:
del agent
del agent_best
del eval_env

Load the image stack agent

In [11]:
agent = sb3.PPO.load("v6/ppo_tb_carracing_trained.zip")
agent_best = sb3.PPO.load("v6/best_model_tb/best_model.zip")

Set up the image stack environment

In [12]:
eval_env = make_env_tb()
n_episodes = 30

Evaluate the agent in the environment for 30 episodes, rendering the process.

In [14]:
mean_reward, std_reward = sb3.common.evaluation.evaluate_policy(agent,
                                                                eval_env,
                                                                n_eval_episodes=n_episodes,
                                                                render=True)
print("Trained Agent Mean Reward: {} +/- {}".format(mean_reward, std_reward))

mean_reward, std_reward = sb3.common.evaluation.evaluate_policy(agent_best,
                                                                eval_env,
                                                                n_eval_episodes=n_episodes,
                                                                render=True)
print("Best Agent Mean Reward: {} +/- {}".format(mean_reward, std_reward))

Trained Agent Mean Reward: -86.01198786666667 +/- 14.52192114452869
Best Agent Mean Reward: 378.30467026666673 +/- 138.96342264659938


### Reflection

Reflect on which  agent performs better at the task, and the training process involved (max 200 words).

In this task, we compared two agents trained on the `CarRacing-v2` environment: one using single grayscale frames, and the other using a stack of four consecutive grayscale frames. Although both agents struggled during training (having suboptimal behaviors like doing donuts) the best version of the stacked image agent performed significantly better in evaluation. Specifically, it achieved a mean reward of `378.30 +/- 138.96`, compared to the single image agent’s `99.82 +/- 125.33`.

The training environments were nearly identical, with the primary difference being the use of VecFrameStack for the stacked agent. This image stacking allowed the agent to infer motion and momentum, providing better context for decision-making in a dynamic environment. In contrast, the single image agent lacked this temporal awareness, which likely made it worse when learning effective driving strategies.

Despite the training being mostly ineffective, the final evaluation suggests that the stacked image agent is far superior for this task. The stark difference in reward highlights the importance of temporal information in environments like CarRacing, where understanding speed, direction, and continuity between frames is a big deal.

Overall, the stacked image approach proves to be a more effective representation for training agents in continuous control tasks.