# COMP47590 Advanced Machine Learning

## Assignment 2: Going the Distance
Uses the PPO actor-critic method to train a neural network to control a simple robot in the RacingCar environment from OpenAI gym (https://gym.openai.com/envs/RacingCar-v0/). 

Students:

* Carl Winkler. Student Number: 20207528 
* David Moreno Borràs. Student Number: 21200646

![Racing](racing_car.gif)

There are five discrete **actions** in this environment:
- left (0)
- right (1)
- brake (2)
- accelerate (3)
- none (4)

**Reward** of -0.1 is awarded every frame and +1000/N for every track tile visited, where N is the total number of tiles in track. For example, if you have finished in 732 frames, your reward is 1000 - 0.1*732 = 926.8 points.

And the **state** is represented using a single image frame (96 * 96).

<span style="color:blue">
    
## Conventions
    
- We will highlight our texts in this notebook in **blue**
- Tensorboard is shown inline
- The best models can be found in these folders: best_model_multi, best_modelsC, best_modelsD
    
## Foreword: How we conducted the work for this assingment
    
First we started working with Google Colab but we ran into some problems, specially with the display. We solved this by using a monitor wrapper and saving the result in an MP4 file. We finally decided to move back to local executions to find the best agents because the basic version of Google Colab restricts the GPU usage after training some time and it only allows 90 minutes of background execution.  
    
In each section we go more in detail into how we worked with each of the tasks. For the final evaluation, we tested 4 different models (using a single or 4 stacked images and MLP or CNN policy) on two different laptops, this is explained further in detail in the evaluation section, were we compare the resulting best models.

</span>

## Initialisation

If using Google colab you need to install packages - comment out lines below.

In [None]:
## We used this in the colab environment

#!apt install swig cmake ffmpeg
#!apt-get install -y xvfb x11-utils
#!pip install stable-baselines3[extra] pyglet box2d box2d-kengz
#!pip install pyvirtualdisplay PyOpenGL PyOpenGL-accelerate

For Google colab comment out this cell to make a virtual rendering canvas so render calls work (we still won't see display!)

In [None]:
#import pyvirtualdisplay
#
#_display = pyvirtualdisplay.Display(visible=False,  # use False with Xvfb
#                                    size=(1400, 900))
#_ = _display.start()

Import required packages. 

In [1]:
import torch 
import gym
import stable_baselines3 as sb3

import pandas as pd # For data frames and data frame manipulation
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 1000)
import numpy as np # For general  numeric operations
import os

import matplotlib.pyplot as plt
%matplotlib inline 

In [2]:
# Imports for MP4 rendering
import io
import base64

from pathlib import Path

from IPython.display import HTML
from IPython import display as ipythondisplay

from gym import wrappers
from stable_baselines3.common.vec_env import VecVideoRecorder


# Imports for inline Tensorboard
%load_ext tensorboard
import datetime, os


In [3]:
# In colab we ensure to use the GPU for training
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Create and Explore the Environment

Create the **CarRacing-v0** environment. Add wrappers to resize the images and convert to greyscale.

In [4]:
env = gym.make('CarRacing-v0')
env = gym.wrappers.resize_observation.ResizeObservation(env, 64)
env = gym.wrappers.gray_scale_observation.GrayScaleObservation(env, keep_dim = True)

# This is the env we use to monitor when we want a video of the agent
render_env =  wrappers.Monitor(env, "./gym-results", force=True)

Explore the environment - view the action space and observation space.

In [5]:
print("action_space: ", env.action_space)

action_space:  Box([-1.  0.  0.], [1. 1. 1.], (3,), float32)


In [6]:
print("env.observation_space shape: ",env.observation_space.shape)

env.observation_space shape:  (64, 64, 1)


Play an episode of the environment using random actions

In [8]:
obs = render_env.reset()
done = False

while not done:
    action = render_env.action_space.sample()
    obs, reward, done, info = render_env.step(action)
    render_env.render()

render_env.close()

Track generation: 1144..1441 -> 297-tiles track


In [None]:
## Show the MP4 of an episode doing random actions. Not done locally
# show_render_result(render_env)

<span style="color:blue">
    
## Exploration of the Environment- and action space
    
The environment is a racetrack on which a car drives. The environment returns a 94x94 picture as observation after each step. We preprocess this observation by shrinking it to 64x64 elements and converting it to grey-scale. The agent gets this observation and decides on which of the 4 actions it takes. These actions are represented in 3-dimensional vector representation.

</span>

### Single Image Agent
Create an agent that controls the car using a single image frame as the state input. We recommend a PPO agent with the following hyper-parameters (although you can experiment):
- learning_rate = 3e-5
- n_steps = 512
- ent_coef = 0.001
- batch_size = 128
- gae_lambda =  0.9
- n_epochs = 20
- use_sde = True
- sde_sample_freq = 4
- clip_range = 0.4
- policy_kwargs = {'log_std_init': -2, 'ortho_init':False},

We also recommend enabling **tensorboard** monitoring of the training process.

In [9]:
tb_log = './tb_logs_SingleFrame_Training/'

#policy = 'MlpPolicy'
policy = 'CnnPolicy'

agent = sb3.PPO(policy, env,
                    learning_rate = 3e-5,
                    n_steps = 512,
                    ent_coef = 0.001,
                    batch_size = 128,
                    gae_lambda =  0.9,
                    n_epochs = 20,
                    use_sde = True,
                    sde_sample_freq = 4,
                    clip_range = 0.4,
                    policy_kwargs = {'log_std_init': -2, 'ortho_init':False},
                    tensorboard_log=tb_log)

Examine the actor and critic network architectures.

In [10]:
print(agent.policy)

ActorCriticCnnPolicy(
  (features_extractor): NatureCNN(
    (cnn): Sequential(
      (0): Conv2d(1, 32, kernel_size=(8, 8), stride=(4, 4))
      (1): ReLU()
      (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (3): ReLU()
      (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
      (5): ReLU()
      (6): Flatten(start_dim=1, end_dim=-1)
    )
    (linear): Sequential(
      (0): Linear(in_features=1024, out_features=512, bias=True)
      (1): ReLU()
    )
  )
  (mlp_extractor): MlpExtractor(
    (shared_net): Sequential()
    (policy_net): Sequential()
    (value_net): Sequential()
  )
  (action_net): Linear(in_features=512, out_features=3, bias=True)
  (value_net): Linear(in_features=512, out_features=1, bias=True)
)


<span style="color:blue">
    
## Training of the Agents
    
We trained both agent-types using the suggested 500000 timesteps, which takes a significant amount of time. The resulting agents are the ones stored in the folders *best_models* which we will go over later in the Evaluation section. We trained both, the single frame agent and the stack-frame agent with an MLP- and CNN-Policy-Network to see which performs better. As the input data is picture we expect the agents with a CNN-Policy to perform better.

In the following section we show show how the training of these agents works. Therefore, in this notebook, we train them for a low number of time_steps (10000).
    
The best models will be evaluated in the last section. We trained them executing separate python files. This way we could execute multiple trainings at the same time.

</span>

Create an evaluation callback that is called every at regular intervals and renders the episode.

In [11]:
eval_env = gym.make('CarRacing-v0')
eval_env = gym.wrappers.resize_observation.ResizeObservation(eval_env, 64)
eval_env = gym.wrappers.gray_scale_observation.GrayScaleObservation(eval_env, keep_dim = True)

# Using MLP policy change: best_model_save_path='./best_model_MLP_Single/'
eval_callback = sb3.common.callbacks.EvalCallback(eval_env, 
                                                  best_model_save_path='./best_model_CNN_Single/',
                                                  log_path=tb_log, 
                                                  eval_freq=5000,
                                                  render=False)

Train the model for a large number of timesteps (500,000 timesteps will probably work well).

In [12]:
%tensorboard --logdir log_racing_PPO_single

In [13]:
agent.learn(total_timesteps=10000,callback=eval_callback)

Track generation: 1179..1478 -> 299-tiles track


2022-04-26 01:02:59.922162: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/david/.local/lib/python3.8/site-packages/cv2/../../lib64:
2022-04-26 01:02:59.922190: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


Track generation: 1116..1399 -> 283-tiles track
Track generation: 1125..1410 -> 285-tiles track
Track generation: 1164..1459 -> 295-tiles track
Track generation: 1067..1338 -> 271-tiles track
Track generation: 1085..1361 -> 276-tiles track
Track generation: 1055..1323 -> 268-tiles track




Track generation: 1157..1457 -> 300-tiles track
Track generation: 1227..1538 -> 311-tiles track
Track generation: 1285..1610 -> 325-tiles track
Track generation: 1047..1320 -> 273-tiles track
Track generation: 1255..1573 -> 318-tiles track
Eval num_timesteps=5000, episode_reward=-84.93 +/- 2.27
Episode length: 1000.00 +/- 0.00
New best mean reward!
Track generation: 1084..1359 -> 275-tiles track
Track generation: 978..1234 -> 256-tiles track
Track generation: 1057..1325 -> 268-tiles track
Track generation: 1065..1344 -> 279-tiles track
Track generation: 1244..1559 -> 315-tiles track
Track generation: 978..1233 -> 255-tiles track
Track generation: 1027..1293 -> 266-tiles track
Track generation: 1064..1334 -> 270-tiles track
Track generation: 1220..1529 -> 309-tiles track
Track generation: 1153..1445 -> 292-tiles track
Track generation: 1033..1295 -> 262-tiles track
Eval num_timesteps=10000, episode_reward=-81.89 +/- 1.24
Episode length: 1000.00 +/- 0.00
New best mean reward!


<stable_baselines3.ppo.ppo.PPO at 0x7f8cd230d910>

Save the trained agent.

In [14]:
# Using MLP policy change to: agent.save("./final_models/final_model_MLP_single")
agent.save("./final_models/final_model_CNN_single")



And here we can see a quick execution to see how well it performs:

In [15]:
mean_reward, std_reward = sb3.common.evaluation.evaluate_policy(agent, 
                                                                agent.get_env(), 
                                                                n_eval_episodes=15,
                                                                render = True)
print("Mean Reward: {} +/- {}".format(mean_reward, std_reward))

Track generation: 1099..1382 -> 283-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1167..1463 -> 296-tiles track
Track generation: 1132..1419 -> 287-tiles track
Track generation: 1187..1488 -> 301-tiles track
Track generation: 1123..1407 -> 284-tiles track
Track generation: 1151..1452 -> 301-tiles track
Track generation: 961..1205 -> 244-tiles track
Track generation: 1083..1358 -> 275-tiles track
Track generation: 1060..1338 -> 278-tiles track
Track generation: 1162..1456 -> 294-tiles track
Track generation: 1102..1385 -> 283-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1075..1348 -> 273-tiles track
Track generation: 1132..1419 -> 287-tiles track
Track generation: 1184..1484 -> 300-tiles track
Track generation: 1192..1504 -> 312-tiles track
Track generation: 1063..1335 -> 272-tiles track
retry to generate track (normal if there are not manyinstances of this me

For memory management delete old agent and environment (assumes variable names - change if required).

In [16]:
del agent
del env
del eval_env
del render_env

### Create Image Stack Agent

Create the CarRacing-v0 environment using wrappers to resize the images to 64 x 64 and change to greyscale. Also add a wrapper to create a stack of 4 frames. 

In [17]:
# Create Stacked env
env = gym.make('CarRacing-v0')
env = gym.wrappers.resize_observation.ResizeObservation(env, 64)
env = gym.wrappers.gray_scale_observation.GrayScaleObservation(env, keep_dim = True)
env = sb3.common.vec_env.DummyVecEnv([lambda: env]) 
env = sb3.common.vec_env.VecFrameStack(env, n_stack=4)

# Separate evaluation env
eval_env = gym.make('CarRacing-v0')
eval_env = gym.wrappers.resize_observation.ResizeObservation(eval_env, 64)
eval_env = gym.wrappers.gray_scale_observation.GrayScaleObservation(eval_env, keep_dim = True)
eval_env = sb3.common.vec_env.DummyVecEnv([lambda: eval_env]) 
eval_env = sb3.common.vec_env.VecFrameStack(eval_env, n_stack=4)


Create an agent that controls the car using a stack of input image frames as the state input. We recommend a PPO agent with the following hyper-parameters (although you can experiment):
- learning_rate = 3e-5
- n_steps = 512
- ent_coef = 0.001
- batch_size = 128
- gae_lambda =  0.9
- n_epochs = 20
- use_sde = True
- sde_sample_freq = 4
- clip_range = 0.4
- policy_kwargs = {'log_std_init': -2, 'ortho_init':False},

We also recommend enabling **tensorboard** monitoring of the training process.

In [18]:
tb_log = './tb_logs_StackFrame_Training/'

#policy = 'MlpPolicy'
policy = 'CnnPolicy'

agent = sb3.PPO(policy, env,
                    learning_rate = 3e-5,
                    n_steps = 512,
                    ent_coef = 0.001,
                    batch_size = 128,
                    gae_lambda =  0.9,
                    n_epochs = 20,
                    use_sde = True,
                    sde_sample_freq = 4,
                    clip_range = 0.4,
                    policy_kwargs = {'log_std_init': -2, 'ortho_init':False},
                    tensorboard_log=tb_log)

Examine the actor and critic network architectures.

In [19]:
print(agent.policy)

ActorCriticCnnPolicy(
  (features_extractor): NatureCNN(
    (cnn): Sequential(
      (0): Conv2d(4, 32, kernel_size=(8, 8), stride=(4, 4))
      (1): ReLU()
      (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (3): ReLU()
      (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
      (5): ReLU()
      (6): Flatten(start_dim=1, end_dim=-1)
    )
    (linear): Sequential(
      (0): Linear(in_features=1024, out_features=512, bias=True)
      (1): ReLU()
    )
  )
  (mlp_extractor): MlpExtractor(
    (shared_net): Sequential()
    (policy_net): Sequential()
    (value_net): Sequential()
  )
  (action_net): Linear(in_features=512, out_features=3, bias=True)
  (value_net): Linear(in_features=512, out_features=1, bias=True)
)


<span style="color:blue">
      
## On the new actor and critic network
        
The difference we can see now is that the network takes into account 4 64x64 matrices that represent 4 consecutive observations from the environment. This means the agents get some more information of what happened in the past which, in theory, should make it easier to find the optimal action in the current time step. This will be investigated in the next sections.
    
</span>

Create an evaluation callback that is called every at regular intervals and renders the episode.

In [20]:
eval_callback = sb3.common.callbacks.EvalCallback(eval_env, 
                                                  best_model_save_path='./best_model_CNN_4Stack/',
                                                  log_path=tb_log, 
                                                  eval_freq=5000,
                                                  render=False)

<span style="color:blue">

## Training
    
Here we train it only for 10000 timesteps as mentioned to showcase how we trained the agents. The results of training it for 500,000 timesteps can be seen in the evaluation section. As this computation is very time consuming we trained different models on different machines.

</span>


In [49]:
%tensorboard --logdir ./tb_logs_StackFrame_Training/

In [50]:
agent.learn(total_timesteps=10000, callback=eval_callback)

Track generation: 1063..1339 -> 276-tiles track




Track generation: 1304..1634 -> 330-tiles track
Track generation: 1048..1314 -> 266-tiles track
Track generation: 1051..1318 -> 267-tiles track
Track generation: 1252..1569 -> 317-tiles track




Track generation: 1081..1355 -> 274-tiles track
Track generation: 1136..1424 -> 288-tiles track
Track generation: 1112..1394 -> 282-tiles track
Track generation: 1265..1585 -> 320-tiles track
Track generation: 1262..1585 -> 323-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1219..1528 -> 309-tiles track
Track generation: 1307..1638 -> 331-tiles track
Eval num_timesteps=4403, episode_reward=-86.33 +/- 0.79
Episode length: 1000.00 +/- 0.00
New best mean reward!
Track generation: 1128..1414 -> 286-tiles track
Track generation: 1116..1399 -> 283-tiles track
Track generation: 1114..1397 -> 283-tiles track
Track generation: 1304..1634 -> 330-tiles track
Track generation: 1237..1556 -> 319-tiles track
Track generation: 1088..1371 -> 283-tiles track
Track generation: 1235..1548 -> 313-tiles track
Track generation: 1213..1520 -> 307-tiles track
Track generation: 1174..1472 -> 298-tiles track
Track generation: 1234..1551 -> 317-tiles

<stable_baselines3.ppo.ppo.PPO at 0x7fae1202d790>

Save the trained agent and test its performance:

In [51]:
# Using MLP policy change to: agent.save("./final_models/final_model_CNN_4Stack")
agent.save("./final_models/final_model_CNN_4Stack")

mean_reward, std_reward = sb3.common.evaluation.evaluate_policy(agent, 
                                                                agent.get_env(), 
                                                                n_eval_episodes=15,
                                                                render = True)
print("Mean Reward: {} +/- {}".format(mean_reward, std_reward))

Track generation: 1124..1409 -> 285-tiles track
Track generation: 1100..1379 -> 279-tiles track
Track generation: 1073..1349 -> 276-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1520..1904 -> 384-tiles track
Track generation: 1140..1429 -> 289-tiles track
Track generation: 1015..1278 -> 263-tiles track
Track generation: 1177..1475 -> 298-tiles track
Track generation: 983..1239 -> 256-tiles track
Track generation: 1048..1314 -> 266-tiles track
Track generation: 1197..1500 -> 303-tiles track
Track generation: 1126..1411 -> 285-tiles track
Track generation: 1087..1363 -> 276-tiles track
Track generation: 1065..1335 -> 270-tiles track
Track generation: 1148..1439 -> 291-tiles track
Track generation: 1136..1424 -> 288-tiles track
Track generation: 1043..1313 -> 270-tiles track
Track generation: 1149..1446 -> 297-tiles track
Mean Reward: -26.68511994878451 +/- 41.61278147827246


For memory management delete old agent and environment (assumes variable names - change if required).

In [None]:
del agent
del env
del eval_env

## Evaluation


Now that we have trained and saved multiple agents with different settings, we will proceed to evaluate each of them for 30 episodes and compute the mean reward to test which one yields better results.

In particular we have the following four combinations:

- MLP with a single image
- MLP with 4 stacked images
- CNN with a single image
- CNN with 4 stacked images

We ran these on two different computers for 500000 timestamps.

Let's define the possible combinations:

In [15]:
policies = ['MLP', 'CNN']
images = ['single', '4stack']

bestModelsPath = [('./best_modelsC/','PC1'), ('./best_modelsD/','PC2')]

Method to load the requested agent and create its corresponding environment: 

In [5]:
def loadAgent(modelPath, policy, images):
    modelName = 'best_model_' + policy + '_' + images 
    
    if not os.path.isfile(modelPath + modelName + '.zip'):
        return -1, -1
    
    agent = sb3.ppo.PPO.load(modelPath + modelName)
    
    print('Loading', modelPath + modelName)

    render_env = gym.make('CarRacing-v0')
    render_env = gym.wrappers.resize_observation.ResizeObservation(render_env, 64)
    render_env = gym.wrappers.gray_scale_observation.GrayScaleObservation(render_env, keep_dim = True)

    if images == '4stack':
        render_env = sb3.common.vec_env.DummyVecEnv([lambda: render_env]) 
        render_env = sb3.common.vec_env.VecFrameStack(render_env, n_stack=4)

    # render_env = wrappers.Monitor(render_env, "./gym-results", force=True)

    agent.set_env(render_env)
    return (agent, render_env)


Iterate over all possible combinations and execute them for 30 episodes, saving the results in a DataFrame:

In [5]:
results = {'MLP_single_PC1':{}, 'MLP_4stack_PC1':{}, 'CNN_single_PC1':{}, 'CNN_4stack_PC1':{},
           'MLP_single_PC2':{}, 'MLP_4stack_PC2':{}, 'CNN_single_PC2':{}, 'CNN_4stack_PC2':{}}

for modelPath, pathId in bestModelsPath:
    for policy in policies:
        for img in images:
            model_id = policy + '_' + img + '_' + pathId
            print("Testing", img, "image with", policy, "policy", "for", modelPath)
            agent, env = loadAgent(modelPath, policy, img)
            
            if agent == -1:
                print("Not found!")
                results[model_id]['Mean'] = -1
                results[model_id]['std'] = -1
                continue

            mean_reward, std_reward = sb3.common.evaluation.evaluate_policy(agent, 
                                                                            agent.get_env(), 
                                                                            n_eval_episodes=30,
                                                                            render = True)
            print("Mean Reward: {} +/- {}".format(mean_reward, std_reward))
            
            results[model_id]['Mean'] = mean_reward
            results[model_id]['std'] = std_reward
            del agent
            del env

resultsDF = pd.DataFrame(results)

Testing single image with MLP policy for ./best_modelsC/
Loading ./best_modelsC/best_model_MLP_single
Track generation: 1174..1471 -> 297-tiles track
Track generation: 1041..1315 -> 274-tiles track
Track generation: 1104..1385 -> 281-tiles track
Track generation: 1141..1437 -> 296-tiles track
Track generation: 1081..1363 -> 282-tiles track
Track generation: 1263..1583 -> 320-tiles track
Track generation: 1119..1403 -> 284-tiles track
Track generation: 1127..1419 -> 292-tiles track
Track generation: 1043..1308 -> 265-tiles track
Track generation: 1073..1345 -> 272-tiles track
Track generation: 1377..1725 -> 348-tiles track
Track generation: 1234..1547 -> 313-tiles track
Track generation: 1149..1440 -> 291-tiles track
Track generation: 1092..1369 -> 277-tiles track
Track generation: 1069..1340 -> 271-tiles track
Track generation: 1034..1297 -> 263-tiles track
Track generation: 1161..1459 -> 298-tiles track
Track generation: 1165..1465 -> 300-tiles track
Track generation: 1057..1326 -> 26



Track generation: 1266..1587 -> 321-tiles track
Track generation: 1088..1364 -> 276-tiles track
Track generation: 1056..1324 -> 268-tiles track
Track generation: 1029..1290 -> 261-tiles track
Track generation: 1152..1444 -> 292-tiles track
Track generation: 1196..1499 -> 303-tiles track
Track generation: 1163..1458 -> 295-tiles track
Track generation: 1093..1370 -> 277-tiles track
Track generation: 1052..1318 -> 266-tiles track
Track generation: 1180..1479 -> 299-tiles track
Track generation: 1351..1693 -> 342-tiles track
Track generation: 1141..1435 -> 294-tiles track
Track generation: 1283..1608 -> 325-tiles track
Track generation: 1135..1423 -> 288-tiles track
Track generation: 1109..1396 -> 287-tiles track
Track generation: 1196..1499 -> 303-tiles track
Track generation: 1028..1289 -> 261-tiles track
Track generation: 977..1233 -> 256-tiles track
Track generation: 958..1209 -> 251-tiles track
Track generation: 1319..1653 -> 334-tiles track
Track generation: 1172..1469 -> 297-tiles 

And these are the results we obtain after 30 episodes for each possibility:

In [7]:
print(resultsDF)

      MLP_single_PC1  MLP_4stack_PC1  CNN_single_PC1  CNN_4stack_PC1  MLP_single_PC2  MLP_4stack_PC2  CNN_single_PC2  CNN_4stack_PC2
Mean      -82.653779      564.247102              -1      675.449728      226.858870      611.919312      320.686394      449.626848
std         1.195378      232.277129              -1      208.861736      125.156839      221.358447      116.108880       59.669726


<span style="color:blue">
    
## Visual investigation of the best agents

Here look into videos of the following Agents for a few episodes:

- CNN_single_PC2
- MLP_single_PC2
- CNN_4stack_PC1
- MLP_4stack_PC2
</span>

In [23]:
# Load the agents
modelPath = "./best_modelsD/"
policy = "MLP"
img = "single"
MLP_single_PC2_agent, MLP_single_PC2_env = loadAgent(modelPath, policy, img)

modelPath = "./best_modelsD/"
policy = "CNN"
img = "single"
CNN_single_PC2_agent, CNN_single_PC2_env = loadAgent(modelPath, policy, img)

modelPath = "./best_modelsC/"
policy = "CNN"
img = "4stack"
MLP_4stack_PC2_agent, MLP_4stack_PC2_env = loadAgent(modelPath, policy, img)

modelPath = "./best_modelsD/"
policy = "MLP"
img = "4stack"
CNN_4stack_PC2_agent, CNN_4stack_PC2_env = loadAgent(modelPath, policy, img)

agents = [MLP_single_PC2_agent, CNN_single_PC2_agent, MLP_4stack_PC2_agent, CNN_4stack_PC2_agent] 
envs = [MLP_single_PC2_env, CNN_single_PC2_env, MLP_4stack_PC2_env, CNN_4stack_PC2_env]

Loading ./best_modelsD/best_model_MLP_single
Loading ./best_modelsD/best_model_CNN_single
Loading ./best_modelsC/best_model_CNN_4stack
Loading ./best_modelsD/best_model_MLP_4stack


In [25]:
# This functions renders an episode for each agent
namelist = ["MLP_single_PC2_agent", "CNN_single_PC2_agent", "MLP_4stack_PC2_agent", "CNN_4stack_PC2_agent"]
for idx, agent in enumerate(agents):
    env = envs[idx]
    
    obs = env.reset()
    agent.set_env(env)
    print("Showing behaviour of: ", namelist[idx])
    mean_reward, std_reward = sb3.common.evaluation.evaluate_policy(agent, 
                                                                    agent.get_env(), 
                                                                    n_eval_episodes=5,
                                                                    render = True)   
    env.close()


Track generation: 1155..1448 -> 293-tiles track
Showing behaviour of:  MLP_single_PC2_agent
Track generation: 1125..1410 -> 285-tiles track
Track generation: 1163..1458 -> 295-tiles track
Track generation: 1264..1584 -> 320-tiles track
Track generation: 1184..1484 -> 300-tiles track
Track generation: 1067..1338 -> 271-tiles track
Track generation: 1102..1386 -> 284-tiles track
Track generation: 1120..1413 -> 293-tiles track
Showing behaviour of:  CNN_single_PC2_agent
Track generation: 1036..1303 -> 267-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1105..1385 -> 280-tiles track
Track generation: 1207..1513 -> 306-tiles track
Track generation: 1224..1534 -> 310-tiles track
Track generation: 1049..1315 -> 266-tiles track
Track generation: 1238..1551 -> 313-tiles track
Track generation: 1042..1316 -> 274-tiles track
Track generation: 1207..1513 -> 306-tiles track
Showing behaviour of:  MLP_4stack_PC2_agent
Track generation: 11

<span style="color:blue">

It appears that CNN_4stack_PC1_agent is often performing very well, here we run it 5 times, to get a good idea of how it performs.

</span>

In [26]:
mean_reward, std_reward = sb3.common.evaluation.evaluate_policy(CNN_4stack_PC2_agent, 
                                                                CNN_4stack_PC2_agent.get_env(), 
                                                                n_eval_episodes=5,
                                                                render = True)  

Track generation: 1119..1403 -> 284-tiles track
Track generation: 1276..1601 -> 325-tiles track
Track generation: 1009..1275 -> 266-tiles track
Track generation: 973..1268 -> 295-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1224..1534 -> 310-tiles track
Track generation: 1092..1369 -> 277-tiles track
Track generation: 1057..1330 -> 273-tiles track


## Reflection

<span style="color:blue">

The performance of the trained agents varies a lot, with the best agents being the ones that use 4 stacked images. The CNN is better suited for the input data and the agent can estimate the best action better taking into account the past frames. In practice, the models show the best performance after roughly 60-80k training steps.

Some of the agents are able to recover when leaving the track. However, when the car goes too far most agents end up driving circles because the observations then do not indicate the position of the track. 

The problem with agents like MLP_single_PC1 is that some of them start driving out on the grass from the beginning and then it is difficult to escape this local minima as small changes in these models do not impact the reward strongly.

We can see that the best-performing agents have a large std. This means that the reward oscillates, so we might risk having an execution that performs poorly. The CNN_4stack_PC2 agent reaches a mediocre average reward but also a lot smaller std. However, the lower bound of the best agents is still better here.

</span>

<span style="color:blue">

# Appendix
</span>

<span style="color:blue">
    
## Voluntary extra work 1: Appendix-Multiple environments
    
Another approach that we tried to test as well was using multiple parallel environments for training to take advantage of the multi-processing power of the computers.

The idea here is that using SubprocVecEnv we create a vectorized wrapper that will run different environments each step, which can help with the performance.
    
Because we couldn't get to work make_vec_env directly if we wanted the gray_scale and resize wrappers as well, we performed the vectorization manually, mofidying the code taken from the sb3 documentation: https://colab.research.google.com/github/araffin/rl-tutorial-jnrr19/blob/sb3/3_multiprocessing.ipynb#scrollTo=AvO5BGrVv2Rk
</span>

In [29]:
from typing import Callable
from stable_baselines3.common.utils import set_random_seed

def make_env(env_id: str, rank: int, seed: int = 0) -> Callable:
    def _init() -> gym.Env:
        env = gym.make(env_id)
        env = gym.wrappers.resize_observation.ResizeObservation(env, 64)
        env = gym.wrappers.gray_scale_observation.GrayScaleObservation(env, keep_dim = True)
        env.seed(seed + rank)

        return env
    set_random_seed(seed)
    return _init

env_id = "CarRacing-v0"
num_cpu = 4

# Create the vectorized environment
multi_env = sb3.common.vec_env.SubprocVecEnv([make_env(env_id, i) for i in range(num_cpu)])

In [30]:
multi_eval_env = sb3.common.vec_env.SubprocVecEnv([make_env(env_id, i) for i in range(num_cpu)])
tb_log = "tb_test"

eval_callback = sb3.common.callbacks.EvalCallback(multi_eval_env, 
                                                  best_model_save_path='./best_model_multi/',
                                                  log_path=tb_log, 
                                                  eval_freq=2000,
                                                  render=False)


In [31]:
tb_log = "tb_test"
agent = sb3.PPO('CnnPolicy', multi_env,
                    learning_rate = 3e-5,
                    n_steps = 512,
                    ent_coef = 0.001,
                    batch_size = 128,
                    gae_lambda =  0.9,
                    n_epochs = 20,
                    use_sde = True,
                    sde_sample_freq = 4,
                    clip_range = 0.4,
                    policy_kwargs = {'log_std_init': -2, 'ortho_init':False},
                    tensorboard_log=tb_log)

Here we didn't train it up to 500000 but we just wanted to test this as another possible option for our system.

In [32]:
agent.learn(total_timesteps=100000, callback=eval_callback)

Track generation: 1056..1324 -> 268-tiles trackTrack generation: 1108..1389 -> 281-tiles track

Track generation: 1055..1332 -> 277-tiles track
Track generation: 1143..1442 -> 299-tiles track




Track generation: 1231..1543 -> 312-tiles track
Track generation: 1084..1359 -> 275-tiles track
Track generation: 1069..1347 -> 278-tiles track
Track generation: 1087..1369 -> 282-tiles track
Track generation: 1199..1503 -> 304-tiles track
Track generation: 981..1234 -> 253-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1251..1568 -> 317-tiles track
Track generation: 964..1212 -> 248-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1176..1474 -> 298-tiles track
Track generation: 1055..1323 -> 268-tiles track
Track generation: 1055..1332 -> 277-tiles track
Track generation: 1108..1389 -> 281-tiles track
Track generation: 1143..1442 -> 299-tiles track
Track generation: 1056..1324 -> 268-tiles track
Track generation: 1069..1347 -> 278-tiles track
Track generation: 1084..1359 -> 275-tiles track
Track generation: 1087..1369 -> 282-tiles track
Track generation: 1231..15

<stable_baselines3.ppo.ppo.PPO at 0x7fae12097f70>

Save the agent for future use:

In [33]:
agent.save("./best_model_multi/best_model_multi")

If we evaluate it as well we can see that the displayed episodes show the four environments running at the same time:

In [34]:
mean_reward, std_reward = sb3.common.evaluation.evaluate_policy(agent, 
                                                                agent.get_env(), 
                                                                n_eval_episodes=15,
                                                                render = True)
print("Mean Reward: {} +/- {}".format(mean_reward, std_reward))

Track generation: 1079..1353 -> 274-tiles track
Track generation: 1208..1514 -> 306-tiles trackTrack generation: 1072..1350 -> 278-tiles track

Track generation: 1067..1338 -> 271-tiles track
Track generation: 1207..1513 -> 306-tiles track
Track generation: 1154..1452 -> 298-tiles track
Track generation: 1055..1324 -> 269-tiles track
Track generation: 1187..1488 -> 301-tiles track
Track generation: 1228..1539 -> 311-tiles track
Track generation: 1126..1416 -> 290-tiles track
Track generation: 1106..1396 -> 290-tiles track
Track generation: 1049..1325 -> 276-tiles track
Track generation: 963..1208 -> 245-tiles track
Track generation: 1156..1455 -> 299-tiles track
Track generation: 1324..1659 -> 335-tiles track
Track generation: 1296..1628 -> 332-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1047..1319 -> 272-tiles track
Track generation: 1044..1304 -> 260-tiles track
retry to generate track (normal if there are not manyinst

It's hard to say whether it would perform better, but it seems to train significantly faster than the other approaches.

<span style="color:blue">
    
## Voluntary additional work 2: MP4 print functionalities

While we conducted this project we first used colab and there we used the following functions to see the behaviour of the agents at some point. However, in this final submission, we do the visual investigation with the render function of the environment itself. We do this because the behaviour of the monitor and VecVideoRecorder wrapper seem to get stuck sometimes when we use it locally on Ubuntu.
    
The videos are shown inline and they are also saved as MP4 into the eval-video-s folder for single stack agents and eval-video-s4 for agents that used stacked input data.

</span>

In [6]:
## Here comes all the logic for showing the mp4s

# This function takes the result of a closed Monitor wrapper and embedds the MP4 in HTML
def show_render_result(rend_env):
  video = io.open('./eval-video-s/openaigym.video.%s.video000000.mp4' % rend_env.file_infix, 'r+b').read()
  encoded = base64.b64encode(video)
  return HTML(data=''' 
  <video width="720" height="auto" alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>'''
  .format(encoded.decode('ascii')))

# Let an agent drive for an episode and return the MP4
def show_singleFrame_agent_episode(agent, rend_env):
    obs = rend_env.reset()
    done = False
    
    while not done:
        action, _states = agent.predict(obs)
        obs, reward, done, info = rend_env.step(action)

    rend_env.close()
    return show_render_result(rend_env)

In [7]:
# Load the single frame pictures
modelPath = "./best_modelsD/"
policy = "MLP"
img = "single"

MLP_single_PC2_agent, MLP_single_PC2_env = loadAgent(modelPath, policy, img)

modelPath = "./best_modelsD/"
policy = "CNN"
img = "single"

CNN_single_PC2_agent, CNN_single_PC2_env = loadAgent(modelPath, policy, img)

# Use a Monitor wrapper to "film" the agent
render_env_MLP =  wrappers.Monitor(MLP_single_PC2_env, "./eval-video-s/", force=True)
render_env_CNN =  wrappers.Monitor(CNN_single_PC2_env, "./eval-video-s/", force=True)

Loading ./best_modelsD/best_model_MLP_single
Loading ./best_modelsD/best_model_CNN_single


In [8]:
# MLP_single 
show_singleFrame_agent_episode(MLP_single_PC2_agent, render_env_MLP)

Track generation: 1135..1423 -> 288-tiles track


In [None]:
# CNN_single 
show_singleFrame_agent_episode(CNN_single_PC2_agent, render_env_CNN)

In [None]:
# Load the stack models
modelPath = "./best_modelsC/"
policy = "CNN"
img = "4stack"

MLP_4stack_PC2_agent, MLP_4stack_PC2_env = loadAgent(modelPath, policy, img)

modelPath = "./best_modelsD/"
policy = "MLP"
img = "4stack"

CNN_4stack_PC2_agent, CNN_4stack_PC2_env = loadAgent(modelPath, policy, img)

# Create a vecVido Recorde
VecVideoRecorderenv = VecVideoRecorder(MLP_4stack_PC2_env, video_folder="eval-video-s4",
                              record_video_trigger=lambda step: step == 0, video_length=1000,
                              name_prefix='ppo')

# Letting each agent run an episode
for agent in [MLP_4stack_PC2_agent, CNN_4stack_PC2_agent]:
    obs = VecVideoRecorderenv.reset()
    agent.set_env(VecVideoRecorderenv)
    done = False
    while not done:
        action, _states = agent.predict(obs)
        obs, reward, done, info = VecVideoRecorderenv.step(action)

    VecVideoRecorderenv.close()

# Function to show videos
def show_videos(video_path='', prefix=''):
  """
  Taken from https://github.com/eleurent/highway-env

  :param video_path: (str) Path to the folder containing videos
  :param prefix: (str) Filter the video, showing only the only starting with this prefix
  """
  html = []
  for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
      video_b64 = base64.b64encode(mp4.read_bytes())
      html.append('''<video alt="{}" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>'''.format(mp4, video_b64.decode('ascii')))
  ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

#Final print out
print("-----------------")
print("Videos for the Stacked agents")
print("first MLP then CNN")
show_videos('eval-video-s4', prefix='ppo')