### Comparison of Advantage Actor Critic (A2C) and Proximal Policy Optimization (PPO) Reinforcement Learning Algorithms using Space Invaders
- Each algorithm will be trained with default parameters and 1000 steps initially then 20000 steps after
- The algorithms will be playing the Atari Space Invaders game, aiming to achieve the best result
Libraries/Software used:
- Python OpenAI Gym (https://gym.openai.com/)
- Stable Baselines 3 (https://stable-baselines3.readthedocs.io/en/master/index.html)

#### Dependencies

In [27]:
import gym
from stable_baselines3 import A2C
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import VecFrameStack
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_atari_env
import os
import urllib.request

#### Extra code required to load all of the Atari game emulators

In [None]:
# Required to load the atari environments and get the ROMS to work
# References local files (see file structure)
# http://www.atarimania.com/roms/Roms.rar
# https://stackoverflow.com/questions/67656740/exception-rom-is-missing-for-ms-pacman-see-https-github-com-openai-atari-py
urllib.request.urlretrieve('http://www.atarimania.com/roms/Roms.rar','Roms.rar')
!pip install unrar
!unrar x Roms.rar
!mkdir rars
!mv HC\ ROMS.zip   rars
!mv ROMS.zip  rars
!python -m atari_py.import_roms rars

#### Setting up the Environment

In [16]:
environment_name = 'SpaceInvaders-v0'
env = gym.make(environment_name)

# View the Action Space and Observation Space, to see what is available (and also data types)
env.reset()
print(env.action_space)
print(env.observation_space)

# Testing and visualising the environment
episodes = 5
for episode in range(1, episodes+1):
    state = env.reset()
    done = False
    score = 0

    while not done:
        # Renders the environment visually
        env.render()
        # Takes a random action from the action space
        action = env.action_space.sample()
        # The four values returned from the observation
        n_state, reward, done, info = env.step(action)
        score += reward
    print('Episode:{} Score:{} '.format(episode,score))
env.close()

Discrete(6)
Box([[[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 ...

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]], [[[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 ...

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 

### Vectorise the Environment
This is so we can train multiple models at the same time

In [35]:
# Trains four environments at the same time
env = make_atari_env('SpaceInvaders-v0', n_envs=4, seed=0)
env = VecFrameStack(env, n_stack=4)

### Train, Evaluate, and Test the different models

#### Model 1 - A2C (Advantage Actor Critic Method)
- Uses multiple workers to avoid the use of a replay buffer
- https://stable-baselines.readthedocs.io/en/master/modules/a2c.html

In [36]:
# Sets the path to save the training logs to
log_path = os.path.join('Training','Logs')

# CNN is used, because this environment uses image-based observations
a2c_1000_model = A2C('CnnPolicy', env, verbose=1, tensorboard_log=log_path)
# 1000 steps initially just to test
a2c_1000_model.learn(total_timesteps=1000)

# Save the model after training
a2c_path = os.path.join('Training','Saved Models','A2C_Breakout_Model')
a2c_1000_model.save(a2c_path)

Using cpu device
Wrapping the env in a VecTransposeImage.
Logging to Training\Logs\A2C_8


#### Evaluate A2C 1000

In [37]:
# Reduce to just one environment in order to evaluate it
env = make_atari_env('SpaceInvaders-v0', n_envs=1, seed=0)
env = VecFrameStack(env, n_stack=4)
evaluate_policy(a2c_1000_model, env, n_eval_episodes=10, render=True)
env.close()

#### Test A2C 1000

In [38]:
episodes = 10
for episode in range(1, episodes+1):
    obs = env.reset()
    done = False
    score = 0

    while not done:
        env.render()
        action, _ = a2c_1000_model.predict(obs) # Now using the model
        obs, reward, done, info = env.step(action)
        score += reward
    print('Episode:{} Score:{} '.format(episode,score))
env.close()

Episode:1 Score:[0.] 
Episode:2 Score:[8.] 
Episode:3 Score:[9.] 
Episode:4 Score:[2.] 
Episode:5 Score:[10.] 
Episode:6 Score:[9.] 
Episode:7 Score:[0.] 
Episode:8 Score:[2.] 
Episode:9 Score:[1.] 
Episode:10 Score:[0.] 


#### Train, Evaluate, and Test again, but this time with 20000 steps

In [39]:
# Revert to 4 environments
env = make_atari_env('SpaceInvaders-v0', n_envs=4, seed=0)
env = VecFrameStack(env, n_stack=4)

# Train
a2c_20000_model = A2C('CnnPolicy', env, verbose=1, tensorboard_log=log_path)
a2c_20000_model.learn(total_timesteps=20000)
a2c_path = os.path.join('Training','Saved Models','A2C_SpaceInvaders_Model')
a2c_20000_model.save(a2c_path)

Using cpu device
Wrapping the env in a VecTransposeImage.
Logging to Training\Logs\A2C_9
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 690      |
|    ep_rew_mean        | 124      |
| time/                 |          |
|    fps                | 99       |
|    iterations         | 100      |
|    time_elapsed       | 20       |
|    total_timesteps    | 2000     |
| train/                |          |
|    entropy_loss       | -1.06    |
|    explained_variance | 0.0797   |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | -1.04    |
|    value_loss         | 3.1      |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 763      |
|    ep_rew_mean        | 154      |
| time/                 |          |
|    fps                | 100      |
|    iterations         | 200      |
|    time_elapsed      

In [40]:
# Evaluate (Revert back to 1 environment)
env = make_atari_env('SpaceInvaders-v0', n_envs=1, seed=0)
env = VecFrameStack(env, n_stack=4)
evaluate_policy(a2c_20000_model, env, n_eval_episodes=10, render=True)
env.close()

In [41]:
# Test
episodes = 10
for episode in range(1, episodes+1):
    obs = env.reset()
    done = False
    score = 0

    while not done:
        env.render()
        action, _ = a2c_20000_model.predict(obs)
        obs, reward, done, info = env.step(action)
        score += reward
    print('Episode:{} Score:{} '.format(episode,score))
env.close()

Episode:1 Score:[1.] 
Episode:2 Score:[2.] 
Episode:3 Score:[10.] 
Episode:4 Score:[7.] 
Episode:5 Score:[0.] 
Episode:6 Score:[1.] 
Episode:7 Score:[2.] 
Episode:8 Score:[3.] 
Episode:9 Score:[9.] 
Episode:10 Score:[6.] 


#### Conclusion of A2C
- Appeared to have sporadic, inconsistent results

#### Model 2 - PPO (Proximal Policy Optimization Algorithm)
- Combines idea of multiple workers with a trust region to improve actor
- New policy should not be too different from old policy (uses clipping)
- https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html

#### Initially Train, Evaluate, and Test for 1000 training steps

In [42]:
# Revert to 4 environments
env = make_atari_env('SpaceInvaders-v0', n_envs=4, seed=0)
env = VecFrameStack(env, n_stack=4)

# Train
ppo_1000_model = PPO('CnnPolicy', env, verbose=1, tensorboard_log=log_path)
ppo_1000_model.learn(total_timesteps=1000)
ppo_path = os.path.join('Training','Saved Models','PPO_SpaceInvaders_Model')
ppo_1000_model.save(ppo_path)

Using cpu device
Wrapping the env in a VecTransposeImage.
Logging to Training\Logs\PPO_12
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 661      |
|    ep_rew_mean     | 139      |
| time/              |          |
|    fps             | 113      |
|    iterations      | 1        |
|    time_elapsed    | 72       |
|    total_timesteps | 8192     |
---------------------------------


In [43]:
# Evaluate (Revert back to 1 environment)
env = make_atari_env('SpaceInvaders-v0', n_envs=1, seed=0)
env = VecFrameStack(env, n_stack=4)
evaluate_policy(ppo_1000_model, env, n_eval_episodes=10, render=True)
env.close()

In [44]:
# Test
episodes = 10
for episode in range(1, episodes+1):
    obs = env.reset()
    done = False
    score = 0

    while not done:
        env.render()
        action, _ = ppo_1000_model.predict(obs)
        obs, reward, done, info = env.step(action)
        score += reward
    print('Episode:{} Score:{} '.format(episode,score))
env.close()

Episode:1 Score:[6.] 
Episode:2 Score:[1.] 
Episode:3 Score:[2.] 
Episode:4 Score:[2.] 
Episode:5 Score:[3.] 
Episode:6 Score:[4.] 
Episode:7 Score:[12.] 
Episode:8 Score:[2.] 
Episode:9 Score:[1.] 
Episode:10 Score:[3.] 


#### Train, Evaluate, and Test for 20000 time steps

In [45]:
# Revert to 4 environments
env = make_atari_env('SpaceInvaders-v0', n_envs=4, seed=0)
env = VecFrameStack(env, n_stack=4)

# Train
ppo_20000_model = PPO('CnnPolicy', env, verbose=1, tensorboard_log=log_path)
ppo_20000_model.learn(total_timesteps=20000)
ppo_path = os.path.join('Training','Saved Models','PPO_SpaceInvaders_Model')
ppo_20000_model.save(ppo_path)

Using cpu device
Wrapping the env in a VecTransposeImage.
Logging to Training\Logs\PPO_13
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 662      |
|    ep_rew_mean     | 128      |
| time/              |          |
|    fps             | 112      |
|    iterations      | 1        |
|    time_elapsed    | 72       |
|    total_timesteps | 8192     |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 683          |
|    ep_rew_mean          | 141          |
| time/                   |              |
|    fps                  | 82           |
|    iterations           | 2            |
|    time_elapsed         | 197          |
|    total_timesteps      | 16384        |
| train/                  |              |
|    approx_kl            | 0.0122767165 |
|    clip_fraction        | 0.11         |
|    clip_range           | 0.2          |
|    entrop

In [46]:
# Evaluate (Revert back to 1 environment)
env = make_atari_env('SpaceInvaders-v0', n_envs=1, seed=0)
env = VecFrameStack(env, n_stack=4)
evaluate_policy(ppo_20000_model, env, n_eval_episodes=10, render=True)
env.close()

In [47]:
# Test
episodes = 10
for episode in range(1, episodes+1):
    obs = env.reset()
    done = False
    score = 0

    while not done:
        env.render()
        action, _ = ppo_20000_model.predict(obs)
        obs, reward, done, info = env.step(action)
        score += reward
    print('Episode:{} Score:{} '.format(episode,score))
env.close()

Episode:1 Score:[4.] 
Episode:2 Score:[5.] 
Episode:3 Score:[6.] 
Episode:4 Score:[5.] 
Episode:5 Score:[0.] 
Episode:6 Score:[13.] 
Episode:7 Score:[1.] 
Episode:8 Score:[5.] 
Episode:9 Score:[2.] 
Episode:10 Score:[5.] 


#### Conclusion
- Both algorithms appeared to be inconsistent in the testing stage, which indicates that more testing steps are probably required to achieve more consistent results
- In terms of accuracy, PPO did achieve the best result, but as discussed above, the results it outputted were inconsistent, and sometimes worse than A2C
- A2C actually achieved a higher mean episode length, as well as a higher mean reward per episode, but also produced dramatically higher policy losses and value losses
- Overall, from the above results, it could be argued that A2C has higher potential, but PPO is the safer option regarding better performance