# Deep Reinforcement Learning Pacman Project

---

---


This notebook is devoted to solving game from atari suite, namely MsPacman. \
MsPacman is a continuous state space and discrete action space environment. \

In order to solve the environment you will use the OpenAI Gym API for agent-environment communication and \
Stable-Baselines3 for agent architecture and training

OpenAI Gym Doc - [Gym Documentation](https://www.gymlibrary.dev) \
Stable-Baselines3 Doc - [SB3 Documentation](https://stable-baselines3.readthedocs.io/en/master/)

---
\
***Notebook walkthrough*** \
In the notebook you will find two types of annotated cells

```
! DO NOT modify cell
```
this cell purpose is to ensure the code stability and proper installation


```
@ Implementation cell
```
this cell is meant for your to implement you solution to the problem stated

---
\
The notebook is divided into 5 sections \
**section 0** - Visualization tool \
**section 1** - MsPacman Rom Installation \
**section 2** - Environment Instantiation \
**section 3** - Agent Instantiation and Evaluation \
**section 4** - Agent Training \
**section 5** - Saving Model

# Section 0
> Visualization tool


Visualization tool to be used in google colab

```
! DO NOT modify cell
```

In [3]:
!apt-get install ffmpeg freeglut3-dev xvfb
!pip install pyglet==1.5
!pip install stable-baselines3[extra]
!pip install sb3-contrib

Reading package lists... Done
Building dependency tree       
Reading state information... Done
ffmpeg is already the newest version (7:3.4.11-0ubuntu0.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  freeglut3 freeglut3-dev xvfb
0 upgraded, 3 newly installed, 0 to remove and 5 not upgraded.
Need to get 982 kB of archives.
After this operation, 3,350 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 freeglut3 amd64 2.8.1-3 [73.6 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 freeglut3-dev amd64 2.8.1-3 [124 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 xvfb amd64 2:1.19.6-1ubuntu4.11 [785 kB]
Fetched 982 kB in 1s (897 kB/s)
Selecting previously unselected package freeglut3:amd64.
(Reading database ... 123941 files and directories currently installed.)
Pr

## Set Fake Display

---


Set up fake display; otherwise rendering will fail

```
! DO NOT modify cell
```

In [4]:
import pyglet
import os

os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

## Agent Recording

---

Record agent performance

```
! DO NOT modify cell
```

In [5]:
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv

def record_agent(env_id, model, video_length=500, prefix='', video_folder='videos/'):
  """
  :param env_id: (str)
  :param model: (RL model)
  :param video_length: (int)
  :param prefix: (str)
  :param video_folder: (str)
  """
  eval_env = DummyVecEnv([lambda: gym.make(env_id)])
  # Start the video at step=0 and record 500 steps
  eval_env = VecVideoRecorder(eval_env, video_folder=video_folder,
                              record_video_trigger=lambda step: step == 0, video_length=video_length,
                              name_prefix=prefix)

  obs = eval_env.reset()
  for _ in range(video_length):
    action, _ = model.predict(obs)
    obs, _, _, _ = eval_env.step(action)

  # Close the video recorder
  eval_env.close()

## Implement Agent Display

---

Display agent Performance

```
! DO NOT modify cell
```

In [6]:
import base64
from pathlib import Path

from IPython import display as ipythondisplay

def show_agent_video(video_path='', prefix=''):
  """
  Taken from https://github.com/eleurent/highway-env

  :param video_path: (str) Path to the folder containing videos
  :param prefix: (str) Filter the video, showing only the only starting with this prefix
  """
  html = []
  for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
      video_b64 = base64.b64encode(mp4.read_bytes())
      html.append('''<video alt="{}" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>'''.format(mp4, video_b64.decode('ascii')))
  ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

# Section 1
> ## MsPacman ROM Installation




Since the new version of atari suite requires AutoRom to load game binaries \
it is required to locate the roms in the operating system and use ALE interface to register them as gym environemnt

## Install AutoRom

---


Install required package and locate atari binaries

```
! DO NOT modify cell
```

In [None]:
!pip install autorom
!AutoROM --accept-license

```
! DO NOT modify cell
```

```
! DO NOT modify cell
```

In [None]:
!pip install ale-py
!ale-import-roms /usr/local/lib/python3.7/dist-packages/AutoROM/roms

```
! DO NOT modify cell
```

In [None]:
!pip install gym[atari,accept-rom-license]

In [None]:
!pip install stable-baselines3[extra]
# Optional: install SB3 contrib to have access to additional algorithms
!pip install sb3-contrib

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting stable-baselines3[extra]
  Downloading stable_baselines3-1.6.2-py3-none-any.whl (170 kB)
[K     |████████████████████████████████| 170 kB 4.8 MB/s 
Collecting gym==0.21
  Downloading gym-0.21.0.tar.gz (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 50.2 MB/s 
Collecting rich
  Downloading rich-12.6.0-py3-none-any.whl (237 kB)
[K     |████████████████████████████████| 237 kB 45.0 MB/s 
Collecting ale-py==0.7.4
  Downloading ale_py-0.7.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 52.1 MB/s 
Collecting commonmark<0.10.0,>=0.9.0
  Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)
[K     |████████████████████████████████| 51 kB 7.1 MB/s 
[?25hBuilding wheels for collected packages: gym
  Building wheel for gym (setup.py) ... [?25l[?25hdone
  Created wheel for gym: filename=gym-0.21.0-p

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sb3-contrib
  Downloading sb3_contrib-1.6.2-py3-none-any.whl (79 kB)
[K     |████████████████████████████████| 79 kB 3.7 MB/s 
Installing collected packages: sb3-contrib
Successfully installed sb3-contrib-1.6.2


## Load atari game with AutoROM

---


Create ALE Interface and register ms_pacman binaries

```
! DO NOT modify cell
```

In [None]:
from ale_py import ALEInterface

ale = ALEInterface()
ms_pacman_rom_directory = "/usr/local/lib/python3.7/dist-packages/AutoROM/roms/ms_pacman.bin"

is_supported = ale.isSupportedROM(ms_pacman_rom_directory)
if is_supported:
  ale.loadROM(ms_pacman_rom_directory)
else:
  raise SystemError("ROM not supported by Arcade Learning Environment")

# Section 2

> ## Create Env Instance

In this section your role is to create the MsPacman environment




## Instantiate environment

---


Write a code that will instantiate the Pacman environment. \
The details are written in the cell bellow

```
@ Implementation cell
> create env object initialized with ALE/MsPacman-v5 environment
```

In [None]:
import gym
import numpy as np

env = ' your code here '

##  Environment check

---


The cells below will check whether the environemnt is import correctly

```
! DO NOT modify cell
```

In [None]:
_target_observation_space = np.array([210, 160, 3], dtype=np.intc)
_env_observation_space = np.array(env.observation_space.shape, dtype=np.intc)

_target_action_space = gym.spaces.Discrete(9)
_env_action_space = env.action_space

assert np.all([_target_observation_space, _env_observation_space]), "The observation space shape does not match requirements"
assert _target_action_space == _env_action_space, "The action space does not match the requirements"

# Section 3

> ## Agent Initialization and Evaluation

In this section your role is to initialize the RL algorithm for the agent. \
Link to algorithm explanation - [deep reinforcement learning explained](https://spinningup.openai.com/en/latest/index.html) \
Please refer to section **ALGORITHMS DOCS**

List of steps required to perform in order to solve this environment
1.   Choose proper algorithm ( e.g PPO, DDPG, SAC, etc. )
2.   Initialize the policy network ( e.g MlpPolicy, CnnPolicy, etc. )
3.   Wrap your environment with a Stable-Baselines3 Monitor wrapper
4.   Evaluate the untrained agent on Pacman Environment
5.   Record video and display video of untrained agent






##  Algorithm and Policy Network

---
Import from SB3 chosen algorithm and policy

SB3 algorithms - [deep reinforcement learning algos](https://stable-baselines3.readthedocs.io/en/master/guide/algos.html) \
SB3 policy networks - [policy networks](https://stable-baselines.readthedocs.io/en/master/modules/policies.html) \
have in mind that certain policie networks might not be implemented for certain algorithms


```
@ Implementation cell
> import deep reinforcement learning algorithm
> import policy network
```

In [None]:
' your code here - imports '

' your code here - imports '

```
@ Implementation cell
> instantiate chosen model
> the parameters are (policy_network, environment, verbose)
  policy_networs: chosen policy network ( e.g CnnPolicy )
  environemnt: instance of environment
  verbose: bool (True, False) - True means outpout info to prompt ( nice to have )
```

In [None]:
model = ' your code here - model instance '

##  Model check

---
Predict first action given first observation

```
@ Implementation cell
> reset environment and get first observation
> feed first observation to models predict method and get first action
  have in mind that predict method returns action and next hidden state
```

In [None]:
obs = ' your code here - reset environment '
action, _ =  ' your code here - predict action given first observation [ obs ]'

```
! DO NOT modify cell
```

In [None]:
assert action in _env_action_space, "Action is not a part of environment action space"

##  Evaluate Vanilla Agent

---
Check the performance of untrained agent

```
@ Implementation cell
> import evaluate_policy methods from SB3
> import Monitor class from SB3
```

In [None]:
' your code here - imports'

' your code here - imports'

```
@ Implementation cell
> wrap env with Monitor wrapper
  monitor wrapper is used to track agent's performance
```

In [None]:
env = ' your code here - wrap env '

```
@ Implementation cell
> evaluate untrained model using evaluate_policy method
> the parameters are:
  model: instantiated model
  environment: instantiated env
  n_eval_espisodes: episodes to play ( 100 is optimal )
  deterministic: bool (True, False) if True our agent will chose actions greedily - full exploitation
```

In [None]:
mean_reward, std_reward = ' your code here - evaluate policy '

mean_reward:60.00 +/- 0.00


```
! DO NOT modify cell
```

In [None]:
print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

##  Record Vanilla Agent and Display

---

Display agent's performance

```
@ Implementation cell
> record agent video
  for instance: record_agent('ALE/MsPacman-v5', ppo_model, video_length=500, prefix='vanilla-ppo-pacman')

> show agent video
  for instance: show_agent_video('videos', prefix='vanilla')
```

In [None]:
' your code here - record agent '

In [None]:
' your code here - show agent video '

# Section 4

> ## Agent Training

Train your model untill it reaches desired score. \

---


DDPG Building Blocks and Learning Example: \
**ActorNetwork** - online network that is used to choose action. \
**CriticNetwork** - online network that is used to evaluate the performance of actor. \
**TargetActor** - frozen actor network (copy) that is used to make more stable updates to policy parameters. \
**TargetCritic** - fronze critic netowrk (copy) that is used to make more stable updates to critic parameters. 
\

Deep Reinforcement Learning common learning process:
1.   Agent interacts with the environment using ActorNetwork to choose actions
2.   The interactions (trajectories or rollouts) are stored in the replay memory
3.   Once the replay memory is full, we sample a mini batch of trajectories
4.   The calculation of objective function w.r.t target network targets for actor and critic are performed giving the loss
5.   The weights of online networks are updated w.r.t calculated loss
6.   once every several iteration the sync of target and online networks is performed ( deep copy or polyak averaging )


example of how the learning process is implemented for DDPG: [DDPG Learning Example](https://gitlab.com/JamieChojnacki/reinforcementlearning/-/blob/master/from_paper_to_code/ddpg_from_paper_to_code/ddpg_torch.py)



##  Train model

---
Train model untill it reaches desired score

```
@ Implementation cell
> train model untill it reaches desired performance
```

In [None]:
' your code here - train model '

' your code here - train model '

##  Evaluate Trained Agent

---
Check the performance of trained agent

```
@ Implementation cell
> evaluate trained model using evaluate_policy method
> the parameters are:
  model: instantiated model
  environment: instantiated env
  n_eval_espisodes: episodes to play ( 100 is optimal )
  deterministic: bool (True, False) if True our agent will chose actions greedily - full exploitation
```

In [None]:
mean_reward, std_reward = ' your code here - evaluate policy '

```
! DO NOT modify cell
```

In [None]:
print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

##  Record Trained Agent and Display

---

Display agent's performance

```
@ Implementation cell
> record agent video
  for instance: record_agent('ALE/MsPacman-v5', ppo_model, video_length=500, prefix='trained-ppo-pacman')

> show agent video
  for instance: show_agent_video('videos', prefix='trained')
```

In [7]:
' your code here - record agent '

' your code here - record agent '

In [8]:
' your code here - show agent video '

' your code here - show agent video '

# Section 5
> ## Saving Model


Save trained model. \
it is a common practice in DRL since we do not want to train our model from scratch everytime

##  Sample Observations and Actions

---
Sample observations from environment and predicted actions from agent. \
That step is required to validate whether the policy network was saved successfully

```
@ Implementation cell
> gather 10 random observations from environment and save them as numpy array
  use env.observation_space.sample() method to gather those observations
> use your model to get predicted actions for those observations
  use model's predict() method
  make sure to set deterministic flag to True
```

In [None]:
observations = ' your code here - sample observations from env '
action_before_saving, _ = ' your code here - predict actions '

##  Save the Model

---
Save your trained model as a zip file

```
@ Implementation cell
> use your model's save method()
  remember to use .zip extension
```

In [None]:
' your code here - save model '

' your code here - save model '

```
! DO NOT modify cell
```

In [None]:
!ls *.zip

ppo_cartpole.zip


##  Load the Model

---
Load your saved model

```
@ Implementation cell
> use your model's load() method
```

In [None]:
loaded_model = ' your code here - load saved model '

```
! DO NOT modify cell
```

In [None]:
action_after_loading, _ = loaded_model.predict(observations, deterministic=True)

```
! DO NOT modify cell
```

In [None]:
assert np.allclose(action_before_saving, action_after_loading), "Somethng went wrong in the loading"