# Freeway

This is the second project for the MC935A/MO436A - Reinforcement Learning course, taught by Prof. Esther Colombini.

In this project we propose to apply Reinforcement Learning methods to teach an agent how to play the Freeway Atari game.

**Group members:**
- Aline Gabriel de Almeida
- Dionisius Oliveira Mayr (229060)
- Marianna de Pinho Severo (264960)
- Victor Jesús Sotelo Chico (265173)

## Freeway game

![Baseline 1](./img/Freeway_logo.png)

Freeway is a video game written by David Crane for the Atari 2600 and published by Activision [[1]](https://en.wikipedia.org/wiki/Freeway_(video_game)).

In the game, two players compete against each other trying to make their chickens cross the street, while evading the cars passing by.
There are three possible actions: staying still, moving forward or moving backward.
Each time a chicken collides with a car, it is forced back some spaces and takes a while until the chicken regains its control.

When a chicken is successfully guided across the freeway, it is awarded one point and moved to the initial space, where it will try to cross the street again.
The game offers multiple scenarios with different vehicles configurations (varying the type, frequency and speed of them) and plays for 2 minutes and 16 seconds.
During the 8 last seconds the scores will start blinking to indicate that the game is close to end.
Whoever has the most points after this, wins the game!

The image was extracted from the [manual of the game](https://www.gamesdatabase.org/Media/SYSTEM/Atari_2600/Manual/formated/Freeway_-_1981_-_Zellers.pdf).

[1 - Wikipedia - Freeway](https://en.wikipedia.org/wiki/Freeway_(video_game))

# Environment

We will be using the [OpenAI Gym](https://gym.openai.com/) toolkit.
This toolkit uses the [Arcade Learning Environment](https://github.com/mgbellemare/Arcade-Learning-Environment) to simulate the game through the [Stella](https://stella-emu.github.io/) emulator.

Although the game offers multiple scenarios, we are going to consider only the first one. Also, we will be controlling a *single chicken*, while we try to maximize its score.

In this configuration, there are ten lanes and each lane contains exactly one car (with a different speed and direction).
Whenever an action is chosen, it is repeated for $k$ frames, $k \in \{2, 3, 4\}$.

This means that our environment is **stochastic** and it is also **episodic**, with its terminal state being reached whenever 2 minutes and 16 seconds have passed.

## Stable Baselines Library

We will be using the [Stable Baselines](https://stable-baselines.readthedocs.io/en/master/) implementations of the Deep Reinforcement algorithms.

It is a fork of the OpenAI Baselines, fully compatible with the OpenAI Gym environments with straightforward usage.

Under the hood, it is using the Tensorflow library to implement the neural networks.

# Setup

Install the dependencies:
```bash
pip install -r requirements.txt
pip install stable-baselines[mpi]
pip install tensorflow==1.15.0
```

# Useful Resources

Here you can find a list of useful links and materials that were used during this project.

* [Freeway-ram-v0 from OpenAI Gym](https://gym.openai.com/envs/Freeway-ram-v0/)
* [Manual of the game](https://www.gamesdatabase.org/Media/SYSTEM/Atari_2600/Manual/formated/Freeway_-_1981_-_Zellers.pdf)
* [Freeway Benchmarks](https://paperswithcode.com/sota/atari-games-on-atari-2600-freeway)

# Imports

In [None]:
import tensorflow as tf
import gym

from stable_baselines.common import make_vec_env
from stable_baselines.common.cmd_util import make_atari_env
from stable_baselines.common.vec_env import VecFrameStack
from stable_baselines.common.atari_wrappers import make_atari
from stable_baselines.common.policies import MlpPolicy, CnnPolicy

from stable_baselines.deepq.policies import CnnPolicy

from stable_baselines import DQN
from stable_baselines import PPO2

# Action space

As we said above, the agent in this game has three possible actions at each frame, each represented by an integer:

* 0: Stay
* 1: Move forward
* 2: Move backward

In theory, a perfect chicken wouldn't ever need to move backward, since it is possible to know if moving forward would lead you into a collision (in the immediate frame or in the future frames).

# Baseline

## State of the art benchmarks

The image bellow (extracted from https://paperswithcode.com/sota/atari-games-on-atari-2600-freeway) shows the evolution of the scores over time using different techniques.

Today, the state of the art approaches are making 34.0 points, using Deep Reinforcement Learning methods.

![Benchmarks](./img/state_of_art_scores.png)

# Previous Work

On the previous assignment, we explored how classical Reinforcement Learning tabular methods performed on this game.

In some runs, we were able to achieve a peak score of 34 points in a single run using the SARSA($\lambda$) algorithm, which is the state-of-the-art score for this game; but on average we were scoring about 31 points.

## Linear Function Approximation

In Project 1, we also worked with linear function approximators, which were based on the Monte Carlo, Q-Learning and Sarsa($\lambda$) algorithms. For each of them, we experimented with different sets of features to represent the state-action pairs, varied reward functions, different exploration rates, and in all of them we used only two actions (move backward and move forward), which were the best actions found by the tabular methods.

All function approximators obtained results close to those of the baseline, being slightly better. Of them, the Monte Carlo based approximator was the only one that managed to improve in relation to its tabular version, reaching an average score of 22.2, compared to the 13 points obtained by the latter.

Throughout the experiments, we observed that some linear approximators can be faster than the original algorithms, as happened with Sarsa($\lambda$). In addition, we saw that the Q-Learning and Sarsa($\lambda$) approximators achieved better results when they were given greater exploration capacity, and that the Monte Carlo based approximator benefited when less sparse feature vectors were adopted.

Given that, although linear function approximators are powerful tools, they were not good enough to improve the performance of our solutions, either due to the nature of the problem, the possible poor quality of the features we created or other factors that were not addressed.

# Reward Policy

In the base environment we are awarded on point each time we successfully cross the freeway.

# Methodology

There are a lot of parameters to be tuned in our algorithms.
However, we don't have the required computational power to properly optimize them all.

Since we still want to at least experiment them to observe their impact on the obtained reward, we will be training them for 400k iterations, and then we will select a few of them to train for more iterations.

We are aware that this is far from perfect in the optimization point of view, but this is the best we can do with our available machines.

## Algorithms

We will be comparing two different algorithms here: DQN and PPO.

The idea is to compare an off-policy method (DQN) and an on-policy method (PPO) regarding their stability and number of samples to convergence.

## Tensorboard

Tensorboard allows us to visualize and compare the evolution of the algorithms being tested.
It shows the metrics like the reward over time, loss and advantage, in an easy way.

We will be using it to present our results, understand the impact of different parameters and compare our solutions.

## Frame Stack

Before diving into the experiments' results, it is worth understanding some nuances of our testing environment.

We will be using image representations (frames) of the game as our state.
But by doing so, we end up losing some relavant information of the problem, like the direction the cars are moving.
In other words, the problem we are trying to solve becomes non-markovian.

To solve this, we use the `VecFrameStack` function, that stacks $n$ frames of the game, making our environment become markovian once again.

## Vectorized Environments

[Vectorized environments](https://stable-baselines.readthedocs.io/en/master/guide/vec_envs.html) allows us to stack multiple environments, thus permiting us to traing an agent in $n$ environments per step.

It is controlled by the `num_env` parameter in the `make_atari_env` function.

Although it isn't possible to use vectorized environments with DQN, we will be using it for the PPO2 algorithm.

# Experiments

## PPO2

To assess the performance of the PPO in solving our problem, we decided to vary some of its parameters, which were: the discount factor ($\gamma$), the learning rate, the trade-off factor between bias and variance (lam), the number of frames in the input stack of the algorithm, the type of representation of the states (image or RAM), the number of environments and the total timesteps of training.

For this, we created a baseline for comparison, with the following configurations:

- Policy network: CnnPolicy
- Discount factor ($\gamma$): 0.99.
- Learning rate: 0.00025.
- Trade-off factor (lam): 0.95.
- Number of frames in the stack: 4.
- Type of representation: image.
- Number of environments: 4.
- Training timesteps: 400K.

Below, we describe the experiments and results obtained.

In [5]:
TOTAL_TIMESTEPS = 400000

In [38]:
def experiment(tb_lob, gamma, learning_rate, lam, policy, timesteps=TOTAL_TIMESTEPS, n_stack=4, n_env=8, state_repr='frames'):
    # multiprocess environment
    if state_repr == 'ram':
        env = make_atari('Freeway-ramNoFrameskip-v0')
    else:
        env = make_atari_env('FreewayNoFrameskip-v0', num_env=n_env, seed=0)
    # Frame-stacking with 4 frames
    env = VecFrameStack(env, n_stack=n_stack)

    model = PPO2(policy, env, verbose=0, tensorboard_log=tb_lob, gamma=gamma, learning_rate=learning_rate, lam=lam)
    model.learn(total_timesteps=timesteps)
    return model

### Learning Rate

In [25]:
GAMMA = 0.99
LEARNING_RATE = 0.00025
LAM = 0.95

In [26]:
experiment('Experiments_PPO_A', gamma=GAMMA, learning_rate=LEARNING_RATE, lam=LAM, policy=MlpPolicy);

In [28]:
experiment('Experiments_PPO_A', gamma=GAMMA, learning_rate=0.00025 * 0.25, lam=LAM, policy=MlpPolicy);

In [30]:
experiment('Experiments_PPO_A', gamma=GAMMA, learning_rate=0.00025 * 0.5, lam=LAM, policy=MlpPolicy);

In [32]:
experiment('Experiments_PPO_A', gamma=GAMMA, learning_rate=0.00025 * 2, lam=LAM, policy=MlpPolicy);

In [34]:
experiment('Experiments_PPO_A', gamma=GAMMA, learning_rate=0.00025 * 4, lam=LAM, policy=MlpPolicy);

|Parameter|A1|A2|A3|A4|A5|
|-|-|-|-|-|-|
|GAMMA|0.99|0.99|0.99|0.99|0.99|
|LEARNING_RATE|$$2.5 \cdot 10^{-4}$$|$$6.25 \cdot 10^{-5}$$|$$1.25 \cdot 10^{-4}$$|$$5 \cdot 10^{-4}$$|$$1 \cdot 10^{-3}$$|
|LAM|0.95|0.95|0.95|0.95|0.95|
|Smoothed Reward|21.49|21.41|0.0|21.21|21.17|

|![](./img/alpha_complete.png)|
|-| 

|LEARNING_RATE = $$2.5 \cdot 10^{-4}$$|LEARNING_RATE = $$6.25 \cdot 10^{-5}$$|LEARNING_RATE = $$1.25 \cdot 10^{-4}$$|
|-|-|-|
|![](./img/alpha_2.5-4.png)|![](./img/alpha_6.25-5.png)|![](./img/alpha_1.25-4.png)|

|LEARNING_RATE = $$5 \cdot 10^{-4}$$|LEARNING_RATE = $$1 \cdot 10^{-3}$$|
|-|-|
|![](./img/alpha_5-4.png)|![](./img/alpha_1-3.png)|

The first thing that catches our attention in these experiments in the one with Learning Rate $1.25 \cdot 10^{-4}$, where it didn't score a single point.

Comparing it with a smaller Learning Rate ($6.25 \cdot 10^{-5}$) and with a higher Learning Rate ($5 \cdot 10^{-4}$), we can't see a clear relationship between the performance of the chicken and the Learning Rate used.
Thus, we came up with the hypothesis that this odd experiment is due to the random aspect of the algorithm, as we will show in the next session.

### Variability of experiments

Here we will be exploring how unstable a run of the algorithm is by repeating it multiple times with the same parameter settings.

According to Nikishin et al \[1\], Deep Reinforcement Learning methods are notoriously unstable during training and isn't guaranteed to monotonically increase during training.
We observed this exact behavior when training our agents, were between different runs with the same parameters, the results would vary a lot.

You can find a comparison of three different runs of the PPO2 with $\gamma$ = 0.95, learning rate = 0.00025 and lam = 0.95 bellow.

\[1\] - [Improving Stability in Deep Reinforcement Learning with Weight Averaging](https://www.gatsby.ucl.ac.uk/~balaji/udl-camera-ready/UDL-24.pdf)

In [20]:
GAMMA = 0.99
LEARNING_RATE = 0.00025
LAM = 0.95

In [21]:
experiment('Experiments_PPO_Var', gamma=GAMMA, learning_rate=LEARNING_RATE, lam=LAM, policy=MlpPolicy);

In [22]:
experiment('Experiments_PPO_Var', gamma=GAMMA, learning_rate=LEARNING_RATE, lam=LAM, policy=MlpPolicy);

In [None]:
experiment('Experiments_PPO_Var', gamma=GAMMA, learning_rate=LEARNING_RATE, lam=LAM, policy=MlpPolicy);

|Parameter|R1|R2|R3|
|-|-|-|-|
|GAMMA|0.99|0.99|0.99|
|LEARNING_RATE|0.00025|0.00025|0.00025|
|LAM|0.95|0.95|0.95|
|Smoothed Reward|22.68|21.75|21.75|

|![](./img/var_complete.png)|
|-|

|R1|R2|R3|
|-|-|-|
|![](./img/var_1.png)|![](./img/var_2.png)|![](./img/var_3.png)|

From the graphs above, we can see that the first iteration where the agents starts to score some points can go be ~20k, ~140k or ~70k, purely based on random factors.

This is a big problem we are aware and it impacts our results.

### Influence of discount factor with image representation

In the table below, we can see the settings of the experiments carried out in this section, as well as the results achieved after being smoothed by a factor of 0,999. In addition, in the following graphs, we observe how the reward obtained by agents varies according to the value of the discount factor ($\gamma$) and the number of timesteps.

In [2]:
LEARNING_RATE = 0.00025
LAM = 0.95
N_STACK = 4
N_ENV = 4

In [6]:
experiment('Experiments_PPO_gamma_75', gamma=0.75, learning_rate=LEARNING_RATE, lam=LAM, policy=CnnPolicy,
           n_stack=N_STACK, n_env=N_ENV);
experiment('Experiments_PPO_gamma_90', gamma=0.90, learning_rate=LEARNING_RATE, lam=LAM, policy=CnnPolicy,
           n_stack=N_STACK, n_env=N_ENV);
experiment('Experiments_PPO_baseline', gamma=0.99, learning_rate=LEARNING_RATE, lam=LAM, policy=CnnPolicy,
           n_stack=N_STACK, n_env=N_ENV);

| **Parameter**  | **Exp 1** | **Exp 2** | **Exp 3** |
|----------------|-----------|-----------|-----------|
| **Gamma**      | 0,75      | 0,90      | 0,99      |
| Learning Rate  | 0,00025   | 0,00025   | 0,00025   |
| Lam            | 0,95      | 0,95      | 0,95      |
| Stacks         | 4         | 4         | 4         |
| Representation | image     | image     | image     |
| Environments   | 4         | 4         | 4         |
| Policy         | CnnPolicy | CnnPolicy | CnnPolicy |
| Smothed reward | 9,55      | 15,74     | **20,04**     |

<center><img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/01-images-gamma/svg/image_gamma_all.svg width="400"></center>

| $\gamma$=0.75 | $\gamma$=0.90 | $\gamma$=0.99 |  
|---|---|---|  
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/01-images-gamma/svg/image_gamma_75.svg width="250"> | <img src =https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/01-images-gamma/svg/image_gamma_90.svg width="250"> | <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/01-images-gamma/svg/image_gamma_baseline.svg width="250"> |

We see that in all the graphics, there is a considerable variation in the reward and that the curve that proves to be more stable is the one constructed with $ \gamma = 0,99 $ (baseline). On the other hand, we noticed that the other curves decrease a lot from certain timesteps, being the one constructed with $ \gamma = 0,75 $  the first to decrease.

This indicates that, for the problem addressed, a PPO agent with a far-sight view is able to achieve better performances and that, the more limited this view is, the worse their results will be. Finally, although the highest value of the best smoothed curve is the one shown in the table, the values of this curve vary around $22$ points.

Bellow you can find a gif showing an episode of the agent with $\gamma = 0.75$ after the 400k timesteps.

We can see that its performance deterioration, where the agent un-learns how to score points and keeps moving meaninglessly back and forth.

| |
|-|
|<img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/01ad1553f86ea6038e4fa81bf338fc86bc1769b8/marianna/project02-rl/gif/ppo_gamma_75_gif.gif width="400">|

### Influence of the number of frames in the stack with image representation

Once the learning algorithms receive images as input, to model the problem as a MDP, we need to group a set of frames (images) in structures called stacks, which are given as input to these algorithms. This allows models to access time information about the game.

In order to analyze the influence of the size of the frame stacks on the performance of the created PPO agents, we experimented with the baseline with stacks of size 1, 4, 32 and 64, as described in the table below.

In [None]:
LEARNING_RATE = 0.00025
LAM = 0.95
GAMMA = 0.99
N_ENV = 4

In [None]:
experiment('Experiments_PPO_stack_1', gamma=GAMMA, learning_rate=LEARNING_RATE, lam=LAM, policy=CnnPolicy,
           n_stack=1, n_env=N_ENV);
experiment('Experiments_PPO_stack_4', gamma=GAMMA, learning_rate=LEARNING_RATE, lam=LAM, policy=CnnPolicy,
           n_stack=4, n_env=N_ENV);
experiment('Experiments_PPO_stack_32', gamma=GAMMA, learning_rate=LEARNING_RATE, lam=LAM, policy=CnnPolicy,
           n_stack=32, n_env=N_ENV);
experiment('Experiments_PPO_stack_64', gamma=GAMMA, learning_rate=LEARNING_RATE, lam=LAM, policy=CnnPolicy,
           n_stack=64, n_env=N_ENV);

| **Parameter**  | **Exp 1** | **Exp 2** | **Exp 3** | **Exp 4** |
|----------------|-----------|-----------|-----------|-----------|
| Gamma          | 0,99      | 0,99      | 0,99      | 0,99      |
| Learning Rate  | 0,00025   | 0,00025   | 0,00025   | 0,00025   |
| Lam            | 0,95      | 0,95      | 0,95      | 0,95      |
| **Stacks**     | 1         | 4         | 32        | 64        |
| Representation | image     | image     | image     | image     |
| Environments   | 4         | 4         | 4         | 4         |
| Policy         | CnnPolicy | CnnPolicy | CnnPolicy | CnnPolicy |
| Smothed reward | **21,16** | 20,04     | 18,23     | 9,72      |

<center><img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/02-images-stacks-gamma/images-stacks-gamma-all.svg width="400"></center>

| stack = 1 | stack = 4 | stack = 32 | stack = 64 |  
|---|---|---|---|  
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/02-images-stacks-gamma/images-stacks-gamma-stack-1.svg width="200"> | <img src =https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/02-images-stacks-gamma/images-stacks-gamma-baseline.svg width="200"> | <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/02-images-stacks-gamma/images-stacks-gamma-stack-32.svg width="200"> |<img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/02-images-stacks-gamma/images-stacks-gamma-stack-64.svg width="200"> |


According to the graphs, we observed that, as we increase the number of frames in the stacks, the agent needs more timesteps to learn a policy that allows it to earn good rewards. This behavior may indicate that knowledge about a more distant past doesn’t necessarily bring improvements to agents for the problem addressed.

Despite this, all curves tend to average reward values that are close and are around $21$ to $22$ points. In addition, the training time for each agent also increases by adding more frames, which is expected, since more data will be processed.

### Influence of the discount factor with RAM representation

In order to evaluate how the PPO behaves when receiving the RAM values of the game as input, instead of the image, we performed experiments using all the RAM, an MLP policy network and varied the discount factor. The results are shown on the graphics bellow and the smoothed rewards for $\gamma = 0.75$, $\gamma = 0.90$ and $\gamma = 0.99$ are $20.92$, $20.84$ and $19.57$, respectively.

In [None]:
LEARNING_RATE = 0.00025
LAM = 0.95
N_STACK = 4
N_ENV = 4
REPR = 'ram'

In [None]:
experiment('Experiments_PPO_ram_gamma_75', gamma=0.75, learning_rate=LEARNING_RATE, lam=LAM, policy=MlpPolicy,
           n_stack=N_STACK, n_env=N_ENV, state_repr=REPR);
experiment('Experiments_PPO_ram_gamma_90', gamma=0.90, learning_rate=LEARNING_RATE, lam=LAM, policy=MlpPolicy,
           n_stack=N_STACK, n_env=N_ENV, state_repr=REPR);
experiment('Experiments_PPO_ram_baseline', gamma=0.99, learning_rate=LEARNING_RATE, lam=LAM, policy=MlpPolicy,
           n_stack=N_STACK, n_env=N_ENV, state_repr=REPR);

<center><img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/03-ram-gamma/ram-gamma-all.svg width="400"></center>

| $\gamma$=0.75 | $\gamma$=0.90 | $\gamma$=0.99 |  
|---|---|---|  
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/03-ram-gamma/ram-gamma-75.svg width="250"> | <img src =https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/03-ram-gamma/ram-gamma-90.svg width="250"> | <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/03-ram-gamma/ram-gamma-baseline.svg width="250"> |

Observing the curves, we see that they are quite noisy, indicating difficulties during the agent's learning process. Furthermore, unlike what we saw when using images, now a lower value of $ \gamma $ leads to better results. This indicates that, for this agent, seeing values closer to the present time is more significant. This time, the average rewards were between 21 and 23 points.

### Influence of input representation

Another important comparison, still in the context of the previous section, is between the rewards obtained by using RAM or images as input. In the graphics bellow, we can see the results for images and RAM representations and for the three discount factors experimented.

#### <center> Discount factor equal to 0.75

<center><img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/04-images-ram/images-ram-75-all.svg width="300"></center>
    

| Image | RAM |  
|---|---|
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/04-images-ram/images-ram-75-image.svg width="200"> | <img src =https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/04-images-ram/images-ram-75-ram.svg width="200"> | 

---

#### <center> Discount factor equal to 0.90

<center><img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/04-images-ram/images-ram-90-all.svg width="300"></center>
    
| Image | RAM |  
|---|---|
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/04-images-ram/images-ram-90-image.svg width="200"> | <img src =https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/04-images-ram/images-ram-90-ram.svg width="200"> |

---

#### <center> Discount factor equal to 0.99

<center><img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/04-images-ram/images-ram-baseline-all.svg width="300"></center>
    
| Image | RAM |  
|---|---|
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/04-images-ram/images-ram-baseline-image.svg width="200"> | <img src =https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/04-images-ram/images-ram-baseline-ram.svg width="200"> |


We see that the rewards obtained are similar, except for the regions of the curves for $ \gamma = 0,75 $ and $ \gamma = 0,90 $ that start to decrease. This indicates that RAM can also provide a good representation for our problem.

Despite this, given the big variability of the experiments, more tests need to be carried out to verify this hypothesis.

### LAM

The `lam` is a factor that controls the trade-off of bias vs variance for Generalized Advantage Estimator [(PPO2 doc)](https://stable-baselines.readthedocs.io/en/master/modules/ppo2.html).

In [7]:
GAMMA = 0.99
LEARNING_RATE = 0.00025
LAM = 0.95

In [8]:
experiment('Experiments_PPO_L', gamma=GAMMA, learning_rate=LEARNING_RATE, lam=LAM, policy=MlpPolicy);






Instructions for updating:
Use keras.layers.flatten instead.
Instructions for updating:
Please use `layer.__call__` method instead.





Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where







In [10]:
experiment('Experiments_PPO_L', gamma=GAMMA, learning_rate=LEARNING_RATE, lam=1.00, policy=MlpPolicy);

In [12]:
experiment('Experiments_PPO_L', gamma=GAMMA, learning_rate=LEARNING_RATE, lam=0.99, policy=MlpPolicy);

In [14]:
experiment('Experiments_PPO_L', gamma=GAMMA, learning_rate=LEARNING_RATE, lam=0.5, policy=MlpPolicy);

In [16]:
experiment('Experiments_PPO_L', gamma=GAMMA, learning_rate=LEARNING_RATE, lam=0.25, policy=MlpPolicy);

In [18]:
experiment('Experiments_PPO_L', gamma=GAMMA, learning_rate=LEARNING_RATE, lam=0.0, policy=MlpPolicy);

|Parameter|L1|L2|L3|L4|L5|L6|
|-|-|-|-|-|-|-|
|GAMMA|0.99|0.99|0.99|0.99|0.99|0.99|
|LEARNING_RATE|0.00025|0.00025|0.00025|0.00025|0.00025|0.00025|
|LAM|0.95|1.00|0.99|0.5|0.25|0.0|
|Smoothed Reward|21.62|23.71|22.51|21.88|24.28|20.39|

|![](./img/lam_complete.png)|
|-|

|LAM = 0.95|LAM = 1.00|LAM = 0.99|
|-|-|-|
|![](./img/lam_0.95.png)|![](./img/lam_1.0.png)|![](./img/lam_0.99.png)|

|LAM = 0.5|LAM = 0.25|LAM = 0.0|
|-|-|-|
|![](./img/lam_0.5.png)|![](./img/lam_0.25.png)|![](./img/lam_0.0.png)|

From the graphs above we can see that the LAM parameter doesn't seem to have an impactful relationship with the overall performance of the algorithm.

Although the 0.25 lam ended up with a smoothed score of 24.28, it seems to be really unstable, and no apparent relationship with this parameter could be noted.

### Influence of the total timesteps

All previous experiments had a duration of 400K timesteps. In order to evaluate the influence of this parameter on the agents' performance, we run new experiments using the baseline and 1M timesteps. The results are shown on the figure bellow, for two executions of the baseline setting.

In [None]:
LEARNING_RATE = 0.00025
LAM = 0.95
GAMMA = 0.99
N_ENV = 4
TIMESTEPS = 1000000

In [None]:
experiment('Experiments_PPO_baseline_1M', gamma=GAMMA, learning_rate=LEARNING_RATE, lam=LAM, policy=CnnPolicy,
           n_stack=N_STACK, n_env=N_ENV, timesteps=TIMESTEPS);

<center><img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/05-timesteps/timestep-baseline-1-2.svg width="400"></center>

As we can see, after an initial growth, the reward stabilizes for a few thousand timesteps and starts growing again, allowing us to reach values much higher than those obtained previously. This indicates that we can still exploit a lot of the PPO's capacity for our problem, if we train it for a sufficient number of timesteps. Additionally, applying a smoothing of 0.6 on the results, we achieve a final smoothed reward of 28.73, and a peak reward of 31.

### Example Episode

You can find bellow a gif showing how our trained agent perfoms after these 1M timesteps.

| |
|-|
|<img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/01ad1553f86ea6038e4fa81bf338fc86bc1769b8/marianna/project02-rl/gif/ppo_baseline_1M_gif.gif width="400">|

### Important Notes

After the experiments with the discount factor, the forms of representation of the input, the number of frames of the stacks and the amount of training timesteps, we observed some important characteristics that should be highlighted.


The first is the big variability of the results, so that two experiments performed with the same configurations sometimes generate considerably different performances. 

Another important observation is that, as we realized for the experiments with the baseline and 1M timesteps, we would probably get better results for all other configurations if we trained them for longer.


---

## DQN

The [Deep-Q-Network](https://arxiv.org/pdf/1312.5602.pdf) is a deep learning model that learns to control policies directly from high dimensional sensory using reinforcement learning.   

The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating the future rewards.  

The Deep-Q-Network algorithm observes the image $x_t$ from the emulator which is a vector of raw pixel values representing the current screen. In addition it receives a reward $r_t$ representing the change in game score.  

It considers sequences of actions and observations,  

$s_t = x_1, a_1, x_2, ... a_{t-1}x_t$,  

and learn game strategies that depend upon these sequences.  


The optimal action-value function obeys an important identity known as the Bellman equation. This is based on the following intuition: if the optimal value $Q*(s', a')$ of the sequence $s'$ at the next time-step was known for all possible actions $a'$, then the optimal strategy is to select the action $a'$
maximising the expected value of $r + \gamma Q*(s', a')$, where $\gamma$ is the reward discount factor per time-step,  
  
$Q*(s, a) = E_{s' ~ \epsilon}[r + \gamma max_{a'}Q*(s', a')|s, a]$  

     


In this project we applied the [algorithm implemented by Stable Baselines](https://stable-baselines.readthedocs.io/en/master/modules/dqn.html) to the Atari Freeway game.

### Discount Factor $\gamma$

The discount factor $\gamma$ determines how much the agent cares about rewards in the distant future relative to those in the immediate future.  
  
If $\gamma$=0, the agent will be completelly myopic and only learn about actions that produce an immediate reward.  

If $\gamma$=1, the agent will evaluate each of its actions based on the sum of total of all futures rewards.  
  
We used a $\gamma$ value of 0.99 in order to make our agent care about distant future and we also decreased this value to 0.90 and 0.75 to see how they can impact the agent behavior.  

Thus, we will be experimenting with 3 different parameters set:

| Parameter | G1 | G2 | G3 |
|------|----|----|----|
| **`GAMMA`** | 0.99 | 0.90 | 0.75 |
| `LEARNING_RATE` | 0.0005 | 0.0005 | 0.0005 |
| `EXPLORATION_RATE` | 0.1 | 0.1 | 0.1 |
|`Smoothed Reward` |20.73|23.25|21.72|


| |  
|------|  
|<img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/dqn_gamma.png width="400">|

| $\gamma$=0.99 | $\gamma$=0.90 | $\gamma$=0.75 |  
|---|---|---|  
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/vermelho.png width="250"> | <img src =https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/pink.png width="250"> | <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/azul_claro.png width="250"> |

From the plots above, we can see that the three values of $\gamma$ can lead the agents to the similar score values, but some have delayed success achieving them.

### Learning Rate


The learning rate deetermines to what extent newly acquired information overrides old information.  

If the learning rate is 0, the agent will learn nothing (exclusively exploiting prior knowledge).  
If the learning rate is 1, the agent consider only the most recent information (ignoring prior knowledge to explore possibilities).  

We will be experimenting with 3 different parameters set:

| Parameter | G1 | G2 | G3 |
|------|----|----|----|
| `GAMMA` | 0.99 | 0.99 | 0.99 |
| **`LEARNING_RATE`** | 0.0005 | 0.0010 | 0.0050 |
| `EXPLORATION_RATE` | 0.1 | 0.1 | 0.1 |
|`Smoothed Reward` |20.73|21.13|2.616e-19 (approx. 0)|

| |  
|------|  
|<img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/dqn_lr.png width="400">|


| `LEARNING_RATE`=0.0005 | `LEARNING_RATE`=0.0010 | `LEARNING_RATE`=0.0050 |  
|---|---|---|  
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/vermelho.png width="250"> | <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/cinza.png width="250"> | <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/verde.png width="250"> |

As we can see in the plots above, the learning rate of 0.0005 and 0.0010 achieved approximately the same score values.  
On the other hand, the learning rate of 0.0050 performed poorly and did not learn at all. 

### Exploration rate

The exploration rate is the probability that our agent will explore the environment rather than exploit it.  

We used 0.1 as our baseline exploration value. In order to see how the exploration rate impact the agent behavior, we also made experiments using the double of this value (0.1) and the half of it (0.05).

All in all, these are the parameters that we are going to use to execute this experiment:

| Parameter | G1 | G2 | G3 |
|------|----|----|----|
| `GAMMA` | 0.99 | 0.99 | 0.99 |
| `LEARNING_RATE` | 0.0005 | 0.0005 | 0.0005 |
| **`EXPLORATION_RATE`** | 0.1 | 0.05 | 0.20 |
|`Smoothed Reward` |20.73|22.02|21.48|

| |  
|------|  
|<img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/dqn_exploration.png witdh="400">|


| `EXPLORATION_RATE`=0.0020 | `EXPLORATION_RATE`=0.0010 | `EXPLORATION_RATE`=0.0005 |  
|---|---|---|  
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/laranja.png width="250"> | <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/vermelho.png width="250"> | <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/azul.png width="250"> |

As presented above, the three values of exploration rate lead the agents to about the same score values, but they do not increase the score at the same time, as we saw for $\gamma$ parameter.

### DQN experiments discussion

According to the results we got from the DQN plots we can see that they have achieved approximatelly the same score values at the end of 400k steps, the difference between them is mostly about how faster them increased their scores.

From the experiments we ran, we are not able to indicate precisely what are the best hyper parameters to use, because they seem to not have a strong linear behavior.

To explain that, we are supported by the Hado van Hasselt et al that demostrated in the paper [Deep Reinforcement Learning with Double Q-learning](https://arxiv.org/abs/1509.06461), that the DQN algorithm suffers from substantial overestimations in some games in the Atari domain.  

They demonstrated that estimation errors of any kind can induce an upward bias, regardless of whether these errors are due to environmental noise, function approximation, non-stationarity, or any other source. This is important, because in practice any method will incur some inaccuracies during learning, simply due to the fact that the true values are initially unknown.

As they show in their experiments, which plots are presented below, the DQN algorithm can be consistently and sometimes vastly overoptimistic about the value of the current greedy policy, as can be seen by comparing the orange learning curves in the top row of plots to the straight orange lines, which represent the actual discounted value of the best learned policy.   

| |  
|------|  
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plot%20from%20the%20paper%20Deep%20Reinforcement%20Learning%20with%20Double%20Q-learning.png width="800"> |  
| Image from the paper [Deep Reinforcement Learning with Double Q-learning](https://arxiv.org/abs/1509.06461) |  


In the image above we can see the detrimental effect of the DQN overestimations on the score achieved by the agent as it is evaluated during training in comparison with Double-DQN.

Also, according to Sebastian Thrun and Anton Schwartz in the paper [Issues in Using Function Approximation for Reinforcement Learning](https://www.ri.cmu.edu/pub_files/pub1/thrun_sebastian_1993_1/thrun_sebastian_1993_1.pdf), DQN can have a systematic overestimation effect of values which is due to function approximation when used
in recursive value estimation scheme, that can leads to learning fails completely on some cases if the parameters exceed the upper or lower bound for expected failure of Q-learning. This effect of failure exceeding the upper bound is presented in the figure below:  

| |  
|------|  
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plot%20from%20the%20paper%20Issues%20in%20Using%20Function%20Approx%20for%20RL.png width="600"> |  
| Image from the paper [Issues in Using Function Approximation for Reinforcement Learning](https://www.ri.cmu.edu/pub_files/pub1/thrun_sebastian_1993_1/thrun_sebastian_1993_1.pdf) |  


In the figure we can see the learning curves as a function of $\gamma$. Each diagram shows the
performance (probability of reaching the goal state) as a function of the number of training episodes. Note that learning fails completely if  $\gamma$ > 0.98.

Additionally, according to Kamyar Azizzadenesheli et al in the paper [Efficient Exploration through Bayesian Deep Q-Networks](https://arxiv.org/abs/1802.04412), DQN are empirically sensitive to the learning rate and changing it can degrade the performance to even worse than random policy.

As we could see in our learning rate experiments, the learning rate of 0.0005 possible entered in the failure region and because of that did not learn at all, or it may suffered from the fact that DQN are sensitive to the learning rate.

For the discount factor and exploration rate parameters experiments, we found an arbitrairly behavior when determining which agent would achieve higher scores faster. This aparently lack of correlation between the hyper parameters changes can be caused by the DQN overestimation caracteristic and the average of more than one run could be necessary to see the expected correlation. 

### Example Episode

---

# Final Thoughts

## About the stability of the solution

One of the biggest challenges of this work was how to deal with the instability of the Deep Reinforcement Learning methods.

Many times, simply re-running the experiment would lead to different results, making it hard to compare and draw conclusions out of it.

This can be seem when we look at the experiments regarding the Learning Rate for both the DQN and the PPO algorithm, where one of the tests almost didn't score any points at all, while using other parameters achieved some reasonable scores.

## Computational cost

Another issue was regarding the required time to train the algorithms.
In order to run the 400k iterations, DQN would take about an hour and a half, and due to our time constraints, it was hard to run many longer experiments.

The PPO was a lot faster to train than the DQN because it accepts a vectorized environment as input.
This allows the training agent to train in multiple environments per step, allowing a much faster training.
Also, PPO was already designed with performance in mind, aiming to be a faster algorithm than its peers.

## Optimality and convergence

Although we believe we would be able to achieve better results if we left our algorithms training longer, we are still satisfied with what we achieved here, principally with the PPO.
We were able to achieve a smoothed average of 28.73 points using it, and it isn't that far from the state-of-the-art score of 34 points.
It is worth noting here that this solution hadn't converged yet.

The DQN algorithm didn't perform as good as PPO.
Training it for 1M timesteps, the model converged to a 0.6 smoothed average of XXXXXXXXXx points.
Interestingly, the DQN converged a lot faster (in timesteps) than PPO, which is characteristic of off-policy methods.

## Comparing with classical methods

On the first assignment we used classical tabular methods to tackle this problem.
Using the SARSA($\lambda$) algorithm, we were able to achieve ~31 points on average, and a peak of 34.

The classical methods still provided a better solution for our problem, but we believe that training the deep learning algorithms longer could lead to better results.

All in all, we were satisfied with what we achieved, exploring and experimenting a lot on two different deep reinforcement learning methods, in a challenging problem!

# TODOS:

- [ ] Apresentação
    - [ ] Preperar um conjunto de slides para ajudar a gente a apresentar
- [ ] Colocar todos os exps no relatório
    - [ ] 1M DQN
- [ ] Gerar gifs sobre as melhores galinhas (as vencedoras!)
- [X] Fazer a conclusão
    - [X] Incluir comparações com o projeto 1
- [X] Melhorar a parte textual dos experimentos