<a href="https://colab.research.google.com/github/DionisiusMayr/FreewayGame/blob/main/aline.almeida/DQN/aa_DQN_freeway.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Freeway

This is the **second project** for the MC935rA/MO436A - Reinforcement Learning course, taught by Prof. Esther Colombini.

In this project we propose to apply Deep Reinforcement Learning methods to teach an agent how to play the Freeway Atari game.

**Group members:**
- Aline Gabriel de Almeida
- Dionisius Oliveira Mayr (229060)
- Leonardo de Oliveira Ramos (171941)
- Marianna de Pinho Severo (264960)
- Victor Jesús Sotelo Chico (265173)

## Freeway game

<center><img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/freeway/img/Freeway_logo.png>
</center>

Freeway is a video game written by David Crane for the Atari 2600 and published by Activision [[1]](https://en.wikipedia.org/wiki/Freeway_(video_game)).

In the game, two players compete against each other trying to make their chickens cross the street, while evading the cars passing by.
There are three possible actions: staying still, moving forward or moving backward.
Each time a chicken collides with a car, it is forced back some spaces and takes a while until the chicken regains its control.

When a chicken is successfully guided across the freeway, it is awarded one point and moved to the initial space, where it will try to cross the street again.
The game offers multiple scenarios with different vehicles configurations (varying the type, frequency and speed of them) and plays for 2 minutes and 16 seconds.
During the 8 last seconds the scores will start blinking to indicate that the game is close to end.
Whoever has the most points after this, wins the game!

The image was extracted from the [manual of the game](https://www.gamesdatabase.org/Media/SYSTEM/Atari_2600/Manual/formated/Freeway_-_1981_-_Zellers.pdf).

[1 - Wikipedia - Freeway](https://en.wikipedia.org/wiki/Freeway_(video_game))

# Environment

We will be using the [Stable Baselines](https://stable-baselines.readthedocs.io/en/master/index.html) toolkit.
This toolkit is a set of improved implementations of Reinforcement LEarning algorithms based on [OpenAI Baselines](https://github.com/openai/baselines).

Although the game offers multiple scenarios, we are going to consider only the first one. Also, we will be controlling a *single chicken*, while we try to maximize its score.

In this configuration, there are ten lanes and each lane contains exactly one car (with a different speed and direction).
Whenever an action is chosen, it is repeated for $k$ frames, $k \in \{2, 3, 4\}$.

This means that our environment is **stochastic** and it is also **episodic**, with its terminal state being reached whenever 2 minutes and 16 seconds have passed.

You can find more information regarding the environment used at [Freeway-ram-v0](https://gym.openai.com/envs/Freeway-v0/).

# Setup

Install the dependencies:
```sh
pip install -r requirements.txt
pip install stable-baselines
pip install tensorflow==1.15.0
```

# Useful Resources

Here you can find a list of useful links and materials that were used during this project.

* [Freeway-ram-v0 from OpenAI Gym](https://gym.openai.com/envs/Freeway-ram-v0/)
* [Manual of the game](https://www.gamesdatabase.org/Media/SYSTEM/Atari_2600/Manual/formated/Freeway_-_1981_-_Zellers.pdf)
* [Freeway Disassembly](http://www.bjars.com/disassemblies.html)
* [Atari Ram Annotations](https://github.com/mila-iqia/atari-representation-learning/blob/master/atariari/benchmark/ram_annotations.py)
* [Freeway Benchmarks](https://paperswithcode.com/sota/atari-games-on-atari-2600-freeway)

# Imports

In [None]:
import sys
sys.path.append('../')  # Enable importing from `src` folder

In [None]:
%matplotlib inline
import statistics
from collections import defaultdict
from functools import lru_cache
from typing import List

import numpy as np
import time
import matplotlib.pyplot as plt
import seaborn as sns

import src.agents as agents
import src.episode as episode
import src.environment as environment
import src.aux_plots as aux_plots
import src.serializer as serializer
import src.gif as gif

import tensorflow as tf

import gym
from stable_baselines.common import make_vec_env
from stable_baselines.common.cmd_util import make_atari_env
from stable_baselines.common.vec_env import VecFrameStack
from stable_baselines.common.atari_wrappers import make_atari

from stable_baselines.deepq.policies import CnnPolicy
from stable_baselines import DQN

In [None]:
def print_result(i, scores, total_reward, score):
    if i % 10 == 0:
        print(f"Run [{i:4}] - Total reward: {total_reward:7.2f} Mean scores: {sum(scores) / len(scores):.2f} Means Scores[:-10]: {sum(scores[-10:]) / len(scores[-10:]):5.2f} Score: {score:2} ")

In [None]:
def read_int_array_from_file(fn: str):
    with open(f"./experiments/{fn}") as f:
        return [int(x) for x in f.read().splitlines()]

# Action space

As we said above, the agent in this game has three possible actions at each frame, each represented by an integer:

* 0: Stay
* 1: Move forward
* 2: Move backward

# Baseline

## State of the art benchmarks

The image bellow (extracted from https://paperswithcode.com/sota/atari-games-on-atari-2600-freeway) shows the evolution of the scores over time using different techniques.

Today, the state of the art approaches are making 34.0 points, using Deep Reinforcement Learning methods.

<center><img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/47d8d1fd3b921471738b30b5f9ae447593705b09/freeway/img/state_of_art_scores.png>
</center>



## Simple baseline agent

As a simple baseline, we are using an agent that always moves **up**, regardless of the rewards received or the current state.

In [None]:
env, initial_state = environment.get_env()

In [None]:
agent = agents.Baseline()

In [None]:
total_rewards = []
n_runs = 10

In [None]:
%%time
for i in range(n_runs):
    render = i % 10 == 0

    game_over = False
    state = env.reset()
    action = agent.act(state)

    total_reward = 0

    while not game_over:
        if render:
            time.sleep(0.01)
            env.render()

        ob, reward, game_over, _ = env.step(action)

        total_reward += reward
        action = agent.act(state)  # Next action

    total_rewards.append(total_reward)

CPU times: user 21.5 s, sys: 544 ms, total: 22.1 s
Wall time: 50 s


In [None]:
total_rewards

[23.0, 21.0, 23.0, 21.0, 21.0, 21.0, 23.0, 21.0, 23.0, 21.0]

In [None]:
baseline_mean_score = np.mean(total_rewards)
baseline_mean_score

21.8

As we can see, this agent usually scores 21 or 23 points (as shown in the images bellow). It depends on the the values of $k$ sampled, and on average it scores about 21.8 points per run.

<center><img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/47d8d1fd3b921471738b30b5f9ae447593705b09/freeway/img/baseline_1.png>
</center>

<center><img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/47d8d1fd3b921471738b30b5f9ae447593705b09/freeway/img/baseline_2.png>
</center>


# State Representation

Since the tabular methods we are going to use work with some representation of the actual environment state, we will need to understand it better in order to effectively approach this problem.

## Atari 2600

Before talking about the state representation, it is important to understand how the Atari 2600 works.

Atari 2600 is a video game released in 1977 by the American Atari, Inc.
Its **8-bit** microprocessor was of the MOS **6502** family and it had **128 bytes** of RAM.

And these 128 bytes are what really matters here.

---

Recall that Gym gives us the RAM memory of the Atari as the state representation.
In other words, it gives us an 128-element `np.array`, where each element of the array is an `uint8` (*integer values ranging from 0 to 255*).

That said, we have (in theory) $256^{128} \approx 1.8 \cdot 10^{308}$ possible game states!

This is *far* from being manageable, and thus we need to come up with a different approch to represent our state if we want our algorithms to converge.

One might argue that the RAM state is *sparse* and although that is true, it is still not sparse enough to apply tabular methods.

# Reward Policy

In the base environment we are awarded on point each time we successfully cross the freeway.

# Hyper Parameters

In [None]:
GAMMA = 0.99
LEARNING_RATE = 0.0005
EXPLORATION_RATE = 0.1

# Methodology

Since it takes a lot of time to train the models, we won't train them all in this report.
Instead, we will be load the results of our simulations and specifying the parameters used to obtain those results.
Of course, it is possible to reproduce our results simply by running the algorithms here using the same hyper parameters as specified.

Whenever possible, we will be adding plots comparing different approaches and parameters, as well as adding gifs in this notebook so that we can visualize the development of the agent and unique strategies that they learned.

Also, we focused a lot of our experiments on Q-Learning, since it was showing the most promissor results.
Monte Carlo methods didn't really work out, and SARSA($\lambda$) methods took way too much time to run (roughly 12 hours per 2k iterations!).
Since QLearning and SARSA aren't really that different, we applied most of the knowledge we acquired from the QLearning experiments on SARSA, varying only its unique parameter, $\lambda$, in steps of 0.2.

Also, it is worth mentioning that we left the code used by each agent inside `./src/agents.py` and provided a model of implementing the environment along the notebook, with the `n_runs` parameter (that controls the number of episodes used in to train the algorithm) set to `1`.

---

# Review of Project 1

### Tabular Methods

### Linear Function Approximation

In Project 1, we also worked with linear function approximators, which were based on the Monte Carlo, Q-Learning and Sarsa($\lambda$) algorithms. For each of them, we experimented with different sets of features to represent the state-action pairs, varied reward functions, different exploration rates, and in all of them we used only two actions (move backward and move forward), which were the best actions found by the tabular methods.

All function approximators obtained results close to those of the baseline, being slightly better. Of them, the Monte Carlo based approximator was the only one that managed to improve in relation to its tabular version, reaching an average score of 22.2, compared to the 13 points obtained by the latter.

Throughout the experiments, we observed that some linear approximators can be faster than the original algorithms, as happened with Sarsa($\lambda$). In addition, we saw that the Q-Learning and Sarsa($\lambda$) approximators achieved better results when they were given greater exploration capacity, and that the Monte Carlo based approximator benefited when less sparse feature vectors were adopted.

Given that, although linear function approximators are powerful tools, they were not good enough to improve the performance of our solutions, either due to the nature of the problem, the possible poor quality of the features we created or other factors that were not addressed.

---

# DQN

The [Deep-Q-Network](https://arxiv.org/pdf/1312.5602.pdf) is a deep learning model that learns to control policies directly from high dimensional sensory using reinforcement learning.  

The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating the future rewards.  

We applied the [algorithm implemented by Stable Baselines](https://stable-baselines.readthedocs.io/en/master/modules/dqn.html) to the Atari Freeway game.


The Deep-Q-Network algorithm observes the image $x_t$ from the emulator which is a vector of raw pixel values representing the current screen. In addition it receives a reward $r_t$ representing the change in game score.

It considers sequences of actions and observations, $s_t = x_1, a_1, x_2, ... a_{t-1}, x_t$, and learn game strategies that depend upon these sequences.  

All sequences in the emulator are assumed to terminate in a finite number of time-steps. This formalism gives rise to a large but finite Markov decision process (MDP) in which each sequence is a distinct state.  

As a result, we can apply standard reinforcement learning methods for MDPs, simply by using the complete sequence $s_t$ as the state representation at time $t$.  

The future reward are discounted by a factor of $\gamma$ per time-step.  


The optimal action-value function obeys an important identity known as the Bellman equation. This is based on the following intuition: if the optimal value $Q*(s', a')$ of the sequence $s'$ at the next time-step was known for all possible actions $a'$, then the optimal strategy is to select the action $a'$
maximising the expected value of $r + \gamma Q*(s', a')$:  
  
  


$Q*(s, a) = E_{s' ~ \epsilon}[r + \gamma max_{a'}Q*(s', a')|s, a]$  


A Q-network can be trained by minimising a sequence of loss functions $L_i(\theta_i)$ that changes at each iteration $i$:  

$L_i(\theta_i) = E_{s, a ~p(.)}[(y_i - Q(s, a; \theta_i))^2$ 

where  

$y_i = E_{s' ~ \epsilon}[r + \gamma max_{a'}Q*(s', a'; \theta_i)|s,a] $   

is the target for iteration $i$.

## Experiments

### Influence of the discount factor $\gamma$

The discount factor $\gamma$ determines how much the agent cares about rewards in the distant future relative to those in the immediate future.  

If $\gamma$=0, the agent will be completelly myopic and only learn about actions that produce an immediate reward.If $\gamma$=1, the agent will evaluate each of its actions based on the sum of total of all futures rewards.

We used a $\gamma$ value of 0.99 in order to make our agent care about distant future and we also decreased this value to 0.90 and 0.75 to see how they can impact the agent behavior. 

Thus, we will be experimenting with 3 different parameters set:

| Parameter | G1 | G2 | G3 |
|------|----|----|----|
| **`GAMMA`** | 0.99 | 0.90 | 0.75 |
| `LEARNING_RATE` | 0.0005 | 0.0005 | 0.0005 |
| `EXPLORATION_RATE` | 0.1 | 0.1 | 0.1 |


In [None]:
#### IMAGEM DO GRAFICO AQUI ####

In [None]:
# From the plots above we can see that $\gamma = 0.75$ led to poor results, but $\gamma = 0.9$ and $\gamma = 0.99$ seems to be equivalent.
# An explanation is that when we make our agent short-sighted, it doesn't try to cross all the lanes and receive that huge reward we are offering as much as the far-sighted agents try.

# That being said, we will be focusing on the $\gamma = 0.99$, arbitrarily.
# We could be using the $0.9$ too, since it appears to have the same performance.

### Influence of the learning rate parameter



We will be experimenting with 3 different parameters set:

| Parameter | G1 | G2 | G3 |
|------|----|----|----|
| `GAMMA` | 0.99 | 0.99 | 0.99 |
| **`LEARNING_RATE`** | 0.0005 | 0.0010 | 0.0050 |
| `EXPLORATION_RATE` | 0.1 | 0.1 | 0.1 |

In [None]:
#### IMAGEM DO GRAFICO AQUI ####

### Influence of the agent's exploration rate

The exploration rate is the probability that our agent will explore the environment rather than exploit it.  

We used 0.1 as our baseline exploration value. In order to see how the exploration rate impact the agent behavior, we also made experiments using the double of this value (0.1) and the half of it (0.05).

All in all, these are the parameters that we are going to use to execute this experiment.

| Parameter | G1 | G2 | G3 |
|------|----|----|----|
| `GAMMA` | 0.99 | 0.99 | 0.99 |
| `LEARNING_RATE` | 0.0005 | 0.0005 | 0.0005 |
| **`EXPLORATION_RATE`** | 0.1 | 0.05 | 0.20 |

In [None]:
#### IMAGEM DO GRAFICO AQUI ####

In [None]:
# The exploration rate is the probability that our agent will explore the environment rather than exploit it.  

# As we can see from the results show in the plots above, the lower is the $N0$ value, the better is the performance of the agent.
# Although this migth seem counterintuitive at first, in fact, it stands to reason.
# When we explore more (higher $N0$), we exploit less, leading to worst results in the beginning.
# From the graphs above, we can see that all three lines are looking up, still increasing their values, and the gap between them is closing.
# We expect to achive better results with higher $N0$s, but it would take too much time for it to happen (we even tested some of them overnight and it still wasn't enough).

# Based on our reward function, it is fairly simple to detect which action should be taken in most of the states.
# We want to move up always, unless it is leading to a collision.
# Thus, frequently it is easy to detect the best action, and for most of the states we don't need to explore a lot to find it.

---

# Proximal Policy Optimization (PPO)

As an on-policy algorithm for solving our problem, we chose [Proximal Policy Optimization (PPO)](https://arxiv.org/abs/1707.06347). PPO is a policy optimization method that can be used in environments that work with continuous or discrete action spaces.

As with trust region based approaches, such as the TRPO algorithm, it tries to reduce the size of the update that must be applied to a policy, since major updates tend to generate agents with worse performance. However, it seeks to overcome some of the limitations of TRPO  and other methods, being easier to implement and adjust parameters, and having a better sample complexity.

Finally, this algorithm has some variants. One of them, which is used in this project, adopts a specialized clipping function to calculate the objective function, as shown in the equation below. For the experiments, we used the [PPO2](https://stable-baselines.readthedocs.io/en/master/modules/ppo2.html) implementation, from the Stable Baselines library.

$$
L^{CLIP}(\theta) = \hat{\mathbb{E}}_t[min(r_t(\theta)\hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)]
$$

## Experiments

To assess the performance of the PPO in solving our problem, we decided to vary some of its parameters, which were: the discount factor ($\gamma$), the learning rate, the trade-off factor between bias and variance (lam), the number of frames in the input stack of the algorithm, the type of representation of the states (image or RAM), the number of environments and the total timesteps of training.

For this, we created a baseline for comparison, with the following configurations:

- Policy network: CnnPolicy
- Discount factor ($\gamma$): 0.99.
- Learning rate: 0.00025.
- Trade-off factor (lam): 0.95.
- Number of frames in the stack: 4.
- Type of representation: image.
- Number of environments: 4.
- Training timesteps: 400K.

Below, we describe the experiments and results obtained.


### Influence of discount factor with image representation

In the table below, we can see the settings of the experiments carried out in this section, as well as the results achieved after being smoothed by a factor of 0,999. In addition, in the following graphs, we observe how the reward obtained by agents varies according to the value of the discount factor ($\gamma$) and the number of timesteps.

| **Parameter**  | **Exp 1** | **Exp 2** | **Exp 3** |
|----------------|-----------|-----------|-----------|
| **Gamma**      | 0,75      | 0,90      | 0,99      |
| Learning Rate  | 0,00025   | 0,00025   | 0,00025   |
| Lam            | 0,95      | 0,95      | 0,95      |
| Stacks         | 4         | 4         | 4         |
| Representation | image     | image     | image     |
| Environments   | 4         | 4         | 4         |
| Policy         | CnnPolicy | CnnPolicy | CnnPolicy |
| Smothed reward | 9,55      | 15,74     | **20,04**     |

| |
|-|
|<center><img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/01-images-gamma/svg/image_gamma_all.svg width="400"></center>|

| $\gamma$=0.75 | $\gamma$=0.90 | $\gamma$=0.99 |  
|---|---|---|  
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/01-images-gamma/svg/image_gamma_75.svg width="250"> | <img src =https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/01-images-gamma/svg/image_gamma_90.svg width="250"> | <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/01-images-gamma/svg/image_gamma_baseline.svg width="250"> |

We see that in all the graphics, there is a considerable variation in the reward and that the curve that proves to be more stable is the one constructed with $ \gamma = 0,99 $ (baseline). On the other hand, we noticed that the other curves decrease a lot from certain timesteps, being the one constructed with $ \gamma = 0,75 $  the first to decrease.

This indicates that, for the problem addressed, a PPO agent with a far-sight view is able to achieve better performances and that, the more limited this view is, the worse their results will be. Finally, although the highest value of the best smoothed curve is the one shown in the table, the values of this curve vary around $22$ points.

### Influence of the number of frames in the stack with image representation

Once the learning algorithms receive images as input, to model the problem as a MDP, we need to group a set of frames (images) in structures called stacks, which are given as input to these algorithms. This allows models to access time information about the game.

In order to analyze the influence of the size of the frame stacks on the performance of the created PPO agents, we experimented with the baseline with stacks of size 1, 4, 32 and 64, as described in the table below.

| **Parameter**  | **Exp 1** | **Exp 2** | **Exp 3** | **Exp 4** |
|----------------|-----------|-----------|-----------|-----------|
| Gamma          | 0,99      | 0,99      | 0,99      | 0,99      |
| Learning Rate  | 0,00025   | 0,00025   | 0,00025   | 0,00025   |
| Lam            | 0,95      | 0,95      | 0,95      | 0,95      |
| **Stacks**     | 1         | 4         | 32        | 64        |
| Representation | image     | image     | image     | image     |
| Environments   | 4         | 4         | 4         | 4         |
| Policy         | CnnPolicy | CnnPolicy | CnnPolicy | CnnPolicy |
| Smothed reward | **21,16** | 20,04     | 18,23     | 9,72      |

<center><img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/02-images-stacks-gamma/images-stacks-gamma-all.svg width="400"></center>

| stack = 1 | stack = 4 | stack = 32 | stack = 64 |  
|---|---|---|---|  
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/02-images-stacks-gamma/images-stacks-gamma-stack-1.svg width="200"> | <img src =https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/02-images-stacks-gamma/images-stacks-gamma-baseline.svg width="200"> | <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/02-images-stacks-gamma/images-stacks-gamma-stack-32.svg width="200"> |<img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/02-images-stacks-gamma/images-stacks-gamma-stack-64.svg width="200"> |


According to the graphs, we observed that, as we increase the number of frames in the stacks, the agent needs more timesteps to learn a policy that allows it to earn good rewards. This behavior may indicate that knowledge about a more distant past doesn’t necessarily bring improvements to agents for the problem addressed.

Despite this, all curves tend to average reward values that are close and are around $21$ to $22$ points. In addition, the training time for each agent also increases by adding more frames, which is expected, since more data will be processed.

### Influence of the discount factor with RAM representation

In order to evaluate how the PPO behaves when receiving the RAM values of the game as input, instead of the image, we performed experiments using all the RAM, an MLP policy network and varied the discount factor. The results are shown on the graphics bellow and the smoothed rewards for $\gamma = 0.75$, $\gamma = 0.90$ and $\gamma = 0.99$ are $20.92$, $20.84$ and $19.57$, respectively.

<center><img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/03-ram-gamma/ram-gamma-all.svg width="400"></center>

| $\gamma$=0.75 | $\gamma$=0.90 | $\gamma$=0.99 |  
|---|---|---|  
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/03-ram-gamma/ram-gamma-75.svg width="250"> | <img src =https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/03-ram-gamma/ram-gamma-90.svg width="250"> | <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/03-ram-gamma/ram-gamma-baseline.svg width="250"> |

Observing the curves, we see that they are quite noisy, indicating difficulties during the agent's learning process. Furthermore, unlike what we saw when using images, now a lower value of $ \gamma $ leads to better results. This indicates that, for this agent, seeing values closer to the present time is more significant. This time, the average rewards were between 21 and 23 points.

### Influence of input representation

Another important comparison, still in the context of the previous section, is between the rewards obtained by using RAM or images as input. In the graphics bellow, we can see the results for images and RAM representations and for the three discount factors experimented.

#### <center> Discount factor equal to 0.75

<center><img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/04-images-ram/images-ram-75-all.svg width="300"></center>
    

| Image | RAM |  
|---|---|
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/04-images-ram/images-ram-75-image.svg width="200"> | <img src =https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/04-images-ram/images-ram-75-ram.svg width="200"> | 

---

#### <center> Discount factor equal to 0.90

<center><img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/04-images-ram/images-ram-90-all.svg width="300"></center>
    
| Image | RAM |  
|---|---|
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/04-images-ram/images-ram-90-image.svg width="200"> | <img src =https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/04-images-ram/images-ram-90-ram.svg width="200"> |

---

#### <center> Discount factor equal to 0.99

<center><img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/04-images-ram/images-ram-baseline-all.svg width="300"></center>
    
| Image | RAM |  
|---|---|
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/04-images-ram/images-ram-baseline-image.svg width="200"> | <img src =https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/04-images-ram/images-ram-baseline-ram.svg width="200"> |


We see that the rewards obtained are similar, except for the regions of the curves for $ \gamma = 0,75 $ and $ \gamma = 0,90 $ that start to decrease. This indicates that RAM can also provide a good representation for our problem.

Despite this, given the big variability of the experiments, more tests need to be carried out to verify this hypothesis.

### Influence of the total timesteps

All previous experiments had a duration of 400K timesteps. In order to evaluate the influence of this parameter on the agents' performance, we run new experiments using the baseline and 1M timesteps. The results are shown on the figure bellow, for two executions of the baseline setting.

<center><img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/bae41b89189519f64aa78d12693f800f5d62e51c/marianna/project02-rl/01-figures-experiments/05-timesteps/timestep-baseline-1-2.svg width="400"></center>

As we can see, after an initial growth, the reward stabilizes for a few thousand timesteps and starts growing again, allowing us to reach values much higher than those obtained previously. This indicates that we can still exploit a lot of the PPO's capacity for our problem, if we train it for a sufficient number of timesteps. Additionally, applying a smoothing of 0,999 on the results, we achieve a maximum reward of 23,76.


### Important Notes

After the experiments with the discount factor, the forms of representation of the input, the number of frames of the stacks and the amount of training timesteps, we observed some important characteristics that should be highlighted.


The first is the big variability of the results, so that two experiments performed with the same configurations sometimes generate considerably different performances. 

Another important observation is that, as we realized for the experiments with the baseline and 1M timesteps, we would probably get better results for all other configurations if we trained them for longer.
