## DQN

The [Deep-Q-Network](https://arxiv.org/pdf/1312.5602.pdf) is a deep learning model that learns to control policies directly from high dimensional sensory using reinforcement learning.   

The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating the future rewards.  

The Deep-Q-Network algorithm observes the image $x_t$ from the emulator which is a vector of raw pixel values representing the current screen. In addition it receives a reward $r_t$ representing the change in game score.  

It considers sequences of actions and observations,  

$s_t = x_1, a_1, x_2, ... a_{t-1}x_t$,  

and learn game strategies that depend upon these sequences.  


The optimal action-value function obeys an important identity known as the Bellman equation. This is based on the following intuition: if the optimal value $Q*(s', a')$ of the sequence $s'$ at the next time-step was known for all possible actions $a'$, then the optimal strategy is to select the action $a'$
maximising the expected value of $r + \gamma Q*(s', a')$, where $\gamma$ is the reward discount factor per time-step,  
  
$Q*(s, a) = E_{s' ~ \epsilon}[r + \gamma max_{a'}Q*(s', a')|s, a]$  

     


In this project we applied the [algorithm implemented by Stable Baselines](https://stable-baselines.readthedocs.io/en/master/modules/dqn.html) to the Atari Freeway game.

### Discount Factor $\gamma$

The discount factor $\gamma$ determines how much the agent cares about rewards in the distant future relative to those in the immediate future.  
  
If $\gamma$=0, the agent will be completelly myopic and only learn about actions that produce an immediate reward.  

If $\gamma$=1, the agent will evaluate each of its actions based on the sum of total of all futures rewards.  
  
We used a $\gamma$ value of 0.99 in order to make our agent care about distant future and we also decreased this value to 0.90 and 0.75 to see how they can impact the agent behavior.  

Thus, we will be experimenting with 3 different parameters set:

| Parameter | G1 | G2 | G3 |
|------|----|----|----|
| **`GAMMA`** | 0.99 | 0.90 | 0.75 |
| `LEARNING_RATE` | 0.0005 | 0.0005 | 0.0005 |
| `EXPLORATION_RATE` | 0.1 | 0.1 | 0.1 |
|`Smoothed Reward` |20.73|23.25|21.72|


| |  
|------|  
|<img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/dqn_gamma.png width="400">|

| $\gamma$=0.99 | $\gamma$=0.90 | $\gamma$=0.75 |  
|---|---|---|  
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/vermelho.png width="250"> | <img src =https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/pink.png width="250"> | <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/azul_claro.png width="250"> |

From the plots above, we can see that the three values of $\gamma$ can lead the agents to the similar score values, but some have delayed success achieving them.

### Learning Rate


The learning rate deetermines to what extent newly acquired information overrides old information.  

If the learning rate is 0, the agent will learn nothing (exclusively exploiting prior knowledge).  
If the learning rate is 1, the agent consider only the most recent information (ignoring prior knowledge to explore possibilities).  

We will be experimenting with 3 different parameters set:

| Parameter | G1 | G2 | G3 |
|------|----|----|----|
| `GAMMA` | 0.99 | 0.99 | 0.99 |
| **`LEARNING_RATE`** | 0.0005 | 0.0010 | 0.0050 |
| `EXPLORATION_RATE` | 0.1 | 0.1 | 0.1 |
|`Smoothed Reward` |20.73|21.13|2.616e-19 (approx. 0)|

| |  
|------|  
|<img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/dqn_lr.png width="400">|


| `LEARNING_RATE`=0.0005 | `LEARNING_RATE`=0.0010 | `LEARNING_RATE`=0.0050 |  
|---|---|---|  
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/vermelho.png width="250"> | <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/cinza.png width="250"> | <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/verde.png width="250"> |

As we can see in the plots above, the learning rate of 0.0005 and 0.0010 achieved approximately the same score values.  
On the other hand, the learning rate of 0.0050 performed poorly and did not learn at all. 

### Exploration rate

The exploration rate is the probability that our agent will explore the environment rather than exploit it.  

We used 0.1 as our baseline exploration value. In order to see how the exploration rate impact the agent behavior, we also made experiments using the double of this value (0.1) and the half of it (0.05).

All in all, these are the parameters that we are going to use to execute this experiment:

| Parameter | G1 | G2 | G3 |
|------|----|----|----|
| `GAMMA` | 0.99 | 0.99 | 0.99 |
| `LEARNING_RATE` | 0.0005 | 0.0005 | 0.0005 |
| **`EXPLORATION_RATE`** | 0.1 | 0.05 | 0.20 |
|`Smoothed Reward` |20.73|22.02|21.48|

| |  
|------|  
|<img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/dqn_exploration.png witdh="400">|


| `EXPLORATION_RATE`=0.0020 | `EXPLORATION_RATE`=0.0010 | `EXPLORATION_RATE`=0.0005 |  
|---|---|---|  
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/laranja.png width="250"> | <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/vermelho.png width="250"> | <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plots/azul.png width="250"> |

As presented above, the three values of exploration rate lead the agents to about the same score values, but they do not increase the score at the same time, as we saw for $\gamma$ parameter.

In [4]:
# new text below -----------------------------------------

### DQN experiments discussion

According to the results we got from the DQN plots we can see that they have achieved approximatelly the same score values at the end of 400k steps, the difference between them is mostly about how faster them increased their scores.

From the experiments we ran, we are not able to indicate precisely what are the best hyper parameters to use, because they seem to not have a strong linear behavior.

To explain that, we are supported by the Hado van Hasselt et al that demostrated in the paper [Deep Reinforcement Learning with Double Q-learning](https://arxiv.org/abs/1509.06461), that the DQN algorithm suffers from substantial overestimations in some games in the Atari domain.  

They demonstrated that estimation errors of any kind can induce an upward bias, regardless of whether these errors are due to environmental noise, function approximation, non-stationarity, or any other source. This is important, because in practice any method will incur some inaccuracies during learning, simply due to the fact that the true values are initially unknown.

As they show in their experiments, which plots are presented below, the DQN algorithm can be consistently and sometimes vastly overoptimistic about the value of the current greedy policy, as can be seen by comparing the orange learning curves in the top row of plots to the straight orange lines, which represent the actual discounted value of the best learned policy.   

| |  
|------|  
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plot%20from%20the%20paper%20Deep%20Reinforcement%20Learning%20with%20Double%20Q-learning.png width="800"> |  
| Image from the paper [Deep Reinforcement Learning with Double Q-learning](https://arxiv.org/abs/1509.06461) |  


In the image above we can see the detrimental effect of the DQN overestimations on the score achieved by the agent as it is evaluated during training in comparison with Double-DQN.

Also, according to Sebastian Thrun and Anton Schwartz in the paper [Issues in Using Function Approximation for Reinforcement Learning](https://www.ri.cmu.edu/pub_files/pub1/thrun_sebastian_1993_1/thrun_sebastian_1993_1.pdf), DQN can have a systematic overestimation effect of values which is due to function approximation when used
in recursive value estimation scheme, that can leads to learning fails completely on some cases if the parameters exceed the upper or lower bound for expected failure of Q-learning. This effect of failure exceeding the upper bound is presented in the figure below:  

| |  
|------|  
| <img src=https://raw.githubusercontent.com/DionisiusMayr/FreewayGame/main/aline.almeida/DQN/plot%20from%20the%20paper%20Issues%20in%20Using%20Function%20Approx%20for%20RL.png width="600"> |  
| Image from the paper [Issues in Using Function Approximation for Reinforcement Learning](https://www.ri.cmu.edu/pub_files/pub1/thrun_sebastian_1993_1/thrun_sebastian_1993_1.pdf) |  


In the figure we can see the learning curves as a function of $\gamma$. Each diagram shows the
performance (probability of reaching the goal state) as a function of the number of training episodes. The performance is evaluated on an independent testing set of twenty initial robot positions. Note that learning fails
completely if  $\gamma$ > 0.98.

Additionally, according to Kamyar Azizzadenesheli et al in the paper [Efficient Exploration through Bayesian Deep Q-Networks](https://arxiv.org/abs/1802.04412), DQN are sensitive to the learning rate and changing it can degrade the performance to even worse than random policy.

As we could see in our learning rate experiments, the learning rate of 0.0005 possible entered in the failure region and could not learn at all or it may suffered from the fact that DQN are sensitive to the learning rate.

For the discount factor and exploration rate parameters experiments, we found an arbitrairly behavior when determining which agent would learn faster. This aparently lack of correlation between the hyper parameters changes can be caused by the DQN overestimation caracteristic and the average of more than one run could be necessary to see the expected correlation. 

---

# Final Thoughts ------------- TODO TODO TODO

## Computational cost

One of the biggest problems on this project was the computational cost of running an episode.

Since each run plays for 2 minutes and 16 seconds in the original game, there are quite a lot of frames that need to be computed for each episode.
Even though the frame-sync is deactivated in our environment, each time we execute on episode it takes about 2 seconds to compute it for Q-Learn and Monte Carlo, but it takes around 21s seconds for SARSA($\lambda$)! This means that in order to run 4k episodes, one must wait 23 hours. And since we want to experiment with 6 different values of $\lambda$, it would take about a week to compute this all using a single core processor.

The memory usage is fairly low compared to the time that it takes to run the algorithms, even with a decent amount of unique states in our problem.

## Optimality

Regarding the optimality of our solution, we were able to achive the state-of-the-art score (34.0) some times with the SARSA($\lambda$) algorithm, but on average we are closer to 31 points (which is also good, having said that our baseline was 21.8.
Q-Learning also showed really good results, achieving about 28 points on average.
On the other hand, Monte Carlo didn't perform well in our problem, achieving only 13 points on average.

## Regarding the Linear Function Approximators

According to the experiments realized with function approximation for Monte Carlo, Q Learning and Sarsa Lambda control algorithms, we achieved some meaningful results: 

* We were not able to improve the agents compared to the algorithms without function approximation. Some factors may have contributed to this, such as the simplicity of the linear approximator for a problem that, probably, have many nonlinear relationships between the environment variables. Also, the features that we created may not be good enough to provide meaningful information about the environment to the agents.

* The feature approximation of Monte Carlo said that we have to take a sample for our model in order to update the weight values. However, for our problems, this sample is the entire game, so each new generated sample follows a past policy that makes the convergence slow.

* Linear approximators can be faster than its original counterpart as we saw with SARSA, but it’s not always the case.

* To add randomness to underfitted models by using a bigger N0 can be beneficial to the solution.

* Each algorithm has its own specificities, that depends on how it works. For example, for the Monte Carlo approximator, using a less sparse feature vector contributed for better results. On the other hand, for the Q-Learning and Sarsa Lambda approximators, providing a bigger exploration capacity for the agent was very important to achieve better results.

* In all the experiments, the agents converged to a behavior similar to that of the baseline, learning that moving up the most part of the time is the best policy. However, as we saw from the algorithms without approximation, that is not the truth.

All in all, we were satisfied with what we achieved, being able to apply multiple tabular methods and experiment a lot on Reinforcement Learning.