# Project report

The solution code of the exercise can be found in the following files:
* Solution.ipynb: this Jupyter Notebook contains the DQN function to train and test the model using the Unity environment.
* dqn_agent.py: this code file contains the Agent class with all the functionality for the agent to act and learn in the environment. Yje code also contains the ReplayBuffer class with all the code for implementing Experience Replay.
* model.py: contains the Python code for the architecture of the neural network that is used to train the agent. The class QNetwork contains the initial vanilla DQN with 3 fully connected layers. The class Dueling_DQN extends on this class by turning it into a duelling DQN and implementing dropout layers. Finally the class Dueling_DQN6 makes the neural network deeper by using 6 fully connected layers.

## Learning Algorithm

Initially I tried the vanillay 3 layer DQN that was used in the exercise. Which I further improved by:
* Increasing the number of layers to 6.
* Increasing the size of the fully connected layers (128,64,32,32,16).
* Turning the DQC in a duelling DCQ.
* Adding dropout layers.

I used a dueling DQC model with 6 fully connected layers. A duelling DQN uses 2 streams. 1 to estimate the state value function and the other to estimate the advantage for each action. Finally the desired Q values are obtained by combining the state and advantage values. For both streams I applied the following archtitecture:
* The first layer receives the states vector with 37 possible states and outputs 128 nodes.
* The second layer receives 128 and outputs 64
* The third layer receives 64 and outputs 32
* The fourth layer receives 32 and outputs 32
* The fith layer receives 32 and outputs 16
* The final layer receives 16 and outputs 4 (which is the number of possible actions)

I implemented a dropout layer for each fully connected layer. This is to prevent overfitting. I tried several probabilities and 10% seemed to give the best result.

I haven't tuned any of the original parameters in the dqn_agent.py file because:
* The above described architecture gave already an average score above 13 after 346 episodes with the original parameters.
* Tuning the parameters is very time consuming and was not needed in this case.
I only reduced the max_t parameter in the DQN function from 1000 to 500 to speed up the training.

The model uses Experience Replay in order to stabilize the learning process of the neural network and extract more value out of the training data by using experiences multiple times.The file dqn_agent contains the class ReplayBuffer, which keeps a collection of past experiences from which the model picks randomly a subset to re-use in the training of the agent. By randomly picking experiences out of the replay buffer, the model is eliminating temporal correlations between experiences. This avoids action values from oscillating or diverging catastrophically.

## Plot of rewards

It takes the model 346 episodes to train the agent to achieve an average score of 13.

![image.png](attachment:image.png)


After training, we let the agent apply the optimal policy for 100 episodes. The agent achieves an average score of 13.5 on episode 95.

![image.png](attachment:image.png)


## Ideas for Future Work

I would take the following steps to further improve the model:
* Parameter tuning. For example I could make the learning rate decay over time. This can speed up the learning rate of the neural network.
* Apply double Q-learning in order to avoid over estimation of action values.
* Prioritized Experience Replay. This is an improvement of the Experience Replay algorithm that is used in the solution code. It gives priority to the relevant experiences for replay. Relevant experiences could be the experiences that resulted in high TD Errors, because these are the experiences where we can learn the most of. This TD Error is added to the replay buffer and used to calculate a priority measure for the corresponding experience in the replay buffer. An experience tuple with a high priority measure will have a higher probability to get chosen. 
* Apply the Rainbow algorithm, which is a combination of:
    * Double DQN
    * Prioritized Experience Replay
    * Duelling DQN
    * 3 other improvement techniques (multi-step bootstrap targets, Distributional DQN and Noisy DQN)
* Learn from the pixel data instead of receiving only 37 state variables as input. The pixel data will contain much more information. So it would require a more complex learning architecture with convolutional layers, but ultimately result in a better performing agent.  