# Project 1: Navigation
This is a project report for a navigation task using DQN. The agent is tasked to collect as many yellow bananas as possible while avoiding blue bananas.

### Enviroment
The project is based on a Unity enviroment, where an agent is travelling in a workspace to collect bananas. Every time the agent collects a yellow banana, it gets a reward of +1 and every time it collects a blue banana, it receives a reward of -1. Therefore, the goal for this project is to train the agent so that it collects as many yellow bananas as possible while avoiding blue bananas. The environment is considered solved if the agent get an avarage score of +13 over 100 consecutive episodes.

<img src="banana.gif" />

The size of the state space is 37, which contains the agent's velocity, as well as ray-based perception of objects. The size of the action space is 4, indicating 4 directions the agent can move (forward, backward, left, right).

### Algorithm Overview
A value-based [Deep Q-Network (DQN)](https://deepmind.com/research/publications/human-level-control-through-deep-reinforcement-learning) has been used to solve this navigation task. A typical Deep Q-Network is built on a model-free reinforcement learning algorithm Q-learning, whcih learns the value of an action in a particular state. In Q-learning, a Q-table is generated to store the value of each state-action pair. Such tabular methods are good to solve problems where the size of the state and action space is relatively small. For problems with larger or continuous state/action space, a deep neural network has been used which takes a state as the input, and predicts all the action values for that state. The loss is computed by comparing the network prediction with that comes from the Q-learning so that it learns a Q-table approximation. <br/>

There are two major improvements in DQN
- Experience Replay
- Fixed Q-Targets

The experience replay is introduced so as to alleviate the correlation between different experience tuples and Fixed Q-Targets is introduced to decrease the instability arising from parameter optimization during the training.

### Code Overview
The code is modified based on a mini-project "Lunar Lander" in the same course. The code contains
- ##### model.py
  which specifies the structure of the neural network. The structure is as follows. <br/>
  state (input) -> fully-connected layer (37, 64) -> ReLu -> fully-connected layer (64, 64) -> Relu -> fully-connected layer (64, 4) -> action values (output).
- ##### dqn_agent.py 
  which defines the agent's behaviors (choose epsilon-greedy action, add experience tuples to the replay buffer, and train the model with batches of experience tuples). The buffer size is set to be large enough ($\small 10^5$) and the batch size to be 64. The disaccount is 0.99 and the learning rate is $\small 5 \times 10^{-4}$.
- ##### Navigation.ipynb
  which initializes the enviroment for the agent to interact with and trains the agent using DQN. The maximum number of training episodes is 2000 and the maximum number of time steps per episode is 1000. For epsilon, the start value is 1.0 and the minimum value is 0.1. The decay rate of epsilon value is set to be 0.995. 

### Result
The environment is considered solved if the agent get an avarage score of +13 over 100 consecutive episodes. It takes 388 episodes in my case to solve the problem. Below is the statistics in the training process.
Episode 0 	     Average Score: 0.00 <br/>
Episode 100 	 Average Score: 1.17 <br/>
Episode 200 	 Average Score: 4.92 <br/>
Episode 300 	 Average Score: 8.38 <br/>
Episode 400 	 Average Score: 10.40 <br/>
Episode 488 	 Average Score: 13.02 <br/>
Environment solved in 388 episodes!	Average Score: 13.02 <br/>

Below is the plot showing the score for each episodes, as well as the average score of last 100 episodes per episode.
<img src="episode_scores.jpg" />

### Future Improvements
There are variants of DQN which can potentially improve the agent's performance, such as Double DQN and Dueling DQN. Improvements can also take place in terms of how to prioritize experience replay to learn from the most important experience tuples.