# Navigation Project Report - Deep Reinforcement Learning ND.

## Project background
The first project of Deep Reinforcement Learning ND - Udacity. The project goal is build a model that create an self-learning agent to navigate and collect at least 13 yellow bananas in a row of 100 episodes, in a 3D large & square world. This 3D environment was provided by [Udacity](https://www.udacity.com/) and is based on the [Machine learning framework](https://unity3d.com/machine-learning) provided by [Unity](https://unity.com/)

![](./images/Banana.gif)

This project is expected to use Deep Q learning to train an virtual agent in this environment in Pytorch framework.

## Banana Environment Information

The environment we will use is [Unity MLAgents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md)

In [2]:
from unityagents import UnityEnvironment
env = UnityEnvironment(file_name='Banana.exe')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


The environment has 4 actions:
`0`: walk forward
`1`: walk backward
`2`: turn left
`3`: turn right

State space has 37 dimensions. A reward of +1 will be awarded for collecting yellow banana and -1 for blue banana

## Deep Q-Network Architecture

A simple neural network is created as following:

1. input_size = state_size = 37, it then fed into hidden layer 01 as input.
2. 2 linear hidden layers, layer 01 has 128 nodes/units output, layer 02 has 64 nodes output, both followed by relu activation functions.
3. A linear ouput layer with 4 outputs for 4 actions & followed by relu activation function.

QNetwork(  
  (fc1): Linear(in_features=37, out_features=128, bias=True)  
  (fc2): Linear(in_features=128, out_features=64, bias=True)  
  (fc3): Linear(in_features=64, out_features=4, bias=True)  
)

## Epsilon policy
*tanh* function from the *math package* has been used to define **epsilon policy**:

	epsilon = [epsilon_min+(1.0-epsilon_min)*(1-tanh(10*(i/num_episodes))) for i in range(num_episodes+10)]

## Learning algo to use

The algo used in this project will be Q-learning algo with experience relay and fixed Q targets. The optimization of network will be done through `Adam` optimizer

### Hyperparameters used


* `BUFFER_SIZE = int(1e5)` : replay buffer size
* `BATCH_SIZE = 64` : minibatch size
* `GAMMA = 0.99` : discount factor
* `TAU = 1e-3` : for soft update of target parameters
* `LR = 5e-4` : learning rate
* `UPDATE_EVERY = 4` : how often to update the network
* `EPS_DECAY=0.995` : the reduction factor of the epsilon-greedy policy



## Train the agent with DQN

In [4]:
#instantiate an Agent
agent = Agent(state_size=37, action_size=4, seed=32)

Epoch:  459; Score:  20.0; Epsilon: 0.0221; Mean (100): +13.01 #  
Criterion reached (Mean of recent 100 runs > 13), enviroment is considered as solved!  
![](images/scores_epochs.jpg)  
The model then saved as *navigation.pth* for later uses

## Testing the trained

In [None]:
#Saved model is reloaded in evaluation mode for testing purpose, the process is executed in CPU mode
agent.qnetwork_local.load_state_dict(torch.load('navigation.pth', map_location=map_location))
agent.qnetwork_local.eval()

Testing process is carried out in 10 epochs with the scores & mean as following:  
Epoch:   10; Score:  13.0;  
Execution time: 29.8206 Achieved mean and standard deviation over 10 test runs: 15.10 (mea) 2.51 (std)  
![](images/scores_epochs_test.jpg)  

## Conclusions
That simple 2 hidden neural network layers model is quickly converged to the project goal, it then shows an effective perfomance on the random test.

## Future work to consider:

The forward plan is working with:
* Duelling DQN
* Double DQN
* Prioritized Experienced Replay