## Report - Navigation Project 

#### Author : Bhuvaneswari Sankaranarayanan, Prepared on March 14, 2021 


#### Learning Algorithm Used

The Navigation project has been solved using a DQN learning agent which used a fully-connected deep neural network with a 37-dimensional input layer, 2 hidden layers, and a final 4-dimensional output layer for value function approximation. 

The 37-dimensional state vector obtained from the Banana Unity environment is fed as input to the network. The number of output units are 4, one unit to output the preference for each of the 4 actions. 

#### Hyperparameter settings

The following are the values of hyperparameters used:-
- Learning Rate = 5e-4
- Number of hidden layers = 2 
- Number of neurons in each hidden layer = 64 
- Replay Buffer Size = 100000
- Batch Size = 64 
- Tau (hyperparameter for soft update of the target Q function at every step) = 0.001
- UPDATE_EVERY, update the Q network for every 4 steps of the agent
- GAMMA, discount factor = 0.99

#### Plot of Rewards

A plot of rewards per episode is included to illustrate that the agent is able to receive an average reward (over 100 episodes) of at least +13. The screenshot of the jupyter notebook below has been obtained after training the DQN agent. It reports that the number of episodes needed to solve the environment ie. till the average reward crossed 13.0 was 545. It also plots the rewards obtained in each episode. Note that the average score or average reward is calculated as the running average of rewards from the last 100 episodes.  

![alt text](Screenshot_of_output_from_training_1.png "Learning Curve from training")


#### Sample code to load the trained model for behaving in the environment

In [1]:
# below code segment works for windows 10 system 

from unityagents import UnityEnvironment
import numpy as np
import torch
env = UnityEnvironment(file_name="Banana_Windows_x86_64/Banana.exe")
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

from dqn_agent import Agent
agent = Agent(state_size=37, action_size=4, seed=0)

# load the weights from file
agent.qnetwork_local.load_state_dict(torch.load('checkpoint.pth'))

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


<All keys matched successfully>

In [2]:
for i in range(50):
    env_info = env.reset(train_mode=True)[brain_name]
    state = env_info.vector_observations[0]
    for j in range(200):
        action = agent.act(state)
        env_info = env.step(action)[brain_name]
        state = env_info.vector_observations[0]
        reward = env_info.rewards[0]
        done = env_info.local_done[0]
        if done:
            break 
            
env.close()

#### GIF of the trained agent's behaviour in the environment

![alt text](video_clip_of_learned_agent.gif "GIF of the agent's play after learning")

#### Ideas for future work 



The benchmark average score of 13.0 can be achieved earlier by using additional improvements on top of DQN. 

1. Double Q Learning - Tabular methods have a guarantee of learning the target Q function exactly in the limit of infinite episodes. However, with function approximators there is always a noise between the estimated Q and the target Q. Although this noise could be zero-mean, Q learning ends up over-estimating the action values because it uses the max operation to estimate Q function at the next state's Q function and behaves greedily with respect to this estimated Q at the next action selection. That is, max operation causes overestimation because it does not preserve the zero-mean property of the errors of its operands. Modifying the existing DQN to implement Double DQN can estimate the action values better and lead to reaching the benchmark score earlier. 

2. Prioritized experience replay - This is essentially weighted sampling of the experiences in the replay memory proportional to the TD-error of the (S, A, R, S') sample. This can also improve the Q estimation on rare but important state-action pairs.  

3. Duelling architecture for the Q network - Duelling variation of the Q-network can help because for many states, it is unnecessary to estimate the value of each action choice. 

A network that implements all of the above is also shown to be considerably out-performing on many Atari game examples. We can expect a similar effect for our task as well. 