## Report - Continuous Control task for the Reacher environment (Single Agent) 

#### Author : Bhuvaneswari Sankaranarayanan, Prepared on April 12, 2021 


#### Learning Algorithm Used

The Continuous Control task for the Reacher environment with a single agent has been solved using a DDPG learning agent which used a fully-connected deep neural network architecture to take in a 33-dimensional state and 4-dimensional action space. 

DDPG learns a critic network(local/target versions) and an actor network(local/target versions). In this implementation, the critic network is a fully connected neural network taking the 33-dimensional state and 4-dimensional action at the input, a fully connected FCS1 layer with 256 neuron units, a fully connected FC2 layer with 256 neuron units and a fully connected FC3 layer with 128 neuron units. The output of the critic network is a single real value that approximates the action-value function of the input state and action. The actor network is also a fully connected neural network that takes the state at the input and outputs the action to be taken at that state. It has a single hidden layer with 256 neuron units.   

Simple backpropagation on appropriate loss functions as mentioned in the DDPG paper was used to learn the actor and critic networks. Adam optimizer was used to perform the gradient updates at every update step. 

<img src="critic.png" alt="Critic" style="width: 400px;"/> <img src="actor.png" alt="Actor" style="width: 250px;"/>

#### Hyperparameter settings

The diagrams above describe the NN architecture and hyperparameters like no. of hidden layers, no. of neuron units per layer etc. The following are the values of other hyperparameters used:-

- Learning rate for the actor = 1e-4
- Learning rate for the critic = 1e-3
- Buffer Size = 1000000
- Batch Size = 64
- Tau (hyperparameter for soft update of the target network) = 0.001
- Weight Decay (parameter of the PyTorch Adam optimizer) = 0
- UPDATE_EVERY = 4 (update the actor/critic network for every 4 steps of the agent)
- GAMMA, discount factor = 0.99
- Max_T, Maximum number of timesteps allowed per episode = 800
- No. of episodes = 3000

#### Plot of Rewards

A plot of rewards per episode is included to illustrate that the agent is able to receive an average reward (over 100 episodes) of at least +30. It is to be noted that this task is a continuing task and the end of the episode is marked by the max. number of timesteps defined as a hyperparameter in the code. 

The screenshot of the jupyter notebook below has been obtained after training the DDPG agent. It reports that the number of episodes needed to solve the environment ie. till the average reward crossed 30.0 was 2365. The average reward is calculated as the running average of rewards from the last 100 episodes.  
![alt text](scrnshot_from_training.png "Screenshot of the average rewards over last 100 episodes displayed while training")

It also plots the rewards obtained in each episode. 

![alt text](learning_curve.png "Learning Curve from training")


The average reward dropped considerably around 2000 episode but later catched up to the 30.0 mark upon continuing learning. Either early stopping or decaying the learning rates over episodes could help the algorithm stabilize at the optimal solution. Running 3000 episodes on local computer's GPU took approximately 3 hrs to complete. Due to this large running time, the training was stopped once the average reward reached 30.0 and stayed close to it for few hundred episodes. 

#### Sample code to load the trained model for behaving in the environment

In [1]:
# below code segment works for windows 10 system 

from unityagents import UnityEnvironment
import numpy as np
import torch
env = UnityEnvironment(file_name='./Reacher_Windows_x86_64/Reacher.exe')
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

from ddpg_agent import Agent
agent = Agent(state_size=33, action_size=4, random_seed=0)

# load the weights from file
agent.actor_local.load_state_dict(torch.load('checkpoint_actor.pth'))
agent.critic_local.load_state_dict(torch.load('checkpoint_critic.pth'))

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


<All keys matched successfully>

In [None]:
for i in range(50):
    env_info = env.reset(train_mode=True)[brain_name]
    state = env_info.vector_observations[0]
    for j in range(200):
        action = agent.act(state)
        env_info = env.step(action)[brain_name]
        state = env_info.vector_observations[0]
        reward = env_info.rewards[0]
        done = env_info.local_done[0]
        if done:
            break 
            
env.close()



#### GIF of the trained agent's behaviour in the environment

![alt text](video_of_learned_agent.gif "GIF of the agent's play after learning")

#### Ideas for future work 



Possible future work for this continuous control task could be to try Proximal Policy Optimization (PPO). PPO methods have also been shown to do well for continuous control tasks. It would be a nice alternative to DDPG for the current reaching task. 