# Continuous_Control Project Report - DRLND.

[//]: # (Image References)
[image1]: https://user-images.githubusercontent.com/10624937/43851024-320ba930-9aff-11e8-8493-ee547c6af349.gif "Trained Agent"

## Project background
The second project of Deep Reinforcement Learning ND - Udacity. The project goal is build a model that create self-learning agents (double-jointed arm) move to target locations. The goal of the agent is to maintain its position at the target location for as many time steps as possible. The threshold for a successful model is the agents must get an average score of +30 (over 100 consecutive episodes, and over all agents). This 3D environment was provided by [Udacity](https://www.udacity.com/) and is based on the [Machine learning framework](https://unity3d.com/machine-learning) provided by [Unity](https://unity.com/)

![Trained Agent][image1]

This project is expected to use Deep Deterministic Policy Gradient (DDPG) learning - Actor & Critic networks to train 20 virtual agents in this environment in Pytorch framework.

## Continuous_Control Environment Information

The environment will be used is [Unity MLAgents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md)

In [None]:
from unityagents import UnityEnvironment

# select this option to load version 2 (with 20 agents) of the environment
env = UnityEnvironment(file_name='/data/Reacher_Linux_NoVis/Reacher.x86_64')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Agent:  
Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1.

Environment:  
The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm.  
- A reward of +0.1 is provided for each step that the agent's hand is in the goal location.  
- State space has 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm..

## Deep Deterministic Policy Gradient Network Architecture

the agent can learn the policy either directly from the states using Policy-based methods or via the action valued function as in the case of Value-based methods. The policy-based methods tend to have high variance and low bias and use Monte-Carlo estimate whereas the value-based methods have low variance but high bias as they use TD estimates. Now, the actor-critic methods were introduced to solve the bias-variance problem by combining the two methods.

In Actor-Critic, the actor is a neural network which updates the policy and the critic is another neural network which evaluates the policy being learned which is, in turn, used to train the actor. The actor uses the value provided by the critic to make the new policy update.  

DDPG algorithm , a subset of Actor & Critic method, comprises two networks. The actor produces a deterministic policy instead of the usual stochastic policy and the critic evaluates the deterministic policy.  

- Actor network composes 1 linear hidden layer taking state_size (33) as input shape, number of units by default as 256, it is come after by *relu* activation functions. The linear output layer takes the second hidden layer out_channels & action_size as params, it is followed by *tank* activation function.  

Actor(  
  (fc1): Linear(in_features=33, out_features=256, bias=True)  
  (fc2): Linear(in_features=256, out_features=4, bias=True)  
)

- Critic network composes 3 linear hidden layers taking state_size (33) as input shape, number of units by default as 256, 256 & 128 respectively. They are come after by *leaky_relu* activation functions that help to avoid *vanishing gradient* problem during training. The linear output layer takes the third hidden layer out_channels & action_size as params, it is also followed by *leaky_relu* activation function.

Critic(  
  (fcs1): Linear(in_features=33, out_features=256, bias=True)  
  (fc2): Linear(in_features=260, out_features=256, bias=True)  
  (fc3): Linear(in_features=256, out_features=128, bias=True)  
  (fc4): Linear(in_features=128, out_features=1, bias=True)  
)

## Learning algo to use

Adam was used as an optimizer for both actor and critic networks.

### Hyperparameters used

* `BUFFER_SIZE = int(1e5)` : replay buffer size
* `BATCH_SIZE = 128      ` : minibatch size
* `GAMMA = 0.99          ` : discount factor
* `TAU = 1e-3            ` : for soft update of target parameters
* `LR_ACTOR = 1e-3       ` : learning rate of the actor
* `LR_CRITIC = 1e-3      ` : learning rate of the critic
* `WEIGHT_DECAY = 0      ` : L2 weight decay


## Train the agent with DDPG

In [4]:
#Case 20 agents
agents = Agents(state_size=state_size, action_size=action_size, n_agents=num_agents, random_seed=10)

Episode 100	Average Score: 0.69  
Episode 200	Average Score: 11.27  
Episode 274	Average Score: 30.05  
Environment solved in 174 episodes!	Average Score: 30.05  

![](images/scores_epochs.jpg)  
The model then saved as *./saved_models/checkpoint_actor_20agents.pth* and *./saved_models/checkpoint_critic_20agents.pth* for later uses

## Testing the trained

In [None]:
#Saved models are reloaded in evaluation mode for testing purpose, the process is executed in CPU mode
agents.actor_local.load_state_dict(torch.load('./saved_models/checkpoint_actor_20agents.pth', map_location='cpu'))
agents.critic_local.load_state_dict(torch.load('./saved_models/checkpoint_critic_20agents.pth', map_location='cpu'))
agents.actor_local.eval()
agents.critic_local.eval()

Testing process is carried out in 20 epochs with the scores & mean as following:  

![](images/scores_epochs_test.jpg)  

## Conclusions
The problem was solved using the DDPG algorithm where the average reward of +30 over at least 100 episodes was achieved in 174 episodes.  
The result depends significantly on the fine-tuning of the hyperparameters. Ex: if the number of time steps is too low, learning rate or the seed is too large or small, the system can fall into a local minimum where the score may start to decrease after a certain number of episodes.  

The test was carried out in 20 epochs, the average reward of 20 agents was all above 30.

## Future work to consider:

Normally after a tranformation process, the data varies a lot and they need to be stardardized to optimum format *(Gaussian law)* of neural network (avoid outliers). Batch normalization layer is recommend to be added to the network architecture (before activation function layer) to help improve training.  

The forward plan is working with:

1. Proximal Policy Optimization:  
The idea is to implement a Policy Gradient algorithm that determines the appropriate policy with gradient methods. However, the change in the policy from one iteration to another is very slow in the neighbourhood of the previous policy in the high dimensional space.

2. Prioritized Experience Replay:  
The idea behind using these technique for sampling from the replay buffer is that not all experiences are equal, some are more important than others in terms of reward, so naturally the agent should at least prioritize between the different experiences.

3. Asynchronous Actor Critic:  
The idea is to have a global network and multiple agents who all interact with the environment separately and send their gradients to the global network for optimization in an asynchronous way.

### References

https://medium.com/@jasdeepsidhu13/project-2-continuous-control-of-udacity-s-deep-reinforcement-learning-c16fef28f24e  
https://medium.com/@kinwo/solving-continuous-control-environment-using-deep-deterministic-policy-gradient-ddpg-agent-5e94f82f366d