## Project Continuous Control - Report

The project demonstrates how policy-based methods can be used to learn the optimal policy in a model-free Reinforcement Learning setting using a Unity environment, in which a double-jointed arm can move to target locations. A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of the agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector is a number between -1 and 1. An agent choosing actions randomly can be seen in motion below:

![random agent](assets/random_agent.gif) 

The following report is written in three parts:

- **Implementation**

- **Results**

- **Ideas for improvement** 

## Implementation

The basic algorithm lying under the hood is an actor-critic method.


###     Actor-Critic dual mechanism

For each timestep _t,_ we do the following operations:

Let __*S&nbsp;*__ be the current state. It is the  input for the  _Actor NN_.  The output is the action-value 

![](images/policy_pi.png)

where ${\pi}$ is the policy function,  i.e., the distribution of the actions. The _Critic NN_  gets the state __*S&nbsp;*__ as input and outputs      
the state-value function __*v(S,w)*__ , that is the _expected total reward_ for the agent starting from state __*S&nbsp;*__. Here, _\theta_ is    
the vector parameter of the _Actor NN_, _w&nbsp;_ - the vector parameter of the _Critic NN_. The task is to train both networks, i.e.,   
to find the optimal values for _\theta_ and _w&nbsp;_.  By policy _\pi_ we get the action _A&nbsp;_,  from the environment we get reward _R&nbsp;_   
and the next state __*S'&nbsp;*__. Then we get _TD-estimate_: 
 
![](images/TD_estimate.png)
		 
Next, we use the _Critic_ to calculate the _advantage function_ _A(s, a)_:

![](images/calc_advantage.png)
				 
Here, _\gamma_ is the _discount factor_. The parameter _\theta_ is updated by gradient ascent as follows:

![](images/update_theta.png)

The parameter _w&nbsp;_ is updated as follows:

![](images/update_w.png)
		
Here, ${\alpha}$ (resp. ${\beta}$) is the learning rate for the _Actor NN_ (resp. _Critic NN_).  Before we return to the next timestep we update the state _S&nbsp;_ and the operator _I&nbsp;_ by _discount factor_ \gamma:

![](images/next_state.png)

At the start of the algorithm the operator _I_ should be initialized to the identity opeartor. 

###  DDPG  Algorithm

In this project we use _Algorithm DDPG_ (_Deep Deterministic Policy Gradient_).  _DDPG_ is an algorithm  which   
concurrently learns a Q-function and a policy.  It uses off-policy data and the Bellman equation  to learn    
the Q-function, and uses the Q-function to learn the policy. This dual mechanism is the _actor-critic method_. 
The DDPG algorithm uses two additional mechanisms: _Replay Buffer_ and _Soft Updates_. 

### Goal of DDPG Agent 

The environment for this project involves controlling a **double-jointed arm**, to reach target locations.     
A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of      
this agent is to maintain its position at the target location for as many time steps as possible. 

The observation space (i.e., state space) has 33 dimensions corresponding to position, rotation, velocity,    
and angular velocities of the arm. The action space has 4 dimensions corresponding to torque applicable to    
two joints. Every entry in the action vector should be a number between -1 and 1.


The target network used for slow tracking of the learned network. We create a copy of the _actor_    and _critic_ networks:    
_actor_\__target_ (say, with the parameter vector _p'_) and _critic_\__target_   (say, with the parameter vector _w'_). The weights of    
these _target networks_ are updated by having   them the following track:    

    p'  <--  p * \tau + p' * (1 - \tau)  
    w'  <--  w * \tau + w' * (1 - \tau)

We put the very small value for _\tau_ (= 0.001). This means that the target values are constrained  to change slowly, greatly improving the stability of learning. This update is performed by function  _soft_\__update_.   

_"This may slow learning, since the target network delays the propagation of value estimations.   
However, in practice we found this was greatly outweighed by the stability of learning."     
("Continuous control with deep reinforcement learning", Lillicrap et al.,2015, arXiv:1509.02971)_  


### DDPG Neural Networks

The DDPG algorithm uses 4 neural networks: _actor_\__target_, _actor_\__local_, _critic_\__target_ and _critic_\__local_:

    actor_local = Actor(state_size, action_size, random_seed).to(device)
    actor_target = Actor(state_size, action_size, random_seed).to(device)

    critic_local = Critic(state_size, action_size, random_seed).to(device)
    critic_target = Critic(state_size, action_size, random_seed).to(device)

classes _Actor_ and _Critic_ are provided by model.py. The typical behavior of _the actor_ and _the critic_
is as follows:

    actor_target(state) -> action
    critic_target(state, action) -> Q-value
    
    actor_local(states) -> actions_pred
    -critic_local(states, actions_pred) -> actor_loss

### Hyperparameters

There were many hyperparameters involved in the experiment. The value of each of them is given below:

| Hyperparameter                      | Value |
| ----------------------------------- | ----- |
| Replay buffer size                  | 1e6   |
| Batch size                          | 1024  |
| $\gamma$ (discount factor)          | 0.99  |
| $\tau$                              | 1e-3  |
| Actor Learning rate                 | 1e-4  |
| Critic Learning rate                | 3e-4  |
| Update interval                     | 20    |
| Update times per interval           | 10    |
| Number of episodes                  | 500   |
| Max number of timesteps per episode | 1000  |
| Leak for LeakyReLU                  | 0.01  |


Note that parameters LEARNING_PERIOD and UPDATE_FACTOR are critical for the **convergence** of the algorithm.    
The corresponding code is in the function _step()_.    
     
     if len(self.memory) > BATCH_SIZE and timestep % LEARNING_PERIOD == 0:
            for _ in range(UPDATE_FACTOR):
                experiences = self.memory.sample()
                self.learn(experiences, GAMMA)


## Results

   

  The trained agent can be seen in action below:

  ![trained](assets/trained_agent.gif) 



  The best performance was achieved by **DDPG** where the reward of +30 was achieved in **337** episodes. I noticed how changing every single hyperparameter contributes significantly towards getting the right results and how hard it is to identify the ones which work. The plot of the rewards across episodes is shown below:

  ![ddpg](assets/scores.png)

### Ideas for Improvement

1. Possibly, the improve can be achieved by adding some layers to the neural networks Actor and Critic. Some papers state     
   that Batch Normalization can accelerate Deep Network Training, 
   for example, [here](https://medium.com/@ilango100/batch-normalization-speed-up-neural-network-training-245e39a62f85) and [here](https://arxiv.org/pdf/1502.03167.pdf).

2. Check different values for hyperparameters such as BATCH_SIZE, LR_ACTOR,  LR_CRITIC, LEARNING_PERIOD, UPDATE_FACTOR.    
 
3. Instead of DDPG, other models can be considered, such as [PPO](https://openai.com/blog/openai-baselines-ppo/), 
   [A3C](https://blog.goodaudience.com/a3c-what-it-is-what-i-built-6b91fe5ec09c) and others.
4. The Q-prop algorithm, which combines both off-policy  and on-policy learning, could be good one to try.

5. General optimization techniques like cyclical learning rates and warm restarts could be useful as well.