# Continous Control Project Report

#### Deep Reinforcement Learning Continous Control

  ## Introduction

#### Project Overview

In this project, 20 identical Agents were built using Deep Deterministic Policy Gradent (DDPG) algorithm to solve the Reacher Environment.Each agent has its own copy of the environment.

#### Environment
For this project, I worked with the Reacher environment.

In this environment, a double-jointed arm can move to target locations. A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1.

I solved the 2nd version of the Unity environment: 
The 2nd version contains 20 identical agents, each with its own copy of the environment.

In order to solve the environment, the agents must achieve an average score of +30 (over 100 consecutive episodes, and over all agents).

## Methodology and Algorithm

### Architecture
In this project, a single brain was used to control all 20 agents, rather than 20 individual brain for each agent. A policy based method was adopted beacause of adaptability to continuous action spaces and the fact that policy based methods can learn the optimal policy directly, without maintaining a separate value function estimate.

An extension of Deep Q-learning to continuos tasks called Deep Deterministic Policy Gradient (DDPG) algorithm was implemented. 

![DDPG_Algorithm.png](./images/DDPG_Algorithm.png)

The researchers at Google Deepmind in this paper [ontinuous Control with Deep Reinforcement Learning](https://arxiv.org/pdf/1509.02971.pdf) presented a model-free, off-policy actor critic algorithm using deep function approximators. This function is capable of learning policies in high dimentional continous action spaces.

### Code Structure
There are 3 main files that are very important for the implementation of this project. The files were structured this way for the sake of Modularity and easy debugging.

```model.py```: In this file I implemented the model for Actor and the Critic class using PyTorch framework. This Method - Actor-Critic method combines both Value-based and Policy-based methods.

- ```class Actor```

    - An Input Layer: Which the value depends on the state_size parameter.
    - A BatchNorm1D layer: This was added immediately after the first layer to scale the features and ensure that they are in the same range throughout the entire model.
    - one other fully connected layer with in_units=400 and out_units=300.
    - An output layer: The value of which depends on the action_size parameter.
    - ```.reset_parameters()```: This methods helps to initialize the weights using uniform distribution.
    - ```.forward()```: method maps states to corresponding actions. A non-linear function called ReLu activation function was used for the hidden layers and tanh was used for the output layer to maintain values between -1 and 1.

**Actor Architecture**

     Input nodes(33)
    (BatchNorm1D) BatchNorm Layer (400 nodes, ReLU activation)
    (fc) fully connected linear layer (300 nodes, ReLU activation)
    Output nodes(4, tanh activation) 
    
    
    
- ```class Critic```
    - Input Layer: The size depends on the state_size parameter.
    - Two (2) layers(BatchNorm1D and a fully connected linear layer): the reason for using the batchnorm layer is still the same as in the Actor Class. The fully connected linear layer has in_units which is equal to the 400+action_size and out_unis=300.
    - Ouput layer: this layer gives a single value.
    - ```.reset_parameters()```: This methods helps to initialize the weights using uniform distribution.
    - ```.forward()```: this method implements the forward pass and maps (state action) pair. ReLu activation function was used for the hidden layers. The output of the first activation layer was concatenated with action value. No activation function was for the output layer.
    
**Critic Architecture**

     Input nodes(33)
    (BatchNorm1D) BatchNorm Layer (400 nodes, ReLU activation)
    (fc) fully connected linear layer (300+action nodes, ReLU activation)
    Output nodes(1)    
      
<br>  
    
```Agent.py```: This file contains the implementation of the Action-Critic logic, Ornstein-Uhlenbeck Process and Experience Replay.

- ```Class Agent```:
    - The local and target networks were initialized separately for both the action and the critic to improve stability. I also instatiate OUNoise and ReplayBuffer.
    - ```.step()```: this method implement the interval in which the learning step is only performed every 20 timesteps (LEARN_EVERY = 20). It saves and samples experiences from the Replay Buffer and run .learn() for range(LEARN_NUM = 10).
    - ```.act()```: The method return Actions for a given state based on the current policy. In this method the noise parameter is accompanied by an epsilon parameter used to decay the level of noise.
    - ```.learn()```: Here, the policy value parameters were updated with selected experiences. The critic network was first implemented, after the forward pass, I calculated the loss and before the optimiation step, the gradient was clipped to deal with exploding gradient problem. Later the Actor network was implemented with clipping its gradient and the noise was also updated using EPSILON_DECAY
    - ```.soft_update()```: The model parameters were updated here.
   
   
- ```Class OUNoise```: In this method [Ornstein-Uhlenbeck Process](https://arxiv.org/pdf/1509.02971.pdf) was implemented. This process adds a certain amount of noise to the action values at each timestep and help us address the trade-off between Exploitation Vs. Exploration Dilema. This was originally implemented in CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING paper.

    - parameters (mu, theta, sigma, seed) were initialized.
    - ```.reset()```: It create a copy of the internal state  with parameter, mu.
    - ```.sample()```: This update the internal state and return it as a noise sample using theta and sigma parameters.


- ```Class ReplayBuffer```: In this class, experience replay was implemented, which allows the Agent to learn from past experiences. So this fixed size buffer can store experince tuples. For the 20 agents, we have just one central replay buffer to enable the agents learn from each others' experiences since they are performing the same task. Experience is selected by each agent stochastically.

    - The replay buffer parameters and experience tuple were initialized.
    - ```.add()```: The method adds new Experience tuple _(state, action, reward, next_state, done)_ to the memory
    - ```.sample()```: This samples and return Random batch of experiences from the memory.
    
<br>

```Continous_Control.ipynb```: This notebook consist of codes for training the agent.

### Hyperparameters
These are hyperparameters used in the Agent.py file.

```
BUFFER_SIZE       = int(1e6)  
BATCH_SIZE        = 128        
GAMMA             = 0.99            
TAU               = 1e-3              
LR                = 1e-3         
WEIGHT_DECAY      = 0        
LEARN_EVERY       = 20        
LEARN_NUM         = 10          
OU_SIGMA          = 0.2          
OU_THETA          = 0.15         
EPSILON           = 1.0           
EPSILON_DECAY     = 1e-6
```

For the Actor and Critic Network ```Adam Optimizer``` was used with learning rate (LR) of 1e-3 each.

## Result

After training the Agents with the specified hyperparamters and architecture, the plot below was generated. The plot shows the performance of the agents over several episodes.

![Result.png](./images/result.png)

![Graph.png](./images/graph.png)

## Ideas on Performance Improvement
In the future, I will consider improvement on this project using:

- **Priotized Experience Replay**: Experience replay lets online reinforcement learning agents remember and reuse experiences from the past. In prior work, experience transitions were uniformly sampled from a replay memory. However, this approach simply replays transitions at the same frequency that they were originally experienced, regardless of their significance. [Paper](https://arxiv.org/abs/1511.05952)