# Project
## Description

To solve the **Reacher** challenge, we have employed a _reinforcement learning_ algorithm named **"DDPG"**, which stands for _"Distributed Distributional Deterministic Policy Gradients"_, explained [here](https://openreview.net/pdf?id=SyZipzbCb).

The core of this algorithm is resumed in:
 - There are two kind of networks
     - **Actor**: Network that predicts the actions, given a state; in a continupus manner.
     - **Critic**: given the information about the environment and action selected, estimates the reward.
 - The algorithm is suited for contiuous action and state spaces.
 - In order to have a balance between **"exploration/explotation"**, a noise decaying coefficient is added to the action performed by the agent, i.e. after the _"Actor"_ has predicted and action, the noise is added.
 - Both __actor__ and __critic__ count with their respective _target networks_
 - Update of networks is produced by __expericence replay__
 

## Parameters
### Networks
#### Actor
 The actor model is based on a neural network, with the following architecture:
 - __Input__: 33 inputs, state size
 - __1st Layer__: 32 units | Layer Normalization | Relu Activation
 - __2nd Layer__: 32 units | Layer Normalization | Relu Activation
 - __Output__: 4 units (action size) | Layer Normalization | Relu Activation | Hyperbolic Tangent

#### Critic
The critic, had a much complex architecture, has _between the 1st and 2nd layers_ we __added the action's values from the Actor network__.
 - __Input__: 33 inputs, state size
 - __1st Layer__: 256 units + 4 units (actor output) | Layer Normalization | Relu Activation
 - __2nd Layer__: 256 units | Layer Normalization | Relu Activation
 - __3rd Layer__: 128 units | Layer Normalization | Relu Activation
 - __4rth Layer__: 64 units | Layer Normalization | Relu Activation
 - __Output__: 1 unit (Q value) + Layer Normalization + Relu Activation + Hyperbolic Tangent
 

### Hyperparameters
 - __Batch Size__ = 128
 - __Gamma__ = 0.99
 - __Tau__ = 1e-3
 - __LR Actor__ = 1e-4
 - __LR Critic__ = 2e-4
 - __Weight Decay__ = 1e-4
 - __Exploration Coef.__ = 1.0 (initial value)
 - __Exploration Coef. decay__ = 0.95 (each time we learnt we multiplied the _"exploratory coef."_ by it)
 - __Update Every__ = 10 (number of iterations between learnings)
 - __Consecutive Learning Iterations__ = 4 (number of learning iterations in a row)
 


## Results
### Raw results
Results obtained from last algorithm training iterations.

![python_console](ddpg_results.png)


### Plot results
Plot of the total historical learning of the agent-critic.
![resutls_plot](ddpg_plot.png)

The results were of a mean (windows of 100 values) of __above 30.0 points__ of reward.

## Future Improvements
Improvements that ought to increase the learning speed and stability. 
 - **Prioritized Experience Replay**: being able to weigth the best experiences (i.e. the most recent,, the ones that gave mor reward).
 - **Implementing other algorithtms**: as __Q-Prop__ [algorithm](https://arxiv.org/abs/1611.02247v1), which improves the main problem of policy grandient methods (e.g _"DDPG"_), that is they required big hyperparameter optimization to find a convergent and stable point. 