# Report of Project 3: Collaboration and Competition

## Learning algorithm

The algorithm used in this project is a __Deep Deterministic Policy Gradient__ (DDPG) agent, based upon [Lillicrap et al.](https://arxiv.org/abs/1509.02971) which is an actor-critic method borrowing many concepts from double DQN, with some modifications as described in the following sections.

__The policy (actor)__ in DDPG is a fully-connected neural network which outputs continuous values in a range of [-1, 1] for a given state input and is deterministic in it's nature. Therefore, to encourage exploration, especially in the beginning of training, we add __Gaussian noise__ to the actions generated by the neural network, which we decay over time (episodes). That way, as with double DQN, we favor exploration in the beginning of the learning process and exploitation of the learned policy towards the end. We also add __batch normalization__ after the first hidden layer to mitigate the _internal covariate shift_ problem and to stabilize and speed up the learning process.  

__The action-value function (critic)__ is a fully-connected neural network similar to [Lillicrap et al.](https://arxiv.org/abs/1509.02971), but which differs in layer dimensions $(256, 256, 128)$ and using Leaky-ReLU activations. This architecture has been inspired from the [ddpg-bipedal](https://github.com/udacity/deep-reinforcement-learning/blob/master/ddpg-bipedal/model.py) example, provided by the authors of the Udacity's Deep Reinforcement Learning Nanodegree program. Also, for this network, we added batch normalization to stabilize and speed up convergence.

We use the same initialization scheme as proposed by [Lillicrap et al.](https://arxiv.org/abs/1509.02971), i.e. initializing the final layers of both the actor and critic with a uniform distribution $[−3×10^{−3},3×10^{−3}]$. The upstream layers were initialized with uniform distributions $[−1 \frac{1}{√f},1 \frac{1}{√f}]$ where $f$ is the fan-in of the respective layer.

As with the double DQN case, each experience is stored in a fixed-size __experience replay buffer__ (a double-ended queue), from which the algorithm samples batches _uniformly at random_ to fit the neural networks, breaking the temporal correlation of experiences and thus generates i.i.d. samples from the buffer, which is expected by gradient-based __optimizers__ such as with __Adam__, which is used here. Notice, that as opposed to the suggested learning rates by [Lillicrap et al.](https://arxiv.org/abs/1509.02971), we use a learning rate of $0.001$ for both, the actor's and the critic's optimizer instances.

__The model-fitting process__ (i.e. update of the neural network's weights) is done every 4-th step of the agent, using a sampled batch of experiences. The process is similar to double DQN, which is made up by four neural networks, i.e. two online networks for actor and critic which are continuously updated, and two target networks, which are used to generate the next step's actions and Q-values respectively.

The critic is optimized via the next state's estimated actions and Q-values from the target networks and the local Q-value estimate. For the loss function, `SmoothL1Loss` a.k.a. Huber-Loss has been chosen, which is more robust to extreme values, i.e. mitigating very large gradients which potentially lead to oszillations and thus slow convergence, especially in the beginning of the learning process:

$$L_i(\theta_i) = \mathbb{E}_{(s,a,r,s')\sim U(D)} \Big[\big(r+\gamma Q(s', \mu(s'; \phi^-);\theta^-) - Q(s,a;\theta_i)\big)^2\Big]$$

The policy is optimized via maximizing the expected value of the Q-network (critic), using the state and the policy's selected action for that state:

$$J_i(\phi_i) = \mathbb{E}_{s\sim U(D)} \Big[ Q(s, \mu(s; \phi); \theta)\Big]$$

The final step in the model-fitting process is to "soft-update" the target network's weights. In contrast to overwriting the target-network weights with the online network weights every N time steps, we use __Polyak Averaging__, which updates the weights more often by mixing the weights with tiny bits of both networks:

$$\theta_i^- = \tau \theta_i + (1-\tau)\theta_i^-$$

Completing this fairly concise explanation of the algorithm, the following table provides a listing of hyper-parameters used for the final results presented in the next section:

Parameter Name | Value | Description
:--- | --- | :---
actor_hidden_layer_dimensions | (128, 64) | hidden layer dimensions of the policy network.
critic_hidden_layer_dimensions | (512, 256, 128) | hidden layer dimensions of Q-network.
activation_fn (critic)| Leaky-ReLU | the activation function used for the Q-network hidden layers.
activation_fn (actor) | ReLU | the activation function used for the policy network.
buffer_size | 1000_000 | replay buffer size.
batch_size | 1024 | mini-batch size.
gamma | 0.99 | discount factor.
tau | 0.1 | interpolation parameter for target-network weight update.
lr_actor | 1e-4 | learning rate of the policy network.
lr_critic | 1e-4 | learning rate of the Q-value network.
update_every | 2 | perform optimization every N steps.
n_episodes | 5000 | maximum number of training episodes.
max_t | 1000 |  maximum number of time steps per episode.
eps_start | 1.0 | starting value of epsilon, for epsilon-greedy action selection.
eps_end | 0.01 | minimum value of epsilon.
eps_decay | 0.999 | multiplicative factor (per episode) for decreasing epsilon.
scores_window_length | 100 | length of scores window to monitor convergence.
average_target_score | 0.5 | average target score for scores_window_length at which learning stops.

## Results

The environment could be __solved in 2599 episodes__ using a local CPU environment:

* OS: macOS Big Sur (Version 11.4)
* Processors: 3,1 GHz Quad-Core Intel Core i7
* Memory: 16 GB 2133 MHz LPDDR3

![Scores](scores.png "Scores")

## Ideas for future work

From a modeling point of view, there are many ways to improve the algorithms and models even further. A (incomplete) listing of possible future improvements/extensions:

* Try Bayesian variants of the given neural network architectures (e.g. Variational Inference or Bayesian approximation using Monte-Carlo Dropout), which could possibly improve decision making by incorporating (un)certainty based on posterior-predictive distributions
* Utilize hyper-parameter optimization frameworks (hyperopt or scikit-optimize) rather than manual trial-and-error to further optimize the agent's settings

## Additional References

* [Lowe et al. "Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments, arXiv (2020)"](https://arxiv.org/abs/1706.02275)