# Project 3  - Tennis
Written by: [Anthony Vergottis](http://github.com/anthonyvergottis) 

In this environment there are two agents, each controlling a racket, with the goal of hitting the ball over the net. If the agent hits the ball over the net it received a reward of 0.1, if it lets the ball hit the ground it receives a reward of -0.01. Therefore, the objective for each agent is to keep the ball in play.

Each agents observation consists of 8 variables, corresponding to the position and velocity of both the ball and the racket. Each agent is not aware of the existence of an other agent. The action space for each agent is two continuous values, moving towards and away from the net, and moving up and down. The environment is set up to receive three stacked observations, this results in an overall state observation of size 24.

Each task is episodic in nature. The goal is for both agents to get an average score greater or equal to 0.5 over 100 consecutive episodes (after taking the maximum for both agents).

There are multiple ways in which this problem can be solved. Unfortunately, I was not able to get the MADDPG algorithm to work, it simply would not learn, I could not understand why.

Instead the same approach was used as in the second project, The Reacher Arm, but was extended to work with two agents. One actor-critic network was initialised and trained using the experience gathered from both agents (both agents added experience to the same reply buffer). It was trained 10 times every four time steps.

The report is split in three parts:

1. **Learning algorithm**
2. **Plot of rewards**
3. **Ideas for future work**

# Learning algorithm

The learning algorithm used to solve this problem was the same as in this [paper](https://arxiv.org/abs/1509.02971). It is a deep deterministic policy gradient algorithm (DDPG) that is capable of working with continuous actions spaces. For this case it was extended to work with two agents.



### Hyperparameters

The hyperparameter values are named as they are found in the code:

    - BUFFER_SIZE = int(1e6)  # replay buffer size
    - BATCH_SIZE = 256        # minibatch size
    - GAMMA = 0.99            # discount factor
    - TAU = 0.2              # for soft update of target parameters
    - LR_ACTOR = 1e-4         # learning rate actor
    - LR_CRITIC = 1e-3        # learning rate critic
    - n_episodes = 2000       # Limit of number of episodes to run
    - max_t = 2000            # Max No. of time steps per episode
    - WEIGHT_DECAY = 0        # L2 weight decay

Some alterations were made from the original paper in order to improve performance of the algorithm.

1. Instead of using 400 and 300 units in the first and second layer of the network, a new network architecture of 128 units in both layers was used.

2. Batch normalization was used after the first layer in both the actor and critic networks.

3. The L2 weight decay for the critic network was set to 0.

4. The gradients were clipped in the critic network. This suggestion was found in the benchmark implementation. Using torch.nn.utils.clip_grad_norm_(self.critic_local.parameters(), 1).

5. Instead of using a uniform distribution for the Ornstein-Uhlenbeck process a normal distribution was used. This yielded much better results.

6. It was suggested by one of my colleges to increase the value of tau to 0.2 as it would result in faster learning.



# Plot of rewards

The DDPG agent solved the environment in 664 episodes, with an Average Score: 0.5
![ddpg_scoress.png](images/ddpg_scoress.png)




# Ideas for future work

1. Implement the PPO algorithm, the DDPG algorithm took rather long to solve the environment.
2. Try using prioritized replay memory
3. Try adding noise to the policy parameters
4. Experiment with different network weight initialisation
5. Implement Leaky ReLU activations in network rather RelU
6. Further hyperparameter tuning. Given the slow nature of the DDPG algorithm in this case, did not allow for correct exploration.
7. The results appear to be rather noisy. Further exploration to find out the cause would be beneficial.