# Project 2 - Reacher Arm
Written by: [Anthony Vergottis](http://github.com/anthonyvergottis) 

In this environment, a double-jointed arm can move to target locations. A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1. The simulation environment is provided by Unity ML.

The single agent version of the environment was used

The report is split in three parts:

1. **Learning algorithm**
2. **Plot of rewards**
3. **Ideas for future work**

# Learning algorithm

The learning algorithm used to solve this problem was the same as in this [paper](https://arxiv.org/abs/1509.02971). It is a deep deterministic policy gradient algorithm (DDPG) that is capable of working with continuous actions spaces.



### Hyperparameters

The hyperparameter values are named as they are found in the code:

    - BUFFER_SIZE = int(1e5)  # replay buffer size
    - BATCH_SIZE = 128        # minibatch size
    - GAMMA = 0.99            # discount factor
    - TAU = 1e-3              # for soft update of target parameters
    - LR_ACTOR = 1e-4         # learning rate actor
    - LR_CRITIC = 1e-3        # learning rate critic
    - n_episodes = 2000       # Limit of number of episodes to run
    - max_t = 1000            # Max No. of time steps per episode
    - WEIGHT_DECAY = 0        # L2 weight decay

### Improvements 

Some alterations were made from the original paper in order to improve performance of the algorithm.

1. Instead of using 400 and 300 units in the first and second layer of the network, a new network architecture of 128 units in both layers was used.

2. Batch normalization was used after the first layer in both the actor and critic networks.

3. The L2 weight decay for the critic network was set to 0.

4. The gradients were clipped in the critic network. This suggestion was found in the benchmark implementation. Using torch.nn.utils.clip_grad_norm_(self.critic_local.parameters(), 1).

5. Instead of using a uniform distribution for the Ornstein-Uhlenbeck process a normal distribution was used. This yielded much better results.



# Plot of rewards

The DDPG agent solved the environment in 449 episodes, with an Average Score: 13.03
![dqn_scores.png](images/ddpg_scores.png)




# Ideas for future work

1. Implement the PPO algorithm, the DDPG algorithm took rather long to solve the environment.
2. Try using prioritized replay memory
3. Try adding noise to the policy parameters
4. Experiment with different network weight initialisation
5. Implement Leaky ReLU activations in network rather RelU
6. Further hyperparameter tuning. Given the slow nature of the DDPG algorithm in this case, did not allow for correct exploration.
7. The results appear to be rather noisy. Further exploration to find out the cause would be beneficial.