# COLLABORATION AND COMPETITION - TENNIS
_Project for Udacity Deep Reinforcement Learning Nanodegree_
---

In [1]:
from unityagents import UnityEnvironment
import numpy as np

# Project Overview

## Agent overview
This code has the flexibility to run either __DDPG (Deep Deterministic Policy Gradient)__ or a __D4PG (Distributed Distributional Deep Deterministic Policy Gradient)__ agents in either the single arm or multi arm environments.

### Deep Deterministic Policy Gradient Agent

The DDPG uses an __Actor Critic__ architecture to estimate the best action for the current observation of the state. The __Actor__ is a policy based agent the uses a neural network to directly estimate the best policy for the agent while the __Critic__ is a value based agent that determines the value of the action in the given state represented by a _Q value_ given the current observation of the state and the output generated by the Actor.

#### DDPG Learning Process
The DDPG agent is an off policy agent that utilizes a __Replay Buffer__ to increase its sample efficiency by storing experiences (SARS' tuples) in the replay buffer and sampling batches of experiences to learn from. This allows experiences to be utilized several times to prior to being discarded hence allows an agent to learn the aspects of the environment with fewer observations. 

The DDPG agent consists of both __Active and Target__ networks, this is drawn from the concept of __Fixed Q Targets__ presented in the __DQN (Deep Q Network)__ to stabilizes learning by reducing the speed of change of the targets networks that the active network is working toward. This is further evolved in the concept of the DDPG by having target networks for both the Actor and the Critic and then utilizing them to train the Active Critic network.

The Active Critic learns by calculating the __Mean Squared Error Loss__ between its value estimate for the action proposed by the actor for the current observation of the state and the sum of the current reward and the value estimate by the Target Critic for the action proposed by the Target Actor for the observation of the next state. Once the MSE Loss is calculated, the loss back propogated through the Active Critic network to train it using __Gradient Descent__.

The gradients for the the Acitve Critic is clipped to prevent large updates that would destabilize the neural network

The Active Actor is then updated with the negative of the value predicted by the Active Critic performing back propogation to train it using __Gradient Ascent__ in order to maximize the score as the score is a representation of value.


### Distributed Distributional Deep Deterministic Policy Gradient Agent
The D4PG is an evolution of the DDPG architecture implemented by __Google Deepmind__ recently. The D4PG paper introduces several key concepts to improve performance over the traditional DDPG architecture. These include:

- N-Step Rollouts
- Prioritized Experience Replay (Not Implemented)
- K-Actor Distributed Training 
- Distributional Value Estimation

_For simplicity, observation of state will be referred to as state._

#### N-Step Rollouts
N Step Rollouts are not a new concept but has been incorporated to include a better understanding of value of the state n steps from the current state. N steps is the middle ground between __Monte Carlo Learning__, where the agent learns at the end of every trajectory, and __Temporal Difference Learning__, where the agent learns at every timestep. By using the rollout length of the N steps, we can control how far ahead the agent looks. In this implementation, the __N Step Replay Buffer__ has the capacity to switch in between the standard single experience mode and N step mode. When in N step mode, the agent keeps track of experiences of a certain rollout length, traditionally 5 is used. Once the rollout length is reached, the agent simply converts the set of experiences into a single experience by changing the next state to the state at the N'th step and changing the reward to contain the initial reward and the discounted rewards till the N'th state

#### K-Actor Distributed Training
K-Actors are used to gather experience and fill the replay buffer, allowing rapid gathering of experience thereby enhancing learning quicker with a diverse set of experiences. 

#### Prioritized Replay
This was not implemented as the agent was easily able to beat the environment. This and the implementation of __Hindsight Experience Replay__ will be really interesting to evaluate

#### Distributional Value Estimation
This allows the Critic to estimate the value more accurately by using a probability distribution to estimate the probability of a value. This allows the agent to understand the there is X% chance to result in a value but there also Y% chance to result in this value. This gives the Critic a wider perception of the possiblities thereby making the Critic perform better. Since the loss that is used to update the Actor is the negative of the value estimated by the Critic, improving the Critic will directly improve the Actor resulting in better performance of the agent.


### Implemented Tweaks

#### Reward Hacking
I implemented reward hacking to adjust the rewards to attain better performance by creating a reward hypothesis which I personally felt was better suited for the environment. These hacked rewards are not used in the calculation of the scores, only in the creating of experience tuples hence it only affects the learning. 

Currently the agent receives a reward 0.1 for being hitting the ball over the net and a reward of -0.01 for hitting the ball out of bounds or letting it drop. I simply modified the rewards with somewhat of a wrapper to somewhat duplicate rewards. Basically I give the same reward to the agent at play and additionally give 10% of the reward to the other agent. This percentage controlled by the alternative_reward_scalar variable. The ideology behind this is that the collaborative agent will provide a good ball to the other agent by sharing the rewards of the other agent with the collaborative agent. The rewards currently seem to keep the agents disconnected. I merely linked them here.

#### Batch Normalization
Batch Normalization allows the normalization of the at each stage as it moves from the hidden layers output to the __Activation Function__ thereby levelizing and regularizing the data thereby leading to improved and more stable learning. However, this initially wrecked learning when implemented across all layers with __ReLU and Tanh Activations__. By removing the Batch Normalization layers prior to the ReLU activations and keeping the Batch Normalization prior to the Tanh activations resulted in much more stable learning even outperforming implementations with and Batch Normalization layers. I believe this is because Tanh simply outputs majority of values in between a small range unlike ReLU and is aided much more so with regularized inputs.

#### Delayed Hard Target Network Updates
Instead of using the traditional soft updates where the Target networks are updated with the parameters from the Active networks by scaling the parameters down with a factor of __TAU__. The Target networks were updated every x timesteps with a hard update from the Active networks. This seemed to stabilize learning further from my experience

#### Pre - Training 
When the agent is started the agent generates experiences with the use of random actions without forward prograting through the network. This fills up the replay buffer quickly resulting in gathering experiences performed with a much broader exploration outlook. This seems to help the agent get through the essential first steps of learning by learning from a diverse set of experiences before simply using __Gaussian Noise__ for the actions creating somewhat of less of a diverse variety of experiences later on but closer to the specific direction in which the agent is heading.


### Failed Tweaks

#### Scheduled Learning Rates
Scheduling the learning rates to reduce on performance plateaus were tested. This to cause problems when the agent hits local optimums as the agent drastically reduces its learning the longer its stuck in a local optima thereby hinders itself to moving out of the local optima by having its capacity to learn being pulled out right under it.

_This could possibly prove to work if the patience factor in between the updates is increased thereby the agents learning rate is decreased less frequently_

#### Cyclical Learning Rates
Updating the learning rates periodically by multiplying the base learning to the oscillations of a __Cosine Function__ resulting in learning cycling between a value range. This seems to cause subtantial instability in the network and the agent sometimes gets thrown off a cliff once it has converged as the learning rate gets too high.

_This could possibly prove to work if the base learning rate is kept lower preventing the agent from being thrown off a cliff_

#### Parameter Noise
This was suggested in OpenAIs paper in encourage exploration and that it performs better than both Gaussian and OU Noise allowing for learning in some sense similar to an __Evolutionary Alogrithm__. I tried adding little bits of noise to the network weights / parameters several times but this seemed to wreck any learning even.

_This could possibly prove to work if extremely tiny bits of noise is added to the parameters. I find it questionable as it causes substantial instability in the environment_

#### Delayed Intensive Updates
This ideology was obtained from the Udacity benchmark implementation. I personally found this to result in greater instability resulting in slower convergence. For testing, I updated the agents every 20 timesteps 10 times in a row. 

_Honeslty speaking, I don't think there is much point in this tweak_


### The Results 

The scores were tracked using Tensorboard along with the losses of the Actor and the Critic. The agent seems to converge faster than the benchmark implementation and remains stable after convergence with minor improvements over time. The Max Score represents the maximum score out of the two agents

#### Tracked by Tensorboard

![TITLE](images/Results.jpg)

### Hyper Parameter Selection

__ACTOR LEARNING RATE / CRITIC LEARNING RATE__ - Tests were run with several learning rates in between 0.0005 to 0.0001 for the Actor and learning rates in between 0.001 to 0.0005 for the Critic. The D4PG seems to be quite a robust network being able to eventually converge either which way

__EPSILON__ - This is the control used to tweak the Gaussian Noise that is added to the action thereby it is the variable that we use to control exploration of the environment. This was set to 0.3 and not annealed throughout the training

__PRETRAIN__ - The number of random experiences that are required to be generated and filled into the replay buffer before proceeding with training the agent using Gradient Descent

__UPDATE_TARGETS_EVERY__ - This is the control used to determine how frequently the parameters from the Active networks are used to update the Target networks. This was set to 350

__ROLLOUT LENGTH__ - This is used to determine the rollout length used by the N step replay buffer to decide how far ahead the agent should look into

__ATOMS__ - This controls the granularity with which the probability distribution is estimated from the Q network. The more atoms there are, the greater the granularity of the distribution. 

__VMIN / VMAN__ - The bounds of the values predicted by the agent. The atoms are distributed between these two values which in turn are multiplied by the probabilities from the Critic network to arrive at the value distribution. 

### 1. Initialize the Environment 
Run the below section once to initialize the environment.. Don't Repeat!!!

### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [2]:
from utilities import Seeds, initialize_env, get_device

device = get_device()                           # gets gpu if available

environment_params = {
    'no_graphics': False,                       # runs no graphics windows version
    'train_mode': True,                         # runs in train mode
    'offline': True,                            # toggle on for udacity jupyter notebook 
    'device': device
}

env, env_info, states, state_size, action_size, brain_name, num_agents = initialize_env(environment_params)


INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Number of agents: 2
Number of actions: 2
States have length: 24
States initialized: 2
Number of Agents: 2


### 2. Configure the Agent
Configures the respective 'knobs' on the agent to beat the environment.. The best configurations from my experiments have already been set.

To load the complete version of the agent, please run as is. If you want test the agent from scratch, simply change the name in agent params and re run

In [3]:
from unityagents import UnityEnvironment
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

import torch.nn.functional as F
import torch.nn as nn

from agent import D4PGAgent
from train import train
from memory import NStepReplayBuffer
from noise import OUNoise, GaussianExploration


seedGenerator = Seeds('seeds')
seedGenerator.next()

experience_params = {
    'seed': seedGenerator,                      # seed for the experience replay buffer
    'buffer_size': 300000,                      # size of the replay buffer
    'batch_size': 128,                          # batch size sampled from the replay buffer
    'rollout_length': 15,                        # n step rollout length    
    'agent_count': 2,
    'gamma': 0.99,
    'device': device
}

experienceReplay = NStepReplayBuffer(experience_params)

noise_params = {
    'ou_noise_params': {                        # parameters for the Ornstein Uhlenbeck process
        'mu': 0.,                               # mean
        'theta': 0.15,                          # theta value for the ornstein-uhlenbeck process
        'sigma': 0.2,                           # variance
        'seed': seedGenerator,                  # seed
        'action_size': action_size  
    },  
    'ge_noise_params': {                        # parameters for the Gaussian Exploration process                   
        'max_epsilon': 0.3,                     
        'min_epsilon': 0.005,   
        'decay_epsilon': True,      
        'patience_episodes': 2,                 # episodes since the last best reward  
        'decay_rate': 0.95                   
    }
}

noise = GaussianExploration(noise_params['ge_noise_params'])

params = {
    'episodes': 2000,                           # number of episodes
    'maxlen': 100,                              # sliding window size of recent scores
    'brain_name': brain_name,                   # the brain name of the unity environment
    'achievement': 0.5,                         # score at which the environment is considered beaten
    'achievement_length': 100,                  # how long the agent needs to get a score above the achievement to solve the environment
    'environment': env,             
    'pretrain': True,                           # whether pretraining with random actions should be done
    'pretrain_length': 5000,                   # minimum experience required in replay buffer to start training 
    'random_fill': False,                       # basically repeat pretrain at specific times to encourage further exploration
    'random_fill_every': 10000,             
    'hack_rewards': True,                       # hack rewards
    'alternative_reward_scalar': 0.1,           # scales other agents rewards to current agent
    'log_dir': 'runs/',
    'load_agent': True,
    'save_every': 1000,                         # save every x episodes
    'agent_params': {
        'name': 'D4PG - High Rollout',
        'd4pg': True,
        'experience_replay': experienceReplay,
        'device': device,
        'seed': seedGenerator,
        'num_agents': num_agents,               # number of agents in the environment
        'gamma': 0.99,                          # discount factor
        'tau': 0.0001,                          # mixing rate soft-update of target parameters
        'update_target_every': 350,             # update the target network every n-th step
        'update_every': 1,                      # update the active network every n-th step
        'actor_update_every_multiplier': 1,     # update actor every x timestep multiples of the crtic, critic needs time to adapt to new actor
        'update_intensity': 1,                  # learns from the same experiences several times
        'update_target_type': 'hard',           # should the update be soft at every time step or hard at every x timesteps
        'add_noise': True,                      # add noise using 'noise_params'
        'schedule_lr': False,                   # schedule learning rates 
        'lr_steps': 30,                         # step iterations to cycle lr using cosine
        'lr_reset_every': 5000,                 # steps learning rate   
        'lr_reduction_factor': 0.9,             # reduce lr on plateau reduction factor
        'lr_patience_factor': 10,               # reduce lr after x (timesteps/episodes) not changing tracked item
        'actor_params': {                       # actor parameters
            'lr': 0.0001,                       # learning rate
            'state_size': state_size,           # size of the state space
            'action_size': action_size,         # size of the action space
            'seed': seedGenerator,              # seed of the network architecture
        },
        'critic_params': {                      # critic parameters
            'lr': 0.0005,                        # learning rate
            'weight_decay': 3e-10,              # weight decay
            'state_size': state_size,           # size of the state space
            'action_size': action_size,         # size of the action space
            'seed': seedGenerator,              # seed of the network architecture
            'action_layer': True,
            'num_atoms': 75,
            'v_min': 0.0, 
            'v_max': 0.5
        },
        'noise': noise
    }
}

### 3. Train the Agent
Get the agent to learn to beat the environment and output the scores

In [4]:
agents = D4PGAgent(params=params['agent_params']) 

scores = train(agents=agents, params=params, num_processes=num_agents)

df = pd.DataFrame(data={'episode': np.arange(len(scores)), 'D4PG': scores})
df.to_csv('results/D4PG.csv', index=False)


################ ACTOR ################

Actor(
  (fc1): Sequential(
    (0): Linear(in_features=24, out_features=400, bias=True)
    (1): ReLU()
  )
  (fc2): Sequential(
    (0): Linear(in_features=400, out_features=300, bias=True)
    (1): ReLU()
  )
  (fc3): Sequential(
    (0): Linear(in_features=300, out_features=2, bias=True)
    (1): BatchNorm1d(2, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): Tanh()
  )
)

################ CRITIC ################

D4PGCritic(
  (fc1): Sequential(
    (0): Linear(in_features=26, out_features=400, bias=True)
    (1): ReLU()
  )
  (fc2): Sequential(
    (0): Linear(in_features=400, out_features=300, bias=True)
    (1): ReLU()
  )
  (fc3): Sequential(
    (0): Linear(in_features=300, out_features=75, bias=True)
    (1): ReLU()
  )
)
Loading checkpoint - Average Reward 0.1212500019930303 at Episode 289
Episode 290	Max: 0.00 	 Time: 0.04
Episode 291	Max: 0.00 	 Time: 0.22
Episode 292	Max: 0.10 	 Time: 0.08
Episode 293	Max:

Episode 499	Max: 0.10 	 Time: 4.56
Episode 500	Max: 0.40 	 Time: 12.65
Episode 501	Max: 0.20 	 Time: 6.48
Episode 502	Max: 0.29 	 Time: 10.83
Episode 503	Max: 0.10 	 Time: 4.55
Episode 504	Max: 0.00 	 Time: 1.30
Episode 505	Max: 0.00 	 Time: 1.70
Episode 506	Max: 0.49 	 Time: 16.12
Episode 507	Max: 0.30 	 Time: 11.60
Episode 508	Max: 0.10 	 Time: 2.77
Episode 509	Max: 0.10 	 Time: 2.78
Episode 510	Max: 0.39 	 Time: 12.74
Episode 511	Max: 0.30 	 Time: 11.32
Episode 512	Max: 0.10 	 Time: 4.54
Episode 513	Max: 0.20 	 Time: 5.59
Episode 514	Max: 0.30 	 Time: 9.65
Episode 515	Max: 0.00 	 Time: 1.25
Episode 516	Max: 0.20 	 Time: 8.02
Episode 517	Max: 0.10 	 Time: 2.68
Episode 518	Max: 0.20 	 Time: 6.30
Episode 519	Max: 0.30 	 Time: 9.64
Episode 520	Max: 0.10 	 Time: 4.65
Episode 521	Max: 0.30 	 Time: 9.59
Episode 522	Max: 0.20 	 Time: 6.39
Episode 523	Max: 0.20 	 Time: 7.98
Episode 524	Max: 0.20 	 Time: 7.91
Episode 525	Max: 0.60 	 Time: 19.70
Episode 526	Max: 0.00 	 Time: 1.89
Episode 527	M

Episode 732	Max: 0.20 	 Time: 8.02
Episode 733	Max: 0.10 	 Time: 2.58
Episode 734	Max: 0.20 	 Time: 7.30
Episode 735	Max: 0.00 	 Time: 1.28
Episode 736	Max: 0.10 	 Time: 4.67
Episode 737	Max: 0.00 	 Time: 1.26
Episode 738	Max: 0.20 	 Time: 7.90
Episode 739	Max: 0.20 	 Time: 6.34
Episode 740	Max: 0.10 	 Time: 4.72
Episode 741	Max: 0.10 	 Time: 4.63
Episode 742	Max: 0.30 	 Time: 11.01
Episode 743	Max: 0.10 	 Time: 4.62
Episode 744	Max: 0.00 	 Time: 1.26
Episode 745	Max: 0.10 	 Time: 4.62
Episode 746	Max: 0.10 	 Time: 4.67
Episode 747	Max: 0.09 	 Time: 2.61
Episode 748	Max: 0.09 	 Time: 4.26
Episode 749	Max: 0.10 	 Time: 2.58
Episode 750	Max: 0.10 	 Time: 4.54
Episode 751	Max: 0.10 	 Time: 2.58
Episode 752	Max: 0.20 	 Time: 8.11
Episode 753	Max: 0.10 	 Time: 3.37
Episode 754	Max: 0.00 	 Time: 1.33
Episode 755	Max: 0.10 	 Time: 2.58
Episode 756	Max: 0.09 	 Time: 2.79
Episode 757	Max: 0.10 	 Time: 4.55
Episode 758	Max: 0.10 	 Time: 2.83
Episode 759	Max: 0.20 	 Time: 7.75
Episode 760	Max: 0.

Episode 965	Max: 0.80 	 Time: 26.74
Episode 966	Max: 0.10 	 Time: 2.67
Episode 967	Max: 0.90 	 Time: 31.71
Episode 968	Max: 0.90 	 Time: 31.72
Episode 969	Max: 1.10 	 Time: 38.64
Episode 970	Max: 0.20 	 Time: 6.22
Episode 971	Max: 1.10 	 Time: 36.73
Episode 972	Max: 0.10 	 Time: 2.87
Episode 973	Max: 0.10 	 Time: 2.86
Episode 974	Max: 0.50 	 Time: 18.06
Episode 975	Max: 0.20 	 Time: 7.89
Episode 976	Max: 0.10 	 Time: 2.77
Episode 977	Max: 0.10 	 Time: 2.70
Episode 978	Max: 0.10 	 Time: 2.85
Episode 979	Max: 0.40 	 Time: 12.95
Episode 980	Max: 0.10 	 Time: 2.89
Episode 981	Max: 0.10 	 Time: 2.83
Episode 982	Max: 0.10 	 Time: 2.68
Episode 983	Max: 0.30 	 Time: 9.64
Episode 984	Max: 0.10 	 Time: 2.59
Episode 985	Max: 0.10 	 Time: 2.79
Episode 986	Max: 0.60 	 Time: 21.62
Episode 987	Max: 0.10 	 Time: 2.87
Episode 988	Max: 0.20 	 Time: 7.95
Episode 989	Max: 0.10 	 Time: 2.76
Episode 990	Max: 0.10 	 Time: 2.86
Episode 991	Max: 0.20 	 Time: 7.95
Episode 992	Max: 0.10 	 Time: 4.63
Episode 993	

KeyboardInterrupt: 


### Future Improvements

#### Prioritized Experience Replay
By prioritizing the experiences and learning from experiences with 'more to learn from' will most likely result to faster convergence.

#### Annealing of Gaussian Noise
The agent is currently using Gaussian Noise to encourage exploration of the action space. A noise factor of epsilon 0.3 is used consistently throughout the training. Annealing this to lower values based on performance once the agent comes close to convergence may result in higher maximum scores