# Report for Continuous Control Project

## Environment description
The output below is from the unity environment initialisation

Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 
Size of each action: 4

There are 2 agents. Each observes a state with length: 8
        
This shows there are 8 State Space, and 4 Action spaces for the environemnt.

### Solving the Environment

The Environment is considered solved when an average score of +0.5 or more is achieved over 100 episodes.
        
## Model Description
This project uses PyTorch to construct a neural network for each of the Actor and Critic portions of the agent:

The Actor Has:
    - 8 inputs (State size from environment) 
    - 256 First fully connected hidden layer 
    - 128 Second Fully connected hidden layer 
    - 2 output (Action size from environment)
    
The Critic has:
    - 8 inputs (State size from environment) 
    - 256 First fully connected hidden layer 
    - 128 Second Fully connected hidden layer 
    - 1 output (Value for the (State, Action) pair)
     
The hidden layers all use the Relu activation function.

## Agent Description
The algorithm used for the solution is a DDPG agent, this agent uses 4 neural networks in order to learn the task. 
    - Actor Local
    - Actor Target
    - Critic Local
    - Critic Target

### Q-Network
The goal of a Q Network is to learn the optimal policy to solve a given task. This is achieved by initially performing random action given the current state and using the reward from the environment to update the likelyhood of choosing that response again. Over time the Q-Network will determine the best action to perform given a state from the environment. When a policy that performes the best action for each state has been found then the optimal policy has been reached.

### DQN
A DQN is a reinforcement learning Algorithm that uses a neural network for the function approximator. In this instance the model described above is used at the centre of the algorithm to learn the required actions to perform for a given environment state.

### DDPG
The DDPG agent uses a similar methodology to the DQN agent. DQN agent become much more complicated to implement if the action space of an environment is continuous. The DDPG agent solves this issue by using 2 neural networks, one (the actor) determines the best action under the current policy, the other (the critic) determines the best action value for the action chosen by the actor. This allows a value to be assigned to an action.

### MARL
In this project the two agents have seperate, initially identical, actor networks. Each actor network shares the same critic network. The replay buffer, explained below, is also shared. The agents are trained seperately, this results in a wider experience pool to learn form.

### Random Replay buffer
The Replay Buffer is storage that contains the results of all actions taken. Instead of learning while taking every action the replay buffer provides a list of previously taken actions. At pre determined intervals a number of samples are chosen from the buffer at random this means that the training data is out of sequence. This removes the possibility of certain sequences biasing the training of the neural network. In this porject 10 training loops are performed every 20 timesteps.


### Soft Target Update
When using soft target updates the changes of the weights in the network are adapted in small steps using the equation below:

![](results/softUpdate.png)

Research has shown the soft update method has better results than a hard update method, where the steps taken are larger.

### Noise
In order to produce random varience in the choice of action a noise is introcuced on to the chosen action on each timestep. The noise in this project is generated by Ornstein Uhlenbeck Noise function. The code for this noise function was taken from the open ai baselines repository: https://github.com/openai/baselines/blob/master/baselines/ddpg/noise.py


## Hyperparameters
The current Hyperparameter list is shown below.

BUFFER_SIZE = int(1e5)  # replay buffer size

BATCH_SIZE = 1024       # minibatch size

GAMMA = 0.99            # discount factor

TAU = 1e-2      # for soft update of target parameters

LR_ACTOR = 1e-3        # learning rate of the actor 

LR_CRITIC = 1e-3     # learning rate of the critic

WEIGHT_DECAY = 0.       # weight decay

NO_LEARN = 20

LEARN_EVERY = 20

REPLAY_INIT_STEPS = 1024

SEED = 8

MU = 0.

THETA = 0.15

SIGMA = 0.2

ACTOR_FC1 = 500

ACTOR_FC2 = 500

CRITIC_FC1 = 500

CRITIC_FC2 = 500

N_EPISODES = 1000

MAX_T = 1000
    
# Results
The DDPG Agent Developed in this project was able to reach an average score of >0.5 in 642 episodes.

![](results/ScoreImage.png)


# Future Development
There are some known drawbacks with the implementstion in this project due to limitations in the application of the agent. The following processes could improve the performance of the training agent:

## Replay Buffer
One improvement could be to implement a Prioritised Experience Replay buffer. In the current imlementation all experiences in the replay buffer have equal probability of being chosen. With a priotitised buffer the experiences would be pritoritised in terms of how much the result diverges from the predicted results.


## Other Agent Types
It may be worth experimenting with other agents such as D4PG, TRPO  and TNPG implementations to investigate the best approach for solving the environment.

## Custom rewards
As the reward system of this environment only rewards the agent once the episode has finished it may be worth implementing some custom rewards to aid training. The idea behind this approach is to use the environment data timestep by timestep in order to guide the learnng of the agents. for example a small reward could be generated every timestep depending on the proximity of the agent, on the side of the net the ball is, to the ball. This will encourage the agent to track the ball movement in the environment, potentially speeding up the learning task.