# Collaboration & Competition Project Report - DRLND.

## Project background
Reinforcement Learning is the forth part of Machine Learing, the most complicated & continuously growing up one. Its natural is a dynamically learning by adjusting actions based in continuous feedback to maximize a reward, it opens a tremendous applications of ML in real life in helping human execute the complicated & dangerous tasks.  
The Third project of Deep Reinforcement Learning ND - Udacity is the simulation of how multi agents interact each other in an artifact environment to achieve a specific task . The target is build a model that enables 02 agents collaborate and compete concurently each other in the Tennis environment. The goal of each agent (player) is to keep the ball in play, thus, project's goal is score as much reward as posible through episodes. The threshold for a successful model is the agents must get an average score (2 agents) of +0.5 (over 100 consecutive episodes).  
Creating environment where agents can stay, interact & learn is the most difficult part. However, the environment is provided by [Udacity](https://www.udacity.com/) and is based on the [Machine learning framework](https://unity3d.com/machine-learning) provided by [Unity](https://unity.com/)  
A less complicated but not least part is creating agents & model. The agent has its own methods & properties that enable it interact with environment & learning. The model is the way agent explore & exploid the environment to solve project's task.  

![](./images/tennis.gif)

This project is expected to use (multi-agent) Deep Deterministic Policy Gradient (DDPG) learning - Actor & Critic networks to train 2 virtual agents in this environment in Pytorch framework.

## Collaboration & Competition Environment Information

The environment will be used is [Unity MLAgents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md)

In [None]:
env = UnityEnvironment(file_name="/data/Tennis_Linux_NoVis/Tennis")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Agent:  
two agents control rackets to bounce a ball over a net.  
- Action space is two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping  

Environment:  
The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket.  
- A reward of +0.1 is provided if an agent hits the ball over the net.  
- A reward of -0.01 is provided if an agent lets a ball hit the ground or hits the ball out of bounds.  

## Deep Deterministic Policy Gradient Network Architecture

the agent can learn the policy either directly from the states using Policy-based methods or via the action valued function as in the case of Value-based methods. The policy-based methods tend to have high variance and low bias and use Monte-Carlo estimate whereas the value-based methods have low variance but high bias as they use TD estimates. Now, the actor-critic methods were introduced to solve the bias-variance problem by combining the two methods.

In Actor-Critic, the actor is a neural network which updates the policy and the critic is another neural network which evaluates the policy being learned which is, in turn, used to train the actor. The actor uses the value provided by the critic to make the new policy update.  

DDPG algorithm , a subset of Actor & Critic method, comprises two networks. The actor produces a deterministic policy instead of the usual stochastic policy and the critic evaluates the deterministic policy.  

- Actor network composes 1 linear hidden layer taking state_size (24) as input shape, number of units by default as 128, it is come after by *relu* activation functions. A *BatchNorm1D* layer is applied to standardize the output of hidden layer. The linear output layer takes the batchnorm output & action_size as params, it is followed by *tank* activation function.  

Actor(  
   (fc1): Linear(in_features=24, out_features=128, bias=True)  
   (fc2): Linear(in_features=128, out_features=128, bias=True)  
   (fc3): Linear(in_features=128, out_features=2, bias=True)  
   (bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True,  track_running_stats=True)  
 )  

- Critic network composes 2 linear hidden layers taking state_size (24) as input shape, number of units by default as 128. They are come after by *relu* activation functions. A *BatchNorm1D* layer is applied after the input layer to standardize its output, it then is concatenated to action_size (2) to feed into 2nd hidden layer.

Critic(  
   (fcs1): Linear(in_features=24, out_features=128, bias=True)  
   (fc2): Linear(in_features=130, out_features=128, bias=True)  
   (fc3): Linear(in_features=128, out_features=1, bias=True)  
   (bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True,  track_running_stats=True)  
 )

## Learning algo to use

Adam was used as an optimizer for both actor and critic networks.

### Hyperparameters used

* `BUFFER_SIZE = int(1e5)` : replay buffer size
* `BATCH_SIZE = 128      ` : minibatch size
* `GAMMA = 0.99          ` : discount factor
* `TAU = 1e-3            ` : for soft update of target parameters
* `LR_ACTOR = 1e-3       ` : learning rate of the actor
* `LR_CRITIC = 1e-3      ` : learning rate of the critic
* `WEIGHT_DECAY = 0      ` : L2 weight decay


## Train the agent with DDPG

In [4]:
#Case 2 agents
agent = Agent(state_size=state_size, action_size=action_size, num_agents=num_agents, random_seed=40)

Episode 100	Average Score: 0.0038  
Episode 200	Average Score: 0.0351  
Episode 300	Average Score: 0.0010  
Episode 400	Average Score: 0.0000  
Episode 500	Average Score: 0.0143  
Episode 600	Average Score: 0.0519  
Episode 700	Average Score: 0.0887  
Episode 800	Average Score: 0.1002  
Episode 900	Average Score: 0.0927  
Episode 1000	Average Score: 0.0934  
Episode 1100	Average Score: 0.1065  
Episode 1200	Average Score: 0.0867  
Episode 1300	Average Score: 0.0936  
Episode 1400	Average Score: 0.1287  
Episode 1496	Average Score: 0.5030  
Environment solved in 1496 episodes!	Average Score: 0.5030  

![](images/scores_epochs.jpg)  
The model then saved as *./saved_models/checkpoint_actor.pth* and *./saved_models/checkpoint_critic.pth* for later uses

## Testing the trained

In [None]:
#Saved models are reloaded in evaluation mode for testing purpose, the process is executed in CPU mode
agent.actor_local.load_state_dict(torch.load('./saved_models/checkpoint_actor.pth', map_location='cpu'))
agent.critic_local.load_state_dict(torch.load('./saved_models/checkpoint_critic.pth', map_location='cpu'))
agent.actor_local.eval()
agent.critic_local.eval()

Testing process is carried out in 20 epochs with the scores & mean as following:  

Episodes 0000-0005  Max Reward: 2.800  Moving Average: 1.900  
Episodes 0005-0010  Max Reward: 5.200  Moving Average: 2.020  
Episodes 0010-0015  Max Reward: 2.700  Moving Average: 1.920  
Episodes 0015-0020  Max Reward: 5.200  Moving Average: 2.130  

![](images/scores_epochs_test.jpg)  

## Conclusions
The problem was solved using the MADDPG algorithm where the average reward of +0.5 over at least 100 episodes was achieved in 1496 episodes.  

The test was carried out in 20 epochs, the average reward of 2 agents was all greater +1.5.

## Future works:
Add some noise into the agents during the training, fine turning hyperparameters for the optimum training operation& build more complicated model to enhance the scores.  

Try to create my own environment following the [instruction](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Getting-Started-with-Balance-Ball.md)
