## Report - Collaboration and Competition - Multi-Agent task for the Tennis environment 

#### Author : Bhuvaneswari Sankaranarayanan, Prepared on April 27, 2021 


#### Learning Algorithm Used

The Collaboration and Competition task for the Tennis environment is a two-agent system that aims to keep the ball in play for the maximum duration possible. The task has been solved using the Multi-Agent DDPG algorithm [MADDPG paper](https://arxiv.org/abs/1706.02275) using a fully connected deep neural network architecture to take in the 24-dimensional state and the 2-dimensional continuous action for each agent. 

MADDPG learns a critic network(local/target versions) and an actor network(local/target versions) for each agent but maintains a common replay memory that records the state-action-reward-next state tuple pairs of both the agents at any given time step. 

In this implementation, the critic network is a fully connected neural network taking the 24-dimensional state of both the agents (thus making it 48-dimensional flattened vector) and 2-dimensional action of both the agents (making it a 4-dimensional flattened vector) at the input. The input layer is followed by a batch normalization layer, a fully connected layer with 128 neuron units, a fully connected layer with 256 neuron units and another fully connected FC3 layer with 128 neuron units. The output of the critic network is a single real value that approximates the action-value function of the input state and action. The actor network is also a fully connected neural network that takes the state at the input and outputs the action to be taken at that state. It has a batch normalization layer, a fully connected layer of 128 neuron units and another fully connected hidden layer of 64 neuron units. The output of the actor network is the 2-dimensional action of the agent. 

In MADDPG, which is the multi-agent version of the DDPG algorithm, each agent learns and acts upon its own version of the critic and actor network. Simple backpropagation on appropriate loss functions as mentioned in the MADDPG paper was used to learn the actor and critic networks. Adam optimizer was used to perform the gradient updates at every update step. 

<img src="./Figures_for_the_report/critic.png" alt="Critic" style="width: 400px;"/> <img src="./Figures_for_the_report/actor.png" alt="Actor" style="width: 300px;"/>

#### Hyperparameter settings

The diagrams above describe the NN architecture and hyperparameters like no. of hidden layers, no. of neuron units per layer etc. The following are the values of other hyperparameters used:-

- Learning rate for the actor = 1e-4
- Learning rate for the critic = 5e-3
- Buffer Size = 1000000
- Batch Size = 128
- Tau (hyperparameter for soft update of the target network) = 0.001
- Weight Decay (parameter of the PyTorch Adam optimizer) = 0
- UPDATE_EVERY = 4 (update the actor/critic network for every 4 steps of the agent)
- GAMMA, discount factor = 0.995

#### Plot of Rewards

A plot of rewards per episode is included to illustrate that the agents are able to receive an average maximum reward(take the maximum reward of the two agents each episode and average this over the last 100 episodes) of at least +0.5 over the last 100 episodes . 

The screenshot of the jupyter notebook below has been obtained after training the MADDPG agent. It reports that the number of episodes needed to solve the environment ie. till the average reward crossed 0.5 was 5387 episodes. The average reward is calculated as the running average of rewards from the last 100 episodes.  
![alt text](./Figures_for_the_report/screenshot_from_training_maddpg.png "Screenshot of the average rewards over last 100 episodes displayed while training")

It also plots the rewards obtained in each episode. 

![alt text](./Figures_for_the_report/learning_curve_maddpg_general.png "Learning Curve from training")


#### Sample code to load the trained model for behaving in the environment

In [1]:
# below code segment works for windows 10 system 

from unityagents import UnityEnvironment
import numpy as np
import torch
env = UnityEnvironment(file_name='./Tennis_Windows_x86_64/Tennis.exe')
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

num_agents=2
state_size=24
action_size = 2

from maddpg_agent import Agent
agents = []
for agent_id in range(num_agents):
    agents.append(Agent(agent_id, num_agents, state_size, action_size))
    
# load the weights from file
for agent_id in range(num_agents):
    agents[agent_id].actor_local.load_state_dict(torch.load('./MADDPG_model_Solution1/checkpoint_actor_'+str(agent_id)+'.pth'))
    agents[agent_id].critic_local.load_state_dict(torch.load('./MADDPG_model_Solution1/checkpoint_critic_'+str(agent_id)+'.pth'))


INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


In [None]:
for i in range(10):
    env_info = env.reset(train_mode=False)[brain_name]
    states = env_info.vector_observations
    while True:
        actions=[]
        for agent_id in range(num_agents):
            actions.append(agents[agent_id].act(states[agent_id, :]))
        actions = np.squeeze(np.array(actions), axis=1)
        env_info = env.step(actions)[brain_name]
        states = env_info.vector_observations
        rewards = env_info.rewards
        dones = env_info.local_done
        if np.any(dones):
            break 
env.close()



#### GIF of the trained agent's behaviour in the environment

![alt text](video_of_learned_agents.gif "GIF of the agent's play after learning")

#### Ideas for future work 



Possible future work for this multi-agent task could be to try a few tweaks to the vanilla MADDPG algorithm.
- Use prioritized experience replay to sample from the replay memory 
- Any improvements to the DDPG algorithm for a single agent environment would improve the MA-DDPG algorithm as well. For eg. the [Twin DDPG](https://spinningup.openai.com/en/latest/algorithms/td3.html) algorithm which uses Double Q-learning updates, employs a critic that updates twice as many times as the actor network, and adds noise to the action for a smoother Q function across actions(we already add noise in our implementation!).
- Use task-specific information to modify the existing MADDPG algorithm. For eg. our tennis playing task is a fully co-operative learning task and hence it makes sense to use a common critic for both the agents to speed up learning.  