# Collaboration and Competition

---

## Preamble

### The project

In this notebook, an agent based on the MADDPG algorithm is used to solve a Tennis Game for the  [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893).

Before using this notebook check that you have followed the .Readme file available in [GitHub Project repository](https://github.com/BDGITAI/RL_P3_COLLABORATION_COMPETITION)

For the Notebook to work you will need the Reacher environment executable which was placed in [GitHub Project repository](https://github.com/BDGITAI/RL_P3_COLLABORATION_COMPETITION/Tennis_Windows_x86_64/). The environment needs to uncompressed as  `"./Tennis_Windows_x86_64/Tennis.exe"`

This implementation uses the Pytorch library and was tested in a **Windows 64 bits** platform using **CPU**  computation. 


This notebook is divided in two parts
* **Part 1** : Training. We will train an Agent and see how the learning can be applied to execute a task
* **Part 2** : To see an already trained agent you can skip to Part 2 and load a trained agent.


### Base used for the project

Some files used in the project are based on a starter code provide in the nanodegree to solve the pendulum openai gym.
Modifications of original files are indicated in the comments.

---

## 1. Part 1 : Training an agent 


Import of required librairies first

In [1]:
# import required librairies
from maddpg import MADDPG
import torch
import numpy as np
from unityagents import UnityEnvironment
from collections import deque

Creation of a function used to pre fill a memory buffer. This function was created to reinforce the positive reward signal and try to stabilise the learning. All actions are taken with a random policy 

In [4]:
#####################################################################################################
#
#   Function to pre initialise the experience buffer of the MADDPG agent
#   Use to try to reinforce the positive reward signal during learning 
#   When not trained the agent receives mostly neutral or negative rewards
#   Introduced this function to use random experiences to boot strap the memory 
#   Different ratio used to control the filling of the buffer. 
#   For instance half of the memory is filled with 20% of positive exp and then 40% neg, 40% 0
#
#####################################################################################################

def collect_experience(agent,number_of_episodes):
    """Load saved network param
        Params
        ======
            agent (a MADDPG agent): MADDPG agent that is bootstrapped
            number_of_episodes (int): number of episodes used to fill the memory
    """ 
    # create a unity environment
    env = UnityEnvironment(file_name="./Tennis_Windows_x86_64/Tennis.exe")
    # get the default brain
    brain_name = env.brain_names[0]
    brain = env.brains[brain_name]
    # define filling ratio
    # e.g. : 20 % of positive reward = 0.2
    pos_ratio = 0.2
    neg_ratio = 0.4
    neutral_ratio = 0.4
    # counter for number of experiences of each type
    pos_actual = 0
    neg_actual = 0
    neu_actual = 0
    # amount of memory to be filled
    memory_fill_ration = 0.5
    to_fill = agent.buffer_size*memory_fill_ration
    
    # do random actions for the number of episode given
    for i_episode in range(1, number_of_episodes+1):
        # reset at each episode
        env_info = env.reset(train_mode=True)[brain_name] 
        # get all agents states
        states = env_info.vector_observations
        while True:
            actions = np.random.randn(2, 2)              # select an action (for each agent)
            actions = np.clip(actions, -1, 1)            # all actions between -1 and 1            
            env_info = env.step(actions)[brain_name]     # send all actions to tne environment
            next_states = env_info.vector_observations         # get next state (for each agent)
            rewards = env_info.rewards                         # get reward (for each agent)
            dones = env_info.local_done                        # see if episode finished   
            
            # add data to buffer if the size is inferior to threshold
            if len(agent.memory)< to_fill:
                # reshape data to be compatible with buffer 
                state = states.reshape(1, -1)  
                next_state = next_states.reshape(1, -1)  
                action= np.array(actions).reshape(1, -1)
                # if current rewards are positive store if under the ratio
                if np.mean(rewards)>0 and (pos_actual/to_fill)<=pos_ratio:
                    agent.memory.add(state, action, rewards, next_state, dones)
                    #update counter
                    pos_actual+=1
                if np.mean(rewards)<0 and (neg_actual/to_fill)<=neg_ratio:
                    agent.memory.add(state, action, rewards, next_state, dones)
                    neg_actual+=1
                if ((np.mean(rewards)==0) and ((neu_actual/to_fill)<=neutral_ratio)):
                    agent.memory.add(state, action, rewards, next_state, dones)
                    neu_actual+=1
            #update state for next time step
            states = next_states
            # if episode is finished
            if np.any(dones):
                break
        # monitoring values
        actual_pos = pos_actual/to_fill
        neg_pos = neg_actual/to_fill
        neu_pos = neu_actual/to_fill
        total = actual_pos + neg_pos + neu_pos
        # display fillinf progress
        print('\rEpisode {}\tPos: {:.4f}\tNeg: {:.4f}\tNeu: {:.4f}\tFill :{}'.format(i_episode, actual_pos,neg_pos,neu_pos,to_fill), end="")
        if len(agent.memory)== to_fill:
            break
    # close unity environment
    env.close()


The training loop that can be used with or without the pre fill function

In [None]:
#####################################################################################################
#
#   Training loop 
#
#####################################################################################################    

def seeding(seed=1):
    np.random.seed(seed)
    torch.manual_seed(seed)
    
def train(agent, number_of_episodes = 1):
    """Load saved network param
        Params
        ======
            agent (a MADDPG agent): MADDPG agent that is bootstrapped
            number_of_episodes (int): number of episodes used to fill the memory
    """ 
    seeding()
    # save initial weights
    # agent.save('init')
    # monitor current best score to solve networks even if final score not achieved
    best_mean_score = -1

    # build environment
    env = UnityEnvironment(file_name="./Tennis_Windows_x86_64/Tennis.exe")
    # get the default brain
    brain_name = env.brain_names[0]
    brain = env.brains[brain_name]

    #
    scores = []     # list containing scores from each episode
    scores_window = deque(maxlen=100)  # last 100 scores
    
    # training loop    
    for i_episode in range(1, number_of_episodes+1):
        # init rewards for all agents
        reward_this_episode = np.zeros(len(agent.ddpg_agents))
        #reset env for each episode
        env_info = env.reset(train_mode=True)[brain_name] 
        states = env_info.vector_observations
        while True:
            # decide actions
            actions = agent.act(states, add_noise=True)
          
            # step forward 
            env_info = env.step(actions)[brain_name]     # send all actions to tne environment
            next_states = env_info.vector_observations         # get next state (for each agent)
            rewards = env_info.rewards                         # get reward (for each agent)
            dones = env_info.local_done                        # see if episode finished   
            
            # add data to buffer
            agent.step(states, actions, rewards, next_states, dones)
            
            # cumul reward
            reward_this_episode += rewards
            
            #update state for next time step
            states = next_states
            if np.any(dones):
                break             
                          
        # compute score 
        # use max amongst all agents as directed in project instructions
        score = np.max(reward_this_episode)
        scores_window.append(score)       # save most recent score
        scores.append(score)              # save most recent score
        # display progress
        print('\rEpisode {}\tAverage Score: {:.4f}\tMax Score: {:.4f}\tReward this episode: {}'.format(i_episode, np.mean(scores_window),np.max(scores_window),reward_this_episode), end="")

        # print every 100 episodes and save if average better than previous
        if i_episode % 100 == 0 or i_episode == number_of_episodes-1:
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
            if best_mean_score < np.mean(scores_window):
                agent.save('best_achieved')
                best_mean_score = np.mean(scores_window)
        # save and end if score is achieved      
        if np.mean(scores_window) >= 0.5:
            print('\rSolved Episode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
            agent.save('solved')
            break
    # save where training stopped if target score is not achieved
    if np.mean(scores_window) < 0.5:        
        agent.save('end')
    # close unity
    env.close()

Creation of agent and prefill of the memory

In [7]:
# create agent
# Observations dim for each agent = 24
# action dim = 2
# create 2 agents
agent = MADDPG(24, 2, 2, buffer_size=int(1e5), batch_size=128, discount_factor=0.99, tau=0.001)

# use this line to prefill the memory 
collect_experience(agent,10000)

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Episode 10000	Pos: 0.1487	Neg: 0.4001	Neu: 0.4001	Fill :15000.0

Now let's execute the training loop

In [8]:
# perform the training
train(agent,number_of_episodes = 10000)

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Episode 100	Average Score: 0.0000	Max Score: 0.0000	Reward this episode: [ 0.   -0.01]
Episode 200	Average Score: 0.0000	Max Score: 0.0000	Reward this episode: [ 0.   -0.01]
Episode 300	Average Score: 0.0029	Max Score: 0.2000	Reward this episode: [ 0.    0.09]
Episode 400	Average Score: 0.0252	Max Score: 0.1000	Reward this episode: [ 0.   -0.01]
Episode 500	Average Score: 0.0000	Max Score: 0.0000	Reward this episode: [ 0.   -0.01]
Episode 600	Average Score: 0.0000	Max Score: 0.0000	Reward this episode: [ 0.   -0.01]
Episode 700	Average Score: 0.0000	Max Score: 0.0000	Reward this episode: [-0.01  0.  ]
Episode 800	Average Score: 0.0018	Max Score: 0.0900	Reward this episode: [ 0.   -0.01]
Episode 900	Average Score: 0.0350	Max Score: 0.2000	Reward this episode: [ 0.    0.09]
Episode 1000	Average Score: 0.0766	Max Score: 0.3000	Reward this episode: [ 0.1  -0.01]
Episode 1100	Average Score: 0.1082	Max Score: 0.2000	Reward this episode: [ 0.    0.09]
Episode 1200	Average Score: 0.1285	Max Sc

## 2. Part 2 : Watch a trained agent

Import librairies if no pre training done

In [1]:
# import required librairies
from maddpg import MADDPG
import torch
import numpy as np
from unityagents import UnityEnvironment
from collections import deque

In [2]:
#####################################################################################################
#
#   Perform an evaluation in non training mode 
#
#####################################################################################################  

def evaluate(agent,number_of_episodes = 10):
    #
    env = UnityEnvironment(file_name="./Tennis_Windows_x86_64/Tennis.exe")
    # get the default brain
    brain_name = env.brain_names[0]
    brain = env.brains[brain_name]
        
    # path to saved weights
    path = []
    path.append('./successful_weigths/solved1.actor.pth')
    path.append('./successful_weigths/solved0.actor.pth')
   
    # load networks
    agent.load(path)
    
    #
    scores = []     # list containing scores from each episode
    scores_window = deque(maxlen=100)  # last 100 scores

    for i_episode in range(1, number_of_episodes+1):
        
        reward_this_episode = np.zeros(2)
        # reset environment not in training mode
        env_info = env.reset(train_mode=False)[brain_name] 
        states = env_info.vector_observations
        while True:
            # noise disables in evaluation
            actions = agent.act(states, add_noise=False)
            # step forward one frame
            env_info = env.step(actions)[brain_name]           # send all actions to tne environment
            next_states = env_info.vector_observations         # get next state (for each agent)
            rewards = env_info.rewards                         # get reward (for each agent)
            dones = env_info.local_done                        # see if episode finished                    
            reward_this_episode += rewards
            states = next_states
            if np.any(dones):
                break
        # save scores
        score = np.max(reward_this_episode)
        scores_window.append(score)       # save most recent score
        scores.append(score)              # save most recent score
    print('\rEpisode {}\tAverage Score: {:.4f}\tMax Score: {:.4f}\tReward this episode: {}'.format(i_episode, np.mean(scores_window),np.max(scores_window),reward_this_episode), end="")
    env.close()    

In [3]:
agent = MADDPG(24, 2, 2)
evaluate(agent,10)

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Episode 10	Average Score: 1.3590	Max Score: 2.7000	Reward this episode: [ 0.1   0.09]