# Collaboration and Competition with MADDPG

---
In this project, we use the Unity ML-Agents environment to demonstrate how multi-agent deep deterministic policy gradient (MADDPG) can be used to solve collaboration and competition problems. This is the third project of the Deep Reinforcement Learning Nanodegree. Make sure you follow the steps outlined in the README file to set up the necessary packages and environment.

In this implementation, we use the two agents environment.


### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

import random
import torch

from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline

from maddpg import MADDPG, ReplayBuffer

import os
from utilities import transpose_list, transpose_to_tensor, convert_to_tensor
import matplotlib.pyplot as plt
%matplotlib inline

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

For instance, if you are using a Mac, then you downloaded `Tennis.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Tennis.app")
```

In [2]:
env = UnityEnvironment(file_name='env\Tennis.app')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 24 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]


# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states)

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [[ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.         -6.65278625 -1.5
  -0.          0.          6.83172083  6.         -0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.         -6.4669857  -1.5
   0.          0.         -6.83172083  6.          0.          0.        ]]


In [5]:
env_info.vector_observations

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        , -6.65278625, -1.5       , -0.        ,  0.        ,
         6.83172083,  6.        , -0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        , -6.4669857 , -1.5       ,  0.        ,  0.        ,
        -6.83172083,  6.        ,  0.        ,  0.        ]])

In [6]:
env_info.rewards

[0.0, 0.0]

In [7]:
env_info.local_done

[False, False]

In [8]:
actions = [[700,700],[700,-700]]


In [9]:
actions_array = torch.stack(transpose_to_tensor(actions)).detach().numpy()

In [10]:

actions_for_env = np.rollaxis(actions_array,1)

In [11]:
env_info_demo = env.step(actions_for_env)[brain_name] 

In [12]:
env_info_demo.vector_observations

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        , -6.65278625, -1.5       ,
        -0.        ,  0.        ,  6.83172083,  6.        , -0.        ,
         0.        , -3.65278697, -0.98316395, 30.        ,  6.21520042,
         6.83172083,  5.94114017, 30.        ,  6.21520042],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        , -6.4669857 , -1.5       ,
         0.        ,  0.        , -6.83172083,  6.        ,  0.        ,
         0.        , -3.46698642, -1.55886006, 30.        , -0.98100001,
        -6.83172083,  5.94114017, 30.        , -0.98100001]])

### 3. Multi-Agent Deep Deterministic Policy Gradient (MADDPG)

#### Key Concept:

In the previous Continuous Control with DDGP project, we successfully used DDPG to teach a double-jointed arm to move to target locations. The action space was continous and we had to use actor and critic design. However, in this environment, we have 2 agents playing against each other. In order to solve this, we are using a multi-agent version of DDPG in this project.


#### Actor and Critic

In DDPG, we use a seperate network to determine what the best action is for each state. This is the actor network. We use another network to model the expected value of a state + action combination. As mentioned in the DDPG paper, it is important to use targets for both actor and critic networks to help our models stablize.

#### Centralized Critic input

In this project, we will train two DDPG agents with their own actor and critic. However, the critics do not act in silo. Instead, each critic will also take 


##### Normalization
Normalization is extrememly important for this task. When state inputs have different dimensions, the differences in magnitude would cause our networks to be unstable. In this implementation, we applied the weight normalization for all non-last layers.  For more details, see `networkforall.py`

#### DDPG Agent

Just like DQN, having a relay buffer and seperating target and current networks are two key ideas that allow the model to learn. More specifically:

1. Relay Buffer:
Instead of using (state, action, reward) tuples in their natural order, our agent stores a bunch of such tuples in a relay buffer. In each iteration, at each time step, the agent will put the new (state, action, reward) tuple in to the buffer and pull out a random batch of tuples to update the networks

2. Target Q vs. Current Q:
At each step, instead of updating the current network according to values in the current network, we use a target network that only gets updated to the current network slowly. This prevents the networks from chasing after a moving target and helps the agent to learn better. We apply this concept to both actor and critic.

##### Learning Steps
Having two networks make things slightly more complicated. Here are the major steps that happen during learning:
1. Pick the next action using target actor network
2. Obtain the corresponding Q-value estimate using target critic network
3. Update the current critic using updated target Q values by minimizing the mse between the expected local Q and target Q
4. Obtain the predicted action using current actor network
5. Update the current actor by following the action-value gradient
6. Soft update the target networks with a small fraction of the current networks

For more information on DDPG, please see `ddpg_agent.py`

### 4. Train MADDPG!

To deploy our agent to solve the navigation problem, we first import the agent class we wrote. When training the environment, set train_mode=True, so that the line for resetting the environment looks like the following:

```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [13]:
# setting parameters
# number of parallel environment, each environment has 2 agents
# this would generate more experience and smooth things out
# PARALLEL_ENVS = 1
# Here we only have 1 env for simplicity

# number of training episodes.
# change this to higher number to experiment. say 30000.
NUMBER_OF_EPISODES = 6000
EPISODE_LENGTH = 1000
BATCHSIZE = 200

# amplitude of OU noise
# this slowly decreases to 0
# instead of resetting noise to 0 every episode, we let it decrease to 0 over a few episodes
NOISE = 1
NO_NOISE_AFTER = 40000
NOISE_DECAY = 0.9999
MIN_BUFFER_SIZE = 10000
BUFFER_SIZE = 1000000

IN_ACTOR_DIM = 24 
HIDDEN_ACTOR_IN_DIM = 300
HIDDEN_ACTOR_OUT_DIM = 100
OUT_ACTOR_DIM = 2

# Critic input contains both states AND all the actions of all the agents
# there are 2 agents, so 24*2 + 2*2 = 28
IN_CRIT_DIM = IN_ACTOR_DIM  * num_agents + action_size * num_agents
HIDDEN_CRIT_IN_DIM = 200
HIDDEN_CRIT_OUT_DIM = 80
OUT_CRIT_DIM = 1

# how many periods before update
UPDATE_EVERY = 20
SEED = 0

In [14]:
os.getcwd()

'C:\\Users\\kathl\\Desktop\\Data Science\\Udacity\\DeepReinf\\CollaborationCompetitionWithMADDPGDraft'

## agent can only have one negative reward each episode
## when it gets out, its over
## and both agetns are over at the same time
## but when one gets a postive, it can still go on!
## so now, going to push the rewards only when we have at least one success hit. If not don't push.

In [None]:
# main function that sets up environments
# perform training loop






np.random.seed(SEED)
torch.manual_seed(SEED)
t = 0




# torch.set_num_threads(PARALLEL_ENVS)
# env = envs.make_parallel_env(PARALLEL_ENVS)

# keep 5000 episodes worth of replay

buffer = ReplayBuffer(int(BUFFER_SIZE*EPISODE_LENGTH))
# initialize policy and critic through MADDOG
maddpg = MADDPG(IN_ACTOR_DIM, HIDDEN_ACTOR_IN_DIM, HIDDEN_ACTOR_OUT_DIM, OUT_ACTOR_DIM, IN_CRIT_DIM, HIDDEN_CRIT_IN_DIM, HIDDEN_CRIT_OUT_DIM)

# these will be used to print rewards for agents
agent0_reward = []
agent1_reward = []
scores_deque = deque(maxlen=100)
best_scores = []
avg_best_score = []
update_t = 0
#     max_state =  env_info_demo.vector_observations[0]
#     max_action = [0,0]
times_updated = 0

# use keep_awake to keep workspace from disconnecting
for episode in range(1, NUMBER_OF_EPISODES+1):



    env_info = env.reset(train_mode=True)[brain_name]
    states = env_info.vector_observations


    # initialize scores for both agents
    scores = [0,0]   
    # this resets the noise
    maddpg.reset()
    all_t_episode = []


    for episode_t in range(EPISODE_LENGTH):
#             max_state =  find_max_state(max_state, states)


        # explore = only explore for a certain number of episodes
        # action input needs to be transposed

        state_tensors = convert_to_tensor(states)
        
        
        actions = maddpg.act(state_tensors, noise = NOISE)
        if len(buffer) > NO_NOISE_AFTER:
            NOISE *= NOISE_DECAY



        actions_array = torch.stack(actions).detach().numpy()


        # act (actions)       
        # [tensor([ 0.9857,  0.0912]), tensor([ 0.0951, -0.1229])]
        # stack (actions_array)             
        # [[ 0.98568964  0.09124897]
        #  [ 0.0950533  -0.12286544]]

        # step forward one frame

        env_info = env.step(actions_array)[brain_name] 

        next_states = env_info.vector_observations

        rewards = env_info.rewards
        dones  = env_info.local_done

        transition = (states, actions_array, rewards, next_states, dones)
        all_t_episode.append(transition)

        scores = [sum(x) for x in zip(scores, rewards)]

        states = next_states    
    
    

        buffer.push(transition)

        update_t = (update_t + 1) % UPDATE_EVERY
        
        if len(buffer) > MIN_BUFFER_SIZE and update_t == 0:
            times_updated += 1

            for a_i in range(num_agents):
                samples = buffer.sample(BATCHSIZE)
                maddpg.update(samples, a_i)               

            maddpg.update_targets()





        if dones[0]:
            if round(max(scores),1) != 0 :
                print(round(max(scores),1), NOISE, episode, episode_t, len(buffer), list(zip(*actions_array)))
            elif sum(np.isnan(list(zip(*actions_array))[0])) + sum(np.isnan(list(zip(*actions_array))[1]))  >= 1:               
                print(round(max(scores),1), NOISE, episode, episode_t, len(buffer), list(zip(*actions_array)), list(states), list(next_states))                    

            break
            
            



    if max(scores) > 0.09:
        for i in range(1, 5):
            if max(scores) > 0.1 * i - 0.01:

                for transition in all_t_episode:
                    buffer.push(transition)
                    buffer.push(transition)


    agent0_reward.append(scores[0])
    agent1_reward.append(scores[1])

    best_scores.append(max(scores))
    scores_deque.append(max(scores))    
    avg_best_score.append(np.mean(scores_deque))






    # print score every 100 episodes and save model 
    if episode % 100 == 0 or episode == NUMBER_OF_EPISODES-1 or  np.mean(scores_deque)>=0.5:

        print('\rEpisode {}\tAverage Last 100 Episodes Score: {:.2f}'.format(episode, np.mean(scores_deque)))
        print("times_updated", times_updated)


    # problem solved
    if  np.mean(scores_deque)>=0.5:
        print('\nEnvironment solved in {:d} episodes!\tAverage Last 100 Episodes Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))            

        break             



print(len(buffer))





0.1 1 28 30 427 [(-0.03193012, 0.2692544), (0.20518495, 0.17144242)]
0.1 1 42 48 722 [(0.9137176, -0.30156916), (0.16899443, -0.54591906)]
0.1 1 44 30 866 [(-0.18241909, -0.26968563), (-0.12184572, -0.08567905)]
0.1 1 62 44 1215 [(-0.8653891, 0.33459708), (0.04567094, -0.9818289)]
0.1 1 75 31 1514 [(0.36852983, -0.20088026), (0.16863576, 0.5808885)]
Episode 100	Average Last 100 Episodes Score: 0.00
times_updated 0
0.1 1 148 32 2665 [(0.37457782, 0.09285021), (0.47647592, -0.33167535)]
0.1 1 159 32 2924 [(0.54501146, 0.20213464), (0.20685694, 0.72580034)]
0.1 1 173 32 3208 [(0.8639217, 0.23231544), (-0.16728444, -0.25159463)]
0.1 1 181 20 3394 [(0.11172579, 0.24077916), (-0.8317204, 0.08684449)]
0.1 1 182 49 3486 [(0.25240842, -0.11388537), (-0.06006129, -0.5111423)]
Episode 200	Average Last 100 Episodes Score: 0.00
times_updated 0
0.1 1 233 31 4328 [(0.33557937, -0.341532), (0.29665688, -0.20772816)]
0.1 1 235 31 4439 [(0.67642593, 0.01812917), (0.12637056, -0.35087308)]
0.1 1 247 41 4