# Collaboration and Competition with MADDPG

---
In this project, we use the Unity ML-Agents environment to demonstrate how multi-agent deep deterministic policy gradient (MADDPG) can be used to solve collaboration and competition problems. This is the third project of the Deep Reinforcement Learning Nanodegree. Make sure you follow the steps outlined in the README file to set up the necessary packages and environment.

In this implementation, we use the two agents environment.


### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

import random
import torch

from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline

from maddpg import MADDPG, ReplayBuffer

import os
from utilities import transpose_list, transpose_to_tensor, convert_to_tensor
import matplotlib.pyplot as plt
%matplotlib inline

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

For instance, if you are using a Mac, then you downloaded `Tennis.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Tennis.app")
```

In [2]:
env = UnityEnvironment(file_name='env\Tennis.app')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 24 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]


# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states)

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [[ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.         -6.65278625 -1.5
  -0.          0.          6.83172083  6.         -0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.         -6.4669857  -1.5
   0.          0.         -6.83172083  6.          0.          0.        ]]


# check weight initilized uncomment

In [5]:
# # setting parameters
# # number of parallel environment, each environment has 2 agents
# # this would generate more experience and smooth things out
# # PARALLEL_ENVS = 1
# # Here we only have 1 env for simplicity

# # number of training episodes.
# # change this to higher number to experiment. say 30000.
# NUMBER_OF_EPISODES = 6
# EPISODE_LENGTH = 1000
# BATCHSIZE =3

# # amplitude of OU noise
# # this slowly decreases to 0
# # instead of resetting noise to 0 every episode, we let it decrease to 0 over a few episodes
# NOISE = 1
# NO_NOISE_AFTER = 10000
# NOISE_DECAY = 0.9999
# MIN_BUFFER_SIZE = 15000
# BUFFER_SIZE = 1000000

# IN_ACTOR_DIM = 24 
# HIDDEN_ACTOR_IN_DIM = 100
# HIDDEN_ACTOR_OUT_DIM = 100
# OUT_ACTOR_DIM = 2

# # Critic input contains both states AND all the actions of all the agents
# # there are 2 agents, so 24*2 + 2*2 = 28
# IN_CRIT_S = IN_ACTOR_DIM  * num_agents 
# IN_CRIT_A = action_size * num_agents
# HIDDEN_CRIT_IN_DIM = 100
# HIDDEN_CRIT_OUT_DIM = 100
# OUT_CRIT_DIM = 1

# # how many periods before update
# UPDATE_EVERY = 1
# SEED = 6
# DISC = 0.99
# TAU = 0.001
# LR_ACT = 1.e-4
# LR_CRI = 1.e-3

# maddpg = MADDPG(IN_ACTOR_DIM, HIDDEN_ACTOR_IN_DIM, HIDDEN_ACTOR_OUT_DIM, OUT_ACTOR_DIM,\
#                 IN_CRIT_S, IN_CRIT_A, HIDDEN_CRIT_IN_DIM, HIDDEN_CRIT_OUT_DIM, SEED, LR_ACT, LR_CRI, DISC, TAU)


In [6]:
# list(maddpg.maddpg_agent[0].critic.parameters())[2]

In [7]:
# list(maddpg.maddpg_agent[0].target_critic.parameters())[2]

### 3. Multi-Agent Deep Deterministic Policy Gradient (MADDPG)

#### Key Concept:

In the previous Continuous Control with DDGP project, we successfully used DDPG to teach a double-jointed arm to move to target locations. The action space was continous and we had to use actor and critic design. However, in this environment, we have 2 agents playing against each other. In order to solve this, we are using a multi-agent version of DDPG in this project.


#### Actor and Critic

In DDPG, we use a seperate network to determine what the best action is for each state. This is the actor network. We use another network to model the expected value of a state + action combination. As mentioned in the DDPG paper, it is important to use targets for both actor and critic networks to help our models stablize.

#### Centralized Critic input

In this project, we will train two DDPG agents with their own actor and critic. However, the critics do not act in silo. Instead, each critic will take actions and states from both agents as input, so the critics have complete information from both sides of the table. This differentiates MADDPG from DDPG.


#### Normalization
Normalization is extrememly important for this task. When state inputs have different dimensions, the differences in magnitude would cause our networks to be unstable. In this implementation, we applied the weight normalization for all non-last layers.  For more details, see `networkforall.py`

#### DDPG Agent
Having a relay buffer and seperating target and current networks are two key ideas that allow the model to learn. More specifically:

1. Relay Buffer:
Instead of using (state, action, reward) tuples in their natural order, our agent stores a bunch of such tuples in a relay buffer. In each iteration, at each time step, the agent will put the new (state, action, reward) tuple in to the buffer and pull out a random batch of tuples to update the networks

2. Target Q vs. Current Q:
At each step, instead of updating the current network according to values in the current network, we use a target network that only gets updated to the current network slowly. This prevents the networks from chasing after a moving target and helps the agent to learn better. We apply this concept to both actor and critic.


#### MADDPG Agent
Each MADDPG agent has 2 DDPG agents

##### Learning Steps
Having two networks make things slightly more complicated. Here are the major steps that happen during learning:
1. Pick the next action using target actor network
2. Obtain the corresponding Q-value estimate using target critic network
3. Update the current critic using updated target Q values by minimizing the mse between the expected local Q and target Q
4. Obtain the predicted action using current actor network
5. Update the current actor by following the action-value gradient
6. Soft update the target networks with a small fraction of the current networks

For more information on DDPG, please see `ddpg_agent.py`

### 4. Train MADDPG!

To deploy our agent to solve the navigation problem, we first import the agent class we wrote. When training the environment, set train_mode=True, so that the line for resetting the environment looks like the following:

```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [8]:
os.getcwd()

'C:\\Users\\kathl\\Desktop\\Data Science\\Udacity\\DeepReinf\\CollaborationCompetitionWithMADDPGDraft\\updating'

## agent can only have one negative reward each episode
## when it gets out, its over
## and both agetns are over at the same time
## but when one gets a postive, it can still go on!
## so now, going to push the rewards only when we have at least one success hit. If not don't push.

# Set parameters for buffer filling

In [9]:
# setting parameters
# number of parallel environment, each environment has 2 agents
# this would generate more experience and smooth things out
# PARALLEL_ENVS = 1
# Here we only have 1 env for simplicity

# number of training episodes.
# change this to higher number to experiment. say 30000.
NUMBER_OF_EPISODES = 3000
EPISODE_LENGTH = 1000
BATCHSIZE =200
MIN_BUFFER_SIZE = 60000
UPDATE_EVERY = 1
# amplitude of OU noise
# this slowly decreases to 0
# instead of resetting noise to 0 every episode, we let it decrease to 0 over a few episodes

NOISE = 1
NO_NOISE_AFTER = 3000
NOISE_DECAY = 0.9999

BUFFER_SIZE = 300000

IN_ACTOR_DIM = 24 
HIDDEN_ACTOR_IN_DIM = 400
HIDDEN_ACTOR_OUT_DIM = 400
OUT_ACTOR_DIM = 2

# Critic input contains both states AND all the actions of all the agents
# there are 2 agents, so 24*2 + 2*2 = 28
IN_CRIT_S = IN_ACTOR_DIM  * num_agents 
IN_CRIT_A = action_size * num_agents
HIDDEN_CRIT_IN_DIM = 400
HIDDEN_CRIT_OUT_DIM = 400
OUT_CRIT_DIM = 1

# how many periods before update

SEED = 6
DISC = 0.99
TAU = 0.001
LR_ACT = 0.0001
LR_CRI = 0.001


In [10]:
# Initialization
def process_data (states, actions, rewards, next_states, dones):
    full_state = states.flatten()
    next_full_state = next_states.flatten()
    return (states, full_state, actions, rewards, next_states, next_full_state, dones)

np.random.seed(SEED)
torch.manual_seed(SEED)
t = 0


# torch.set_num_threads(PARALLEL_ENVS)
# env = envs.make_parallel_env(PARALLEL_ENVS)

# keep 5000 episodes worth of replay


# initialize policy and critic through MADDOG
maddpg = MADDPG(IN_ACTOR_DIM, HIDDEN_ACTOR_IN_DIM, HIDDEN_ACTOR_OUT_DIM, OUT_ACTOR_DIM,\
                IN_CRIT_S, IN_CRIT_A, HIDDEN_CRIT_IN_DIM, HIDDEN_CRIT_OUT_DIM, SEED, LR_ACT, LR_CRI, DISC, TAU)

# these will be used to print rewards for agents
agent0_reward = []
agent1_reward = []
scores_deque = deque(maxlen=100)
best_scores = []
avg_best_score = []
update_t = 0
#     max_state =  env_info_demo.vector_observations[0]
#     max_action = [0,0]
times_updated = 0


In [11]:
# list(maddpg.maddpg_agent[0].critic.parameters())[0]

In [12]:
def fill_buffer(buffer):
    episodes = 0

    while len(buffer) <= MIN_BUFFER_SIZE:
        env_info = env.reset(train_mode=True)[brain_name]
        states = env_info.vector_observations
        scores = [0,0] 
        all_t_episode = []
        
        for t in range(1000): 
            state_tensors = convert_to_tensor(states)
            actions = maddpg.act(state_tensors, noise = NOISE)
            #print(actions)
            actions_array = torch.stack(actions).detach().numpy()

            env_info = env.step(actions_array)[brain_name] 
#             env_info = env.step(actions)[brain_name] 
            next_states = env_info.vector_observations
            rewards = env_info.rewards
            dones  = env_info.local_done   

            transition = process_data(states, actions_array, rewards, next_states, dones)
            
            all_t_episode.append(transition)

            scores = [sum(x) for x in zip(scores, rewards)]
            states = next_states  
            if dones[0] and max(scores) < 0.05:
                break
            elif dones[0] and max(scores) > 0.05:
                how_many_times = 1 if sum(scores) < 0.11 else int(sum(scores)/ 0.08) ** int(sum(scores)/ 0.08) *10
                
                print("episode", episodes, "scores", scores, "max t", t, "len buf", len(buffer), int(sum(scores)/ 0.08) )
                for t, transition in enumerate(all_t_episode):
                    buffer.push(transition) 
                    if t+3 > len(all_t_episode):
                        continue
                    elif max(transition[3]) > 0.05 or max(all_t_episode[t+1][3]) > 0.05  or max(all_t_episode[t+2][3]) > 0.05:
                        for j in range(min(how_many_times, 500)):
                            buffer.push(transition)                 
                                                        
                break
            else:
                continue
        episodes += 1        
        if len(buffer) % 100 == 0 and len(buffer)>1:
            print("buffer len", len(buffer),"episodes", episodes, "scores",scores)
    print(len(buffer), episodes, scores)

    

In [13]:
buffer = ReplayBuffer(int(BUFFER_SIZE))
fill_buffer(buffer)

episode 15 scores [-0.009999999776482582, 0.10000000149011612] max t 43 len buf 0 1
episode 42 scores [0.0, 0.09000000171363354] max t 29 len buf 47 1
episode 50 scores [0.10000000149011612, -0.009999999776482582] max t 41 len buf 80 1
episode 54 scores [-0.009999999776482582, 0.10000000149011612] max t 30 len buf 125 1
episode 55 scores [-0.009999999776482582, 0.10000000149011612] max t 32 len buf 159 1
episode 69 scores [0.10000000149011612, -0.009999999776482582] max t 49 len buf 195 1
episode 73 scores [0.10000000149011612, 0.09000000171363354] max t 42 len buf 248 2
episode 104 scores [0.10000000149011612, -0.009999999776482582] max t 29 len buf 531 1
episode 109 scores [0.10000000149011612, -0.009999999776482582] max t 34 len buf 564 1
episode 116 scores [0.10000000149011612, -0.009999999776482582] max t 32 len buf 602 1
episode 117 scores [0.10000000149011612, -0.009999999776482582] max t 33 len buf 638 1
episode 124 scores [0.10000000149011612, -0.009999999776482582] max t 30 l

In [14]:
# import csv
# bufflist = list(buffer.deque)

# with open('buffer.csv', 'w', newline='') as myfile:
#     wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
#     wr.writerow(bufflist)



In [15]:
# maddpg = MADDPG(IN_ACTOR_DIM, HIDDEN_ACTOR_IN_DIM, HIDDEN_ACTOR_OUT_DIM, OUT_ACTOR_DIM,\
#                 IN_CRIT_S, IN_CRIT_A, HIDDEN_CRIT_IN_DIM, HIDDEN_CRIT_OUT_DIM, SEED, LR_ACT, LR_CRI, DISC, TAU)



# start training little by little
# parameters can be tuned here

In [16]:
NUMBER_OF_EPISODES = 10000
EPISODE_LENGTH = 1000
BATCHSIZE =200
MIN_BUFFER_SIZE = 60000
UPDATE_EVERY = 1
# amplitude of OU noise
# this slowly decreases to 0
# instead of resetting noise to 0 every episode, we let it decrease to 0 over a few episodes

NOISE = 1
NO_NOISE_AFTER = 3000
NOISE_DECAY = 0.9999

BUFFER_SIZE = 300000

IN_ACTOR_DIM = 24 
HIDDEN_ACTOR_IN_DIM = 400
HIDDEN_ACTOR_OUT_DIM = 400
OUT_ACTOR_DIM = 2

# Critic input contains both states AND all the actions of all the agents
# there are 2 agents, so 24*2 + 2*2 = 28
IN_CRIT_S = IN_ACTOR_DIM  * num_agents 
IN_CRIT_A = action_size * num_agents
HIDDEN_CRIT_IN_DIM = 400
HIDDEN_CRIT_OUT_DIM = 400
OUT_CRIT_DIM = 1

# how many periods before update

SEED = 6
DISC = 0.99
TAU = 0.001
LR_ACT = 0.0001
LR_CRI = 0.001


# these will be used to print rewards for agents
agent0_reward = []
agent1_reward = []
scores_deque = deque(maxlen=100)
best_scores = []
avg_best_score = []
update_t = 0
#     max_state =  env_info_demo.vector_observations[0]
#     max_action = [0,0]
times_updated = 0


In [17]:
# # reset network if did not learn well, do not reset buffer
# maddpg = MADDPG(IN_ACTOR_DIM, HIDDEN_ACTOR_IN_DIM, HIDDEN_ACTOR_OUT_DIM, OUT_ACTOR_DIM,\
#                 IN_CRIT_S, IN_CRIT_A, HIDDEN_CRIT_IN_DIM, HIDDEN_CRIT_OUT_DIM, SEED, LR_ACT, LR_CRI, DISC, TAU)


In [18]:
aaaa =[ [ -7.0089,   0.5568,   3.1098,   2.4874,   6.8317,   1.3903, 3.1098,   2.4874,  -6.0478,   0.7466,   9.6109,   1.5064, 6.8317,   0.4707,   9.6109,   1.5064,  -6.4336,   0.8384, -3.8586,   0.5254,   6.8317,  -0.4489,  -3.8586,   0.5254]]

In [None]:
bbbb = [  0.0000,   0.0000,   0.0000,   0.0000,   0.0000,   0.0000, 0.0000,   0.0000,  -7.4364,  -1.5000,  -0.0000,   0.0000, 6.8317,   5.8587,  -0.0000,   0.0000,  -7.1158,  -1.5589, 3.2063,  -0.9810,   6.8317,   5.6429,   3.2063,  -0.9810]

In [None]:
aaaa.append(bbbb)


In [None]:
totest = convert_to_tensor([aaaa,aaaa])

In [None]:
# pre-update do not refresh
actions = maddpg.act(totest) 
actions

[tensor(1.00000e-02 *
        [[-3.6408,  4.4867],
         [-4.1375,  4.6489]]), tensor(1.00000e-02 *
        [[-3.6408,  4.4867],
         [-4.1375,  4.6489]])]

In [None]:
# pre-update do not refresh
actions = maddpg.target_act(totest) 
actions

[tensor(1.00000e-02 *
        [[-3.6408,  4.4867],
         [-4.1375,  4.6489]]), tensor(1.00000e-02 *
        [[-3.6408,  4.4867],
         [-4.1375,  4.6489]])]

In [None]:
# main function that sets up environments
# perform training loop



# use keep_awake to keep workspace from disconnecting
for episode in range(1, NUMBER_OF_EPISODES+1):

    env_info = env.reset(train_mode=True)[brain_name]
    states = env_info.vector_observations

    scores = [0,0]   
    all_t_episode =[]
    
    if episode > NO_NOISE_AFTER:
        NOISE *= NOISE_DECAY

    for episode_t in range(EPISODE_LENGTH):

        state_tensors = convert_to_tensor(states)



        
        actions = maddpg.act(state_tensors, noise = NOISE)
        actions_array = torch.stack(actions).detach().numpy()

        env_info = env.step(actions_array)[brain_name] 
        next_states = env_info.vector_observations
        rewards = env_info.rewards
        dones  = env_info.local_done

        transition = process_data(states, actions_array, rewards, next_states, dones)
        all_t_episode.append(transition)
                
        scores = [sum(x) for x in zip(scores, rewards)]

        states = next_states    


        update_t = (update_t + 1) % UPDATE_EVERY
        
        if len(buffer) > MIN_BUFFER_SIZE and update_t == 0:
            times_updated += 1
            samplesa = buffer.sample(BATCHSIZE)
            samplesb = buffer.sample(BATCHSIZE)
            maddpg.update(samplesa, samplesb)

        if dones[0]:
            if max(scores) > 0.05 :
                how_many_times = 1 if sum(scores) < 0.11 else int(sum(scores)/ 0.08) ** int(sum(scores)/ 0.08) 
                print(round(max(scores),1), NOISE, episode, episode_t, len(buffer), list(zip(*actions_array)), int(sum(scores)/ 0.08) )
                for transition in all_t_episode:                                    
                    buffer.push(transition) 
                    if t+3 > len(all_t_episode):
                        continue
                    elif max(transition[3]) > 0.05 or max(all_t_episode[t+1][3]) > 0.05  or max(all_t_episode[t+2][3]) > 0.05:
                        for j in range(min(how_many_times, 200)):
                            buffer.push(transition)          
                            
            elif sum(np.isnan(list(zip(*actions_array))[0])) + sum(np.isnan(list(zip(*actions_array))[1]))  >= 1:               
                print(round(max(scores),1), NOISE, episode, episode_t, len(buffer), list(zip(*actions_array)), list(states), list(next_states))                    

            break                      
                    
    agent0_reward.append(scores[0])
    agent1_reward.append(scores[1])

    best_scores.append(max(scores))
    scores_deque.append(max(scores))    
    avg_best_score.append(np.mean(scores_deque))
    
    
    if np.mean(scores_deque) >= 0.4:
        LR_ACT = 0.00001
        LR_CRI = 0.00005
    elif np.mean(scores_deque) >= 0.3:
        LR_ACT = 0.00005
        LR_CRI = 0.0001       
    elif np.mean(scores_deque) >= 0.15:
        LR_ACT = 0.00008
        LR_CRI = 0.0005    


    # print score every 100 episodes and save model 
    if episode % 100 == 0 or episode == NUMBER_OF_EPISODES-1 or  np.mean(scores_deque)>=0.5:

        print('\rEpisode {}\tBuffer Len {}\tAverage Last 100 Episodes Score: {:.2f}'.format(episode, len(buffer),np.mean(scores_deque)))
        print("times_updated", times_updated)
        print()


    # problem solved
    if  np.mean(scores_deque)>=0.5:
        print('\nEnvironment solved in {:d} episodes!\tAverage Last 100 Episodes Score: {:.2f}'.format(episode, np.mean(scores_deque)))            

        break     
#     if times_updated == 10:
#         break



print(len(buffer))





In [None]:
# refresh
actions = maddpg.act(totest) 
actions

In [None]:
states, full_state, actions, rewards, next_states, next_full_state, dones = map(transpose_to_tensor, samples)
full_states = [samples[1], samples[5]]
samples = [states, actions, rewards, next_states, dones]
samples.extend(convert_to_tensor(full_states))  
states, actions, rewards, next_states, dones, full_state, next_full_state = samples
states = torch.stack(states)   
next_states = torch.stack(next_states) 

In [22]:
import copy
totest = copy.deepcopy (states)
totest.shape

torch.Size([2, 3, 24])

In [29]:
# pre-update do not refresh
actions = maddpg.act(totest, noise = 0) 
print(actions[0][:5])
print(actions[1][:5])

tensor([[ 0.2843,  0.0706],
        [ 0.2297,  0.1283],
        [ 0.2186,  0.0077]])
tensor([[ 0.0456,  0.1667],
        [ 0.0543,  0.1656],
        [-0.0531,  0.2090]])


In [30]:

# pre-update do not refresh
actions = maddpg.target_act(totest) 
print(actions[0][:5])
print(actions[1][:5])

tensor([[ 0.5040, -0.0112],
        [-0.4286,  0.0675],
        [ 0.1861,  0.0641]])
tensor([[ 0.3712, -0.0869],
        [ 0.3318, -0.0746],
        [-0.6440, -0.0688]])


In [25]:
# post-update: refresh
actions = maddpg.act(totest) 
print(actions[0][:5])
print(actions[1][:5])

tensor([[ 0.2046,  0.1649],
        [ 0.1590,  0.2091],
        [ 0.1694,  0.0725]])
tensor([[-0.0192,  0.0352],
        [-0.0088,  0.0342],
        [-0.1244,  0.0633]])


In [27]:

# post update, refresh:
actions = maddpg.target_act(totest, noise = 0) 
print(actions[0][:5])
print(actions[1][:5])

tensor([[ 0.5017, -0.0086],
        [-0.4288,  0.0687],
        [ 0.1877,  0.0620]])
tensor([[ 0.3659, -0.0789],
        [ 0.3462, -0.0919],
        [-0.6502, -0.0618]])


In [46]:

# for i in range(5):                                         # play game for 5 episodes
#     env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
#     states = env_info.vector_observations                  # get the current state (for each agent)
#     scores = np.zeros(num_agents)                          # initialize the score (for each agent)
#     while True:
# #         actions = [0.1+i*0.1, -0.1+i*0.1, 0.1+i*0.1, -0.1+i*0.1] # select an action (for each agent)
# #         actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
#         state_tensors = convert_to_tensor(states)
        
#         actions = maddpg.act(state_tensors, noise = 0)
#         actions_array = torch.stack(actions).detach().numpy()
#         env_info = env.step(actions_array)[brain_name]           # send all actions to tne environment
#         next_states = env_info.vector_observations         # get next state (for each agent)
#         rewards = env_info.rewards                         # get reward (for each agent)
#         dones = env_info.local_done                        # see if episode finished
#         scores += env_info.rewards                         # update the score (for each agent)
#         states = next_states                               # roll over states to next time step
#         if np.any(dones):                                  # exit loop if episode finished
#             break
#     print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

In [56]:
print(list(states).append(list(states)))

None


In [63]:
env_info = env.reset(train_mode=True)[brain_name]
states = env_info.vector_observations
a = list(states)
a+a

[array([ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        , -7.61386442, -1.5       , -0.        ,  0.        ,
         6.49473906,  5.85873604, -0.        ,  0.        ]),
 array([ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        , -6.10514164, -1.5       ,  0.        ,  0.        ,
        -6.49473906,  5.85873604,  0.        ,  0.        ]),
 array([ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        , -7.61386442, -1.5       , -0.        ,  0.        ,


In [62]:
a

In [40]:
state_tensors = convert_to_tensor(states)

In [82]:
# target_actions = maddpg.target_act(state_tensors) 

In [45]:
maddpg.act(state_tensors) 

[tensor([-0.0414,  0.1445]), tensor([-0.0947,  0.2239])]

In [66]:
states, full_state, actions, rewards, next_states, next_full_state, dones = map(transpose_to_tensor, samples)
full_states = [samples[1], samples[5]]
samples = [states, actions, rewards, next_states, dones]
samples.extend(convert_to_tensor(full_states))  
states, actions, rewards, next_states, dones, full_state, next_full_state = samples
states = torch.stack(states)   
next_states = torch.stack(next_states) 

In [73]:
actions = maddpg.act(states, noise = 0) 
actions

[tensor([[ 0.1795,  0.4920],
         [-0.2261,  0.0490],
         [-0.0924, -0.0389]]), tensor([[ 0.0824,  0.5297],
         [ 0.1877,  0.0659],
         [-0.2619, -0.0863]])]

In [72]:
actions = maddpg.target_act(states) 
actions

[tensor([[-0.0192,  0.1840],
         [-0.2922,  0.1319],
         [ 0.1376, -0.2479]]), tensor([[ 0.1147,  0.3136],
         [ 0.0838,  0.4019],
         [-0.3635, -0.5947]])]

In [13]:
[[b - a for a, b in  zip(target, cur)] for target, cur in zip(target_actions, actions)]

[[tensor([ 0.2401, -0.0017]),
  tensor([ 0.3762,  0.3117]),
  tensor([-0.0924, -0.1699])],
 [tensor([ 0.5019,  0.0432]),
  tensor([-0.5772, -0.2432]),
  tensor([-0.5193, -0.0081])]]

In [23]:
actions = torch.cat(actions, dim = -1)
maddpg.maddpg_agent[0].critic(full_state, actions)



tensor([[ 0.0000],
        [ 0.0000],
        [ 0.1263],
        [ 0.0000],
        [ 0.0986],
        [ 0.0911],
        [ 0.0246],
        [ 0.0000],
        [ 0.0000],
        [ 0.0000],
        [ 0.1755],
        [ 0.1201],
        [ 0.0000],
        [ 0.0631],
        [ 0.0000],
        [ 0.0000],
        [ 0.0000],
        [ 0.0000],
        [ 0.0000],
        [ 0.0000],
        [ 0.0000],
        [ 0.0000],
        [ 0.1428],
        [ 0.0000],
        [ 0.0000],
        [ 0.0000],
        [ 0.0028],
        [ 0.0000],
        [ 0.0000],
        [ 0.0000],
        [ 0.0000],
        [ 0.0000],
        [ 0.1209],
        [ 0.0676],
        [ 0.0241],
        [ 0.0000],
        [ 0.0000],
        [ 0.0049],
        [ 0.0000],
        [ 0.0459],
        [ 0.0374],
        [ 0.0857],
        [ 0.0000],
        [ 0.0008],
        [ 0.0000],
        [ 0.0000],
        [ 0.0794],
        [ 0.0027],
        [ 0.0000],
        [ 0.0000],
        [ 0.0000],
        [ 0.0000],
        [ 0.

# parameters before acting

In [44]:
list(maddpg.maddpg_agent[0].actor.parameters())

[Parameter containing:
 tensor([ 0.0614,  0.0776,  0.0733, -0.0623,  0.0069,  0.0882,  0.1878,
         -0.0775,  0.1624, -0.1298,  0.0489,  0.0519,  0.1075,  0.0757,
          0.0869,  0.0980,  0.1148,  0.1460, -0.0329, -0.1889, -0.1866,
          0.0521, -0.1902, -0.1207, -0.1146,  0.1954, -0.1872,  0.0555,
         -0.0057, -0.1293, -0.0743,  0.0381, -0.1553, -0.1234, -0.1482,
          0.1442, -0.0027,  0.1010,  0.0871,  0.1273,  0.0889, -0.0430,
         -0.1131,  0.1832, -0.1927, -0.1657, -0.1234, -0.0382,  0.1853,
         -0.0689,  0.1911,  0.1133, -0.0692,  0.0568, -0.0260,  0.1529,
          0.0542, -0.1123,  0.0204, -0.1332, -0.0745, -0.1476,  0.1090,
          0.1588,  0.0445,  0.0001, -0.1284, -0.1406, -0.1908, -0.0132,
          0.0420,  0.1286,  0.1713, -0.1106,  0.1939, -0.1120,  0.1367,
          0.1041,  0.1672, -0.0961, -0.1855, -0.1575,  0.0260, -0.1187,
          0.1208, -0.1921,  0.1859,  0.1075, -0.0026, -0.1263, -0.0985,
         -0.1018,  0.0053, -0.0533,  0.14

In [80]:
maddpg.maddpg_agent[0].act(state_tensors[0])

tensor([[ 0.2804,  0.0672]])

In [None]:
print(actions)

In [None]:
print(actions[1].shape)

In [None]:
print(torch.cat(actions, dim = -1))

In [None]:
next_full_state

In [None]:
isinstance(actions, list)

In [None]:
type(actions[0][0])

In [None]:
target_actions

In [None]:
len(next_states.shape)

In [None]:
isinstance(actions[0], torch.Tensor)

# Test training parameters step by step

In [13]:
# setting parameters
# number of parallel environment, each environment has 2 agents
# this would generate more experience and smooth things out
# PARALLEL_ENVS = 1
# Here we only have 1 env for simplicity

# number of training episodes.
# change this to higher number to experiment. say 30000.
NUMBER_OF_EPISODES = 12
EPISODE_LENGTH = 1000
BATCHSIZE =3
MIN_BUFFER_SIZE = 3
UPDATE_EVERY = 1
# amplitude of OU noise
# this slowly decreases to 0
# instead of resetting noise to 0 every episode, we let it decrease to 0 over a few episodes
NOISE = 1
NO_NOISE_AFTER = 10000
NOISE_DECAY = 0.9999

BUFFER_SIZE = 1000000

IN_ACTOR_DIM = 24 
HIDDEN_ACTOR_IN_DIM = 100
HIDDEN_ACTOR_OUT_DIM = 100
OUT_ACTOR_DIM = 2

# Critic input contains both states AND all the actions of all the agents
# there are 2 agents, so 24*2 + 2*2 = 28
IN_CRIT_S = IN_ACTOR_DIM  * num_agents 
IN_CRIT_A = action_size * num_agents
HIDDEN_CRIT_IN_DIM = 100
HIDDEN_CRIT_OUT_DIM = 100
OUT_CRIT_DIM = 1

# how many periods before update

SEED = 6
DISC = 0.99
TAU = 0.001
LR_ACT = 1.e-4
LR_CRI = 1.e-3



# Initialization
def process_data (states, actions, rewards, next_states, dones):
    full_state = states.flatten()
    next_full_state = next_states.flatten()
    return (states, full_state, actions, rewards, next_states, next_full_state, dones)

np.random.seed(SEED)
torch.manual_seed(SEED)
t = 0


# torch.set_num_threads(PARALLEL_ENVS)
# env = envs.make_parallel_env(PARALLEL_ENVS)

# keep 5000 episodes worth of replay


# initialize policy and critic through MADDOG
maddpg = MADDPG(IN_ACTOR_DIM, HIDDEN_ACTOR_IN_DIM, HIDDEN_ACTOR_OUT_DIM, OUT_ACTOR_DIM,\
                IN_CRIT_S, IN_CRIT_A, HIDDEN_CRIT_IN_DIM, HIDDEN_CRIT_OUT_DIM, SEED, LR_ACT, LR_CRI, DISC, TAU)

# these will be used to print rewards for agents
agent0_reward = []
agent1_reward = []
scores_deque = deque(maxlen=100)
best_scores = []
avg_best_score = []
update_t = 0
#     max_state =  env_info_demo.vector_observations[0]
#     max_action = [0,0]
times_updated = 0


In [14]:
buffer = ReplayBuffer(int(BUFFER_SIZE))
# # maddpg = MADDPG(IN_ACTOR_DIM, HIDDEN_ACTOR_IN_DIM, HIDDEN_ACTOR_OUT_DIM, OUT_ACTOR_DIM,\
# #                 IN_CRIT_S, IN_CRIT_A, HIDDEN_CRIT_IN_DIM, HIDDEN_CRIT_OUT_DIM, SEED, LR_ACT, LR_CRI, DISC, TAU)

In [15]:
# main function that sets up environments
# perform training loop



# use keep_awake to keep workspace from disconnecting
for episode in range(1, NUMBER_OF_EPISODES+1):

    env_info = env.reset(train_mode=True)[brain_name]
    states = env_info.vector_observations

    scores = [0,0]   
    all_t_episode =[]

    for episode_t in range(EPISODE_LENGTH):

        state_tensors = convert_to_tensor(states)

        if len(buffer) > NO_NOISE_AFTER:
            NOISE *= NOISE_DECAY

        
        actions = maddpg.act(state_tensors, noise = NOISE)
        actions_array = torch.stack(actions).detach().numpy()

        env_info = env.step(actions_array)[brain_name] 
        next_states = env_info.vector_observations
        rewards = env_info.rewards
        dones  = env_info.local_done

        transition = process_data(states, actions_array, rewards, next_states, dones)
        #buffer.push(transition) 
                
        scores = [sum(x) for x in zip(scores, rewards)]

        states = next_states    


        update_t = (update_t + 1) % UPDATE_EVERY
        
        if len(buffer) > MIN_BUFFER_SIZE and update_t == 0:
            times_updated += 1
            samplesa = buffer.sample(BATCHSIZE)
            samplesb = buffer.sample(BATCHSIZE)
            maddpg.update(samplesa, samplesb)

        if dones[0]:
            print(round(max(scores),1), NOISE, episode, episode_t, len(buffer), list(zip(*actions_array)))
                
        
            break                      
                    
    agent0_reward.append(scores[0])
    agent1_reward.append(scores[1])

    best_scores.append(max(scores))
    scores_deque.append(max(scores))    
    avg_best_score.append(np.mean(scores_deque))






    # print score every 100 episodes and save model 
    if episode % 100 == 0 or episode == NUMBER_OF_EPISODES-1 or  np.mean(scores_deque)>=0.5:

        print('\rEpisode {}\tBuffer Len {}\tAverage Last 100 Episodes Score: {:.2f}'.format(episode, len(buffer),np.mean(scores_deque)))
        print("times_updated", times_updated)
        print()


    # problem solved
    if  np.mean(scores_deque)>=0.5 or times_updated >1:
        print('\nEnvironment solved in {:d} episodes!\tAverage Last 100 Episodes Score: {:.2f}'.format(episode, np.mean(scores_deque)))            

        break     
#     if times_updated == 10:
#         break



print(len(buffer))





0.0 1 1 14 0 [(0.2395658, 0.900802), (-0.18102472, 0.79633856)]
0.0 1 2 13 0 [(-0.27538237, 0.8652375), (-0.22752392, -0.04210297)]
0.0 1 3 13 0 [(0.49065253, -0.2558384), (0.10818676, 0.0653445)]
0.0 1 4 13 0 [(0.13746555, -0.20754811), (0.35974222, -0.0018371493)]
0.0 1 5 13 0 [(0.8821971, 0.5649713), (0.82261586, 0.37034607)]
0.0 1 6 14 0 [(0.45597395, 0.044246167), (-0.40887004, 0.72332627)]
0.0 1 7 13 0 [(-0.28096032, 0.000490278), (-0.07019459, -0.153855)]
0.0 1 8 13 0 [(-0.08594951, 1.0), (0.14174736, -0.53325504)]
0.0 1 9 13 0 [(0.033820122, -0.5606962), (-0.020705104, -0.56146836)]
0.0 1 10 13 0 [(-0.4639236, 0.6014853), (0.36051536, 0.37391606)]
0.0 1 11 13 0 [(-0.11057241, 0.8189635), (0.075384736, 1.0)]
Episode 11	Buffer Len 0	Average Last 100 Episodes Score: 0.00
times_updated 0

0.0 1 12 14 0 [(0.76398396, -0.1391693), (-0.5609775, -0.24289367)]
0


In [17]:
states, full_state, actions, rewards, next_states, next_full_state, dones = map(transpose_to_tensor, samples)
full_states = [samples[1], samples[5]]
samples = [states, actions, rewards, next_states, dones]
samples.extend(convert_to_tensor(full_states))  
states, actions, rewards, next_states, dones, full_state, next_full_state = samples
states = torch.stack(states)   
next_states = torch.stack(next_states) 

In [18]:
import copy
totest = copy.deepcopy (states)
totest

tensor([[[ -7.0089,   0.5568,   3.1098,   2.4874,   6.8317,   1.3903,
            3.1098,   2.4874,  -6.0478,   0.7466,   9.6109,   1.5064,
            6.8317,   0.4707,   9.6109,   1.5064,  -6.4336,   0.8384,
           -3.8586,   0.5254,   6.8317,  -0.4489,  -3.8586,   0.5254],
         [  0.0000,   0.0000,   0.0000,   0.0000,   0.0000,   0.0000,
            0.0000,   0.0000,  -7.4364,  -1.5000,  -0.0000,   0.0000,
            6.8317,   5.8587,  -0.0000,   0.0000,  -7.1158,  -1.5589,
            3.2063,  -0.9810,   6.8317,   5.6429,   3.2063,  -0.9810],
         [ -8.3302,  -1.1838,   0.0600,   6.4114,   6.8317,   4.4069,
            0.0600,   6.4114,  -8.0220,  -0.6015,   3.0815,   5.4304,
            6.8317,   3.7986,   3.0815,   5.4304,  -7.8264,  -0.1173,
            1.9561,   4.4494,   6.8317,   3.0923,   1.9561,   4.4494]],

        [[ -0.4000,  -1.8522,  30.0000,  -0.0000,  -6.8317,   1.3903,
           30.0000,  -0.0000,  -0.4000,  -1.8522,  30.0000,  -0.0000,
           -6.8

In [18]:
aaaa =[ [ -7.0089,   0.5568,   3.1098,   2.4874,   6.8317,   1.3903, 3.1098,   2.4874,  -6.0478,   0.7466,   9.6109,   1.5064, 6.8317,   0.4707,   9.6109,   1.5064,  -6.4336,   0.8384, -3.8586,   0.5254,   6.8317,  -0.4489,  -3.8586,   0.5254]]

In [19]:
bbbb = [  0.0000,   0.0000,   0.0000,   0.0000,   0.0000,   0.0000, 0.0000,   0.0000,  -7.4364,  -1.5000,  -0.0000,   0.0000, 6.8317,   5.8587,  -0.0000,   0.0000,  -7.1158,  -1.5589, 3.2063,  -0.9810,   6.8317,   5.6429,   3.2063,  -0.9810]

In [20]:
aaaa.append(bbbb)


In [21]:
totest = convert_to_tensor([aaaa,aaaa])

In [33]:
# pre-update do not refresh
actions = maddpg.act(totest) 
actions

[tensor([[ 0.1225,  0.0145],
         [ 0.0556, -0.0109]]), tensor([[ 0.1225,  0.0145],
         [ 0.0556, -0.0109]])]

In [34]:
# preupdate, do not refresh
actions = maddpg.target_act(totest, noise = 0) 
actions

[tensor([[ 0.1225,  0.0145],
         [ 0.0556, -0.0109]]), tensor([[ 0.1225,  0.0145],
         [ 0.0556, -0.0109]])]

In [40]:
# post-update: refresh
actions = maddpg.act(totest) 
actions

[tensor([[ 0.9491, -0.9785],
         [ 0.9642, -0.9536]]), tensor([[ 0.3634, -0.7679],
         [ 0.2861, -0.8780]])]

In [41]:

# post update, refresh:
actions = maddpg.target_act(totest, noise = 0) 
actions

[tensor([[ 0.6874, -0.9446],
         [ 0.7541, -0.8920]]), tensor([[ 0.1387, -0.7869],
         [-0.0108, -0.8664]])]