# Multi-Agent Deep Deterministic Policy Gradients (MADDPG)
# Assignment Collaboration and Competition

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.'

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

For instance, if you are using a Mac, then you downloaded `Tennis.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Tennis.app")
```

In [2]:
no_graphics=True
#no_graphics=False
env = UnityEnvironment(file_name='C:\EigeneLokaleDaten\DeepRL\Value-based-methods\p3_collab-compet\Tennis_Windows_x86_64\Tennis.exe',no_graphics=no_graphics)

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2 It's My Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [4]:
from collections import deque
import torch
import numpy as np
import os

from buffer import ReplayBuffer  ## REWRITE BUFFER  // Check UTILITIS

# rewritten MADDPG to have actor/critic networks of appropriate shapes, 
# i.e. 24 states and 2 actions per agent
from maddpg import MADDPG       

def seeding(seed=1):
    np.random.seed(seed)
    torch.manual_seed(seed)

def pre_process(entity, batchsize):
    processed_entity = []
    for j in range(3):
        list = []
        for i in range(batchsize):
            b = entity[i][j]
            list.append(b)
        c = torch.Tensor(list)
        processed_entity.append(c)
    return processed_entity

In [5]:
debug_ = False

scores_window = deque(maxlen=100)  # last 100 max scores
scores_window_mean = deque(maxlen=100)  # last 100 mean scores

seeding()
# number of parallel agents
parallel_envs = 1   # start with a single Unity-ML env
# number of training episodes.
training_episods = 10000*3
buffer_length = 100*10000

if debug_:
    batchsize = 3
else:
    batchsize = 128*4 

UPDATE_EVERY_NTH_STEP = 30
UPDATE_MANY_EPOCHS = 10

t = 0
    
# amplitude of OU noise
# this slowly decreases to 0
noise = 4  # 2 before 0.1 not enough?
noise_reduction =  0.9

# how many episodes before update
episode_per_update = 2 * parallel_envs

log_path = os.getcwd()+"/log"
model_dir= os.getcwd()+"/model_dir"
    
os.makedirs(log_path, exist_ok=True)    
os.makedirs(model_dir, exist_ok=True)

torch.set_num_threads(parallel_envs)
    
#from tensorboardX import SummaryWriter
#logger = SummaryWriter(log_dir=log_path)
num_agents = 2 

In [6]:
data = np.load('state_scale.npz')
scale = data['scale_int']
print(scale)

[21 30 30 30 30 30 30 30 30 30 30 23 23 23 30  8 12 14 30 30 30 30 30 30]


In [7]:
# keep 1e6 samples of replay
buffer = ReplayBuffer(int(buffer_length))  #

print('batchsize',batchsize)

# initialize policy and critic
maddpg = MADDPG()
agent0_reward = []
agent1_reward = []

env_info = env.reset(train_mode=True)[brain_name]      # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
states = states/scale
actions = maddpg.act(torch.from_numpy(states).unsqueeze(0).float(), noise=noise)
actions_array = torch.stack(actions).detach().numpy()

env_info = env.step(actions_array.squeeze())[brain_name]           # send all actions to the environment
next_states = env_info.vector_observations         # get next state (for each agent)
next_states = next_states/scale
rewards = env_info.rewards                         # get reward (for each agent)

not_yet_shown = True
max_100_average_score = -1

for i_episode in range(1, training_episods):               # train for training_episods many episodes
    env_info = env.reset(train_mode=True)[brain_name]      # reset the environment    
    maddpg.rest_noise()                                    # reset the noise object
    
    states = env_info.vector_observations                  # get the current state (for each agent)
    states = states / scale
    scores = np.zeros(num_agents)                          # initialize the score (for each agent)        
    jj = 0 
    while True:
        jj += 1 
        actions = maddpg.act(torch.from_numpy(states).unsqueeze(0).float(), noise=noise)
        noise *= noise_reduction            
        actions_array = torch.stack(actions).detach().numpy().squeeze()
        
        if debug_ :
            print('actions_array type',type(actions_array))
            print('actions_array shape',actions_array.shape)
       
        env_info = env.step(actions_array)[brain_name]     # send all actions to the environment
        next_states = env_info.vector_observations         # get next state (for each agent)
        next_states = next_states / scale
        rewards = env_info.rewards                         # get reward (for each agent)
        dones = env_info.local_done                        # see if episode finished
        scores += env_info.rewards                         # update the score (for each agent)
        
        transition = ([states], [actions_array], [rewards], [next_states], [dones])
        buffer.push(transition)
        
        # update once after every episode_per_update
        #if len(buffer) > batchsize*10 and i_episode % episode_per_update < parallel_envs:
        
        if len(buffer) > batchsize*10 and i_episode % UPDATE_EVERY_NTH_STEP == 0:          
            for k in range(UPDATE_MANY_EPOCHS):
                for a_i in range(2):
                    samples = buffer.sample(batchsize)
                    maddpg.update(samples, a_i)
            maddpg.update_targets() #soft update the target network towards the actual networks
                
        
        states = next_states                               # roll over states to next time step
        if np.any(dones):                                  # exit loop if episode finished
            break

    scores_window.append(scores.max())       # save most recent score
    scores_window_mean.append(scores.mean())       # save most recent score
    if np.mean(scores_window) >= 0.5 and not_yet_shown:
                    print('\rEpisode {}\tAverage Score: {:.2f}\tScore: {:.2f}'.format(i_episode, np.mean(scores_window), scores.max()))
                    print('Assignment -DONE-')
                    not_yet_shown = False
    
    if max_100_average_score <  np.mean(scores_window):
        max_100_average_score = np.mean(scores_window)
    print('\rEpisode {}\tAverage <Score>: {:.2f}\tAverage Max Score: {:.2f}\tMax Score: {:.2f}\tMax Average Max Score: {:.2f}'.format(i_episode, np.mean(scores_window_mean), np.mean(scores_window), np.max(scores_window), max_100_average_score), end="")                
    if i_episode % 1000 == 0:
        print('\rEpisode {}\tAverage <Score>: {:.2f}\tAverage Max Score: {:.2f}\tMax Score: {:.2f}\tMax Average Max Score: {:.2f}'.format(i_episode,  np.mean(scores_window_mean), np.mean(scores_window), np.max(scores_window), max_100_average_score))        

            
    #print('Score (max over agents) from episode {}: {} , steps: {}'.format(i, np.max(scores),jj))

print('')
print('Stop it...')

batchsize 512
init OUNoise with dim= 2
init OUNoise with dim= 2
Episode 329	Average <Score>: 0.00	Average Max Score: 0.01	Max Score: 0.10	Max Average Max Score: 0.022

  make_tensor = lambda x: torch.tensor(x, dtype=torch.float)


Episode 1000	Average <Score>: -0.00	Average Max Score: 0.00	Max Score: 0.00	Max Average Max Score: 0.02
Episode 2000	Average <Score>: -0.00	Average Max Score: 0.00	Max Score: 0.00	Max Average Max Score: 0.02
Episode 3000	Average <Score>: 0.03	Average Max Score: 0.05	Max Score: 0.20	Max Average Max Score: 0.144
Episode 4000	Average <Score>: 0.12	Average Max Score: 0.16	Max Score: 0.90	Max Average Max Score: 0.164
Episode 5000	Average <Score>: 0.15	Average Max Score: 0.18	Max Score: 1.00	Max Average Max Score: 0.186
Episode 5037	Average Score: 0.52	Score: 2.60ax Score: 0.50	Max Score: 2.70	Max Average Max Score: 0.50
Assignment -DONE-
Episode 6000	Average <Score>: 0.58	Average Max Score: 0.61	Max Score: 2.70	Max Average Max Score: 2.50
Episode 6239	Average <Score>: 1.93	Average Max Score: 1.96	Max Score: 2.70	Max Average Max Score: 2.50

KeyboardInterrupt: 

In [8]:
# save the model
save_dict_list =[]
for i in range(2):
                save_dict = {'actor_params' : maddpg.maddpg_agent[i].actor.state_dict(),
                             'actor_optim_params': maddpg.maddpg_agent[i].actor_optimizer.state_dict(),
                             'critic_params' : maddpg.maddpg_agent[i].critic.state_dict(),
                             'critic_optim_params' : maddpg.maddpg_agent[i].critic_optimizer.state_dict()}
                save_dict_list.append(save_dict)
torch.save(save_dict_list, 
                           os.path.join(model_dir, 'Run5_episode-{}.pt'.format(i_episode)))


save_dict_list =[]
for i in range(2):
                save_dict = {'actor_params' : maddpg.maddpg_agent[i].actor.state_dict(),                             
                             'critic_params' : maddpg.maddpg_agent[i].critic.state_dict()}
                save_dict_list.append(save_dict)
torch.save(save_dict_list, 
                           os.path.join(model_dir, 'Run5_reduced_episode-{}.pt'.format(i_episode)))





# Debugging/Checking the Code Elements....

In [11]:
from utilities import soft_update, transpose_to_tensor, transpose_list
samples = buffer.sample(batchsize)
obs, action, reward, next_obs, done = map(transpose_to_tensor, samples)
actions = maddpg.act(obs)
target_actions = maddpg.target_act(next_obs)
print('Actions:',actions[0])
print('Target Actions:',target_actions[0])
print('Actions:',actions[1])
print('Target Actions:',target_actions[1])

Actions: tensor([[-0.9839,  0.8409],
        [-0.9907,  0.9475],
        [-0.9735,  0.9821],
        ...,
        [-0.9914,  0.9628],
        [-0.9842,  0.9752],
        [-0.9856,  0.9375]], grad_fn=<AddBackward0>)
Target Actions: tensor([[-0.9827,  0.8484],
        [-0.9895,  0.9705],
        [-0.9454,  0.8281],
        ...,
        [-0.9919,  0.9661],
        [-0.9791,  0.9751],
        [-0.9901,  0.9651]], grad_fn=<AddBackward0>)
Actions: tensor([[0.9859, 0.9818],
        [0.9948, 0.9958],
        [0.9983, 0.9906],
        ...,
        [0.9831, 0.9839],
        [0.9979, 0.9948],
        [0.9819, 0.9631]], grad_fn=<AddBackward0>)
Target Actions: tensor([[0.9828, 0.9795],
        [0.9950, 0.9903],
        [0.9381, 0.9304],
        ...,
        [0.9814, 0.9835],
        [0.9973, 0.9929],
        [0.9951, 0.9900]], grad_fn=<AddBackward0>)


In [9]:
import logging
#https://realpython.com/python-logging/
for handler in logging.root.handlers[:]:
    logging.root.removeHandler(handler)
#logging.basicConfig(filename='C:\EigeneLokaleDaten\DeepRL\Value-based-methods\p3_collab-compet\test.log', filemode='a', format='%(name)s - %(levelname)s - %(message)s', level=logging.DEBUG)
logging.basicConfig(filename='test.txt')
print(os.getcwd())
# Creating an object
logger = logging.getLogger()

# Setting the threshold of logger to DEBUG
logger.setLevel(logging.DEBUG)

logger.debug('This will get logged')
logger.warning('This will get logged to a file')
logger.critical("Internet is down")

c:\EigeneLokaleDaten\DeepRL\Value-based-methods\p3_collab-compet


In [55]:
### DEBUG MADDPG-UPDATE

from utilities import soft_update, transpose_to_tensor, transpose_list
samples = buffer.sample(batchsize)
obs, action, reward, next_obs, done = map(transpose_to_tensor, samples)

obs_full = torch.stack(obs,dim=1)
n = obs_full.shape[0]
obs_full = obs_full.reshape(n,2*24)        
next_obs_full = torch.stack(next_obs,dim=1)
next_obs_full = next_obs_full.reshape(n,2*24)

'''
print(n)
print(len(obs))
print(obs[0].shape)
print(obs[1].shape)
print(obs[0][704,:])
print(obs[1][704,:])
print(obs_full[704,:])
print('#################')

print(len(next_obs))
print(next_obs[0].shape)
print(next_obs[1].shape)
print(next_obs[0][704,:])
print(next_obs[1][704,:])
print(next_obs_full.shape)
'''
print(next_obs_full[704,:])


#agent = self.maddpg_agent[agent_number]
#agent.critic_optimizer.zero_grad()

target_actions = maddpg.target_act(next_obs)
target_actions = torch.cat(target_actions, dim=1)

print('Actions:shape',target_actions.shape)
print('Actions: 704:',target_actions[704,:])


target_critic_input = torch.cat((next_obs_full,target_actions), dim=1)
print('target_critic_input:',target_critic_input.shape)
print('target_critic_input: 704',target_critic_input[704,:])
with torch.no_grad():
    q_next = maddpg.return_agent(0).target_critic(target_critic_input)
print('Qnext shape',q_next.shape)
print('Qnext 704:',q_next[704])


discount_factor = 0.95
agent_number = 0 

print(reward[0].shape)
print(done[agent_number].shape)
#print((1 - done[agent_number].view(-1, 1)))
#print(reward[agent_number].view(-1, 1))
#print(reward[agent_number])
y = reward[agent_number].view(-1, 1) + discount_factor * q_next * (1 - done[agent_number].view(-1, 1))
print(y.shape)
print('Y 704:',y[704])

action = torch.cat(action, dim=1)
print(action.shape)
print('Actions: 704:',action[704,:])
### OLD -> critic_input = torch.cat((obs_full.t(), action), dim=1).to(device)
critic_input = torch.cat((obs_full, action), dim=1)
print(critic_input.shape)
#print(obs_full[704,:])
print(critic_input[704,:])
q = maddpg.return_agent(0).critic(critic_input)
print(q.shape)
print('Q 704:',q[704,:])

huber_loss = torch.nn.SmoothL1Loss()
critic_loss = huber_loss(q, y.detach())
print(critic_loss.shape)
print(critic_loss)
print(np.mean( 0.5* (q.detach().numpy() - y.detach().numpy())**2 ))  # detach Returns a new Tensor, 
                                                                # detached from the current graph. 
                                                                # The result will never require gradient.
## how to debug this??
critic_loss.backward()
maddpg.return_agent(0).critic_optimizer.step()


tensor([ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
        -7.6477, -1.5000, -0.0000,  0.0000,  6.7403,  5.9765, -0.0000,  0.0000,
        -6.4677, -1.5589, 11.8004, -0.9810,  6.7403,  5.8587, 11.8004, -0.9810,
         0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
        -7.7758, -1.5000,  0.0000,  0.0000, -6.7403,  5.9765,  0.0000,  0.0000,
        -6.3379, -1.5589, 14.3793, -0.9810, -6.7403,  5.8587, 14.3793, -0.9810])
Actions:shape torch.Size([1000, 4])
Actions: 704: tensor([ 0.0214,  0.4292, -0.9239, -0.4335], grad_fn=<SliceBackward0>)
target_critic_input: torch.Size([1000, 52])
target_critic_input: 704 tensor([ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
        -7.6477, -1.5000, -0.0000,  0.0000,  6.7403,  5.9765, -0.0000,  0.0000,
        -6.4677, -1.5589, 11.8004, -0.9810,  6.7403,  5.8587, 11.8004, -0.9810,
         0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
        -

In [75]:
#### ACTOR 
maddpg_agents = [maddpg.return_agent(0),maddpg.return_agent(1)]
q_input = [ maddpg_agents[i].actor(ob) if i == agent_number \
                   else maddpg_agents[i].actor(ob).detach()
                   for i, ob in enumerate(obs) ]

print(len(q_input))
print(q_input[0].shape)
print(q_input[1].shape)
print(q_input[0][704,:])
print(q_input[1][704,:])


q_input = torch.cat(q_input, dim=1)
print(q_input.shape)
print(q_input[704,:])

q_input2 = torch.cat((obs_full, q_input), dim=1)
print(q_input2.shape)
print(q_input2[704,:])
print(-maddpg.return_agent(0).critic(q_input2).mean())

actor_loss = -maddpg.return_agent(0).critic(q_input2).mean()
actor_loss.backward()
torch.nn.utils.clip_grad_norm_(maddpg.return_agent(0).actor.parameters(),0.5)   ## added this clipping... which was uncommented in orgi
maddpg.return_agent(0).actor_optimizer.step()

print(actor_loss.cpu().detach().item())
print(critic_loss.cpu().detach().item())

2
torch.Size([1000, 2])
torch.Size([1000, 2])
tensor([-0.0426,  0.1942], grad_fn=<SliceBackward0>)
tensor([-0.7345, -0.3538])
torch.Size([1000, 4])
tensor([-0.0426,  0.1942, -0.7345, -0.3538], grad_fn=<SliceBackward0>)
torch.Size([1000, 52])
tensor([ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
         0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
        -7.6477, -1.5000, -0.0000,  0.0000,  6.7403,  5.9765, -0.0000,  0.0000,
         0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
         0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
        -7.7758, -1.5000,  0.0000,  0.0000, -6.7403,  5.9765,  0.0000,  0.0000,
        -0.0426,  0.1942, -0.7345, -0.3538], grad_fn=<SliceBackward0>)
tensor(-0.6452, grad_fn=<NegBackward0>)
-0.6451784372329712
0.043362393975257874


In [11]:
## TESTING BUFFER push/pull
#buffer = ReplayBuffer(int(5000*episode_length))  ## buffer ToDo

dones = env_info.local_done                        # see if episode finished
print(dones)
#assert 1== 0
dones = [True,False]

def transpose_to_tensor(input_list):
    print('lets go')
    make_tensor = lambda x: torch.tensor(x, dtype=torch.float)
    return list(map(make_tensor, zip(*input_list)))

transition = ([states], actions_array, [rewards], [next_states], [dones])
for jj in range(10):
    buffer.push(transition)

samples = buffer.sample(3)
print('######')
print(type(samples))
print(len(samples))
print(samples)
print('######')

'''
print([rewards])
#obs, obs_full, action, reward, next_obs, next_obs_full, done = map(transpose_to_tensor, samples)
obs, action, reward, next_obs,  done = map(transpose_to_tensor, samples)
print(reward)
#obs, action, reward,  next_obs,  done   = map(transpose_to_tensor,(states, actions_array, [rewards], next_states, [dones]))
'''


obs, action, reward, next_obs,  done = map(transpose_to_tensor, samples)
'''
print('obs:',obs[0].shape)
print('obs:',len(obs))
print('rewards:',reward)
print('dones:',done)
'''
print('obs:',obs)

obs_full = torch.stack(obs,dim=1)
n = obs_full.shape[0]
obs_full = obs_full.reshape(n,2*24)
print('obs_full:',obs_full)
print('obs_full _shape',obs_full.shape)


print('next_obs:',next_obs)
next_obs_full = torch.stack(next_obs,dim=1)
#next_obs_full = next_obs_full.reshape(n,2*24)
print('next_obs_full:',next_obs_full)

'''
obs_full = torch.stack(obs_full)
next_obs_full = torch.stack(next_obs_full)
print('next_obs_full:',next_obs_full)
'''


[True, True]
######
<class 'list'>
5
[[array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        , -6.95863676, -1.5       , -0.        ,  0.        ,
        -6.47696209,  5.96076012, -0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        , -7.71691895, -1.5       ,  0.        ,  0.        ,
         6.47696209,  5.96076012,  0.        ,  0.        ]]), array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        , -6.95863676, 

"\nobs_full = torch.stack(obs_full)\nnext_obs_full = torch.stack(next_obs_full)\nprint('next_obs_full:',next_obs_full)\n"

In [17]:
print(states)

[[ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.         -6.68513823 -1.5
  -0.          0.         -6.70838642  5.96076012 -0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.         -6.23882484 -1.5
   0.          0.          6.70838642  5.96076012  0.          0.        ]]


In [32]:
obs, action, reward, next_obs, done = map(transpose_to_tensor, samples)

lets go
lets go
lets go
lets go
lets go


In [33]:
action

[tensor([[-0.1356,  0.3300],
         [-0.1356,  0.3300],
         [-0.1356,  0.3300]]),
 tensor([[-0.2193,  0.3981],
         [-0.2193,  0.3981],
         [-0.2193,  0.3981]])]

In [16]:
action = torch.cat(action, dim=1)

In [20]:
action

tensor([[0.2373, 2.0914, 0.2131, 2.0083],
        [1.5938, 1.1650, 1.5151, 1.0715],
        [1.3315, 0.2975, 1.0845, 0.1133]])