# Multi-Agent Deep Deterministic Policy Gradients (MADDPG)
# Assignment Collaboration and Competition

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.'

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

In [2]:
no_graphics=True
#no_graphics=False
env = UnityEnvironment(file_name='C:\EigeneLokaleDaten\DeepRL\Value-based-methods\p3_Soccer\Soccer_Windows_x86_64\Soccer.exe',no_graphics=no_graphics)
goalies_defending = True  # 

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 2
        Number of External Brains : 2
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: GoalieBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 112
        Number of stacked Vector Observation: 3
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 
Unity brain name: StrikerBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 112
        Number of stacked Vector Observation: 3
        Vector Action space type: discrete
        Vector Action space size (per agent): 6
        Vector Action descriptions: , , , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# follow nomucaltur: https://github.com/udacity/deep-reinforcement-learning/blob/master/p3_collab-compet/Soccer.ipynb
# set the goalie brain
g_brain_name = env.brain_names[0]
g_brain = env.brains[g_brain_name]

# set the striker brain
s_brain_name = env.brain_names[1]
s_brain = env.brains[s_brain_name]

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)

# number of agents 
num_g_agents = len(env_info[g_brain_name].agents)
print('Number of goalie agents:', num_g_agents)
num_s_agents = len(env_info[s_brain_name].agents)
print('Number of striker agents:', num_s_agents)

# number of actions
g_action_size = g_brain.vector_action_space_size
print('Number of goalie actions:', g_action_size)
s_action_size = s_brain.vector_action_space_size
print('Number of striker actions:', s_action_size)

# examine the state space 
g_states = env_info[g_brain_name].vector_observations
g_state_size = g_states.shape[1]
print('There are {} goalie agents. Each receives a state with length: {}'.format(g_states.shape[0], g_state_size))
s_states = env_info[s_brain_name].vector_observations
s_state_size = s_states.shape[1]
print('There are {} striker agents. Each receives a state with length: {}'.format(s_states.shape[0], s_state_size))

Number of goalie agents: 2
Number of striker agents: 2
Number of goalie actions: 4
Number of striker actions: 6
There are 2 goalie agents. Each receives a state with length: 336
There are 2 striker agents. Each receives a state with length: 336


### 2 It's My Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [5]:
from collections import deque
import torch
import numpy as np
import os

from buffer import ReplayBuffer  ## REWRITE BUFFER  // Check UTILITIS

# rewritten MADDPG to have actor/critic networks of appropriate shapes, 
# i.e. 24 states and 2 actions per agent
from maddpg import MADDPG       

def seeding(seed=1):
    np.random.seed(seed)
    torch.manual_seed(seed)

In [None]:
debug_ = False

scores_window = deque(maxlen=100)  # last 100 max scores
scores_window_mean = deque(maxlen=100)  # last 100 mean scores

seeding(42)
# number of parallel agents
parallel_envs = 1   # start with a single Unity-ML env
# number of training episodes.
training_episods = 10000*3
buffer_length = 100*10000

if debug_:
    batchsize = 3
else:
    batchsize = 128*1 

UPDATE_EVERY_NTH_STEP = 30
UPDATE_MANY_EPOCHS = 20

t = 0
    
# epsilon greedy: initial epsilon and decay
noise_start = 1
noise_reduction = 0.999

# how many episodes before update
episode_per_update = 2 * parallel_envs

torch.set_num_threads(parallel_envs)
    
#from tensorboardX import SummaryWriter
#logger = SummaryWriter(log_dir=log_path)
num_agents = 2 

In [None]:
# keep 1e6 samples of replay
buffer = ReplayBuffer(int(buffer_length))  #

print('batchsize',batchsize)

# initialize policy and critic
maddpg = MADDPG()
agent0_reward = []
agent1_reward = []

env_info = env.reset(train_mode=True)      # reset the environment    
states_f = env_info[s_brain_name].vector_observations                  # get the current state (for each agent)
states = states_f[:,-224:]
actions = maddpg.act(torch.from_numpy(states).unsqueeze(0).float(), noise=1.0)
print('#1',actions)
actions_array = torch.argmax(actions[0],dim=1).detach().numpy()
print('#2',actions_array)
actions_array_prob = actions[0].detach().numpy()
print('#3',actions_array_prob)

g_actions = np.random.randint(g_action_size, size=num_g_agents)
s_actions = actions_array.squeeze() #actions_array
actions = dict(zip([g_brain_name, s_brain_name], 
                           [g_actions, s_actions]))

env_info = env.step(actions)           # send all actions to the environment


next_states_f = env_info[s_brain_name].vector_observations         # get next state (for each agent)
next_states = next_states_f[:,-224:]
#next_states = next_states/scale
rewards = env_info[s_brain_name].rewards                         # get reward (for each agent)

not_yet_shown = False
max_100_average_score = -1

noise = noise_start  # reset the initial noise value 

for i_episode in range(training_episods):               # train for training_episods many episodes
    env_info = env.reset(train_mode=True)               # reset the environment    
    noise *= noise_reduction                            # reduction across episodes...
    
    states_f = env_info[s_brain_name].vector_observations                  # get the current state (for each agent)
    states = states_f[:,-224:]                          # reduce observation space
    scores = np.zeros(num_agents)                       # initialize the score (for each agent)        
    while True:
        actions = maddpg.act(torch.from_numpy(states).unsqueeze(0).float(), noise=noise)
        actions_array_prob = actions[0].detach().numpy()
        actions_array = torch.argmax(actions[0],dim=1).detach().numpy()
              
            
        if goalies_defending: 
            g_actions = np.random.randint(g_action_size, size=num_g_agents)    
        else: ## goalie not defending ....
            g_actions = np.random.randint(1, size=num_g_agents)+2  # 0 -> towards center , 1 towards goal, 2 right, 3 left 
        s_actions = actions_array.squeeze() #actions_array
        actions = dict(zip([g_brain_name, s_brain_name], 
                           [g_actions, s_actions]))

        env_info = env.step(actions)           # send all actions to the environment
    
        next_states_f = env_info[s_brain_name].vector_observations         # get next state (for each agent)
        next_states = next_states_f[:,-224:]
        rewards = env_info[s_brain_name].rewards                         # get reward (for each agent)
        dones = env_info[s_brain_name].local_done                        # see if episode finished
        scores += env_info[s_brain_name].rewards                         # update the score (for each agent)
        
        transition = ([states], [actions_array_prob], [rewards], [next_states], [dones])
        buffer.push(transition)
                           
        # update once after every episode_per_update
        if len(buffer) > batchsize*10 and i_episode % UPDATE_EVERY_NTH_STEP == 0:          
            for k in range(UPDATE_MANY_EPOCHS):
                for a_i in range(2):
                    samples = buffer.sample(batchsize)
                    maddpg.update(samples, a_i)
            maddpg.update_targets() #soft update the target network towards the actual networks
        
        states = next_states                               # roll over states to next time step
        if np.any(dones):                                  # exit loop if episode finished
            break

    scores_window.append(scores.max())       # save most recent score
    scores_window_mean.append(scores.mean())       # save most recent score

    if np.mean(scores_window) >= 0.5 and not_yet_shown:
                    print('\rEpisode {}\tAverage Score: {:.2f}\tScore: {:.2f}'.format(i_episode, np.mean(scores_window), scores.max()))
                    print('Assignment -DONE-')
                    not_yet_shown = False
    
    if max_100_average_score <  np.mean(scores_window):
        max_100_average_score = np.mean(scores_window)
    print('\rEpisode {}\tAverage <Score>: {:.2f}\tAverage Max Score: {:.2f}\tMax Score: {:.2f}\tMax Average Max Score: {:.2f}'.format(i_episode, np.mean(scores_window_mean), np.mean(scores_window), np.max(scores_window), max_100_average_score), end="")                
    if i_episode % 50 == 0:
        print('\rEpisode {}\tAverage <Score>: {:.2f}\tAverage Max Score: {:.2f}\tMax Score: {:.2f}\tMax Average Max Score: {:.2f}'.format(i_episode,  np.mean(scores_window_mean), np.mean(scores_window), np.max(scores_window), max_100_average_score))        

print('')
print('Stop it...')

batchsize 128
init OUNoise with dim= 1
init OUNoise with dim= 1
#1 [tensor([[0.0992, 0.1471, 0.1421, 0.2501, 0.1385, 0.2230],
        [0.2024, 0.1089, 0.2462, 0.1767, 0.1707, 0.0951]], dtype=torch.float64)]
#2 [3 2]
#3 [[0.09919614 0.14708694 0.142117   0.25011663 0.13847178 0.22301151]
 [0.20235354 0.10892883 0.24619895 0.17668853 0.1707112  0.09511895]]
Episode 0	Average <Score>: -0.91	Average Max Score: -0.87	Max Score: 0.94	Max Average Max Score: -0.87
Episode 50	Average <Score>: -0.66	Average Max Score: -0.52	Max Score: 0.98	Max Average Max Score: -0.25
Episode 59	Average <Score>: -0.70	Average Max Score: -0.58	Max Score: 0.98	Max Average Max Score: -0.25

In [10]:
# save the model
model_dir= os.getcwd()+"/model_dir"
os.makedirs(model_dir, exist_ok=True)
save_dict_list =[]
for i in range(2):
                save_dict = {'actor_params' : maddpg.maddpg_agent[i].actor.state_dict(),
                             'actor_optim_params': maddpg.maddpg_agent[i].actor_optimizer.state_dict(),
                             'critic_params' : maddpg.maddpg_agent[i].critic.state_dict(),
                             'critic_optim_params' : maddpg.maddpg_agent[i].critic_optimizer.state_dict()}
                save_dict_list.append(save_dict)
torch.save(save_dict_list, 
                           os.path.join(model_dir, 'StrikerOnly_Run3_episode-{}.pt'.format(i_episode)))


save_dict_list =[]
for i in range(2):
                save_dict = {'actor_params' : maddpg.maddpg_agent[i].actor.state_dict(),                             
                             'critic_params' : maddpg.maddpg_agent[i].critic.state_dict()}
                save_dict_list.append(save_dict)
torch.save(save_dict_list, 
                           os.path.join(model_dir, 'StrikerOnly_Run3_reduced_episode-{}.pt'.format(i_episode)))


save_dict_list =[]
for i in range(2):
                save_dict = {'actor_params' : maddpg.maddpg_agent[i].actor.state_dict()}
                save_dict_list.append(save_dict)
torch.save(save_dict_list, 
                           os.path.join(model_dir, 'StrikerOnly_Run3_reduced_only_actor_episode-{}.pt'.format(i_episode)))




# Debugging/Checking the Code Elements....

In [None]:
### DEBUG update:
samples = buffer.sample(3)
maddpg.update(samples, 0)