# Multi-Agent Deep Deterministic Policy Gradients (MADDPG) - reduced env.
# Collaboration and Competition

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

For instance, if you are using a Mac, then you downloaded `Tennis.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Tennis.app")
```

In [2]:
no_graphics=True
#no_graphics=False
env = UnityEnvironment(file_name='C:\EigeneLokaleDaten\DeepRL\Value-based-methods\p3_collab-compet\Tennis_Windows_x86_64\Tennis.exe',no_graphics=no_graphics)

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

# check time stacking...

In [4]:
env_info = env.reset(train_mode=True)[brain_name]      # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
print(states)
for jj in range(2):
  env_info = env.step([[0,0],[0,0]])[brain_name]     # send all actions to the environment
  states = env_info.vector_observations         # get next state (for each agent)
  print(states)

[[ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.         -6.65278625 -1.5
  -0.          0.          6.83172083  6.         -0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.         -6.4669857  -1.5
   0.          0.         -6.83172083  6.          0.          0.        ]]
[[ 0.          0.          0.          0.          0.          0.
   0.          0.         -6.65278625 -1.5        -0.          0.
   6.83172083  6.         -0.          0.         -6.65278625 -1.55886006
  -0.         -0.98100001  6.83172083  5.94114017 -0.         -0.98100001]
 [ 0.          0.          0.          0.          0.          0.
   0.          0.         -6.4669857  -1.5         0.          0.
  -6.83172083  6.          0.         

In [5]:
d_states = np.zeros((2,8))
print(states[:,-8:])
print(states[:,-8:].shape)
d_states = states[:,-8:]
print(d_states.shape)

states_f = env_info.vector_observations                  # get the current state (for each agent)
print(states_f.shape)
states = states_f[:,-8:]
print(states.shape)

[[-6.65278625 -1.71581995 -0.         -1.96200001  6.83172083  5.78418016
  -0.         -1.96200001]
 [-6.4669857  -1.71581995  0.         -1.96200001 -6.83172083  5.78418016
   0.         -1.96200001]]
(2, 8)
(2, 8)
(2, 24)
(2, 8)


### 2 It's My Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [6]:
from collections import deque
import torch
import numpy as np
import os

from buffer import ReplayBuffer  #
from maddpgred import MADDPGRED as MADDPG

maddpg = MADDPG()

def seeding(seed=1):
    np.random.seed(seed)
    torch.manual_seed(seed)

def pre_process(entity, batchsize):
    processed_entity = []
    for j in range(3):
        list = []
        for i in range(batchsize):
            b = entity[i][j]
            list.append(b)
        c = torch.Tensor(list)
        processed_entity.append(c)
    return processed_entity

init OUNoise with dim= 2
init OUNoise with dim= 2


In [7]:
debug_ = False

scores_window = deque(maxlen=100)  # last 100 max scores
scores_window_mean = deque(maxlen=100)  # last 100 mean scores

seeding()
# number of parallel agents
parallel_envs = 1   # start with a single Unity-ML env
# number of training episodes.
training_episods = 10000*3
buffer_length = 100*10000

if debug_:
    batchsize = 3
else:
    batchsize = 128*4 

UPDATE_EVERY_NTH_STEP = 30
UPDATE_MANY_EPOCHS = 10

t = 0
    
# amplitude of OU noise
# this slowly decreases to 0
noise = 4  # 2 before 0.1 not enough?
noise_reduction =  0.9

# how many episodes before update
episode_per_update = 2 * parallel_envs

log_path = os.getcwd()+"\log"
model_dir= os.getcwd()+"\model_dir"
    
os.makedirs(log_path, exist_ok=True)    
os.makedirs(model_dir, exist_ok=True)

torch.set_num_threads(parallel_envs)
    
#from tensorboardX import SummaryWriter
#logger = SummaryWriter(log_dir=log_path)
num_agents = 2 


In [8]:
data = np.load('state_scale.npz')
scale = data['scale_int']
print(scale)
scale = 30 

[21 30 30 30 30 30 30 30 30 30 30 23 23 23 30  8 12 14 30 30 30 30 30 30]


In [9]:
# keep 10000 episodes worth of replay
buffer = ReplayBuffer(int(buffer_length))  ## buffer ToDo

states = np.zeros((2,8))
next_states = np.zeros((2,8))

print('batchsize',batchsize)

# initialize policy and critic
maddpg = MADDPG()
agent0_reward = []
agent1_reward = []

env_info = env.reset(train_mode=True)[brain_name]      # reset the environment    
states_f = env_info.vector_observations                  # get the current state (for each agent)
states = states_f[:,-8:]
states = states/scale
print(states.shape)
actions = maddpg.act(torch.from_numpy(states).unsqueeze(0).float(), noise=noise)
actions_array = torch.stack(actions).detach().numpy()

env_info = env.step(actions_array.squeeze())[brain_name]           # send all actions to the environment
next_states_f = env_info.vector_observations         # get next state (for each agent)
next_states = next_states_f[:,-8:]
next_states = next_states/scale
rewards = env_info.rewards                         # get reward (for each agent)

not_yet_shown = True
max_100_average_score = -1

for i_episode in range(1, training_episods):               # train for training_episods many episodes
    env_info = env.reset(train_mode=True)[brain_name]      # reset the environment    
    maddpg.rest_noise()                                    # reset the noise object
    
    states_f = env_info.vector_observations                  # get the current state (for each agent)
    states = states_f[:,-8:]
    states = states / scale
    
    scores = np.zeros(num_agents)                          # initialize the score (for each agent)        
    jj = 0 
    while True:
        jj += 1 
        actions = maddpg.act(torch.from_numpy(states).unsqueeze(0).float(), noise=noise)
        noise *= noise_reduction            
        actions_array = torch.stack(actions).detach().numpy().squeeze()
        
        if debug_ :
            print('actions_array type',type(actions_array))
            print('actions_array shape',actions_array.shape)
        
        env_info = env.step(actions_array)[brain_name]     # send all actions to the environment
        next_states_f = env_info.vector_observations         # get next state (for each agent)
        next_states = next_states_f[:,-8:]
        next_states = next_states / scale        
        rewards = env_info.rewards                         # get reward (for each agent)
        dones = env_info.local_done                        # see if episode finished
        scores += env_info.rewards                         # update the score (for each agent)
        
        transition = ([states], [actions_array], [rewards], [next_states], [dones])
        buffer.push(transition)
        
        # update once after every episode_per_update
        #if len(buffer) > batchsize*10 and i_episode % episode_per_update < parallel_envs:
        if len(buffer) > batchsize*10 and i_episode % UPDATE_EVERY_NTH_STEP == 0:          
            for k in range(UPDATE_MANY_EPOCHS):
                for a_i in range(2):
                    samples = buffer.sample(batchsize)
                    maddpg.update(samples, a_i)
            maddpg.update_targets() #soft update the target network towards the actual networks
                
        
        states = next_states                               # roll over states to next time step
        if np.any(dones):                                  # exit loop if episode finished
            break

    scores_window.append(scores.max())       # save most recent score
    scores_window_mean.append(scores.mean())       # save most recent score
    if np.mean(scores_window) >= 0.5 and not_yet_shown:
                    print('Episode {}\tAverage Score: {:.2f}\tScore: {:.2f}'.format(i_episode, np.mean(scores_window), scores.max()))
                    print('Assignment -DONE-')
                    not_yet_shown = False
    
    if max_100_average_score <  np.mean(scores_window):
        max_100_average_score = np.mean(scores_window)
    print('\rEpisode {}\tAverage <Score>: {:.2f}\tAverage Max Score: {:.2f}\tMax Score: {:.2f}\tMax Average Max Score: {:.2f}'.format(i_episode, np.mean(scores_window_mean), np.mean(scores_window), np.max(scores_window), max_100_average_score), end="")                
    if i_episode % 1000 == 0:
        print('\rEpisode {}\tAverage <Score>: {:.2f}\tAverage Max Score: {:.2f}\tMax Score: {:.2f}\tMax Average Max Score: {:.2f}'.format(i_episode,  np.mean(scores_window_mean), np.mean(scores_window), np.max(scores_window), max_100_average_score))        

print('')
print('Stop it...')

batchsize 512
init OUNoise with dim= 2
init OUNoise with dim= 2
(2, 8)
Episode 269	Average <Score>: -0.00	Average Max Score: 0.00	Max Score: 0.10	Max Average Max Score: 0.01

  make_tensor = lambda x: torch.tensor(x, dtype=torch.float)


Episode 1000	Average <Score>: -0.00	Average Max Score: 0.00	Max Score: 0.10	Max Average Max Score: 0.01
Episode 2000	Average <Score>: -0.00	Average Max Score: 0.00	Max Score: 0.00	Max Average Max Score: 0.01
Episode 3000	Average <Score>: -0.00	Average Max Score: 0.00	Max Score: 0.00	Max Average Max Score: 0.06
Episode 4000	Average <Score>: -0.00	Average Max Score: 0.00	Max Score: 0.00	Max Average Max Score: 0.06
Episode 5000	Average <Score>: -0.00	Average Max Score: 0.00	Max Score: 0.10	Max Average Max Score: 0.06
Episode 6000	Average <Score>: 0.01	Average Max Score: 0.02	Max Score: 0.10	Max Average Max Score: 0.066
Episode 7000	Average <Score>: 0.04	Average Max Score: 0.06	Max Score: 0.30	Max Average Max Score: 0.144
Episode 8000	Average <Score>: 0.28	Average Max Score: 0.30	Max Score: 1.50	Max Average Max Score: 0.31
Episode 8321	Average <Score>: 0.45	Average Max Score: 0.49	Max Score: 2.70	Max Average Max Score: 0.49Episode 8322	Average Score: 0.51	Score: 2.60
Assignment -DONE-
Epis

KeyboardInterrupt: 

In [13]:
# save the model
save_dict_list =[]
for i in range(2):
                save_dict = {'actor_params' : maddpg.maddpg_agent[i].actor.state_dict(),
                             'actor_optim_params': maddpg.maddpg_agent[i].actor_optimizer.state_dict(),
                             'critic_params' : maddpg.maddpg_agent[i].critic.state_dict(),
                             'critic_optim_params' : maddpg.maddpg_agent[i].critic_optimizer.state_dict()}
                save_dict_list.append(save_dict)
torch.save(save_dict_list, 
                           os.path.join(model_dir, 'Run4_episode-{}.pt'.format(i_episode)))


save_dict_list =[]
for i in range(2):
                save_dict = {'actor_params' : maddpg.maddpg_agent[i].actor.state_dict(),                             
                             'critic_params' : maddpg.maddpg_agent[i].critic.state_dict()}
                save_dict_list.append(save_dict)
torch.save(save_dict_list, 
                           os.path.join(model_dir, 'Run4_reduced_episode-{}.pt'.format(i_episode)))