# Collaboration and Competition

---
1. detach tensor in agent.forward_all() if (agent != self)
2. add batch norm in actor and critic
3. change 1st layer of critic networks. The shape is change from state_size to (state_size + action_size)
4. disable gradients clipping by setting max_norm = 1000.0
5. Fix bug in updating target networks. Moving update function from each agent to agentGroup.
6. Try different parameters.
7. Fix bug in maddpg_agent_verion_5. Replace actions with action_pred in self.critic_local()
8. Replace parameters of version 5.
9. Debugging

### 1. Start the Environment

Run the next code cell to install a few packages.  This line will take a few minutes to run!

In [1]:
!pip -q install ./python

The environment is already saved in the Workspace and can be accessed at the file path provided below. 

In [2]:
from unityagents import UnityEnvironment
import numpy as np

env = UnityEnvironment(file_name="/data/Tennis_Linux_NoVis/Tennis")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.         -6.65278625 -1.5        -0.          0.
  6.83172083  6.         -0.          0.        ]


### Identical and Complement Fields in Observation Spaces

In [5]:
for index, s in enumerate(states):
    print('Observation of agent {}:\n{}\n'.format(index, s))

print('Identical fields:\n{}\n{}\n'.format(states[0][17],
                                        states[0][21]))

print('Complement fields:\n{}\n{}\n'.format(states[0][20],
                                     states[1][20]))

Observation of agent 0:
[ 0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.         -6.65278625 -1.5        -0.          0.
  6.83172083  6.         -0.          0.        ]

Observation of agent 1:
[ 0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.         -6.4669857  -1.5         0.          0.
 -6.83172083  6.          0.          0.        ]

Identical fields:
-1.5
6.0

Complement fields:
6.83172082901001
-6.83172082901001



### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Note that **in this coding environment, you will not be able to watch the agents while they are training**, and you should set `train_mode=True` to restart the environment.

In [6]:
for i in range(5):                                         # play game for 5 episodes
    env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
    states = env_info.vector_observations                  # get the current state (for each agent)
    scores = np.zeros(num_agents)                          # initialize the score (for each agent)
    while True:
        actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
        actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
        env_info = env.step(actions)[brain_name]           # send all actions to tne environment
        next_states = env_info.vector_observations         # get next state (for each agent)
        rewards = env_info.rewards                         # get reward (for each agent)
        dones = env_info.local_done                        # see if episode finished
        scores += env_info.rewards                         # update the score (for each agent)
        states = next_states                               # roll over states to next time step
        if np.any(dones):                                  # exit loop if episode finished
            break
    print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Total score (averaged over agents) this episode: -0.004999999888241291
Total score (averaged over agents) this episode: -0.004999999888241291
Total score (averaged over agents) this episode: -0.004999999888241291
Total score (averaged over agents) this episode: 0.04500000085681677
Total score (averaged over agents) this episode: -0.004999999888241291


When finished, you can close the environment.

In [7]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  A few **important notes**:
- When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```
- To structure your work, you're welcome to work directly in this Jupyter notebook, or you might like to start over with a new file!  You can see the list of files in the workspace by clicking on **_Jupyter_** in the top left corner of the notebook.
- In this coding environment, you will not be able to watch the agents while they are training.  However, **_after training the agents_**, you can download the saved model weights to watch the agents on your own machine! 

In [10]:
%load_ext autoreload
%autoreload 2

from utils.workspace_utils import active_session
import matplotlib.pyplot as plt
%matplotlib inline
from collections import deque
import numpy as np
from datetime import datetime
from utils import utils
from unity_env_decorator import UnityEnvDecorator
from agents.maddpg_agent_version_6 import MADDPGAgentVersion6
from agents.agent_group_version_3 import AgentGroupVersion3
from agents.game import Game
from utils.utils import ScoreParcels

version='MADDPG_version_9'
dir_logs='./logs/'
dir_checkpoints='./checkpoints/'

def init_agent_group(random_seed):
    # define common parameters
    param_agent = {'state_size': 24, 
                    'action_size': 2,
                    'random_seed': random_seed,
                    'lr_critic': 1e-3,
                    'lr_actor': 1e-4,
                    'fc1_units': 256,
                    'fc2_units': 256,
                    'gamma': 0.99,
                    'tau': 1e-3,
                    'max_norm': 1000.0,
                    'epsilon_start': 1.0,
                    'epsilon_end': 0.0,
                    'epsilon_decay': 0.9999,}

    param_agent_group = {'action_size': param_agent['action_size'],
                         'learn_period': 10,
                        'learn_sampling_num':5,
                         'buffer_size': int(1e4), 
                         'batch_size': 200,
                          'random_seed': random_seed}

    """
        class Game and class MADDPGAgentVersionX form a 'chain-of-responsibility' design pattern
    """
    game = Game()
    
    # Initialize 2 DDPG agents. None of them has replay buffer
    num_agents = 2
    agent_list = []
    for i_agent in range(num_agents):
        agent = MADDPGAgentVersion6(game, num_agents, **param_agent, name='{}'.format(i_agent))
        game.add_agent(agent)
        agent_list.append(agent)

    """ 
        Initialize container of agents.
        This is a 'composite' design pattern
    """
    agentGroup = AgentGroupVersion3(agent_list, **param_agent_group)
        
    return agentGroup

def maddpg_framwork(envDecorator, agentGroup, n_episode=2000, max_episode_length=1000, 
                    print_every=100, size_moving_average=100, baseline_score=0.5, save_best=True):
    
    global_max_score = -1.0
    scores_deque = deque(maxlen=size_moving_average)
    scores = []
    
    total_time_steps = 0
    time_steps = 0
    
    # Declare time stamp for total execution time
    start_time_total = datetime.now()
    # Declare time stamp for execution time within 'print_every' episodes.
    start_time_moving_average = datetime.now()
    
    for i_episode in range(1, n_episode+1):
        states = envDecorator.reset()
        agentGroup.reset()
        score = np.zeros(envDecorator.num_agents)
        
        for i_step in range(max_episode_length):
            # actions[0] = actions of agent_0.
            # actions[1]= actions of agent_1
            actions = agentGroup.act(states)

            # next_states[0] = next_states of agent_0
            # next_states[1] = next_states of agent_1
            next_states, rewards, dones, _ = envDecorator.step(actions)

            agentGroup.step(states, actions, rewards, next_states, dones)

            
            
            # record scores
            score += rewards
            states = next_states

            time_steps += 1
            total_time_steps += 1
            
            if np.any(dones):
                break
                
        max_score = np.max(score)
        scores.append(max_score)
        scores_deque.append(max_score)
                
        
        print('\rEpisode {}\tScore={}\tStep:{}\tAbs Time{}'.format(i_episode,
                                                             score,
                                                              i_step+1,
                                                             datetime.now() - start_time_total),
                                                             end='')
    
        if i_episode % print_every == 0:
            print('\rEpisode {}\tAverage Score:{:.4f}\tTime Steps={}\tExecution Time:{}'.format(i_episode,
                                                                 np.mean(scores_deque),
                                                                 time_steps,
                                                                 datetime.now() - start_time_moving_average))
            
            start_time_moving_average = datetime.now()
            time_steps = 0
            
            
        # save the model with highest score
        if save_best is True:
            if (max_score > baseline_score) and (max_score > global_max_score):
                print('Save best model at episode {}'.format(i_episode))
                utils.save_agent(agentGroup.model_dicts(), dir_checkpoints, version+'_best')
                global_max_score = max_score
            
           
    print('Average Score: {:.4f}\tTotal Time Steps: {}\tTotal Time={}'.format(np.mean(scores_deque),
                                                        total_time_steps,
                                                        datetime.now() - start_time_total))
    return scores
    

def maddpg(unity_env, random_seed=0):
    with active_session():
    
        # Decorator of unity environmet
        envDecorator = UnityEnvDecorator(unity_env)
    
        agentGroup = init_agent_group(random_seed)
    
        # run MADDPG
        scores = maddpg_framwork(envDecorator, agentGroup, n_episode=100, 
                    max_episode_length=20000, print_every=100)
    
        # save scores
        utils.save_logs(scores, dir_logs, version)
    
        path_score = utils.log_path_name(dir_logs, version)
        score_parcels = [ScoreParcels('MADDPG', path_score, 'r')]
        
        utils.plot_scores_v2(score_parcels, 
                          size_window=100,
                         show_origin=True,
                        show_episode_on_label=True,
                        margin=0)

        # save the last agent
        utils.save_agent(agentGroup.model_dicts(), dir_checkpoints, version)  
    
    
maddpg(env)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Episode 13	Score=[ 0.   -0.01]	Step:15	Abs Time0:00:00.874063> /home/workspace/MADDPG/agents/agent_group_version_3.py(96)step()
-> agent.step(*experiences)
(Pdb) s0 = np.expand_dims(states[0], axis=0)
(Pdb) s0_tensor = torch.from_numpy(s0).float().to(device)
(Pdb) self.agent_list[0].actor_local(s0_tensor)
*** ValueError: Expected more than 1 value per channel when training, got input size [1, 256]
(Pdb) s0_tensor.shape
torch.Size([1, 24])
(Pdb) experiences.shape
*** AttributeError: 'tuple' object has no attribute 'shape'
(Pdb) type(experiences)
<class 'tuple'>
(Pdb) type(experiences[0])
<class 'torch.Tensor'>
(Pdb) experiences[0].shape
torch.Size([200, 48])
(Pdb) c
> /home/workspace/MADDPG/agents/maddpg_agent_version_6.py(217)learn()
-> next_actions = self.forward_all(next_states)
(Pdb) type(state_i)
*** NameError: name 'state_i' is not defined
(Pdb) type(states_i)
<class 'torch.Tensor'>
(Pdb) state

(Pdb) states[:, 24:] == states_i
tensor([[ 0,  0,  0,  ...,  1,  0,  0],
        [ 0,  1,  0,  ...,  1,  0,  0],
        [ 0,  0,  0,  ...,  1,  0,  1],
        ...,
        [ 0,  1,  0,  ...,  1,  0,  1],
        [ 0,  1,  1,  ...,  1,  0,  1],
        [ 0,  0,  0,  ...,  1,  0,  0]], dtype=torch.uint8)
(Pdb) self.name
'0'
(Pdb) actions
tensor([[ 0.2402, -0.1070,  0.2366, -0.0640],
        [-0.6088, -0.5056, -0.2756,  0.6022],
        [-0.6657, -0.0045, -0.2417,  0.1038],
        [-0.6650, -0.2684, -0.7166, -0.0784],
        [-0.2532, -0.3806,  0.0953,  0.0111],
        [ 0.1080,  0.1349, -0.0343,  0.1879],
        [-0.0196, -0.6273, -0.0681, -0.4090],
        [-0.4980, -0.1680,  0.1814, -0.1633],
        [ 0.2069,  0.2683,  0.3553, -0.3310],
        [-0.1095,  0.0214, -0.0291, -0.0230],
        [-0.1771, -0.2706, -0.2657, -0.2107],
        [ 0.2123,  0.0020,  0.2162, -0.1664],
        [ 0.2559, -0.3182, -0.0523, -0.1169],
        [-0.1921, -0.0343,  0.1694, -0.1650],
        [ 0.0892

(Pdb) action.shape
*** NameError: name 'action' is not defined
(Pdb) actions.shape
torch.Size([200, 4])
(Pdb) actions[:,:2] == actions_i
tensor([[ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],

(Pdb) c
> /home/workspace/MADDPG/agents/maddpg_agent_version_6.py(221)learn()
-> assert (next_actions.shape[1] == (self.action_size * self.num_agents)), 'Wrong shape of next_actions'
(Pdb) next_actions.shape
torch.Size([200, 4])
(Pdb) next_states
tensor([[-10.8998,  -0.6013,  -0.0000,  ...,  -2.2882,   7.0968,
          -2.4176],
        [-10.8998,  -1.8522,  -0.0000,  ...,  -1.9203,  -8.2669,
           6.4114],
        [-10.8998,  -1.8522,  -0.0000,  ...,   2.4566,  -7.2509,
           0.0000],
        ...,
        [ -9.2612,  -1.8524,  -8.6835,  ...,   4.5167,   0.1864,
           0.0000],
        [ -8.5432,  -1.5589,  -7.8635,  ...,   5.3290,   4.4175,
           0.0000],
        [ -0.8685,  -1.9587,   6.5758,  ...,  -0.6575,  -0.6548,
          -0.0000]])
(Pdb) next_states.shape
torch.Size([200, 48])
(Pdb) self.forward_all(next_states)
tensor(1.00000e-02 *
       [[-5.0016, -5.4525, -5.8503, -4.7004],
        [-5.5576, -4.8985, -4.6882, -5.3513],
        [-5.6274, -5.1160, -5.9408

(Pdb) fake_next_states = torch.cat(next_actions[0], next_actions[0], dim=0)
*** TypeError: cat(): argument 'tensors' (position 1) must be tuple of Tensors, not Tensor
(Pdb) fake_next_states = torch.cat((next_actions[0], next_actions[0]), dim=0)
(Pdb) fake_next_states.shape
torch.Size([8])
(Pdb) fake_next_states = torch.cat((next_states[0], next_states[0]), dim=0)
(Pdb) fake_next_states.shape
torch.Size([96])
(Pdb) next_states.shape
torch.Size([200, 48])
(Pdb) fake_next_states = torch.cat((next_states[:, :24], next_states[:, :24]), dim=0)
(Pdb) fake_next_states.shape
torch.Size([400, 24])
(Pdb) self.forward(fake_next_states).shape
*** AttributeError: 'MADDPGAgentVersion6' object has no attribute 'forward'
(Pdb) self.forward_all(fake_next_states)
*** RuntimeError: size mismatch, m1: [1 x 400], m2: [24 x 256] at /opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/TH/generic/THTensorMath.c:2033
(Pdb) fake_next_states = torch.cat((next_states[:, :24], next_states[:, :24]), dim=1)
(Pdb)

(Pdb) fake_next_actions[:, 0:24] == fake_next_actions[:, 24:]
*** RuntimeError: The size of tensor a (4) must match the size of tensor b (200) at non-singleton dimension 1
(Pdb) fake_next_actions[:, :24]
tensor(1.00000e-02 *
       [[-5.0016, -5.4525, -5.0016, -5.4525],
        [-5.5576, -4.8985, -5.5576, -4.8985],
        [-5.6274, -5.1160, -5.6274, -5.1160],
        [-4.9488, -6.3734, -4.9488, -6.3734],
        [-6.0067, -5.0080, -6.0067, -5.0080],
        [-5.8175, -5.1850, -5.8175, -5.1850],
        [-6.1062, -4.8094, -6.1062, -4.8094],
        [-5.7198, -5.4904, -5.7198, -5.4904],
        [-5.6077, -5.4214, -5.6077, -5.4214],
        [-6.1569, -5.5522, -6.1569, -5.5522],
        [-6.2537, -5.4099, -6.2537, -5.4099],
        [-6.1738, -5.6085, -6.1738, -5.6085],
        [-5.6981, -5.0200, -5.6981, -5.0200],
        [-6.0400, -5.4579, -6.0400, -5.4579],
        [-5.7219, -5.2494, -5.7219, -5.2494],
        [-5.7580, -5.2723, -5.7580, -5.2723],
        [-5.9198, -5.5447, -5.9198, -5.

(Pdb) fake_next_actions[:, 24:]
tensor([-5.5576e-02, -5.6274e-02, -4.9488e-02, -6.0067e-02, -5.8175e-02,
        -6.1062e-02, -5.7198e-02, -5.6077e-02, -6.1569e-02, -6.2537e-02,
        -6.1738e-02, -5.6981e-02, -6.0400e-02, -5.7219e-02, -5.7580e-02,
        -5.9198e-02, -5.8676e-02, -5.6481e-02, -5.3939e-02, -6.3858e-02,
        -6.0923e-02, -6.0743e-02, -6.4030e-02, -4.8520e-02, -5.9129e-02,
        -5.7838e-02, -5.7008e-02, -5.9884e-02, -5.5711e-02, -6.5602e-02,
        -5.3140e-02, -5.8978e-02, -5.4037e-02, -5.8086e-02, -6.0633e-02,
        -5.5518e-02, -5.9203e-02, -5.4850e-02, -5.2050e-02, -5.8023e-02,
        -5.8464e-02, -5.6231e-02, -5.8372e-02, -5.4245e-02, -5.6142e-02,
        -5.6893e-02, -5.9747e-02, -5.7289e-02, -5.4750e-02, -5.3548e-02,
        -5.3417e-02, -5.4822e-02, -6.7495e-02, -6.1813e-02, -5.5582e-02,
        -5.3380e-02, -6.4111e-02, -5.5777e-02, -5.5543e-02, -5.6531e-02,
        -5.5246e-02, -5.5569e-02, -5.9245e-02, -5.5096e-02, -5.5882e-02,
        -6.0303e-02

BdbQuit: 