# Collaboration and Competition

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

import random
import torch

from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline

from maddpg import MADDPG, ReplayBuffer

import os
from utilities import transpose_list, transpose_to_tensor


Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

For instance, if you are using a Mac, then you downloaded `Tennis.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Tennis.app")
```

In [2]:
env = UnityEnvironment(file_name='env\Tennis.app')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.65278625 -1.5
 -0.          0.          6.83172083  6.         -0.          0.        ]


In [11]:
env_info.vector_observations

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        , -6.65278625, -1.5       , -0.        ,  0.        ,
         6.83172083,  6.        , -0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        , -6.4669857 , -1.5       ,  0.        ,  0.        ,
        -6.83172083,  6.        ,  0.        ,  0.        ]])

In [13]:
obs, obs_full = transpose_list(env_info.vector_observations)

[[0.0, 0.0],
 [0.0, 0.0],
 [0.0, 0.0],
 [0.0, 0.0],
 [0.0, 0.0],
 [0.0, 0.0],
 [0.0, 0.0],
 [0.0, 0.0],
 [0.0, 0.0],
 [0.0, 0.0],
 [0.0, 0.0],
 [0.0, 0.0],
 [0.0, 0.0],
 [0.0, 0.0],
 [0.0, 0.0],
 [0.0, 0.0],
 [-6.6527862548828125, -6.466985702514648],
 [-1.5, -1.5],
 [-0.0, 0.0],
 [0.0, 0.0],
 [6.83172082901001, -6.83172082901001],
 [6.0, 6.0],
 [-0.0, 0.0],
 [0.0, 0.0]]

### 4. Train MADDPG!

To deploy our agent to solve the navigation problem, we first import the agent class we wrote. When training the environment, set train_mode=True, so that the line for resetting the environment looks like the following:

```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [None]:
agent = Agent(state_size=state_size, action_size=action_size, random_seed=10)

In [None]:
# setting parameters
# number of parallel environment, each environment has 2 agents
# this would generate more experience and smooth things out
# PARALLEL_ENVS = 1
# Here we only have 1 env for simplicity

# number of training episodes.
# change this to higher number to experiment. say 30000.
NUMBER_OF_EPISODES = 1000
EPISODE_LENGTH = 80
BATCHSIZE = 1000

# amplitude of OU noise
# this slowly decreases to 0
# instead of resetting noise to 0 every episode, we let it decrease to 0 over a few episodes
NOISE = 2
NOISE_REDUCTION = 0.9999
BUFFER_SIZE = 5000

IN_ACTOR_DIM = 24 
HIDDEN_ACTOR_IN_DIM = 16
HIDDEN_ACTOR_OUT_DIM = 8
OUT_ACTOR_DIM = action_size

# Critic input contains the state AND all the actions of all the agents
# there are 2 agents, so 24+ 2*2 = 28
IN_CRIT_DIM = IN_ACTOR_DIM + action_size * num_agents
HIDDEN_CRIT_IN_DIM = 32
HIDDEN_CRIT_OUT_DIM = 16
OUT_CRIT_DIM = 1

# how many episodes before update
UPDATE_EVERY = 5

episode_per_update = 2 * PARALLEL_ENVS
        # critic input = obs_full + actions = 14+2+2+2=20

        if i_episode % 50 == 0:
            print('\rEpisode {}\tAverage Last 100 Episodes Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))
            torch.save(agent.actor_local.state_dict(), 'checkpoint_actor.pth')
            torch.save(agent.critic_local.state_dict(), 'checkpoint_critic.pth')

        if np.mean(scores_deque)>=30.0:
            print('\nEnvironment solved in {:d} episodes!\tAverage Last 100 Episodes Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))
            torch.save(agent.actor_local.state_dict(), 'checkpoint_actor.pth')
            torch.save(agent.critic_local.state_dict(), 'checkpoint_critic.pth')
            break  

In [16]:
def transpose_list(mylist):
    return list(map(list, zip(*mylist)))

transpose_list([[[1,2,3],[2,3,4]],[[5,6,7],[8,9,0]] ])

[[[1, 2, 3], [5, 6, 7]], [[2, 3, 4], [8, 9, 0]]]

In [None]:
# main function that sets up environments
# perform training loop



def seeding(seed=1):
    np.random.seed(seed)
    torch.manual_seed(seed)

def pre_process(entity, batchsize):
    processed_entity = []
    for j in range(3):
        list = []
        for i in range(batchsize):
            b = entity[i][j]
            list.append(b)
        c = torch.Tensor(list)
        processed_entity.append(c)
    return processed_entity

def run_maddpg():
    seeding()
    t = 0
    
    # number of parallel environment, each environment has 2 agents
    # this would generate more experience and smooth things out
    # parallel_envs = 4
    # number of training episodes.
    # change this to higher number to experiment. say 30000.
    x number_of_episodes = 1000
    x episode_length = 80
    x batchsize = 1000
    # how many episodes to save policy and gif
    x save_interval = 1000

    
    # amplitude of OU noise
    # this slowly decreases to 0
    # instead of resetting noise to 0 every episode, we let it decrease to 0 over a few episodes
    x noise = 2
    x noise_reduction = 0.9999

    # how many episodes before update
    x episode_per_update = 2 * PARALLEL_ENVS

    log_path = os.getcwd()+"/log"
    model_dir= os.getcwd()+"/model_dir"
    
    os.makedirs(model_dir, exist_ok=True)

    # torch.set_num_threads(PARALLEL_ENVS)
    # env = envs.make_parallel_env(PARALLEL_ENVS)
    
    # keep 5000 episodes worth of replay
    buffer = ReplayBuffer(int(BUFFER_SIZE * EPISODE_LENGTH))
    
    # initialize policy and critic through MADDOG
    maddpg = MADDPG(IN_ACTOR_DI, HIDDEN_ACTOR_IN_DIM, HIDDEN_ACTOR_OUT_DIM, OUT_ACTOR_DIM, IN_CRIT_DIM, HIDDEN_CRIT_IN_DIM, HIDDEN_CRIT_OUT_DIM)
    
    logger = SummaryWriter(log_dir=log_path)
    
    agent0_reward = []
    agent1_reward = []


    # training loop
    # show progressbar
    import progressbar as pb
    widget = ['episode: ', pb.Counter(),'/',str(NUMBER_OF_EPISODES),' ', 
              pb.Percentage(), ' ', pb.ETA(), ' ', pb.Bar(marker=pb.RotatingMarker()), ' ' ]
    
    timer = pb.ProgressBar(widgets=widget, maxval=NUMBER_OF_EPISODES).start()

    # use keep_awake to keep workspace from disconnecting
    for episode in range(1, n_episodes+1):

uuu        uuu timer.update(episode)

uuu        uuu reward_this_episode = np.zeros((PARALLEL_ENVS, 3))
        
uuu        uuu all_obs = env.reset() 
        
        env_info = env.reset(train_mode=True)[brain_name]
        states = env_info.vector_observations[0]
        agent.reset()
        
        # initialize scores for both agents
        score = [0,0]   
        
        # in lab, all_obs came directly from env, it has 4 lists, each list has 1 3x14 and 1 1x14
        # obs has 4 lists, getting only the 3x14 from all_obs
        # obs_full on the other hand got the 1x14
        
        obs, obs_full = transpose_list(all_obs)
        


        #for calculating rewards for this particular episode - addition of all time steps

        # save info or not
        save_info = ((episode) % save_interval < PARALLEL_ENVS or episode== NUMBER_OF_EPISODES-PARALLEL_ENVS)
        frames = []
        tmax = 0
        
        if save_info:
            frames.append(env.render('rgb_array'))


        
        for episode_t in range(EPISODE_LENGTH):

            t += PARALLEL_ENVS
            

            # explore = only explore for a certain number of episodes
            # action input needs to be transposed
            actions = maddpg.act(transpose_to_tensor(obs), noise=NOISE)
            NOISE *= NOISE_REDUCTION
            
            actions_array = torch.stack(actions).detach().numpy()

            # transpose the list of list
            # flip the first two indices
            # input to step requires the first index to correspond to number of parallel agents
            actions_for_env = np.rollaxis(actions_array,1)
            
            # step forward one frame
            next_obs, next_obs_full, rewards, dones, info = env.step(actions_for_env)
            
            # add data to buffer
            transition = (obs, obs_full, actions_for_env, rewards, next_obs, next_obs_full, dones)
            
            buffer.push(transition)
            
            reward_this_episode += rewards

            obs, obs_full = next_obs, next_obs_full
            
            """obs is the observation state space of all the three agents in the four parallel environments, 
            for the Physical Dception environment with three agents it is of dimension 4x3x14.
            obs_full is world state irrespective of the agents and its dimension is 4x14. 
            """           
            
            # save gif frame
            if save_info:
                frames.append(env.render('rgb_array'))
                tmax+=1
        
        # update once after every episode_per_update
        if len(buffer) > BATCHSIZE and episode % episode_per_update < PARALLEL_ENVS:
            for a_i in range(3):
                samples = buffer.sample(BATCHSIZE)
                maddpg.update(samples, a_i, logger)
            maddpg.update_targets() #soft update the target network towards the actual networks

        
        
        for i in range(PARALLEL_ENVS):
            agent0_reward.append(reward_this_episode[i,0])
            agent1_reward.append(reward_this_episode[i,1])


        if episode % 100 == 0 or episode == NUMBER_OF_EPISODES-1:
            avg_rewards = [np.mean(agent0_reward), np.mean(agent1_reward), np.mean(agent2_reward)]
            agent0_reward = []
            agent1_reward = []

            for a_i, avg_rew in enumerate(avg_rewards):
                logger.add_scalar('agent%i/mean_episode_rewards' % a_i, avg_rew, episode)

        #saving model
        save_dict_list =[]
        if save_info:
            for i in range(3):

                save_dict = {'actor_params' : maddpg.maddpg_agent[i].actor.state_dict(),
                             'actor_optim_params': maddpg.maddpg_agent[i].actor_optimizer.state_dict(),
                             'critic_params' : maddpg.maddpg_agent[i].critic.state_dict(),
                             'critic_optim_params' : maddpg.maddpg_agent[i].critic_optimizer.state_dict()}
                save_dict_list.append(save_dict)

                torch.save(save_dict_list, 
                           os.path.join(model_dir, 'episode-{}.pt'.format(episode)))
                
            # save gif files
            imageio.mimsave(os.path.join(model_dir, 'episode-{}.gif'.format(episode)), 
                            frames, duration=.04)

    env.close()
    logger.close()
    timer.finish()

if __name__=='__main__':
    main()


When finished, you can close the environment.

In [5]:
env.close()