# Collaboration and Competition

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [4]:
import sys
sys.path.append('/home/ubuntu/DPRL-Tennis-ML-Agents/python')
# sys.path.append('./python')

In [5]:
from unityagents import UnityEnvironment
import numpy as np

np.float_ = np.float64
np.int_ = np.int64

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

For instance, if you are using a Mac, then you downloaded `Tennis.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Tennis.app")
```

In [6]:
env = UnityEnvironment(file_name="Tennis_Linux/Tennis.x86_64", no_graphics=True)

Found path: /home/ubuntu/DPRL-Tennis-ML-Agents/Tennis_Linux/Tennis.x86_64
Mono path[0] = '/home/ubuntu/DPRL-Tennis-ML-Agents/Tennis_Linux/Tennis_Data/Managed'
Mono config path = '/home/ubuntu/DPRL-Tennis-ML-Agents/Tennis_Linux/Tennis_Data/MonoBleedingEdge/etc'
Preloaded 'libgrpc_csharp_ext.x64.so'
Unable to preload the following plugins:
	ScreenSelector.so
	libgrpc_csharp_ext.x86.so
	ScreenSelector.so
Logging to /home/ubuntu/.config/unity3d/Unity Technologies/Unity Environment/Player.log


ALSA lib confmisc.c:855:(parse_card) cannot find card '0'
ALSA lib conf.c:5178:(_snd_config_evaluate) function snd_func_card_inum returned error: No such file or directory
ALSA lib confmisc.c:422:(snd_func_concat) error evaluating strings
ALSA lib conf.c:5178:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1334:(snd_func_refer) error evaluating name
ALSA lib conf.c:5178:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5701:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM default
INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continu

Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [7]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [8]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.65278625 -1.5
 -0.          0.          6.83172083  6.         -0.          0.        ]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agents and receive feedback from the environment.

Once this cell is executed, you will watch the agents' performance, if they select actions at random with each time step.  A window should pop up that allows you to observe the agents.

Of course, as part of the project, you'll have to change the code so that the agents are able to use their experiences to gradually choose better actions when interacting with the environment!

In [None]:
# for i in range(1, 6):                                      # play game for 5 episodes
#     env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
#     states = env_info.vector_observations                  # get the current state (for each agent)
#     scores = np.zeros(num_agents)                          # initialize the score (for each agent)
#     while True:
#         actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
#         actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
#         env_info = env.step(actions)[brain_name]           # send all actions to tne environment
#         next_states = env_info.vector_observations         # get next state (for each agent)
#         rewards = env_info.rewards                         # get reward (for each agent)
#         dones = env_info.local_done                        # see if episode finished
#         scores += env_info.rewards                         # update the score (for each agent)
#         states = next_states                               # roll over states to next time step
#         if np.any(dones):                                  # exit loop if episode finished
#             break
#     print('Score (max over agents) from episode {}: {}'.format(i, np.max(scores)))

When finished, you can close the environment.

In [None]:
# env.close()

In [None]:
### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [9]:
class Config:
    def __new__(self):
        """Define this class as a singleton"""
        if not hasattr(self, 'instance'):
            self.instance = super().__new__(self)

            self.instance.device = None
            self.instance.seed = None
            self.instance.target_score = None
            self.instance.target_episodes = None
            self.instance.max_episodes = None

            self.instance.state_size = None
            self.instance.action_size = None
            self.instance.num_agents = None

            self.instance.actor_layers = None
            self.instance.critic_layers = None
            self.instance.actor_lr = None
            self.instance.critic_lr = None
            self.instance.lr_sched_step = None
            self.instance.lr_sched_gamma = None

            self.instance.batch_normalization = None

            self.instance.buffer_size = None
            self.instance.batch_size = None
            self.instance.gamma = None
            self.instance.tau = None

            self.instance.noise = None
            self.instance.noise_theta = None
            self.instance.noise_sigma = None

        return self.instance


In [10]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class BaseNN(nn.Module):
    """Superclass for the Actor and Critic classes"""
    def __init__(self):
        super(BaseNN, self).__init__()
        self.config = Config()
        self.to(self.config.device)
        torch.manual_seed(self.config.seed)
        self.module_list = nn.ModuleList()

    def create_fc_layer(self, nodes_in, nodes_out):
        layer = nn.Linear(nodes_in, nodes_out)
        self.reset_parameters(layer)
        self.module_list.append(layer)

    def reset_parameters(self, layer):
        layer.weight.data.uniform_(-3e-3, 3e-3)


class Actor(BaseNN):
    """Build an actor (policy) network that maps states -> actions."""
    def __init__(self):
        super(Actor, self).__init__()
        for nodes_in, nodes_out in self.layers_nodes():
            self.create_fc_layer(nodes_in, nodes_out)

    def layers_nodes(self):
        nodes = []
        nodes.append(self.config.state_size)
        nodes.extend(self.config.actor_layers)
        nodes.append(self.config.action_size)
        nodes_in = nodes[:-1]
        nodes_out = nodes[1:]
        return zip(nodes_in, nodes_out)

    def forward(self, x):
        for layer in self.module_list[:-1]:
            x = F.relu(layer(x))
        x = self.module_list[-1](x)
        return torch.tanh(x)


class Critic(BaseNN):
    """Build a critic (value) network that maps
       (state, action) pair -> Q-values.
    """
    def __init__(self):
        super(Critic, self).__init__()
        for nodes_in, nodes_out in self.layers_nodes():
            self.create_fc_layer(nodes_in, nodes_out)
        if self.config.batch_normalization:
            self.bn = nn.BatchNorm1d(self.module_list[1].in_features)

    def layers_nodes(self):
        nodes = []
        nodes.append(self.config.state_size * self.config.num_agents)
        nodes.extend(self.config.critic_layers)
        nodes.append(1)
        nodes_in = nodes[:-1]
        nodes_in[1] += self.config.num_agents * self.config.action_size
        nodes_out = nodes[1:]
        return zip(nodes_in, nodes_out)

    def forward(self, state, action):
        x = F.relu(self.module_list[0](state))
        x = torch.cat((x, action), dim=1)
        if self.config.batch_normalization:
            x = self.bn(x)
        for layer in self.module_list[1:-1]:
            x = F.relu(layer(x))
        x = self.module_list[-1](x)
        return torch.sigmoid(x)


In [11]:
import copy
import random

import numpy as np
import torch
import torch.optim as optim



class Agent():
    """Interacts with and learns from the environment."""
    def __init__(self):
        self.config = Config()
        random.seed(self.config.seed)

        # Actor Network
        self.actor_local = Actor()
        self.actor_target = Actor()
        local_state_dict = self.actor_local.state_dict()
        self.actor_target.load_state_dict(local_state_dict)
        self.actor_optimizer = optim.Adam(
            self.actor_local.parameters(),
            lr=self.config.actor_lr)
        self.actor_lr_scheduler = optim.lr_scheduler.StepLR(
            self.actor_optimizer,
            step_size=self.config.lr_sched_step,
            gamma=self.config.lr_sched_gamma)

        # Critic Network
        self.critic_local = Critic()
        self.critic_target = Critic()
        local_state_dict = self.critic_local.state_dict()
        self.critic_target.load_state_dict(local_state_dict)
        self.critic_optimizer = optim.Adam(
            self.critic_local.parameters(),
            lr=self.config.critic_lr)
        self.critic_lr_scheduler = optim.lr_scheduler.StepLR(
            self.critic_optimizer,
            step_size=self.config.lr_sched_step,
            gamma=self.config.lr_sched_gamma)

        # Initialize a noise process
        self.noise = OUNoise()

    def soft_update(self):
        """Soft update actor and critic parameters.
        θ_target = τ*θ_local + (1 - τ)*θ_target
        """
        tau = self.config.tau
        for target_param, local_param \
                in zip(self.actor_target.parameters(),
                       self.actor_local.parameters()):
            target_param.data.copy_(
                tau * local_param.data
                + (1.0 - tau) * target_param.data)
        for target_param, local_param \
                in zip(self.critic_target.parameters(),
                       self.critic_local.parameters()):
            target_param.data.copy_(
                tau * local_param.data
                + (1.0 - tau) * target_param.data)

    def act(self, state):
        """Returns actions for given state as per current policy."""
        with torch.no_grad():
            self.actor_local.eval()
            state = torch.from_numpy(state).float()
            state.to(self.config.device)
            action = self.actor_local(state).data.cpu().numpy()
            self.actor_local.train()

        if self.config.noise:
            action += self.noise.sample()
            np.clip(action, a_min=-1, a_max=1, out=action)

        return action

    def lr_step(self):
        self.actor_lr_scheduler.step()
        self.critic_lr_scheduler.step()

    def reset_noise(self):
        self.noise.reset()


class OUNoise:
    """Ornstein-Uhlenbeck process."""
    def __init__(self, mu=0.):
        """Initialize parameters and noise process."""
        self.config = Config()
        random.seed(self.config.seed)
        self.mu = mu * np.ones(self.config.action_size)
        self.reset()

    def reset(self):
        """Reset the internal state (= noise) to mean (mu)."""
        self.state = copy.copy(self.mu)

    def sample(self):
        """Update internal state and return it as a noise sample."""
        x = self.state
        random_array = [random.random() for i in range(len(x))]
        dx = self.config.noise_theta * (self.mu - x) \
             + self.config.noise_sigma * np.array(random_array)
        self.state = x + dx
        return self.state


In [12]:
import random
from collections import namedtuple, deque

import torch



class ReplayBuffer:
    """Fixed-size buffer to store experience tuples."""
    def __init__(self):
        self.config = Config()
        random.seed(self.config.seed)
        self.memory = deque(maxlen=self.config.buffer_size)

        self.experience = namedtuple(
            'Experience',
            field_names=['state', 'actions', 'rewards', 'next_state'])

    def store(self, state, actions, reward, next_state):
        """Add a new experience to memory."""
        e = self.experience(state, actions, reward, next_state)
        self.memory.append(e)

    def sample(self):
        """Randomly sample a batch of experiences from memory."""
        experiences = random.sample(self.memory, self.config.batch_size)

        states = self.create_tensor(dim=self.config.state_size)
        actions = self.create_tensor(dim=self.config.action_size)
        rewards = self.create_tensor()
        next_states = self.create_tensor(dim=self.config.state_size)
        for i, e in enumerate(experiences):
            states[i] = torch.as_tensor(e.state)
            actions[i] = torch.as_tensor(e.actions)
            rewards[i] = torch.as_tensor(e.rewards)
            next_states[i] = torch.as_tensor(e.next_state)
        return (states, actions, rewards, next_states)

    def create_tensor(self, dim=0):
        batch_size = self.config.batch_size
        num_agents = self.config.num_agents
        if dim > 0:
            size = (batch_size, num_agents, dim)
        else:
            size = (batch_size, num_agents)

        tensor = torch.empty(size=size, dtype=torch.float,
                             device=self.config.device,
                             requires_grad=False)
        return tensor

    def __len__(self):
        """Return the current size of internal memory."""
        return len(self.memory)


In [13]:
import torch
import torch.nn.functional as F



class MultiAgentDDPG():
    """Manage multi agents while interacting with the environment."""
    def __init__(self):
        super(MultiAgentDDPG, self).__init__()
        self.config = Config()
        self.agents = [Agent() for _ in range(self.config.num_agents)]
        self.buffer = ReplayBuffer()

    def act(self, state):
        actions = [agent.act(obs) \
                   for agent, obs in zip(self.agents, state)]
        return actions

    def actions_target(self, states):
        batch_size = self.config.batch_size
        num_agents = self.config.num_agents
        action_size = self.config.action_size
        with torch.no_grad():
            actions = torch.empty(
                (batch_size, num_agents, action_size),
                device=self.config.device)
            for idx, agent in enumerate(self.agents):
                actions[:,idx] = agent.actor_target(states[:,idx])
        return actions

    def actions_local(self, states, agent_id):
        batch_size = self.config.batch_size
        num_agents = self.config.num_agents
        action_size = self.config.action_size

        actions = torch.empty(
            (batch_size, num_agents, action_size),
            device=self.config.device)
        for idx, agent in enumerate(self.agents):
            action = agent.actor_local(states[:,idx])
            if not idx == agent_id:
                action.detach()
            actions[:,idx] = action
        return actions

    def store(self, state, actions, rewards, next_state):
        self.buffer.store(state, actions, rewards, next_state)

        if len(self.buffer) >= self.config.batch_size:
            self.learn()

    def learn(self):
        batch_size = self.config.batch_size
        for agent_id, agent in enumerate(self.agents):
            # sample a batch of experiences
            states, actions, rewards, next_states = self.buffer.sample()
            # stack the agents' variables to feed the networks
            obs = states.view(batch_size, -1)
            actions = actions.view(batch_size, -1)
            next_obs = next_states.view(batch_size, -1)
            # Consider only the rewards for this agent
            r = rewards[:,agent_id].unsqueeze_(1)

            ## Train the Critic network
            with torch.no_grad():
                next_actions = self.actions_target(next_states)
                next_actions = next_actions.view(batch_size, -1)
                next_q_val = agent.critic_target(next_obs, next_actions)
                y = r + self.config.gamma * next_q_val
            agent.critic_optimizer.zero_grad()
            q_value_predicted = agent.critic_local(obs, actions)
            loss = F.mse_loss(q_value_predicted, y)
            loss.backward()
            agent.critic_optimizer.step()

            ## Train the Actor network
            agent.actor_optimizer.zero_grad()
            actions_local = self.actions_local(states, agent_id)
            actions_local = actions_local.view(batch_size, -1)
            q_value_predicted = agent.critic_local(obs, actions_local)
            loss = -q_value_predicted.mean()
            loss.backward()
            agent.actor_optimizer.step()

        for agent in self.agents:
            agent.soft_update()

    def reset_noise(self):
        for agent in self.agents:
            agent.reset_noise()

    def state_dict(self):
        return [agent.actor_local.state_dict() for agent in self.agents]

    def load_state_dict(self, state_dicts):
        for agent, state_dict in zip(self.agents, state_dicts):
            agent.actor_local.load_state_dict(state_dict)

    def lr_step(self):
        for agent in self.agents:
            agent.lr_step()


In [14]:
## Initialize values in Config
config = Config()
config.seed = 997
config.device = 'cpu'
config.target_score = 0.5
config.target_episodes = 100

config.num_agents = num_agents
config.action_size = action_size
config.state_size = state_size

config.actor_layers = [64, 64]
config.critic_layers = [64, 64]
config.actor_lr = 3e-3
config.critic_lr = 4e-4
config.lr_sched_step = 1
config.lr_sched_gamma = 0.2
config.batch_normalization = True

config.buffer_size = int(1e6)
config.batch_size = 64
config.gamma = 0.99
config.tau = 8e-3
config.noise = True
config.noise_theta = 0.9
config.noise_sigma = 0.01

config.max_episodes = 2000

# Instantiate a Multi Agent
maddpg = MultiAgentDDPG()

In [15]:
## Define the training function
def train(env, maddpg, max_episodes=5000):
    """Train a Multi Agent Deep Deterministic Policy Gradients (MADDPG).
    Params
    ======
        env (UnityEnvironment): environment for the agents
        maddpg (MADDPG): the Multi Agent DDPG
        n_episodes (int): maximum number of training episodes
    """
    scores = []
    agents_scores = []
    moving_avg = []
    
    avg_checkpoints = iter([0.1, 0.2, 0.4, 0.6])
    next_avg_checkpoint = next(avg_checkpoints)
    ## Perform n_episodes of training
    for i_episode in range(1, max_episodes+1):
        maddpg.reset_noise()
        env_info = env.reset(train_mode=True)[brain_name]
        state = env_info.vector_observations

        scores_episode = np.zeros(config.num_agents)
        while True:
            ## Perform a step: S;A;R;S'
            actions = maddpg.act(state)
            env_info = env.step(actions)[brain_name]

            rewards = env_info.rewards
            next_state = env_info.vector_observations

            maddpg.store(state, actions, rewards, next_state)

            state = next_state
            scores_episode += rewards

            if any(env_info.local_done):
                break

        agents_scores.append(scores_episode)
        scores.append(scores_episode.max())
        moving_avg.append(np.mean(scores[-config.target_episodes:]))

        if moving_avg[-1] >= next_avg_checkpoint:
            print('\nDecreasing learning rate ...')
            maddpg.lr_step()
            next_avg_checkpoint = next(avg_checkpoints)

        print('\rEpisode {:4d}\t' \
              'Last score: {:5.2f} ({:5.2f} / {:5.2f})\t' \
              'Moving average: {:5.3f}'
              .format(i_episode,
                      scores[-1], scores_episode[0], scores_episode[1],
                      moving_avg[-1]),
              end='')
        if i_episode % 100 == 0:
            print()

        ## Check if the environment has been solved
        if moving_avg[-1].mean() >= config.target_score \
                and i_episode >= config.target_episodes:
            print('\n\nEnvironment solved in {:d} episodes!\t' \
                  'Moving Average Score: {:.3f}'
                  .format(i_episode-config.target_episodes, moving_avg[-1]))
            break

    print('\n')
    return scores, agents_scores, moving_avg

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# Train the agent
print('Starting training phase using these settings:')
print("\n '".join(str(config.__dict__).split(", '")), '\n')

scores, agents_scores, moving_avg = train(env, maddpg, config.max_episodes)

Starting training phase using these settings:
{'device': 'cpu'
 'seed': 997
 'target_score': 0.5
 'target_episodes': 100
 'max_episodes': 2000
 'state_size': 24
 'action_size': 2
 'num_agents': 2
 'actor_layers': [64, 64]
 'critic_layers': [64, 64]
 'actor_lr': 0.003
 'critic_lr': 0.0004
 'lr_sched_step': 1
 'lr_sched_gamma': 0.2
 'batch_normalization': True
 'buffer_size': 1000000
 'batch_size': 64
 'gamma': 0.99
 'tau': 0.008
 'noise': True
 'noise_theta': 0.9
 'noise_sigma': 0.01} 

Episode    4	Last score:  0.00 ( 0.00 / -0.01)	Moving average: 0.000

  actions[i] = torch.as_tensor(e.actions)
E1117 06:11:37.305590930   16178 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers
E1117 06:11:37.333197429   16178 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers
E1117 06:11:37.352488807   16178 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers
E1117 06:11:37.370980215   16178 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers
E1117 06:11:37.391092950   16178 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers
E1117 06:11:37.411586243   16178 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers
E1117 06:11:37.431767834   16178 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers


Episode  100	Last score:  0.00 (-0.01 /  0.00)	Moving average: 0.007
Episode  200	Last score:  0.00 ( 0.00 / -0.01)	Moving average: 0.006
Episode  300	Last score:  0.10 ( 0.10 / -0.01)	Moving average: 0.030
Episode  400	Last score:  0.00 (-0.01 /  0.00)	Moving average: 0.060
Episode  500	Last score:  0.00 (-0.01 /  0.00)	Moving average: 0.063
Episode  600	Last score:  0.20 ( 0.09 /  0.20)	Moving average: 0.092
Episode  700	Last score:  0.00 ( 0.00 / -0.01)	Moving average: 0.077
Episode  800	Last score:  0.10 ( 0.10 / -0.01)	Moving average: 0.074
Episode  900	Last score:  0.10 ( 0.10 / -0.01)	Moving average: 0.091
Episode  935	Last score:  0.10 ( 0.09 /  0.10)	Moving average: 0.098
Decreasing learning rate ...
Episode 1000	Last score:  0.20 ( 0.19 /  0.20)	Moving average: 0.123
Episode 1100	Last score:  0.09 ( 0.09 /  0.00)	Moving average: 0.140
Episode 1200	Last score:  0.10 ( 0.10 /  0.09)	Moving average: 0.124
Episode 1300	Last score:  0.10 ( 0.10 /  0.09)	Moving average: 0.116
Episo

In [16]:
## Plot graphic of rewards

# Trace a line indicating the target value
target = [config.target_score] * len(scores)

# Graphic with the total rewards
fig = plt.figure(figsize=(18,8))
ax = fig.add_subplot(111)
ax.set_title('Plot of the rewards', fontsize='xx-large')
ax.plot(scores, label='Score', color='Blue')
ax.plot(moving_avg, label='Moving Average',
        color='DarkOrange', linewidth=3)
ax.plot(target, linestyle='--', color='LightCoral', linewidth=1 )
ax.text(0, config.target_score, 'Target', color='LightCoral', fontsize='large')
ax.set_ylabel('Score')
ax.set_xlabel('Episode #')
ax.legend(fontsize='xx-large')
plt.show()

# Graphics for each one of the agents
fig, axs = plt.subplots(1, 2, figsize=(18, 8), sharex=True, sharey=True)
fig.suptitle('Rewards for each one of the agents', fontsize='xx-large')
axs = axs.flatten()
for idx, (ax, s) in enumerate(zip(axs, np.transpose(agents_scores))):
    ax.plot(s, label='Agent Score', color='DodgerBlue', zorder=1)
    ax.plot(moving_avg, label='Moving Avg (Total)',
            color='DarkOrange', linewidth=3, alpha=0.655, zorder=2)
    ax.plot(target, linestyle='--', color='LightCoral', linewidth=1, zorder=0)
    ax.text(0, config.target_score, 'Target',
            color='LightCoral', fontsize='large')
    ax.set_title('Agent #%d' % (idx+1))
    ax.set_ylabel('Score')
    ax.set_xlabel('Episode #')
    ax.label_outer()
    ax.legend(fontsize='medium')

plt.show()

NameError: name 'scores' is not defined

In [None]:
checkpoint = {
    'maddpg_state_dict': maddpg.state_dict(),
    'scores': scores,
    'agents_scores': agents_scores,
    'moving_avg': moving_avg}
torch.save(checkpoint, 'checkpoint.pt')

In [None]:
## Test the trained model
def test(env, maddpg, max_episodes=3):
    """Test a Multi Agent Deep Deterministic Policy Gradients (MADDPG).
    Params
    ======
        env (UnityEnvironment): environment for the agents
        maddpg (MADDPG): the Multi Agent DDPG
        n_episodes (int): maximum number of training episodes
    """
    ## Perform n_episodes of training
    for i_episode in range(1, max_episodes+1):
        env_info = env.reset(train_mode=False)[brain_name]
        scores = np.zeros(config.num_agents)
        while True:
            states = env_info.vector_observations
            actions = maddpg.act(states)
            env_info = env.step(actions)[brain_name]
            scores += env_info.rewards
            states = env_info.vector_observations
            if any(env_info.local_done):
                break
        print('\rEpisode {:4d}\tScore: {:5.2f} ({:5.2f} / {:5.2f})\t'
              .format(i_episode, scores.max(), scores[0], scores[1]))

checkpoint = torch.load('checkpoint.pt')
config.noise = False
maddpg = MultiAgentDDPG()
maddpg.load_state_dict(checkpoint['maddpg_state_dict'])
test(env, maddpg)

In [None]:
env.close()