# Collaboration and Competition

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
import torch
print(torch.cuda.is_available())

True


In [2]:
# !pip install numpy==1.13.3
# %cd python
# !pip install -r requirements.txt
# !pip install -e .

# !pip install protobuf==3.20.2

In [3]:
from unityagents import UnityEnvironment
import numpy as np
from ddpg_agent import Agent, load_and_run
from collections import deque
import torch
print(torch.cuda.is_available())


cuda:0
True


Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

For instance, if you are using a Mac, then you downloaded `Tennis.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Tennis.app")
```

In [4]:
env = UnityEnvironment(file_name="Tennis_Windows_x86_64/Tennis.exe")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [5]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [6]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.65278625 -1.5
 -0.          0.          6.83172083  6.         -0.          0.        ]


### 3. Define the Learning Process with Deep Deterministic  Policy Gradient Method


The learning process using the Deep Deterministic Policy Gradient method is defined below (ddpg). The input parameters are described in the comments below. The outputs are list of the maximum scores for each episode and a list of the maimum scores averaged over the current and preceding 99 episodes (100-episode averages). The episodes before 100 just have averages over the preceding scores although there are less than 100.

In [7]:
def ddpg(
    agent,
    n_episodes=2000,
    max_t=1500,
    print_every=100,
    gamma_initial = 0.9,
    gamma_final = 0.99,
    gamma_rate = 0.002,
    tau_initial = 0.02,
    tau_final = 0.001,
    tau_rate = 0.001,
    noise_factor = 1.0
):
    """
    Reinforcement learning with Deep Deterministic Policy Gradients
    n_episodes (int): Maximum number of training episodes
    max_t (int): Maximum number of timesteps per episode
    epsilon_initial (float): Initial value of epsilon for epsilon-greedy selection of an action
    epsilon_final (float): Final value of epsilon
    epsilon_rate (float): A rate (0.0 to 1.0) for decreasing epsilon for each episode. Higher is faster decay.
    gamma_initial (float): Initial gamma discount factor (0 to 1). Higher values favor long term over current rewards.
    gamma_final (float): Final gamma discount factor (0 to 1).
    gammma_rate (float): A rate (0 to 1) for increasing gamma.
    beta_initial (float): For prioritized replay. Corrects bias induced by weighted sampling of stored experiences.
        The beta parameters have no effect if the agent has prioritized experience replay activated.
    beta_rate (float): Rate (0 to 1) for increasing beta to 1 as per Schauel et al. https://arxiv.org/abs/1511.05952
    tau_initial (float): Initial value for tau, the weighting factor for soft updating the neural network.
        The tau parameters have no effect if the agent uses fixed Q targets instead of soft updating.
    tau_final (float): Final value of tau.
    tau_rate (float): Rate (0 to 1) for increasing tau each episode.
    
    Returned values:
        max_scores[]: The maximum scores for each episode.
        avg_max_scores[]: The maximum scores averaged over the maximum score for the current episode and the preceding
            99 episodes (100-episode averages).
    
    """
    
    gamma = gamma_initial
    gamma_scale = 1.0 - gamma_rate
    
    tau = tau_initial
    tau_scale = 1.0 - tau_rate
    
    noise_scale = 1.0
    
    success = False
    first05 = False
    both05 = False
    both1 = False
    
    max_scores_deque = deque(maxlen = print_every)
    #scores_deque = deque(maxlen=print_every)
    #scores = []
    max_scores = []
    avg_max_scores = []
    best_avg_max = 0.0
    best_agent_max = 0.0
    for i_episode in range(1, n_episodes+1):
        
        # Reset environment
        env_info = env.reset(train_mode=True)[brain_name]
        
        # Get next state
        state = env_info.vector_observations
        
        # state = env.reset()
        agent.reset()

        score = np.zeros(agent.num_agents)
        
        for t in range(max_t):
            
            # Get actions
            action = agent.act(state, noise_scale)
            #print(action)

            # Send actions to the environment
            env_info = env.step(action)[brain_name]
            
            # Get next state
            next_state = env_info.vector_observations
            
            # Get rewards
            reward = env_info.rewards
            
            # Check if episode is finished
            done = env_info.local_done
            
            # Make the agent proceed to the next timestep in the environment
            agent.step(state, action, reward, next_state, done, gamma, tau)
            
            # Add rewards to scores
            score += reward
            
            # Replace the current state with the next state for the next episode
            state = next_state
            
            # Exit if episode is finished
            if np.any(done):
                break
                
        #print('Total score (averaged over agents) this episode: {}'.format(np.mean(score)))
        agent_avg = np.mean(score)
        agent_max = np.max(score)
        agent_min = np.min(score)
        max_scores.append(agent_max)
        max_scores_deque.append(agent_max)  
        avg_max = np.mean(max_scores_deque)
        avg_max_scores.append(avg_max)
        #scores_deque.append(agent_avg)
        #scores.append(agent_avg)
        # avg_score = np.mean(max_scores_deque)
                          
        print('Ep {}\tEp AvgMax: {:.4f}\tAg1: {:.4f}\tAg2: {:.4f}\tMax: {:.4f}\tg: {:.4f}\tns: {:.4f}\ttau: {:.4f}'.format(
            i_episode, avg_max, score[0], score[1], agent_max, gamma, noise_scale, tau))
        if not first05 and agent_max > 0.5:
            first05 = True
            torch.save(agent.actor_local.state_dict(), 'checkpoint_actor_first.pth')
            torch.save(agent.critic_local_1.state_dict(), 'checkpoint_critic_1_first.pth')
            torch.save(agent.critic_local_2.state_dict(), 'checkpoint_critic_2_first.pth')
            print("Agent max score >0.5 after {:d} episodes.".format(i_episode))
        if not both05 and score[0] > 0.5 and score[1] > 0.5:
            both05 = True
            torch.save(agent.actor_local.state_dict(), 'checkpoint_actor_both05.pth')
            torch.save(agent.critic_local_1.state_dict(), 'checkpoint_critic_1_both05.pth')
            torch.save(agent.critic_local_2.state_dict(), 'checkpoint_critic_2_both05.pth')
            print("Both agents score >0.5 after {:d} episodes.".format(i_episode))
        if not both1 and score[0] > 1.0 and score[1] > 1.0:
            both1 = True
            best_agent_max = agent_max
            torch.save(agent.actor_local.state_dict(), 'checkpoint_actor_both1.pth')
            torch.save(agent.critic_local_1.state_dict(), 'checkpoint_critic_1_both1.pth')
            torch.save(agent.critic_local_2.state_dict(), 'checkpoint_critic_2_both1.pth')
            print("Both agents score >1 after {:d} episodes.".format(i_episode))
        if i_episode >=100 and not success and avg_max > 0.5:
            success = True
            best_avg_max = avg_max
            torch.save(agent.actor_local.state_dict(), 'checkpoint_actor.pth')
            torch.save(agent.critic_local_1.state_dict(), 'checkpoint_critic_1.pth')
            torch.save(agent.critic_local_2.state_dict(), 'checkpoint_critic_2.pth')
            print("100-episode-average max score >0.5 after {:d} episodes.".format(i_episode))
        if success and avg_max > best_avg_max:
            best_avg_max = avg_max
            torch.save(agent.actor_local.state_dict(), 'checkpoint_actor_best_avg_max.pth')
            torch.save(agent.critic_local_1.state_dict(), 'checkpoint_critic_1_avg_max.pth')
            torch.save(agent.critic_local_2.state_dict(), 'checkpoint_critic_2_avg_max.pth')
            print("New best 100-episode-average at Episode {:d}.".format(i_episode))
        if both1 and agent_max > best_agent_max:
            best_agent_max = agent_max
            torch.save(agent.actor_local.state_dict(), 'checkpoint_actor_best_agent_max.pth')
            torch.save(agent.critic_local_1.state_dict(), 'checkpoint_critic_1_best_agent_max.pth')
            torch.save(agent.critic_local_2.state_dict(), 'checkpoint_critic_2_best_agent_max.pth')
            print("New best agent maximum score at Episode {:d}.".format(i_episode))
                
        # Increase gamma discount factor. Limit to gamma_final.
        gamma = gamma_final - gamma_scale * (gamma_final - gamma)
        
        tau = tau_final - tau_scale * (tau_final - tau)
        
        noise_scale *= noise_factor

    torch.save(agent.actor_local.state_dict(), 'checkpoint_actor_final.pth')
    torch.save(agent.critic_local_1.state_dict(), 'checkpoint_critic_1_final.pth')
    torch.save(agent.critic_local_2.state_dict(), 'checkpoint_critic_2_final.pth')
    return max_scores, avg_max_scores


## 4. Declare the Learning Agent

Agent has the following parameters:

    state_size: Number of parameters defining the environment state
    action_size: Number of pameters defining the actions
    num_agents: Number of learning agents
    random_seed: Random seed number
    batch_size: Batch size for neural network training
    lr_actor: Learning rate for the actor neural network
    lr_critic: Learning rate for the critic neural network
    noise_theta (float): theta for Ornstein-Uhlenbeck noise process
    noise_sigma (float): sigma for Ornstein-Uhlenbeck noise process
    actor_fc1 (int): Number of hidden units in the first fully connected layer of the actor network
    actor_fc2: Units in second layer
    actor_fc3: Units in third fully connected layer. This parameter does nothing for the "RELU" network
    critic_fc1: Number of hidden units in the first fully connected layer of the critic network
    critic_fc2: Units in second layer
    critic_fc3: Units in third layer. This parameter does nothing for the "RELU" network
    update_every: The number of time steps between each updating of the neural networks 
    num_updates: The number of times to update the networks at every update_every interval
    buffer_size: Buffer size for experience replay. Default 2e6.
    network (string): The name of the neural networks that are used for learning.
        There are only 2 choices, one with only 2 fully connected layers and RELU activations and one
        with 3 fully connected layers with SELU activations.
        Their names are "RELU" and "SELU," respectively. Default is "RELU."


In [8]:
agent = Agent(
    state_size = state_size,
    action_size = action_size,
    num_agents = num_agents,
    random_seed = 0,
    batch_size = 1024, 
    lr_actor = 0.001,
    lr_critic = 0.001,
    noise_theta = 0.1,
    noise_sigma = 0.05,
    actor_fc1 = 128,
    actor_fc2 = 128,
    critic_fc1 = 128,
    critic_fc2 = 128,
    update_every = 20,
    num_updates = 15,
    buffer_size = int(2e6)
)

## 5. Train the Agent

Perform the training and collect the scores. The following are printed for every episode:

    Ep: The episode number
    Ep AvgMax: The agent-maximum score averaged over the current episode and and previous 99 episodes.
    Ag1: The score for the first agent for the current episode
    Ag2: The score for the second agent.
    Max: The maximum of the 2 scores
    g: The gamma discount factor for the current episode.
    ns: The noise scaling factor for the current episode.
    tau: The weight factor used for soft updating for the current episode.
    
    Messages will appear to notify the user when various achievements occur:
        1. First time achieving an agent maximum score of 0.5
        2. First time both agents score >0.5
        3. First time both agents score >1
        3. Each time a new best agent-maximum score is achieved (after achievement 3)
        4. First time the 100-episode-average maximum score >0.5 (after at least 100 episodes)
        5. Each time a new best 100-episode-average maximum score is achieved (after achievement 4)
        
        A checkpoint is saved at each of these notifications. The checkpoints are overwritten for 3 and 5.

In [9]:
max_scores, avg_max_scores = ddpg(
    agent,
    n_episodes = 2000,
    max_t = 10000,
    gamma_initial = 0.95,
    gamma_final = 0.99,
    gamma_rate = 0.01,
    tau_initial = 0.01,
    tau_final = 0.001,
    tau_rate = 0.001,
    noise_factor = 0.9999
)

  state_batch = torch.tensor([b[0] for b in batch], dtype=torch.float32, device=device)


Ep 1	Ep AvgMax: 0.0000	Ag1: 0.0000	Ag2: -0.0100	Max: 0.0000	g: 0.9500	ns: 1.0000	tau: 0.0100
Ep 2	Ep AvgMax: 0.0000	Ag1: 0.0000	Ag2: -0.0100	Max: 0.0000	g: 0.9504	ns: 0.9999	tau: 0.0100
Ep 3	Ep AvgMax: 0.0000	Ag1: 0.0000	Ag2: -0.0100	Max: 0.0000	g: 0.9508	ns: 0.9998	tau: 0.0100
Ep 4	Ep AvgMax: 0.0000	Ag1: 0.0000	Ag2: -0.0100	Max: 0.0000	g: 0.9512	ns: 0.9997	tau: 0.0100
Ep 5	Ep AvgMax: 0.0000	Ag1: 0.0000	Ag2: -0.0100	Max: 0.0000	g: 0.9516	ns: 0.9996	tau: 0.0100
Ep 6	Ep AvgMax: 0.0000	Ag1: 0.0000	Ag2: -0.0100	Max: 0.0000	g: 0.9520	ns: 0.9995	tau: 0.0100
Ep 7	Ep AvgMax: 0.0000	Ag1: 0.0000	Ag2: -0.0100	Max: 0.0000	g: 0.9523	ns: 0.9994	tau: 0.0099
Ep 8	Ep AvgMax: 0.0125	Ag1: 0.1000	Ag2: 0.0900	Max: 0.1000	g: 0.9527	ns: 0.9993	tau: 0.0099
Ep 9	Ep AvgMax: 0.0222	Ag1: 0.0900	Ag2: 0.1000	Max: 0.1000	g: 0.9531	ns: 0.9992	tau: 0.0099
Ep 10	Ep AvgMax: 0.0200	Ag1: 0.0000	Ag2: -0.0100	Max: 0.0000	g: 0.9535	ns: 0.9991	tau: 0.0099
Ep 11	Ep AvgMax: 0.0182	Ag1: 0.0000	Ag2: -0.0100	Max: 0.0000	g: 0.9538	

KeyboardInterrupt: 

In [None]:
# 629m 25s
import matplotlib.pyplot as plt

#avg_mask = np.ones(100) / 100
#score_avg= np.convolve(scores, avg_mask, 'valid')

fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(max_scores)+1), max_scores)
plt.plot(np.arange(1, len(max_scores)+1), avg_max_scores, label='avg')
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

## 6. Run the Trained Agent

In [None]:
# Define the agent again. This is not necessary if the entire notebook was run from the beginning.
# However, I had to run this section again and did not want to repeat the learningin in Section 5.

agent = Agent(
    state_size = state_size,
    action_size = action_size,
    num_agents = num_agents,
    random_seed = 19,
    actor_fc1 = 128,
    actor_fc2 = 128,
    critic_fc1 = 128,
    critic_fc2 = 128,
)

In [None]:
load_and_run(agent, env, 'checkpoint_actor_best_avg_max.pth', 'checkpoint_critic_1_best_avg_max.pth', 100)

In [None]:
1_load_and_run(agent, env, 'checkpoint_actor_best_agent_max.pth', 'checkpoint_critic_1_best_agent_max.pth', 100)

When finished, you can close the environment.

In [None]:
env.close()