# Collaboration and Competition

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

For instance, if you are using a Mac, then you downloaded `Tennis.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Tennis.app")
```

Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [1]:
from unityagents import UnityEnvironment
import numpy as np

env = UnityEnvironment(file_name="Tennis_Linux/Tennis.x86_64")#, no_graphics = True)
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.65278625 -1.5
 -0.          0.          6.83172083  6.         -0.          0.        ]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agents and receive feedback from the environment.

Once this cell is executed, you will watch the agents' performance, if they select actions at random with each time step.  A window should pop up that allows you to observe the agents.

Of course, as part of the project, you'll have to change the code so that the agents are able to use their experiences to gradually choose better actions when interacting with the environment!

When finished, you can close the environment.

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [1]:
# Parallel Training ####

from unityagents import UnityEnvironment
import numpy as np

env = UnityEnvironment(file_name="Tennis_Linux/Tennis.x86_64")#, no_graphics = True)
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])




%load_ext autoreload
%autoreload 2

import torch
from matplotlib import pyplot as plt
from ddpg_agent import Agent
from collections import deque

env_info = env.reset(train_mode=True)[brain_name]
states = env_info.vector_observations
action_size = brain.vector_action_space_size
state_size = states.shape[1]



print_every = 30
n_episodes= 2000 #4000 # 2000
max_t=1000
Batch_size = 64# 512
N_Bootstrap = 4  ## New
Learning_Rate = 10
seed = 8

LR_actor = 1e-4         # learning rate of the actor 
LR_critic = 1e-3        # learning rate of the critic
gamma = 0.9 #0.99            # discount factor

# noise
theta = 0.2
sigma = 3.
def ddpg(env,agent0,agent1, print_every, n_episodes, max_t, Batch_size, N_Bootstrap, seed):
    scores_deque = deque(maxlen=print_every)
    score_list = []
    for i_episode in range(0, n_episodes+1):
        env_info = env.reset(train_mode=True)[brain_name]
        agent0.reset()
        agent1.reset()
        states = env_info.vector_observations
        score = np.asarray([0.,0.])
        #if i_episode < 300:
        #    reduction = 1.
        #else:
        #    reduction = .1
        reduction =((n_episodes-i_episode+0.0)/(n_episodes+0.0))**3.
        for t in range(max_t):
            action0 = agent0.act(np.asarray([states[0]]),reduction) #states[0]
            action1 = agent1.act(np.asarray([states[1]]),reduction) #states[0]
            env_info = env.step([action0,action1])[brain_name]           # send all actions to tne environment
            next_states = env_info.vector_observations         # get next state (for each agent)
            rewards = env_info.rewards                         # get reward (for each agent)
            dones = env_info.local_done                        # see if episode finished
            score += np.asarray(env_info.rewards)
            agent0.step(states[0], action0[0], rewards[0], next_states[0], dones[0])
            agent1.step(states[1], action1[0], rewards[1], next_states[1], dones[1])
            states = next_states
            if dones[0]:
                break
        scores_deque.append(score)
        if i_episode % print_every == 0:
            print ("start")
            print (len(agent0.memory.prios))
            print (max(agent0.memory.prios))
            #print(agent0.memory.prios)
            print('\rEpisode {}\tAverage Score: {:.4f}'.format(i_episode, np.mean(scores_deque)))    
            score_list.append(np.mean(scores_deque))
            #torch.save(agent.actor_local.state_dict(), 'checkpoint_actor_.pth')
            #torch.save(agent.critic_local.state_dict(), 'checkpoint_critic_.pth')
    return scores_deque, max(score_list)

for Batch_size in [64,128,256]:
    for factor in np.arange(1.,12.,10.):
        LR_actor = factor*1e-4         # learning rate of the actor 
        LR_critic =factor* 1e-3 
        for gamma in np.arange(0.95,0.99,0.2):
            agent0 = Agent(state_size, action_size, seed, Batch_size, Learning_Rate, N_Bootstrap, LR_actor, LR_critic, gamma, theta, sigma)
            agent1 = Agent(state_size, action_size, seed, Batch_size, Learning_Rate, N_Bootstrap, LR_actor, LR_critic, gamma, theta, sigma)
            scores, final_score = ddpg(env,agent0,agent1, print_every, n_episodes, max_t, Batch_size, N_Bootstrap, seed)
            print (gamma)
            print (factor)
            print (theta)
            print (sigma)
            print (final_score)
#torch.save(agent0.actor_local.state_dict(), 'checkpoint_actor_'+str(seed)+".pth")
#torch.save(agent0.critic_local.state_dict(), 'checkpoint_critic_'+str(seed)+".pth")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.65278625 -1.5
 -0.          0.          6.83172083  6.         -0.          0.        ]




start
14
14.207725120544433
Episode 0	Average Score: -0.0050
start
0.09021023391683715
start
0.16187284970073368
start
0.1728696876300761
start
0.12502412313741582
start
0.11440199835869568
start
0.11197785651417408
start
0.16592011456748887
start
0.0939988368850888
start
0.11466147887262122
start
0.08509975704115864
start
0.07048381786968325
start
0.0919893352216908
start
0.08374761445134424
start
0.10120100257294756
start
0.07633750662323122
start
0.08136159828176068
start
0.08915557335389063
start
0.07906659245831141
start
0.0841221644258608
start
0.07429333994669865
start
0.22677755036109665
start
0.18825250517518186
start
0.4038324065772284
start
0.2848297687871433
start
0.4058572579387729
start
0.2852569474630946
start
0.3176432218727715
start
0.3315866572705733
start
0.16566849726869484
start
0.1737488538968433
start
0.18102649406986301
start
0.1617822097870738
start
0.20011748100183754
start
0.38056161207634587
start
0.2422268468431291
start
0.30305188452716963
start
0.17449355

KeyboardInterrupt: 

In [2]:
a = np.vstack([e.reward for e in agent.memory.memory if e is not None])
b = agent.memory.prios

In [4]:
a[a>0]
#np.where(a>0)[0]

array([0.1, 0.1, 0.1])

In [11]:
a = np.array([1,1,1,2,2,2])
any(a)==2

False

In [6]:
any(a)>1

False

In [5]:
agent = Agent(state_size, action_size, seed, Batch_size, Learning_Rate, N_Bootstrap)
agent.reset()
next_state= torch.FloatTensor([next_states[1]])
state = torch.FloatTensor([states[1]])
action0 = agent.act(np.asarray([states[1]]),1)[0]
action = torch.FloatTensor([action0])

print (agent.critic_target(state, action))
print (agent.critic_local(state, action))

NameError: name 'next_states' is not defined

In [4]:
print (np.mean(np.asarray(b)[[np.where(a>0)][0][0]]))
print (np.mean(np.asarray(b)[[np.where(a<0)][0][0]]))
print (np.mean(np.asarray(b)[[np.where(a==0)][0][0]]))

0.049400998604439555
0.005968637593173023
0.002796234717440362


In [7]:
####### Random #####

states_ = []
rewards_= []
for i in range(1, 20):                                      # play game for 5 episodes
    env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
    states = env_info.vector_observations                  # get the current state (for each agent)
    scores = np.zeros(num_agents)                          # initialize the score (for each agent)
    while True:
        actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
        actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
        env_info = env.step(actions)[brain_name]           # send all actions to tne environment
        next_states = env_info.vector_observations         # get next state (for each agent)
        rewards = env_info.rewards                         # get reward (for each agent)
        rewards_.append(rewards)
        dones = env_info.local_done                        # see if episode finished
        scores += env_info.rewards                         # update the score (for each agent)
        states = next_states                               # roll over states to next time step
        states_.append(states)
        if np.any(dones):                                  # exit loop if episode finished
            break
    print('Score (max over agents) from episode {}: {}'.format(i, np.max(scores)))

Score (max over agents) from episode 1: 0.0
Score (max over agents) from episode 2: 0.10000000149011612


KeyboardInterrupt: 

In [1]:
### Common training ###

from unityagents import UnityEnvironment
import numpy as np


%load_ext autoreload
%autoreload 2

import torch
from matplotlib import pyplot as plt
from ddpg_agent import Agent
from collections import deque

env = UnityEnvironment(file_name="Tennis_Linux/Tennis.x86_64")#, no_graphics = True)
brain_name = env.brain_names[0]
brain = env.brains[brain_name]
env_info = env.reset(train_mode=True)[brain_name]
states = env_info.vector_observations
action_size = brain.vector_action_space_size
state_size = states.shape[1]



print_every = 30
n_episodes= 2500 #2000 # 2000
max_t=1000
Batch_size = 128# 512
N_Bootstrap = 1  ## 8
Learning_Rate = 10
seed = 7
LR_actor = 1e-4         # learning rate of the actor 
LR_critic = 1e-4 #-3        # learning rate of the critic
gamma = 0.95 #0.99            # discount factor

theta = 0.15
sigma = 1.2

def ddpg(env,agent, print_every, n_episodes, max_t, Batch_size, N_Bootstrap, seed):
    scores_list = []
    scores_deque = deque(maxlen=print_every)
    for i_episode in range(0, n_episodes+1):
        env_info = env.reset(train_mode=True)[brain_name]
        agent.reset()
        states = env_info.vector_observations
        score = np.asarray([0.,0.])
        #if i_episode < 300:
        #    reduction = 1.
        #else:
        #    reduction = .1
        reduction =((n_episodes-i_episode+0.0)/(n_episodes+0.0))**3.
        #reduction = 1
        for t in range(max_t):
            action0 = agent.act(np.asarray([states[0]]),reduction)[0] #states[0]
            action1 = agent.act(np.asarray([states[1]]),reduction)[0] #states[0]
        
            env_info = env.step([action0,action1])[brain_name]           # send all actions to tne environment
            next_states = env_info.vector_observations         # get next state (for each agent)
            rewards = env_info.rewards                         # get reward (for each agent)
            dones = env_info.local_done                        # see if episode finished
            score += np.asarray(env_info.rewards)
            agent.step(states[0], action0, rewards[0], next_states[0], dones[0])
            agent.step(states[1], action1, rewards[1], next_states[1], dones[1])
            states = next_states
            if dones[0]:
                break
        scores_deque.append(score)
        if i_episode % print_every == 0:
            print('\rEpisode {}\tAverage Score: {:.4f}'.format(i_episode, np.mean(scores_deque)))
            print (len(agent.memory.prios))
            print (max(agent.memory.prios))
            scores_list.append(np.mean(scores_deque))
            #print (reduction)
            #print (states[0])
            #print (action0)
            #print (states[1])
            #print (action1)           
            torch.save(agent.actor_local.state_dict(), 'checkpoint_actor_.pth')
            torch.save(agent.critic_local.state_dict(), 'checkpoint_critic_.pth')
    return scores_deque, max(scores_list)
for N_Bootstrap in np.arange(7,10,5):
    for seed in range(7,9):
        for sigma in np.arange(0.1,0.8,0.2):
            for gamma in np.arange(0.95,1.,0.09):
                agent = Agent(state_size, action_size, seed, Batch_size, Learning_Rate, N_Bootstrap, LR_actor, LR_critic, gamma, theta, sigma)
                scores, score = ddpg(env,agent, print_every, n_episodes, max_t, Batch_size, N_Bootstrap, seed)
                print ("seed")
                print (seed)
                print ("N_Boot")
                print (N_Bootstrap)
                print ("sigma")
                print (sigma)
                print ("score")
                print (score)
                print ("gamma")
                print (gamma)
        #torch.save(agent.actor_local.state_dict(), 'checkpoint_actor_'+str(seed)+".pth")
        #torch.save(agent.critic_local.state_dict(), 'checkpoint_critic_'+str(seed)+".pth")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Episode 0	Average Score: -0.0050
30
0.018332199960947038
Episode 30	Average Score: -0.0050
882
0.01692986099421978
Episode 60	Average Score: -0.0050
1734
0.017803840175271035
Episode 90	Average Score: -0.0050
2586
0.018573148012161256
Episode 120	Average Score: -0.0050
3438
0.020724912941455842
Episode 150	Average Score: -0.0050
4290
0.01608436631411314
Episode 180	Average Score: -0.0050
5142
0.017917172819375993
Episode 210	Average Score: -0.0050
5994
0.017519177705049516
Episode 240	Average Score: -0.0050
6846
0.015924567192792893
Episode 270	Average Score: -0.0050
7698
0.016179422684013844
Episode 300	Average Score: -0.0050
8550
0.015609619975090028
Episode 330	Average Score: -0.0050
9402
0.01415153017640114
Episode 360	Average Score: -0.0050
10254
0.01414496621489525
Episode 390	Average Score: -0.0050
11106
0.014008195906877519
Episode 420	Average Score: -0.0050
11958
0.012583123356103898
Episode 450	Average Score: -0.0050
12810
0.013063341215252877
Episode 480	Average Score: -0.00

Episode 1470	Average Score: 0.0083
42480
0.18341600692272186
Episode 1500	Average Score: 0.0217
43918
0.20366111195087433
Episode 1530	Average Score: 0.0300
45430
0.20395317471027374
Episode 1560	Average Score: 0.0267
46916
0.19380646741390228
Episode 1590	Average Score: 0.0382
48626
0.15537135100364685
Episode 1620	Average Score: 0.0233
50004
0.1251862326860428
Episode 1650	Average Score: 0.0050
51060
0.12123227661848068
Episode 1680	Average Score: 0.0033
52096
0.09908039665222168
Episode 1710	Average Score: -0.0033
52976
0.0877557098865509
Episode 1740	Average Score: -0.0050
53828
0.08343293517827988
Episode 1770	Average Score: -0.0050
54680
0.08851666748523712
Episode 1800	Average Score: -0.0050
55532
0.09386064112186432
Episode 1830	Average Score: -0.0050
56384
0.0970723032951355
Episode 1860	Average Score: -0.0050
57236
0.09606428444385529
Episode 1890	Average Score: -0.0050
58088
0.09857131540775299
Episode 1920	Average Score: -0.0050
58940
0.1029062569141388
Episode 1950	Average

Episode 360	Average Score: -0.0050
10252
0.027593005284667016
Episode 390	Average Score: -0.0050
11104
0.02599261870980263
Episode 420	Average Score: -0.0050
11956
0.03473662754893303
Episode 450	Average Score: -0.0050
12808
0.03287298029661179
Episode 480	Average Score: -0.0050
13660
0.042518334299325944
Episode 510	Average Score: -0.0050
14512
0.051521768629550935
Episode 540	Average Score: -0.0050
15364
0.05227645283937454
Episode 570	Average Score: -0.0050
16216
0.06630348211526871
Episode 600	Average Score: -0.0050
17068
0.05886816403269768
Episode 630	Average Score: -0.0050
17920
0.05420222303271294
Episode 660	Average Score: -0.0050
18772
0.051086259841918946
Episode 690	Average Score: -0.0050
19624
0.049379652202129365
Episode 720	Average Score: -0.0050
20476
0.046515965670347215
Episode 750	Average Score: -0.0050
21328
0.0456758046746254
Episode 780	Average Score: -0.0050
22180
0.04051624035835266
Episode 810	Average Score: -0.0050
23032
0.04051624035835266
Episode 840	Average

Episode 1830	Average Score: -0.0050
52000
0.11104939675331116
Episode 1860	Average Score: -0.0050
52852
0.11921364611387253
Episode 1890	Average Score: -0.0050
53704
0.12411632186174393
Episode 1920	Average Score: -0.0050
54556
0.13417640125751495
Episode 1950	Average Score: -0.0050
55408
0.13081267273426056
Episode 1980	Average Score: -0.0050
56260
0.14045215940475464
Episode 2010	Average Score: -0.0050
57112
0.1428905109167099
Episode 2040	Average Score: -0.0050
57964
0.15196762776374817
Episode 2070	Average Score: -0.0050
58816
0.15989330208301544
Episode 2100	Average Score: -0.0050
59668
0.16780288314819336
Episode 2130	Average Score: -0.0050
60520
0.17958409881591797
Episode 2160	Average Score: -0.0050
61372
0.1850859353542328
Episode 2190	Average Score: -0.0050
62224
0.18595292961597443
Episode 2220	Average Score: -0.0050
63076
0.19499398565292358
Episode 2250	Average Score: -0.0050
63928
0.19504138624668121
Episode 2280	Average Score: -0.0050
64780
0.1959545294046402
Episode 231

Episode 750	Average Score: -0.0050
21328
0.16316573119163513
Episode 780	Average Score: -0.0050
22180
0.16750937497615814
Episode 810	Average Score: -0.0050
23032
0.15823785758018494
Episode 840	Average Score: -0.0050
23884
0.1615757176876068
Episode 870	Average Score: -0.0050
24736
0.16341705417633057
Episode 900	Average Score: -0.0050
25588
0.16533335840702057
Episode 930	Average Score: -0.0050
26440
0.16391874647140503
Episode 960	Average Score: -0.0050
27292
0.16486432945728302
Episode 990	Average Score: -0.0050
28144
0.16197870469093323
Episode 1020	Average Score: -0.0050
28996
0.15433335101604462
Episode 1050	Average Score: -0.0050
29848
0.17459673976898193
Episode 1080	Average Score: -0.0050
30700
0.17068534886837006
Episode 1110	Average Score: -0.0050
31552
0.15344336426258087
Episode 1140	Average Score: -0.0050
32404
0.16493260657787323
Episode 1170	Average Score: -0.0050
33256
0.1678798327445984
Episode 1200	Average Score: -0.0050
34108
0.16985173320770264
Episode 1230	Averag

KeyboardInterrupt: 

In [3]:
print (len(agent.memory.prios))
print (max(agent.memory.prios))

9592
0.11192409491539002


In [4]:
agent.memory.prios

[0.0022690584883093834,
 0.0016085256356745958,
 0.001421253546141088,
 0.005809379741549492,
 0.0014201331650838256,
 0.003022050019353628,
 0.0022776685655117035,
 0.0022812196984887123,
 0.0027210828848183155,
 0.0012951868120580912,
 0.001032317872159183,
 0.0018233252922073007,
 0.0013378564035519958,
 0.0010935005266219378,
 0.0011042197002097964,
 0.0030044359154999256,
 0.0016791749512776732,
 0.0018595770234242082,
 0.0021947333589196205,
 0.00206717848777771,
 0.0014462462859228253,
 0.006615178193897009,
 0.0013813438126817346,
 0.0028803013265132904,
 0.0010158588411286473,
 0.0062149688601493835,
 0.00300012668594718,
 0.004571408033370972,
 0.0029637212865054607,
 0.002048277761787176,
 0.0068999966606497765,
 0.0016655271174386144,
 0.0018229936249554157,
 0.0012747751316055655,
 0.002908041700720787,
 0.0021429448388516903,
 0.001732514938339591,
 0.002091230358928442,
 0.0014329749392345548,
 0.002220365684479475,
 0.003539910539984703,
 0.0012159313773736358,
 0.00122

In [4]:
len(states)

2

In [16]:
env_info.local_done

[False, False]