# Collaboration and Competition

---

The goal of this project is to solve the [Collaboration and Competition](https://github.com/udacity/deep-reinforcement-learning/tree/master/p3_collab-compet) challenge from the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.



### 1. Required Libraries

The following libraries and dependencies are used:

1. [Python 3.6](https://www.python.org/downloads/)
2. [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md)
3. [NumPy](http://www.numpy.org/)
4. [Pytorch 1.3](https://pytorch.org/)

And the following files has been defined:

1. agent.py : Contains the implementation of Random, DDPG agents and MADDPG.
2. model.py : Contains the Actor and Critic models used by the DDPG agents.
3. noise.py : Contains the implementation of the Ornstein–Uhlenbeck noise.
3. coach.py : Contains a function to run the environment with a specified agent and define the structure to learn from the environment.


In [1]:
from unityagents import UnityEnvironment
import numpy as np

from agent import RandomAgent, DDPGAgent, MADDPG
from coach import Coach

### 2.The Environment

The compiled environment can be downloaded from the following links:

- [Linux](https://s3-us-west-1.amazonaws.com/udacity-drlnd/P3/Tennis/Tennis_Linux.zip)
- [Mac OSX](https://s3-us-west-1.amazonaws.com/udacity-drlnd/P3/Tennis/Tennis.app.zip)
- [Windows (32-bit)](https://s3-us-west-1.amazonaws.com/udacity-drlnd/P3/Tennis/Tennis_Windows_x86.zip)
- [Windows (64-bit)](https://s3-us-west-1.amazonaws.com/udacity-drlnd/P3/Tennis/Tennis_Windows_x86_64.zip)

Once downloaded, please update the environment location below:

In [2]:
environment_location = "env/Tennis.app"

In [3]:
env = UnityEnvironment(file_name=environment_location)
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


#### Observation and Action Space

The environment consists of two agents (represented as rackets) that can be controlled to play a tennis-like game. Each agent receives a reward of +0.1 if it hits the ball while receives a penality of -0.01 when the ball hits the ground on its territory. To maximise the environment reward, both agent should try to keep the ball in play as long as possible.

Each agent observation is defined by 8 variables corresponding to the position and velocity of the ball and its racket. The ensemble of the two observations will be called "state" here since it has all the information required to re-create the current state.

Actions are continious and are defined by two variables, one to move toward/away from the net and one to jump.

The episode score is defined as the maximum score of the 2 agents. The environment is considered solved if this score reaches +0.5 over 100 consecutive episodes.

In [4]:
env_info = env.reset(train_mode=True)[brain_name]

n_agents = len(env_info.agents)
action_space = brain.vector_action_space_size
state = env_info.vector_observations

observation_space = state.shape[1]

print('Number of agents:', n_agents)
print('Each agent makes an observation of length: {}'.format(observation_space))
print('And can make an action of length: {}'.format(action_space))

Number of agents: 2
Each agent makes an observation of length: 24
And can make an action of length: 2


### 3. Settings and Parameters

The following settings and parameters have been used to train the MADDPG.

In [5]:
batch_size = 128                   # Batch size for training neural network
gamma = 0.995                      # Discount factor for future rewards
replay_buffer_size = int(1e6)      # Memory buffer size
actor_lr = 1e-4                    # Learning rate of the local actor 
critic_lr = 1e-3                   # Learning rate of the local critic 
n_updates = 5                      # Number of learning updates
tau = 5e-3                         # Hyperparameter for the soft-updates

action_range = [-1.,1.]

eps_decay = 0.999                  # Decay of the Ornstein–Uhlenbeck Noise
min_eps = 0.001                    # Minimum noise multiplier after exploration

n_episodes = 2000                  # Number of episodes on which to train
max_steps = int(1e9)               # Maximum number of frames per episode
log_interval = 100                 # Number of episodes before a fixed log of results
save_interval = 100                # Number of episodes before saving model

save_directory = 'Checkpoint\\'    # Directory where models will be saved

### 4. Defining the Coach
The Coach will be responsible of running the agent and supervising its training.

In [None]:
coach = Coach(env=env,
              brain_name=brain_name,
              save_directory=save_directory
             )

We will run a random agent in the environment in order to better understand the environment

In [None]:
random_agent = RandomAgent(n_agents=n_agents,
                           action_space=action_space)

In [None]:
coach.watch(agent=random_agent, n_episodes=5)

### 5. Create and train Multi Agent DDPG

In [6]:
m_agent = MADDPG(n_agents=n_agents,
                 observation_space=observation_space, 
                 action_space=action_space,
                 action_range=action_range,
                 replay_buffer_size=replay_buffer_size, 
                 batch_size=batch_size, 
                 gamma=gamma, 
                 tau=tau,
                 actor_lr=actor_lr, 
                 critic_lr=critic_lr,
                 eps_decay=eps_decay,
                 min_eps=min_eps,
                 n_updates=n_updates,
                 seed=0)

In [None]:
scores, cum_scores = coach.train(agent=m_agent, 
                                 n_episodes=n_episodes, 
                                 max_steps=max_steps,
                                 log_interval=log_interval,
                                 save_interval=save_interval)

Episode:  100/2000 | Cum.Avg.Score: 0.003 | Epis.Score: 0.000 | Elaps.Time: 0h 10m 46s
Episode:  200/2000 | Cum.Avg.Score: 0.006 | Epis.Score: 0.000 | Elaps.Time: 0h 23m 08s
Episode:  300/2000 | Cum.Avg.Score: 0.053 | Epis.Score: 0.000 | Elaps.Time: 0h 44m 32s
Episode:  400/2000 | Cum.Avg.Score: 0.053 | Epis.Score: 0.100 | Elaps.Time: 1h 06m 54s
Episode:  500/2000 | Cum.Avg.Score: 0.072 | Epis.Score: 0.200 | Elaps.Time: 1h 33m 25s
Episode:  600/2000 | Cum.Avg.Score: 0.088 | Epis.Score: 0.100 | Elaps.Time: 2h 04m 02s
Episode:  700/2000 | Cum.Avg.Score: 0.124 | Epis.Score: 0.100 | Elaps.Time: 2h 45m 16s
Episode:  800/2000 | Cum.Avg.Score: 0.194 | Epis.Score: 0.100 | Elaps.Time: 3h 48m 10s
Episode:  900/2000 | Cum.Avg.Score: 0.250 | Epis.Score: 0.100 | Elaps.Time: 5h 09m 13s
Episode:  956/2000 | Cum.Avg.Score: 0.386 | Epis.Score: 2.500 | Elaps.Time: 6h 38m 59s

In [None]:
fig, ax = plt.subplots()
ax.plot(np.arange(len(rewards)), np.asarray(rewards), c='lightsteelblue', linestyle=':', label='Episode Score')
ax.plot(np.arange(len(rewards)), np.asarray(cum_rewards), c='royalblue', label='Average 100 last scores')
ax.set(xlabel='Episodes', ylabel='Score')
ax.legend()
plt.show()

### 6. Watch the trained MADDPG

In [None]:
coach.watch(m_agent,n_episodes=5)

In [None]:
env.close()

### 7. Ideas for future work

1. We could do an hyperparameters tuning to train the agents in less epochs.
<br><br>
2. Some of the code for the learning part could be optimized to run faster.
<br><br>
3. Since each agent is the reflexion of the other, we could increase the replay memory by a factor of 2.
<br><br>
4. Since the reward is really sparse at the beginning, we could try different noise functions that would provide a better result at the begginning, but this might be environment specific...
<br><br>
5. The reward function could maybe be adjusted to give a bonus for how close the racket is from the ball. The model could then start to learn to go near the ball earlier without trial and error as the current implementation is doing.
<br><br>
6. We could adapt the code to train on the [soccer environment](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md).