## Project Collaboration and Competition - Report

### DDPG and MADDPG Algorithms 

In this project, we use the **DDPG** algorithm (Deep Deterministic Policy Gradient) and the **MADDPG** algorithm,     
a wrapper for DDPG. MADDPG stands for **Multi-Agent DDPG**. DDPG is an algorithm which concurrently   learns    
a Q-function and a policy.  It uses off-policy data and the Bellman equation to learn the Q-function, 
and uses    
the Q-function to learn the policy. This dual mechanism is the  actor-critic method. The DDPG algorithm uses   
two additional mechanisms: _Replay Buffer_ and _Soft Updates_.  

In MADDPG, we train two separate agents, and the agents need to **collaborate** (like don’t let the   ball hit the ground)   
and **compete** (like gather as many points as possible). Just doing a simple extension of single 
agent RL    
by independently training the two agents does not work very well because the agents are independently updating    
their policies as learning progresses. And this causes the   environment to appear non-stationary from the viewpoint   
of any one agent. 

In MADDPG, _each agent’s critic is trained using the observations and actions_ from **both agents** , whereas   
each _agent’s actor is trained using just_ its **own observations**.  

In the finction _step()_ of the _class madppg_\__agent_, we collect all current info
 for **both agents**  into  the **common** variable    
_memory_ of the type  _ReplayBuffer_.  Then we get the random _sample_ from _memory_  into the variable _experiance_.   
This _experiance_   together with the current number of agent (0 or 1) go to the function _learn()_.   We get the corresponding    
agent (of type _ddpg_\__agent_):

      agent = self.agents[agent_number]

and _experiance_ is transferred to function _learn()_  of the _class ddpg_\__agent_.  There, the actor and the critic 
are handled by different ways.  







 ### Eight Neural Networks

In this prohect, there are 8 _neural networks_.  For the training, we 
create one _maddpg agent_. 

         maddpg = maddpg_agent()
         
In turn, _maddpg agent_ creates 2 _ddpg agents_: 
         
         self.agents = [ddpg_agent(state_size, action_size, i+1, random_seed=0) 
                  for i in range(num_agents)]    

Each of two agents (red and blue) create 4 neural networks:

        self.actor_local = Actor(state_size, action_size).to(device)
        self.actor_target = Actor(state_size, action_size).to(device)

        self.critic_local = Critic(state_size, action_size).to(device)
        self.critic_target = Critic(state_size, action_size).to(device)

Classes Actor and Critic are provided by **model.py**. The typical behavior of the actor 

        actor_target(state) -> next_actions
        actor_local(states) -> actions_pred
        
see function _learn()_ in maddpg agent. The typical behavior of the critic is as follows:

        critic_target(state, action) -> Q-value 
        -critic_local(states, actions_pred) -> actor_loss
        
see function _learn()_ in ddpg agent.        
        

### Architecture of the actor and critic networks

Both the actor and critic classes implement the neural network
with 3 fully-connected layers and 2 rectified nonlinear layers. These networks are realized in the framework
of package PyTorch. Such a network is used in Udacity model.py code for the Pendulum model using DDPG.
The number of neurons of the fully-connected layers are as follows:

for the actor:   
Layer fc1, number of neurons: state_size x fc1_units,   
Layer fc2, number of neurons: fc1_units x fc2_units,   
Layer fc3, number of neurons: fc2_units x action_size,   

for the critic:   
Layer fcs1, number of neurons: (state_size + action_size) x n_agents x fcs1_units,   
Layer fc2, number of neurons: (fcs1_units x fc2_units,   
Layer fc3, number of neurons: fc2_units x 1.   

Here, state_size = 24, action_size = 2.       
The input parameters fc1_units, fc2_units, fcs1_units are all taken = 64.   

### Hyperparameters

From **ddpg_agent.py** 

        GAMMA = 0.99                    # discount factor  
        TAU = 5e-2                      # for soft update of target parameters   
        LR_ACTOR = 5e-4                 # learning rate of the actor   
        LR_CRITIC = 5e-4                # learning rate of the critic  
        WEIGHT_DECAY = 0.0              # L2 weight decay   
        NOISE_AMPLIFICATION = 1         # exploration noise amplification  
        NOISE_AMPLIFICATION_DECAY = 1   # noise amplification decay

From **maddpg_agent.py**

        BUFFER_SIZE = int(1e6)          # replay buffer size   
        BATCH_SIZE = 512                # minibatch size   
        LEARNING_PERIOD = 2             # weight update frequency 
        
Note that parameters LEARNING_PERIOD is important. The corresponding code is in the function   _step()_.

     if len(self.memory) > BATCH_SIZE and timestep % LEARNING_PERIOD == 0: 
         for a_i, agent in enumerate(self.agents):
              experiences = self.memory.sample()
              self.learn(experiences, a_i)



### Training the Agent

On my local machine with GPU, the desired average reward **+0.5** was achieved in **2761** episodes in **28** minutes.

### Full log

Episode: 50, Score: -0.0050, 	Average Score: -0.0050, Time: 00:00:06

Episode: 100, Score: -0.0050, 	Average Score: -0.0050, Time: 00:00:21

Episode: 150, Score: -0.0050, 	Average Score: -0.0050, Time: 00:00:37

Episode: 200, Score: -0.0050, 	Average Score: -0.0050, Time: 00:00:52 

Episode: 250, Score: -0.0050, 	Average Score: -0.0050, Time: 00:01:08

Episode: 300, Score: -0.0050, 	Average Score: -0.0050, Time: 00:01:24

Episode: 350, Score: -0.0050, 	Average Score: -0.0050, Time: 00:01:40 

Episode: 400, Score: -0.0050, 	Average Score: 0.0045, Time: 00:02:04 

Episode: 450, Score: -0.0050, 	Average Score: 0.0095, Time: 00:02:26 

Episode: 500, Score: 0.0450, 	Average Score: 0.0055, Time: 00:02:46 

Episode: 550, Score: -0.0050, 	Average Score: 0.0025, Time: 00:03:03 

Episode: 600, Score: -0.0050, 	Average Score: 0.0005, Time: 00:03:21 

Episode: 650, Score: 0.0450, 	Average Score: 0.0010, Time: 00:03:39 

Episode: 700, Score: -0.0050, 	Average Score: -0.0010, Time: 00:03:55 

Episode: 750, Score: -0.0050, 	Average Score: -0.0020, Time: 00:04:12 

Episode: 800, Score: -0.0050, 	Average Score: 0.0000, Time: 00:04:31 

Episode: 850, Score: -0.0050, 	Average Score: 0.0000, Time: 00:04:48 

Episode: 900, Score: -0.0050, 	Average Score: -0.0010, Time: 00:05:05 

Episode: 950, Score: -0.0050, 	Average Score: 0.0020, Time: 00:05:24 

Episode: 1000, Score: -0.0050, 	Average Score: 0.0035, Time: 00:05:43 

Episode: 1050, Score: -0.0050, 	Average Score: 0.0025, Time: 00:06:01 

Episode: 1100, Score: -0.0050, 	Average Score: 0.0020, Time: 00:06:20 

Episode: 1150, Score: -0.0050, 	Average Score: 0.0025, Time: 00:06:39 

Episode: 1200, Score: -0.0050, 	Average Score: 0.0020, Time: 00:06:58 

Episode: 1250, Score: 0.0450, 	Average Score: 0.0040, Time: 00:07:18 

Episode: 1300, Score: -0.0050, 	Average Score: 0.0065, Time: 00:07:39 

Episode: 1350, Score: 0.0450, 	Average Score: 0.0180, Time: 00:08:09 

Episode: 1400, Score: -0.0050, 	Average Score: 0.0210, Time: 00:08:31 

Episode: 1450, Score: -0.0050, 	Average Score: 0.0235, Time: 00:09:05 

Episode: 1500, Score: -0.0050, 	Average Score: 0.0215, Time: 00:09:26 

Episode: 1550, Score: -0.0050, 	Average Score: 0.0050, Time: 00:09:44 

Episode: 1600, Score: -0.0050, 	Average Score: 0.0055, Time: 00:10:06 

Episode: 1650, Score: 0.0450, 	Average Score: 0.0130, Time: 00:10:31 

Episode: 1700, Score: -0.0050, 	Average Score: 0.0120, Time: 00:10:52 

Episode: 1750, Score: -0.0050, 	Average Score: 0.0075, Time: 00:11:13 

Episode: 1800, Score: 0.0450, 	Average Score: 0.0100, Time: 00:11:36 

Episode: 1850, Score: -0.0050, 	Average Score: 0.0060, Time: 00:11:54 

Episode: 1900, Score: -0.0050, 	Average Score: -0.0005, Time: 00:12:12 

Episode: 1950, Score: -0.0050, 	Average Score: 0.0115, Time: 00:12:40 

Episode: 2000, Score: -0.0050, 	Average Score: 0.0190, Time: 00:13:03 

Episode: 2050, Score: -0.0050, 	Average Score: 0.0210, Time: 00:13:34 

Episode: 2100, Score: 0.0450, 	Average Score: 0.0190, Time: 00:13:57 

Episode: 2150, Score: -0.0050, 	Average Score: 0.0215, Time: 00:14:28 

Episode: 2200, Score: 0.0950, 	Average Score: 0.0605, Time: 00:15:22 

Episode: 2250, Score: 0.0450, 	Average Score: 0.0705, Time: 00:16:00 

Episode: 2300, Score: 0.0450, 	Average Score: 0.0550, Time: 00:16:40 

Episode: 2350, Score: 0.0450, 	Average Score: 0.0670, Time: 00:17:29 

Episode: 2400, Score: 0.0450, 	Average Score: 0.0720, Time: 00:18:14 

Episode: 2450, Score: 0.2450, 	Average Score: 0.0680, Time: 00:19:01 

Episode: 2500, Score: 0.0450, 	Average Score: 0.0720, Time: 00:19:50 

Episode: 2550, Score: 0.1450, 	Average Score: 0.0670, Time: 00:20:31 

Episode: 2600, Score: -0.0050, 	Average Score: 0.0670, Time: 00:21:19 

Episode: 2650, Score: 0.1450, 	Average Score: 0.0890, Time: 00:22:19 

Episode: 2700, Score: 0.0950, 	Average Score: 0.1065, Time: 00:23:24 

Episode: 2750, Score: 2.6000, 	Average Score: 0.3097, Time: 00:27:27 

*** Environment solved in 2761 episodes!	Average Score: 0.52 ***

### Future ideas

1. Check different values for hyperparameters such as LEARNING_PERIOD, and neural network parameters fc1_units, fc2_units, etc.
2. How does the addition of new nonlinear layers in the used neural networks affect the robustness of the algorithm.
3. It would be interesting to train agents using [MAPPO](https://github.com/kotogasy/unity-ml-tennis) and compare them with MADDPG. 
4. Running the agent for more episodes should also improve the score, since we see a very sharp increase in the average score by the end of training.