## Report - Project Collaboration and Competition

### Methods : DDPG and MADDPG Algorithms 

In this collaboration and competition project, we deploy the **DDPG** algorithm along with the **MADDPG** algorithm, which is a **Multi-Agent DDPG** as DDPG wrapper. When it comes to DDPG, it can simultaneously learn a Q-function and it's policy. Next, it uses off-policy data and the Bellman equation to learn the Q-function. Fianlly, DDP learns the policy by using the Q-function. This paired mechanism is called a actor-critic method.  


> Two additional mechanisms: _Replay Buffer_ and _Soft Updates_.



For MADDPG algorithm, we train two separate agents to be the competitors to eachother, specifically, we let them **collaborate** and **compete**. MADDPG method comes in handy to get better result, compared to the original DDPG, in that, the original DDPG repeat a simple extension of single agent RL by independently training the two agents. It hardly works very well since the agents are individually updating their policiess when learning the prediction . Furthermore, this causes the environment to look like a non-stationary process from the viewpoint among one of the agent. 

When it comes to MADDPG, the critics of the _each agent are trained by the observations and actions_ that comes from **both agents** , while the actors of each _agent are trained by just_ which is their **own observations**.  

In the finction _step()_ of the _class madppg_\__agent_, we gether all of the current info
 for **both agents**  into  the **common** variable    
_memory_ of the type  _ReplayBuffer_.  After that, we attain the random _sample_ and then move it from _memory_  into the variable _experiance_.   
Then, this _experiance_ altogether with the current number of agent (0 or 1) moves to the function _learn()_. Then, we finally get the corresponding    
agent (of type _ddpg_\__agent_):

      agent = self.agents[agent_number]

and then _experiance_ is transferred to function _learn()_  of the _class ddpg_\__agent_.  At the point, the actor and the critic are dealt with different ways.  







 ### Network Architecture 

We use 8 _neural networks_.  In the training phase, we 
create one _maddpg agent_. 

         maddpg = maddpg_agent()
         
In the same way, _maddpg agent_ creates 2 _ddpg agents_: 
         
         self.agents = [ddpg_agent(state_size, action_size, i+1, random_seed=0) 
                  for i in range(num_agents)]    

Theoretically, we can see the two agents (red and blue) create 4 neural networks as all:

        self.actor_local = Actor(state_size, action_size).to(device)
        self.actor_target = Actor(state_size, action_size).to(device)

        self.critic_local = Critic(state_size, action_size).to(device)
        self.critic_target = Critic(state_size, action_size).to(device)

Classes Actor and Critic are produced by **model.py**. The following is the typical behavior of the actor 

        actor_target(state) -> next_actions
        actor_local(states) -> actions_pred
        
Observe function _learn()_ in maddpg agent. The typical behavior of the critic is as follows:

        critic_target(state, action) -> Q-value 
        -critic_local(states, actions_pred) -> actor_loss
              

### The details of the Architecture for the actor and critic networks

The pair of the actor and critic classes implement the neural network
with 3 fully-connected layers and 2 rectified nonlinear layers. such networks are realized in the framework
of package "PyTorch", which is used in Udacity model.py code for the Pendulum model using DDPG.
The number of neurons of the fully-connected layers are as follows:

for the actor:   
Layer fc1, number of neurons: state_size x fc1_units,   
Layer fc2, number of neurons: fc1_units x fc2_units,   
Layer fc3, number of neurons: fc2_units x action_size,   

for the critic:   
Layer fcs1, number of neurons: (state_size + action_size) x n_agents x fcs1_units,   
Layer fc2, number of neurons: (fcs1_units x fc2_units,   
Layer fc3, number of neurons: fc2_units x 1.   

At this point, state_size = 24, action_size = 2.       
The input parameters fc1_units, fc2_units, fcs1_units are all taken = 64.   

### Hyperparameters

From **ddpg_agent.py** 

        GAMMA = 0.99                    # discount factor  
        TAU = 5e-2                      # for soft update of target parameters   
        LR_ACTOR = 5e-4                 # learning rate of the actor   
        LR_CRITIC = 5e-4                # learning rate of the critic  
        WEIGHT_DECAY = 0.0              # L2 weight decay   
        NOISE_AMPLIFICATION = 1         # exploration noise amplification  
        NOISE_AMPLIFICATION_DECAY = 1   # noise amplification decay

From **maddpg_agent.py**

        BUFFER_SIZE = int(1e6)          # replay buffer size   
        BATCH_SIZE = 512                # minibatch size   
        LEARNING_PERIOD = 2             # weight update frequency 
        
Observer that parameters LEARNING_PERIOD is important. The corresponding code is in the function   _step()_.

     if len(self.memory) > BATCH_SIZE and timestep % LEARNING_PERIOD == 0: 
         for a_i, agent in enumerate(self.agents):
              experiences = self.memory.sample()
              self.learn(experiences, a_i)



### Summary for Training the Agent

In my Udacity-powered GPU enviroment, the desired average reward **+0.5** was achieved in **26** minutes with **1101** episodes.

* Environment solved in 1101 episodes! Average Score: **0.50**

### Showing the Full Log

Episode: 20, Score: 0.0450, 	Average Score: 0.0475, Time: 00:00:13 <br>
Episode: 40, Score: 0.0450, 	Average Score: 0.0438, Time: 00:00:26 <br>
Episode: 60, Score: 0.0450, 	Average Score: 0.0425, Time: 00:00:38 <br>
Episode: 80, Score: 0.0450, 	Average Score: 0.0431, Time: 00:00:52 <br>
Episode: 100, Score: 0.0450, 	Average Score: 0.0475, Time: 00:01:08 <br>
Episode: 120, Score: 0.0450, 	Average Score: 0.0480, Time: 00:01:22 <br>
Episode: 140, Score: -0.0050, 	Average Score: 0.0480, Time: 00:01:35 <br>
Episode: 160, Score: 0.0450, 	Average Score: 0.0555, Time: 00:01:54 <br>
Episode: 180, Score: 0.0450, 	Average Score: 0.0580, Time: 00:02:09 <br>
Episode: 200, Score: 0.0450, 	Average Score: 0.0560, Time: 00:02:25 <br>
Episode: 220, Score: 0.0450, 	Average Score: 0.0580, Time: 00:02:41 <br>
Episode: 240, Score: 0.0450, 	Average Score: 0.0685, Time: 00:03:02 <br>
Episode: 260, Score: 0.0450, 	Average Score: 0.0710, Time: 00:03:23 <br>
Episode: 280, Score: 0.1450, 	Average Score: 0.0760, Time: 00:03:43 <br>
Episode: 300, Score: -0.0050, 	Average Score: 0.0815, Time: 00:04:03 <br>
Episode: 320, Score: 0.0950, 	Average Score: 0.0835, Time: 00:04:21 <br>
Episode: 340, Score: 0.0450, 	Average Score: 0.0820, Time: 00:04:41 <br>
Episode: 360, Score: 0.0950, 	Average Score: 0.0800, Time: 00:05:01 <br>
Episode: 380, Score: 0.0450, 	Average Score: 0.0765, Time: 00:05:17 <br>
Episode: 400, Score: 0.0450, 	Average Score: 0.0705, Time: 00:05:32 <br>
Episode: 420, Score: 0.0950, 	Average Score: 0.0705, Time: 00:05:50 <br>
Episode: 440, Score: 0.0450, 	Average Score: 0.0645, Time: 00:06:05 <br>
Episode: 460, Score: 0.0450, 	Average Score: 0.0595, Time: 00:06:20 <br>
Episode: 480, Score: 0.0950, 	Average Score: 0.0635, Time: 00:06:41 <br>
Episode: 500, Score: 0.1450, 	Average Score: 0.0705, Time: 00:07:01 <br>
Episode: 520, Score: 0.0950, 	Average Score: 0.0725, Time: 00:07:21 <br>
Episode: 540, Score: 0.2450, 	Average Score: 0.0820, Time: 00:07:44 <br>
Episode: 560, Score: 0.1450, 	Average Score: 0.0875, Time: 00:08:05 <br>
Episode: 580, Score: 0.1450, 	Average Score: 0.0880, Time: 00:08:27 <br>
Episode: 600, Score: 0.0450, 	Average Score: 0.0865, Time: 00:08:46 <br>
Episode: 620, Score: 0.0450, 	Average Score: 0.0885, Time: 00:09:08 <br>
Episode: 640, Score: -0.0050, 	Average Score: 0.0840, Time: 00:09:28 <br>
Episode: 660, Score: 0.0450, 	Average Score: 0.0910, Time: 00:09:55 <br>
Episode: 680, Score: 0.1450, 	Average Score: 0.0905, Time: 00:10:16 <br>
Episode: 700, Score: 0.2450, 	Average Score: 0.0970, Time: 00:10:41 <br>
Episode: 720, Score: 0.1950, 	Average Score: 0.0985, Time: 00:11:04 <br>
Episode: 740, Score: 0.1950, 	Average Score: 0.1095, Time: 00:11:34 <br>
Episode: 760, Score: 0.1950, 	Average Score: 0.1055, Time: 00:11:57 <br>
Episode: 780, Score: 0.0450, 	Average Score: 0.1135, Time: 00:12:23 <br>
Episode: 800, Score: 0.1450, 	Average Score: 0.1170, Time: 00:12:50 <br>
Episode: 820, Score: 0.0450, 	Average Score: 0.1160, Time: 00:13:11 <br>
Episode: 840, Score: 0.1450, 	Average Score: 0.1095, Time: 00:13:35 <br>
Episode: 860, Score: 0.1950, 	Average Score: 0.1130, Time: 00:14:00 <br>
Episode: 880, Score: 0.1450, 	Average Score: 0.1145, Time: 00:14:29 <br>
Episode: 900, Score: 0.0950, 	Average Score: 0.1090, Time: 00:14:53 <br>
Episode: 920, Score: 0.0450, 	Average Score: 0.1100, Time: 00:15:15 <br>
Episode: 940, Score: 0.2950, 	Average Score: 0.1080, Time: 00:15:37 <br>
Episode: 960, Score: 0.0950, 	Average Score: 0.1160, Time: 00:16:10 <br>
Episode: 980, Score: 0.1450, 	Average Score: 0.1240, Time: 00:16:45 <br>
Episode: 1000, Score: 0.0450, 	Average Score: 0.1435, Time: 00:17:26 <br>
Episode: 1020, Score: 0.4950, 	Average Score: 0.1705, Time: 00:18:12<br> 
Episode: 1040, Score: 0.0950, 	Average Score: 0.2485, Time: 00:19:39 <br>
Episode: 1060, Score: 1.1450, 	Average Score: 0.2910, Time: 00:20:49 <br>
Episode: 1080, Score: 0.0950, 	Average Score: 0.3651, Time: 00:22:28 <br>
Episode: 1100, Score: 1.6450, 	Average Score: 0.4451, Time: 00:24:16 <br>

**Environment solved in 1106 episodes!	Average Score: 0.50**

### Future Works

1. Try various values for hyperparameters such as LEARNING_PERIOD, and neural network parameters fc1_units, fc2_units, then check if we can advance the original performance.
2. We can check the affect of the adding one or above one new nonlinear layers in the current neural networks on the robustness of the algorithm.
3. We can train agents using [MAPPO](https://github.com/kotogasy/unity-ml-tennis) and take a look and the result comparing with MADDPG. 