## I. Learning algorithm

This solution is almost an exact copy of my [previous project](https://github.com/AlekseySyryh/DRL_ContinuousControl). However, there are a few differences.

1. I have two identical but independent Deep Deterministic Policy Gradients (DDPG) agents. Of course, I could wrap them in one agent, but it seems to me that the existing implementation emphasizes their independence.

1. Normal noise does not work here. So I am try using Ornstein–Uhlenbeck process instead and it works very well.

1. For balance exploration-exploitation tradeoff, I am using epsilon parameter which decreasing from 1 to 0.1 (but probably it will be solved faster) - and it is the likelihood that noise will be used in this episode.

## II. Plot of Rewards

In [1]:
import pickle as pkl
with open('scores.pkl','rb') as f:
    stat=pkl.load(file=f)

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
data=pd.DataFrame({"Score":stat})
data["Mean"]=data.rolling(100).mean()
data.plot()
plt.show()

<Figure size 640x480 with 1 Axes>

Environment solved in 1946 episodes

## III. Solution example

In [3]:
from unityagents import UnityEnvironment
import numpy as np
env = UnityEnvironment(file_name='c:/Tennis_Windows_x86_64/Tennis.exe');
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


In [4]:
from network import Actor
actor1 = Actor(24,2,1)
actor2 = Actor(24,2,1)

In [5]:
import torch
actor1.load_state_dict(torch.load('actor1.final.pth'))
actor2.load_state_dict(torch.load('actor2.final.pth'))

In [6]:
for x in range(5):
    env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
    state = env_info.vector_observations                  # get the current state (for each agent)
    score = np.zeros(2)
    while True:
        action=[actor1(torch.tensor(state[0],dtype=torch.float32)).detach().numpy(),actor2(torch.tensor(state[1],dtype=torch.float32)).detach().numpy()]
        env_info = env.step(action)[brain_name]
        next_state = env_info.vector_observations
        reward = env_info.rewards
        dones = env_info.local_done                        # see if episode finished
        score += env_info.rewards                         # update the score (for each agent)        
        state = next_state # roll over states to next time step
        if np.any(dones):
            print(score.max())
            break

2.600000038743019
2.7000000402331352
2.600000038743019
2.7000000402331352
2.7000000402331352


In [7]:
env.close()

An interesting effect. In training, in 50% of cases, the agent managed to hit the ball no more than two times. I thought it was some kind of learning problem, but here (without noise) there is no such problem. So the algorithm looks good enough.

## IV. Ideas for future work

Collaboration looks good, and agents can play long enough now. Now it's time to start Competition part.

What if reward of one agent will be also a penalty of another (may be with some discount)? I do not think that such an approach would have worked when learning from scratch, but for an agent who was trained on this task, this might work (some kind of transfer learning).

As usual, I will probably check out some other algorithms, but there is not much point in this - the result is already as close as possible to the ideal one. I never managed to see how the agent is mistaken.