# Report
## Method
We implement a DDPG (Deep Deterministic Policy Gradient) algorithm for a continuous control task called Reacher.
The DDPG algorithms is described as follows:
1. Create a Actor specifying a parameterized actor function $a=\mu(s_{t}|\theta_{\mu})$ to represent current policy by deterministically mapping states to specific action.
2. Create a Critic to estimate the state-action value function, 
$$Q(s_t, a) = R_t + \gamma Q(s_{t+1}, a) = R_t + \gamma Q(s_{t+1}, \mu(s_{t}|\theta_{\mu}))$$.
3. The loss function of the critic is to estimate the Q(s_t, a), so we can use the mean square error as follows:
$$(y_t - Q(s_t, \mu(s_t)|\theta_{Q}))^{2}$$
where $y_t = R_t + \gamma Q(s_{t+1}, \mu(s_{t+1}))$.
4. The goal of the actor is to maximize the return $J(\theta_{\mu}) = \mathop{\mathbb{E}}\limits_{s_t} Q(s_t, \mu(s_{t}|\theta_{\mu})|\theta_{Q})$.

## Implementation
1. We implement the critic and the actor using 4 and 3 layer fully connected network respectively.
2. Hyperparameters. buffer_size=20000, batch_size=128, lr=1e-4, gamma=0.99.

## Plot of Average Return
![average return](./experimental_results/return_curve.png)
From the figure, we can see that in the 50th episode, the return is higher than 30.

## Idea of Furture Work
### 1. Use advantage function as the goal:
$$A(s_t, a) = Q(s_t, a) - V(s_t)$$
which can be estimated by the one-step return (TD-error) as following:
$$A(s_t, a) = R_t + \gamma V(s_{t+1}) - V(s_t) $$
### 2. n-step return. 
The one-step return is calculated as following:
$$Q_(s_t, a) = R_t + \gamma Q(s_{t+1}, a)$$
The two-step return is calculated as following:
$$Q_(s_t, a) = R_t + \gamma R_{t+1} + \gamma^2 Q(s_{t+2}, a)$$
The n-step return is calculated as following:
$$Q_(s_t, a) = \sum \limits_{l=0}^{n-1} \gamma^{l} R_{t+l} + \gamma^l Q(s_{t+l}, a)$$

In [1]:
%load_ext autoreload
%autoreload 2
from tqdm import tqdm
from model import *
from unityagents import UnityEnvironment
import numpy as np

env = UnityEnvironment(file_name='./Reacher')

# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

env_info = env.reset(train_mode=True)[brain_name]

num_agents = len(env_info.agents)
print(num_agents)

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


20


In [11]:
agent = Agent(pre_trained=True)

avg_scores = []
best_scores = 30

env_info = env.reset(train_mode=True)[brain_name]      # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    actions = agent.act(states)                        # select an action (for each agent)
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    rewards = env_info.rewards                         # get reward (for each agent)
    next_states = env_info.vector_observations         # get next state (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print(f"The best average return of the pre-trained model: {scores.mean():.1f}")

Load model successful
The best average return of the pre-trained model: 38.5
