# Project 2: Continuous Control

### Project description
The folder contains the codes and a report for project 2 of the Deep Reinforcement Learning nanodegree. The project is based on a Unity ML-Agents Reacher environment, where a double-jointed arm can move to target locations. A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

<img src="reacher.gif" />

The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1.

In this project, I choose the **Deep Deterministic Policy Gradient (DDPG)** algortithm, an actor-critic algorithm for learning continous actions to train 20 identical agents in the Unity environment. The agents must get an average score of +30 (over 100 consecutive episodes, and over all agents)

### Methodology
The **Deep Deterministic Policy Gradient (DDPG)** algortithm is effective in solving problems with continuous action space. The pseudocode is demonstrated as follows from the [paper](https://arxiv.org/pdf/1509.02971.pdf)

<img src="pseudocode_ddpg.png" />

It contains a critic network, as well as an actor network, each of which has its own local and target network. It uses Experience Replay and slow-learning target networks from DQN, and it is based on DPG, which can operate over continuous action spaces. 
- For updating the critic, it is similar to DQN except that it uses target actor network to produce the next action in calculating the Q_target ($y_i$).
- For updating the actor, it performs gradient ascent on the sampled policy gradient.

### Implementation
The codes contain 3 main files:
- `model.py`: defines and specifies the Actor and Critic network. (This part builds on top of the [Udacity repository](https://github.com/udacity/deep-reinforcement-learning/tree/master/ddpg-bipedal))
- `ddpg_agent.py`: implements the DDPG training process, as well as defining the action noise and the replay buffer. (This part builds on top of the [Udacity repository](https://github.com/udacity/deep-reinforcement-learning/tree/master/ddpg-bipedal))
- `Continuous_Control.ipynb`: the notework to train the agent. Run cell by cell

#### Network architecture
- Actor Network
```
input layer (33 units, state_space_size)
hidden layer 1 (fully connected 256 units, batch_norm, relu)
hidden layer 2 (fully connected 128 units, relu)
output layer (4 units, tanh, action_space_size)
```
- Critic Network
```
input layer (33 units, state_space_size)
hidden layer 1 (fully connected 256 units, batch_norm, relu)
hidden layer 2 (fully connected 128 units, relu)
output layer (1 units, Q_value)
```

#### Hyperparameters for training
```
BUFFER_SIZE = int(1e6)  # replay buffer size
BATCH_SIZE = 128        # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR_ACTOR = 1e-4         # learning rate of the actor 
LR_CRITIC = 1e-3        # learning rate of the critic
WEIGHT_DECAY = 0   # L2 weight decay
EPSILON_DECAY = 0.9999  # decay rate
LEARNING_TIMES = 10     # the number of times for learning at a single phase.
```

### Results and key notes
The statistics for the training process and the plot of the experimental results are demonstrated below.
```
Episode 10 	 Average Score: 2.33
Episode 20 	 Average Score: 4.77
Episode 30 	 Average Score: 5.87
Episode 40 	 Average Score: 6.95
Episode 50 	 Average Score: 8.17
Episode 60 	 Average Score: 9.00
Episode 70 	 Average Score: 10.06
Episode 80 	 Average Score: 10.98
Episode 90 	 Average Score: 11.78
Episode 100 	 Average Score: 12.71
Episode 110 	 Average Score: 14.91
Episode 120 	 Average Score: 17.00
Episode 130 	 Average Score: 19.21
Episode 140 	 Average Score: 21.66
Episode 150 	 Average Score: 24.21
Episode 160 	 Average Score: 26.78
Episode 170 	 Average Score: 29.00
Episode 176 	 Average Score: 30.13
Environment solved in 76 episodes!	Average Score: 30.13
```
<img src="plot.png" />

The training process goes good. In the image above, it is specified that environment is solved in 76 episode.
The actual episode at which environment is solved (+30 over last 100 episodes) is 176.

#### Key notes
Initially I could not get the agents training effectively, no matter how I tuned the hyperparameters. The trend was, it started with some evidences of learning, with the first 3-5 episodes. It got scores around 1.0 but after that the scores stuck there and not increased. I searched online and found another source of DDPG pseudocode [here](https://spinningup.openai.com/en/latest/algorithms/ddpg.html#pseudocode). The line 9-10 caught my attention which said "if it's time to update then...". It gave the hint that it might not be a wise idea to training every time an experience tuple is added to the replay buffer. Learning too frequently with very minor change of the current buffer might lead to local optimum. The statistics that I initially received gave me a strong indication that I encountered the local optimum after a few episodes due to learning too frequently. So I adjusted my training process by only triggering the learning every 20 time steps, so as to allow the agents to update the buffer more before each training. It turned it worked. 

### Future Work
It is mentioned that the second version (multi-agent version) is useful for algorithms like PPO, A3C, and D4PG that use multiple (non-interacting, parallel) copies of the same agent to distribute the task of gathering experience. So the future work can include the implementation of PPO, A3C and D4PG.