# Continuous Control

---

This notebook uses the Unity ML-Agents environment for the second project of the Udacity [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893)

## Import Packages

Ensure you have followed the setup instructions from the README before importing these packages

In [2]:
from unityagents import UnityEnvironment
import numpy as np
import torch

**_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded during the README setup:

- **Mac**: `"path/to/Reacher.app"`
- **Windows** (x86): `"path/to/Reacher_Windows_x86/Reacher.exe"`
- **Windows** (x86_64): `"path/to/Reacher_Windows_x86_64/Reacher.exe"`
- **Linux** (x86): `"path/to/Reacher_Linux/Reacher.x86"`
- **Linux** (x86_64): `"path/to/Reacher_Linux/Reacher.x86_64"`
- **Linux** (x86, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86"`
- **Linux** (x86_64, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86_64"`

You can also watch the agent visually in the Unity Environment. I opted not to by specifying `no_graphics=True` below because my computer has trouble running the environment.

The environment also has a brain that we can use to control our agent.

In [None]:
env = UnityEnvironment(file_name='Reacher.app', no_graphics=True) # uses file name for Mac
# env = UnityEnvironment(file_name='Reacher.app') # opens Unity environment to watch agent visually

brain_name = env.brain_names[0] # get the default brain

## Untrained Agent

See how an untrained performs with random actions

In [None]:
env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

## Training Phase

Trains the agent

In [None]:
import torch.optim as optim

from agent import Agent
from memory import ReplayBuffer
from models import Actor, Critic
from noise import OUNoise

# Set some hyperparameters
BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 128        # minibatch size
LR_ACTOR = 1e-4         # learning rate of the actor 
LR_CRITIC = 1e-4        # learning rate of the critic
WEIGHT_DECAY = 0        # L2 weight decay

action_size = 4
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
num_agents = 20
random_seed = 2

def ddpg(n_episodes=600, max_t=1000, print_every=100, eps_start=1, eps_decay=0.99, eps_end=0.01):
    actor_local = Actor(state_size, action_size, random_seed).to(device)
    actor_target = Actor(state_size, action_size, random_seed).to(device)
    actor_optimizer = optim.Adam(actor_local.parameters(), lr=LR_ACTOR)

    critic_local = Critic(state_size, action_size, random_seed).to(device)
    critic_target = Critic(state_size, action_size, random_seed).to(device)
    critic_optimizer = optim.Adam(critic_local.parameters(), lr=LR_CRITIC, weight_decay=WEIGHT_DECAY)

    memory = ReplayBuffer(action_size, BUFFER_SIZE, BATCH_SIZE, random_seed)

    noise = OUNoise(action_size, random_seed)

    agents = []

    for _ in range(num_agents):
        agents.append(Agent(action_size, random_seed, BATCH_SIZE, actor_local, actor_target, actor_optimizer, critic_local, critic_target, critic_optimizer, memory, noise, device))

    eps = eps_start
    scores_deque = deque(maxlen=print_every)
    scores = []

    for i_episode in range(1, n_episodes + 1):
        env_info = env.reset(train_mode=True)[brain_name]
        states = env_info.vector_observations
        for agent in agents:
            agent.reset()
        episode_scores = np.zeros(num_agents)
        for t in range(max_t):
            actions = []
            for q in range(num_agents):
                actions.append(agents[q].act(states[q], eps))
            env_info = env.step(actions)[brain_name]
            next_states = env_info.vector_observations         # get next state (for each agent)
            rewards = env_info.rewards                         # get reward (for each agent)
            dones = env_info.local_done                        # see if episode finished
            for q in range(num_agents):
                agents[q].step(states[q], actions[q], rewards[q], next_states[q], dones[q])
            episode_scores += env_info.rewards                         # update the score (for each agent)
            states = next_states                               # roll over states to next time step
            if np.any(dones):                                  # exit loop if episode finished
                break
        eps = max(eps * eps_decay, eps_end)
        avg_score = np.mean(episode_scores)
        scores_deque.append(avg_score)
        scores.append(avg_score)
        print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)), end="")
        torch.save(actor_local.state_dict(), 'weights_actor.pth')
        torch.save(critic_local.state_dict(), 'weights_critic.pth')
        if i_episode % print_every == 0:
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))
            
    return scores

scores = ddpg()

## Visualize Results

Plots a recap of the training phase

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

# plot the scores
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(len(scores)), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

## Watch Trained Agents

See how trained agents perform over 10 episodes with weights loaded from the training phase