# Minion Navigation

---

In this notebook, we're training an agent (here, a minion), to navigate through the space and pick up yellow bananas and avoiding blue banabas as it walks around. 

We use the Unity ML-Agents environment. This is performed as part of a project for the course [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893).

### 1. Start the Environment

We begin by importing some necessary packages.  If the code cell below returns an error, please revisit the project instructions (README) to double-check that you have installed the necessary packages. 

In [None]:
from collections import deque
import matplotlib.pyplot as plt
import numpy as np
import torch
from unityagents import UnityEnvironment

from agent import Agent

%matplotlib inline

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Next, we will start the environment. Note that this will only work on a Linux (x64) machine. 

In [None]:
env = UnityEnvironment(file_name="../unity_environment/Banana.x86_64")

You should now see a Unity visualization window just spawned. If you get any pop-ups on the application not responding, just click "Wait" or ignore the message. 

Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [None]:
# Get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The state space has `37` dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana. 

Run the code cell below to print some information about the environment.

In [None]:
# Reset the environment
env_info = env.reset(train_mode=False)[brain_name]
print('Number of agents:', len(env_info.agents))

action_size = brain.vector_action_space_size
print('Number of possible actions:', action_size)

state = env_info.vector_observations[0]
state_size = len(state)
print('Number of possible stes:', state_size)

### 3. Take Random Actions in the Environment

Let's confirm that we have everything set up and the visualization works. 

Here, we will watch the agent's (minion) performance, as it selects an action (uniformly) at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

In [None]:
env_info = env.reset(train_mode=False)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
while True:
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        break
    
print("Score: {}".format(score))

### 4. Let's train the minion!

Let's construct our minion from the `Agent` class. 

In [None]:
minion = Agent(state_size=state_size, action_size=action_size, device=device, seed=0)

Let's define some hyper-parameters for our training. 

In [None]:
NUM_EPISODES = 2000           # Maximum number of training episodes.
MAX_TIME_IN_EPISODE = 1000   # Maximum number of timesteps per episode.

def update_eps(eps): 
    """
    Updates the epsilon for the epsilon-greedy policy. 
    """
    eps = min(eps, 1.0) # The starting value. 
    return max(0.01, 0.995 * eps) # Decrease eps until it stays at a constant 0.01. 

In [None]:
SCORE_ACCEPTANCE_THRESHOLD = 13.

scores = []                        # List containing scores from each episode
scores_window = deque(maxlen=100)  # Last 100 scores
num_episodes_to_acceptance_threshold = -1

eps = 1.0 

for episode_idx in range(1, NUM_EPISODES + 1):
    # Reset. 
    state = env.reset(train_mode=True)[brain_name].vector_observations[0]
    score = 0
    
    # Rollout the episode until MAX_TIME_IN_EPISODE or episode termination. 
    for t in range(MAX_TIME_IN_EPISODE):
        # Get the action our agent must take at the current state.
        action = minion.act(state, eps)
        
        # Get the experience vectors. 
        step_info = env.step(action)[brain_name] 
        next_state = step_info.vector_observations[0]
        reward = step_info.rewards[0]
        done = step_info.local_done[0]
        
        # Learn from the experience. 
        minion.step(state, action, reward, next_state, done)
        
        # Update next state. 
        state = next_state
        
        # Update reward. 
        score += reward
        
        # If this episode terminates, move to the next episode. 
        if done:
            break 
    
    # Update episilon.
    eps = update_eps(eps)
    
    # Update scores. 
    scores_window.append(score)       # save most recent score
    scores.append(score)              # save most recent score
    print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode_idx, np.mean(scores_window)), end="")
    if episode_idx % 100 == 0:
        print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode_idx, np.mean(scores_window)))
        if np.mean(scores_window) >= SCORE_ACCEPTANCE_THRESHOLD and num_episodes_to_acceptance_threshold < 0:
            num_episodes_to_acceptance_threshold = episode_idx

print(f"\nOur minion learnt to get a score of {SCORE_ACCEPTANCE_THRESHOLD} in {num_episodes_to_acceptance_threshold} episodes.")

In [None]:
# Plot the scores over learning time. 

fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(len(scores)), scores, label='raw')
averaged_scores = np.convolve(scores, np.ones(10)/10, mode='valid')
plt.plot(np.arange(len(averaged_scores)), averaged_scores, linewidth=4, label='averaged')
plt.plot([0, len(scores)], [SCORE_ACCEPTANCE_THRESHOLD, SCORE_ACCEPTANCE_THRESHOLD], 
         linestyle='dashed', 
         label='acceptance threshold')
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.legend()
plt.show()

Save the parameters of the Q network. 

In [None]:
torch.save(minion.q_network_local.state_dict(), '../model/trained_model.pt')

### 5. Test our minion's performance. 

In [None]:
# Load the weights from file. 
minion.q_network_local.load_state_dict(torch.load('../model/trained_model.pt'))

env_info = env.reset(train_mode=False)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
while True:
    action = minion.act(state)                     # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        break
    
print("Score: {}".format(score))

In [None]:
env.close()