# Navigation

---

In this notebook, a Deep-Q-Network is composed in order to solve the following project challenge;
[Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893).

#### Set the "train" parameter to False to see a smart agent. Set it True to train the Agent.

In [1]:
train = False # set true to train agent and false to watch the trained agent

---

## 1. Imports

We start by importing some useful packages.

In [2]:
from unityagents import UnityEnvironment # Runs the training environment
import numpy as np # General Purpose Data
from dqn_agent_new import Agent # The agent class that handles actions and training
from collections import deque # Used for bundling and extracting data
import matplotlib.pyplot as plt # used for plotting learning results
import torch # Saving and loading Nets
import time # for time estimation etc.
from sklearn.model_selection import ParameterGrid # The class that handles grid optimization
import seaborn as sns # Pretty Plotting
sns.set
sns.set_palette("husl")
import pandas as pd
import statistics
%matplotlib inline

This code is made to run on a linux x64 machine. For other training images please refer to the Udacity Github.

## 2. Setup the environment and set parameters

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The state space has `37` dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana. 

In [3]:
# create environment and load API
env = UnityEnvironment(file_name="../BananaFeast/Banana_Linux/Banana.x86_64")
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


----------------------------------------
# 3. Setting the functions
The dqn function trains an agent using a Double-Deep-Q-Network.
The Agent is initialized inside the function in order to enable multiple successive training runs.
The parameters of training can all be adjusted with the dqn function and are set, by default, to the best found solution.


We also have a function for plotting the Agents results and learning process.

In [4]:
def dqn(n_episodes=1500, max_t=1500, eps_start=1.0, eps_end=0.01, eps_decay=0.99, BUFFER_SIZE = int(1e5), 
        BATCH_SIZE = 64, GAMMA = 0.99, TAU = 1e-3, LR = 5e-4, UPDATE_INTERVAL = 20, model = True):
    """Deep Q-Learning.
    
    Params
    ======
        n_episodes (int):       maximum number of training episodes
        max_t (int):            maximum number of timesteps per episode
        eps_start (float):      starting value of epsilon, for epsilon-greedy action selection
        eps_end (float):        minimum value of epsilon
        eps_decay (float):      multiplicative factor (per episode) for decreasing epsilon
        BUFFER_SIZE (int):      Size of Memory Buffer
        BATCH_SIZE (int):       Size of memory batch trained on
        GAMMA (float):          discount factor for future rewards
        TAU (float):            for soft update of target parameters
        LR (float):             learning rate 
        UPDATE_INTERVAL (int):  how often to update the network
        EPSILON (float):        Exploration Rate
        model (boolean):        use small (True) or big (False) Net       
        
    """
    # 1. Initiate Agent class and stores for scores 
    agent = Agent(BUFFER_SIZE = BUFFER_SIZE, BATCH_SIZE = BATCH_SIZE, GAMMA = GAMMA, 
        TAU = TAU, LR = LR, UPDATE_INTERVAL = UPDATE_INTERVAL ,state_size=37, action_size=4, seed=42, EPSILON=eps_start, model = model)
    scores = []                        # list containing scores from each episode
    scores_window = deque(maxlen=100)  # last 100 scores
    
    # initialize epsilon for training
    eps = eps_start
    
    # iterate over the amount of training episodes
    for i_episode in range(1, n_episodes+1):
        
        #reset env, score and get first state value
        env_info = env.reset(train_mode=True)[brain_name]   #reset env
        state = env_info.vector_observations[0]          
        score = 0
        
        # act inside the environment until maxtime or the env is "done"
        for t in range(max_t):
            
            action = agent.get_action(state)        # get next action based on state
            env_info = env.step(action)[brain_name] # apply action to environment
            
            next_state, reward, done = env_info.vector_observations[0], env_info.rewards[0], env_info.local_done[0] # update values to new env state
            agent.step(state, action, reward, next_state, done) # record values and train if finished cycle
            
            state = next_state #update state
            score += reward
            
            if done:
                break 
        scores_window.append(score)       # save most recent score
        scores.append(score)              # save most recent score
        eps = max(eps_end, eps_decay*eps) # decrease epsilon
        agent.EPSILON = eps               # update new epsilon to agent
        
        if agent.scoremax < np.mean(scores_window): 
            agent.scoremax = np.mean(scores_window) # if agents performance increased, update own threshold 
        exp, lvl, maxscore = agent.get_stats()      # get agents current stats
        
        # Output training results for evaluation
        print('\rEpisode {}\tAvg. Score: {:.2f}\tExperience: {}\tcurrent Level: {}\tMax Score: {:.2f}'.format(i_episode, np.mean(scores_window), exp, lvl, maxscore), end="")
        if i_episode % 100 == 0:
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
            
    # if the agent reaches minimum score, save the agents network    
    if np.mean(scores_window)>=13.0:
        print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode-100, np.mean(scores_window)))
        torch.save(agent.localNet.state_dict(), 'weights.tnn')
    print("\nfinished training")
    
    return scores, agent.scoremax


def plot_res(score):
    """
    DQN result plotting
    
    plots the learning results of agent in a graph
    
    Params
    ======
        score (list<int>): list of past training results    
        
    """
    
    # write results in dataframe
    data = pd.DataFrame({"episode": np.arange(len(score)), "score" : score})
    data.episode = (data.episode // 20) + 1 
    
    #plot learning graph
    fig, ax = plt.subplots(figsize=(25,20))
    ax = sns.lineplot(x = "episode", y = "score", data = data)
    plt.ylabel('Score')
    plt.xlabel('Episode #')    
    plt.show()

-------------------------------------------
# 4. Train the Agent

Before training, we create a parametergrid for the training cycles.
This way we are able to iterate through different parameter seetings and hone done to the best performing model.

Next, the model is trained. The scores are saved to csv files in order to document the training.
After the Agent has trained for the set amount of episodes, the training process is displayed in a graph. 


In [5]:
# Create Paramater grid for optimization
# add values to the parameter list to train in different configs
grid = {
    "n_episodes" : [1500],
    "BUFFER_SIZE" : [int(1e5)],
    "BATCH_SIZE" : [64],
    "GAMMA" : [0.99],
    "TAU" : [1e-3],
    "LR" : [5e-4],
    "UPDATE_INTERVAL" : [20],
    "model" : [True],
    "eps_decay" : [0.99]
}

print("The amount of training configurations is:",len(list(ParameterGrid(grid))))

The amount of training configurations is: 1


In [6]:
if train:
    
    #set stores for training results
    scores = []
    scoremaxima = []
    scoreparam = []
    
    i = 1 # run counter
    
    # iterate over all parameter configs
    for params in ParameterGrid(grid):
        print("--------------------------------------------------------------------------------------------------")
        print("run\t", i)
        print(params)
        
        # run the agent with the parameter config
        kwargs = params
        %time score, scoremax = dqn(**kwargs)

        # save parameters to file
        scoreparam.append(list(params.values()).append(scoremax))
        pd.DataFrame(scoreparam).to_csv("results.csv")
    
        # plot the results of training
        plot_res(score)
        print("\n")
        
        i += 1

--------------------------------------
# 5. Watch the agent

Here you can watch the smart Agent act inside the env and collect bananas.
The Agent should be able to get around 15 points on average. 

Depending on random generation the agent may achieve as low as 8 or as high as 21 points.

In [7]:
# Watch the smart agent
if train == False:
    
    n_episodes = 1 # number of runs
    max_t = 1000   # max number of steps to solve env 
    
    # load the trained agent
    trained_agent = Agent(state_size=37, action_size=4, seed=42,EPSILON=0)
    trained_agent.localNet.load_state_dict(torch.load('weights.tnn'))
    
    score = 0
    
    # iterate over runs
    for i_episode in range(1, n_episodes+1):
        
        # get environment infos and set first state
        env_info = env.reset(train_mode=True)[brain_name]
        state = env_info.vector_observations[0]          
        
        # run until finished or max steps
        for t in range(max_t):
            
            action = trained_agent.get_action(state) # get next action based on state
            env_info = env.step(action)[brain_name]  # act in env
            
            # update environment info
            next_state, reward, done = env_info.vector_observations[0], env_info.rewards[0], env_info.local_done[0]
            state = next_state
            score = score + reward
            
            time.sleep(0.016) # makes actions visable for humans (~60 fps)
            print('\rScore: {:.2f}'.format(score), end="")
            if done:
                break 


Score: 16.00

In [8]:
# when finished, close the env
env.close()

When finished, you can close the environment.