# Navigation

---

This notebook contains the optional solution of the first project of the Deep Reinforcement Learning Nanodegree. It differs from the [first Navigation solution](Navigation_own.ipynb) in that the agent has to learn from pixels here, instead of giving all observation right away in the obervation space.  
So there is an additional layer of complexity involved.

### The Environment

It uses the Banana environment from [Unity Technologies](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Learning-Environment-Examples.md#banana-collector).

![environment](banana.gif)

The goal for the agent is to navigate through the environment and collect yellow bananas and avoiding blue bananas.

The task is episodic and in order to solve the environment, the agent must get an average score of `+13` over `100` consecutive episodes.

Please note, that the files `dqn_agent.py` as well as `model.py` are needed.

### State and Action Spaces

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The environment state is an array of raw pixels with shape `(1, 84, 84, 3)`.  
A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana.

In [1]:
from unityagents import UnityEnvironment
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
env = UnityEnvironment(file_name="../Banana_Linux_NoVis/Banana.x86_64")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

In [6]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents in the environment
print('Number of agents:', len(env_info.agents))

# number of actions
action_size = brain.vector_action_space_size
print('Number of actions:', action_size)

# examine the state space 
state = env_info.visual_observations[0]
print('States look like:')
plt.imshow(np.squeeze(state))
plt.show()
state_size = state.shape
print('States have shape:', state.shape)

Number of agents: 1
Number of actions: 4


IndexError: list index out of range

## Train agent

In [8]:
from dqn_agent import PixelAgent
from collections import deque
import torch

ModuleNotFoundError: No module named 'utils'

In [2]:
def dqn(n_episodes=2000, eps_start=1.0, eps_end=0.01, eps_decay=0.995, train=False, score_target=float('inf')):
    """Deep Q-Learning.

    Params
    ======
        n_episodes (int): maximum number of training episodes
        max_t (int): maximum number of timesteps per episode
        eps_start (float): starting value of epsilon, for epsilon-greedy action selection
        eps_end (float): minimum value of epsilon
        eps_decay (float): multiplicative factor (per episode) for decreasing epsilon
        train (bool): train agent or apply best actions
    """
    scores = []                        # list containing scores from each episode
    scores_window = deque(maxlen=100)  # last 100 scores
    eps = eps_start                    # initialize epsilon

    for i_episode in range(1, n_episodes+1):

        env_info = env.reset(train_mode=train)[brain_name] # reset the environment
        state = env_info.visual_observations[0]            # get the current state
        score = 0                                          # initialize the score

        while True:

            action = agent.act(state)                      # select an action
            env_info = env.step(action)[brain_name]        # send the action to the environment

            next_state = env_info.visual_observations[0]   # get the next state
            reward = env_info.rewards[0]                   # get the reward
            done = env_info.local_done[0]                  # see if episode has finished

            if train:
                agent.step(state, action, reward, next_state, done)

            score += reward                                # update the score

            state = next_state                             # roll over the state to next time step

            if done:                                       # exit loop if episode finished
                break

        scores_window.append(score)       # save most recent score
        scores.append(score)              # save most recent score

        eps = max(eps_end, eps_decay*eps) # decrease epsilon

        print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)), end="")

        if i_episode % 100 == 0:
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))

        if np.mean(scores_window)>= score_target:
            print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode-100, np.mean(scores_window)))
            torch.save(agent.qnetwork_local.state_dict(), 'checkpoint.pth')
            break

    return scores   

In [None]:
agent = PixelAgent(state_size=state_size, action_size=action_size, seed=0)

scores = dqn(train=True)

In [None]:
# plot the scores
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(len(scores)), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

## Run agent

In [None]:
agent = PixelAgent(state_size=state_size, action_size=action_size, seed=0)

agent.qnetwork_local.load_state_dict(torch.load('checkpoint.pth'))

scores = dqn(n_episodes=1, train=False)

In [None]:
env.close()