# Banana Lover (Navigation)
This notebook describes the implementation of a DQN agent to solve `Navigation` task.
All the modules required for solving this task is inside `scripts/agent.py` and consist of three main components (classes):

- **Replaybuffer**
- **QNetwork**
- **Agent**

### Replaybuffer
This module serves as a limited memory to store (using `add()` method) the latest experiences tuple of
`(state, action, reward, next_state, done)` that agent can randomly sample a batch of it for training (using the `sample()` method).

First importing some required libraries:

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import random
from collections import namedtuple, deque

In [2]:
class ReplayBuffer:
    """Fixed-size buffer to store experience tuples."""

    def __init__(self, action_size, buffer_size, batch_size, seed, device):
        """Initialize a ReplayBuffer object.

        Params
        ======
            action_size (int): dimension of each action
            buffer_size (int): maximum size of buffer
            batch_size (int): size of each training batch
            seed (int): random seed
        """
        self.action_size = action_size
        self.memory = deque(maxlen=buffer_size)
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
        self.seed = seed
        self.device = device
        random.seed(seed)

    def add(self, state, action, reward, next_state, done):
        """Add a new experience to memory."""
        e = self.experience(state, action, reward, next_state, done)
        self.memory.append(e)

    def sample(self):
        """Randomly sample a batch of experiences from memory."""
        experiences = random.sample(self.memory, k=self.batch_size)

        states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(self.device)
        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).long().to(self.device)
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(self.device)
        next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(self.device)
        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(self.device)

        return (states, actions, rewards, next_states, dones)

    def __len__(self):
        """Return the current size of internal memory."""
        return len(self.memory)

`buffer_size`, `batch_size` and `seed` are the hyperparameters alongside the others and all are stored in config dictionary inside `main.py` script:

In [3]:
config = {
    "BUFFER_SIZE": int(1e5),        # replay buffer size
    "BATCH_SIZE": 64,               # minibatch size
    "GAMMA": 0.99,                  # discount factor
    "TAU": 1e-3,                    # for soft update of target parameters
    "LR": 5e-4,                     # learning rate
    "UPDATE_EVERY": 4,              # how often to update the network
    "SEED": 10,
    "N_EPISODS": 2000,              # Number of episodes to train
    "EPS_START": 1.0,               # Epsilon starting value
    "EPS_END": 0.01,                # Minimum epsilon value
    "EPS_DECAY": 0.995,             # Epsilon decay rate
    "Q_NET_Hidden_Dims": (64, 64)   # Size of the hidden layer in Q-Net
}

### QNetwork
In DQN, Deep Neural Network is used to learn the Q-Values for all possible states. Here it is implemented as a simple
dense network with two hidden layers with `ReLu` nonlinearity and a linear output layer (since it's a regression task).

This architecture is used for both the value-network and target-network.

In [4]:
class QNetwork(nn.Module):
    """Actor (Policy) Model."""

    def __init__(self, state_size, action_size, seed, hidden_dims=(64, 64)):
        """Initialize parameters and build model.
        Params
        ======
            state_size (int): Dimension of each state
            action_size (int): Dimension of each action
            seed (int): Random seed
        """
        super(QNetwork, self).__init__()
        self.seed = torch.manual_seed(seed)

        "*** YOUR CODE HERE ***"
        self.h_layers = nn.ModuleList()
        last_dim = hidden_dims[0]

        self.input_layer = nn.Linear(state_size, last_dim)

        for h_dim in hidden_dims[1:]:
            self.h_layers.append(nn.Linear(last_dim, h_dim))
            last_dim = h_dim
        self.output_layer = nn.Linear(last_dim, action_size)

    def forward(self, state):
        """Build a network that maps state -> action values."""

        x = F.relu(self.input_layer(state))
        for h_layer in self.h_layers:
            x = F.relu(h_layer(x))

        return self.output_layer(x)

### Agent
This class uses the two modules before and contains the code to interact with the environment (using `act()` method to select
 an action given the state and using`step()` method to execute it in the environment),
collect experiences (using the `memory` module), and perform learning on the collected data (using the `learn()` method).

In [5]:
class Agent:
    """Interacts with and learns from the environment."""

    def __init__(self, state_size, action_size, device, config):
        """Initialize an Agent object.

        Params
        ======
            state_size (int): dimension of each state
            action_size (int): dimension of each action
            seed (int): random seed
        """
        self.state_size = state_size
        self.action_size = action_size
        self.device = device
        self.config = config
        self.seed = config["SEED"]
        random.seed(self.seed)

        # Q-Network
        self.qnetwork_local = QNetwork(state_size, action_size, self.seed, self.config["Q_NET_Hidden_Dims"]).to(self.device)
        self.qnetwork_target = QNetwork(state_size, action_size, self.seed, self.config["Q_NET_Hidden_Dims"]).to(self.device)
        self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=self.config["LR"])

        # Replay memory
        self.memory = ReplayBuffer(action_size, self.config["BUFFER_SIZE"], self.config["BATCH_SIZE"], self.seed, device=self.device)
        # Initialize time step (for updating every UPDATE_EVERY steps)
        self.t_step = 0

    def step(self, state, action, reward, next_state, done):
        # Save experience in replay memory
        self.memory.add(state, action, reward, next_state, done)

        # Learn every UPDATE_EVERY time steps.
        self.t_step = (self.t_step + 1) % self.config["UPDATE_EVERY"]
        if self.t_step == 0:
            # If enough samples are available in memory, get random subset and learn
            if len(self.memory) > self.config["BATCH_SIZE"]:
                experiences = self.memory.sample()
                self.learn(experiences, self.config["GAMMA"])

    def act(self, state, eps=0.):
        """Returns actions for given state as per current policy.

        Params
        ======
            state (array_like): current state
            eps (float): epsilon, for epsilon-greedy action selection
        """
        state = torch.from_numpy(state).float().unsqueeze(0).to(self.device)
        self.qnetwork_local.eval()
        with torch.no_grad():
            action_values = self.qnetwork_local(state)
        self.qnetwork_local.train()

        # Epsilon-greedy action selection
        if random.random() > eps:
            return np.argmax(action_values.cpu().data.numpy())
        else:
            return random.choice(np.arange(self.action_size))

    def learn(self, experiences, gamma):
        """Update value parameters using given batch of experience tuples.

        Params
        ======
            experiences (Tuple[torch.Tensor]): tuple of (s, a, r, s', done) tuples
            gamma (float): discount factor
        """
        states, actions, rewards, next_states, dones = experiences

        ## TODO: compute and minimize the loss
        "*** YOUR CODE HERE ***"

        # DQN-Target
        next_max_value = self.qnetwork_target(next_states).detach().max(1)[0].unsqueeze(1)

        target_q = rewards + gamma * next_max_value * (1 - dones)
        expected_q = self.qnetwork_local(states).gather(1, actions)

        # Compute loss
        loss = F.mse_loss(expected_q, target_q)
        # Minimize the loss
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # ------------------- update target network ------------------- #
        self.soft_update(self.qnetwork_local, self.qnetwork_target, self.config["TAU"])

    def soft_update(self, local_model, target_model, tau):
        """Soft update model parameters.
        θ_target = τ*θ_local + (1 - τ)*θ_target

        Params
        ======
            local_model (PyTorch model): weights will be copied from
            target_model (PyTorch model): weights will be copied to
            tau (float): interpolation parameter
        """
        for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
            target_param.data.copy_(tau * local_param.data + (1.0 - tau) * target_param.data)

### main.py
This is the main scripts that used for running the experiments; the `agent` module together with the `environment` is used
to train the agent (using the `train()` method) and to keep track of its learning behavior.

In [6]:
def train(_env, _agent, _brain_name):

    # watch an untrained agent
    env_info = _env.reset(train_mode=False)[_brain_name]
    state = env_info.vector_observations[0]
    score = 0  # initialize the score
    for _ in range(50):
        action = _agent.act(state)  # select an action
        env_info = _env.step(action)[_brain_name]  # send the action to the environment
        next_state = env_info.vector_observations[0]  # get the next state
        reward = env_info.rewards[0]  # get the reward
        done = env_info.local_done[0]  # see if episode has finished
        score += reward  # update the score
        state = next_state  # roll over the state to next time step
        if done:  # exit loop if episode finished
            break
    print("UnTrained Agent's Score: {}".format(score))


    scores = []  # list containing scores from each episode
    scores_window = deque(maxlen=100)  # last 100 scores
    eps = config["EPS_START"]  # initialize epsilon

    for i_episode in range(1, config["N_EPISODS"] + 1):

        env_info = _env.reset(train_mode=True)[_brain_name]
        state = env_info.vector_observations[0]
        score = 0

        while True:
            action = _agent.act(state, eps)

            env_info = _env.step(action)[_brain_name]
            next_state = env_info.vector_observations[0]
            reward = env_info.rewards[0]
            done = env_info.local_done[0]

            _agent.step(state, action, reward, next_state, done)

            state = next_state
            score += reward
            if done:
                break

        scores_window.append(score)  # save most recent score
        scores.append(score)  # save most recent score
        eps = max(config["EPS_END"], config["EPS_DECAY"] * eps)  # decrease epsilon
        print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)), end="")
        if i_episode % 100 == 0:
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
        if np.mean(scores_window) >= 13:
            print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode,
                                                                                         np.mean(scores_window)))
            torch.save(_agent.qnetwork_local.state_dict(), 'results/checkpoint.pth')
            break

    return scores, _agent

### Let's Train the Agent!
Run `python main.py --is-training True` in the terminal:

Number of agents: 1

Number of actions: 4

States look like: [0.         1.         0.         0.         0.19246322 0.
 1.         0.         0.         0.39209977 0.         0.
 0.         1.         0.         0.         1.         0.
 0.         0.19775437 0.         0.         1.         0.
 0.86202884 0.         1.         0.         0.         0.25187665
 0.         0.         0.         1.         0.         0.
 0.        ]

States have length: 37

UnTrained Agent's Score: 0.0

Episode 100	Average Score: 0.55

Episode 200	Average Score: 4.66

Episode 300	Average Score: 8.34

Episode 400	Average Score: 10.63

Episode 500	Average Score: 12.31

Episode 548	Average Score: 13.00

Environment solved in 548 episodes!	Average Score: 13.00

Trained Agent's Score: 17.0

**The agent was able to solve the task in 448 episodes!**

### Results
Results are stored at results folder which consist of `checkpoint.pth` (this contains the weights and parameters of the trained agent),
and the `learning_curve.png` (learning behavior of the agent throughout the training):

<img src="./results/learning_curve.png">

### Ideas for Future Work
There are lots of extension to this agent!
- First would be to try to learn the task from raw pixels using ConvsNet as encoder.

- Second would be to assess the affect of using [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html) and [Weight_Norm](https://pytorch.org/docs/stable/generated/torch.nn.utils.weight_norm.html) in the architecture of the `QNetwork` to see if
there are any improvements in terms of robustness (when tried on different seeds) and training speed.

- Next would be to add the improvements that have been made upon standard DQN agent; such as  Double DQN, a Dueling DQN,
Prioritized Experience Replay and Rainbow!
