# Rainbow
---
It this notebook, the noise neural network is used. There are two main aspects of this implementation.

1) The layers in advantage and value blocks have factorized Gaussian nosie added to them while in training mode. A layer with the noise is implemented in the `NoisyLinear` class. In the "classical" layers, the output (with no activation function) is computed as:
$$y=w*x+b$$
in the noisy layers it is:
$$y=(\mu^w + \sigma^w \odot \epsilon^w)x + \mu^b + \sigma^b \odot \epsilon^b$$

where $\mu^w \in \mathbb{R}^{q \times p}$, $\sigma^w \in \mathbb{R}^{q \times p}$, $\mu^b \in \mathbb{R}^q$, and $\sigma^b \in \mathbb{R}^q$ are learnable parameters and $\epsilon^w \in \mathbb{R}^{q \times p}$, $\epsilon^b \in \mathbb{R}^q$ are noise random variables. In this notation $p$ and $q$ correspond to the number of layer inputs and outputs and $\odot$ is an elementwise multiplication.


The noise can be generated with one of the following approaches:
- **Independent Gaussian noise** -  appied independently to each weight and bias
- **Factorized Gaussian noise** - produces two random noise vectors $\epsilon_{in} \in \mathbb{R}^p$ and $\epsilon_{out} \in \mathbb{R}^q$. Then we use a function $f(x) = sgn(x) \sqrt{|x|}$ to compute $f(\epsilon_{in})$ and $f(\epsilon_{out})$. Finally we set 
$$\epsilon^w = f(\epsilon_{in}) \otimes f(\epsilon_{out})$$
$$\epsilon^b = f(\epsilon_{out})$$
where $\otimes$ is a generalized outer product.

2) Instead of doing e-greedy exploration we can use the noise to explore. This way we are always trying to choose the best action, and the exploration (choosing non-optimal actions) comes from the noise itself. In my code you can choose between noise-based and e-greedy exploration. It is controlled by the `explore_with_noise` parameter.




In [None]:
# Import packages
import gym
import torch
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline

import config

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print ('Device:', device)

In [None]:
env = gym.make(config.ENVIRONMENT)

In [None]:
from rainbow.rainbow_agent import Agent

agent = Agent(
    state_size=env.observation_space.shape[0], 
    action_size=env.action_space.n,
    buffer_size = int(1e5),
    batch_size = 64,
    gamma = 0.99,
    lr = 5e-4,
    update_every = 4, # How often to update the network
    device=device,
    # PER parameters
    per_alpha = 0.2,
    per_beta_start = 0.4,
    per_beta_frames = 1e5,
    per_prior_eps = 1e-6, 
    # Dueling parameters
    clip_grad=10, 
    #N-step parameters
    n_steps = 3, 
    # Distributional parameters
    atom_size=10, # Originally it was 51
    v_min=0,
    v_max=200,
    # NoisyNet parameters
    explore_with_noise=False,
    )



In [None]:
def train_agent(n_episodes=config.MAX_EPISODES, 
        max_t=config.MAX_TIMESTEPS, 
        eps_start=config.EPSILON_START, 
        eps_end=config.EPSILON_END, 
        eps_decay=config.EPSILON_DECAY,
        expected_reward = config.EXPECTED_REWARD,
        update_target_every = 4
): 
    """Deep Q-Learning.
    Args:
        n_episodes (int): maximum number of training episodes
        max_t (int): maximum number of timesteps per episode
        eps_start (float): starting value of epsilon, for 
            epsilon-greedy action selection
        eps_end (float): minimum value of epsilon
        eps_decay (float): decay factor (per episode) 
            for decreasing epsilon
        expected_reward (float): finish when the average score
            is greater than this value
        upate_target_every (int): how often should the target 
            network be updated. Default: 1 (per every episode) 
    Returns:
        scores (list): list of scores from each episode
    """
    scores = []                        
    scores_window = deque(maxlen=100)  
    eps = eps_start                    
    for episode in range(1, n_episodes+1):
        state, info = env.reset()
        score = 0
        for t in range(max_t):
            
            action = agent.act(state, eps)
            next_state, reward, done, truncated, info = env.step(action)
            
            agent.step(state, action, reward, next_state, done)
            
            state = next_state
            score += reward
            if done:
                break 
        scores_window.append(score)       
        scores.append(score)
                
        eps = max(eps_end, eps_decay*eps) 
        
        if episode % update_target_every == 0:
            agent.target_hard_update()
        
        mean_score = np.mean(scores_window)
        print(f'\rEpisode {episode}\tAverage Score: {mean_score:.2f}', end="")
        if episode % 100 == 0:
            print(f'\rEpisode {episode}\tAverage Score: {mean_score:.2f}')
            agent.save('checkpoint.pth')
        if mean_score >= expected_reward:
            print(f'\nDone in {episode:d} episodes!\tAverage Score: {mean_score:.2f}')
            agent.save('checkpoint.pth')
            break
    return scores

### Train the agent


In [None]:
scores = train_agent()

# plot the scores
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(len(scores)), scores)
plt.ylabel('Score')
plt.xlabel('Episode #') 
plt.show()