<a href="https://colab.research.google.com/github/AyHaski/DL_AtariRainbow/blob/master/Anyrl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<div>
<img src="https://upload.wikimedia.org/wikipedia/commons/4/45/Alien_pixel.png" height="100"/>
</div>
<h1>Deep Reinforcement Learning: Rainbow in Atari Simulator</h1>


This notebook is supposed to show how to train an Atari game with the Rainbow Agent for the Deep Learning Seminar at the University of Offenburg.

An brief explanation of the agent is provided as well as the code snippets for hyperparameter tuning and training the agent.


In [0]:
#@title Mount Drive
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


The framework used is called [Anyrl-py](https://github.com/unixpickle/anyrl-py). It is a open-source framework for Reinforcement Learning implementing different algorithms.

In [6]:
#@title Install Packages
!pip install anyrl
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip

--2020-01-18 12:53:04--  https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
Resolving bin.equinox.io (bin.equinox.io)... 34.204.156.91, 34.198.126.60, 54.165.216.26, ...
Connecting to bin.equinox.io (bin.equinox.io)|34.204.156.91|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13773305 (13M) [application/octet-stream]
Saving to: ‘ngrok-stable-linux-amd64.zip’


2020-01-18 12:53:05 (67.0 MB/s) - ‘ngrok-stable-linux-amd64.zip’ saved [13773305/13773305]

Archive:  ngrok-stable-linux-amd64.zip
  inflating: ngrok                   


# Explanation: Rainbow Agent
The Rainbow Agent combines different variants and extensions of DQN (Deep Q-learning Network). To get an understanding of the agent a quick explanation of Q-learning, DQN itself and its extentions are in the following text sections.

## Q-learning

Q-learning is a category of model-free temporal-difference learning. It tries to find a policy to take the best action at a specific state to get a maximum future reward. It's called Q-learning because it calculates the quality of a action-state value pair. For that it will take the reward of the current step and add the estimated discounted max future reward.
The discount factor determines how important the values of the timestep after the curent are.
The function looks as follows:
</br>
</br>
![alt text](https://cdn-media-1.freecodecamp.org/images/TnN7ys7VGKoDszzv3WDnr5H8txOj3KKQ0G8o)
</br>
</br>
When the problem is small all the different combinations of actions and state can be saved in so called Q-table, which basically functions as a cheat sheet to determine the best next action. The bigger the problem the bigger the table gets. This lead to the development to combine deep neural network with the Q-Function.

##DQN

DQN represents the Q-function of Q-learning with a deep neural network. 
The details of the DQN was featured in the [nature paper](https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf), which also includes this image.
</br></br>
![alt text](https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fnature14236/MediaObjects/41586_2015_Article_BFnature14236_Fig1_HTML.jpg)
</br></br>

The network itself takes 4 input images of the game, so the movement of the game is visible. The images go through 3 convulotional layers and 2 fully connected layers at the end. The final ouput of the network are the actions of the atari game, 18 in total.

In DQN three important things were introdruced to the Q-learning Algorithm.

><h3> 1. ϵ-greedy strategy</h3>

At the beginning of training the way actions are chosen is explorativ, because in the beginning it's not clear which action will yield the best reward. So the get a lot of knowledge of the different state-action values Q-Value, exploration is favored.
</br>

After some time training the strategy of choosing actions is changed to exploitation. The actions with the best rewards are more often choosen. The combination of the strategies enables a more stable learning.

> <h3> 2. Experience replay buffer</h3> 

The transistion are saved into a buffer during training. In the Rainbow Agent the last million transisitions are saved. Random mini batches are taken from the buffer to be the input data for the next training loop. This leads to no forgotten experiences and reduces the correlations between experiences.

> <h3> 3. Online and Target Network</h3> 

The current Q-value and the future max reward which is the goal of the training are calculated with the same parameters in normal Q-learning. This leads to correlations between the two values. Is the current Q-value updated, the future reward is updated with the same parameters leading to a moving target reward. This is bad as the target reward will never be reached.
<br>
To fix this two networks are introduced: an online and target network. The online network is responsible for calculating the current Q-values and updating the parameters during training while the target network only updates periodically through copying the online network. 
<br>

The Q-function in DQN looks like the following, which determines the Q-loss:

>$R_{t+1} + \gamma_{t+1}max q_{\bar{\theta}}(S_{t+1},a')-q_{{\theta}}(S_{t},A_{t}))^2$

---
### Extentions and Variants of DQN

### Double DQN

In Double DQN the action selection and evaluation of the action is split. This leads to more stable learning and counters the overerstimation bias of normal DQN, which stems from the maximization step. The Q-loss is changed to represent this:

>$R_{t+1} + \gamma_{t+1}q_{\bar{\theta}}(S_{t+1},argmax (S_{t+1},a'))-q_{{\theta}}(S_{t},A_{t}))^2$

---

### Muli-Step DQN

Normally DQN only calculates for one step one reward. In muli-step DQN *n* steps are calculated and *n* rewards are being returned. This leads to significantly faster learning:

>$R_{t}^{(n)} + \gamma_{t}^{(n)}max q_{\bar{\theta}}(S_{t+n},a')-q_{{\theta}}(S_{t},A_{t}))^2$

---

### Prioritized Replay Buffer

The samples of the replay buffer are normally randomly sampled. In the case of the prioritized replay buffer, the transistions with high Q-loss are prioritized as these transistions are the ones, from which the most can be learned.

---

### Dueling Architecture

There are two streams in the dueling architecture. One stream calculates the value of a state $V(s)$, while the other stream calculates the advantage of an action over other actions at a specific state $A(s,a)$. This leads to not having to calculate all actions for a state which isn't valuable as any action in that state wouldn't impact the game.
The two streams are later aggregated through a special aggregation layer to get an estimate of $Q(s,a)$. When aggregating the two streams a simple addition of the two values wouldn't suffice as that way $V(s)$ and $A(s,a)$ themselve can't be identified in the backpropagation. Through subtraction the average advantage of all actions of the specific state, the problem can be avoided.

---

### Distributional RL

Normally the average estimated Q-value is used as target​. But using the average Q-values are not that accurate as the Q-value can be diverse in different situations. Instead of using the average Q-value distributional reinforcement learning is used to learn the distribution of Q-values. The resulting loss is the *Kullback-Leibler* divergence.

---

### Noisy Nets

Noisy Nets can be used to replace the ϵ-greedy strategy for the action selection. It adds noise to all linear layers which leads to changes in the exploration rate automatically. During training itself the agent can learn to ignore the noise.

## Rainbow Agent

The Rainbow Agent combines all of the 6 variants and extensions of DQN, creating a new state of the art agent which plays atari games better then all other subcombinations of DQN. The agent 
replaces the 1-step distributional loss with a multi-step variant​ and combines this with Double DQN. For the replay buffer a proportional prioritized replay is used which prioritizes transistions by the *KL* loss. The dueling network is adapted for use with the return of the distribution learning. At last all linear layers are replaced with a noisy net equivalent.
<br>
The following graph show the improvments of the score in atari games.

![alt text](https://media.arxiv-vanity.com/render-output/1731808/x1.png)

<br>



#Settings


When using the Rainbow Agent a lot of hyperparameters need to be set. 

## General Parameters

This sections is for general settings like the save path and which game to train.
<br>

*  **RUN_RESTORE** - the run number to restore
*  **RUN_NUM** - stand for subrun of one big run if the training sessions want to be split.
*  **path** - path to save data to
*  **restore** - to load saved model and transition
*  **game** - game to train
*  **workers** - how many workers should work in the env

In [0]:
#@title Organisation settings

RUN='1' #@param {type: "string"}
RUN_RESTORE='1_1' #@param {type: "string"}
RUN_NUM='1_1' #@param {type: "string"}
path='/tmp/anyrl_rainbow' #@param {type: "string"}
restore=False #@param {type: "boolean"}



In [0]:
#@title Training settings

game='SpaceInvaders' #@param
workers=8 #@param {type: "integer"}










## Hyperparameters

The next sections is for setting the hyperparameters. The parameters are split into groups of the specific extension of the Rainbow algorithm for a clearer overview.


The default values of the rainbow agent are:
```
lr=6.25e-5 
num_steps=6000000 
train_interval=4 
batch_size=32 
target_interval=8192 
num_atoms=51 
v_min=-10 
v_max=10 
n_step=3 
min_buffer_size=20000 
buffer_size=1000000 
replay_epsilon=0.1 
replay_alpha=0.5 
replay_beta=0.4 
```





### Training Parameters
The Rainbow Agent uses the AdamOptimizer. 
* **lr**  - learning rate
* **num_steps** - amount of steps to learn
* **train_interval** - impacts the learning speed. Basically every 4 steps the networkt learns. 
* **batch_size** - input dataset; the amount to sample from the transitions
* **target_interval** -  is the period between update of the online and target network

In [0]:
lr=6.25e-5 #@param {type:"number"}
num_steps=6000000 #@param {type: "integer"}
train_interval=4 #@param {type: "integer"}
batch_size=32 #@param {type: "integer"}
target_interval=8192 #@param {type: "integer"}

### Distributional Parameters
*  **num_atoms** - number of atoms in the distribution
*  **v_min** - lowest value of the distribution
*  **v_max** - highest value of the distribution

In [0]:

num_atoms=51 #@param {type: "integer"}
v_min=-10 #@param {type: "integer"}
v_max=10 #@param {type: "integer"}

### Multi-Step Parameters


*   **n_step** - amount of steps to take in multi-step



In [0]:
n_step=3 #@param {type: "integer"}

###Replay Buffer Parameters
It's supposed to start learning after 80000 frames passed and the transistions saved. As the input of the network are four frames at once, this means one step equals four frames. This means after 20000 steps training can start. The buffer_size is normally 1000000 but as this means a big size to save it was reduced.

*   **min_buffer_size** - min size to start learning
*   **buffer_size** - size of buffer saving transistions
*   **replay_epsilon** - value added to every error term
*   **replay_alpha** - controlling the temperature. Higher values result in more prioritization. A value of 0 yields uniform prioritization
*   **replay_beta** - controlling amount of importance sampling.A value of 1 yields unbiased sampling. A value of 0 yields no importance sampling

In [0]:
min_buffer_size=20000 #@param {type: "integer"}
buffer_size=500000 #@param {type: "integer"}
replay_epsilon=0.1 #@param {type:"number"}
replay_alpha=0.5 #@param {type:"number"}
replay_beta=0.4 #@param {type:"number"}

# Modified Code

The replay buffer of the framework was modified to save the amount of transisitions in the replay buffer, which size was specified in the parameter the hyperparameter section. This way the training can be continued at a later time.
The original buffer can be found at:
https://github.com/unixpickle/anyrl-py/blob/master/anyrl/rollouts/replay.py


In [0]:
#@title Modified Replay Buffer Code
"""
Various replay buffer implementations.
"""

from math import sqrt
import random
import pickle
from anyrl.rollouts import ReplayBuffer
import numpy as np

class ModifiedPrioritizedReplayBuffer(ReplayBuffer):
    """
    A prioritized replay buffer with loss-proportional
    sampling.
    Weights passed to add_sample() and update_weights()
    are assumed to be error terms (e.g. the absolute TD
    error).
    """

    def __init__(self, capacity, alpha, beta, first_max=1, epsilon=0):
        """
        Create a prioritized replay buffer.
        The beta parameter can be any object that has
        support for the float() built-in.
        This way, you can use a TFScheduleValue.
        Args:
          capacity: the maximum number of transitions to
            store in the buffer.
          alpha: an exponent controlling the temperature.
            Higher values result in more prioritization.
            A value of 0 yields uniform prioritization.
          beta: an exponent controlling the amount of
            importance sampling. A value of 1 yields
            unbiased sampling. A value of 0 yields no
            importance sampling.
          first_max: the initial weight for new samples
            when no init_weight is specified and the
            buffer is completely empty.
          epsilon: a value which is added to every error
            term before the error term is used.
        """
        self.capacity = capacity
        self.alpha = alpha
        self.beta = beta
        self.epsilon = epsilon
        self.transitions = []
        self.errors = FloatBuffer(capacity)
        self._max_weight_arg = first_max

    @property
    def size(self):
        return len(self.transitions)

    def sample(self, num_samples):
        indices, probs = self.errors.sample(num_samples)
        beta = float(self.beta)
        importance_weights = np.power(probs * self.size, -beta)
        importance_weights /= np.power(self.errors.min() / self.errors.sum() * self.size, -beta)
        samples = []
        for i, weight in zip(indices, importance_weights):
            sample = self.transitions[i].copy()
            sample['weight'] = weight
            sample['id'] = i
            samples.append(sample)
        return samples

    def add_sample(self, sample, init_weight=None):
        """
        Add a sample to the buffer.
        When new samples are added without an explicit
        initial weight, the maximum weight argument ever
        seen is used. When the buffer is empty, first_max
        is used.
        """
        self.transitions.append(sample)
        if init_weight is None:
            self.errors.append(self._process_weight(self._max_weight_arg))
        else:
            self.errors.append(self._process_weight(init_weight))
        while len(self.transitions) > self.capacity:
            del self.transitions[0]

    def update_weights(self, samples, new_weights):
        for sample, weight in zip(samples, new_weights):
            self.errors.set_value(sample['id'], self._process_weight(weight))

    def _process_weight(self, weight):
        self._max_weight_arg = max(self._max_weight_arg, weight)
        return (weight + self.epsilon) ** self.alpha

    """
    The two functions were added to save the transitions
    """
    #saving replay buffer to pickle file
    def save_samples(self):
        print("saving transitions: ",len(self.transitions))
        with open(path+'/'+RUN+'/transitions_'+RUN_NUM+'.p','wb') as handle:
            pickle.dump(self.transitions,handle,protocol=pickle.HIGHEST_PROTOCOL)

    #loading replay buffer from pickle file
     #have to add sample by sample due to populating the errors into the FloatBuffer (self.errors)
    def load_samples(self):
        temp_trans = pickle.load(open(path+'/'+RUN+"/transitions_"+RUN_RESTORE+".p","rb"))
        for i in temp_trans:
            self.add_sample(i)
        print("loaded transitions: ",len(self.transitions))

class FloatBuffer:
    """A ring-buffer of floating point values."""

    def __init__(self, capacity, dtype='float64'):
        self._capacity = capacity
        self._start = 0
        self._used = 0
        self._buffer = np.zeros((capacity,), dtype=dtype)
        self._bin_size = int(sqrt(capacity))
        num_bins = capacity // self._bin_size
        if num_bins * self._bin_size < capacity:
            num_bins += 1
        self._bin_sums = np.zeros((num_bins,), dtype=dtype)
        self._min = 0

    def append(self, value):
        """
        Add a value to the end of the buffer.
        If the buffer is full, the first value is removed.
        """
        idx = (self._start + self._used) % self._capacity
        if self._used < self._capacity:
            self._used += 1
        else:
            self._start = (self._start + 1) % self._capacity
        self._set_idx(idx, value)

    def sample(self, num_values):
        """
        Sample indices in proportion to their value.
        Returns:
          A tuple (indices, probs)
        """
        assert self._used >= num_values
        res = []
        probs = []
        bin_probs = self._bin_sums / np.sum(self._bin_sums)
        while len(res) < num_values:
            bin_idx = np.random.choice(len(self._bin_sums), p=bin_probs)
            bin_values = self._bin(bin_idx)
            sub_probs = bin_values / np.sum(bin_values)
            sub_idx = np.random.choice(len(bin_values), p=sub_probs)
            idx = bin_idx * self._bin_size + sub_idx
            res.append(idx)
            probs.append(bin_probs[bin_idx] * sub_probs[sub_idx])
        return (np.array(list(res)) - self._start) % self._capacity, np.array(probs)

    def set_value(self, idx, value):
        """Set the value at the given index."""
        idx = (idx + self._start) % self._capacity
        self._set_idx(idx, value)

    def min(self):
        """Get the minimum value in the buffer."""
        return self._min

    def sum(self):
        """Get the sum of the values in the buffer."""
        return np.sum(self._bin_sums)

    def _set_idx(self, idx, value):
        assert not np.isnan(value)
        assert value > 0
        needs_recompute = False
        if self._min == self._buffer[idx]:
            needs_recompute = True
        elif value < self._min:
            self._min = value
        bin_idx = idx // self._bin_size
        self._buffer[idx] = value
        self._bin_sums[bin_idx] = np.sum(self._bin(bin_idx))
        if needs_recompute:
            self._recompute_min()

    def _bin(self, bin_idx):
        if bin_idx == len(self._bin_sums) - 1:
            return self._buffer[self._bin_size * bin_idx:]
        return self._buffer[self._bin_size * bin_idx:self._bin_size * (bin_idx + 1)]

    def _recompute_min(self):
        if self._used < self._capacity:
            self._min = np.min(self._buffer[:self._used])
        else:
            self._min = np.min(self._buffer)

In the training loop a code section to record the values of different parameters where added. In specific the discount, weight, losses and rewards while training.
The original code can be found at:
https://github.com/unixpickle/anyrl-py/blob/master/anyrl/algos/dqn.py

In [0]:
#@title Modified DQN Code
%tensorflow_version 1.x
import time

import tensorflow as tf


class DQN:
    """
    Train TFQNetwork models using Q-learning.
    """

    def __init__(self, online_net, target_net, discount=0.99):
        """
        Create a Q-learning session.
        Args:
          online_net: the online TFQNetwork.
          target_net: the target TFQNetwork.
          discount: the per-step discount factor.
        """
        self.online_net = online_net
        self.target_net = target_net
        self.discount = discount

        obs_shape = (None,) + online_net.obs_vectorizer.out_shape
        self.obses_ph = tf.placeholder(online_net.input_dtype, shape=obs_shape)
        self.actions_ph = tf.placeholder(tf.int32, shape=(None,))
        self.rews_ph = tf.placeholder(tf.float32, shape=(None,))
        self.new_obses_ph = tf.placeholder(online_net.input_dtype, shape=obs_shape)
        self.terminals_ph = tf.placeholder(tf.bool, shape=(None,))
        self.discounts_ph = tf.placeholder(tf.float32, shape=(None,))
        self.weights_ph = tf.placeholder(tf.float32, shape=(None,))

        losses = online_net.transition_loss(target_net, self.obses_ph, self.actions_ph,
                                            self.rews_ph, self.new_obses_ph, self.terminals_ph,
                                            self.discounts_ph)
        self.losses = self.weights_ph * losses
        self.loss = tf.reduce_mean(self.losses)

        assigns = []
        for dst, src in zip(target_net.variables, online_net.variables):
            assigns.append(tf.assign(dst, src))
        self.update_target = tf.group(*assigns)

    def feed_dict(self, transitions):
        """
        Generate a feed_dict that feeds the batch of
        transitions to the DQN loss terms.
        Args:
          transition: a sequence of transition dicts, as
            defined in anyrl.rollouts.ReplayBuffer.
        Returns:
          A dict which can be fed to tf.Session.run().
        """
        obs_vect = self.online_net.obs_vectorizer
        res = {
            self.obses_ph: obs_vect.to_vecs([t['obs'] for t in transitions]),
            self.actions_ph: [t['model_outs']['actions'][0] for t in transitions],
            self.rews_ph: [self._discounted_rewards(t['rewards']) for t in transitions],
            self.terminals_ph: [t['new_obs'] is None for t in transitions],
            self.discounts_ph: [(self.discount ** len(t['rewards'])) for t in transitions],
            self.weights_ph: [t['weight'] for t in transitions]
        }
        new_obses = []
        for trans in transitions:
            if trans['new_obs'] is None:
                new_obses.append(trans['obs'])
            else:
                new_obses.append(trans['new_obs'])
        res[self.new_obses_ph] = obs_vect.to_vecs(new_obses)
        return res

    def optimize(self, learning_rate=6.25e-5, epsilon=1.5e-4, **adam_kwargs):
        """
        Create a TF Op that optimizes the objective.
        Args:
          learning_rate: the Adam learning rate.
          epsilon: the Adam epsilon.
        """
        optim = tf.train.AdamOptimizer(learning_rate=learning_rate, epsilon=epsilon, **adam_kwargs)
        return optim.minimize(self.loss)

    def train(self,
              num_steps,
              player,
              replay_buffer,
              optimize_op,
              train_interval=1,
              target_interval=8192,
              batch_size=32,
              min_buffer_size=20000,
              tf_schedules=(),
              handle_ep=lambda steps, rew: None,
              timeout=None):
        """
        Run an automated training loop.
        This is meant to provide a convenient way to run a
        standard training loop without any modifications.
        You may get more flexibility by writing your own
        training loop.
        Args:
          num_steps: the number of timesteps to run.
          player: the Player for gathering experience.
          replay_buffer: the ReplayBuffer for experience.
          optimize_op: a TF Op to optimize the model.
          train_interval: timesteps per training step.
          target_interval: number of timesteps between
            target network updates.
          batch_size: the size of experience mini-batches.
          min_buffer_size: minimum replay buffer size
            before training is performed.
          tf_schedules: a sequence of TFSchedules that are
            updated with the number of steps taken.
          handle_ep: called with information about every
            completed episode.
          timeout: if set, this is a number of seconds
            after which the training loop should exit.
        """
        
        
        sess = self.online_net.session
        
        """
        This section was added to record the values
        """
        tnrewsph = tf.summary.scalar(name='rews_ph', tensor=tf.reduce_mean(self.rews_ph))
        tndiscountsph = tf.summary.scalar(name='discounts_ph', tensor=tf.reduce_mean(self.discounts_ph))
        tnweightsph = tf.summary.scalar(name='weights_ph', tensor=tf.reduce_mean(self.weights_ph))
        tnlosses = tf.summary.scalar(name='losses', tensor=tf.reduce_mean(self.losses))
        merge = tf.summary.merge([tnrewsph,tndiscountsph,tnweightsph,tnlosses])
        train_writer = tf.summary.FileWriter( path+'/'+RUN+'/logs/'+ RUN_NUM +'/train', sess.graph)
        
        
        sess.run(self.update_target)
        steps_taken = 0
        next_target_update = target_interval
        next_train_step = train_interval
        start_time = time.time()

        if restore:
          replay_buffer.load_samples()

        while steps_taken < num_steps:
            if timeout is not None and time.time() - start_time > timeout:
                return
            transitions = player.play()
            for trans in transitions:
                if trans['is_last']:
                    handle_ep(trans['episode_step'] + 1, trans['total_reward'], trans['episode_id'])
                replay_buffer.add_sample(trans)
                steps_taken += 1
                for sched in tf_schedules:
                    sched.add_time(sess, 1)
                if replay_buffer.size >= min_buffer_size and steps_taken >= next_train_step:
                    next_train_step = steps_taken + train_interval
                    batch = replay_buffer.sample(batch_size)
                    
                    _, losses, summary = sess.run((optimize_op, self.losses, merge),
                                         feed_dict=self.feed_dict(batch))
                    
                    train_writer.add_summary(summary, steps_taken)
                    replay_buffer.update_weights(batch, losses)
                    
                if steps_taken >= next_target_update:
                    next_target_update = steps_taken + target_interval
                    sess.run(self.update_target)
                if (steps_taken % 100000 ==0):
                   replay_buffer.save_samples()
        replay_buffer.save_samples()

    def _discounted_rewards(self, rews):
        res = 0
        for i, rew in enumerate(rews):
            res += rew * (self.discount ** i)
        return res

# Training

The training loop was modified to allow recoding of videos while training as well as saving parameters and rewards of each epiode in the training process. An ouput is printed every 10 episodes with the episode id and the mean of the last 10 episodes. 

In [5]:
#@title DQN Train
%tensorflow_version 1.x

import tensorflow as tf
import os
import time
from gym.wrappers import Monitor
from functools import partial
from anyrl.envs import batched_gym_env,BatchedGymEnv
from anyrl.envs.wrappers import BatchedFrameStack, DownsampleEnv, GrayscaleEnv
from anyrl.models import rainbow_models
from anyrl.rollouts import BatchedPlayer, NStepPlayer
from anyrl.spaces import gym_space_vectorizer
from anyrl.utils import tf_state
import gym
import numpy as np

REWARD_HISTORY = 10

def wrap_env(env):
  env = Monitor(env, path+'/'+RUN+'/videos/'+RUN_NUM, force=True )
  return env

def make_env():
    """
    Create an environment with some standard wrappers.
    """
    env = wrap_env(gym.make(game+'-v0'))
    env = GrayscaleEnv(DownsampleEnv(env, 2))
    return env

def main():

    env = batched_gym_env([partial(make_env)]* workers)
    env = BatchedFrameStack(env, num_images=4, concat=False)

    checkpoint_dir = os.path.join(os.getcwd(), path+'/'+RUN+'/checkpoints/'+RUN_NUM)
    results_dir = os.path.join(os.getcwd(), path+'/'+RUN+'/results/', time.strftime("%d-%m-%Y_%H-%M-%S__")+ RUN_NUM)
    if not os.path.exists(results_dir):
        os.makedirs(results_dir)
    summary_writer = tf.summary.FileWriter(results_dir)

    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True # pylint: disable=E1101
    
    
    with tf.Session(config=config) as sess:  
        
        dqn=DQN(*rainbow_models(sess,
                              env.action_space.n,
                              gym_space_vectorizer(env.observation_space),
                              num_atoms=num_atoms,
                              min_val=v_min,
                              max_val=v_max))

        player = NStepPlayer(BatchedPlayer(env, dqn.online_net), n_step)
        optimize = dqn.optimize(learning_rate=lr)
        sess.run(tf.global_variables_initializer())

        if(restore):
          tf_state.load_vars(sess,path+"/"+RUN+"/anyrlModel_"+RUN_RESTORE)

        reward_hist = []
        total_steps = 0
        save_model_steps=100000
        episodes=0

        def _handle_ep(steps, rew, id):
            nonlocal total_steps
            nonlocal episodes
            nonlocal save_model_steps
            total_steps += steps
            episodes += 1
            reward_hist.append(rew)

            summary_reward = tf.Summary()
            summary_reward.value.add(tag='global/reward', simple_value=rew)
            summary_writer.add_summary(summary_reward, global_step=total_steps)

            if len(reward_hist) == REWARD_HISTORY:
              print('ID: %d | %d steps | %f mean' % (id,total_steps, sum(reward_hist) / len(reward_hist)))
            
              summary_meanreward = tf.Summary()
              summary_meanreward.value.add(tag='global/mean_reward', simple_value=sum(reward_hist) / len(reward_hist))
              summary_writer.add_summary(summary_meanreward, global_step=total_steps)

              reward_hist.clear()
            if(total_steps>= save_model_steps):
              save_model_steps +=100000
              print('save model')
              tf_state.save_vars(sess,path+"/"+RUN+"/anyrlModel_"+RUN_NUM)
                 
        
        dqn.train(num_steps=num_steps, 
                  player=player,
                  replay_buffer=ModifiedPrioritizedReplayBuffer(buffer_size, replay_alpha, replay_beta, epsilon=replay_epsilon),
                  optimize_op=optimize,
                  train_interval=train_interval,
                  target_interval=target_interval,
                  batch_size=batch_size,
                  min_buffer_size=min_buffer_size,
                  handle_ep=_handle_ep)
        
        print('save model')
        tf_state.save_vars(sess,path+"/"+RUN+"/anyrlModel_"+RUN_NUM)
    env.close()
    

if __name__ == '__main__':
      main()




Instructions for updating:
Use `tf.keras.layers.Conv2D` instead.
Instructions for updating:
Please use `layer.__call__` method instead.


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


ID: 8 | 6410 steps | 151.000000 mean
ID: 14 | 14021 steps | 200.500000 mean
ID: 26 | 22540 steps | 221.000000 mean
ID: 39 | 30789 steps | 240.000000 mean
ID: 50 | 39337 steps | 212.500000 mean
ID: 56 | 48636 steps | 314.500000 mean
ID: 70 | 56106 steps | 175.500000 mean
ID: 79 | 63941 steps | 215.500000 mean
ID: 90 | 70998 steps | 159.000000 mean
ID: 100 | 78276 steps | 195.500000 mean
ID: 106 | 85406 steps | 186.000000 mean
ID: 120 | 91971 steps | 162.000000 mean
saving transitions:  100000
saving transitions:  100000
save model


In [10]:
#@title Starting Tensorboard for the logs

LOG_DIR = path+'/'+RUN+'/logs/'
get_ipython().system_raw(
    'tensorboard --logdir {} --host 0.0.0.0 --port 6006 &'
    .format(LOG_DIR)
)
get_ipython().system_raw('./ngrok http 6006 &')
!curl -s http://localhost:4040/api/tunnels | python3 -c \
    "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

https://dbe3063a.ngrok.io


Sometimes the ngrok link won't open. If that happens try running:

```
!tensorboard --logdir=path+'/'+RUN+'/logs/' 
```

And click on the previous link again.

In [14]:
#@title Show the newest video of current run

from IPython.display import HTML
from base64 import b64encode
import glob
import os

list_of_files = glob.glob(path+'/'+RUN+'/videos/'+RUN_NUM+'/*.mp4') # * means all if need specific format then *.csv
latest_file = max(list_of_files, key=os.path.getctime)
print(latest_file)

mp4 = open(latest_file,'rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML("""
<video width=400 controls>
      <source src="%s" type="video/mp4">
</video>
""" % data_url)

/tmp/anyrl_rainbow/1/videos/1_1/openaigym.video.0.413.video000008.mp4


# Links


**Q-learning and DQN**
* https://www.freecodecamp.org/news/an-introduction-to-q-learning-reinforcement-learning-14ac0b4493cc/
* https://www.freecodecamp.org/news/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8/
* https://www.freecodecamp.org/news/improvements-in-deep-q-learning-dueling-double-dqn-prioritized-experience-replay-and-fixed-58b130cc5682/
* https://medium.com/@jonathan_hui/rl-dqn-deep-q-network-e207751f7ae4

* https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf

**Rainbow Agent**

* https://arxiv.org/pdf/1710.02298.pdf
* https://medium.com/intelligentunit/conquering-openai-retro-contest-2-demystifying-rainbow-baseline-9d8dd258e74b