<a href="https://colab.research.google.com/github/NicMaq/Reinforcement-Learning/blob/master/Breakout_explained.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Learning to play Breakout with a DQN 


Breakout was an arcade game developed and published by Atari on May 13, 1976. To play: a layer of bricks  lines the top third of the screen and the goal is to destroy them all! The following animated gif was captured during one of our eval sessions. Execute the following cells and you'll train a network to do the same. 

This Google Colab was written to support my post: Best practices for Reinforcement Learning XXX.


---


![alt text](http://www.modelfit.us/uploads/7/6/0/6/76068583/gymbreakout-20200617223846-7912_orig.gif)










# Imports and global constants

First we import the required packages and create a few global constants (size of the image senf to the DQN, number of frames to process, number of actions,...). And the hyperparameters (learning rate, discounted ratio, epsilon for the e-greedy policy). These are the values to fine tune if you want to improve performance.


I am running tensorflow 2.x and numpy 1.18.5.


In [None]:
%tensorflow_version 2.x

import time
from datetime import datetime
import gym
import tensorflow as tf
import numpy as np
import math
import sys
from skimage.transform import rescale, resize, downscale_local_mean
import imageio

In [None]:
# Global constants
MAX_STEPS = 2500000 
EVAL_STEPS = 200000 # Evaluate the model every EVAL_STEPS frames
EVAL_GAMES = 100     # For EVAL_GAMES games
MINI_BATCH_SIZE = 32 
MAX_SAMPLES = 1000000
IMG_HEIGHT = 84
IMG_WIDTH = 84

# Policy
# qlearning e-greedy = 0 ; expected sarsa e-greedy = 1 ; expected sarsa softmax = 2
POLICY = 0

# NUM_ACTIONS 
ACTIONS = {
    0: "NOOP",
    1: "FIRE",
    2: "RIGHT",
    3: "LEFT",
}
NUM_ACTIONS = len(ACTIONS)

# Epsilon = Greedy Policy
MIN_EPSILON = 0.1
MAX_EPSILON = 1 
EVAL_EPSILON = 0.0
EXPLORE_STEPS = 300000
ANNEALING_STEPS = 900000

# Tau = Softmax Policy
TAU = 0.00005 

# Network update
MODELUPDATE_TRAIN_STEPS = 5000
START_LEARNING = 50000
UPDATE_FREQ = 2
REPEAT_ACTION = 1 
NO_OP_MAX = 0

# Save model
SAVEMODEL_STEPS = 1000000 

# Learning rate (alpha) and Discount factor (gamma) 
ALPHA = 0.00001
GAMMA = 0.99

# Epochs for training the DNN - How many mini batches will be sent at each steps for training. 2 = 2 gradient descents at each step
EPOCHS = 1 
 
# Directories
SAVE_DIR = 'models/GymBreakout/ExpectedSarsa'
ROOT_TF_LOG = 'tf_logs'

#GPU CPU - Use Argparse to modify this 
USE_DEVICE = '/GPU:0' 
USE_CPU = '/CPU:0'
RENDER = False

# The agent and the environment

One of the key ideas in Reinforcement Learning is that an agent learns from its interactions with an environment. The agent is the learner and the decision maker. For each action the agent takes, it retrieves from the environment the new state (a 194*160 RGB image) and the reward (how many bricks we removed). We will create the environment later in this notebook using the Open AI Gym environment: https://gym.openai.com/envs/Breakout-v0/

<br><br>
![alt text](http://www.modelfit.us/uploads/7/6/0/6/76068583/screen-shot-2020-06-17-at-5-33-36-pm_orig.png)
<br><br>

In this notebook, we will use one of the central ideas in Reinforcement Learning: Temporal Difference learning. We will implement a special case of Expected Sarsa: QLearning. In the python version of this notebook (XXX), I implemented Expected Sarsa and QLearning and designed different policies: e-greedy and softmax. To keep it simple, in this notebook we'll use an e-greedy policy. 

We are implementing the same network and strategies as in the Deepmind paper: https://www.nature.com/articles/nature14236. 

The agent uses two networks. One for approximating the action-value function of the current state (model) and to take actions; the other one for calculating the td-error (target_model). The rationale is to support convergence by not chasing our own tail.

We will also use an experience memory to feed random samples into the network at each step, but we'll discuss this in a later cell.

The run_training function (defined in a few cells) orchestrates training and evaluation. It asks the agent to start a game (def play_game), to decide the best action under current policy (def choose_action); to calculate the TD error; and to back-propagate the gradients (def calculate_target_and_train). Then, every EVAL_STEPS frames, our run_training function  evaluates the network's performance.

I implemented the following pseudo code as it can be found in the  Reinforcement Learning book by [Richard S. Sutton and Andrew G. Barto](https://www.amazon.com/Reinforcement-Learning-Introduction-Adaptive-Computation/dp/0262039249). I really really encourage you to read (and reread) this book.
<br><br>
![alt text](http://www.modelfit.us/uploads/7/6/0/6/76068583/screen-shot-2020-06-18-at-3-56-38-pm_orig.png)

<br>

**The two critical parts: 1. The policies and 2. the Bellman equation.**

In GYM_ACROBOT.py, I offer the option to use two policies: the e-greedy and the softmax policies. These policies are discussed in a separate colab notebook: [e-greedy and softmax explained](https://colab.research.google.com/drive/1--qFcl5QuTuudC-yYcE1odKx_htui4h6?usp=sharing). QLearning is pretty straightforward as it uses a greedy policy for the TD update and an e-greedy policy to select the best action. 
In XXX GYM_ACROBOT.py and XXX GYM_SPACE_INVADERS.py, I implemented ties management for the e-greedy policy. I often turn it off because I noticed that for ACROBOT and BREAKOUT, ties rarely happens (but they do happen). 
<br><br>
**The Bellman equation**

The [Bellman equation](https://en.wikipedia.org/wiki/Bellman_equation#:~:text=A%20Bellman%20equation%2C%20named%20after,method%20known%20as%20dynamic%20programming.&text=This%20breaks%20a%20dynamic%20optimization,%E2%80%9Cprinciple%20of%20optimality%E2%80%9D%20prescribes.)  for QLearning, named after Richard E. Bellman, writes the "value" of a decision problem at a certain point in time resulting from the payoff from some initial choices and the "value" of the remaining decision problem that come from those initial choices.

![alt text](http://www.modelfit.us/uploads/7/6/0/6/76068583/screen-shot-2020-06-18-at-3-18-06-pm_orig.png)

As indicated in the pseudo code, this update is done after every transition from a nonterminal state; this update happens in calculate_target_and_train.
<br>
We start by calculating the next state's action-value with the target network for each element of the batch:

```
next_qsa = self.target_model((batch_next_history, batch_action_all_ones), training=True)
```
Then we instantiate the tensor representing the maximum value over a of each action-value.   
```
max_q = tf.math.reduce_max(next_qsa, axis=1, keepdims=True)
```
The Bellman equation is for non terminal states. For the terminal state, the update is simply the expected reward.
```
v_next_vect = batch_terminal * max_q
```
Then we calculate the TD error: 
```
target_vec = batch_reward + GAMMA * v_next_vect
```
Our goal is to update the action_value of the action that was taken. Therefore we tf.mulitply the target_vec by the one-hot encoding of the batch of actions.
```
target_mat = tf.multiply(target_vec, batch_action_one_hot)
```
Now that we have the predicted action-value, we need the current output of the network.
```
qsa = self.model((batch_history, batch_action_one_hot), training=True)
```
The difference between the two will be used to calculate the loss.
```
qsa_mat = tf.multiply(qsa, batch_action_one_hot)
delta_mat = target_mat - qsa_mat
```

We are using the following Huber loss implementation:

```
squared_loss = 0.5 * tf.square(delta_mat)
linear_loss = tf.abs(delta_mat) -0.5
ones = tf.ones_like(delta_mat)
loss_mat = tf.where(tf.greater(linear_loss, ones), x = linear_loss, y = squared_loss)
loss_train = tf.reduce_mean(loss_mat, axis=1, keepdims=True)
```

We use Tensorflow tf.GradientTape() to record all useful information from the forward propagation and apply the gradients:

```
grads = tape.gradient(loss_train, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))  
```




In [None]:
class Agent:

    def __init__(self, env, model, target_model, optimizer, exp_buffer):
        self.env = env
        self.exp_buffer = exp_buffer
        self.model = model
        self.target_model = target_model
        self.optimizer = optimizer
        
        with tf.device(USE_DEVICE):
            self.decay = (MAX_EPSILON-MIN_EPSILON) / ANNEALING_STEPS
            self.epsilon = tf.constant(MAX_EPSILON)
            self.epsilon = tf.cast(self.epsilon, dtype=tf.float32)
            self.min_epsilon = tf.constant(MIN_EPSILON, dtype=tf.float16)
            self.min_epsilon = tf.cast(self.min_epsilon, dtype=tf.float32)
            self.epsilon_evaluation = tf.constant(EVAL_EPSILON, dtype=tf.float16)
            self.epsilon_evaluation = tf.cast(self.epsilon_evaluation, dtype=tf.float32)

            assert self.epsilon.device[-5:].lower() == USE_DEVICE[-5:].lower(), "epsilon not on : %s" % USE_DEVICE
            assert self.min_epsilon.device[-5:].lower() == USE_DEVICE[-5:].lower(), "min_epsilon not on : %s" % USE_DEVICE
            assert self.epsilon_evaluation.device[-5:].lower() == USE_DEVICE[-5:].lower(), "epsilon_evaluation not on : %s" % USE_DEVICE
        
        self._reset()


    def _reset(self):
        self.image = self.env.reset()
        self.state = preprocess(self.image)


    def eval_game(self):

        dead = False
        steps = 0 
        game_reward = 0
        raw_images = []

        self._reset()

        raw_images.append(self.image)
        
        remaining_lives = 5 
        history = np.repeat(self.state, 4, axis=2)
        init_history = history 

        while True:

            # Play next step
            if RENDER: self.env.render()

            if steps % REPEAT_ACTION == 0:
                
                history_foraction =  np.reshape(history, (1, IMG_HEIGHT, IMG_WIDTH,4))

                with tf.device(USE_DEVICE):
                    
                    tf_history = tf.constant(history_foraction)
                    tf_history = tf.cast(tf_history, dtype=tf.float32)

                    assert tf_history.device[-5:].lower() == USE_DEVICE[-5:].lower(), "tf_history not on : %s" % USE_DEVICE
                    
                    if POLICY == 2:
                        action_probs = self.choose_action(tf_history)
                        probs = action_probs.numpy()
                        action = np.random.choice(NUM_ACTIONS, p=probs.squeeze())
                    else:
                        action = self.choose_action(tf_history)
                        action = action.numpy()
            
            # Do the NO_OP actions then fire once 
            if np.all(np.equal(history, init_history)):
                if NO_OP_MAX > 0:
                    no_op = np.random.randint(0, NO_OP_MAX)
                    for op in tf.range(no_op):
                        action = np.random.randint(2, 4) # Select either right or left
                        _, _, _, _ = self.env.step(action)
                action = 1
            
            next_image, step_reward, done, info = self.env.step(action)

            if steps > 2500:
                print('Max steps reached')
                done = True

            game_reward += step_reward
            raw_images.append(next_image)

            next_state = preprocess(next_image)
            next_history = np.append(history[:,:,-3:], next_state, axis=2)

            if  remaining_lives > info['ale.lives']:
                dead = True
                if info['ale.lives'] == 0: done = True
                print("Player is dead! game_reward is: %s" % (game_reward))
                remaining_lives = info['ale.lives']

   
            # if the game is done, break the loop
            if done:
                return game_reward, raw_images

            # move the agent to the next state 
            if dead:
                dead = False
                history = init_history

            else:
                history = next_history

            steps += 1


    def play_game(self, global_steps):

        loss = np.zeros((1,), dtype=np.float32) 
        dead = False

        steps = 0 
        game_reward = 0
        process_time = 0
        train_time = 0

        data_images = []
        data_actions = []
        data_rewards = []
        data_dones = []

        self._reset()

        remaining_lives = 5 
        history = np.repeat(self.state, 4, axis=2)
        init_history = history 

        while True:

            # Play next step
            if RENDER: self.env.render()
            
            if steps % REPEAT_ACTION == 0:
                
                history_foraction =  np.reshape(history, (1, IMG_HEIGHT, IMG_WIDTH,4))
               
                with tf.device(USE_DEVICE):
                    
                    tf_history = tf.constant(history_foraction)
                    tf_history = tf.cast(tf_history, dtype=tf.float32)

                    assert tf_history.device[-5:].lower() == USE_DEVICE[-5:].lower(), "tf_history not on : %s" % USE_DEVICE
                    
                    if POLICY == 2:
                        action_probs = self.choose_action(tf_history)
                        probs = action_probs.numpy()
                        action = np.random.choice(NUM_ACTIONS, p=probs.squeeze())
                    else:
                        action = self.choose_action(tf_history)
                        action = action.numpy()
            
            # Fire once at start # if no fire required the action has to be initialized
            if np.all(np.equal(history, init_history)):
                action = 1

            next_image, step_reward, done, info = self.env.step(action)

            if steps > 2500:
                print('Max steps reached')
                step_reward = -1
                done = True

            game_reward += step_reward
            
            # If you need clipping
            '''
            if step_reward > 0:
                step_reward = 1
            elif step_reward == 0:
                step_reward = 0
            else:
                step_reward = -1           
            '''

            lap_time = time.time()
            next_state = preprocess(next_image)
            process_time +=  time.time() - lap_time

            next_history = np.append(history[:,:,-3:], next_state, axis=2)

            # Decay epsilon
            with tf.device(USE_DEVICE):
                if self.epsilon > self.min_epsilon and global_steps > EXPLORE_STEPS:
                        self.epsilon -= self.decay
                        if self.epsilon < self.min_epsilon: 
                            self.epsilon = tf.constant(self.min_epsilon)
                        assert self.epsilon.device[-5:].lower() == USE_DEVICE[-5:].lower(), "self.epsilon not updated on : %s" % USE_DEVICE
            
            if  remaining_lives > info['ale.lives']:
                dead = True
                if info['ale.lives'] == 0: done = True
                print("Player is dead! game_reward is: %s" % (game_reward))
                remaining_lives = info['ale.lives']
                step_reward = -1

            data_actions.append(action) 
            data_images.append(next_state[:,:,0].numpy()) 
            data_rewards.append(step_reward) 
            data_dones.append(int(dead)) 
            
            if steps % UPDATE_FREQ == 0 :
                
                if global_steps > START_LEARNING:
                    
                    lap_time = time.time()
                    
                    with tf.device(USE_DEVICE):
                        # Calculate target
                        lossBatch = self.calculate_target_and_train() 
                        lossMean = tf.reduce_mean(lossBatch)
                        loss += lossMean.numpy() 
                    train_time +=  time.time() - lap_time
                
            # if the game is done, break the loop
            if done:

                np_data_images = np.asarray(data_images, dtype=np.int16)
                np_data_rewards = np.asarray(data_rewards, dtype=np.int16)
                np_data_actions = np.asarray(data_actions, dtype=np.int16)
                np_data_dones = np.asarray(data_dones, dtype=np.int16)
                
                data = (np_data_images, np_data_actions, np_data_rewards, np_data_dones)
                
                return data, steps, game_reward, loss, process_time, train_time

            # move the agent to the next state 
            if dead:
                dead = False
                history = init_history

            else:
                history = next_history

            steps += 1
 

    #@tf.function    
    def calculate_target_and_train(self):

        loss = tf.constant(0)
        loss = tf.cast(loss, dtype=tf.float32)

        #yield history, next_history, action_one_hot, terminals, rewards
        for batch_history, batch_next_history, batch_action_one_hot, batch_terminal, batch_reward in self.exp_buffer.dataset.take(EPOCHS):

            batch_action_all_ones = tf.ones_like(batch_action_one_hot)
            
            # predict Q(s',a') for the Bellman equation
            next_qsa = self.target_model((batch_next_history, batch_action_all_ones), training=True)
            
            if POLICY == 1:

                # e-greedy policy - Expected Sarsa
                sum_piq = egreedy_policy(next_qsa, self.epsilon)
                v_next_vect = batch_terminal * sum_piq

            elif POLICY == 2:

                # Softmax policy - Expected Sarsa
                action_probs = softmax_policy(next_qsa)
                expectation = tf.multiply(action_probs, next_qsa)
                sum_expectation = tf.reduce_sum(expectation, axis=1, keepdims=True)
                v_next_vect = batch_terminal * sum_expectation

            else:

                # e-greedy policy - Q-Learning
                max_q = tf.math.reduce_max(next_qsa, axis=1, keepdims=True)
                v_next_vect = batch_terminal * max_q
            
            target_vec = batch_reward + GAMMA * v_next_vect
            target_mat = tf.multiply(target_vec, batch_action_one_hot)

            # Predict Q(s,a)
            with tf.GradientTape() as tape:

                qsa = self.model((batch_history, batch_action_one_hot), training=True)

                qsa_mat = tf.multiply(qsa, batch_action_one_hot)
                delta_mat = target_mat - qsa_mat
                
                # Huber loss
                squared_loss = 0.5 * tf.square(delta_mat)
                linear_loss = tf.abs(delta_mat) -0.5
                ones = tf.ones_like(delta_mat)
                loss_mat = tf.where(tf.greater(linear_loss, ones), x = linear_loss, y = squared_loss)
                loss_train = tf.reduce_mean(loss_mat, axis=1, keepdims=True)

                grads = tape.gradient(loss_train, self.model.trainable_variables)
                self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))        
                
                loss = tf.add(loss_train,loss)

        return loss


    #@tf.function    
    def choose_action(self, states):
        
        actions_all_ones = tf.ones((1,NUM_ACTIONS))

        if POLICY == 2:
            
            # softmax
            qsa = self.model((states, actions_all_ones), training=True)
            action_probs = softmax_policy(qsa)

            return action_probs 
        
        else:

            # e-greedy
            randomNum = tf.random.uniform((), minval=0, maxval=1, dtype=tf.float32, seed=1)
            if randomNum < self.epsilon:
                random_action = tf.random.uniform((), minval=0, maxval=NUM_ACTIONS, dtype=tf.int32)
                best_action = random_action 

            else:

                qsa = self.model((states, actions_all_ones), training=True)
                best_action = tf.math.argmax(qsa, axis=1, output_type=tf.dtypes.int32)
                #best_action = argmax_ties(qsa)
                best_action = best_action[0]
            
            return best_action 

# The experience buffer 

The experience buffer or the memory replay is a very simple class that stores all images, actions, rewards and terminal states. There are a few functions to append, count, and limit the maximum number of experiences (Exp).  

![alt text](http://www.modelfit.us/uploads/7/6/0/6/76068583/memory-replay_orig.png)

At each step (represented here as one column), we're adding the action that was taken, the resulting state, reward, and terminal. Terminal indicates if the state is terminal (a red column is a terminal state).

There are two important things to mention: 

1. The dtype you use to store the experiences is directly related to the size of the memory allocated to your process. If you want to run multiple training sessions in parallel I would recommend using np.int16 instead of np.float32.

2. I added a tensorflow dataset to this class to feed the experience replay into the training loop. We will detail this data pipeline in one of the next cells.

# The dataset

As I detailed in the Medium post XXX, searching for the best performances is difficult, especially for Reinforcement Learning. As a matter of fact, each step of the process includes generating and storing data on the CPU and doing network related operations on the GPU. Therefore, it's not unusual to have better performances running exclusively on the CPU as it limits the data transfers between the two worlds. After trying many different options, I settled on a Tensorflow dataset. This data set will be the bridge between the CPU's memory and the GPU's memory. It will retrieve a batch of samples from the experience replay and feed it to the network. A batch of samples contains MINI_BATCH_SIZE elements. Each element is made of 5 experiences (5 columns). The network requires a stack of 4 states and the related action as inputs. Looking at only one image does not help you understand the motions involved by the game. Therefore, we need to feed the network a stack of images. Deepmind and many practitioners use a stack of 4. We'll do the same for this implementation. Having a history of 4 experiences means you need 5 consecutive experiences as we need to approximate the action-value function of both the current and the upcoming steps. 
<br><br>
The only drawback is that we need to ensure the consistency of the histories in the data pipeline. For example, we can't have the following and we need to remove those samples. 

![alt text](http://www.modelfit.us/uploads/7/6/0/6/76068583/memory-incorrect_orig.png)



As detailed below, we'll do so by removing the stack of 5 samples which contains a terminal state in one of the first 4 experiences. 


```
# Remove bad samples
first4_dones = first5_dones[:,:-1]
any_bad_samples = np.any(first4_dones, axis=1)
indices_ok = np.logical_not(any_bad_samples)

rewards_filtered = gathered_rewards[indices_ok,:]
images_filtered = gathered_images[indices_ok,:,:,:]
actions_filtered = gathered_actions[indices_ok,:]
dones_filtered = gathered_dones[indices_ok,:]
```

In [None]:
class ExperienceBuffer:
    def __init__(self):

        self.images = np.empty(shape=(1,IMG_HEIGHT,IMG_WIDTH), dtype=np.int16)
        self.actions = np.empty(shape=(1,), dtype=np.int16)
        self.rewards = np.empty(shape=(1,), dtype=np.int16)
        self.dones = np.empty(shape=(1,), dtype=np.int16)

        with tf.device(USE_DEVICE):
            types = tf.float32, tf.float32, tf.float32, tf.float32,tf.float32 
            shapes = (MINI_BATCH_SIZE,IMG_HEIGHT,IMG_WIDTH,4), \
                    (MINI_BATCH_SIZE,IMG_HEIGHT,IMG_WIDTH,4), \
                    (MINI_BATCH_SIZE,NUM_ACTIONS), \
                    (MINI_BATCH_SIZE,1), \
                    (MINI_BATCH_SIZE,1)  

            fn_generate = lambda: self.generate_data()
            self.dataset = tf.data.Dataset.from_generator(fn_generate, \
                                         output_types= types, \
                                         output_shapes = shapes)
            #fn_map = lambda *args: args                             
            #self.dataset = self.dataset.map(fn_map, num_parallel_calls=tf.data.experimental.AUTOTUNE)
            self.dataset = self.dataset.prefetch(buffer_size=2*EPOCHS)

    def count(self):
        return self.images.shape[0]

    def pop(self):
        self.images = self.images[1:,:,:]
        self.actions = self.actions[1:]
        self.rewards = self.rewards[1:]
        self.dones = self.dones[1:]

    def append(self, experiences):
        self.images = np.append(self.images, experiences[0], axis=0)
        self.actions= np.append(self.actions, experiences[1], axis=0)
        self.rewards = np.append(self.rewards, experiences[2], axis=0)
        self.dones = np.append(self.dones, experiences[3], axis=0)

        if self.images.shape[0] > MAX_SAMPLES:
            self.images = self.images[-MAX_SAMPLES:,:,:] 
            self.actions = self.actions[-MAX_SAMPLES:] 
            self.rewards = self.rewards[-MAX_SAMPLES:] 
            self.dones = self.dones[-MAX_SAMPLES:] 
    
    def generate_data(self):

        mini_batch_size = MINI_BATCH_SIZE
        mini_batch_size = float(mini_batch_size)
        num_samples = mini_batch_size * 1.7 # We don't know how many samples we'll remove
        num_samples = int(num_samples)

        while True:
           
            replay_images = self.images
            replay_actions = self.actions
            replay_rewards = self.rewards
            replay_dones = self.dones

            indices4 = np.random.randint(low=0, high=self.count()-4, size=num_samples)
            indices4 = indices4 + 4
            indices3 = indices4 -1
            indices2 = indices3 -1
            indices1 = indices2 -1
            indices0 = indices1 -1
            indices = np.stack((indices0,indices1,indices2,indices3,indices4), axis=0)
            indices = np.reshape(np.transpose(indices),(num_samples*5,))
            reshaped_indices= np.reshape(indices,(-1,5))
            reshaped_indices4 = np.reshape(indices4,(-1,1))

            gathered_images = np.take(replay_images, reshaped_indices, axis=0)
            gathered_actions = np.take(replay_actions, reshaped_indices4, axis=0)
            gathered_rewards = np.take(replay_rewards, reshaped_indices4, axis=0)
            gathered_dones = np.take(replay_dones, reshaped_indices4, axis=0)
            first5_dones =  np.take(replay_dones, reshaped_indices, axis=0)

            # Remove bad samples
            first4_dones = first5_dones[:,:-1]
            any_bad_samples = np.any(first4_dones, axis=1)
            indices_ok = np.logical_not(any_bad_samples)

            rewards_filtered = gathered_rewards[indices_ok,:]
            images_filtered = gathered_images[indices_ok,:,:,:]
            actions_filtered = gathered_actions[indices_ok,:]
            dones_filtered = gathered_dones[indices_ok,:]

            rewards = rewards_filtered[0:MINI_BATCH_SIZE,:]
            images = images_filtered[0:MINI_BATCH_SIZE:,:,:]
            actions = actions_filtered[0:MINI_BATCH_SIZE,:]
            dones = dones_filtered[0:MINI_BATCH_SIZE,:]

            raw_history = images[:,0:4,:,:]
            history = np.transpose(raw_history,(0,2,3,1))
            raw_next_history = images[:,1:5,:,:]
            next_history = np.transpose(raw_next_history,(0,2,3,1))

            actions = actions.astype(int)
            actions = np.reshape(actions,(-1,))

            action_one_hot = np.eye(NUM_ACTIONS)[actions]
            action_one_hot = action_one_hot.astype(float)
            action_one_hot

            terminals = 1 - dones

            history = history.astype(np.float32)
            next_history = next_history.astype(np.float32)
            action_one_hot = action_one_hot.astype(np.float32)
            terminals = terminals.astype(np.float32)
            rewards = rewards.astype(np.float32)
            
            yield history, next_history, action_one_hot, terminals, rewards


# The Tensorflow Keras Model

The Keras model we're using is as close as it can be to the one used by Deepmind. My understanding (I'm not sure what the Torch default was in 2015) is that, except Keras, the only differences are the use of the Relu activation function and the kernel initializers.

Although the number of dense layers may vary, I decided to keep 2 fully connected layers before the last visible layer. 

I tried many different kernel initializers, but I decided to use VarianceScaling for the Convolution layers and the default GlorotUniform for the dense layers. This gave me the best results. Especially for Expected Sarsa with Softmax, which does not converge if VarianceScaling is used on all layers.

<br><br>
![alt text](http://www.modelfit.us/uploads/7/6/0/6/76068583/screen-shot-2020-06-22-at-11-20-00-am_orig.png)



In [None]:
def build_keras_Seq():
    
    frames = tf.keras.Input(shape=(IMG_HEIGHT,IMG_WIDTH, 4), name='frames')
    actions = tf.keras.Input(shape=(NUM_ACTIONS,), name='actions')

    normalized = tf.keras.layers.Lambda(lambda x: x / 255.0, name='normalization')(frames)
    
    init = tf.keras.initializers.VarianceScaling(scale=2.0, mode='fan_in', distribution='untruncated_normal', seed=None)
    init0 = tf.keras.initializers.Zeros()
    init1 = tf.keras.initializers.Ones()
    init2 = tf.keras.initializers.GlorotUniform(seed=1) #[-limit, limit], where limit = sqrt(6 / (fan_in + fan_out))
    init3 = tf.keras.initializers.he_uniform(seed=1)

    x = tf.keras.layers.Conv2D(32, (8, 8), strides=(4, 4), kernel_initializer=init, padding='valid', use_bias=False)(normalized)
    x = tf.keras.activations.relu(x) # , max_value=6)
    
    x = tf.keras.layers.Conv2D(64, (4, 4), strides=(2, 2), kernel_initializer=init, padding='valid', use_bias=False)(x)
    x = tf.keras.activations.relu(x) #, max_value=6)

    x = tf.keras.layers.Conv2D(64, (3, 3), strides=(1, 1), kernel_initializer=init, use_bias=False)(x)
    x = tf.keras.activations.relu(x) #, max_value=6)

    x = tf.keras.layers.Flatten()(x)

    x = tf.keras.layers.Dense(512, kernel_initializer=init2)(x)
    x = tf.keras.activations.relu(x) #, max_value=6)

    #x = tf.keras.layers.Dense(128, kernel_initializer=init2)(x)
    #x = tf.keras.activations.relu(x) #, max_value=6)
 
    #x = tf.keras.layers.Dense(64, kernel_initializer=init2)(x)
    #x = tf.keras.activations.relu(x) #, max_value=6)
    
    q_values = tf.keras.layers.Dense(NUM_ACTIONS, dtype='float32', name='q_values', kernel_initializer=init2, activation=None)(x)
    output = tf.keras.layers.Multiply(dtype='float32', name='Qs')([q_values, actions])
    
    model = tf.keras.Model(inputs=[frames,actions], outputs=output)
    return model     

# Run training

This cell orchestrates the training and the evaluation of the network. It  updates on a regular basis (MODELUPDATE_TRAIN_STEPS) the target network with: 

```
agent.target_model.set_weights(agent.model.get_weights())
```

It registers the following tensorboard scalars and histograms:

```
tf.summary.scalar('loss', loss[0], step=global_steps)
tf.summary.scalar('epsilon', agent.epsilon, step=global_steps)
tf.summary.scalar('score', successFrame[-1], step=global_steps)
tf.summary.scalar('steps', steps, step=global_steps)
tf.summary.histogram('actions', agent.exp_buffer.actions[-steps:], step=global_steps)
```

I decided to use Tensorboard to display the loss (of course), epsilon (it is interesting to understand when the network starts to converge), the score, the number of steps for each game and the actions' distribution (to understand how it evolves with convergence). 


In [None]:
def run_training(agent, now):

    logdir = "{}/run/{}/".format(ROOT_TF_LOG, now)
    
    with tf.device(USE_DEVICE):
        file_writer = tf.summary.create_file_writer(logdir)

    modelId = now

    with tf.device(USE_DEVICE):
        agent.target_model.set_weights(agent.model.get_weights())

    # Metrics - Should be a collections deque with max capacity set to more than last summary scalar successFrame.
    successMemory = np.empty((1,0))
    successFrame = np.empty((1,0))
    previous_global_steps_tn = 0
    previous_global_steps_eval = 0
    
    game_count = 1
    global_steps = 0
    loss =  np.zeros((1,),dtype=np.float32)
    best_score = -1
    
    lap_time = time.time()
   
    try:

        while global_steps <= MAX_STEPS: 

            print('\nGame {} - Run {}'.format(game_count, now))

            #if global_steps % SAVEMODEL_STEPS  > previous_global_steps % SAVEMODEL_STEPS:
            #    save_theModel(model, modelId, game_count, samples)

            # return steps, game_reward, loss, epsilon
            data_game, steps, game_reward, loss, process_time, train_time = agent.play_game(global_steps)
            loss /= steps + 1 # steps starts at 0 

            buffer_previous_size = agent.exp_buffer.count()
            agent.exp_buffer.append(data_game)

            global_steps += steps + 1   
            print('Global_steps is: %s' % global_steps)
            
            if buffer_previous_size == 1 :
                print("Experience Replay buffer pop")
                agent.exp_buffer.pop()

            # Update the target network 
            train_steps = (global_steps - previous_global_steps_tn)*EPOCHS*MINI_BATCH_SIZE/UPDATE_FREQ
            if train_steps > MODELUPDATE_TRAIN_STEPS:
                with tf.device(USE_DEVICE):
                    agent.target_model.set_weights(agent.model.get_weights())
                print('Updating target model **************************** Updating target model ****************')
                previous_global_steps_tn = global_steps

            if POLICY == 0 or POLICY == 1: print('Epsilon is: %s' % agent.epsilon)
            
            # Evaluate every EVAL_STEPS frames the performance 
            if (global_steps > EXPLORE_STEPS + ANNEALING_STEPS  and global_steps > previous_global_steps_eval + EVAL_STEPS) or global_steps > MAX_STEPS:
                
                successEval = np.empty((1,0))
                agent.epsilon = agent.epsilon_evaluation 
                remaining_eval_games = EVAL_GAMES
                previous_global_steps_eval = global_steps
                
                while remaining_eval_games > 0:
                    
                    print('Evaluation game %s' % remaining_eval_games)
                    remaining_eval_games -= 1 

                    game_reward, raw_frames = agent.eval_game()
                    print('game_reward is: ', game_reward)
                    successEval = np.append(successEval, game_reward)
        
                    if  game_reward > best_score:
                        generate_gif(raw_frames, modelId, game_count, game_reward)
                        best_score = game_reward
                        print('Generating GIF  **************************** Generating Gif ****************')
     
                    if remaining_eval_games == 0:
                        agent.epsilon = agent.min_epsilon 

                    assert agent.epsilon.device[-5:].lower() == USE_DEVICE[-5:].lower(), "agent.epsilon not on : %s" % USE_DEVICE
                
                with file_writer.as_default():
                    with tf.device(USE_DEVICE):
                        tf.summary.scalar('eval', np.mean(successEval), step=global_steps)
                        tf.summary.scalar('eval-var', np.var(successEval), step=global_steps)
                        tf.summary.histogram('scores', successEval, step=global_steps)
                
                print('Evaluation games average score is %s ' % np.mean(successEval)) 
                print('Evaluation games score variance is %s ' % np.var(successEval))


            successMemory = np.append(successMemory,game_reward)
            successFrame = np.append(successFrame,np.mean(successMemory[-10:successMemory.size]))
           
            actions_distrib = np.histogram(agent.exp_buffer.actions[-steps:], bins=[0,1,2,3,4,5,6], density=True)

            print('Memory contains %s samples' % agent.exp_buffer.count())       
            print('Reward over 10 games is: %s and loss is: %s' % (successFrame[-1],loss[0]))
            print('Actions distribution (last game, %) is: ', (100 * actions_distrib[0]).astype(int))
            print('Steps survived: %s' % (steps+1))
 
            # Add user custom data to TensorBoard
            with file_writer.as_default():
                with tf.device(USE_DEVICE):
                    tf.summary.scalar('loss', loss[0], step=global_steps)
                    tf.summary.scalar('epsilon', agent.epsilon, step=global_steps)
                    tf.summary.scalar('score', game_reward, step=global_steps)
                    tf.summary.scalar('steps', steps, step=global_steps)
                    tf.summary.histogram('actions', agent.exp_buffer.actions[-steps:], step=global_steps)

            previous_time = lap_time
            lap_time = time.time()
            print("Image processing time for the last game: ", process_time)
            print("Train time for the last game: ", train_time)
            print("Elapsed time for the last game: ", lap_time - previous_time)
            
            #if game_count == 1:
            #    break
            
            game_count += 1  
            
    except KeyboardInterrupt:

        print('Save the model')
        save_theModel(agent.model, modelId, game_count)
        file_writer.close()    

        raise

    print('Save the model ', modelId)
    save_theModel(agent.model, modelId, game_count)
    file_writer.close()   

# Utilities

Here we will declare three new functions to:
 

*   Preprocess the frames retrieved from the environment. We need to scale them down to process a reasonable amount of pixels,
*   Generate an animated gif,
*   Save the model.


In [None]:
def preprocess(image):
    
    img_gray = tf.image.rgb_to_grayscale(image)
    img_cropped = tf.image.crop_to_bounding_box(img_gray, 34, 0, 160, 160)
    img_resized = tf.image.resize(img_cropped, [IMG_HEIGHT, IMG_WIDTH], method='nearest')

    return img_resized    

In [None]:
def generate_gif(frames, pathName, game_count, game_reward):

    for idx, frame_idx in enumerate(frames): 
        frames[idx] = resize(frame_idx, (420, 320, 3), preserve_range=True, order=0).astype(np.uint8)
        
    imageio.mimsave(f'{SAVE_DIR}{"/GymBreakout-{}-{}-{}.gif".format(pathName, game_count, game_reward)}', frames, duration=1/30)

In [None]:
def save_theModel(model, pathName, game_count):
    
    now_save = pathName + '_' + str(game_count)
    modelPath = "{}/GymBreakout-{}.h5".format(SAVE_DIR, now_save)
    model.save(modelPath)
    print('Saved model: ', modelPath)
    print(datetime.utcnow().strftime("%a, %d %b %Y %H:%M:%S +0000"))

# Our main

The last step, our "main", is to instantiate the different objects and seed the random number generators. Most importantly, we will create the keras optimizer. We will use Adam, a different optimizer than the one used by Deepmind (RMSProp). 

```
optimizer = tf.keras.optimizers.Adam(ALPHA, epsilon=1e-8)
```



In [None]:
# If running in Collab, uncomment the following two lines to display tensorboard. 
# You should have no actve dashboards until your run the following cells
# Be aware that you're going to crash your session around game 1800.
# This process needs about 30Gb of RAM to store 1.000.000 experiences ;-)

#%load_ext tensorboard
#%tensorboard --logdir tf_logs

In [None]:
# Please allocate a GPU to this notebook.
gpus = tf.config.list_physical_devices('GPU')

assert len(gpus) > 0, "Please allocate a GPU to this notebook."
print('GPUS are: ', gpus)

for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu,True)

# To be sure everything runs on GPU, set to True
tf.debugging.set_log_device_placement(False)

now = datetime.utcnow().strftime("%Y%m%d%H%M%S")

# Seeding the random
# Don't forget to seed the network activation function if needed
np.random.seed(seed=42)
#random.seed(43)
tf.random.set_seed(44)

# Create env
env = gym.make('BreakoutDeterministic-v0')
env.seed = 45

print("obs shape is: ", env.observation_space.shape)
print("actions space is: ", env.action_space.n)
actions = env.unwrapped.get_action_meanings()
print('actions are: ', actions)

# Build Model
with tf.device(USE_DEVICE):

    model = build_keras_Seq()
    target_model = build_keras_Seq()
    optimizer = tf.keras.optimizers.Adam(ALPHA, epsilon=1e-8)
 
memory = ExperienceBuffer()
agent = Agent(env, model, target_model, optimizer, memory)

#Training
run_training(agent, now)

# Close env 
env.close()


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Player is dead! game_reward is: 0.0
Player is dead! game_reward is: 0.0
Player is dead! game_reward is: 0.0
Player is dead! game_reward is: 0.0
Player is dead! game_reward is: 0.0
Global_steps is: 269417
Epsilon is: tf.Tensor(1.0, shape=(), dtype=float32)
Memory contains 269417 samples
Reward over 10 games is: 1.6 and loss is: 0.011375404
Actions distribution (last game, %) is:  [17 32 29 20  0  0]
Steps survived: 123
Image processing time for the last game:  0.31781625747680664
Train time for the last game:  2.8231728076934814
Elapsed time for the last game:  6.19070029258728

Game 1562 - Run 20200911003313
Player is dead! game_reward is: 0.0
Player is dead! game_reward is: 0.0
Player is dead! game_reward is: 0.0
Player is dead! game_reward is: 0.0
Player is dead! game_reward is: 1.0
Global_steps is: 269569
Epsilon is: tf.Tensor(1.0, shape=(), dtype=float32)
Memory contains 269569 samples
Reward over 10 games is: 1.7 and