# Deep Q-Learning

Read the introduction to this notebook first...

... or dive right in :)

## Contents
1. Reinforcement Learning 
2. Q-Learning
3. Exploration-exploitation trade-off


## 1. Reinforcement Learning
In supervised learning (for instance) a neural network learns a function that maps an input to a corresponding output on the basis of a large amount of labeled training data consisting of example input-output pairs: Simply put, if you train a neural network to classify, for example, cats and dogs, you repeatedly show the network pictures of cats or dogs, compare the network's prediction to the label and slightly adapt the network's parameters until the neural net is able to classify what animal is shown in a picture.

Now, let's say you let a child play a computer game it has never played before. In the case of [Breakout](https://www.youtube.com/watch?v=TmPfTpjtdgg) the player sees the pixel screen as input and has to decide whether to move left or right. You could certainly show the child many times in which situations it has to press left and in which situations right in order to win the game - this would be a classification problem (supervised learning) - but surely the child would become bored quickly and would try to push you aside, wanting to try the game itself. And the child would learn to play the game quickly without being told how to do so simply by evaluating which actions lead to an increased score. In reinfocement learning, we try to make a computer learn in this exact same way, by letting it explore the environment and occasionally giving it a reward when the score increases. 

However, in comparison to supervised learning, this poses a problem. On p. 1 of [Mnih et al. 2013](https://arxiv.org/abs/1312.5602) (from now on denoted as (DQN)) the authors say:

>RL algorithms [...] must be able to learn from a scalar reward signal that is frequently sparse [...] and delayed. The delay between actions and resulting rewards, which can be thousands of timesteps long, seems particularly daunting when compared to the direct association between inputs and targets found in supervised learning.

What do the authors mean with with "sparse [...] and delayed"? 

In our fictive maze example, the rewards are the sparser, the less gold you find. For an agent, a game is more difficult to learn, the sparser the reward is. [Pong](https://gym.openai.com/envs/Pong-v0/) is one of the games DQN can learn fastest because the score changes quite often. [Montezuma's Revenge](https://gym.openai.com/envs/MontezumaRevenge-v0/), on the other hand, has very sparse rewards and DQN (at least without some additional tricks) is not able to learn the game at all.

And *delayed*?
Imagine you walk through a maze trying to find treasures. You get a reward once you find gold. Now imagine you encounter a fork in the path. Which way do you take? As opposed to supervised learning, at the fork the agent does not get any reward for taking the right path but only later once it finds any gold. Yet it might have been crucial to take for example the left way at the fork. This is what the authors mean with *delayed*. The problem is met by discounting future rewards with a factor $\gamma$ (between 0 and 1). 

The discounted return $R_i$ is calculated as follows:  $R_{i} = r_i + 𝛾 r_{i+1} + 𝛾^2 r_{i+2} + 𝛾^3 r_{i+3} + ...$

Let us look at a very simple example where there is just one reward not equal to 0:

time step | $t_{i}$ | $t_{i+1}$ |$t_{i+2}$ |$t_{i+3}$ |$t_{i+4}$ |
:---| --- | ---| ---|---|---|
reward | 0 | 0 | 0 | 1 | 0 |
discounted reward | $\gamma^3$ | $\gamma^2$ | $\gamma$ | 1 | 0 |
for $gamma=0.9$|0.729 | 0.81 | 0.9|1 |0|

Simply put, by discounting rewards, future rewards increase past or current rewards and the closer $\gamma$ is to 1, the further they influence past rewards. 

## 2. Q-Learning
So how does Q-Learning work? If the agent (regardless of trained or still untrained) is shown a state $s$ of the game, it has to decide which action $a$ to perform (for example move paddle left or right in breakout). How does it do that? On page 2 [Mnih et al. 2013](https://arxiv.org/abs/1312.5602) define the so-called Q-Function:
>We define the optimal action-value function $Q^∗(s, a)$ as the maximum expected return achievable by following any strategy, after seeing some sequence $s$ and then taking some action $a$ 

This means that given a state of the game $s$ (for now please consider *sequences* as states of the game), $Q^*(s,a)$ is the best (discounted) total score the agent can achieve if it performs action $a$ in the current state $s$. So how does it chose which action to perform assuming we already know $Q^*(s,a)$? One obvious strategy would be to always chose the action with the maximum value of $Q^*$ (we will see later, why this is slightly problematic). But first of all, we need to find this magical function $Q^*$:

Let's say we are in state $s$, decide to perform action $a$ and arrive in the next state $s'$. If we assume that in state $s'$ the $Q^*$-values for all possible actions $a'$ were already known, then the $Q^*$-value in state $s$ for action $a$ (the maximum discounted return in $s$ for action $a$) would be the reward $r$ we got for performing action $a$ plus the discounted maximum future reward in $s'$:

\begin{equation}
Q^*(s,a) = r + \gamma \textrm{max} Q^*(s',a')
\end{equation}

This is the so-called **Bellman equation**. Deep Q-Learning uses a neural network to find an approximation $Q(s,a,\theta)$ of $Q^*(s,a)$. $\theta$ are the parameters of the neural network. We will discuss later, how exactly the parameters of the network are updated. Now, I will explain to you, how the neural network maps a state $s$ to $Q$-values for the possible actions $a$.

Earlier I mentioned, that I regard a *sequence* as a *state*. What did I mean with that? Imagine you have a pin-sharp image of a flying soccer ball. Can you tell in which direction it moves? No, you cannot (but you could if there was some kind of motion blur in the picture). The same problem occurs in Atari games. From a single frame of the game [Pong](https://gym.openai.com/envs/Pong-v0/), the agent can not discern in which direction the ball moves. DeepMind met this problem by stacking several consecutive frames and considering this sequence a state that is passed to the neural network. From such a sequence the agent is able to detect the direction and speed of movement because the ball is in a different position in each frame.

On page 5 of [Mnih et al. 2013](https://arxiv.org/abs/1312.5602) the authors explain the preprocessing of the frames:

>Working directly with raw Atari frames, which are 210 × 160 pixel images with a 128 color palette, can be computationally demanding, so we apply a basic preprocessing step aimed at reducing the input dimensionality. The raw frames are preprocessed by first converting their RGB representation to gray-scale and down-sampling it to a 110×84 image. The final input representation is obtained by cropping an 84 × 84 region of the image that roughly captures the playing area. The final cropping stage is only required because we use the GPU implementation of 2D convolutions from [...], which expects square inputs. For the experiments in this paper, the function $\phi$ [...] applies this preprocessing to the last 4 frames of a history and stacks them to produce the input to the $Q$-function.

So let us start by looking at how the prepocessing can be implemented. I used `gym` from OpenAi to provide the environment. A frame returned by the environment has the shape `(210,160,3)` where the 3 stands for the RGB color channels. Such a frame is passed to the method `process` which transforms it to a `(84,84,1)` frame, where the 1 indicates that instead of three RGB channels there is one grayscale channel. 


In [1]:
import gym
import tensorflow as tf
import random
import numpy as np
import os
import imageio
from skimage.transform import resize

  from ._conv import register_converters as _register_converters


In [11]:
class processFrame():
    def __init__(self):
        self.frame = tf.placeholder(shape=[210, 160, 3], dtype=tf.uint8)
        self.processed = tf.image.rgb_to_grayscale(self.frame)
        self.processed = tf.image.crop_to_bounding_box(self.processed, 34, 0, 160, 160)
        self.processed = tf.image.resize_images(self.processed, [84, 84], method=tf.image.ResizeMethod.NEAREST_NEIGHBOR)
    
    def process(self, sess, frame):
        """
        Args:
            sess: A Tensorflow session object
            frame: A (210, 160, 3) frame of an Atari game in RGB
        Returns:
            A processed (84, 84, 1) frame in grayscale
        """
        return sess.run(self.processed, feed_dict={ self.frame:frame})

### Network

Instead of the network architecture described in [Mnih et al. 2013](https://arxiv.org/abs/1312.5602) or [Mnih et al. 2015](https://www.nature.com/articles/nature14236/) I used the dueling network architecture described in [Wang et al. 2016](https://arxiv.org/abs/1511.06581).

![](pictures/dueling.png "Figure 1 in Wang et al. 2016")

Both the [Mnih et al. 2015](https://www.nature.com/articles/nature14236/) and the [Wang et al. 2016](https://arxiv.org/abs/1511.06581) dueling architecture have the same low-level convolutional structure:

>The first convolutional layer has 32 8x8 filters with stride 4, the second 64 4x4 filters with stride 2, and the third and final convolutional layer consists 64 3x3 filters with stride 1.

In the normal DQN architecture (top network in the figure) the *final hidden layer is fully-connected and consists of 512 rectifier units. The output layer is a fully-connected linear layer with a single output for each valid action.* (see page 6 of [Mnih et al. 2015](https://www.nature.com/articles/nature14236/)) These outputs are the predicted $Q(s,a;\theta)$-values for action $a$ in state $s$.

Instead of directly predicting a single $Q$-value for each action, the dueling architecture splits the final convolutional layer into two streams that represent the value and advantage functions that predict a *state value* $V(s)$ that depends only on the state, and *action advantages* $A(s,a)$ that depend on the state and the respective action. On page 2 of [Wang et al. 2016](https://arxiv.org/abs/1511.06581) the authors explain:

>Intuitively, the dueling architecture can learn which states are (or are not) valuable, without having to learn the effect of each action for each state. This is particularly useful in states where its actions do not affect the environment in any relevant way. 
In the experiments, we demonstrate that the dueling architecture can more quickly identify the correct action during policy evaluation as redundant or similar actions are added to the learning problem. 

The *state value* $V(s)$ predicts *how good it is to be in a certain state* $s$ and the *action advantage* $A(s,a)$ predicts *relative measure of the importance of each action $a$ being in current state $s$*.
I suggest you take a look at figure 2 in [Wang et al. 2016](https://arxiv.org/abs/1511.06581) to better understand what the value- and advantage-stream learn to look at.

Next, we have to combine the value- and advantage-stream into $Q$-values $Q(s,a)$. This is done the following way (equation 9 in [Wang et al. 2016](https://arxiv.org/abs/1511.06581)):

\begin{equation}
Q(s,a) = V(s) + \left(A(s,a) - \frac 1{| \mathcal A |}\sum_{a'}A(s, a')\right)
\end{equation}

Why so complicated instead of just adding $V(s)$ and $A(s,a)$? Let's assume $Q(s,a) = V(s) + A(s,a)$. The Q function measures the value of choosing a particular action when in a particular state. The value function $V$, which is the expected value of $Q$ over all possible actions, measures how good it is to be in this particular state. If you combine $E(Q) = V$ and $Q = V + A$, you find $E(Q) = E(V) + E(A)$. But $V$ does not depend on any action, which means $E(V)=V$ and thus $E(A)=0$. The expected value of the advantage $A(s,a')$ over all possible actions $a'$ has to be zero. This is ensured be the equation shown above by subtracting the mean of the advantages of all actions from every advantage.

In the cell below you find the code that implements this architecture in tensorflow. Some things to keep in mind: You should normalize the input pixel values to [0,1] by dividing the input with 255. The reason for this is, that the pixelvalues of the frames, the environment returns, are uint8 which can store values in the range [0,255]. Make sure you initialize the weights properly with the Xavier-initializer. DeepMind used an implementation of the RMSProp optimizer that is different to the one in tensorflow. Before implementing it myself, I tried the Adam optimizer which gave promising results without much hyperparameter-search. Adam was not invented when [Mnih et al. 2013](https://arxiv.org/abs/1312.5602) was published, so one could argue that they might have used it instead of RMSProp if it had been invented earlier. On the other hand, the authors of this [blog post](https://blog.paperspace.com/intro-to-optimization-momentum-rmsprop-adam/) compare *Momentum, RMSProp and Adam* and argue:
>Out of the above three, you may find momentum to be the most prevalent, despite Adam looking the most promising on paper. Empirical results have shown the all these algorithms can converge to different optimal local minima given the same loss. However, SGD with momentum seems to find more flatter minima than Adam, while adaptive methods tend to converge quickly towards sharper minima. Flatter minima generalize better than sharper ones.

It might thus be well worth spending some time on playing with different optimizers and implementing the version of RMSProp used by DeepMind. For now, I stick with Adam and if I find some time in the future, I might come back to this.

If you compare the dueling architecture described above to the network implemented in the next cell, you will find a small difference. Instead of two hidden fully connected layers with 512 rectifier units for each, the value- and the advantage-stream, I ended up adding a fourth convolutional layer with 512 filters that is then split into two streams. This architecture is suggested [here](https://github.com/awjuliani/DeepRL-Agents/blob/master/Double-Dueling-DQN.ipynb) and after performing some tests on the environment Pong, which is comparably easy to learn for a DQN agent, I find that this small adjustment lets the reward increase slightly earlier and higher.

In [3]:
class DQN():
    def __init__(self, hidden=512, learningRate=0.00005):
        self.hidden = hidden
        self.learningRate = learningRate
        
        self.input = tf.placeholder(shape=[None,84,84,4], dtype=tf.float32)
        # Normalizing the input
        self.inputscaled = self.input/255
        
        # Convolutional layers
        self.conv1 = tf.layers.conv2d(
            inputs=self.inputscaled, filters=32, kernel_size=[8,8], strides=4,
            padding="valid", activation=tf.nn.relu, use_bias=False)
        self.conv2 = tf.layers.conv2d(
            inputs=self.conv1, filters=64, kernel_size=[4,4], strides=2, 
            padding="valid", activation=tf.nn.relu, use_bias=False)
        self.conv3 = tf.layers.conv2d(
            inputs=self.conv2, filters=64, kernel_size=[3,3], strides=1, 
            padding="valid", activation=tf.nn.relu, use_bias=False)
        self.conv4 = tf.layers.conv2d(
            inputs=self.conv3, filters=hidden, kernel_size=[7,7], strides=1, 
            padding="valid", activation=tf.nn.relu, use_bias=False)
        
        # Splitting into value- and advantage-stream
        self.valuestream, self.advantagestream = tf.split(self.conv4,2,3)
        self.valuestream = tf.layers.flatten(self.valuestream)
        self.advantagestream = tf.layers.flatten(self.advantagestream)
        self.advantage = tf.layers.dense(
            inputs=self.advantagestream,units=env.action_space.n,
            kernel_initializer=tf.contrib.layers.xavier_initializer())
        self.value = tf.layers.dense(
            inputs=self.valuestream,units=1,kernel_initializer=tf.contrib.layers.xavier_initializer())
        
        # Combining value and advantage into Q-values as described above
        self.Qvalues = self.value + tf.subtract(self.advantage,tf.reduce_mean(self.advantage,axis=1,keepdims=True))
        self.bestAction = tf.argmax(self.Qvalues,1)
        
        # targetQ according to Bellman equation: Q = r + gamma*max Q'
        self.targetQ = tf.placeholder(shape=[None],dtype=tf.float32)
        self.action = tf.placeholder(shape=[None],dtype=tf.int32)
        self.Q = tf.reduce_sum(tf.multiply(self.Qvalues, tf.one_hot(self.action, env.action_space.n, dtype=tf.float32)), axis=1)
        
        self.loss = tf.reduce_mean(tf.losses.huber_loss(labels=self.targetQ, predictions=self.Q))
        self.optimizer = tf.train.AdamOptimizer(learning_rate=self.learningRate)
        self.update = self.optimizer.minimize(self.loss)

## 3. Exploration-exploitation trade-off
If you look at the code in the previous cell, you will find, that we are now able to predict the action, the network considers best (`self.bestAction`) by taking the argument of the maximum $Q$-value. But initially, the agent does not know how to play the game. If we always exploit and never explore by always chosing the action with the highest $Q$-value (greedy), the agent will stick to the first strategy it discovers that returns a small reward. It can then not continue exploring the environment and can not continue to learn. The $\epsilon$-greedy algorithm offers a simple solution for that problem: Simply put, we usually chose the action the networks deems best but with a probability of $\epsilon$ we chose a random action. $\epsilon$ is a function of the number of frames the agent has seen. For the first 50000 frames the agent only explores ($\epsilon=1$). Over the following 1 million frames, $\epsilon$ is linearly decreased to 0.1, meaning that the agent starts exploiting more and more while it learns. DeepMind then keeps $\epsilon=0.1$, however, I chose to decrease it to $\epsilon=0.01$ over the remaining frames as suggested by the [OpenAi Baselines for DQN](https://blog.openai.com/openai-baselines-dqn/) (in the plot the maximum number of frames is 2 million for demonstrating purposes).

![](pictures/epsilon.png "See the gnuplot script to find out how to quickly create such a plot")

The method `getAction` in the cell below implements this behaviour: It first calculates $\epsilon$ from the number of the current frame and then either returns a random action (with probability $\epsilon$) or the action the DQN deems best. The variables in the constructor are the slopes and intercepts for the decrease of $\epsilon$ shown in the plot above.

In [2]:
class ActionGetter:
    def __init__(self, explorationInitial = 1, explorationFinal = 0.1, explorationInference = 0.01, explorationAnnealingFrames = 1000000, memoryBufferStartSize = 50000, maxFrames = 25000000):
        self.explorationInitial = explorationInitial
        self.explorationFinal = explorationFinal
        self.explorationInference = explorationInference
        self.explorationAnnealingFrames = explorationAnnealingFrames
        self.memoryBufferStartSize = memoryBufferStartSize
        self.maxFrames = maxFrames
        
        # Slopes and intercepts for exploration decrease
        self.m = -(self.explorationInitial - self.explorationFinal)/self.explorationAnnealingFrames
        self.b = self.explorationInitial - self.m*self.memoryBufferStartSize
        self.m2 = -(self.explorationFinal - self.explorationInference)/(self.maxFrames - self.explorationAnnealingFrames - self.memoryBufferStartSize)
        self.b2 = self.explorationInference - self.m2*self.maxFrames

    def getAction(self, frameNumber, state, inference=False):
        """
        Args:
            frameNumber: An integer determining the number of the current frame
            state: A (84, 84, 4) sequence of frames of an Atari game in grayscale
            inference: A boolean saying whether the agent is learning (inference=False)
        Returns:
            An integer between 0 and env.action_space.n - 1 determining the action the agent perfoms next
        """
        if frameNumber < self.memoryBufferStartSize:
            e = self.explorationInitial
        elif frameNumber >= self.memoryBufferStartSize and frameNumber < self.memoryBufferStartSize + self.explorationAnnealingFrames:
            e = self.m*frameNumber + self.b
        elif frameNumber >= self.memoryBufferStartSize + self.explorationAnnealingFrames:
            e = self.m2*frameNumber + self.b2
        elif inference:
            e = self.explorationInference
        if np.random.rand(1) < e:
            return np.random.randint(0, env.action_space.n)
        else:
            return sess.run(mainDQN.bestAction, feed_dict={mainDQN.input:[state]})[0]       

In [5]:
class TargetNetworkUpdater:
    def __init__(self, mainDQNVars, targetDQNVars):
        self.mainDQNVars = mainDQNVars
        self.targetDQNVars = targetDQNVars

    def _updateTargetVars(self):
        updateOps = []
        for i, var in enumerate(self.mainDQNVars):
            op = self.targetDQNVars[i].assign(var.value())
            updateOps.append(op)
        return updateOps
            
    def updateNetworks(self, sess):
        """
        Args:
            sess: A Tensorflow session object
            frame: A (210, 160, 3) frame of an Atari game in RGB
        Returns:
            A processed (84, 84, 1) frame in grayscale
        """
        updateOps = self._updateTargetVars()
        for op in updateOps:
            sess.run(op)

In [12]:
class MemoryBuffer:
    def __init__(self, size = 1000000, frameHeight=84, frameWidth=84, agentHistoryLength = 4, batchSize = 32):
        self.size = size
        self.frameHeight = frameHeight
        self.frameWidth = frameWidth
        self.agentHistoryLength = agentHistoryLength
        self.batchSize = batchSize
        self.count = 0
        self.current = 0
        
        # Pre-allocate memory
        self.actions = np.empty(self.size, dtype=np.int32)
        self.rewards = np.empty(self.size, dtype=np.float32)
        self.frames = np.empty((self.size, self.frameHeight,self.frameWidth), dtype=np.uint8)
        self.terminalFlags = np.empty(self.size, dtype=np.bool)
        
        # Pre-allocate memory for the States and newStates in a minibatch
        self.states = np.empty((self.batchSize, self.agentHistoryLength, self.frameHeight, self.frameWidth), dtype=np.uint8)
        self.newStates = np.empty((self.batchSize, self.agentHistoryLength, self.frameHeight, self.frameWidth), dtype=np.uint8)
        self.indices = np.empty(self.batchSize, dtype=np.int32)
        
    def addExperience(self, action, frame, reward, terminal):
        """
        Args:
            sess: A Tensorflow session object
            frame: A (210, 160, 3) frame of an Atari game in RGB
        Returns:
            A processed (84, 84, 1) frame in grayscale
        """
        if frame.shape != (self.frameHeight, self.frameWidth):
            raise ValueError('Dimension of frame is wrong!')
        self.actions[self.current] = action
        self.frames[self.current,...] = frame
        self.rewards[self.current] = reward
        self.terminalFlags[self.current] = terminal
        self.count = max(self.count, self.current+1)
        self.current = (self.current + 1) % self.size
             
    def _getState(self, index):
        if self.count is 0:
            raise ValueError("The replay memory is empty!")
        if index < self.agentHistoryLength - 1:
            raise ValueError("Index must be min 3")
        return self.frames[index-self.agentHistoryLength+1:index+1,...]
        
    def _getValidIndices(self):
        for i in range(self.batchSize):
            while True:
                index = random.randint(self.agentHistoryLength, self.count - 1)
                if index < self.agentHistoryLength:
                    continue
                if index >= self.current and index - self.agentHistoryLength <= self.current:
                    continue
                if self.terminalFlags[index - self.agentHistoryLength:index].any():
                    continue
                break
            self.indices[i] = index
            
    def getMinibatch(self):
        """
        Args:
            sess: A Tensorflow session object
            frame: A (210, 160, 3) frame of an Atari game in RGB
        Returns:
            A processed (84, 84, 1) frame in grayscale
        """
        if self.count < self.agentHistoryLength:
            raise ValueError('Not enough memories to get a minibatch')
        
        self._getValidIndices()
            
        for i, idx in enumerate(self.indices):
            self.states[i] = self._getState(idx - 1)
            self.newStates[i] = self._getState(idx)
        
        return np.transpose(self.states,axes=(0,2,3,1)), self.actions[self.indices], self.rewards[self.indices], np.transpose(self.newStates,axes=(0,2,3,1)), self.terminalFlags[self.indices]
                

In [7]:
def learn():
    """
        Args:
            sess: A Tensorflow session object
            frame: A (210, 160, 3) frame of an Atari game in RGB
        Returns:
            A processed (84, 84, 1) frame in grayscale
    """
    states, actions, rewards, newStates, terminalFlags = myMemoryBuffer.getMinibatch()    
    argQmax = sess.run(mainDQN.bestAction, feed_dict={mainDQN.input:newStates})
    Qvals = sess.run(targetDQN.Qvalues, feed_dict={targetDQN.input:newStates})
    
    doubleQ = Qvals[range(bs), argQmax]
    # Bellman equation
    targetQ = rewards + (discountFactor*doubleQ * (1-terminalFlags))
    _ = sess.run(mainDQN.update,feed_dict={mainDQN.input:states,mainDQN.targetQ:targetQ, mainDQN.action:actions})

In [8]:
def generateGif(sess, frameNumber, framesForGif, reward):
    """
        Args:
            sess: A Tensorflow session object
            frame: A (210, 160, 3) frame of an Atari game in RGB
        Returns:
            A processed (84, 84, 1) frame in grayscale
    """
    for idx,frame_idx in enumerate(framesForGif): 
        framesForGif[idx] = resize(frame_idx,(420,320,3),preserve_range=True, order=0).astype(np.uint8)
        
    imageio.mimsave(f'{PATH}{"ATARI_frame_{0}_reward_{1}.gif".format(frameNumber, reward)}', framesForGif, duration=1/30)

In [4]:
# Control parameter
maxEpisodeLength = 18000

targetNetworkUpdateFreq = 10000
discountFactor = 0.99
memoryBufferStartSize = 50000
maxFrames = 25000000
memorySize = 1000000
noOpSteps = 20
gifFreq = 50

hidden = 512
learningRate = 0.00001
bs = 32

PATH = "output/"
os.makedirs(PATH,exist_ok=True)

env = gym.make('BreakoutDeterministic-v0')
print("The environment has {} possible actions {}".format(env.action_space.n, env.unwrapped.get_action_meanings()))

The environment has 4 possible actions ['NOOP', 'FIRE', 'RIGHT', 'LEFT']


If you want to make sure, that the environment ...
env = gym.make('BreakoutDeterministic-v0').env.ale
env.getFloat('repeat_action_probability')
should be 0.25 like in ALE


In [10]:
tf.reset_default_graph()

myMemoryBuffer = MemoryBuffer(size=memorySize, batchSize=bs)
mainDQN = DQN(hidden, learningRate)
targetDQN = DQN(hidden)
variables = tf.trainable_variables()
mainDQNVars = variables[0:len(variables)//2]
targetDQNVars = variables[len(variables)//2:]

NetworkUpdater = TargetNetworkUpdater(mainDQNVars, targetDQNVars)
frameProcessor = processFrame()
actionGetter = ActionGetter(memoryBufferStartSize=memoryBufferStartSize, maxFrames=maxFrames)

init = tf.global_variables_initializer()
saver = tf.train.Saver()
restore = False

with tf.Session() as sess:
    sess.run(init)
    if restore == True:
        saver.restore(sess,tf.train.latest_checkpoint(PATH))

    frameNumber = 0
    episodeNumber=0
    rewards = []
    
    while frameNumber < maxFrames:
        if episodeNumber % gifFreq == 0: 
            framesForGif = []
        
        frame = env.reset()
        terminal = False
        terminal2 = False
        lastLives = 0
        
        # No op steps
        for _ in range(random.randint(1, noOpSteps)):
            frame, _, _, _ = env.step(0)
            
        processedFrame = frameProcessor.process(sess,frame)
        state = np.repeat(processedFrame,4, axis=2)
        episodeRewardSum = 0
        
        for j in range(maxEpisodeLength):
            action = actionGetter.getAction(frameNumber,state)
            newFrame, reward, terminal, info = env.step(action)
            
            # Pass terminal=True to Memory if live was lost
            if info['ale.lives'] < lastLives:
                terminal2 = True;
            else:
                terminal2 = terminal
            lastLives = info['ale.lives']
            
            if episodeNumber % gifFreq == 0: 
                framesForGif.append(newFrame)
                        
            processedNewFrame = frameProcessor.process(sess,newFrame)
            newState = np.append(state[:,:,1:],processedNewFrame,axis=2)

            frameNumber += 1
            
            # Add current experience to Memory
            myMemoryBuffer.addExperience(action=action, frame=processedNewFrame[:,:,0], reward=reward, terminal=terminal2)
            
            if frameNumber > memoryBufferStartSize:
                learn()
            
            if frameNumber % targetNetworkUpdateFreq == 0 and frameNumber > memoryBufferStartSize:
                NetworkUpdater.updateNetworks(sess)
            
            episodeRewardSum += reward
            state = newState
            
            if terminal == True:
                break
                
        rewards.append(episodeRewardSum)
        if episodeNumber % gifFreq == 0: 
            generateGif(sess, frameNumber, framesForGif, episodeRewardSum)
        if episodeNumber % gifFreq == 0:
            saver.save(sess,PATH+'/my_model',global_step=frameNumber)
        if episodeNumber % 10 == 0:
            print(episodeNumber, frameNumber,np.mean(rewards[-100:]), j)
            with open('rewards.dat','a') as f:
                print(episodeNumber, frameNumber,np.mean(rewards[-100:]), j,file=f)
        episodeNumber += 1

Instructions for updating:
Use the retry module or similar alternatives.


  warn("The default mode, 'constant', will be changed to 'reflect' in "
  warn("Anti-aliasing will be enabled by default in skimage 0.15 to "


0 188 1.0 187
10 2049 1.2727272727272727 136
20 3897 1.2857142857142858 251
30 5598 1.1612903225806452 140
40 7362 1.146341463414634 231
50 9185 1.1372549019607843 225
60 10877 1.0819672131147542 227
70 12716 1.1126760563380282 171
80 14616 1.1481481481481481 176
90 16575 1.1978021978021978 165
100 18582 1.26 151
110 20453 1.28 213
120 22173 1.24 154
130 24253 1.34 237
140 26191 1.36 338
150 28176 1.41 160
160 30142 1.48 221
170 31817 1.44 160
180 33924 1.5 171
190 35652 1.44 207
200 37566 1.39 280
210 39265 1.34 153
220 41205 1.4 206
230 42924 1.32 241
240 44775 1.32 186
250 46938 1.42 318
260 48667 1.37 148
270 50548 1.42 395
280 52307 1.32 257
290 54329 1.39 146
300 56291 1.41 238
310 58420 1.53 209
320 60135 1.46 164
330 62174 1.5 249


KeyboardInterrupt: 

# Inference

In [None]:
tf.reset_default_graph()
init = tf.global_variables_initializer()
frameProcessor = processFrame()
mainDQN = DQN(hidden, learningRate)
targetDQN = DQN(hidden)
saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(init)
    saver.restore(sess,tf.train.latest_checkpoint(PATH))
    framesForGif = []
    terminal = False
    frame = env.reset()
    processedFrame = frameProcessor.process(sess,frame)
    state = np.repeat(processedFrame,4, axis=2)
    episodeRewardSum = 0
    
    while not terminal:
        action = getAction(1,state,inference=True)
        newFrame, reward, terminal, _ = env.step(action)
            
        framesForGif.append(newFrame)
                        
        processedNewFrame = frameProcessor.process(sess,newFrame)
        newState = np.append(state[:,:,1:],processedNewFrame,axis=2)


        episodeRewardSum += reward
        state = newState
    print("Total reward: %s" % episodeRewardSum)
    generateGif(sess,0, framesForGif)