# Math

In this notebook we are going to finally discuss Deep Q-Learning 

---

<h3>Deep Q-Learning</h3>

What we should find interesting is that aside from a few miscellaneous modifications most of this stuff we have already seen before

We've been set up so that we already know all the different components that are going to go into this

Specifically Deep Q-Learning can be used for playing much more complex games like Atari videogames using techniques we already know plus a few minor modifications

---

<h3>Practical Issues</h3>

But first let's discuss some practical issues

One thing to note is that actually training our agent is going to take a very long time

Why might this be?

Well first consider the size of the input or state space

CartPole has four inputs and mountain car has two inputs

If we look at the state space of the Atari game Breakout we see that it's a box of shape (210,160,3)

Let's see if we can get rid of the three to turn our screen into a black and white image

So now it's just two 210x160

Now if we're familiar with typical screen resolutions we know intuitively that 210x160 is a very modestly sized screen

So this is already way smaller than a typical video game but at the same time to 210x160 is 33,600

So we're going from four inputs to 30,000 inputs into a neural network

This is going to take much longer to train since the patterns that have to be learned are much more complex

---

So we expect the training process to take time (maybe not that much aided with Colab's GPU)

But we are not going to let this stop us from writing code

So first and foremost we are still going to discuss the concepts and put them into code as usual 

In that respect, the notebook is going to proceed as normal

If the code happens to run in very long time, we would run it and look for a little bit of improvement

This is good enough since understanding the concepts and putting them into code is the important part

# Math

In this section we're going to discuss techniques that allow deep learning and Q-learning to work together

---

<h3>Deep Q-Learning Techniques</h3>

Previously we discussed that we actually already have all the tools you need to implement Q-Learning with a deep neural network

Remember, our update equations are in terms of arbitrary gradients and therefore you can use any model we want as long as we can differentiate the output with respect to the parameters

$$\theta = \theta + \alpha(r+ \gamma \max_{a'} Q(s',a') - Q(s,a)) \nabla_\theta Q(s,a)$$

And of course we don't need to derive the gradients by hand since tensorflow has automatic differentiation 

Should we have tried by this point, we would have seen firsthand that it doesn't seem to converge to a good solution (too lazy to try smth that doesnt work :) )

So in this section we are going to discuss some modifications we can make to the basic Q-Learning process that will make this work

---

<h3>Deep Q-Networks</h3>

First we may have heard of the term deep Q-network and wondered how is this different from a regular deep neural network

Well it is in fact not at all different from the stuff we've seen so far

Really it's just a regular deep neural network that we're using for Q-Learning or in other words it's being used as a function approximate or for the action value function

There is an important subtlety that we have already seen in our implementation of function approximation for Q, and that is rather than transforming the state action pair into a feature, we instead transform only the state into an input feature and instead have multiple output nodes each representing the value for a different action

This is a Deep-Q Network or DQN

---

<h3>Experience Replay</h3>

The next trick we use is called experience replay

We've alluded to this multiple times in the past

We can think of Monte-Carlo methods as a tiny step towards experience replay because when we use Monte-Carlo we obtain a set of state action return triples that we then use to train the function approximator all at the same time once the episode is over 

With experience replay we take this a step further

We introduce what is called a replay buffer 

The size of the buffer is chosen by the programmer and is considered to be a hyperparameter 

Inside the replay buffer, we store four tuples of state, action, reward and next state 

To train the Deep Q network, we sample a random mini-batch from the replay buffer and use that as training data

The buffer basically acts as a queue

In computer science terms that means first in first out

So the buffer always contains the most recent four tuples

---

By taking random samples from past experience, we get a better representation of the true data distribution

This is better than the default queue update where there can be hidden patterns and correlations that will affect how the neural network learns

Imagine if we are trying to classify pictures of animals but instead of your training data being evenly distributed animals we first only have cats in our first batch then we only have dogs in our second batch and so on

This wouldn't work very well

---

Since learning happens on every step we might be wondering what happens at the beginning when there hasn't been any actual experience to replay

The answer to this is that initially we store random experience in the buffer 

For instance, we can keep playing episodes and doing random actions until the buffer is full

---

<h3>Semi-Gradients</h3>

The next trick is related to something we learned about in the previous reinforcement learning notebooks

Recall that Q-Learning as a temporal difference method 

Temporal difference methods use the value function itself as the target return

This is weird because in regular supervised learning the target is always given as part of the training data, we would never have to estimate it 

Because of this, when we do gradient descent, it's not a true gradient, we call it a semi-gradient

For Deep Q-Learning this also leads to instability

---

<h3>Dual Networks</h3>

The solution to this is to introduce another Deep Q Network

We call this the target network, and as we might suspect this network will be responsible for creating the TD target 

The target network is essentially a copy of the main Q-Network but it isn't updated as often

So for example, we might make the target network constant for 100 times steps while continuing to update the deep Q-network 

Only every 100 steps, we'll copy the parameters of the main Deep Q Network into the target network

This helps stabilize the targets

---

<h3>Dealing with Images</h3>

The next technique isn't a trick per se but it has something to do with the fact that the states are raw pixels of the screen, these are images of course 

As we know the way we deal with images in deep learning is to use convolutional neural networks

One problem with still images that we saw before is that we can't tell how things are moving 

For example in Breakout, we can see the ball but we can't determine its velocity from just one image

<img src="https://th.bing.com/th/id/R.0c95a091403723fb06522c23339f8f18?rik=UV0PRkTjEgwtow&riu=http%3a%2f%2fgym.openai.com%2fv2018-02-21%2fvideos%2fBreakout-ram-v0-801e980f-733c-42df-9a47-1cd6fcc309ec%2fposter.jpg&ehk=iGyzxSTfZQgnVwPtYi%2fcAQxgdFWBXYrqUFhpBV%2babjw%3d&risl=&pid=ImgRaw&r=0" extras="700">

Therefore we cant only use just this one image

But remember what we learned about in previous reinforcement learning notebooks about why the Markov property which may seem limiting because it says that the current state depends only on the most recent previous state is not necessarily limiting since we can be the ones who define the state

---

In other words suppose we are using Markov models for next word prediction

We were given the word 'the' and we were trying to predict the next word

This is of course not an easy task because many words can possibly come after 'the' 

One solution to this is to use a higher order Markov model so that the current word depends on a few previous words rather than just one 

---

Another solution is to simply define each state to be three or four words in a row

So now the previous state would be the 'quick brown fox' and we pretty much know that the next state is going to be 'quick brown fox jumps'

---

So this is the same thing we do in Deep Q-Learning

We use a Convolutional Neural Network but instead of using just one most recent frame we use the previous four frames

In addition we also convert the image to grayscale since as we can imagine color information isn't that useful for games (depends on game)

So we start out with a 4D tensor where the dimensions are frame number, height width and color ($,H,W,C)

And we take the average along the color axis, so we end up with frame number height and width (4,H,W)

This is a 3D sensor but we're already used to working with 3D sensors in convolutional neural networks because the third dimension is usually color

Now we're simply replacing color with time

This is irrelevant from the perspective of the neural network itself since all it sees is a 3-D tensor and it's doing convolution along the width and height dimensions

But it's nice to know that we already have all the tools we need

We don't need to build a special kind of convolution or anything like that

---

<h3>Summary</h3>

So let's do a summary of all the modifications we just talked about

First we are going to build a Deep Q Network where each output node corresponds to a value for a specific action

This is as opposed to considering the action as an input into the neural network

Second we store four tuples of state, action, reward and next state in a buffer that we randomly sample for training

Third we use a target network that is a periodically updated version of the main Q-network

This target network is for generating the target for the TD error

And finally we are going to use a convolutional neural network with grayscale images and the past four frames to represent each input state

---

So what's interesting about these changes

Well each of these changes isn't anything particularly challenging to implement

We know how to create a neural network with multiple output nodes, we know how to store things in an array or a list which will be a buffer

We know how to build a convolutional neural network

The only interesting challenge is making a copy of a neural network which is something we haven't done before

# code

# Math

In this section we are going to discuss some extra implementation details that we are going to need for implementing deep learning for Breakout

---

<h3>Extra Details for Breakout-v0</h3>

First we talked about how the input is a lot bigger than we're used to 

Recall that we're stacking four frames as the state 

The frame is of size to 210x150x3 but we can get rid of the 3 and make the image greyscale

So in total we have 210x160x4 for the size of each state which is 134,400 inputs which is a lot

The nice thing about all Atari games is that their graphics are pretty simple so we can down sample the image without losing any information at all

In fact there are even parts of the image that we don't really need for making decisions at all so we can crop the image too

---

<h3>Let's Look at a frame</h3>

Let's inspect the frame so we can see what we mean

First we want to import Jim and get the environment  and then get the first state  

```python
import gym
env = gym.make()
A = env.reset()
```

Remember that `env.observation_space.sample()` will just give us uniformly sample pixels so that won't help

---

<h3>Showing the frame</h3>

Next you want to import matplotlib so we can plot the image we'll see the first frame of breakout

```python
import matplotlib.pyplot as plt
plt.imshow(A)
plt.show()
```

<img src="extras/63.0.PNG" width="400">

So we can see that the top of the frame and the bottom of the frame underneath the paddle we don't really need

A good idea is to try this and zoom in to see which parts you can crop out

---

<h3>Crop</h3>

So we choose pixel 31 and pixel 195 along the height dimension

So let's crop that out and see what it looks like

```python
B = A[31:195]
plt.imshow(B)
plt.show()
```

<img src="extras/63.1.PNG" width="400">

We can see that by doing this, we still have all the information we need to play the game

---

<h3>Bad downsample</h3>

Next we mentioned downsampling

So let's try that next

Luckily `scipy` has a function specifically for downsampling images called `imresize` 

```python
from scipy.misc import imresize
C = imresize(B,size=(105,80,3))
```

<img src="extras/63.2.PNG" width="400">

Notice how 105x80 is half the size of the original

This looks kind of blurry but not for the reason we think 

The reason is because image downsampling isn't just subsampling the original, it does some fancy math in between

---

<h3>Nearest neighbor interploation</h3>

So we can do better by choosing a different interpellation method specifically nearest neighbor

```python
C = imresize(B,siz=(105,80,3),interp='nearest')
plt.inshow(C)
plt.show()
```

<img src="extras/63.3.PNG" width="400">

So this looks much better

---

<h3>Make it a square</h3>

Finally it's kind of convenient for us to treat images that squares with equal width and height

```python
C = imresize(B,size=(80,80,3),interp='nearest')
plt.imshow(C)
plt.show()
```

<img src="extras/63.4.PNG" width="400">

We might wonder if distorting the image like this might result in us losing information but this seems to look OK as well

---

<h3>Other</h3>

The other image processing steps which are gray scaling is just taking the mean along the color axis and scaling the pixel values to between 0 and 1 which are all stuff we've seen before

All these steps help us save space in the experience replay buffer (actually we could save even more space by storing the image as uint8, the dividing batch by 255 before feeding to model)

---

<h3>Tensorflow Layers</h3>

The next change we are going to talk about is using built in tenserflow layers 

Recall from convolutional neural networks, that in order to connect the convolution layers and the fully connected layers we need to flatten the final convolution output 

But to make the first hidden layer wait the right size, we need to know what size the final convolution output is

This isn't a trivial task in general

So using built in layers helps make this less painful

In addition these are part of the official tensorflow library, so they are going to go faster than anything we built 

Speed really matters here, so it's good to use built in stuff if possible 

Tensorflow has functions for both convolutional and fully connected layers

---

<h3>Epsilon Decay</h3>

Next is a minor detail but worth mentioning 

In the original Deep Q Learning paper they updated Epsilon linearly starting from 1 down to zero 0.1 over some number of steps

After that they kept Epsilon at 0.1

<img src="extras/63.5.PNG" width="400">

So we'll also take this approach

---

<h3>Hyperparameters</h3>

The paper contains details about the hyperparameters used, we may end up tweaking them

# Math

In this section, we are going to discuss these pseudocode for Deep Q Learning in Atari

---

<h3>Pseudocode for Atari Deep Q-learning</h3>

One of the key tricks in this section is how to implement the replay memory efficiently

So that's the main non-trivial thing we will discuss in this section

---

<h3>Top-Down Perspective</h3>

From a top down perspective, what we're doing is actually very simple

First we create three things our environment, our target network and our main network

Then in a loop we just play the game for a predetermined number of episodes

```
Create enviroment
Create main network
Create target network
Loop for predetermined # steps:
  Play episode
```

---

<h3>Playing the Game</h3>

Now we can look more in-depth at what playing the game for each episode entails

```python
while not done:
  action = model.get_action(state,epsilon) # select action
  next_frame,reward,done = env.step(action) # perform action
  next_state = make_state(state,next_frame) # make next state
  # update experience replay memory
  experience_replay.add(state,action,reward,next_state,done)

  learn() # train the model
```

As we've seen this means we're going to run a loop that quits when we see the done flag 

Inside the loop, we're going to take an action depending on the current state 

Of course we use the neural network to determine which action to take using an epsilon greedy policy 

This gives us back the next frame, the reward and the done flag 

Using the next frame we can append that to our state to get the next state for the next iteration of the loop

We also want to store the SARS' tuple in our experience replay memory

Lastly on each step we're going to call the learn function

We also want to make sure that we periodically copy the weights from the main network to the target network

---

<h3>Learning</h3>

So what goes inside the learn function

```
def learn():
  (s,a,r,s_prime)batch = experience_replay.get_batch()
  create the targets for each batch
```
$$target = r+\gamma \max_{a'} Q(s',a') if \ s' \ not \ terminal$$

$$target = r \ if \ s' \ terminal$$

```python
  do a step of gradient descent
```

$$\Delta \theta = \frac{\partial}{\partial \theta} (target - Q(s,a))^2$$

Well this is where we actually train our model 

In this function, we're going to grab a random batch of samples from our experience replay memory

This is a batch of SARS' and done tuples

If we recall the S's form the inputs into the neural network 

The targets are calculated using the target network 

In particular the target for a given r and s' is $r + \gamma \max_{a'} Q(s',a')$

It's important to remember that if the done flag is true that means s' is a terminal state in which case the target is just r, the reward 

Then we update the parameters of the main network using the squared error between the prediction and the target

---

<h3>Experience Replay</h3>

So that's essentially it

The main trick in the upcoming code is how to efficiently implement the replay memory

Imagine if we just naively stored each (s,a,r,s',done) tuple 

If we recall each state whether it's s or s' (4,80,80) array, that's a pretty big array 

But not only that we're going to have multiple of these arrays

Let's suppose we store these in lists

So when our list becomes full we have two operations we need to do

First we want to append the latest state

So we call the append function

Then we also want to remove the oldest state, so we call the pop function with the argument 0 

```
Append latest (s,a,r,s',done) tuple
Remove oldest (s,a,r,s',done) tuple (pop())
```

Now this might appear that it takes up constant memory but in fact it is not

Unfortunately in Python this will cause a memory leak

---

<h3>Problem</h3>

What's another problem with storing states this way

Consider s and s' 

Since s' is the state after s the last three frames of s are equal to the first three frames of s'

<img src="extras/63.6.PNG" width="400">

In other words we only have five unique frames but we would have to store eight frames which is almost double the amount we need

---

<h3>Even Worse Problem</h3>

But our problem is even worse than that

If we recall we're going to store these tuples at every timestamp 

So that s' of this time step is equal to the s of the next time step

In other words these are totally redundant 

<img src="extras/63.7.PNG" width="400">

---

<h3>Solution</h3>

Here's the solution to this

Let's instead just store consecutive frames

Now suppose it comes time to draw a sample meaning a tuple of (s,a,r,s',done)

We have an array of frames actions rewards and done flags

Let's suppose we choose a random index t

Now the state is stored in the frames array at the index is t-4 for up to t 

The next state is stored in the frames array at the index is t-3 to t+1 

The action reward and done flag are all stored at index t 

<img src="extras/63.8.PNG" width="400">

So we can see how we implicitly have all the states we need just by storing all the raw frames

---

<h3>Edge Cases</h3>

Now of course there are all sorts of edge cases we have to deal with

For example what happens at the boundary between episodes

We can't just look at the frames at index t-4:t if the index at t-2 is an episode boundary 

<img src="extras/63.9.PNG" width="400">

In addition we'll want to update the arrays in a circular fashion

This is because our array is preallocated and is a fixed length 

So once the array is full we'll just go back to the beginning

Another thing we have to take care of is ordering the dimensions for the library we're using

If we recall PyTorch uses the ordering color by height by width (CxHxW) whereas tensorflow uses the ordering height by width by color (HxWxC)

A single frame since it's grayscale is just a 2D array height by width

So our replay buffer objects will have to handle creating batches of images such that the dimensions are on the correct order

Of course these details are no good to discuss philosophically

So let's look at the code

# code

# Math

In this section we are going to discuss what are called partially observable MDPPs or PO-MDPs

---

<h3>PArtially Observable MDPs</h3>

The term partially observable should be pretty intuitive

Basically it means our state doesn't have full information about the environment which is something we already know

So in fact a PO-MDP is a more realistic model

---

<h3>Intuition</h3>

To get a better intuition about this, think about how we would describe the state of the room we are in 

To do this we would probably turn our heads and look around

While we look around we will probably keep track of what we saw and compile that information in our minds

So similar to what we did with regular Deep Q-Learning

We combine observations to give us a more accurate depiction of the state we're actually in

The key concept here is if what we see at a particular time doesn't give us full information about a state then we can combine things we see at other times to give us more information

In other words we combine observations across time 

---

<h3>Sequences</h3>

In deep learning, this would be considered a sequence of observations

We know that the way we deal with sequences in deep learning is with recurrent neural networks 

By using a recurrent neural network to approximate either the action value function Q or the policy, we can make a function approximate or that depends on previous states as well as the current state

$$\pi(a_t \vert s_t,s_{t-1},s_{t-2},\ldots) \\ Q(s_t,a_t \vert s_{t-1},s_{t-2},\ldots)$$

---

<h3>Computational details</h3>

Computationally, doing Deep Q-Learning with recurrent neural networks is hard

One complication stems from the fact that what we usually do is randomly sample from the experience replay buffer

However if we take a random sample then the states won't be ordered and we won't be able to fit a proper sequence of states into the recurrent network

---

Let's discuss some possible solutions to this

First we can continue to use the experience replay buffer as we have already

In other words each item in the experience replay buffer is a sequence of four observations 

Then instead of concatenating the four observations into one image with four colors we can consider it a sequence of four black and white images

The downside to this is the sequences will be limited in length

We could however use a longer sequence like five, six, seven or eight

While this limits the sequences to always being the same size this is actually easier implementation wise

---

Another thing we can do is just sample randomly from the experience replay buffer but then don't use those samples as indexes to states but rather use them as indexes of end states

So for example suppose we randomly pick index 300 and we want to consider sequences of length 10 

Then what we do is feed in all the states from index 290 up to 300 into the recurrent neural network

In this situation too, it is more convenient to use sequences of constant length

One advantage of this method is that we don't need to store four separate images as part of each state but rather just one

So in effect that saves memory

---

Remember that in regular Deep Q-Learning the state has redundant information

State s(t) contains the images [x(t),x(t-1),x(t-2),x(t-3)] but state s(t-1) already contains [x(t-1),x(t-2),x(t-3),x(t-4)]

However this redundancy makes things easier to implement during training

---

Another thing to think about is that if we consider each state to be just a single frame and we continuously append these to your experience replay buffer you can't naively just always search 10 frames behind to find the beginning of a sequence

Why might that be?

Well if we choose the fifth frame of an episode let's say then it doesn't have 10 previous frames so we can't go back that far

So we also need some way to keep track of episode boundaries if you're going to use this method

# Math

In this section we are going to summarize everything we learned in the notebook which was focused on Deep Q-Learning

---

<h3>Deep Q-Learning Summary</h3>

The surprising thing about Deep Q-Learning is that we didn't really do anything new

All of the techniques we used we had developed previously

The only changes we made were hyperparameter changes and things that seem surprisingly simple

---

<h3>Everything up to Deep Q-Learning</h3>

So to recap, we actually already covered Q-Learning in our previous reinforcement learning notebooks

So we're familiar with the basic form

We take the TD(0) error and do gradient descent with it

We did this both with Q tables and in the more recent notebooks with approximation methods

Of course Deep Learning is also an approximation method but as we may have noticed just trying to plug and play any old neural network won't result in convergence to a good solution


We were however able to use a type of neural network called the RBF network that has a fixed feature representation so that's everything we did up to this notebook

---

<h3>Deep Q-Learning Summary</h3>

In this notebook we answer the question, what additional tuning can we do to get neural networks to work with reinforcement learning algorithms like Q-Learning

So let's list out some of those techniques

There is experience replay

This should remind us of bache in the decem where instead of stochastic gradient descent where we just look at one sample at a time, we look at a bunch of samples that we collected from the past

It's important to note that we always sample from the most recent set of experiences

The reason we want to do this is because when we use the TD error, we're not doing true gradient recent we're only doing an approximation of gradient descent

And that's because our target is not a true target but rather it's yet another prediction made by our very own model

And that brings us to the next technique where we use a separate target network that periodically copies the weights from the main network

---

Another modification we made was for how to model Q 

Instead of doing a feature expansion on the state-action pair as in traditional reinforcement learning, we instead do a feature expansion on just the state and have an output node for each action 

In terms of neural networks, we can think of this as each of the inner layers of the neural network finding a good feature representation and then the final output layer being a separate linear regression for each action 

Something to consider though is how would this work if we had an infinite number of actions

---

Finally we saw some techniques that allow us to apply Deep Q-Learning to video games

Specifically we don't just want to extract features from the video game manually 

Since this is deep learning, we would like the neural network to learn feature representations of the input using gradient descent

And so as we know, our main tool for dealing with images, which can be considered the most raw feature we can get from a video game is the convolutional neural network

---

We also looked at this idea that we don't have to consider the state as just information from the current time

We can also incorporate any other information into the state that we would like

For instance images from previous frames

So this is exactly what we did

Our reasoning for this was because we can't infer motion or velocity from a still frame

And of course motion is very important in a videogame because it allows us to predict what's going to happen next

This is just like trying to catch a baseball

We can look at the trajectory of the baseball and our brain will be able to predict where it's going to land

---

So with these simple techniques we were able to combine Deep Learning with Reinforcement Learning, but not only that we were able to scale it up to a much more complex games like Breakout

And of course as the original authors have shown, these same techniques apply equally well to a number of other Atari games like Pong, Space Invaders, Seaquest and Beamrider

