## Topics in this Notebook

[1-9]. Covered earlier 

10. Deep Q-Learning - Combining Neural Networks and Reinforcement Learning
11. Replay Memory Explained - Experience for Deep Q-Network Training
12. Training a Deep Q-Network - Reinforcement Learning
13. Training a Deep Q-Network with Fixed Q-targets - Reinforcement Learning

## 10. Deep Q-Learning - Combining Neural Networks and Reinforcement Learning

### Limitations of Q-learning with value iteration 

In Frozen Lake, our environment was relatively simplistic with only 16 states and 4 actions, giving us a total state-action space of just 16 x 4. Meaning we only had 16 x 4 or 64 Q-values to update in the Q-table. Given the fact that these Q-value updates occur in an iterative fashion, we can imagine that as our state space increases in size, the time it will take to traverse all those states and iteratively update the Q-values will also increase.  

Think about a video game where a player has a large environment to roam around in. Each state in the environment would be represented by a set of pixels, and the agent may be able to take several actions from each state. The iterative process of computing and updating Q-values for each state-action pair in a large state space becomes computationally inefficient and perhaps infeasible due to the computational resources and time this may take.

So, what can we do when we want to step up our game from a simple toy environment, like Frozen Lake, to something more sophisticated? Well, rather than using value iteration to directly compute Q-values and find the optimal Q-function, we instead use a **function approximator to estimate the optimal Q-function**.

### Deep Q-learning   
We’ll make use of a deep neural network to estimate the Q-values for each state-action pair in a given environment, and in turn, the network will approximate the optimal Q-function. The act of combining Q-learning with a deep neural network is called deep Q-learning, and a deep neural network that approximates a Q-function is called a deep Q-Network, or DQN.  


### Deep Q-networks

Suppose we have some arbitrary deep neural network that **accepts states** from a given environment as input. For each given state input, the network outputs **estimated Q-values** for each action that can be taken from that state. The objective of this network is to **approximate the optimal Q-function**, and remember that the optimal Q-function **will satisfy the Bellman equation** that we covered previously:
$$q_{∗}(s,a)=E[R_{t+1}+γmax_{a′}q_{∗}(s′,a′)]$$

With this in mind, the **loss from the network is calculated** by comparing the outputted Q-values to the target Q-values from the right hand side of the Bellman equation, and as with any network, the **objective here is to minimize this loss**.

![title](../docs/images/DLiz/4_Deep_Q-Network.png)

After the loss is calculated, the **weights within the network are updated** via SGD and backpropagation, again, just like with any other typical network. This **process is done over and over again** for each state in the environment until we sufficiently minimize the loss and get an approximate optimal Q-function.

<div class="alert alert-info">

**Note:** So, take a second now to think about how we previously used the Bellman equation to compute and update Q-values in our Q-table in order to find the optimal Q-function. Now, with deep Q-learning, our network will make use of the Bellman equation to estimate the Q-values to find the optimal Q-function. So, we’re still solving the same general problem here, just with a different algorithm. *Rather than making use of value iteration to solve the problem, we’re now using a deep neural network.*

</div>

Alright, we should now have a general idea about what deep Q-learning is and what, at a high level, the deep Q-network is doing. Now, let’s get a little more into the details about the network itself.

#### The input

We discussed earlier that the network would accept states from the environment as input. Thinking of Frozen Lake, we could easily represent the states using a simple coordinate system from the grid of the environment and use this as input.

                SFFF
                FHFH
                FFFH
                HFFG

If we’re in a more complex environment, though, like a video game, for example, then we’ll use images as our input. Specifically, we’ll use still frames that capture states from the environment as the input to the network.

The **standard preprocessing** done on the frames usually involves **converting the RGB data into grayscale** data since the color in the image is probably usually not going to affect the state of the environment. Additionally, we’ll typically see **some cropping and scaling** as well to both cut out unimportant information from the frame and shrink the size of the image.

Now actually, rather than having a single frame represent a single input, we usually will use a **stack of a few consecutive frames to represent a single input**. So, we would grab, say, four consecutive frames from the video game. We’d then do all the preprocessing on each of these four frames we mentioned earlier – the grayscale conversion, the cropping, and the scaling – and then we’d take the preprocessed frames and stack them on top of each other in the order of which they occurred in the game.  

![title](../docs/images/DLiz/5_stacked_frames_for_DQN.png)

We do this because a **single frame usually isn’t going to be enough** for our network, or even for our human brains, to fully understand the state of the environment. For example, by just looking at the first single frame above from the Atari game, Breakout, we can’t tell if the ball is coming down to the paddle or going up to hit the block. We also don’t have any indication about the speed of the ball, or which direction the paddle is moving in.

If we look at four consecutive frames, though, then we have a much better idea about the current state of the environment because we now do indeed have information about all of these things that we didn’t know with just a single frame. So, the takeaway is that a stack of frames will represent a single input, which represents the state of the envrionment.


#### The layers

Many deep-Q networks are purely just some convolutional layers, followed by some non-linear activation function, and then the convolutional layers are followed by a couple fully connected layers, and that’s it.

#### The output

The output layer will be a fully connected layer, and it will produce the Q-value for each action that can be taken from the given state that was passed as input.

For example, suppose in a given game, the actions we can take consist of moving left, moving right, jumping, and ducking. Then the output layer would consist of four nodes, each representing one of the four actions. The value produced from a single output node would be the Q-value associated with taking the action that corresponds to that node from the state that was supplied as input to the network.

We won’t see the output layer followed by any activation function since **we want the raw, non-transformed Q-values from the network**.


## 11. Replay Memory Explained - Experience for Deep Q-Network Training

### Experience Replay and Replay Memory
With deep Q-networks, we often utilize this technique called experience replay during training. With experience replay, we store the agent’s experiences at each time step in a data set called the **replay memory**. We represent the agent’s experience at time $t$ as $e_{t}$.

At time $t$, the agent's experience $e_{t}$ is defined as this tuple: $$e_{t}=(s_{t},a_{t},r_{t+1},s_{t+1})$$

This tuple contains the state of the environment $s_{t}$, the action at taken from state $s_{t}$, the reward $r_{t+1}$ given to the agent at time $t+1$ as a result of the previous state-action pair $(s_{t},a_{t})$, and the next state of the environment $s_{t+1}$. This tuple indeed gives us a summary of the agent’s experience at time $t$. 

All of the agent's experiences at each time step over all episodes played by the agent are stored in the replay memory. Well actually, in practice, we’ll usually see the replay memory set to some finite size limit, $N$, and therefore, it will only store the last $N$ experiences.

This replay memory data set is what we’ll randomly sample from to train the network. The act of gaining experience and sampling from the replay memory that stores these experience is called **experience replay**.

#### Why use experience replay?

Why would we choose to train the network on random samples from replay memory, rather than just providing the network with the sequential experiences as they occur in the environment?

**A key reason for using replay memory is to break the correlation between consecutive samples.**

If the network learned only from consecutive samples of experience as they occurred sequentially in the environment, the *samples would be highly correlated and would therefore lead to inefficient learning*. Taking random samples from replay memory breaks this correlation.


### Combining a deep Q-network with experience replay

Alright, we now have the idea of experience replay down. From last time, we should also have an understanding of a general deep Q-network architecture, the data that the network accepts, and the output from the network.

![title](../docs/images/DLiz/4_Deep_Q-Network.png)

As a quick refresher, remember that the network is passed a state from the environment, and in turn, the network outputs the Q-value for each action that can be taken from that state.

Let’s now bring all of this information in together with experience replay to see how they fit in with each other.

#### Setting up

- Before training starts, we first initialize the replay memory data set $D$ to capacity $N$. So, the replay memory $D$ will hold $N$ total experiences.  
- Next, we initialize the network with random weights.  
- Next, for each episode, we initialize the starting state of the episode.  

#### Gaining experience

- Now, for each time step $t$ within the episode, we either *explore* the environment and select a random action, or we *exploit* the environment and select the greedy action for the given state that gives the highest Q-value.  
- We then execute the selected action $a_{t}$ in an emulator. So, for example, if the selected action was to move right, then from an emulator where the actions were being executed in the actual game environment, the agent would actually move right. We then observe the reward $r_{t+1}$ given for this action, and we also observe the next state of the environment, $s_{t+1}$. We then store the entire experience tuple $e_{t} =(s_{t},a_{t},r_{t+1},s_{t+1})$ in replay memory $D$. 

## 12. Training a Deep Q-Network - Reinforcement Learning  

### The policy network

After storing an experience in replay memory, we then sample a random batch of experiences from replay memory. 

<div class="alert alert-info">
For ease of understanding, though, we're going to explain the remaining process for a single sample, and then you can generalize the idea to an entire batch.
</div>

Alright, so from a single experience sample from replay memory, we then preprocess the state (grayscale conversion, cropping, scaling, etc.), and pass the preprocessed state to the network as input. Going forward, we’ll refer to this network as the policy network since its objective is to approximate the optimal policy by finding the optimal Q-function.

The input state data then forward propagates through the network, using the same forward propagation technique that we’ve discussed for any other general neural network. The model then outputs an estimated Q-value for each possible action from the given input state.

The loss is then calculated. We do this by comparing the Q-value output from the network for the action in the experience tuple we sampled and the corresponding optimal Q-value, or target Q-value, for the same action.

Remember, the target Q-value is calculated using the expression from the right hand side of the Bellman equation. So, just as we saw when we initially learned about plain Q-learning earlier in this series, the loss is calculated by subtracting the Q-value for a given state-action pair from the optimal Q-value for the same state-action pair.
$$q_{∗}(s,a)−q(s,a)=loss$$
$$E[R_{t+1}+γmax_{a′}q_{∗}(s′,a′)]−E[∑_{k=0}^{∞}γ^{k}R_{t+k+1}]=loss$$

### Calculating the max term

When we are calculating the optimal Q-value for any given state-action pair, notice from the equation for calculating loss that we used above, we have this term here that we must compute:
$$max_{a′}q_{∗}(s′,a′)$$

Recall that $s′$ and $a′$ are the state and action that occur in the following time step. Previously, we were able to find this $max$ term by peeking in the Q-table, remember? We'd just look to see which action gave us the highest Q-value for a given state.

Well that's old news now with deep Q-learning. In order to find this $max$ term now, what we do is pass $s′$ to the policy network, which will output the Q-values for each state-action pair using $s′$ as the state and each of the possible next actions as $a′$. Given this, we can obtain the max Q-value over all possible actions taken from $s′$, giving us $max_{a′}q_{∗}(s′,a′)$.

![title](../docs/images/DLiz/6_Deep_Q-Network_with_next_state_s_prime.png)

Once we find the value of this $max$ term, we can then calculate this term for the original state input passed to the policy network.
$$E[R_{t+1}+γmax_{a′}q_{∗}(s′,a′)]$$

#### Why do we need to calculate this term again?

Ah, yes, this term enables us to compute the loss between the Q-value given by the policy network for the state-action pair from our original experience tuple and the target optimal Q-value for this same state-action pair.

So, to quickly touch base, note that we first forward passed the state from our experience tuple to the network and got the Q-value for the action from our experience tuple as output. We then passed the next state contained in our experience tuple to the network to find the $max$ Q-value among the next actions that can be taken from that state. This second step was done just to aid us in calculating the loss for our original state-action pair.

This may seem a bit odd, but let it sink in for a minute and see if the idea clicks.

### Training the policy network

Alright, so after we're able to calculate the optimal Q-value for our state-action pair, we can calculate the loss from our policy network between the optimal Q-value and the Q-value that was output from the network for this state-action pair.

Gradient descent is then performed to update the weights in the network in attempts to minimize the loss, just like we’ve seen in all other previous networks we've covered on this channel. In this case, minimizing the loss means that we’re aiming to make the policy network output Q-values for each state-action pair that approximate the target Q-values given by the Bellman equation.

Up to this point, everything we've gone over was all for one single time step. We then move on to the next time step in the episode and do this process again and again time after time until we reach the end of the episode. At that point, we start a new episode, and do that over and over again until we reach the max number of episodes we’ve set. We’ll want to keep repeating this process until we’ve sufficiently minimized the loss.


## 13. Training a Deep Q-Network with Fixed Q-targets - Reinforcement Learning


### Potential training issues with deep Q-networks

Alright, now that we have that refresher out of the way, let's focus on the potential issues with this process. As mentioned, the issues stem from the second pass to the network.

We do the first pass to calculate the Q-value for the relevant action, and then we do a second pass in order to caluclate the target Q-value for this same action. Our objective is to get the Q-value to approximate the target Q-value.

Remember, we don't know ahead of time what the target Q-value even is, so we attempt to approximate it with the network. This second pass occurs using the same weights in the network as the first pass.

Given this, when our weights update, our outputted Q-values will update, but so will our target Q-values since the targets are calculated using the same weights. So, our Q-values will be updated with each iteration to move closer to the target Q-values, but the target Q-values will also be moving in the same direction.

As Andong put it in the comments of the last video, this makes the optimization appear to be chasing its own tail, which introduces instability. As our Q-values move closer and closer to their targets, the targets continue to move further and further because we're using the same network to calculate both of these values.

### The target network

Well, here's a perfect time to introduce the second network that we mentioned earlier. Rather than doing a second pass to the policy network to calculate the target Q-values, we instead obtain the target Q-values from a completely separate network, appropriately called the target network.

The target network is a clone of the policy network. Its weights are frozen with the original policy network’s weights, and we update the weights in the target network to the policy network’s new weights every certain amount of time steps. This certain amount of time steps can be looked at as yet another hyperparameter that we'll have to test out to see what works best for us.

So now, the first pass still occurs with the policy network. The second pass, however, for the following state occurs with the target network. With this target network, we're able to obtain the $max$ Q-value for the next state, and again, plug this value into the Bellman equation in order to calculate the target Q-value for the first state.
$$max_{a′}q_{∗}(s′,a′)$$
This is all we use the target network for — To find the value of this $max$ term so that we can calculate the target Q-value for any state passed to the policy network.

As it turns out, this removes much of the instability introduced by using only one network to calculate both the Q-values, as well as the target Q-values. We now have something fixed, i.e. fixed Q-targets, that we want our policy network to approximate. So, we no longer have the dog-chasing-it's-tail problem.

As mentioned though, these values don't stay completely fixed the entire time. After $x$ amount of time steps, we'll update the weights in the target network with the weights from our policy network, which will in turn, update the target Q-values with respect to what it's learned over those past time steps. This will cause the policy network to start to approximate the udpated targets. 

### Summary

Here's a summary of what we have so far:
1. Initialize replay memory capacity.
2. Initialize the network with random weights.
3. For each episode:
    1. Initialize the starting state.
    2. For each time step:
        1. Select an action.
            - Via exploration or exploitation
        2. Execute selected action in an emulator.
        3. Observe reward and next state.
        4. Store experience in replay memory.
        5. Sample random batch from replay memory.
        6. Preprocess states from batch.
        7. Pass batch of preprocessed states to policy network.
        8. Calculate loss between output Q-values and target Q-values.
        9. Requires a second pass to the network for the next state
        10. Gradient descent updates weights in the policy network to minimize loss.
            - After x time steps, weights in the target network are updated to the weights in the policy network.
