# Value-based RL DNN approaches

First, consider again the curses we covered in the previous lesson:

* curse of dimension
* curse of modeling
* curse of credit assignment

The problem with the first two was the size of the state space and action space. Simply, it is not possible to store the values for each state separately. Therefore we turned toward feature extraction and linear methods. We have also seen, there are pretty good performance guarantees. Here are the tables again.

**Prediction algorithms:**

| On/Off-policy | Algorithm | Tabular | Linear | Non-linear|
|------|------|------|------|------|
| On-policy | MC | YES | YES | YES |
| On-policy | TD($\lambda$) | YES | YES | NO |
| On-policy | **Gradient TD** | YES | YES | YES |
| Off-policy | MC | YES | YES | YES |
| Off-policy | TD($\lambda$) | YES | NO | NO |
| On-policy | **Gradient TD** | YES | YES | YES |

**Control algorithms:**

| Algorithm | Tabular | Linear | Non-linear|
|------|------|------|------|
| MC | YES | YES | NO |
| Sarsa | YES | YES | NO |
| Q-learning | YES | NO | NO |
| **Gradient Q** | YES | YES | NO |

In the tables above, two new algorithms are highlighted which have better convergence properties. We do not discuss them in detail but it can be useful to know about them. Details: [Gradient TD](http://incompleteideas.net/Talks/gradient-TD-2011.pdf), [Gradient Q-learning 1](http://agi-conf.org/2010/wp-content/uploads/2009/06/paper_21.pdf) and [Gradient Q-learning 2](https://arxiv.org/pdf/1705.03967.pdf).

**Question:** can we eliminate or handle the problem of the curse of modeling and the curse of dimension?

So far, we have seen that the model can be eliminated by applying the Q-function with sampling (e.g.: Q-learning, Sarsa-learning and their variants). Those algorithms are called model-free learning algorithms.

The **curse of credit assignment** is more challenging. n-step return, $\lambda$-return can make the learning more stable and by using a long horizon the credit assignment can be easier. But there is no clear-cut in this.

The **curse of dimension** requires a different approach instead of tabular methods. Manual feature extraction has a long history in machine learning but deep learning makes it possible to learn the features automatically. It is tempting to apply deep learning for representing the state values.

### DQN - Deep Q-Network

The DQN algorithm was one of the first RL algorithms where the RL framework was combined with DNNs and the result was satisfying. The properties of the algorithm:

* relatively easy to implement
* simple but powerful
* difficult convergence
* sensitive for the hyper-paramters

You will implement this algorithm in today's handout [cartpole-dqn-handout](CartPole-with-DQN.ipynb).

DQN is based on Q-learning but the $Q$-function is approximated by a neural network: $Q_\theta(s, a)$. Then the update rule for the $\theta$ parameters is given as:

$$\theta_{t+1} = \theta_t + \alpha \cdot \left( r_t + \max_{a'}Q_\theta(s', a') - Q_\theta(s, a) \right)\cdot \nabla_\theta Q_\theta(s, a)|_{\theta = \theta_t}$$

**How can we represent the state?** The problem is when something is moving on the image (e.g.: a ball) then a static frame is not able to represent it at a time point. The convolutional networks are memory less therefore they can not store informations across the consequtive frames. Remember, RL assumes an MDP (Markov decision process), which requires the state contains all the information about the environment. 

A good approximation of this is to use a bunch of consequtive frames. Therefore a moving object appears different places on the consequtive frames.

DQN has two major tricks to avoid the instability caused by the neural network approximator:

1. experience replay
2. iterative update

The next slide shows the pseudo code of the DQN algorithm. The experience replay is the $D$ buffer in the code. The algorithm stores and samples experiences from the buffer. The iterative update implemented with $\hat{Q}$ and $Q$.

<img src="http://drive.google.com/uc?export=view&id=1EaDj9o9-ACuMsf9PtmMw4p3CMtknTVAg" width=55%>

**Experience replay:**
<img src="http://drive.google.com/uc?export=view&id=1rqQQyPxhDTSFScMwmXDLGAE6eUpAWoSu" width=75%>

There are two main reasons why experience replay can help to converge faster:

1. If all of the samples are taken consequtively before feeding it into the network then the data samples will be correlated. This correlation makes the learning slower and harms the generalization. The replay buffer gathers the experiences in a buffer and the training batches are sampled according to a uniform distribution.
2. There are valuable states (experiences) which should be used more times because it affects the policy strongly. However, may be the state is visited rarely because it is hard to reach it. Because the replay buffer stores a long history of experiences, the rare experiences can be reused several times.

**Iterative update:**

One of the reasons behind the instability of the Q-learning combined with a deep neural network is the fast change (high variance) of the one-step return. The one-step return depends on the network itself and the network weight is updated frequently. The network has no time to adapt and follow up changes.

Iterative update or (delayed update) uses two networks for representing the $Q$-function. The architecture is the same but the weights are different. The weights are synchronized after a given number of steps.

The goal of the first network is to calculate the return and it is not updated until synchronization. The second network is responsible for selecting the next step and it is always updated according to the update rule.

The update rule changes for the following one:

$$\theta_{t+1} = \theta_t + \alpha \cdot \left( r_t + \max_{a'}\hat{Q}_{\theta^-}(s', a') - Q_\theta(s, a) \right)\cdot \nabla_\theta Q_\theta(s, a)|_{\theta = \theta_t}$$

**Preprocessing steps:**

The frames arriving from the simulator needs preprocession before feeding into the network.

<img src="http://drive.google.com/uc?export=view&id=1v6xXmKxSbElHF8RDgwmjP4eou2DxMHDj" width=75%>

The preprocessing of the raw input frames consists of the following steps, as the above image illustrates:

* grayscale the image
* cropping (only the interesting part of the image will remain)
* downsampling (or resizing) the image for $84\times 84$
* stacking four frames together to form the state

**Network architecture:**

* Conv2D(kernel\_num=32, kernel\_size=(8, 8), padding='valid', input_shape=(84, 84, 4), strides=(4, 4))
* Activation('relu')
* Conv2D(kernel\_num=64, kernel\_size=(4, 4), padding='valid', strides=(2, 2))
* Activation('relu')
* Conv2D(kernel\_num=64, kernel\_size=(3, 3), padding='valid', strides=(1, 1))
* Activation('relu')
* Flatten()
* Dense(units=512, activation='relu')
* Dense(num\_actions)

[Video playing Atari](https://www.youtube.com/watch?v=V1eYniJ0Rnk)

### Double DQN

[paper](https://arxiv.org/pdf/1509.06461.pdf)

$Q$-learning can overestimate the real value of $Q(s, a)$ and this can harm the convergence. Double $Q$-learning is an approach to handle the problem of overestimation.

The main idea is to use two separate $Q$ functions. One for choosing the best action, and one for boostrapping. 

Update rule for simple $Q$-learning:
$$Y^Q = r_t + \gamma Q(s_t, \arg\max_a Q(s_t, a; \theta); \theta)$$

Update rule for **double $Q$-learning**:
$$Y^{DoubleQ} = r_t + \gamma Q(s_t, \arg\max_a Q(s_t, a; \theta); \theta')$$

The role of $\theta$ and $\theta'$ is switched update by update.

<img src="https://drive.google.com/uc?export=download&id=1jaH7q73Pc_GzuIDt2nkHPzo3rgAq7z16"  height=400>

### DQN with prioritized experience replay

[paper](https://arxiv.org/pdf/1511.05952.pdf)

Remember, in case of the DQN we sample the experiences uniformly (with equal probability) from the experience replay.
Unfortunately, this approach assumes that all of the experiences has equal impact on learning. It is easy to understand that his is not true. There are experiences with more relevance. If we define a metric or indicator, to decide which experience is the more important, then we can create a prioritized experience replay. The experiences with higher priority are chosen more frequently.

How can we measure the importnace? A common way for that is to calculate the TD-error (we have already seen it), which indicates how suprising or unexpected the transition is:

$$\delta = r + \gamma \max_{a'}Q(s', a') - Q(s, a)$$

Then, the probability of sampling transition $i$ is:

$$P(i) = \frac{p^\alpha}{\sum_k p^\alpha_k}$$

where:

$$p_i = |\delta_i| + \varepsilon$$

$\alpha$ is a hyperparameter and $\alpha=0$ is equal with the uniform sampling.

<img src="https://drive.google.com/uc?export=download&id=1UK6o9wAMGZ_IC8UM2JD9yUt34YwZD7re"  height=350>

<img src="https://drive.google.com/uc?export=download&id=1Foki_zqVOJLkE2eJrpXmZJVxQ3w8prDB" >

### Dueling network for DQN

[paper](https://arxiv.org/pdf/1511.06581.pdf)

<img src="http://drive.google.com/uc?export=view&id=1EH5T77x-RqDyXm1u3rAf8RNBobOv71jm" width=75%>

Definition of the **advantage**:

$$A(s, a) = Q(s, a) - V(s)$$

Combining the estimated $V$ and $A$ in the network is tricky because of the identification problem, mentioned in the paper. For instance by adding a constant to the $V$ and substracting the same constant from $A$, the $Q$ will be the same.

However, the following is true for the relation between $V$ and $Q$ in case of a deterministic policy:

$$V(s) = Q(s, a^*)$$

If we compare the two equations then:

$$A(s, a^*) = 0$$

In order to avoid the identification problem, the authors suggest to force the last equation to hold by:

$$Q(s, a; \theta, \alpha, \beta) = V(s; \theta, \beta) + \left( A(s, a; \theta, \alpha) - \max_a' A(s, a'; \theta, \alpha) \right)$$

However, to further improve the stability of optimization, the paper proposed the following module to combine $V$ and $A$ at the output:

$$Q(s, a; \theta, \alpha, \beta) = V(s; \theta, \beta) + \left( A(s, a; \theta, \alpha) - \frac{1}{|D|} \sum_{a' \in D}A(s, a'; \theta, \alpha) \right)$$

$D$ now is the action space.

The main contribution of this approach, that it helps generalizing among different environments. In case of the Atari, the games differ in the action space too (e.g.: the number of actions and their meaning).

### TreeQN

[paper](https://arxiv.org/pdf/1710.11417.pdf)

DQN combined with the model. The model is represented as a network. The architecture is like a tree. 


<img src="http://drive.google.com/uc?export=view&id=1fX44HRPKzRJI6U7vxaAZPa6bg7tHEGpY" width=75%>