# Workshop RL02: value-based RL (Deep Q-learning focus)
## Motivation:
In the last workshop for model-based RL, we have seen how we can model states transitions/dynamics by defining a transition probabilities matrix. Then we use that to calculate the value functions iteratively. But what if we have enormous states and/or actions, or transition probabilities are hard to define (e.g. self-driving, Atari, consumer marketing, healthcare etc.).  Instead of finding the value functions explicitly, we can estimate with a parameterized function, e.g.

<img src='param-est.png' width=400>



## Value Function Approximation (VFA)
There are many ways of approximations: 
- linear approximation: $\hat{V}^{\pi}(s;w) = x(s)^Tw$
- Decision trees
- Nearest neighbours
- Neural Netwrok

Our focus here will be on neural netwrok. An example of neural network approximations:
<img src='DQN.png'>
- Input $s$ is a stack of raw pixels from last 4 frames (that represents the current state)
- Reward is change in score for that step 
- Output is the approximation of $Q(s,a)$ values for all actions (e.g. joystick/buttons) for the current state.   


## Parameters updating
Then our goal is to find the set of parameters $W$ that minimize the loss between true value $Q^{\pi}(s,a)$ and its approximations $\hat{Q}^{\pi}(s,a;w)$. 

To bridge the gap between true values and approximations, first, let's define the **loss function**:
$$ J(w) = E_{\pi} \lbrack (Q^{\pi}(s,a)-\hat{Q^{\pi}}(s,a;w))^2 \rbrack $$


Then using **gradient descent** to find the **local minimum**:
$$ \Delta w = -\frac{\alpha}{2}\nabla_wJ(w) $$
> note: $\Delta w$ is the updates of $w$, $\nabla_w$ is the derivative with respect to $w$. 

Now plugging in $\nabla_wJ(w)$: 
$$ \Delta w = \alpha E_{\pi} \lbrack (Q^\pi(s,a)-\hat{Q^{\pi}}(s,a;w))\nabla_w\hat{Q^{\pi}}(s,a;w) \rbrack $$


Finally, since this is still expection value, we can estimate using **stochastic gradient descent**:
$$ \Delta w = \alpha (Q^\pi(s,a)-\hat{Q^{\pi}}(s,a;w))\nabla_w\hat{Q^{\pi}}(s,a;w) $$ 
> i.e. compute this using a single random sample. If we do this enough of times, we can converge to the result of gradient descent (expected SGD is the same as the full gradient descent). 

>$Q^\pi(s,a)-\hat{Q^{\pi}}(s,a;w)$ is also called **prediction error**. 

## Approximation with an Oracle
Notice that we still don't have the true value for value functions $Q^{\pi}(s,a)$. We can overcome this by using some sort of **oracle** to tell the true value, e.g.:
- Monte Carlo method: generate an episode and use its return $G_t = \sum_t r_t \gamma ^t$ as subsitute $$\Delta w = \alpha (G_t-\hat{Q_t}(s_t,a_t;w))\nabla_w\hat{Q_t}(s_t,a_t;w)$$
> unbiased, high variance
- SARSA (State-Action-Reward-State-Action) method: use a Temporal Difference (TD) target that leverages the current function approximation value
$$\Delta w = \alpha (r+\gamma \hat{Q}(s_{t+1},a_{t+1};w)-\hat{Q_t}(s_t,a_t;w))\nabla_w\hat{Q_t}(s_t,a_t;w)$$
- Q-learning method: use a TD target that leverages the maximum of the current function approximation value 
$$\Delta w = \alpha (r+\gamma max_a \hat{Q}(s_{t+1},a;w)-\hat{Q_t}(s_t,a_t;w))\nabla_w\hat{Q_t}(s_t,a_t;w)$$

## DQNs in practise
Our focus here is Deep Q-learning Network (DQN), which uses deep CNN as approximiation. Let's discuss in more details about some methods that can improve performance of DQNs. 

### Better Convergence
DQN using VFA can diverge, because:
- correlation between samples (iid assumption for SGD convergence)
- non-stationary targets (weights keep changing, and policy keeps changing)

DQN can address these problems by: 1. Experience Replay and 2. Fixed Q-targets

#### 1. Experience Replay
This is to remove correlation between samples. Idea is to sample from experience to cancel out the correlation between sample and preceding one, e.g. $s_t, s_{t+1}$.

First store data from prior experience, i.e. **Replay Buffer** $D$.

|Replay Buffer|
|:-----------:|
|$s_1,a_1,r_1,s_2$|
|$s_2,a_2,r_2,s_3$|
|$s_3,a_3,r_3,s_4$|
| ... |
|$s_t,a_t,r_t,s_{t+1}$|

> note: underscore here means timestep

Then sample a minibatch of $(s,a,r,s')$ from the buffer $D$ and update the weights using SGD as follow:
$$\Delta w = \alpha (r+\gamma max_{a'} \hat{Q}(s',a';w)-\hat{Q}(s,a;w))\nabla_w\hat{Q}(s,a;w)$$

#### 2. Fixed Q-targets
This is to improve stability of weights so as to avoid them from exploding. Idea is to fix the target for multiple updates to stop weights from updating drastically. 

First we define 2 different sets of weights. 
- $w^-$ be the set of weights for fixing the targets (fixed for some updates)
- $w$ be the set of weights that are being updated 

Then take an sample from replay buffer $D$: $(s,a,r,s')$ and update the weights as follow:
$$\Delta w = \alpha (r+\gamma max_{a'} \hat{Q}(s',a';w^-)-\hat{Q}(s,a;w))\nabla_w\hat{Q}(s,a;w)$$

### Exploration or Exploitation
**Exploitation** means the agent uses the current information to make the best decisions whereas **exploration** means the agent explores other actions that might lead to better results. 

We want to do both but we also want to find the fine balance between these 2. We can achieve this by **$\varepsilon$-greedy**: 
$$\pi(a|s)=\bigg\{  \begin{array}{ll}
                    argmax_aQ(s,a),\ with\ prob.\ 1-\varepsilon \\ 
                    a_1, a_2, ... ,\ with\ prob.\ \frac{\varepsilon}{|A|} 
                    \end{array}$$
> how do we choose $\varepsilon$? DeepMind's paper (mnih2015human, mnih-atari-2013), decreases $\varepsilon$ from 1 to 0.1 during the first million steps. But in test time, $\varepsilon_{soft} = 0.05$. 


There are other better-performing DQNs, which are included in the optional workshop. Feel free to go through it and try implementing them.

## Exercise
### Task
In the exercise we will be doing the [assignment from Standford Reinforcement Learning course](http://web.stanford.edu/class/cs234/assignment2/index.html) again. This time we will train a DQN (feel free to implement other improved DQNs as well) in [Atari 2600 game Pong](https://gym.openai.com/envs/Pong-v0/) to play against another decent AI. Since this will take **12 hours** to train on a GPU, so we will first test it on a testing environment then later you can train it on your own free time if you're interested. 

### Settings
These are basically what we have talked about in the above section, but we just put them here for your convinience.
- tabular setting: 
    In the tabular setting, we want to matintain a table for $(s_t,a_t,r_t,s_{t+1},Q(s_t,a_t))$, and update $Q$ as follow
    $$Q(s_t,a_t;w) \leftarrow Q(s_t,a_t;w)+\alpha \big( r_t + \gamma \max\limits_{a' \in A} Q(s_{t+1},a';w) - Q(s_t,a_t) \big) $$
    
- approximation setting:
    In the approximation setting, instead of storing $Q(s,a)$ itself, we represent $Q(s_t,a_t)$ as $\hat{Q}(s_t,a_t;w)$. (So we just need a set of $w$ for all $\hat{Q}$) Then we update the weights $w$ as follow
$$w \leftarrow w + \alpha (r_t+\gamma max_{a' \in A} \hat{Q}(s_{t+1},a';w)-\hat{Q}(s_t,a_t;w))\nabla_w\hat{Q}(s_t,a_t;w)$$

- target network:
    In the target network setting we maintain two sets of weights for updating ($w$) and fixing target ($w^-$)   
    $$w \leftarrow w + \alpha (r_t+\gamma max_{a' \in A} \hat{Q}(s_{t+1},a';w^-)-\hat{Q}(s_t,a_t;w))\nabla_w\hat{Q}(s_t,a_t;w)$$

- replay memory:
    In the replay memory setting, we store the tuple $(s_t,a_t,r_t,s_{t+1})$ in a buffer, and sample a minibatch from the buffer to update weights.
    
- $\varepsilon$-greedy:
    In $\varepsilon$-greedy setting, we decrease $\varepsilon$ from 1 to 0.1 during the first million steps, and during test time we use $\varepsilon_{soft} = 0.05$. 

### Setting up the pipeline
To set up the whole pipeline of RL, we need to do the following in these scritps:
- set up learning rate schedule and $\varepsilon$-greedy in q1_schedule.py
- implement linear approximation in q2_linear.py
- implement DQN as descirbed in [mnih2015human](https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf) in q3_nature.py


Follow carefully the instructions in the assignment and keep searching for keywords in Tensorflow documentation. Have fun!

In [2]:
! pip install -r ./assignment2/requirements.txt

Collecting matplotlib==2.0.2 (from -r ./assignment2/requirements.txt (line 3))
[?25l  Downloading https://files.pythonhosted.org/packages/f5/f0/9da3ef24ea7eb0ccd12430a261b66eca36b924aeef06e17147f9f9d7d310/matplotlib-2.0.2.tar.gz (53.9MB)
[K    100% |████████████████████████████████| 53.9MB 800kB/s 
[?25h    Complete output from command python setup.py egg_info:
        pkg-config is not installed.
        matplotlib may not be able to find some of its dependencies
    Edit setup.cfg to change the build options
    
    BUILDING MATPLOTLIB
                matplotlib: yes [2.0.2]
                    python: yes [3.7.3 (default, Mar 27 2019, 16:54:48)  [Clang
                            4.0.1 (tags/RELEASE_401/final)]]
                  platform: yes [darwin]
    
    REQUIRED DEPENDENCIES AND EXTENSIONS
                     numpy: yes [version 1.16.2]
                       six: yes [using six version 1.12.0]
                  dateutil: yes [using dateutil version 2.8.0]
             