# Value Function Approximation

----

Represent a (state/state-action) value function with a parameterized function instead of a table. Better for 'Generalization'.


# Recall: Monte-Carlo Methods

----

- Policy Evaluation, value function:
$$
V(s) = V(s) + \frac{1}{N(s)} (G_t - V(s))
$$

$$
Q(s, a) = Q(s, a) + \frac{1}{N(s, a)} (G_t - Q(s, a))
$$

- Difference between $G_t$ and $V(s)$ or $Q(s, a)$:  
$$
\delta = G_t - V(s)
$$  
  
  
$$
\delta = G_t - Q(s, a)
$$

# Recall: Temporal Difference Learning

----

- Policy Evaluation  
  
$$
V(s_t) = V(s_t) + \alpha ((r_{t+1} + \gamma V(s_{t+1})) - V(s_t))
$$  

$$
Q(s_t, a_t) = Q(s_t, a_t) + \alpha ((r_{t+1} + \gamma Q(s_{t+1}, a_{t+1})) - Q(s_t, a_t))
$$


- TD-error:  
  
$$
\delta = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)
$$  
  
$$
\delta = r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)
$$


# Linear Value Function Approximation

----

- Value Function

$$
\hat{V} (s; w) = \sum_{j=1}^{n} x_{j} \dot w_{j} = x(s)^{T}\dot w
$$

- Q-Value Function

$$
\hat{Q}(s, a; w) = \sum_{j=1}^{n} x_{j}(s,a)\dot w_{j} = x(s, a)^{T} \dot w
$$


- $\delta$:  

$$
\delta = G_t - V(s_t; w)
$$  
  
$$  
\delta = r_{t+1} + \gamma V(s_{t+1}; w) - V(s_t; w)
$$  
  
$$
\delta = r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}; w) - Q(s_t, a_t; w)
$$  

- Loss function for Value function:  
  
$$
J(w) = \mathbb{E}^{\pi} [(G_t - V (𝑠; w))^{2}]
$$  
  
  
$$
J(w) = \mathbb{E}^{\pi} [(r + \gamma V^{\pi} (s^{\prime}; w) - V(s; w))^{2}]
$$

  

- Loss function for Q-Value function:  
  
$$  
J(w) = \mathbb{E}^{\pi} [(G_t - Q(s_t, a_t; w))^{2}]
$$  
  
$$
J(w) = \mathbb{E}^{\pi} [(r + \gamma Q(s_{t+1}, a_{t+1}; w) - Q(s_t, a_t; w))^{2}]
$$


# Incremental Approaches(Policy Improvement)

----

- Monte Carlo

$$
\Delta w = \alpha (G_t - \hat{Q}(s, a; w)) \nabla_{w} \hat{Q}(s, a; w)

$$

- TD-learning: SARSA

$$
\Delta w = \alpha (r + \gamma \hat{Q}(s^{\prime}, a^{\prime}; w) - \hat{Q}(s, a; w)) \nabla_{w} \hat{Q}(s, a; w)

$$

- TD-learning: Q-learning

$$
\Delta w = \alpha (r + \gamma max_{a^{\prime}}\hat{Q}(s^{\prime}, a^{\prime}; w) - \hat{Q}(s, a; w)) \nabla_{w} \hat{Q}(s, a; w)
$$


# Deep Reforcement Learning

----

- Using Deep Neural Network to represent Value function and Q-Value function

- Monte Carlo

$$
\Delta w = \alpha (G_t - \hat{Q}(s, a; w)) \nabla_{w} \hat{Q}(s, a; w)

$$

- TD-learning: SARSA

$$
\Delta w = \alpha (r + \gamma \hat{Q}(s^{\prime}, a^{\prime}; w) - \hat{Q}(s, a; w)) \nabla_{w} \hat{Q}(s, a; w)

$$

- TD-learning: Q-learning

$$
\Delta w = \alpha (r + \gamma max_{a^{\prime}}\hat{Q}(s^{\prime}, a^{\prime}; w) - \hat{Q}(s, a; w)) \nabla_{w} \hat{Q}(s, a; w)

$$


## Issues:

- Correlations between samples

- Non-stationary targets

## Strategies:

- Experience replay

- Fixed Q-targets: 

$$
r + \gamma max_{a^{\prime}}\hat{Q}(s^{\prime}, a^{\prime}; w)
$$


# Deep Q-learning Neural Network (DQN)

----

- Incremental Approaches

$$
\Delta w = \alpha (r + \gamma max_{a^{\prime}}\hat{Q}(s^{\prime}, a^{\prime}; w^{-}) - \hat{Q}(s, a; w)) \nabla_{w} \hat{Q}(s, a; w)
$$

- where $\hat{Q}(s^{\prime}, a^{\prime}; w^{-})$ and $\hat{Q}(s, a; w)$ are using different parameters.


- Psudocode:
  
    1. Initialize replay memory $D$ to capacity $N$  
    1. Initialize action-value function $Q$ with random weights $\theta$  
    1. Initialize target action-value function $\hat{Q}$ with weigths $\theta^- = \theta$  
    1. For episode = 1, $M$ do  
    1. $\quad$ Initialize sequence $s_1 = \{x_1\}$ and preprocessed sequence $\phi_1=\phi(s_1)$  
    1. $\quad$ For $t=1$, $T$ do  
    1. $\qquad$ With probability $ε$ select $a_t=max_a\hat{Q}(\phi(s_t), a; \theta)$; otherwise select a random action $a_t$  
    1. $\qquad$ Execute action $a_t$ in emulator and observe reward $r_t$ and image $x_{t + 1}$  
    1. $\qquad$ Set $s_{t+1}=s_t, a_t, x_{t+1}$ and preprocess $\phi_{t+1}=\phi(s_{t+1})$  
    1. $\qquad$ Store transition $(\phi_t, a_t, r_t, \phi_{t+1})$ in $D$  
    1. $\qquad$ Sample random minibatch of transitions $(\phi_j, a_j, r_j, \phi_{j+1})$ from $D$  
    1. $\qquad$ Set $y_j = \begin{cases}
    r_j& \text{for terminal at step j+1}\\
    r_j+\gamma*max_{a'}\hat{Q} (\phi_{j+1}, a'; \theta^-)&\text{otherwise}
    \end{cases}$  
    1. $\qquad$ Perform a gradient descent step on $(y_j-Q(\phi_j, a_j; \phi))^2$ with respect to the network parameters $\theta$  
    1. $\qquad$ Every $C$ steps reset $\hat{Q} = Q$  
    1. $\quad$ End For  
    1. End For  

# Example: Flappy Bird(Dog)

----

In [14]:
%%html
<img src="./flappy_dog.png", width=300, height=400>
<img src="./flappy_bird.png", width=300, height=400>
<img src="./flappy_bird_e.png", width=300, height=400>

# Usage

----

- Requirements
  - torch > 1.0.0
  - pygame >= 1.9.6
  - numpy >= 1.18.1
- Installation of pygame
    ```python
    pip install pygame

    ```

## Input from keyboard

```python
python FlappyDogEnv.py  # key 'a' is 'jump'

```

## Play by RL agent

```python
python main.py
```





> Welcome to my github: https://github.com/ChenDdon/Reinforcement_Learning_one_by_one 