<a href="https://colab.research.google.com/github/RLWH/reinforcement-learning-notebook/blob/master/5.%20Value%20Approximation/Value_Function_Approximation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 5. Value Approximation

Basically, all the model-free methods require either value function or state-value function, that essentially they are a one-to-one mapping function for a given state S. These algorithms might be able to solve small to medium problems, but they may not be able to solve large problems like Backgammon ($10^{20}$ states), Computer Go ($10^{170}$ states), or other problems that have a continuous state space. 

So, how to scale up the model-free methods? The answer is by function approximation. 

#### The feature weights
In this chapter, we will introduce a new weight vector $\vec{w} \in \mathop{\mathbb{R^d}}$.

Hence, we will rewrite the value function into $\hat{v}(s, \vec{w}) \approx v_{\pi}(s)$, which means the value function uses the weight vector to approximate the true value function. On the other hand, we can also approximate state-value function $\hat{q}(s, a, \vec{w}) \approx q_{\pi}(s,a)$. The goal of having the feature weights is to generalise the learnings from seen states to unseen states. 

On the side note, extending reinforcement learning to function approximation also makes it applicable to partially observable problems, where the full state is not available to the agent. 



# Value-function Approximation

In order to use SGD to approximate a value function, we need to first define the objective function. 

The goal of SGD is to find a parameter vector $w$ that minimise the mean-squared error between the approximate value function $\hat{v}(s, \vec w)$ and the true value function $v_{\pi}(s)$
Of course, the states are not equally important, therefore, we are obligated to specify a state distribution $\mu(s) \geq 0, \sum_{s} = 1$, to represent how much we care about the error in each state. 

Suppose the true value $v_{\pi}(s)$ exists, the formulation of the objective function, also known as the *Mean Squared Value Error*, becomes
\begin{equation}
\begin{split}
J(\vec w) & = \sum_s\mu(s)[v_{\pi}(s) - \hat{v}(s, \vec{w})]^2 \\
& = \mathop{\mathbb{E_{\pi}}}[(v_{\pi}(s) - \hat{v}(s, \vec{w}))]
\end{split}
\end{equation}
for on-policy distribution in episodic tasks.

Ideally, the goal in terms of $J(\vec{w})$ would be to find a global optimum, where a weight vector $\vec{w^*}$ such that $J(\vec{w^*}) \leq J(\vec{w})$ for all possible $\vec{w}. $

## Stochastic gradient descent methods

Suppose the weight vector is a column vector with a fixed number of real valued components, $\vec{w} = (w_1, w_2, w_3, ..., w_d)^{T}$, and the approximate value function $\hat{v}(s, \vec{w})$ is a differentiable function of $\vec{w}$ for all $s \in S$. 

The weight vector $\vec{w}$ will be updated at each of the discrete time steps, for $t=0, 1, 2, 3, ...$, so we instead use $\vec{w_t}$ to represent the weight vector.

The SGD method adjusts the weight vector at each time step, or after each example, by a small amount in the direction that would most reduce the error on that example.
\begin{equation}
\Delta{\vec{w}} = \alpha [v_{\pi}(S_t) - \hat{v}(S,\vec{w_t})] \nabla{\hat{v}}(S_t, \vec{w_t})
\end{equation},
where $\alpha$ is a positive step-size parameter, and $\nabla f(\vec{w})$, for any scalar expression $f(\vec{w})$ that is a function of a vector, denotes the column vector of partial derivatives of the expression w.r.t the components of the vector.

## Incremental Prediction Algorithms

In the previous discussion we have assumed the true value $v_{\pi}$ has been given. However, in reinforcement learning problem, there is no supervisor, only rewards. Thus, in practice, we need to substitute a target for $v_{\pi}(s)$. We can denote $U_t \in \mathop{\mathbb{R}}$, of the *t*th training example, such that $S_t \mapsto U_t$. We have to be aware of it that $U_t$ is not the true value, $v_{\pi}(S_t)$, but some, possibly random, approximation to it. 

Thus, in general the gradient of $\vec{w}$ can be rewritten as
\begin{equation}
\Delta{\vec{w}} = \alpha [U_t - \hat{v}(S,\vec{w_t})] \nabla_{\vec{w}}{\hat{v}}(S_t, \vec{w_t})
\end{equation}

$U_t$ can be approximated by various methods:

### 1. Monte Carlo
For Monte Carlo method, the target is the return $G_t$, or we can assign $U_t = G_t$.

The gradient of the weight then become
\begin{equation}
\Delta{\vec{w}} = \alpha [G_t - \hat{v}(S,\vec{w_t})] \nabla_{\vec{w}}{\hat{v}}(S_t, \vec{w_t})
\end{equation}

##### Pseudo Code
---
```
Input: The policy pi to be evaluated
Input: a differentiable function v_hat that takes in state and weights and map to a value: S x Rd -> R

Algorithm parameter: stepsize alpha > 0
Initialise value-function weights w (dimension d) arbitrarily

Loop forever (for each episode):
        Generate an episode S0, A0, R1, S1, A1, ..., RT, ST using pi
        Loop for each step of episode, t=0, 1, ..., T-1:
                w = w + alpha * (Gt - vhat(st, w)) * grad_vhat(st, w)
```
---

### 2. TD(0)

For one-step TD, the target $U_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \vec{w})$

The gradient of the weight then becomes
\begin{equation}
\Delta{\vec{w}} = \alpha [R_{t+1} + \gamma \hat{v}(S_{t+1}, \vec{w}) - \hat{v}(S,\vec{w_t})] \nabla_{\vec{w}}{\hat{v}}(S_t, \vec{w_t})
\end{equation}

There is no guarantee to converge by using a bootstrapping method like TD(0) or TD($\lambda$), as it relies on a bootstraping estimate to update the gradient. Bootstraping methods, in fact, are not the instences of true gradient descent (Barnard, 1993), as they all depend on the current value of the weight vector $\vec{w_t}$, which imples that they will be biased and that they will not produce a true gradient-descent method. These methods are called *semi-gradient* methods.

Although semi-gradient methods do not converge as robustly as gradient methods, they do converge reliably in important cases such as the linear cases. Another advantage for semi-gradient methods is that they enable learning to be coninual and online, without waiting for the end of an episode.

##### Pseudo Code
---
```
Input: The policy pi to be evaluated
Input: A differentiable function v_hat(state, weights) |-> R such that v_hat(terminal, .) = 0
Algorithm parameter: step size alpha > 0
Initialise value-function weights w as a d-dimensional zero vectors

Loop for each episode:
        Initialise S
        Loop for each step of episode:
                Choose A ~ pi(.|S)
                Take action A, observe R, S'
                w = w + alpha * (R + gamma * v_hat(S', w) - v_hat(S, w)) * grad_vhat(S, w)
        until S is terminal
```
---

### 3. TD($\lambda$)

For TD lambda, the target is the $\lambda$-return $G_t^{\lambda}$. i.e. $U_t = G_t^{\lambda}$

The gradient becomes
\begin{equation}
\Delta{\vec{w}} = \alpha [G_t^{\lambda} - \hat{v}(S,\vec{w_t})] \nabla_{\vec{w}}{\hat{v}}(S_t, \vec{w_t})
\end{equation}





## Linear Methods of Value Approximation

One of the most important special cases of function approximation is that, where the approximate function, $\hat{v}(\cdot, \vec{w})$, is a linear function of the weight vector, $\vec{w}$

Denote a real-valued feature vector
\begin{equation}
\vec{x}(s) = (x_1(s), x_2(s), ..., x_d(s))^{T}
\end{equation}
with the same number of components as $\vec{w}$

Linear methods approximate the state-value function by the inner product between $\vec{w}$ and $\vec{x}$, such that 

\begin{equation}
\begin{split}
\hat{v}(s, \vec{w}) & = \vec{w} ^T \vec{x}(s) \\
& = \sum_{i=1}^{d}w_i x_i(s)
\end{split}
\end{equation}


Note: Feature vector can be anything that tells you about the state space. For example:
- Distance of robot from landmarks
- Trends in the stock market
- Piece and pawn configurations in chess


The update rule is particularly simple with Linear Value Function Approximation
\begin{equation}
\begin{split}
\nabla_{\vec{w}}\hat{v}(s, \vec{w}) &= \vec{x}(s) \\
\Delta \vec{w} &= \alpha[v_{\pi}(s) - \hat{v}(s, \vec{w})] \vec{x}(s)
\end{split}
\end{equation}

In other words:
**Update = step-size x prediction error x feature value**

Since in the linear case there is only one optimum, and thus any method  that is guaranteed to converge to or near a local optimum is automatically guaranteed to converge to or near the global optimum. 

### Monte Carlo with Linear Methods

- The return $G_t$ is an unbiased, noisy sample of true value $v_{\pi}(s_t)$
- The "training data"

\begin{equation}
\langle S_1, G_1\rangle, \langle S_2, G_2\rangle, ..., \langle S_T, G_T\rangle
\end{equation}

- The gradient used for MC policy evaluation
\begin{equation}
\Delta{\vec{w}} = \alpha [G_t^{\lambda} - \hat{v}(S,\vec{w_t})] \vec{x}(S_t)
\end{equation}

The gradient MC algorithm converges to the global optimum of $J$ under linear function approximation if $\alpha$ is reduced over time according to the usual conditions.

### TD(0) with Linear Methods
- The TD-target $R_{t+1} + \gamma \hat{v}(S_{t+1}, \vec{w})$ is a biased sample of true value $v_{\pi}(s_t)$
- This can still apply supervised learning to the "training data"
\begin{equation}
\langle S_1, R_2 + \gamma \hat{v}(S_2, \vec{w})\rangle, \langle S_2, R_3 + \gamma \hat{v}(S_3, \vec{w})\rangle, ..., \langle S_{T-1}, R_T + \gamma \hat{v}(S_T, \vec{w})\rangle
\end{equation}
- The gradient used for TD(0) policy evaluation
\begin{equation}
\begin{split}
\Delta{\vec{w}} &= \alpha [R + \gamma \hat{v}(s_{t+1}, \vec{w}) - \hat{v}(S,\vec{w_t})] \vec{x}(S_t) \\
&=\alpha \delta \vec{x}(s)
\end{split}
\end{equation}
- Linear TD(0) converges close to the global optimum, which is called the *TD Fixed point* (Proof in p.206), where
\begin{equation}
J(\vec{w}_{TD}) \leq \frac{1}{1-\gamma} \min_\vec{w} J(\vec{w})
\end{equation}

###  Forward-view linear TD($\lambda$) and Backward-view linear TD($\lambda$)

- The $\lambda$-return $G_t^{\lambda}$ is also a biased sample of true value $v_{\pi}(s)$
- Again, it can apply supervised learning to "training data"\begin{equation}
\langle S_1, G_1^{\lambda}\rangle, \langle S_2, G_2^{\lambda}\rangle, ..., \langle S_{T-1}, G_{T-1}^{\lambda}\rangle
\end{equation}

#### Forward view linear TD($\lambda$)
\begin{equation}
\Delta{\vec{w}} = \alpha [G_t^{\lambda} - \hat{v}(S,\vec{w_t})] \vec{x}(S_t)
\end{equation}

#### Backward vieww linear TD($\lambda$)
\begin{equation}
\begin{split}
& \delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \vec{w}) - \hat{v}(S_t, \vec{w}) \\
& E_t = \gamma \lambda E_{t-1} + \vec{x}(S_t) \\
& \Delta{\vec{w}} = \alpha \delta_t E_t
\end{split}
\end{equation}

# Action-Value $Q(s,a)$ Function Approximation

For policy control, we need to approximate the action-value function such that $\hat{q}(s, a, \vec{w}) \approx q_*(s,a)$, where $\vec{w} \in \mathop{\mathbb{R}}^d$ is a finite-dimensional weight vector. 

In this section, we will cver the on-policy methods only.

### From approximating state value to action value

The objective is to approximate the action value function

\begin{equation}
\hat{q}(S, A, \vec{w}) \approx q_{\pi}(S, A)
\end{equation}

We can use the same approach as approximating state value function, but change into action-value. Suppose there is a true action-value $q_{\pi}(S,A)$

The objective function becomes
\begin{equation}
J(\vec{w}) = \mathop{\mathbb{E_{\pi}}}[(q_{\pi}(S,A) - \hat{q}(S, A, \vec{w}))^2]
\end{equation}

And, by using stochastic gradient descent to find a local minimum

\begin{equation}
\begin{split}
-\frac{1}{2} \nabla_{\vec{w}} J(\vec{w}) &= (q_{\pi}(S,A) - \hat{q}(S, A, \vec{w})) \nabla_{\vec{w}}\hat{q}(S,A,\vec{w})\\
\Delta \vec{w} &= \alpha [q_{\pi}(S,A) - \hat{q}(S,A,\vec{w})] \nabla_{\vec{w}}\hat{q}(S,A,\vec{w})
\end{split}
\end{equation}

### Approximation by using Linear methods
Similarly, the action-value function can be represented by linear combinations of features

\begin{equation}
\hat{q}(S, A, \vec{w}) = \vec{x}(S,A)^T \vec{w} = \sum_{j=1}^{n} x_j(S,A) w_j
\end{equation}

### The update process
\begin{equation}
\begin{split}
\nabla_{\vec{w}} \hat{q}(S, A, \vec{w}) &= \vec{x}(S,A) \\
\Delta \vec{w} &= \alpha [q_{\pi}(S,A) - \hat{q}(S, A, \vec{w})]\vec{x}(S,A)
\end{split}
\end{equation}

## Incremental Control Algorithms

Like prediction, we also need to substitute a target for $q_{\pi}(S,A)$

### 1. Monte Carlo Methods
For Monte Carlo method, the target is the return $G_t$, or we can assign $U_t = G_t$.

The gradient of the weight then become
\begin{equation}
\Delta{\vec{w}} = \alpha [G_t - \hat{q}(S_t,A_t,\vec{w_t})] \nabla_{\vec{w}}{\hat{q}}(S_t, A_t, \vec{w_t})
\end{equation}

### 2. SARSA(0) Methods

This method is called *episodic semi-gradient one-step SARSA*. For a constant policy, this method converges in the same way that TD(0) does, with the same kind of error bound.

\begin{equation}
\Delta{\vec{w}} = \alpha [R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \vec{w}) - \hat{q}(S_t,A_t, \vec{w_t})] \nabla_{\vec{w}}{\hat{q}}(S_t, A_t, \vec{w_t})
\end{equation}

In order to form control methods, we need to plug the predicted action-value into the GPI framework. That is, for each possible action $a$ available in the current state $S_t$, we can compute $\hat{q}(S_t, A_t, \vec{w})$ and then find the greedy action $A_t^* = \text{argmax}_a \hat{q}(S_t, a, \vec{w}_{t-1})$. Considering the on-policy method, the policy improvement can be done by changing the estimation policy to a soft approximation of the greedy policy, such as $\epsilon$-greedy policy. 

##### Pseudocode
---
```
Input: a differentiable action-value function parameterisation q_hat(state, action, weights) |-> score
Algorithm parameters: stepsize alpha > 0, small eps > 0
Initialise value-function weights (D-dimensional zero vector)

Loop for each episode:
       Initialise State and action of episode (eps-greedy)
       
       Loop for each step of episode:
            Take action A, observe R, S'
            
            If S' is terminal:
                    w = w + alpha * (R - q_hat(S, A, w)) * grad_q(S, A, w)
                    break
                    
            Choose A' as a function of q_hat(S', ., w) (e.g. eps-greedy):
                    w = w + alpha * (R + gamma * q_hat(S', A', w) - q(S', A', w)) * grad_q(S, A, w)
                    
                    S = S'
                    A = A'
```
---

### 3. SARSA($\lambda$)

##### Forward-view
\begin{equation}
\Delta w = \alpha (q_t^{\lambda} - \hat{q}(S_t, A_t, \vec{w})) \nabla_{\vec{w}} \hat{q}(S_t, A_t, \vec{w})
\end{equation}

##### Backward-view
\begin{equation}
\begin{split}
\delta_t &= R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \vec{w}) - \hat{q}(S_t, A_t, \vec{w}) \\
E_t &= \gamma \lambda E_{t-1} + \nabla_{\vec{w}}\hat{q}(S_t, A_t, \vec{w}) \\
\Delta{\vec{w}} &= \alpha \delta_t E_t
\end{split}
\end{equation}

### 4. Gradient TD
Not covered here


## Summary
![Prediction convergence summary](https://raw.githubusercontent.com/RLWH/reinforcement-learning-notebook/master/images/prediction_algos_convergence_summary.png)

![Control convergence summary](https://raw.githubusercontent.com/RLWH/reinforcement-learning-notebook/master/images/control_algo_convergence_summary.png)

# Exploiting the samples of data by using batch update

## Stochastic Gradient Descent with Experience Replay

Given expereince consisting of $\langle \text{state}, \text{value} \rangle$ pairs
\begin{equation}
D = \{\langle s_1, v_1^{\pi} \rangle\, \langle s_2, v_2^{\pi} \rangle\, ..., \langle s_T, v_T^{\pi} \rangle\}
\end{equation}

##### Algorithm
---
```
Repeat:

    1. Sample state, value from experience
    2. Apply stochastic gradient descent update
    w = w + alpha * (v_pi - v_hat(s, w)) * grad_w v(s, w)
    
Until it converges to least square solution
```
---

## Deep Q Network

Boostraping methods, as discussed previously, are not stable when it comes to non-linear approximation methods. DQN uses two tricks to stabalise with NN
1. Experience Replay
2. Fixed Q-targets

Method
1. Take action $a_t$ according to $\epsilon$-greedy policy
2. Store transition ($s_t, a_t, r_{t+1}, s_{t+1}$) in replay memory $D$
3. Sample random mini-batch of transitions ($s, a, r, s'$) from $D$
4. **Compute Q-learning targets w.r.t old, fixed parameters $w^-$**
5. Optimise MSE between Q-network and Q-learning agents
\begin{equation}
LS(\vec{w_i}) = \mathop{\mathbb{E_{s, a, r, s'}}} \Big[ (r + \gamma \max_{a'}Q(s', a'; w_i^{-}) - Q(s, a; w_i))^2 \Big]
\end{equation}

## One-step Linear Least Square Method for Linear approximators

If the approximator is a linear approximator, we can use an analytical method to find out the least square method. 

\begin{equation}
\begin{split}
\mathop{\mathbb{E}}_D[\Delta \vec{w}] &= 0\\
\alpha\sum_{t=1}^{T} \vec{x}(s_t)(v_t^{\pi} - x(s_t)^T\vec{w}) &= 0 \\
\sum_{t=1}^{T} \vec{x}(s_t)v_t^{\pi} &= \sum_{t=1}^{T} \vec{x}(s_t) \vec{x}(s_t)^T\vec{w} \\
\vec{w} &= \Big( \sum_{t=1}^T \vec{x}(s_t) \vec{x}(s_t)^T \Big)^{-1} \sum_{t=1}^T \vec{x}(s_t)v_t^{\pi}
\end{split}
\end{equation}