# On-Policy Prediction with Approximation

What if

- states are infititely many
- states are finitely, but still too many

$$\hat v(s,w) \approx v_{\pi}(s)$$

for $w\in\mathbb{R}^d$.

## Value-function Approximation

In tabular methods, we did update the target
$$
s\mapsto u
$$
where $u$ is the target, it can be:

* $G_t$ for Monte Carlo
* $R_{t+1}+\gamma \hat v(S_{t+1,w})$
* $\mathbb{E}_\pi[R_{t+1}+\gamma \hat v(S_{t+1,w})|S_t=s]$

Instead of tabular representation, we can update the approximation.

Any supervised learning can be considered.

Not any is suitable; we prefer:

* Online methods - can be updated after each transition; can react to changes (including GPI)


## The Prediction Objective $\bar{VE}$

Considering state weighting $\mu(s)$ where $\mu(s)\geq 0$ and $\sum_{s}\mu(s)=1$.

Using this, we can define Mean Squared Value Error

$$
\bar{VE}(w) = \sum_s \mu(s)(v_{\pi}(s)-\hat v (s,w))^2
$$

For on-policy training, we call $\mu(s)$ on-policy distribution.

In continuing tasks, it is the *stationary* distribution under $\pi$.

For episodic tasks, it is more tricky:
$$\eta(s) = h(s) + \sum_{\bar s}\eta(\bar s)\sum_a \pi(a,\bar s)p(s|\bar{s},a)$$

$$\mu(s) = \frac{\eta(s)}{\sum_s \eta(s)}$$

Question:

* Is the prediction objective $\bar{VE}$ the ultimate goal of our learning?

However, let's consider global or local optimum of $\bar{VE}$. In some cases it does not converge for RL.

Question:

* Why? Provide some intuition.



## Stochastic-gradient and Semi-gradient Methods

Stochastic gradient descent (SGD):

$$w_{t+1} = w_{t} - \frac{1}{2}\alpha\nabla[v_{\pi}(S_t)-\hat{v}(S_t,w)]^2$$
$$
= w_{t} +\alpha[v_{\pi}(S_t)-\hat{v}(S_t,w)]\nabla\hat{v}(S_t,w)
$$
Question:

* Why do we call this *stochastic*?

If we don't know $v_{\pi}(S_t)$, we approximate it by target $U_t$

$$
w_{t+1}= w_{t} +\alpha[U_t-\hat{v}(S_t,w)]\nabla\hat{v}(S_t,w)
$$

For Monte Carlo $U_t$ is unbiased.
<img src="https://pic2.zhimg.com/80/v2-fe510bbb6ce95ccfec286aa98373fa99_hd.png" width="65%"/>

For DP and TD - we bootstrap: $U_t$ depends on the last value of $w_t$. Thus it is *biased*.

Then we speak about *semigradient* methods. They:

* Have less theoretical guarantees.
* Practically, converges faster (advantages of bootstrapping).

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQBKLq3I_pUmk-LLGTDQKQtDZ41XbTFhHSY3tOcFhCXwyn_YCkw" width="65%">

## Linear Methods
<img src="https://i.stack.imgur.com/h68dd.png" width="65%"/>

## Feature Extraction for Linear Methods

* Polynomial

* Fourier basis

* Coarse coding

* Tile coding (symetrical vs. asymetrical offsets)

* Radial basis functions

## Non-Linear Function Approximation

Neural Networks

Kernel Based (SVM)

Lazy Learning

# Homework

Obligatory:

* Consider a policy "do nothing" and approximate the value function for Cart Pole example in OpenAI Gym. E.g. Monte Carlo, e.g. linear with some coding.

Optional:

*  Same as above, for alternative settings of $\alpha$. When does it "works"? When is it "strange".
