# Off-policy Methods with Approximation

Let's remind what we should already know. What is on-policy approach; what is off-policy?

* SARSA
$$
Q(S_t,A_t) \gets R_{t+1} +\gamma Q(S_{t+1},A_{t+1}) 
$$
* Q Learning
$$
Q(S_t,A_t) \gets R_{t+1} +\gamma \max_{a'}Q(S_{t+1},a') 
$$

Question:

* What about DP?

Main benefits of off-policy methods:

* Not mixing exploration and exploitation.
* Can use recorded experience of a different agent (even human).

Disclaimer:

* This lecture is quite closed to the cutting edge of the technologies.
* It attempts to provide some intuition and understanding.

Two main challenges:

* Target of the update ($U_t$ is not the $U_t$ of our interest) => can be solved by importance sampling
* Distribution of the updates => to be addressed here


## Semi-gradient Methods

Semi-gradient off-policy TD(0):

$$
w_{t+1} = w_t + \alpha \rho_t \delta_t \nabla \hat v (S_t,w_t)
$$
where
$$
\rho_t = \frac{\pi(A_t|S_t)}{b(A_t|S_t)}
$$
and
$$
\delta_t = R_{t+1}+\gamma \hat v (S_{t+1},w_t) - \hat v (S_t,w_t) 
$$

Possible variants:
* Expected Sarsa
* $n$-step tree-backup algorithm

## Examples of Off-policy divergence
<img src="images/11ExamplesOfOffPolicyDivergence.png" />

## The Deadly Triad

We need:

* Function approximation - if the number of states is large or even not finite.
* Bootstrapping - if we cannot wait till the end of episodes
* Off-policy training - especially when learning multiple targets (people and animals learn many things, towards general AI)

If all three considered, the learning tends to be unstable.

## Linear Value-function Geometry
<img src="images/11LinearValueFunctionGeometry.png" />
$$
||v||^2_{\mu}=\sum_s \mu(s)v(s)^2
$$
Using this notation we have 
$$
\overline{VE}(w) = ||v_w -v_\pi||_{\mu}^{2}
$$ 

$$
\overline{BE}(w) = ||B_\pi v_w - v_w ||_{\mu}^{2}
$$

$$
\overline{PBE}(w) = ||\Pi(B_\pi v_w - v_w) ||_{\mu}^{2}
$$



## Stochastic Gradient Descent in the Bellman Error

Problems:

- slow
- wrong value functions
- simply a bad objective

## The Bellman Error is Not Learnable

Let's consider these 2 Markov Reward Processes (one action only).
<img src="images/11TheBellmanErrorIsNotLearnable1.png"/>
They produce identical sequence of rewards. Internally, they are different. By change, we represent the value function by the same $w$.

For $\gamma=0$, the true values are 1, 0, and 2 respectively. $w=1$ - even $\overline{VE}$ is not learnable, but the parameter that optimizes it is: 

<img src="images/11TheBellmanErrorIsNotLearnable2.png"/>

With Bellman Error, the situation is different - it is not learnable, not even in parameters:
<img src="images/11TheBellmanErrorIsNotLearnable3.png"/>

Example - generates the same data, but results in different minimizers
<img src="images/11TheBellmanErrorIsNotLearnable4.png"/>

Asssuming that $B$ and $B'$ cannot be distinguished. For the first $w=(0,0)$, for the second (complicated calculation), it is $w=(-\frac{1}{2},0)$.



## Gradient-TD Methods
Minimization of $\overline{PBE}$. It is a true gradient descent method.

$$
\nabla\overline{PBE}(w) = 2\mathbb{E}[\rho(\gamma x_{t+1}-x_t)x_t)x_t^{\intercal}]\mathbb{E}[x_t x_t^{\intercal}]^{-1}\mathbb{E}[\rho_t\delta_t x_t]
$$

There is a trick how to 

- get unbiased estimate of each factor of that product (first and third depend on $x_{t+1}$).
- calculate the inversion efficiently

$$
w_{t+1} = w_t + \alpha \rho_t(\delta_t x_t - \gamma x_{t+1}x_t^\intercal v_t)
$$
with
$$
v_{t+1}=v_t+\beta\rho_t(\delta_t-v_t^{\intercal}x_t)x_t
$$

## Emphatic-TD Methods


$$
\delta_t = R_{t+1}+\gamma \hat v (S_{t+1},w_t) - \hat v (S_t,w_t) 
$$

$$
w_{t+1} = w_t + \alpha M_t \rho_t \delta_t \nabla \hat v (S_t,w_t)
$$

$$
M_t = \gamma \rho_{t-1}M_{t-1}+I_t
$$
with $I_t$ is interest and $M_t$ is emphasis, initiated $M_0=0$.


## Reducing Variance

Off-policy => logically increased variance

Tricks to cope with that considered.