<a href="https://colab.research.google.com/github/RLWH/reinforcement-learning-notebook/blob/master/4.%20Model%20Free%20Control/Ch5_7(c)_Off_policy_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# On-policy learning vs Off-policy learning

On-policy learning is about learning the most optimal behaviour through the policy that the agent is following. However, the agent needs to behave non-optimally in order to explore all actions to find the optimal actions. So, how can they learn about th eoptimal policy while behaving according to an exploratory policy?

An alternative way to learn the most optimal policy is to use two policies
- One policy is going to learn the most optimal behaviour - called the *target policy $\pi$*
- Another policy is more exploratory and is used to generate behaviour - called the *behaviour/exploratory policy $\mu$*

Since in this case we say that learning is from data "off" the target policy, thus the overall process is termed *off-policy* learning

## Pros and cons of On-policy learning and Off-policy learning

##### On-policy
- Concept is simpler
- Easier to converge

##### Off-policy
- Concept is harder
- slower to converge
- Greater variance
- More exploration
- Learn from others' experience
- Key to learning multi-step predictive models of the world's dynamics

## The prediction problem of off-policy methods

Suppose we wish to estimate $v_\pi$ or $q_\pi$, but all we have are episodes folowing another policy $\mu$, where $\mu \neq \pi$. Under this setting, we call that the policy $\pi$ is the target policy, and this is the policy that we want to learn and optimise. The policy $\mu$ is the behaviour policy, where we will sample the actions from. Both policies are considered fixed and given.

##### Some requirements - The assumption of converge
If we want to use episodes from $\mu$ to estimate values for $\pi$, we need to ensure that every action taken from $\pi$, is at least occasionally taken under $\mu$. Formally, we require $\pi(a|s) > 0 \implies \mu(a|s) > 0$. 

Under this setting, the policy $\mu$ must be stochastic in states where it is not identical to $\mu$, while the policy $\mu$ can be deterministic. For simplicity, we can assume policy $\mu$ is a greedy policy. i.e. $\text{argmax}_a \pi(a|s) = 1$

### The cornerstone of off-policy methods - Importance Sampling
Importance sampling is a general technique for estimating expected values under one distribution given samples from another. It weights the returns of each timestep according to the relative probability of their trajectories occuring under the target and behaviour policies. 

Given a starting state $S_t$, the probability of the subsequent state-action trajectory **under policy $\pi$**, {$A_t, S_{t+1}, A_{t+1}, ..., S_T | S_t, A_{t:T-1}\sim \pi$} is 

\begin{equation}
\begin{split}
Pr\{A_t, &S_{t+1}, A_{t+1}, ..., S_T | S_t, A_{t:T-1} \sim \pi\} \\
& = \pi(A_t|S_t)p(S_{t+1}|S_t,A_t) * \pi(A_{t+1}|S_{t+1})...p(S_T|S_{T-1}, A_{T-1}) \\
& = \prod_{k=t}^{T-1}\pi(A_k|S_k)p(S_{k+1}|S_k,A_k)
\end{split}
\end{equation}
where $p$ is the state-transition probability function.

Thus, the relative probability of the trajectory under the target and behaviour policies, or the importance-sampling ratio, is
\begin{equation}
\begin{split}
\rho_{t:T-1} &= \frac{\prod_{k=t}^{T-1}\pi(A_k|S_k)p(S_{k+1}|S_k,A_k)}{\prod_{k=t}^{T-1}\mu(A_k|S_k)p(S_{k+1}|S_k,A_k)} \\
&=\prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{\mu(A_k|S_k)}
\end{split}
\end{equation}

which is regardless of the transition probabilities.

Under off-policy learning, we wish to estimate the expected returns under the target policy. However, we sample the actions base on the behaviour policy, thus, we cannot calculate the expectation straight from the sampled values. The ratio $\rho_{t:T-1}$ thus transforms the returns to have the right expected value:
\begin{equation}
\mathop{\mathbb{E}}[\rho_{t:T-1}G_t | S_t = s] = v_{\pi}(s)
\end{equation}


### Applying to Monte Carlo value estimation
Recall from Monte Carlo value estimation:

If we wish to estimate a value of state $s$ under policy $\pi$, i.e. $v_{\pi}(s)$, we can generate a set of episode that pass through $s$ and average the returns of the visits to $s$. 

There are two ways of averaging. We can either:
1. Average only the first visits to state $s$, which is called the first-visit MC method, or
2. Average all visits to state $s$, which is called every-visit MC method.

The same methodology also applies to off-policy method with a slight tweak.
If we are using the first-visit method, under off-policy estimation, we first denote $J(s)$ as the as the set of all time steps in which state $s$ is visited, or first-visited, and T(t) denote the first time of termination following time t, G_t denite the return after t up to T. Then, to estimate $v_{\pi}(s)$, we simply scale the returns by ratios and averages the result:
\begin{equation}
V(s) = \frac{\sum_{t \in J(s)} \rho_{t:T(t)-1}G_t}{|J(s)|}
\end{equation}

### Incremental update
Suppose we have a sequence of returns $G_1, G_2, ..., G_{n-1}$, all starting in the same state, and each with a corresponding random weight $W_i = \rho_{t_i:T(t_i) - 1}$, then we can form the estimate
\begin{equation}
V_n = \frac{\sum_{k=1}^{n-1}W_kG_k}{\sum_{k=1}^{n-1}W_k}, \hspace{1cm} n \geq 2
\end{equation}

and the update rule becomes
\begin{equation}
V_{n+1} = V_{n} + \frac{W_n}{C_n}[G_n - V_n], \hspace{1cm} n \geq 1
\end{equation}
and

\begin{equation}
C_{n+1} = C_n + W_{n+1}
\end{equation}
where $C_0 = 0$ and $V_1$ is arbitrary.

The update rule is same for Q, just need to replace $V(s)$ by $Q(s,a)$

##### Pseudo code of off-policy MC prediction for Q
---
```
Input: an arbitrary target policy pi
Initialise, for all s in S, a in A(s):
    Q(s,a) arbitarily
    C(s,a) = 0
    
Loop forever for each episode:
    mu = any policy with converge of pi
    Generate an episode following mu: S0, A0, R1, ..., ST-1, AT-1, RT
    
    G = 0
    W = 1
    
    Loop for each step of episode, t = T-1, T-2, ..., 0, while W =/= 0:
        G = gamma * G + Rt+1
        C(s,a) = C(s,a) + W
        Q(s,a) = Q(s,a) + W/C(s,a) * abs(G - Q(s,a))
        W = W * (pi(a|s) / mu(a|s))
```
---

## Off-policy TD Control


## Q-learning
Q-learning is similar to SARSA(0) but for off-policy learning
Considering off-policy learning of action values Q(s,a). It does not require importance sampling. 
Here's the idea:
1. The next action is chosen by using behaviour policy $A_{t+1} \sim \mu(.|S_t)$
2. But we consider alternative successor action $A' \sim \pi(.|S_t)$
3. Then, update $Q(S_t, A_t)$ towards value of alternative action

\begin{equation}
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A') - Q(S_t, A_t)]
\end{equation}

If we now allow both behaviour and target policies to improve.
Say the target policy $\pi$ is greedy w.r.t $Q(s,a)$

\begin{equation}
\pi(S_{t+1}) = \arg\max_{a'} Q(S_{t+1}, a')
\end{equation}

and the behaviour policy $\mu$ is $\epsilon$-greedy w.r.t $Q(s,a)$

Then, the Q-learning target then simpilifes:

\begin{equation}
\begin{split}
& R_{t+1} + \gamma Q(S_{t+1}, A') \\
& = R_{t+1} + \gamma Q(S_{t+1}, \arg\max_{a'} Q(S_{t+1}, a')) \\
&= R_{t+1} + \max_{a'} \gamma Q(S_{t+1}, a')
\end{split}
\end{equation}

#### SARSAMAX update
\begin{equation}
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_{a'}  Q(S_{t+1}, a') - Q(S_t, A_t)]
\end{equation}

##### Pseudo code
---
```
Algorithm parameters: step size alpha, smal epsilon > 0
Initialise Q(s,a) for all s in S+, a in A(s)
Arbitrarily except that Q(terminal,.) = 0

Loop for each episode:
    Initialise S
    Loop for each step of episode:
        Choose A from S using policy derived from Q (e.g. epsilon-greedy)
        Take action A, observe R, S'
        Q(S,A) = Q(S,A) + alpha * (R + gamma * max_aQ(S',a) - Q(S,A))
        S = S'
    until S is terminal
    
```
---

## Expected SARSA

Expected SARSA is an alternative algorithm that is just like Q-learning, except that instead of using a greedy policy over next state-action pairs, it uses the expected value of how likely each action is under the current policy.

#### Formulation

\begin{equation}
\begin{split}
Q(S_t, A_t) & \leftarrow Q(S_t, A_t) + \alpha [R_{t+1}+ \gamma \mathop{\mathbb{E_\pi}}[Q(S_{t+1}, A_{t+1} | S_{t+1})] - Q(S_t, A_t)] \\
& \leftarrow Q(S_t, A_t) + \alpha [R_{t+1}+ \gamma \sum_a \pi(a|S_{t+1})Q(S_{t+1}, a) - Q(S_t, A_t)]
\end{split}
\end{equation}

Given the next state $S_{t+1}$, this algorithm moves deterministically in the same diection as Sarsa moves in expectation, and thus it is called expected SARSA

Expected SARSA is more complex computationally than SARSA, but, in return, it eliminates the variance due to the random selection of $A_{t+1}$

# Summary

![Relationship Between DP and TD_1](https://raw.githubusercontent.com/RLWH/reinforcement-learning-notebook/master/images/Relationship_DP_TD_1.png)

![Relationship Between DP and TD_2](https://raw.githubusercontent.com/RLWH/reinforcement-learning-notebook/master/images/Relationship_DP_TD_2.png)