## 1. Off-Policy control

**DILEMMA** (on-policy control) : agents need to learn about the *optimal* policy while behaving according to an *exploratory* policy

Off-Policy control separates target policy and behavior policy
* Evaluate target policy $\pi(a|s)$ to compute $V_{\pi}(s)$ or $q_{\pi}(s,a)$ while following behavior policy $\mu(a |s)$
* Learn from observing other agents
* Re-use exprerience generated from old policies
* Learn about optimal policy while following exploratory policy
* Learn about *multiple* policies while following *one* policy

## 2. Importance sampling

estimating properties of a particular distribution, while only having *samples generated from a different distribution* than the distribution of interest

$$E_{X \sim P}[f(X)]  = \sum P(X)f(X)   = \sum Q(X) \frac{P(X)}{Q(X)} f(X)   = E_{X \sim Q} \left[ \frac{P(X)}{Q(X)}f(X) \right]  $$

**2.1. Importance Sampling for Off-Policy MC**

* Use returns generated from $\mu$ to evaluate $\pi$
* Multiply importance sampling corrections along whole episode
$$ G_t^{\pi / \mu} = \frac{\pi(A_t|S_t)}{\mu(A_t|S_t)} \frac{\pi(A_{t+1}|S_{t+1})}{\mu(A_{t+1}|S_{t+1})} \cdots \frac{\pi(A_T|S_T)}{\mu(A_T|S_T)}G_t$$
* Update value towards corrected return
$$V(S_t) \leftarrow V(S_t)+\alpha \left( G_{t}^{\pi / \mu} - V(S_t) \right) $$
* Importance sampling can dramatically increases variance

**2.2. Importance sampling for Off-Policy TD**
* Use TD targets generated from $\mu$ to evaluate $\pi$
* Weight TD target $R+\gamma V(S') $ by importance sampling
* Only need a single importance sampling correction
$$V(S_t) \leftarrow V(S_t) + \alpha \left( \frac{\pi(A_t|S_t)}{\mu (A_t|S_t)}(R_{t+1}+\gamma V(S_{t+1}))-V(S_t) \right)$$
* Much lower variance than Monte-Carlo importance sampling

## 3. Q-learning

  **$\epsilon$-greedy (behavior policy) & greedy (target policy)**

target policy $\pi(S_{t+1})=\underset{a'}{argmax}Q(S_{t+1},a')$

$\therefore$ target simplifies:

$$ R_{t+1}+\gamma Q(S_{t+1},A')   = R_{t+1}+\gamma Q(S_{t+1},\underset{a'}{argmax}Q(S_{t+1},a')) \\  = R_{t+1} +\underset{a'}{max}\gamma Q(S_{t+1},a')  $$

updating

$$Q(S_t,A_t) \leftarrow Q(S_t,A_t)+\alpha(R_{t+1}+\gamma maxQ(S_{t+1},a')-Q(S_t,A_t))$$

converges to the optimal action-value function : $Q(s,a) \rightarrow q_{*}(s,a)$

### Algorithm

Initialize $Q(s,a), \forall s \in \mathcal{S}, a \in \mathcal{A}(S)$, arbitrarily, and $Q(terminal-state)=0$

Repeat (for each episode):

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Initialize $S$

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Repeat (for each step of episode):

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Choose $A$ from $S$ using policy derived from $Q$ &nbsp;&nbsp; ** $\ast$ $\epsilon$ - greedy **

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Take action $A$, observe $R$,$S'$

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$Q(S,A) \leftarrow Q(S,A)+\alpha \left[ R+\gamma max_aQ(S',a)-Q(S,A) \right] $  &nbsp;&nbsp; ** $\ast$ greedy **

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (**cf.SARSA** : Choose $A'$ from $S'$ using policy derived from $Q$  &nbsp;&nbsp;  ** $\ast$ $\epsilon$ - greedy ** 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $Q(S,A) \leftarrow Q(S,A) + \alpha \left[ R+\gamma Q(S',A')-Q(S,A) \right] $ )

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$S \leftarrow S'$;

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; until $S$ is terminal

### Q-Learning and SARSA

gridworld example : Cliff Walking