In [2]:
%matplotlib inline
import torch
import numpy as np 
import sklearn
import matplotlib.pyplot as plt

# Model-Free Approaches

In most of the real life problems, transition dynamics are not available. Hence, sampling works as the major technique. 
The methods are in consideration, namely the MC method and TD method 
1. Bootstrapping 
2. Exploration-vs-exploitation
3. Off-Policy vs on Policy
For the MC method, expected returns are calculated by using the average of sample returns. One typically needs to know the whole trajectory of the starting point all the way to the terminate point. 

`It does not work for the non-terminating case`

## First Visit MC

First visit, then terminate the state. 

$v(s)$-current estimate, $N(s)$-number of visit. 

Initiate: 

1. Cumulative state val: $S(s)=0, \forall s \in \mathbf{S}$
2. Estimated state val: $v(s)=0,  \forall s \in \mathbf{S}$
3. Visit count: $N(s)=0,  \forall s \in \mathbf{S}$

Loop:

Sample from Policy $\pi$: $S_0, A_0, R_1, S_1, A_1, R_2, ..., R_T,S_T$ `Note: one can see that some state may never be visited`

$G \leftarrow 0$

Loop backward for each step of episode: $t=T-1, T-2, ..., 1, 0$

$$G \leftarrow G + R_{t+1}$$

If $S_t$ does not appear in $S_0, S_1, ..., S_{t-1}$:  ($\textbf{First Visit Cond}$)
$$
\begin{align}
N(s) &\leftarrow N(s)+1 \\ 
S(s) &\leftarrow S(s)+G \\ 
v(s) &\leftarrow S(s)/N(s)
\end{align}
$$

We comment that the same strategy can also be used on the state action function $q(s,a)$.

## Every visit MC

`For the every visit algorithm, just remove the` ($\textbf{First Visit Cond}$).

## Observations and comments

By definition, we have the following:
$$
\begin{align}
N_{n+1}&=N(s)\\
v_{n+1}(s)&=[S_n(s)+G]/N_{n+1}(s)\\
&=[v_n(s)N_n(s) +G]/N_{n+1}(s) \\ 
&=v_n(s)+\frac{1}{N_{n+1}(s)}[G-v_n(s)]
\end{align}
$$
And it suggests a very general updating form
$$
\begin{equation}
v_{n+1} = v_{n} + \alpha (G-v_n)  \tag{1} \label{eq1}
\end{equation}
$$

Like we pointed out before, some of the (state,action) pair may never be visited due to the nature of Monte Carlo sampling strategy. However, when we try to find the improvement, we do 
$$argmax_{a} q(s,a)$$
then, in this case some of the pairs may never be known.
Since the standard strategy for optimization under the current framework is 
$$
\begin{cases}
    1. & \text{Evaluate the policy} \\
    2. & \text{Policy improvement}
\end{cases}
$$
one is not supposed to do greedy improvement. In this case, one needs to balance between exploration and exploitation, and this is achieved by using the $\epsilon$-greedy policy. That is, instead of taking the 'optimal action', one still explores other actions with some probability
$$
\pi(a|s)=
\begin{cases}
    1-\epsilon + \frac{\epsilon}{|A|}. & \text{for  } a = argmax_a Q(s,a) \\
     \frac{\epsilon}{|A|} & \text{otherwise}
\end{cases}
$$

We could run MC prediction followed by policy improvement on an episode-by-episode basis. This approach will remove the need for a large number of iterations in the estimation/prediction step, thus making the scheme scalable for Markove Decision Processes. 

But of course, for convergence, one would reduce the exploration size by some factor indexed by $k$. 





## Greedy in the limit with inifite exploration

First visit, then terminate the state. 

$q(s,a)$-state action function, $N(s,a)$-number of visit. 

Initiate: 


1. Estimated state val: $q(s,a)=0,  \forall s \in \mathbf{S}$
2. Visit count: $N(s,a)=0,  \forall s \in \mathbf{S}$
3. Policy $\pi$ with enough exploration.

Loop:

Sample from Policy $\pi_k$: $S_0, A_0, R_1, S_1, A_1, R_2, ..., R_T,S_T$ 

$G \leftarrow 0$

Loop backward for each step of episode: $t=T-1, T-2, ..., 1, 0$

$$
\begin{align}
G &\leftarrow \gamma G + R_{t+1} \\
N(s,a) &\leftarrow N(s,a)+1 \\ 
q(s,a) &\leftarrow q(s,a)+[G-q(s,a)]/N(s,a)
\end{align}
$$
Update the policy for convergence:$\epsilon=\frac{1}{k}$
 and using updated $q(s,a)$.

### Off-policy MC control

The previous algorithm(s) uses the same policy to explore and the one to be optimized. Such method is called the on-policy. 

There is another approach where the samples are generated by using a policy that is more exploratory with a higher $\epsilon$ while the one being optimzied is the one with lower $\epsilon$ or even a fully deterministic one. 

$$
\begin{cases}
    1. & \text{Used to generate samples ---behavior policy} \\
    2. & \text{Being optimized ---Target Policy}
\end{cases}
$$

On policy can only find the optimal policy from the samples while off-policy algorithm can learn policy from data generated using other sub-optimal policies.
### `Algorithm off-policy`

1. Estimated state val: $q(s,a)=0,  \forall s \in \mathbf{S}$
2. Visit count: $N(s,a)=0,  \forall s \in \mathbf{S}$
3. Policy $\pi=argmax_a Q(s,a)$.

Loop:

$b\leftarrow a$ 'behavior policy' with enough exploration

Sample episode (k) from Policy $\pi_k$: $S_0, A_0, R_1, S_1, A_1, R_2, ..., R_T,S_T$ 

$G \leftarrow 0$

Loop backward for each step of episode: $t=T-1, T-2, ..., 1, 0$

$$
\begin{align}
G &\leftarrow \gamma G + R_{t+1} \\
N(s,a) &\leftarrow N(s,a)+1 \\ 
q(s,a) &\leftarrow q(s,a)+[G-q(s,a)]/N(s,a)
\end{align}
$$
$\pi = argmax_a q(s,a)$

# Temporal Differencing 

The value of $v_{\pi}(s)$ is estimated based on the current estimated states $v_{\pi}(s')$, and such method is known as bootstrapping. 

Temporal differencing combines both the DP and the MC method together using bootstrapping
$$
\begin{align}
v(s) = v(s) + \alpha [ \underbrace{R + \gamma v(s')}_{\text{the original } G } -v(s)] \tag{2} \label{eq2}
\end{align}
$$
In \eqref{eq2}, $s'$ is the next state. Such approach is called TD($0$).

### `Algorithm TD(0)`
1. Estimated state val: $q(s,a)=0,  \forall s \in \mathbf{S}$
2. Visit count: $N(s,a)=0,  \forall s \in \mathbf{S}$
3. Policy $\pi=argmax_a Q(s,a)$.

Loop for each episode:

$$\begin{align}
&\text{choose a start state } S \\ 
&\text{Loop for each step in the episode}: \\
& \ \ \ \ \  \text{  Take action } A \text{as per state } s, \pi \\
& \ \ \ \ \  \text{  observe  }  R \text{and next state  } s' \\
& \ \ \ \ \ \  v(s) \leftarrow v(s) + \alpha [R + \gamma v(s') -v(s)] \\
& \ \ \  \ \ \ s \leftarrow s'
\end{align}$$

The difference that one sees in in the above algorithm is called the `TD error`. 
$$\delta_t = R_{t+1} + \gamma v(s_{t+1}) -v(s_t)$$



### SARSA

SARSA on-Policy control

1. Estimated state val: $q(s,a)=0,  \forall s \in \mathbf{S}$,  $\forall a \in A $
2. Policy: $\pi=\epsilon$-greedy policy
3. Learning rate, stepsize $\alpha \in [0,1]$
4. Discount factor $\gamma \in [0,1]$

Loop for each episode:
$$\begin{align}
&\text{Start with a random S, choose A based on the policy.  }\\
&\text{Loop for each step unitl episode end:}  \\
&\ \ \ \text{take action A and observe R and next state S'}\\
&\ \ \  \text{choose A' using the }  \epsilon -\text{greedy policy using current Q} \\
& \ \ \  \text{If S' is not terminal } \\
& \ \ \ \ \ \ q(S,A) \leftarrow q(S,A) + \gamma [R + \gamma q(S',A')-Q(S,A)] \\ 
&\ \ \  \text{Else: } \\ 
& \ \ \ \ \ \ q(S,A) \leftarrow q(S,A) + \gamma [R -Q(S,A)] \\
&\ \ \ S \leftarrow S' ; A \leftarrow A'
\end{align}$$

Return the policy $\pi$ based on the Q values


### Off-Policy TD (Q- learning)

In the off-policy algorithm, one again sample the action A' based on the state S'. In the off-policy TD, we will choice 
$$A'=argmax_{\tilde{A}} q(S', \tilde{A})$$

1. Estimated state val: $q(s,a)=0,  \forall s \in \mathbf{S}$,  $\forall a \in A $
2. Policy: $\pi=\epsilon$-greedy policy
3. Learning rate, stepsize $\alpha \in [0,1]$
4. Discount factor $\gamma \in [0,1]$

Loop for each episode:
$$\begin{align}
&\text{Start with a random S, choose A based on the } \epsilon \text{--greedy policy }.  \\
&\text{Loop for each step unitl episode end:}  \\
&\ \ \ \text{take action A and observe R and next state S'}\\
& \ \ \  \text{If S' is not terminal } \\
& \ \ \ \ \ \ q(S,A) \leftarrow q(S,A) + \gamma [R + \gamma \max_{A'} q(S',A')-q(S,A)] \\ 
&\ \ \  \text{Else: } \\ 
& \ \ \ \ \ \ q(S,A) \leftarrow q(S,A) + \gamma [R -q(S,A)] \\
&\ \ \ S \leftarrow S' ; 
\end{align}$$

Return the policy $\pi$ based on the q values

We are using the Max of estimates instead of the estimate of max along the way. 

Notice that the we are approximating the reward which is supposed to be in the expectation. However, by using the max of the estimate, we may get stuck in that value. 

One idea to overcome such difficulty is to use double-q learning. That is, we replace the $\max_a q(s,a)$ with two value functions (neural network) $q_1(s, argmax_a q_2(s,a))$. 

We can also perform the expected SARSA by doing the following
$$q(S,A) \leftarrow q(S,A) + \gamma [R + \gamma \sum_a \pi(a|s')q(S',a)-q(S,A)]$$

### Replay Buffer