<a href="https://colab.research.google.com/github/RLWH/reinforcement-learning-notebook/blob/master/4.%20Model%20Free%20Control/Ch6_7(b)_TD_Control.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# From MC Control to TD Control

TD learning has several advantages over MC methods
1. Lower variance
2. Online training (No need to wait till the whole episode is run)
3. It can deal with incomplete dequences

As with MC methods, we need to face the tradeoff between exploration and exploitation, and again approaches fall into two main classes:
- On-policy
- Off-policy

# Use TD Learning to estimate Q(S,A)

We do the similar things as MC. 
1. Apply TD to Q(S,A)
2. Use $\epsilon$-greedy policy improvmeent
3. Update every time-step

The most commonly used on-policy algorithm is the SARSA algorithm

The first step of an on-policy TD control algorithm is to learn an action-value function rather than a state-value function. For the same reason, using the action-value function can get rid of know the complete MDP dynamics. 

In particular, for an on-policy method, we must estimate $q_{\pi}(s,a)$ for the current behaviour policy $\pi$ and for all states $s$ and actions $a$. If we review the sequence of events of an episode, it consissts of an alternating sequence of states and state-action pairs:

![Sequences of events](https://raw.githubusercontent.com/RLWH/reinforcement-learning-notebook/master/images/sequence_of_events_in_episode.PNG)

### The TD(0) algorithm with Q-values, or, SARSA
For a transition form a state-action pair to another state-action pair, we can have an update on the Q-value
\begin{equation}
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]
\end{equation}

As the TD algorithm is updated after every transition, which consists of the quintuple of events $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$, this is also known as the SARSA algorithm.

### Plugging in SARSA into the generalised policy iteration (GPI) framework
1. Policy evaluation Sarsa, $Q \approx q_{\pi}$
2. Policy improvement $\epsilon$-greedy policy improvement

### Pseudo code of SARSA
---
```
Initialise Q(s,a) for all s in S, a in A(s), arbitrarily, and Q(T,.)=0
Loop for each episode:
    Initialise S
    Choose A from S using policy derived from Q (e.g. esp-greedy)
    Loop for each step of episode:
        Take action A, observe R, S'
        Choose A' from S' using policy derived from Q (e.g. esp-greedy)
        Q(St,At) = Q(St,At) + alpha * (R + gamma * Q(St+1, At+1) - Q(St, At))
        St = St+1; At = At+1
        Until S is terminal
```
---

### How SARSA converge?

#### Theorem
> SARSA converges to the optimal action-value function, $Q(s,a) \to q_*(s,a)$, under the folowing conditions:
 1. GLIE sequence of policies $\pi_t(a|s)$
 2. Robbins-Monro sequence of step-sizes $\alpha_t$
    - $\sum_{t=1}^{\infty} \alpha_t = \infty$
   - $\sum_{t=1}^{\infty} \alpha_t^{2} < \infty$
   

 



## From SARSA to n-step SARSA and SARSA($\lambda$)

As we have learnt that there is a wide spectrum between Monte Carlo and one-step TD update, which we call it n-step TD methods, we can apply the same logic on SARSA update. This is called the n-step SARSA

![n-step SARSA](https://raw.githubusercontent.com/RLWH/reinforcement-learning-notebook/master/images/n_step_sarsa.PNG)

#### The n-step Q-return

\begin{equation}
q_t^{(n)} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^{n} Q(S_{t+n}), \hspace{1cm} n \geq 1, 0 \leq t \lt T-n
\end{equation}

#### n-step SARSA update

\begin{equation}
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha (q_t^{(n)} - Q(S_t, A_t)), \hspace{1cm} 0 \leq t \lt T
\end{equation}
while the values of all other states remain unchanged

![n-step SARSA backup diagram](https://raw.githubusercontent.com/RLWH/reinforcement-learning-notebook/master/images/n_step_sarsa_backup.PNG)

## Forward-view SARSA($\lambda$)
![Forward SARSA](https://github.com/RLWH/reinforcement-learning-notebook/blob/master/images/forward_sarsa_lambda.PNG?raw=true)

## Backward-view SARSA($\lambda$)
Just like TD($\lambda$), we also have a backward view algorithm that uses*Eligibility traces* in the online algorithm

- SARSA($\lambda$) has one eligibility trace for each state-action pair
 
 \begin{equation}
 \begin{split}
 E_0(s,a) &= 0 \\
 E_t(s,a) &= \gamma \lambda E_{t-1}(s,a) + \mathop{\mathbb{1}}(S_t=s, A_t=a)
 \end{split}
 \end{equation}
 
- Hence, Q(s,a) is updated for every state $s$ and action $a$ in proportion to TD-error $\delta_t$ and eligibility trace *E_t(s,a)*

 \begin{equation}
 \begin{split}
 &\delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \\
 &Q(s,a) \leftarrow Q(s,a) + \alpha \delta_t E_t(s,a)
 \end{split}
 \end{equation}

## Pseudo Code of SARSA($\lambda$) algorithm
---
 ```
 Initialise Q(s,a) arbitrarily, for all s in S, a in A(s)
 Loop for each episode:
    E(s,a) = 0 for all s in S, a in A(s)
    Initialise S, A
    
    Loop for each step of the episode:
        Take action At, observe Rt+1, St+1
        Choose At+1 from St+1 using policy derived from Q (e.g. epsilon-greedy)
        delta = R + gamma * Q(St+1, At+1) - Q(S,A)
        E(S,A) = E(S,A) + 1      # Update The Eligibility Trace
        
        For all s in S, a in A(s):
            Q(s,a) = Q(s,a) + alpha * delta * E(s,a)
            E(s,a) = gamma * lambda * E(s,a)
            
    Until S is terminal
 ```
---


## Windy Gridworld Example