# Temporal Difference Learning

Instead of waiting for the full episode to complete, we can break it up and learn the $Q$ function directly from actual experience from the environment. We thus learn by **bootstrapping**, that is, updating as we play our knowledge of the value function. For the prediction problem, our goal is, given $\pi$, learn $q_\pi$ online from experience.

One way to do that is by incremental every-visit Monte Carlo: we update estimates of the value $Q(S_t, A_t)$ towards **actual** return $G_t$:



<div class="alert alert-block alert-info">
<h4>Incremental Monte Carlo Update </h4>
<ul>
 $$\begin{aligned}
 Q(S_t,A_t) &\leftarrow& Q(S_t,A_t)+\alpha(\underbrace{G_t}-Q(S_t,A_t))
 \end{aligned}
 $$
 </ul>
</div> 

The TD way to do it is to, instead, update estimate of the value $V(S_t)$ towards the **estimated** return (called the TD target):

<div class="alert alert-block alert-info">
<h4>TD Update</h4>
<ul>
 $$\begin{aligned}
 Q(S_t,A_t) &\leftarrow& Q(S_t,A_t) + \alpha \left (\underbrace{R_{t+1}+\gamma Q(S_{t+1}, A_{t+1})}_{\text{=:TD target}}-Q(S_t, A_t) \right )
 \end{aligned}
 $$
 </ul>
</div> 

How do these two methods compare? Well,
- TD updates the "first order" approximation of the expected return, instead of the average return.
- TD can learn before the final episode, unlike MC.
- TD works in non-terminating environments, unlike MC.
- MC is an **unbiased** estimate, whereas TD introduces bias because we update not to the real value function, but towards or best guess so far. 
- TD has less variance: return depends on only one noisy estimate (action), whereas MC has noise across all trajectory.
- TD exploits the Markov property (hence more effective in Markov environments).
- MC does not exploit the Markov property (more effective in non-Markov environments).


A more dramatic example: Assume you are driving and you come to an intersection. You don't see a car coming towards you, and you stop. If you took decisions by MC, you would have to crash to see the reward (negative). With TD, you don't need to wait until you die to update your value function. 

## TD Control

There are two learning procedures for evaluating a policy (which you will later improve). 

- **On-policy learning** which means learning about $\pi$ by sampling experience from $\pi$. 

- **Off-policy learning** learns about *optimal* policy while following an *exploratory* policy. 

An example of on-policy algorithm is SARSA, shown below. It get its name from the sequence $S,A,R,S',A'$ used to generate the updates. At each step we obtain, from the current policy, an action $A$ to be used in state $S$, then that gives us a reward $R$, and move to the new state $S'$. Using the *same* policy we get the new action $A'$ and we update our $Q$ function doing a gradient step on the improvement direction. 

<div class="alert alert-block alert-info">
<h4> Sarsa Algorithm </h4>
Initialize $Q(s,a)$ arbitrarily. $\forall s \in \mathcal{S}, a \in \mathcal{A}, \ Q(\text{terminal state},\cdot) = 0$. 
    <ul>
        <li> Repeat for each episode: </li>
        Initialize the initial state $S$.
        <p>Repeat, for each step of the episode:
        <ul>
            
            <li> Choose $A$ from $S$ using the policy derived from $Q$, for instance, using $\epsilon-$greedy</li>
            <li> Take action $A$, observe $R$, $S'$ </li>
            <li> Choose action $A'$ from $S'$ </li>
            <li> $Q(S,A) <= Q(S,A)+\alpha \left [ R+\gamma\cdot Q(S',A')-Q(S,A) \right ] $</li>
            <li> $S <=S'$ </li>
        </ul>
        
    </ul>
</div>

Q-learning, the second algorithm shown below, is off-policy: instead of steering the $Q-$function estimate using another data point generated from our policy, we use our knowledge to move it to the direction of the best possible action on the new state $S'$.


<div class="alert alert-block alert-info">
<h4> Q-Learning Algorithm </h4>
Initialize $Q(s,a)$ arbitrarily. $\forall s \in \mathcal{S}, a \in \mathcal{A}, Q(\text{terminal state},\cdot) = 0$. 
    <ul>
        <li> Repeat for each episode: </li>
        Initialize the initial state $S$.
        <p>Repeat, for each step of the episode:
        <ul>
            
            <li> Choose $A$ from $S$ using the policy derived from $Q$, for instance, using $\epsilon-$greedy.</li>
            <li> Take action $A$, observe $R$, $S'$. </li>
            <li> $Q(S,A) \leftarrow Q(S,A)+\alpha \left [ R+\gamma\cdot \max_{a \in A} Q(S',a)-Q(S,A) \right ] $</li>
            <li> $S \leftarrow S'$ </li>
        </ul>
        
    </ul>
</div>


Below you can see some sample code for implementing SARSA. 

In [1]:
import numpy as np
import gym

def epsilon_greedy_policy(Q, epsilon, actions):
    """ Q is a numpy array, epsilon between 0,1 
    and a list of actions"""
    
    def policy_fn(state):
        if np.random.rand()>epsilon:
            action = np.argmax(Q[state,:])
        else:
            action = np.random.choice(actions)
        return action
    return policy_fn


env = gym.make("FrozenLake-v0")
Q = np.zeros([env.observation_space.n, env.action_space.n])

gamma = 0.99 
alpha = 0.1
n_episodes = 10000


actions = range(env.action_space.n)

score = []    
for j in range(n_episodes):
    done = False
    state = env.reset()
    policy = epsilon_greedy_policy(Q, \
                                   epsilon=10./(j+1), actions = actions )
    
    
    ### Generate sample episode
    t=0
    while not done:
        t+=1
        action = policy(state)    
        new_state, reward, done, _ =  env.step(action)
        new_action = policy(new_state)
        
        #Book-keeping
        if done:
            Q[state, action] = Q[state,action] + alpha*(reward-Q[state,action])
        else:
            Q[state, action] = Q[state,action] + alpha*(reward+gamma*Q[new_state,new_action]-Q[state,action])
            
        state, action = new_state, new_action
            
        if done:
            if len(score) < 100:
                score.append(reward)
            else:
                score[j % 100] = reward
                
                
            #if (j+1)%1000 == 0:
                #print("INFO: Episode {} finished after \
                #{} timesteps with r={}.\
                # Running score: {}".format(j+1, t, reward, np.mean(score)))
            

env.close()

In [2]:
sum(score)

0.0