# 1. Temporal Difference Learning Introduction
We are now going to look at a third method for solving MDPs, _**Temporal Difference Learning**_. TD is one of the most important ideas in RL, and we will see how it combines ideas from the first two techniques, Dynamic Programming and Monte Carlo. 

Recall that one of the disadvantages of DP was that it requires a full model of environment, and never learns from experience. On the other hand, we saw that MC does learn from experience, and we will shortly see that TD learns from experience as well. 

With the Monte Carlo method, we saw that we could only update the value function after completing an episode. On the other hand, DP uses bootstrapping and was able to improve its estimates based on existing estimates. We will see that TD learning also uses bootstrapping, and furthermore is fully online, so we don't need to wait for an episode to finish before we start updating our value estimates. 

In this section we will take our standard approach:
> 1. First we will look at the prediction problem, aka finding the value function given a policy.
2. Second we will look at the control problem. We will look at 2 different ways of approaching the control problem: **SARSA** and **Q-Learning**.  

---

# 2. Prediction Problem - `TD(0)`
We are now going to look at how to apply TD to the prediction problem, aka finding the value function. The reason that there is a 0 in the name is because there are other TD learning methods such as `TD(1)` and `TD(`${\lambda}$`)`, but they are outside the scope of this course. They are similar, but not necessary to understand _Q-learning_ and _Approximation methods_, which is what we eventually want to get to.

## 2.1 Monte Carlo Disadvantage
One big disadvantage of Monte Carlo was that we needed to wait until the episode is finished before we can calculate the returns, since the return depends on all future rewards. Also, recall that the MC method is to average the returns, and that earlier in the course we looked at different ways of calculating averages. 

## 2.2 `TD(0)`
In particular, we can look at the method that does not require us to store all of the returns: _the moving average_. 

#### $$Q_t = (1 - \alpha)Q_{t-1} + \alpha X_t$$
#### $$Q_t = Q_{t-1} - \alpha Q_{t-1} + \alpha X_t$$
#### $$Q_t = Q_{t-1} + \alpha ( X_t - Q_{t-1})$$

Recall that $\alpha$ can be constant or decay with time. So, if we use this formula it would be an alternative way of caculating the average reward for a state. 

$$V(S_t) \leftarrow V(S_t) + \alpha \big[G(t) - V(S_t)\big]$$

Annotated:

<img src="images/moving-average-TD.png">

In this case we have chosen to not multiply the previous value by $1 - \alpha$, and we have chosen to have $\alpha$ be constant (instead of slowly get smaller over time).

Now recall the definition of $V$; it is the expected value of the return, given a state:

$$V(s) = E \big[G(t) \mid S_t = s\big]$$

But, remember that we can also define it recursively:

$$V(s) = E \big[R(t+1) + \lambda V(S_{t+1}) \mid S_t =s\big]$$

So, it is reasonable to ask if we can just replace the return in the update equation with this recursive definition of $V$! What we get from this, is the `TD(0)` method:

$$V(S_t) = V(S_t) + \alpha \big[R(t+1) + \lambda V(S_{t+1}) - V(S_t)\big]$$

<span style="color:#0000cc">$$\text{TD(0)} \rightarrow V(s) = V(s) + \alpha \big[r + \lambda V(s') - V(s)\big]$$</span>

We can also see how this is fully online! We are not calculating $G$, the full return. Instead, we are just using another $V$ estimate, in particular the $V$ for the next state. What this also tells us is that we cannot update $V(s)$ until we know $V(s')$. So, rather than waiting for the entire episode to finish, we just need to wait until we reach the next state to update the value for the current state. 

## 2.3 Sources of Randomness
It is helpful to examine how these estimates work, and what the sources of randomness are. 

> * With MC, the randomness came from the fact that each episode could play out in a different way. So, the return for a state would be different if all the later state transitions had some randomness. 
* With **`TD(0)`** we have yet another source of randomness. In particular, we don't even know the return, so instead we use $r + \gamma V(s')$ to estimate the return $G$. 

## 2.4 Summary
We just looked at why `TD(0)` is advantageous in comparison to MC/DP. 

> * Unlike DP, we do not require a full model of the environment, we learn from experience, and only update V for states we visit.
* Unlike MC, we don't need to wait for an episode to finish before we can start learning. This is advantageous in situations where we have very long episodes. We can improve our performance during the episode itself, rather than having to wait until the next episode.
* It can even be used for continuous tasks, in which there are no episodes at all. 

---

# 3. `TD(0)` in Code

In [6]:
import numpy as np
from common import standard_grid, negative_grid, print_policy, print_values

SMALL_ENOUGH = 10e-4
GAMMA = 0.9
ALPHA = 0.1 # Learning Rate
ALL_POSSIBLE_ACTIONS = ('U', 'D', 'L', 'R')

# NOTE: This is only policy evaluation, not optimization

def random_action (a, eps=0.1):
  """Adding randomness to ensure that all states are visited. We will use epsilon-soft 
  to ensure that all states are visited. What happens if you don't do this? i.e. eps = 0"""
  p = np.random.random()
  if p < (1 - eps):
    return a
  else:
    return np.random.choice(ALL_POSSIBLE_ACTIONS)

def play_game(grid, policy):
  """Much simpler than MC version, because we don't need to calculate any returns. All
  we need to do is return a list of states and rewards."""
  s = (2, 0)
  grid.set_state(s)
  states_and_rewards = [(s, 0)] # list of tuples of (state, reward)
  while not grid.game_over():
    a = policy[s]
    a = random_action(a)
    r = grid.move(a)
    s = grid.current_state()
    states_and_rewards.append((s, r))
  return states_and_rewards

if __name__ == '__main__':
  # Use standard grid so that we can compare to iterative policy evaluation
  grid = standard_grid()
  
  # print rewards
  print("rewards:")
  print_values(grid.rewards, grid)
  
  # state -> action
  policy = {
    (2, 0): 'U',
    (1, 0): 'U',
    (0, 0): 'R',
    (0, 1): 'R',
    (0, 2): 'R',
    (1, 2): 'R',
    (2, 1): 'R',
    (2, 2): 'R',
    (2, 3): 'U',
  }
  
  # Initialize V(s) and returns to 0
  states = grid.all_states()
  V = {v: 0 for v in states}
  
  # Repeat until convergence
  for it in range(1000):
    
    # Generate an episode using pi
    states_and_rewards = play_game(grid, policy)
    
    # the first (s, r) tuple is the state we start in and 0
    # (since we don't get a reward) for simply starting the game
    # the last (s, r) tuple is the terminal state and the final reward
    # the value for the terminal state is by definition 0, so we don't
    # care about updating it.
    # Once we have our states and rewards, we loop through them and do the TD(0) update
    # which is the equation from the theory we discussed. Notice that here we have 
    # generated a full episode and then are making our V(s) updates. We could have done 
    # them inline, but this allows for them to be cleaner and easier to follow. 
    for t in range(len(states_and_rewards) - 1):
      s, _ = states_and_rewards[t]
      s2, r = states_and_rewards[t + 1]
      
      # We will update V(s) AS we experience the episode
      V[s] = V[s] + ALPHA * (r + GAMMA * V[s2] - V[s])
      
  print("values:")
  print_values(V, grid)
  print("policy:")
  print_policy(policy, grid)

rewards:
---------------------------
 0.00| 0.00| 0.00| 1.00|
---------------------------
 0.00| 0.00| 0.00|-1.00|
---------------------------
 0.00| 0.00| 0.00| 0.00|
values:
---------------------------
 0.78| 0.88| 0.98| 0.00|
---------------------------
 0.70| 0.00|-0.92| 0.00|
---------------------------
 0.62|-0.30|-0.60|-0.87|
policy:
---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  R  |     |
---------------------------
  U  |  R  |  R  |  U  |
