# Temporal Difference (TD) Learning

Temporal Difference (TD) Learning combines ideas from **Monte Carlo methods** and **Dynamic Programming**.

It learns directly from experience **without knowing the environment model**, updating value estimates based on **other learned estimates**.

It’s a **bootstrapping** method : updating estimates using existing estimates.

---
### 🔍 Key Idea
- Update value function $V(s)$ after every step:

$$V(s_t) \leftarrow V(s_t) + \alpha [r_{t+1} + \gamma V(s_{t+1}) - V(s_t)]$$

Here:
- $\alpha$ = learning rate
- $\gamma$ = discount factor
- $r_{t+1}$ = reward received after transitioning to $s_{t+1}$

This formula defines **TD(0)** : the simplest form of Temporal Difference Learning.

---
### ⚙️ Comparison
| Method | Requires Model? | Updates When | Bootstraps? |
|:--|:--:|:--:|:--:|
| Monte Carlo | ❌ No | End of episode | ❌ No |
| Dynamic Programming | ✅ Yes | After full sweep | ✅ Yes |
| Temporal Difference | ❌ No | After each step | ✅ Yes |

---
### ✅ Advantages
- Can learn directly from raw experience
- Doesn’t require environment dynamics
- More efficient than Monte Carlo methods for long episodes

---
### 🧩 Example: TD(0) Prediction

In [None]:
import numpy as np
import random

states = [0, 1, 2, 3, 4, 5]  # Example 1D chain environment
terminal_states = [0, 5]
V = np.zeros(len(states))
alpha = 0.1  # Learning rate
gamma = 0.9  # Discount factor

def step(state):
    if state in terminal_states:
        return state, 0
    action = random.choice([-1, 1])  # move left or right
    next_state = state + action
    reward = 1 if next_state == 5 else 0
    return next_state, reward

for episode in range(100):
    state = 3  # start from middle
    while state not in terminal_states:
        next_state, reward = step(state)
        V[state] += alpha * (reward + gamma * V[next_state] - V[state])
        state = next_state

print("Learned Value Function:")
print(V)

---
### 📘 TD Control Methods
TD learning can also be used for **control** — to find the optimal policy.

Two popular methods:
- **SARSA (On-policy TD Control)**
- **Q-Learning (Off-policy TD Control)**

We’ll explore both in upcoming notebooks.

---
### 🧠 Summary
- **TD(0)** updates value after each step.
- It bridges Monte Carlo and Dynamic Programming.
- It is efficient and works in unknown environments.
- Leads to **SARSA** and **Q-Learning**.

---
### 📚 References
- Sutton & Barto, *Reinforcement Learning: An Introduction*
- OpenAI Spinning Up Tutorials
- David Silver RL Course (Lecture 4)