## 🧠 What is TD?

**TD** stands for **Temporal Difference** learning.  
It’s a method for learning **value functions** — how good it is to be in a state — by learning **from experience**.

### ✅ Key idea:
> Use the **difference between successive value estimates** to update the current one.

That difference is called the **TD error**:
$$
    \delta = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)
$$

Instead of waiting for the **final return**, like in Monte Carlo methods, TD updates **after every step**, using the next state's value as a guess for the future.

---

## 📘 Why is it called "Temporal Difference"?

Because the update is based on the **difference between value estimates across time** (from one timestep to the next).

---

## ⚙️ What is Semi-Gradient TD?

**Semi-gradient TD** is when you:
- Use **function approximation** (e.g., a neural net or linear weights) to estimate value
- And you **only take the gradient of your estimate**, not the target

---

### ✅ Why is it called “semi-gradient”?

Because you're only partially following the full gradient of the loss function.  
You compute:
$$
    \mathbf{w}_{t+1} = \mathbf{w}_{t} + \alpha\delta\nabla\hat{v}(S_t, \mathbf{w}_{t})
$$


You don’t differentiate the **target** (which contains \( \hat{v}(S_{t+1}, \mathbf{w}_t) \)), only the prediction. That’s why it’s called **semi-gradient**, not full gradient.

---

## 🔁 How does TD compare to other RL methods?

| Method           | Learns From        | Waits for Episode End? | Variance | Bias |
|------------------|--------------------|-------------------------|----------|------|
| **TD**           | One step ahead     | ❌ No                   | Low      | Yes  |
| **Monte Carlo**  | Full return        | ✅ Yes                  | High     | No   |
| **Dynamic Prog.**| Known environment  | ❌ No                   | Low      | None |

TD is great because:
- It **doesn’t need a model**
- Learns online (after each step)
- Balances between Monte Carlo (more accurate) and Dynamic Programming (needs full knowledge)

---

## ✅ Summary

- **TD** = learn from next step, not full return  
- **“Temporal Difference”** = difference in value across time  
- **“Semi-gradient”** = you only compute gradient on the prediction, not the whole TD target