## 1. Policies: The Agent's Playbook 

### What is a Policy?
- My "decision-making rulebook" - tells me what to do in each situation
- **Input:** Current state (s)  
- **Output:** Action to take (a)

### Policy Types:
| **Deterministic Policy**      | **Stochastic Policy**          |
|-------------------------------|--------------------------------|
| `π(s) = a`                    | `π(a∣s)=P(a∣s)`                |
| Always choose same action     | Probability distribution      |
| Simple but rigid              | Flexible but complex          |
| Example: "Always turn left"   | Example: "Turn left 80% of time" |

> Key insight: Policies ONLY care about current state - not history or time! ⏱

## 2. Value Functions: The Crystal Ball 

### State-Value Function v(s)
- "How good is this state?"
- Expected total future rewards from state s:  
  `v_π(s) = 𝔼[G_t | S_t = s]`  
- Example: In chess, value of having queen advantage

### Action-Value Function q(s,a)
- "How good is this action in this state?"
- Expected total future rewards after taking a in s:  
  `q_π(s,a) = 𝔼[G_t | S_t = s, A_t = a]`  
- Example: Value of moving pawn vs castle in specific position

### Why We Need Them:
1. **Don't wait for final outcome** → Predict future early
2. **Handle randomness** → Account for stochastic environments
3. **Compare alternatives** → Make optimal decisions

## 3. Bellman Equations: The RL Secret Sauce 

### State-Value Bellman Eq
`v_π(s) = ∑_a π(a|s) ∑_s' ∑_r p(s',r|s,a) [r + γv_π(s')]`

**Translation:**  
Value of state s =  
[Avg over actions I might take] ×  
[Avg over possible next states/rewards] ×  
[Immediate reward + Discounted value of next state]

### Action-Value Bellman Eq
`q_π(s,a) = ∑_s' ∑_r p(s',r|s,a) [r + γ ∑_a' π(a'|s') q_π(s',a')]`

**Translation:**  
Value of (s,a) =  
[Avg over possible outcomes] ×  
[Immediate reward + Discounted value of next action choices]

> My cheat: Both equations are just "reward now + discounted future value" 

## 4. Optimal Policies: The RL Endgame 

### What Makes a Policy Optimal? (π*)
- Beats or equals ALL other policies in EVERY state
- Has highest possible value functions:  
  `v_*(s) = max_π v_π(s)`  
  `q_*(s,a) = max_π q_π(s,a)`

### Optimal Value Functions:
`v_*(s) = max_a ∑_s' ∑_r p(s',r|s,a) [r + γv_*(s')]`  
`q_*(s,a) = ∑_s' ∑_r p(s',r|s,a) [r + γ max_a' q_*(s',a')]`

### Key Properties:
1. **Universality:** Same v* and q* for all optimal policies
2. **Greediness:** Optimal policy is greedy w.r.t q*  
   `π*(s) = argmax_a q_*(s,a)`
3. **Recursive Magic:** Future optimal ⇒ Current optimal

## Student Cheat Sheet 📝

| **Concept**          | **Formula**                      | **When to Use**                  |
|----------------------|----------------------------------|----------------------------------|
| State-Value          | `v_π(s) = 𝔼[G_t⎜S_t=s]`         | Evaluate positions               |
| Action-Value         | `q_π(s,a) = 𝔼[G_t⎜S_t=s,A_t=a]` | Compare moves                    |
| State Bellman        | `v_π(s) = ... + γv_π(s')`       | Value iteration                  |
| Action Bellman       | `q_π(s,a) = ... + γq_π(s',a')`  | Q-learning                       |
| Optimal Value        | `v_*(s) = max_a[...]`           | Finding best strategy            |

> Bellman equations ALWAYS have that γ discount! Don't forget it! 