# Markov Decision Processes (MDPsMDPs)

## 1. Limitations of k-Armed Bandit
- Only considers immediate decisions (no situational awareness)
- Focuses solely on immediate rewards (ignores future consequences)
- Example: Always taking highest-paying job now might limit future opportunities


### Key Things to Remember:
1. **States (S_t):** 
   - "Snapshot" of the environment at time t
   - Examples: chess board position, robot sensor readings

2. **Actions (A_t):** 
   - What the agent decides to DO
   - Can be discrete (left/right) or continuous (steering angle)

3. **Transition Dynamics:**
   - Defined by probability distribution:  
      `p(s', r | s, a) = Pr(S_{t+1}=s', R_{t+1}=r | S_t=s, A_t=a)`
   - Must satisfy:  
      `∑∑ p(s',r|s,a) = 1` for all s ∈ 𝓢, a ∈ 𝓐(s)

4. **Reward (R_{t+1}):**
   - Immediate feedback signal
   - Not always $ - could be delayed!
   - Design tip: Reward ≠ goal, but should guide toward goal

## 3. The Real Goal in RL
- Maximize **cumulative future rewards**, not immediate payoffs  
- Total return: `G_t = R_{t+1} + R_{t+2} + ... + R_T`  
- Key insight: Sacrifice short-term gains for better long-term outcomes

## 4. Reward Hypothesis (Cool Analogy!)
| Approach              | Proverb                          | Limitation                     |
|-----------------------|----------------------------------|--------------------------------|
| Programming AI        | "Give a man a fish..."           | Needs explicit instructions    |
| Supervised Learning   | "Teach a man to fish..."         | Requires labeled data          |
| **Reinforcement Learning** | "Give a taste of fish..."    | Learns from experience        |

### Reward Design Challenges
- No natural "currency" for rewards
- Complex/long-term goals
- Dynamic environments
- Risk vs reward tradeoffs

## 5. Task Types Comparison
|                       | Episodic Tasks                   | Continuing Tasks               |
|-----------------------|----------------------------------|--------------------------------|
| **Structure**         | Natural breaks (episodes)        | Never-ending                  |
| **Termination**       | Ends at terminal state           | No terminal state             |
| **Return Formula**    | `G_t = ∑_{k=1}^T R_{t+k}`       | Needs discounting (γ)         |

## 6. Handling Infinite Horizons
### Discounted Returns
`G_t = R_{t+1} + γR_{t+2} + γ²R_{t+3} + ...`  
`  = ∑_{k=0}^∞ γ^k R_{t+k+1}`

Why discount (0 ≤ γ < 1)?
- Makes infinite sums finite
- Values immediate rewards more
- Models uncertainty about future

### The Magic Recursion
`G_t = R_{t+1} + γG_{t+1}`  
*(This recursive relationship is FUNDAMENTAL to RL algorithms)*

> This recursion enables dynamic programming solutions!