# Lab D.1: Markov Decision Processes

**Module:** D - Reinforcement Learning (Optional)
**Time:** 1.5-2 hours
**Difficulty:** ‚≠ê‚≠ê‚òÜ‚òÜ‚òÜ

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand the components of a Markov Decision Process (MDP)
- [ ] Implement state, action, reward, and transition functions
- [ ] Calculate value functions using the Bellman equation
- [ ] Find optimal policies using value iteration
- [ ] Connect MDPs to real-world sequential decision problems

---

## üìö Prerequisites

- Python programming fundamentals
- Basic understanding of probability
- NumPy array operations (Module 1.4)

---

## üåç Real-World Context

**Why MDPs matter for AI:**

Every time ChatGPT generates a response, it makes a sequence of decisions‚Äîchoosing one word after another. Each word choice affects what comes next. This is exactly what MDPs model: **sequential decision-making where actions have consequences**.

Real-world MDP applications:
- ü§ñ **Robotics**: A robot deciding how to navigate a room
- üéÆ **Games**: An AI learning to play chess or Atari
- üí¨ **Chatbots**: RLHF training treats each token as a decision
- üìà **Finance**: Portfolio optimization over time
- üè• **Healthcare**: Treatment planning for patients

---

## üßí ELI5: What is a Markov Decision Process?

> **Imagine you're a mouse in a maze looking for cheese.** üê≠üßÄ
>
> - **States (S)**: Where you are in the maze (each intersection is a "state")
> - **Actions (A)**: Which direction you can go (up, down, left, right)
> - **Rewards (R)**: You get +10 points for finding cheese, -1 for hitting a wall
> - **Transitions (P)**: Sometimes the floor is slippery, so you might slip and go the wrong way!
> - **Discount (Œ≥)**: Cheese now is better than cheese later (you're hungry!)
>
> The **Markov Property** is the key insight: "Where you go next depends ONLY on where you are now, not on how you got there." It's like a GPS‚Äîit doesn't care about your journey so far, just your current location.
>
> **In AI terms:** An MDP is the mathematical framework for describing environments where an agent makes decisions, receives rewards, and tries to maximize total reward over time. It's the foundation of all reinforcement learning!

---

## Part 1: MDP Components

### The Five Elements of an MDP

An MDP is defined by a tuple $(S, A, P, R, \gamma)$:

| Symbol | Name | Description |
|--------|------|-------------|
| $S$ | States | All possible situations the agent can be in |
| $A$ | Actions | All possible moves the agent can make |
| $P(s'\|s, a)$ | Transition | Probability of reaching state $s'$ after taking action $a$ in state $s$ |
| $R(s, a, s')$ | Reward | Immediate payoff for a transition |
| $\gamma$ | Discount | How much we value future rewards (0 to 1) |

Let's implement each component!

In [None]:
# Setup - run this first!
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, List, Dict, Optional
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Check we're running on DGX Spark
print("üöÄ Module D.1: Markov Decision Processes")
print(f"NumPy version: {np.__version__}")

In [None]:
class GridWorldMDP:
    """
    A simple 4x4 grid world MDP.
    
    The agent starts at top-left and must reach bottom-right (the goal).
    Think of it as our mouse trying to find the cheese! üê≠
    
    Grid Layout:
    +---+---+---+---+
    | 0 | 1 | 2 | 3 |
    +---+---+---+---+
    | 4 | 5 | 6 | 7 |
    +---+---+---+---+
    | 8 | 9 |10 |11 |
    +---+---+---+---+
    |12 |13 |14 |15 | <- Goal!
    +---+---+---+---+
    """
    
    def __init__(self, grid_size: int = 4, slip_prob: float = 0.0):
        """
        Initialize the grid world.
        
        Args:
            grid_size: Size of the grid (default 4x4)
            slip_prob: Probability of slipping to random adjacent cell (0.0 = deterministic)
        """
        self.grid_size = grid_size
        self.n_states = grid_size ** 2
        self.n_actions = 4  # Up, Right, Down, Left
        self.slip_prob = slip_prob
        
        # State indices
        self.start_state = 0
        self.goal_state = self.n_states - 1
        
        # Action mapping
        self.action_names = ['Up', 'Right', 'Down', 'Left']
        self.action_deltas = {
            0: (-1, 0),  # Up
            1: (0, 1),   # Right
            2: (1, 0),   # Down
            3: (0, -1)   # Left
        }
        
        # Discount factor (how much we value future rewards)
        self.gamma = 0.99
        
        print(f"‚úÖ Created {grid_size}x{grid_size} Grid World MDP")
        print(f"   States: {self.n_states}")
        print(f"   Actions: {self.n_actions} ({', '.join(self.action_names)})")
        print(f"   Start: State {self.start_state} | Goal: State {self.goal_state}")
        print(f"   Slip probability: {slip_prob:.1%}")
    
    def state_to_pos(self, state: int) -> Tuple[int, int]:
        """Convert state index to (row, col) position."""
        return (state // self.grid_size, state % self.grid_size)
    
    def pos_to_state(self, row: int, col: int) -> int:
        """Convert (row, col) position to state index."""
        return row * self.grid_size + col
    
    def get_next_state(self, state: int, action: int) -> int:
        """
        Get the next state after taking an action (deterministic part).
        
        If the action would take us off the grid, we stay in place.
        """
        row, col = self.state_to_pos(state)
        dr, dc = self.action_deltas[action]
        
        # Apply action (stay in bounds)
        new_row = max(0, min(self.grid_size - 1, row + dr))
        new_col = max(0, min(self.grid_size - 1, col + dc))
        
        return self.pos_to_state(new_row, new_col)
    
    def get_reward(self, state: int, action: int, next_state: int) -> float:
        """
        Get the reward for a transition.
        
        - +1.0 for reaching the goal
        - -0.01 for each step (encourages efficiency)
        """
        if next_state == self.goal_state:
            return 1.0  # Found the cheese! üßÄ
        else:
            return -0.01  # Small cost for each step (time is valuable)
    
    def step(self, state: int, action: int) -> Tuple[int, float, bool]:
        """
        Take an action in the environment.
        
        Returns:
            next_state: The resulting state
            reward: The reward received
            done: Whether the episode is over (reached goal)
        """
        # Handle slipping (stochastic transitions)
        if self.slip_prob > 0 and np.random.random() < self.slip_prob:
            action = np.random.randint(0, self.n_actions)
        
        next_state = self.get_next_state(state, action)
        reward = self.get_reward(state, action, next_state)
        done = (next_state == self.goal_state)
        
        return next_state, reward, done
    
    def get_transition_probs(self, state: int, action: int) -> Dict[int, float]:
        """
        Get transition probabilities P(s'|s, a).
        
        Returns a dictionary mapping next_state -> probability
        """
        if self.slip_prob == 0:
            # Deterministic: only one possible next state
            next_state = self.get_next_state(state, action)
            return {next_state: 1.0}
        else:
            # Stochastic: could slip to any adjacent cell
            probs = {}
            for a in range(self.n_actions):
                next_s = self.get_next_state(state, a)
                if a == action:
                    prob = 1.0 - self.slip_prob + self.slip_prob / self.n_actions
                else:
                    prob = self.slip_prob / self.n_actions
                probs[next_s] = probs.get(next_s, 0) + prob
            return probs
    
    def visualize_grid(self, values: Optional[np.ndarray] = None, 
                       policy: Optional[np.ndarray] = None,
                       title: str = "Grid World"):
        """
        Visualize the grid world, optionally with values or policy.
        """
        fig, ax = plt.subplots(figsize=(8, 8))
        
        # Draw grid
        for i in range(self.grid_size + 1):
            ax.axhline(y=i, color='black', linewidth=1)
            ax.axvline(x=i, color='black', linewidth=1)
        
        # Color cells based on values
        if values is not None:
            values_2d = values.reshape(self.grid_size, self.grid_size)
            im = ax.imshow(values_2d, cmap='RdYlGn', 
                          extent=[0, self.grid_size, self.grid_size, 0])
            plt.colorbar(im, ax=ax, label='Value')
        
        # Draw policy arrows
        arrow_map = {0: (0, 0.3), 1: (0.3, 0), 2: (0, -0.3), 3: (-0.3, 0)}
        for s in range(self.n_states):
            row, col = self.state_to_pos(s)
            x, y = col + 0.5, row + 0.5
            
            if s == self.goal_state:
                ax.text(x, y, 'üßÄ', fontsize=24, ha='center', va='center')
            elif s == self.start_state:
                ax.text(x, y, 'üê≠', fontsize=24, ha='center', va='center')
            elif policy is not None:
                dx, dy = arrow_map[policy[s]]
                ax.arrow(x, y, dx, -dy, head_width=0.15, head_length=0.1, fc='blue', ec='blue')
            
            # Show state number
            ax.text(x - 0.35, y - 0.35, str(s), fontsize=8, color='gray')
        
        ax.set_xlim(0, self.grid_size)
        ax.set_ylim(self.grid_size, 0)
        ax.set_aspect('equal')
        ax.set_title(title, fontsize=14)
        ax.set_xticks([])
        ax.set_yticks([])
        plt.tight_layout()
        plt.show()

# Create our grid world!
mdp = GridWorldMDP(grid_size=4, slip_prob=0.0)
mdp.visualize_grid(title="Our 4x4 Grid World (Mouse must find the cheese!)")

### üîç What Just Happened?

We created a simple grid world where:
- The **mouse** üê≠ starts at state 0 (top-left)
- The **cheese** üßÄ is at state 15 (bottom-right)
- The mouse can move in 4 directions
- Each step costs -0.01 (encouraging the mouse to hurry!)
- Finding cheese gives +1.0 reward

This is a **deterministic** MDP‚Äîthe mouse always moves where it intends.

In [None]:
# Let's see the transition probabilities
print("üìä Transition Probabilities from State 5:\n")

state = 5
for action in range(mdp.n_actions):
    probs = mdp.get_transition_probs(state, action)
    print(f"  Action '{mdp.action_names[action]}':")
    for next_s, prob in probs.items():
        row, col = mdp.state_to_pos(next_s)
        print(f"    ‚Üí State {next_s} (row {row}, col {col}) with probability {prob:.2f}")

### ‚úã Try It Yourself: Simulate a Random Walk

Let's see what happens when our mouse takes random actions!

In [None]:
# Exercise: Complete this function to simulate an episode

def run_random_episode(mdp: GridWorldMDP, max_steps: int = 100) -> Tuple[List[int], float]:
    """
    Run one episode with random actions.
    
    Returns:
        path: List of states visited
        total_reward: Sum of all rewards
    """
    state = mdp.start_state
    path = [state]
    total_reward = 0.0
    
    for step in range(max_steps):
        # TODO: Choose a random action
        action = np.random.randint(0, mdp.n_actions)  # Hint: use np.random.randint
        
        # TODO: Take the action using mdp.step()
        next_state, reward, done = mdp.step(state, action)  # Hint: returns (next_state, reward, done)
        
        # Update tracking
        path.append(next_state)
        total_reward += reward
        state = next_state
        
        if done:
            break
    
    return path, total_reward

# Run a few random episodes
print("üé≤ Running 5 random episodes:\n")
for i in range(5):
    path, reward = run_random_episode(mdp)
    print(f"Episode {i+1}: {len(path)-1} steps, Reward: {reward:.3f}")
    if len(path) <= 10:
        print(f"   Path: {path}")

---

## Part 2: Value Functions

### The Big Question: How Good is a State?

The **Value Function** $V(s)$ tells us: "If I'm in state $s$ and follow the best possible actions from here, how much total reward will I get?"

> üßí **ELI5**: Imagine you're playing a video game. Some positions are clearly better than others‚Äîyou're closer to the goal, or you have more options. The value function gives each position a "score" based on how good your future looks from there.

### The Bellman Equation: The Magic Formula

The value of a state depends on:
1. The immediate reward you get
2. The value of where you end up (discounted)

$$V(s) = \max_a \left[ R(s, a) + \gamma \sum_{s'} P(s'|s, a) V(s') \right]$$

In plain English: **"The value of being here = best action's (immediate reward + discounted future value)"**

In [None]:
def value_iteration(mdp: GridWorldMDP, threshold: float = 1e-6, max_iterations: int = 1000) -> Tuple[np.ndarray, np.ndarray]:
    """
    Solve the MDP using Value Iteration.
    
    This finds the optimal value function V*(s) and optimal policy œÄ*(s).
    
    Args:
        mdp: The MDP to solve
        threshold: Stop when values change less than this
        max_iterations: Safety limit on iterations
    
    Returns:
        V: Optimal value function
        policy: Optimal policy (best action for each state)
    """
    # Initialize values to zero
    V = np.zeros(mdp.n_states)
    
    print("üîÑ Running Value Iteration...")
    
    for iteration in range(max_iterations):
        V_new = np.zeros(mdp.n_states)
        
        for s in range(mdp.n_states):
            # Goal state has value 0 (terminal)
            if s == mdp.goal_state:
                V_new[s] = 0
                continue
            
            # Find best action using Bellman equation
            action_values = []
            for a in range(mdp.n_actions):
                # Get transition probabilities
                transitions = mdp.get_transition_probs(s, a)
                
                # Expected value = sum over all possible next states
                expected_value = 0
                for next_s, prob in transitions.items():
                    reward = mdp.get_reward(s, a, next_s)
                    expected_value += prob * (reward + mdp.gamma * V[next_s])
                
                action_values.append(expected_value)
            
            # Take the best action
            V_new[s] = max(action_values)
        
        # Check for convergence
        max_change = np.max(np.abs(V_new - V))
        V = V_new
        
        if max_change < threshold:
            print(f"‚úÖ Converged after {iteration + 1} iterations!")
            break
    
    # Extract optimal policy from value function
    policy = np.zeros(mdp.n_states, dtype=int)
    for s in range(mdp.n_states):
        if s == mdp.goal_state:
            continue
        
        action_values = []
        for a in range(mdp.n_actions):
            transitions = mdp.get_transition_probs(s, a)
            expected_value = sum(
                prob * (mdp.get_reward(s, a, next_s) + mdp.gamma * V[next_s])
                for next_s, prob in transitions.items()
            )
            action_values.append(expected_value)
        
        policy[s] = np.argmax(action_values)
    
    return V, policy

# Solve our grid world!
V, policy = value_iteration(mdp)

print("\nüìä Optimal Value Function:")
print(V.reshape(4, 4).round(3))

print("\nüß≠ Optimal Policy:")
policy_names = np.array([mdp.action_names[a] for a in policy]).reshape(4, 4)
print(policy_names)

In [None]:
# Visualize the solution
mdp.visualize_grid(values=V, policy=policy, title="Optimal Value Function and Policy")

### üîç What Just Happened?

1. **Value Iteration** computed $V^*(s)$ for every state
2. States closer to the goal have higher values (greener)
3. The **optimal policy** shows the best action at each state
4. Notice: the policy creates a "path" straight to the goal!

The mouse now knows exactly which way to go from any position.

---

## Part 3: The Discount Factor Œ≥

> üßí **ELI5**: Would you rather have a cookie now or a cookie tomorrow? Most people prefer now! The discount factor $\gamma$ captures this‚Äîit's how much we "shrink" future rewards.

- $\gamma = 0$: Only care about immediate reward (very short-sighted)
- $\gamma = 1$: Care equally about all future rewards (dangerous for infinite horizons!)
- $\gamma = 0.99$: Future matters, but immediate is still slightly better

Let's see how $\gamma$ affects our solution:

In [None]:
# Compare different discount factors
gammas = [0.5, 0.9, 0.99, 0.999]

fig, axes = plt.subplots(1, 4, figsize=(16, 4))

for ax, gamma in zip(axes, gammas):
    # Create MDP with this gamma
    test_mdp = GridWorldMDP(grid_size=4, slip_prob=0.0)
    test_mdp.gamma = gamma
    
    # Solve it
    V, policy = value_iteration(test_mdp)
    
    # Visualize
    V_2d = V.reshape(4, 4)
    im = ax.imshow(V_2d, cmap='RdYlGn')
    ax.set_title(f'Œ≥ = {gamma}\nV(0) = {V[0]:.4f}')
    ax.set_xticks([])
    ax.set_yticks([])
    
    # Add text
    for i in range(4):
        for j in range(4):
            ax.text(j, i, f'{V_2d[i,j]:.3f}', ha='center', va='center', fontsize=8)

plt.suptitle('Effect of Discount Factor Œ≥ on Value Function', fontsize=14)
plt.tight_layout()
plt.show()

### Interpretation

- With low $\gamma$ (0.5), distant states barely see the goal's value
- With high $\gamma$ (0.999), value propagates all the way back
- The starting state's value tells us: "How good is it to start here?"

---

## Part 4: Stochastic MDPs

Real-world environments are noisy! Let's add some "slipperiness" to our floor:

> When the mouse tries to move, there's a chance it slips and goes a random direction instead!

In [None]:
# Create a slippery grid world (like FrozenLake!)
slippery_mdp = GridWorldMDP(grid_size=4, slip_prob=0.3)

print("\nüìä Transition probabilities with 30% slip chance:")
print("\nFrom State 5, trying to go Right:")
probs = slippery_mdp.get_transition_probs(5, 1)  # Action 1 = Right
for next_s, prob in sorted(probs.items()):
    print(f"  ‚Üí State {next_s}: {prob:.1%} chance")

In [None]:
# Solve the slippery grid world
V_slippery, policy_slippery = value_iteration(slippery_mdp)

# Compare deterministic vs stochastic
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Deterministic
ax = axes[0]
im = ax.imshow(V.reshape(4, 4), cmap='RdYlGn', vmin=0, vmax=1)
ax.set_title('Deterministic (no slip)')
for i in range(4):
    for j in range(4):
        ax.text(j, i, f'{V[i*4+j]:.3f}', ha='center', va='center')
ax.set_xticks([])
ax.set_yticks([])

# Stochastic
ax = axes[1]
im = ax.imshow(V_slippery.reshape(4, 4), cmap='RdYlGn', vmin=0, vmax=1)
ax.set_title('Stochastic (30% slip)')
for i in range(4):
    for j in range(4):
        ax.text(j, i, f'{V_slippery[i*4+j]:.3f}', ha='center', va='center')
ax.set_xticks([])
ax.set_yticks([])

plt.colorbar(im, ax=axes)
plt.suptitle('Value Functions: Deterministic vs Stochastic Worlds', fontsize=14)
plt.tight_layout()
plt.show()

print(f"\nüìâ Value of starting state:")
print(f"   Deterministic: {V[0]:.4f}")
print(f"   Stochastic:    {V_slippery[0]:.4f}")
print(f"\n   Uncertainty reduces value by {(1 - V_slippery[0]/V[0])*100:.1f}%!")

### Key Insight

**Uncertainty is costly!** When the environment is unpredictable, the expected total reward decreases. This is why:

- Robust policies try to avoid risky situations
- In RLHF, we want the LLM to give consistent, reliable answers
- Stochastic environments require more exploration to learn

---

## Part 5: Q-Values (Action Values)

Sometimes we want to know: "How good is it to take action $a$ in state $s$?"

This is the **Q-function** or **Action-Value function**:

$$Q(s, a) = R(s, a) + \gamma \sum_{s'} P(s'|s, a) V(s')$$

> üßí **ELI5**: If $V(s)$ tells you "how good is this room?", then $Q(s, a)$ tells you "how good is taking this specific door from this room?"

In [None]:
def compute_q_values(mdp: GridWorldMDP, V: np.ndarray) -> np.ndarray:
    """
    Compute Q(s, a) for all state-action pairs.
    
    Returns:
        Q: Array of shape (n_states, n_actions)
    """
    Q = np.zeros((mdp.n_states, mdp.n_actions))
    
    for s in range(mdp.n_states):
        for a in range(mdp.n_actions):
            transitions = mdp.get_transition_probs(s, a)
            
            Q[s, a] = sum(
                prob * (mdp.get_reward(s, a, next_s) + mdp.gamma * V[next_s])
                for next_s, prob in transitions.items()
            )
    
    return Q

# Compute Q-values for our deterministic MDP
Q = compute_q_values(mdp, V)

# Show Q-values for a specific state
state = 5
print(f"üìä Q-values for State {state}:")
for a in range(mdp.n_actions):
    print(f"   Q({state}, {mdp.action_names[a]:>5}) = {Q[state, a]:.4f}")

print(f"\nüéØ Best action: {mdp.action_names[np.argmax(Q[state])]} (matches policy: {mdp.action_names[policy[state]]})")

In [None]:
# Visualize Q-values as a heatmap
fig, ax = plt.subplots(figsize=(10, 8))

im = ax.imshow(Q, cmap='RdYlGn', aspect='auto')
ax.set_xticks(range(mdp.n_actions))
ax.set_xticklabels(mdp.action_names)
ax.set_yticks(range(mdp.n_states))
ax.set_yticklabels([f'State {s}' for s in range(mdp.n_states)])

ax.set_xlabel('Action', fontsize=12)
ax.set_ylabel('State', fontsize=12)
ax.set_title('Q-Values: Q(s, a) for All State-Action Pairs', fontsize=14)

plt.colorbar(im, label='Q-Value')
plt.tight_layout()
plt.show()

### Why Q-values Matter

1. **Policy Extraction**: $\pi^*(s) = \arg\max_a Q(s, a)$
2. **Learning**: We can learn $Q$ directly without knowing $P$ (Q-learning!)
3. **RLHF**: The reward model essentially learns a Q-function over responses

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Forgetting the Discount Factor

```python
# ‚ùå Wrong: Not discounting future rewards
V_new[s] = reward + V[next_state]  # Values will explode!

# ‚úÖ Right: Apply discount factor
V_new[s] = reward + gamma * V[next_state]
```

**Why:** Without discounting, infinite-horizon MDPs have infinite values.

### Mistake 2: Updating with Old Values

```python
# ‚ùå Wrong: Using V in-place
for s in range(n_states):
    V[s] = max(...)  # Mixes old and new values!

# ‚úÖ Right: Update to new array, then copy
for s in range(n_states):
    V_new[s] = max(...)
V = V_new.copy()
```

**Why:** Bellman equation assumes all values are from same iteration.

### Mistake 3: Wrong Transition Probabilities

```python
# ‚ùå Wrong: Probabilities don't sum to 1
probs = {next_s: 0.5, other_s: 0.3}  # Sums to 0.8!

# ‚úÖ Right: Ensure probabilities sum to 1
probs = {next_s: 0.7, other_s: 0.3}  # Sums to 1.0
```

**Why:** Transition function must be a valid probability distribution.

---

## ‚úã Try It Yourself: Larger Grid with Obstacles

Create a 6x6 grid with obstacles (walls the mouse cannot pass through).

In [None]:
# Exercise: Implement a grid with obstacles

class ObstacleGridMDP(GridWorldMDP):
    """
    Grid world with obstacles (impassable cells).
    """
    
    def __init__(self, grid_size: int = 6, obstacles: List[int] = None):
        super().__init__(grid_size, slip_prob=0.0)
        
        # TODO: Set obstacles (cells the agent cannot enter)
        # Default obstacles create a wall-like pattern
        if obstacles is None:
            self.obstacles = {8, 9, 14, 15, 20, 26, 27}  # Example pattern
        else:
            self.obstacles = set(obstacles)
        
        print(f"   Obstacles at: {sorted(self.obstacles)}")
    
    def get_next_state(self, state: int, action: int) -> int:
        """Modified to handle obstacles."""
        # Get the intended next state
        next_state = super().get_next_state(state, action)
        
        # TODO: If next_state is an obstacle, stay in current state
        if next_state in self.obstacles:
            return state  # Can't move into obstacle
        
        return next_state

# Test it!
obstacle_mdp = ObstacleGridMDP(grid_size=6)
V_obs, policy_obs = value_iteration(obstacle_mdp)

print("\nüìä Value Function with Obstacles:")
print(V_obs.reshape(6, 6).round(3))

<details>
<summary>üí° Hint</summary>

The key is in `get_next_state()`: check if the intended next state is an obstacle. If so, return the current state (agent "bounces" off the wall).

```python
if next_state in self.obstacles:
    return state  # Stay in place
return next_state
```
</details>

---

## üéâ Checkpoint

You've learned:
- ‚úÖ **MDP Components**: States, Actions, Rewards, Transitions, Discount
- ‚úÖ **The Markov Property**: Future depends only on present
- ‚úÖ **Value Functions**: $V(s)$ tells us how good a state is
- ‚úÖ **Bellman Equation**: The recursive relationship between values
- ‚úÖ **Value Iteration**: Algorithm to find optimal policy
- ‚úÖ **Q-values**: Action-value function for evaluating actions

---

## üöÄ Challenge: Policy Iteration

Value iteration updates values until convergence. **Policy Iteration** is an alternative:

1. Start with a random policy
2. **Policy Evaluation**: Compute $V^\pi$ for current policy
3. **Policy Improvement**: Update policy greedily with respect to $V^\pi$
4. Repeat until policy doesn't change

Can you implement it? Policy iteration often converges in fewer iterations!

In [None]:
# Challenge: Implement Policy Iteration

def policy_evaluation(mdp: GridWorldMDP, policy: np.ndarray, 
                      threshold: float = 1e-6) -> np.ndarray:
    """Evaluate a policy to get V^œÄ."""
    V = np.zeros(mdp.n_states)
    
    while True:
        V_new = np.zeros(mdp.n_states)
        for s in range(mdp.n_states):
            if s == mdp.goal_state:
                continue
            
            a = policy[s]  # Use policy's action (not max!)
            transitions = mdp.get_transition_probs(s, a)
            
            V_new[s] = sum(
                prob * (mdp.get_reward(s, a, ns) + mdp.gamma * V[ns])
                for ns, prob in transitions.items()
            )
        
        if np.max(np.abs(V_new - V)) < threshold:
            break
        V = V_new
    
    return V


def policy_iteration(mdp: GridWorldMDP) -> Tuple[np.ndarray, np.ndarray]:
    """Find optimal policy using Policy Iteration."""
    # Start with random policy
    policy = np.random.randint(0, mdp.n_actions, size=mdp.n_states)
    
    iteration = 0
    while True:
        iteration += 1
        
        # Policy Evaluation
        V = policy_evaluation(mdp, policy)
        
        # Policy Improvement
        policy_stable = True
        for s in range(mdp.n_states):
            if s == mdp.goal_state:
                continue
            
            old_action = policy[s]
            
            # Find best action
            action_values = []
            for a in range(mdp.n_actions):
                transitions = mdp.get_transition_probs(s, a)
                q = sum(
                    prob * (mdp.get_reward(s, a, ns) + mdp.gamma * V[ns])
                    for ns, prob in transitions.items()
                )
                action_values.append(q)
            
            policy[s] = np.argmax(action_values)
            
            if old_action != policy[s]:
                policy_stable = False
        
        if policy_stable:
            print(f"‚úÖ Policy Iteration converged in {iteration} iterations!")
            break
    
    return V, policy

# Test Policy Iteration
V_pi, policy_pi = policy_iteration(mdp)

print("\nüìä Comparing methods:")
print(f"   Value Iteration V(0):  {V[0]:.6f}")
print(f"   Policy Iteration V(0): {V_pi[0]:.6f}")
print(f"   Policies match: {np.array_equal(policy, policy_pi)}")

---

## üîó Connection to RLHF

How does this connect to fine-tuning LLMs?

| MDP Concept | RLHF Equivalent |
|-------------|------------------|
| State $s$ | Current context + tokens generated so far |
| Action $a$ | Next token to generate |
| Reward $R$ | Human preference or reward model score |
| Policy $\pi$ | The language model itself! |
| Value $V(s)$ | Expected quality of response from this point |

When we train ChatGPT with RLHF:
1. The LLM is the **policy** (maps context ‚Üí token distribution)
2. Each token generation is an **action**
3. The reward model provides **rewards** for complete responses
4. PPO optimizes the policy to maximize expected reward

Understanding MDPs is the foundation for understanding how RLHF works!

---

## üìñ Further Reading

- [Sutton & Barto, Chapter 3](http://incompleteideas.net/book/the-book-2nd.html) - Finite MDPs
- [OpenAI Spinning Up: Key Concepts](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html)
- [Bellman Equation Explained](https://en.wikipedia.org/wiki/Bellman_equation)

---

## üßπ Cleanup

In [None]:
# No GPU cleanup needed for this notebook (NumPy only)
import gc
gc.collect()

print("‚úÖ Notebook complete! Ready for Lab D.2: Q-Learning")

---

## üìù Summary

| Concept | Formula | Intuition |
|---------|---------|------------|
| Value Function | $V(s) = \mathbb{E}[\sum_t \gamma^t R_t]$ | "How good is this state?" |
| Q-Function | $Q(s,a) = R + \gamma V(s')$ | "How good is this action?" |
| Bellman Equation | $V(s) = \max_a [R + \gamma V(s')]$ | "Value = best immediate + future" |
| Optimal Policy | $\pi^*(s) = \arg\max_a Q(s,a)$ | "Always take the best action" |

**Next:** In Lab D.2, we'll learn Q-learning‚Äîhow to find optimal policies **without knowing the transition probabilities**!