# Dynamic Programming in Reinforcement Learning

Welcome to this comprehensive tutorial on Dynamic Programming (DP) methods in Reinforcement Learning! By the end of this assignment, you'll be able to:

- Implement Policy Evaluation to compute state values under a given policy
- Implement Policy Improvement to derive better policies from value functions
- Apply Policy Iteration to find optimal policies
- Implement Value Iteration for efficient policy optimization
- Compare and contrast different DP algorithms
- Understand the trade-offs between computational complexity and convergence speed

Dynamic Programming provides exact solutions for Markov Decision Processes (MDPs) when we have complete knowledge of the environment dynamics. Think of it as finding the best route in a city where you know all the roads, traffic patterns, and destinations!

<img src="https://via.placeholder.com/650x300.png?text=GridWorld+Environment" style="width:650px;height:300px;">
<caption><center> <u> <b>Figure 1</b> </u>: <b>GridWorld Navigation Problem</b><br> The agent must navigate from the start position to the goal, finding the optimal path that maximizes total reward. </center></caption>

**Notation**: As usual, $\frac{\partial J}{\partial a} = $ `da` for any variable `a`.

Let's get started!

## Important Note on Submission to the AutoGrader

Before submitting your assignment to the AutoGrader, please make sure you are not doing the following:

1. You have not added any _extra_ `print` statement(s) in the assignment.
2. You have not added any _extra_ code cell(s) in the assignment.
3. You have not changed any of the function parameters.
4. You are not using any global variables inside your graded exercises. Unless specifically instructed to do so, please refrain from it and use the local variables instead.
5. You are not changing the assignment code where it is not required, like creating _extra_ variables.

## Table of Contents
- [1 - Packages](#1)
- [2 - Theoretical Background](#2)
    - [2.1 - What is Dynamic Programming?](#2-1)
    - [2.2 - Markov Decision Processes (MDPs)](#2-2)
    - [2.3 - Bellman Equations](#2-3)
- [3 - Policy Iteration](#3)
    - [3.1 - Policy Evaluation](#3-1)
        - [Exercise 1 - policy_evaluation](#ex-1)
    - [3.2 - Policy Improvement](#3-2)
        - [Exercise 2 - policy_improvement](#ex-2)
    - [3.3 - Running Policy Iteration](#3-3)
- [4 - Value Iteration](#4)
    - [4.1 - Value Iteration Step](#4-1)
        - [Exercise 3 - value_iteration_step](#ex-3)
    - [4.2 - Extract Policy from Values](#4-2)
        - [Exercise 4 - extract_policy](#ex-4)
    - [4.3 - Running Value Iteration](#4-3)
- [5 - Algorithm Comparison](#5)
    - [Exercise 5 - compare_algorithms](#ex-5)
- [6 - Experiments and Analysis](#6)
    - [6.1 - Effect of Discount Factor](#6-1)
    - [6.2 - Scalability Analysis](#6-2)
    - [6.3 - GridWorld with Obstacles](#6-3)
- [7 - Advanced Visualizations](#7)
    - [7.1 - Q-Value Heatmaps](#7-1)
    - [7.2 - Convergence Animation](#7-2)
    - [7.3 - Optimal Trajectories](#7-3)

<a name='1'></a>
## 1 - Packages

In [None]:
import sys
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import FancyArrowPatch
from matplotlib.colors import LinearSegmentedColormap
import pandas as pd
import math

# Add repository path for imports
repo_path = '/home/user/Reinforcement-learning-guide'
if repo_path not in sys.path:
    sys.path.insert(0, repo_path)

# Add notebooks directory for dp_utils
notebooks_path = os.path.join(repo_path, 'notebooks')
if notebooks_path not in sys.path:
    sys.path.insert(0, notebooks_path)

# Import utility functions and tests
from dp_utils import (
    PolicyIteration, ValueIteration, create_gridworld_mdp,
    visualize_policy_and_values, visualize_trajectory, simulate_trajectory,
    policy_evaluation_test, policy_improvement_test, value_iteration_step_test,
    extract_policy_test, compare_algorithms_test, gridworld_environment_test
)

# Visualization settings
plt.rcParams['figure.figsize'] = (10.0, 6.0)
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'RdYlGn'
%matplotlib inline

print("‚úì Packages loaded successfully")

<a name='2'></a>
## 2 - Theoretical Background

<a name='2-1'></a>
### 2.1 - What is Dynamic Programming?

**Dynamic Programming (DP)** is a family of algorithms that can compute optimal policies given a perfect model of the environment as a **Markov Decision Process (MDP)**.

**Key Characteristics:**

1. **Requires complete model**: We need to know the transition probabilities $p(s',r|s,a)$
2. **Provides exact solutions**: Finds the optimal policy (not an approximation)
3. **Computationally expensive**: For large state spaces
4. **Theoretical foundation**: Basis for model-free methods (Q-Learning, SARSA, etc.)

**Real-World Applications:**

- Robot path planning and navigation
- Inventory management and supply chain optimization  
- Industrial process control
- Game playing with discrete states
- Resource allocation problems

<a name='2-2'></a>
### 2.2 - Markov Decision Processes (MDPs)

An MDP is defined by the tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$:

- $\mathcal{S}$: Set of states
- $\mathcal{A}$: Set of actions
- $P$: Transition function $P(s'|s,a)$
- $R$: Reward function $R(s,a,s')$
- $\gamma \in [0,1]$: Discount factor

**Value Functions:**

The **state-value function** $V^\pi(s)$ is:
$$V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t R_{t+1} \mid S_0 = s\right]$$

The **action-value function** $Q^\pi(s,a)$ is:
$$Q^\pi(s,a) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t R_{t+1} \mid S_0 = s, A_0 = a\right]$$

<a name='2-3'></a>
### 2.3 - Bellman Equations

**Bellman Expectation Equation** for $V^\pi$:
$$V^\pi(s) = \sum_{a} \pi(a|s) \sum_{s',r} p(s',r|s,a)[r + \gamma V^\pi(s')]$$

**Bellman Optimality Equation** for $V^*$:
$$V^*(s) = \max_a \sum_{s',r} p(s',r|s,a)[r + \gamma V^*(s')]$$

**Bellman Optimality Equation** for $Q^*$:
$$Q^*(s,a) = \sum_{s',r} p(s',r|s,a)\left[r + \gamma \max_{a'} Q^*(s',a')\right]$$

These equations express the recursive relationship between the value of a state and the values of its successor states - the foundation of Dynamic Programming!

<a name='3'></a>
## 3 - Policy Iteration

Policy Iteration is a two-step iterative algorithm:

1. **Policy Evaluation**: Compute $V^\pi$ for the current policy $\pi$
2. **Policy Improvement**: Improve $\pi$ using the computed values

This process continues until the policy no longer changes, guaranteeing convergence to the optimal policy $\pi^*$.

<a name='3-1'></a>
### 3.1 - Policy Evaluation

Policy Evaluation computes the state-value function $V^\pi$ for a given policy $\pi$. We iteratively apply the Bellman expectation equation:

$$V_{k+1}(s) = \sum_{s',r} p(s',r|s,\pi(s))[r + \gamma V_k(s')]$$

**Algorithm:**
```
Initialize V(s) = 0 for all s
Repeat until convergence:
    For each state s:
        V(s) ‚Üê Œ£_{s',r} p(s',r|s,œÄ(s))[r + Œ≥V(s')]
```

<a name='ex-1'></a>
### Exercise 1 - policy_evaluation

Implement the policy evaluation algorithm. For each state, compute the expected value by summing over all possible next states, weighted by their transition probabilities.

**Instructions:**
- Loop through all states
- For each state, get the action from the policy
- Sum over all next states: reward + discounted next state value
- Multiply by transition probability
- Check convergence using max absolute change (delta < theta)

In [None]:
# GRADED FUNCTION: policy_evaluation

def policy_evaluation(policy, transition_probs, rewards, gamma=0.99, theta=1e-6, max_iterations=1000):
    """
    Evaluate a policy by computing the state-value function.
    
    Arguments:
    policy -- numpy array of shape (n_states,) containing action for each state
    transition_probs -- numpy array of shape (n_states, n_actions, n_states) 
                        containing transition probabilities
    rewards -- numpy array of shape (n_states, n_actions, n_states) containing rewards
    gamma -- discount factor, scalar (default: 0.99)
    theta -- convergence threshold, scalar (default: 1e-6)
    max_iterations -- maximum number of iterations (default: 1000)
    
    Returns:
    V -- numpy array of shape (n_states,) containing state values
    """
    
    n_states = transition_probs.shape[0]
    V = np.zeros(n_states)
    
    for iteration in range(max_iterations):
        delta = 0
        V_new = np.zeros(n_states)
        
        # Loop through all states
        for s in range(n_states):
            # (approx. 5 lines)
            # Get action from policy: action = ...
            # Compute value: sum over next states
            # value = 0
            # for s_next in range(n_states):
            #     value += transition_probs[s, action, s_next] * (rewards[s, action, s_next] + gamma * V[s_next])
            # YOUR CODE STARTS HERE
            
            
            # YOUR CODE ENDS HERE
            
            V_new[s] = value
            delta = max(delta, abs(V_new[s] - V[s]))
        
        V = V_new
        
        # Check convergence
        if delta < theta:
            break
    
    return V

Now let's test your implementation:

In [None]:
# Test policy_evaluation
print("Testing policy_evaluation...\n")

# Create simple GridWorld
transition_probs, rewards, n_states, n_actions = create_gridworld_mdp(grid_size=4)

# Create random policy
test_policy = np.random.randint(0, n_actions, size=n_states)

# Evaluate policy
V = policy_evaluation(test_policy, transition_probs, rewards, gamma=0.99)

print(f"State values (first 5): {V[:5]}")
print(f"Value shape: {V.shape}")
print(f"Max value: {np.max(V):.4f}")
print(f"Min value: {np.min(V):.4f}")

# Run automated test
policy_evaluation_test(policy_evaluation)

**Expected Output:**
```
Testing policy_evaluation...

State values (first 5): [...]
Value shape: (16,)
Max value: ...
Min value: ...
‚úì All tests passed!
```

<a name='3-2'></a>
### 3.2 - Policy Improvement

Given a value function $V^\pi$, we can improve the policy by acting greedily:

$$\pi'(s) = \arg\max_a \sum_{s',r} p(s',r|s,a)[r + \gamma V^\pi(s')]$$

This greedy policy $\pi'$ is guaranteed to be at least as good as $\pi$, and strictly better unless $\pi$ is already optimal.

<a name='ex-2'></a>
### Exercise 2 - policy_improvement

Implement policy improvement. For each state, try all possible actions and select the one with the highest expected value.

**Instructions:**
- For each state, initialize action_values array
- For each action, compute expected value (similar to policy evaluation)
- Select action with maximum value using np.argmax()
- Return the improved policy

In [None]:
# GRADED FUNCTION: policy_improvement

def policy_improvement(V, transition_probs, rewards, gamma=0.99):
    """
    Improve a policy given state values by acting greedily.
    
    Arguments:
    V -- numpy array of shape (n_states,) containing state values
    transition_probs -- numpy array of shape (n_states, n_actions, n_states)
    rewards -- numpy array of shape (n_states, n_actions, n_states)
    gamma -- discount factor, scalar (default: 0.99)
    
    Returns:
    policy -- numpy array of shape (n_states,) containing improved policy
    """
    
    n_states = transition_probs.shape[0]
    n_actions = transition_probs.shape[1]
    policy = np.zeros(n_states, dtype=int)
    
    for s in range(n_states):
        # (approx. 6 lines)
        # Initialize action_values array
        # action_values = np.zeros(n_actions)
        # For each action:
        #     for a in range(n_actions):
        #         for s_next in range(n_states):
        #             action_values[a] += ...
        # Select best action: policy[s] = np.argmax(action_values)
        # YOUR CODE STARTS HERE
        
        
        # YOUR CODE ENDS HERE
    
    return policy

In [None]:
# Test policy_improvement
print("Testing policy_improvement...\n")

# Use values from previous evaluation
improved_policy = policy_improvement(V, transition_probs, rewards, gamma=0.99)

print(f"Improved policy (first 5 states): {improved_policy[:5]}")
print(f"Policy shape: {improved_policy.shape}")
print(f"Unique actions used: {np.unique(improved_policy)}")

# Run automated test
policy_improvement_test(policy_improvement)

**Expected Output:**
```
Testing policy_improvement...

Improved policy (first 5 states): [...]
Policy shape: (16,)
Unique actions used: [...]
‚úì All tests passed!
```

<font color='blue'>
    
**What you should remember**:
- Policy Evaluation computes state values for a given policy using the Bellman expectation equation
- Policy Improvement creates a better policy by acting greedily with respect to the current value function
- These two steps alternate in Policy Iteration until convergence
- Convergence is guaranteed and the result is the optimal policy
</font>

<a name='3-3'></a>
### 3.3 - Running Policy Iteration

Now let's put it all together and run the complete Policy Iteration algorithm on a GridWorld environment!

In [None]:
# Create GridWorld environment
print("Creating GridWorld 4x4 environment...")
print("="*60)

transition_probs, rewards, n_states, n_actions = create_gridworld_mdp(
    grid_size=4,
    goal_reward=1.0,
    step_reward=-0.01
)

print(f"‚úì States: {n_states}")
print(f"‚úì Actions: {n_actions} (0=‚Üë, 1=‚Üí, 2=‚Üì, 3=‚Üê)")
print(f"‚úì Start: State 0 (top-left)")
print(f"‚úì Goal: State {n_states-1} (bottom-right)")

# Test environment
gridworld_environment_test(transition_probs, rewards, n_states, n_actions, expected_size=4)

In [None]:
# Run Policy Iteration
print("\n" + "="*60)
print("RUNNING POLICY ITERATION")
print("="*60 + "\n")

pi_solver = PolicyIteration(
    n_states=n_states,
    n_actions=n_actions,
    gamma=0.99,
    theta=1e-6
)

pi_results = pi_solver.solve(
    transition_probs=transition_probs,
    rewards=rewards,
    max_iterations=100
)

print(f"\n‚úì Converged in {pi_results['iterations']} iterations")
print(f"‚úì Average state value: {np.mean(pi_results['V']):.4f}")

In [None]:
# Visualize results
visualize_policy_and_values(
    pi_results['policy'],
    pi_results['V'],
    grid_size=4,
    title="(Policy Iteration)"
)

<a name='4'></a>
## 4 - Value Iteration

Value Iteration combines policy evaluation and improvement into a single update step. Instead of fully evaluating a policy, it performs one sweep of policy evaluation followed by policy improvement.

**Algorithm:**
```
Initialize V(s) = 0 for all s
Repeat until convergence:
    For each state s:
        V(s) ‚Üê max_a Œ£_{s',r} p(s',r|s,a)[r + Œ≥V(s')]
```

**Key Differences from Policy Iteration:**
- Does not maintain explicit policy during iteration
- Makes simpler but more frequent updates  
- Generally faster for large problems
- Policy is extracted only at the end

<a name='4-1'></a>
### 4.1 - Value Iteration Step

<a name='ex-3'></a>
### Exercise 3 - value_iteration_step

Implement one step of value iteration. For each state, compute the maximum value over all actions.

**Instructions:**
- For each state, compute value for each action
- Take the maximum over actions
- This is similar to policy_improvement but we update V instead of extracting policy

In [None]:
# GRADED FUNCTION: value_iteration_step

def value_iteration_step(V, transition_probs, rewards, gamma=0.99):
    """
    Perform one step of value iteration.
    
    Arguments:
    V -- numpy array of shape (n_states,) containing current state values
    transition_probs -- numpy array of shape (n_states, n_actions, n_states)
    rewards -- numpy array of shape (n_states, n_actions, n_states)
    gamma -- discount factor, scalar (default: 0.99)
    
    Returns:
    V_new -- numpy array of shape (n_states,) containing updated state values
    """
    
    n_states = transition_probs.shape[0]
    n_actions = transition_probs.shape[1]
    V_new = np.zeros(n_states)
    
    for s in range(n_states):
        # (approx. 6 lines)
        # Initialize action_values
        # action_values = np.zeros(n_actions)
        # For each action, compute expected value
        # for a in range(n_actions):
        #     for s_next in range(n_states):
        #         action_values[a] += ...
        # Take maximum: V_new[s] = np.max(action_values)
        # YOUR CODE STARTS HERE
        
        
        # YOUR CODE ENDS HERE
    
    return V_new

In [None]:
# Test value_iteration_step
print("Testing value_iteration_step...\n")

# Initialize V
V_test = np.zeros(n_states)

# Perform one step
V_new = value_iteration_step(V_test, transition_probs, rewards, gamma=0.99)

print(f"Updated values (first 5): {V_new[:5]}")
print(f"Max change: {np.max(np.abs(V_new - V_test)):.6f}")

# Run automated test
value_iteration_step_test(value_iteration_step)

**Expected Output:**
```
Testing value_iteration_step...

Updated values (first 5): [...]
Max change: ...
‚úì All tests passed!
```

<a name='4-2'></a>
### 4.2 - Extract Policy from Values

After value iteration converges, we need to extract the optimal policy from the value function.

<a name='ex-4'></a>
### Exercise 4 - extract_policy

Extract the greedy policy from a value function. This is identical to policy improvement!

**Instructions:**
- For each state, compute value of each action
- Select action with maximum value
- This should look very similar to policy_improvement

In [None]:
# GRADED FUNCTION: extract_policy

def extract_policy(V, transition_probs, rewards, gamma=0.99):
    """
    Extract greedy policy from value function.
    
    Arguments:
    V -- numpy array of shape (n_states,) containing state values
    transition_probs -- numpy array of shape (n_states, n_actions, n_states)
    rewards -- numpy array of shape (n_states, n_actions, n_states)
    gamma -- discount factor, scalar (default: 0.99)
    
    Returns:
    policy -- numpy array of shape (n_states,) containing optimal policy
    """
    
    n_states = transition_probs.shape[0]
    n_actions = transition_probs.shape[1]
    policy = np.zeros(n_states, dtype=int)
    
    for s in range(n_states):
        # (approx. 5 lines)
        # This is identical to policy_improvement
        # action_values = np.zeros(n_actions)
        # for a in range(n_actions):
        #     ... compute action values ...
        # policy[s] = np.argmax(action_values)
        # YOUR CODE STARTS HERE
        
        
        # YOUR CODE ENDS HERE
    
    return policy

In [None]:
# Test extract_policy
print("Testing extract_policy...\n")

# Extract policy from test values
extracted_policy = extract_policy(V_new, transition_probs, rewards, gamma=0.99)

print(f"Extracted policy (first 5): {extracted_policy[:5]}")
print(f"Policy shape: {extracted_policy.shape}")

# Run automated test
extract_policy_test(extract_policy)

**Expected Output:**
```
Testing extract_policy...

Extracted policy (first 5): [...]
Policy shape: (16,)
‚úì All tests passed!
```

<font color='blue'>
    
**What you should remember**:
- Value Iteration combines evaluation and improvement in one step
- It uses the Bellman optimality equation directly
- The policy is extracted only at the end using greedy action selection
- Generally more efficient than Policy Iteration for large state spaces
</font>

<a name='4-3'></a>
### 4.3 - Running Value Iteration

Now let's run the complete Value Iteration algorithm and compare it with Policy Iteration!

In [None]:
# Run Value Iteration
print("\n" + "="*60)
print("RUNNING VALUE ITERATION")
print("="*60 + "\n")

vi_solver = ValueIteration(
    n_states=n_states,
    n_actions=n_actions,
    gamma=0.99,
    theta=1e-6
)

vi_results = vi_solver.solve(
    transition_probs=transition_probs,
    rewards=rewards,
    max_iterations=1000,
    verbose=True
)

print(f"\n‚úì Converged in {vi_results['iterations']} iterations")
print(f"‚úì Average state value: {np.mean(vi_results['V']):.4f}")

In [None]:
# Visualize results
visualize_policy_and_values(
    vi_results['policy'],
    vi_results['V'],
    grid_size=4,
    title="(Value Iteration)"
)

<a name='5'></a>
## 5 - Algorithm Comparison

Now that we've implemented both algorithms, let's compare them!

<a name='ex-5'></a>
### Exercise 5 - compare_algorithms

Compare the results from Policy Iteration and Value Iteration. They should converge to the same optimal solution!

**Instructions:**
- Compute the maximum absolute difference in value functions
- Count how many states have different actions in the policies
- Print a summary comparison

In [None]:
# GRADED FUNCTION: compare_algorithms

def compare_algorithms(pi_results, vi_results):
    """
    Compare results from Policy Iteration and Value Iteration.
    
    Arguments:
    pi_results -- dictionary containing Policy Iteration results
    vi_results -- dictionary containing Value Iteration results
    
    Returns:
    comparison -- dictionary with comparison metrics
    """
    
    comparison = {}
    
    # (approx. 4 lines)
    # Compute maximum value difference
    # comparison['max_value_diff'] = np.max(np.abs(...)) 
    # Count policy differences
    # comparison['policy_diff_count'] = np.sum(...)
    # Add iteration counts
    # comparison['pi_iterations'] = ...
    # comparison['vi_iterations'] = ...
    # YOUR CODE STARTS HERE
    
    
    # YOUR CODE ENDS HERE
    
    return comparison

In [None]:
# Compare algorithms
print("="*70)
print("ALGORITHM COMPARISON")
print("="*70 + "\n")

comparison = compare_algorithms(pi_results, vi_results)

print(f"Policy Iteration:  {comparison['pi_iterations']} iterations")
print(f"Value Iteration:   {comparison['vi_iterations']} iterations")
print(f"\nMax value difference: {comparison['max_value_diff']:.8f}")
print(f"States with different actions: {comparison['policy_diff_count']}/{n_states}")

# Run automated test
compare_algorithms_test(pi_results, vi_results)

**Expected Output:**
```
ALGORITHM COMPARISON

Policy Iteration:  ... iterations
Value Iteration:   ... iterations

Max value difference: 0.00000...
States with different actions: 0/16 (or very few)
‚úì All tests passed!
```

Let's create a detailed comparison table:

In [None]:
# Create comparison table
comparison_data = {
    'Metric': [
        'Iterations to Converge',
        'Average State Value',
        'Max State Value',
        'Min State Value'
    ],
    'Policy Iteration': [
        pi_results['iterations'],
        f"{np.mean(pi_results['V']):.6f}",
        f"{np.max(pi_results['V']):.6f}",
        f"{np.min(pi_results['V']):.6f}"
    ],
    'Value Iteration': [
        vi_results['iterations'],
        f"{np.mean(vi_results['V']):.6f}",
        f"{np.max(vi_results['V']):.6f}",
        f"{np.min(vi_results['V']):.6f}"
    ]
}

df = pd.DataFrame(comparison_data)
print("\n" + df.to_string(index=False))

<font color='blue'>
    
**What you should remember**:
- Both Policy Iteration and Value Iteration converge to the same optimal solution
- Policy Iteration typically requires fewer iterations but each iteration is more expensive
- Value Iteration requires more iterations but each iteration is faster
- For small problems, the difference may be minimal
- For large problems, Value Iteration is often preferred
</font>

<a name='6'></a>
## 6 - Experiments and Analysis

<a name='6-1'></a>
### 6.1 - Effect of Discount Factor

The discount factor $\gamma$ controls how much the agent values future rewards. Let's see how it affects the optimal policy!

In [None]:
print("="*70)
print("EXPERIMENT: Effect of Discount Factor (Œ≥)")
print("="*70 + "\n")

gammas = [0.5, 0.9, 0.99, 0.999]
results_gamma = {}

for gamma in gammas:
    print(f"Testing Œ≥ = {gamma}...")
    solver = ValueIteration(
        n_states=n_states,
        n_actions=n_actions,
        gamma=gamma,
        theta=1e-6
    )
    result = solver.solve(transition_probs, rewards, verbose=False)
    results_gamma[gamma] = result
    print(f"  Iterations: {result['iterations']}, Avg Value: {np.mean(result['V']):.4f}\n")

In [None]:
# Visualize effect of gamma
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.flatten()

for idx, gamma in enumerate(gammas):
    value_grid = results_gamma[gamma]['V'].reshape(4, 4)
    im = axes[idx].imshow(value_grid, cmap='RdYlGn', interpolation='nearest')
    axes[idx].set_title(f'Œ≥ = {gamma} ({results_gamma[gamma]["iterations"]} iterations)',
                       fontsize=12, fontweight='bold')
    
    for i in range(4):
        for j in range(4):
            axes[idx].text(j, i, f'{value_grid[i, j]:.2f}',
                         ha='center', va='center', color='black', fontsize=9)
    
    plt.colorbar(im, ax=axes[idx])
    axes[idx].set_xticks(range(4))
    axes[idx].set_yticks(range(4))

plt.tight_layout()
plt.show()

<a name='6-2'></a>
### 6.2 - Scalability Analysis

How do the algorithms scale with problem size?

In [None]:
print("="*70)
print("EXPERIMENT: Scalability with Grid Size")
print("="*70 + "\n")

grid_sizes = [3, 4, 5, 6]
results_sizes = {'PI': [], 'VI': []}

for size in grid_sizes:
    print(f"Grid {size}x{size} ({size*size} states):")
    
    trans, rew, n_s, n_a = create_gridworld_mdp(grid_size=size)
    
    # Policy Iteration
    pi = PolicyIteration(n_s, n_a, gamma=0.99, theta=1e-6)
    pi_res = pi.solve(trans, rew, max_iterations=100)
    results_sizes['PI'].append(pi_res['iterations'])
    
    # Value Iteration
    vi = ValueIteration(n_s, n_a, gamma=0.99, theta=1e-6)
    vi_res = vi.solve(trans, rew, verbose=False)
    results_sizes['VI'].append(vi_res['iterations'])
    
    print(f"  Policy Iteration: {pi_res['iterations']} iterations")
    print(f"  Value Iteration:  {vi_res['iterations']} iterations\n")

In [None]:
# Plot scalability
plt.figure(figsize=(10, 6))
x = [s*s for s in grid_sizes]
plt.plot(x, results_sizes['PI'], 'o-', linewidth=2, markersize=10, label='Policy Iteration')
plt.plot(x, results_sizes['VI'], 's-', linewidth=2, markersize=10, label='Value Iteration')
plt.xlabel('Number of States', fontsize=12)
plt.ylabel('Iterations to Convergence', fontsize=12)
plt.title('Algorithm Scalability', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

<a name='6-3'></a>
### 6.3 - GridWorld with Obstacles

Let's test our algorithms on a more challenging environment with obstacles!

In [None]:
def create_gridworld_with_obstacles(grid_size=5, obstacles=None):
    """
    Creates a GridWorld with obstacles.
    """
    if obstacles is None:
        obstacles = [(1, 2), (2, 2), (3, 2)]
    
    n_states = grid_size * grid_size
    n_actions = 4
    goal_state = n_states - 1
    
    transition_probs = np.zeros((n_states, n_actions, n_states))
    rewards = np.zeros((n_states, n_actions, n_states))
    
    obstacle_states = [r * grid_size + c for r, c in obstacles]
    
    def state_to_pos(state):
        return state // grid_size, state % grid_size
    
    def pos_to_state(row, col):
        return row * grid_size + col
    
    for s in range(n_states):
        if s == goal_state or s in obstacle_states:
            for a in range(n_actions):
                transition_probs[s, a, s] = 1.0
                rewards[s, a, s] = -1.0 if s in obstacle_states else 0.0
            continue
        
        row, col = state_to_pos(s)
        
        for a in range(n_actions):
            new_row, new_col = row, col
            
            if a == 0: new_row = max(0, row - 1)
            elif a == 1: new_col = min(grid_size - 1, col + 1)
            elif a == 2: new_row = min(grid_size - 1, row + 1)
            elif a == 3: new_col = max(0, col - 1)
            
            next_state = pos_to_state(new_row, new_col)
            
            if next_state in obstacle_states:
                next_state = s
                transition_probs[s, a, next_state] = 1.0
                rewards[s, a, next_state] = -0.1
            else:
                transition_probs[s, a, next_state] = 1.0
                rewards[s, a, next_state] = 1.0 if next_state == goal_state else -0.01
    
    return transition_probs, rewards, n_states, n_actions, obstacle_states

# Create and solve obstacle GridWorld
print("="*70)
print("GridWorld with Obstacles (5x5)")
print("="*70 + "\n")

trans_obs, rew_obs, n_s_obs, n_a_obs, obstacles = create_gridworld_with_obstacles(
    grid_size=5,
    obstacles=[(1, 2), (2, 2), (3, 2)]
)

print(f"Obstacles at states: {obstacles}\n")

vi_obs = ValueIteration(n_s_obs, n_a_obs, gamma=0.99, theta=1e-6)
result_obs = vi_obs.solve(trans_obs, rew_obs, verbose=True)

<a name='7'></a>
## 7 - Advanced Visualizations

<a name='7-1'></a>
### 7.1 - Q-Value Heatmaps

Let's visualize the Q-values for each action to understand how the agent evaluates different choices in each state.

In [None]:
def visualize_q_values(solver, transition_probs, rewards, grid_size=4):
    """
    Visualizes Q-values for each action in each state.
    """
    fig, axes = plt.subplots(2, 2, figsize=(14, 12))
    action_names = ['Up (‚Üë)', 'Right (‚Üí)', 'Down (‚Üì)', 'Left (‚Üê)']
    
    for action in range(4):
        ax = axes[action // 2, action % 2]
        
        q_values = np.zeros(grid_size * grid_size)
        for state in range(grid_size * grid_size):
            q_values[state] = solver.get_q_value(state, action, transition_probs, rewards)
        
        q_grid = q_values.reshape(grid_size, grid_size)
        im = ax.imshow(q_grid, cmap='coolwarm', interpolation='nearest')
        ax.set_title(f'Q-Values: {action_names[action]}', fontsize=12, fontweight='bold')
        
        for i in range(grid_size):
            for j in range(grid_size):
                ax.text(j, i, f'{q_grid[i, j]:.3f}',
                       ha='center', va='center', color='white', fontsize=9)
        
        plt.colorbar(im, ax=ax)
        ax.set_xticks(range(grid_size))
        ax.set_yticks(range(grid_size))
    
    plt.tight_layout()
    plt.show()

print("Visualizing Q-Values for each action...\n")
visualize_q_values(vi_solver, transition_probs, rewards, grid_size=4)

<a name='7-2'></a>
### 7.2 - Convergence Animation

Let's see how the value function evolves during the learning process!

In [None]:
def show_convergence_snapshots():
    """
    Shows snapshots of value function at different iterations.
    """
    print("Running Value Iteration with snapshots...\n")
    
    solver = ValueIteration(n_states=16, n_actions=4, gamma=0.99, theta=1e-6)
    snapshot_iters = [1, 3, 5, 10, 20, 50]
    snapshots = []
    
    for iteration in range(max(snapshot_iters) + 1):
        solver.value_update(transition_probs, rewards)
        if iteration in snapshot_iters:
            snapshots.append((iteration, solver.V.copy()))
    
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.flatten()
    
    for idx, (iter_num, V) in enumerate(snapshots):
        value_grid = V.reshape(4, 4)
        im = axes[idx].imshow(value_grid, cmap='RdYlGn',
                             interpolation='nearest', vmin=-0.1, vmax=1.0)
        axes[idx].set_title(f'Iteration {iter_num}', fontsize=12, fontweight='bold')
        
        for i in range(4):
            for j in range(4):
                axes[idx].text(j, i, f'{value_grid[i, j]:.2f}',
                             ha='center', va='center', color='black', fontsize=9)
        
        axes[idx].set_xticks(range(4))
        axes[idx].set_yticks(range(4))
    
    plt.tight_layout()
    plt.show()

show_convergence_snapshots()

<a name='7-3'></a>
### 7.3 - Optimal Trajectories

Finally, let's visualize the optimal path the agent takes from different starting positions!

In [None]:
# Simulate trajectories from different start states
print("Simulating optimal trajectories...\n")

start_states = [0, 4, 8]
for start in start_states:
    trajectory = simulate_trajectory(
        vi_results['policy'],
        start_state=start,
        goal_state=15,
        grid_size=4
    )
    print(f"Start State {start}: {trajectory} ({len(trajectory)-1} steps)")

# Visualize one trajectory
print("\nVisualizing trajectory from state 0...")
trajectory_example = simulate_trajectory(vi_results['policy'], 0, 15, 4)
visualize_trajectory(vi_results['policy'], trajectory_example, grid_size=4)

## Congratulations!

You've successfully completed the Dynamic Programming tutorial! Here's what you've accomplished:

‚úÖ Implemented Policy Evaluation to compute state values  
‚úÖ Implemented Policy Improvement for greedy policy updates  
‚úÖ Built a complete Policy Iteration algorithm  
‚úÖ Implemented Value Iteration for efficient optimization  
‚úÖ Compared and analyzed different DP algorithms  
‚úÖ Experimented with various parameters and environments  

<font color='blue'>
    
**Key Takeaways**:
- Dynamic Programming requires complete knowledge of environment dynamics
- Policy Iteration alternates between evaluation and improvement steps
- Value Iteration combines both steps using the Bellman optimality equation
- Both algorithms guarantee convergence to the optimal policy
- The discount factor Œ≥ controls the agent's preference for immediate vs future rewards
- DP forms the theoretical foundation for model-free RL methods
</font>

**Next Steps**:

Dynamic Programming is powerful but limited to environments where we know all the dynamics. In the next tutorials, you'll learn:

1. **Monte Carlo Methods** - Learning from experience without a model
2. **Temporal Difference Learning** - Combining DP and MC (TD, Q-Learning, SARSA)
3. **Function Approximation** - Handling large/continuous state spaces
4. **Deep Reinforcement Learning** - Using neural networks (DQN, Policy Gradients)

**References**:

- Sutton & Barto (2018): *Reinforcement Learning: An Introduction* - Chapter 4
- Bellman, R. (1957): *Dynamic Programming*
- Puterman, M. (1994): *Markov Decision Processes*

Great work! Keep exploring and experimenting! üöÄ