# Policy Iteration - Interactive Exercise

Welcome! In this notebook, you will implement **Policy Iteration**, one of the fundamental algorithms in Dynamic Programming for Reinforcement Learning.

## What is Policy Iteration?

Policy Iteration is an algorithm that finds the optimal policy by alternating between two steps:
1. **Policy Evaluation**: Compute the value function V^π for the current policy π
2. **Policy Improvement**: Update the policy to be greedy with respect to V^π

Unlike Value Iteration (which updates values and policies simultaneously), Policy Iteration fully evaluates each policy before improving it.

## Key Differences from Value Iteration

| Aspect | Value Iteration | Policy Iteration |
|--------|----------------|------------------|
| Update | Bellman Optimality | Bellman Expectation + Greedy |
| Convergence | Value function | Policy (can converge in fewer iterations) |
| Per Iteration | Faster (1 sweep) | Slower (multiple sweeps for evaluation) |
| Total Iterations | More iterations | Fewer iterations |

## Learning Objectives

By the end of this notebook, you will:
- Understand the two-phase structure of Policy Iteration
- Implement policy evaluation using the Bellman Expectation equation
- Implement policy improvement
- Combine both to create the complete Policy Iteration algorithm
- Compare Policy Iteration with Value Iteration

In [None]:
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt
from policy_iteration_tests import *

## The Environment: FrozenLake

We'll use the same FrozenLake environment as Value Iteration for easy comparison.

```
SFFF       (S: starting point, safe)
FHFH       (F: frozen surface, safe)
FFFH       (H: hole, fall to your doom)
HFFG       (G: goal, where the frisbee is located)
```

In [None]:
# Create environment
env = gym.make('FrozenLake-v1', map_name="4x4", is_slippery=True)
n_states = env.observation_space.n
n_actions = env.action_space.n

print(f"Number of states: {n_states}")
print(f"Number of actions: {n_actions}")
print(f"\nAction meanings: 0=Left, 1=Down, 2=Right, 3=Up")

## Exercise 1: Initialize Policy

In Policy Iteration, we start with an initial policy. A common choice is a **uniform random policy** that selects each action with equal probability.

**Task**: Initialize a stochastic policy where each action has equal probability.

**Policy Representation**: 
- Shape: (n_states, n_actions)
- policy[s, a] = probability of taking action a in state s
- For uniform random: policy[s, a] = 1/n_actions for all s, a

In [None]:
# GRADED FUNCTION: initialize_policy

def initialize_policy(n_states, n_actions):
    """
    Initialize a uniform random policy.
    
    Arguments:
    n_states -- number of states
    n_actions -- number of actions
    
    Returns:
    policy -- numpy array of shape (n_states, n_actions) with uniform probabilities
    """
    # (approx. 1 line)
    # Create a matrix where each row sums to 1.0 (valid probability distribution)
    # Hint: Use np.ones() and divide by n_actions
    
    # YOUR CODE STARTS HERE
    
    
    # YOUR CODE ENDS HERE
    
    return policy

In [None]:
# Test your implementation
initialize_policy_test(initialize_policy)

## Exercise 2: Policy Evaluation Step

Policy evaluation computes V^π using the **Bellman Expectation Equation**:

$$V^{\pi}(s) = \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a)[r + \gamma V^{\pi}(s')]$$

This is different from Value Iteration which uses the Bellman **Optimality** Equation (with max).

**Task**: Implement one iteration of policy evaluation for a single state.

In [None]:
# GRADED FUNCTION: policy_evaluation_step

def policy_evaluation_step(env, V, policy, state, gamma=0.99):
    """
    Perform one step of policy evaluation for a single state.
    
    Arguments:
    env -- OpenAI Gym environment
    V -- current value function, numpy array of shape (n_states,)
    policy -- current policy, numpy array of shape (n_states, n_actions)
    state -- state to evaluate
    gamma -- discount factor
    
    Returns:
    new_value -- updated value for the state
    """
    # (approx. 8-10 lines)
    # 1. Initialize new_value = 0
    # 2. For each action:
    #    a. Get action probability: policy[state, action]
    #    b. Get transitions: env.P[state][action]
    #    c. For each (prob, next_state, reward, done):
    #       - Compute: prob * (reward + gamma * V[next_state])
    #       - Add to action_value
    #    d. Add: policy[state, action] * action_value to new_value
    
    # YOUR CODE STARTS HERE
    
    
    # YOUR CODE ENDS HERE
    
    return new_value

In [None]:
# Test your implementation
policy_evaluation_step_test(policy_evaluation_step)

## Exercise 3: Full Policy Evaluation

Now we need to evaluate the policy until convergence. We sweep through all states repeatedly until the value function stops changing significantly.

**Convergence Criterion**: Stop when max|V_new - V_old| < theta

**Task**: Implement full policy evaluation with convergence check.

In [None]:
# GRADED FUNCTION: policy_evaluation

def policy_evaluation(env, policy, gamma=0.99, theta=1e-8, max_iterations=1000):
    """
    Evaluate a policy until convergence.
    
    Arguments:
    env -- OpenAI Gym environment
    policy -- policy to evaluate, numpy array of shape (n_states, n_actions)
    gamma -- discount factor
    theta -- convergence threshold
    max_iterations -- maximum number of iterations
    
    Returns:
    V -- value function for the policy, numpy array of shape (n_states,)
    iterations -- number of iterations until convergence
    """
    # (approx. 12-15 lines)
    # 1. Initialize V = zeros
    # 2. For iteration in range(max_iterations):
    #    a. delta = 0
    #    b. For each state:
    #       - old_value = V[state]
    #       - new_value = policy_evaluation_step(...)
    #       - V[state] = new_value
    #       - delta = max(delta, abs(old_value - new_value))
    #    c. If delta < theta: break
    # 3. Return V and number of iterations
    
    # YOUR CODE STARTS HERE
    
    
    # YOUR CODE ENDS HERE
    
    return V, iterations

In [None]:
# Test your implementation
policy_evaluation_test(policy_evaluation)

## Exercise 4: Policy Improvement

After evaluating the current policy, we improve it by making it **greedy** with respect to the value function:

$$\pi'(s) = \arg\max_a \sum_{s',r} p(s',r|s,a)[r + \gamma V^{\pi}(s')]$$

This is similar to extracting policy in Value Iteration, but now we use V^π instead of V*.

**Task**: Implement policy improvement that returns a deterministic greedy policy.

In [None]:
# GRADED FUNCTION: policy_improvement

def policy_improvement(env, V, gamma=0.99):
    """
    Improve policy by making it greedy with respect to V.
    
    Arguments:
    env -- OpenAI Gym environment
    V -- value function, numpy array of shape (n_states,)
    gamma -- discount factor
    
    Returns:
    new_policy -- improved policy, numpy array of shape (n_states, n_actions)
                  deterministic: new_policy[s, best_action] = 1.0, others = 0.0
    """
    # (approx. 12-15 lines)
    # 1. Get n_states and n_actions from env
    # 2. Initialize new_policy = zeros(n_states, n_actions)
    # 3. For each state:
    #    a. Initialize action_values = zeros(n_actions)
    #    b. For each action:
    #       - Get transitions: env.P[state][action]
    #       - For each (prob, next_state, reward, done):
    #         * Compute: prob * (reward + gamma * V[next_state])
    #       - Store in action_values[action]
    #    c. Find best_action = argmax(action_values)
    #    d. Set new_policy[state, best_action] = 1.0
    
    # YOUR CODE STARTS HERE
    
    
    # YOUR CODE ENDS HERE
    
    return new_policy

In [None]:
# Test your implementation
policy_improvement_test(policy_improvement)

## Exercise 5: Complete Policy Iteration

Now let's combine policy evaluation and policy improvement into the complete Policy Iteration algorithm!

**Algorithm**:
1. Initialize policy (e.g., uniform random)
2. Repeat:
   - **Policy Evaluation**: Compute V^π
   - **Policy Improvement**: Make policy greedy w.r.t. V^π
   - **Check**: If policy didn't change, we've found the optimal policy
3. Return optimal policy and value function

**Task**: Implement the complete Policy Iteration algorithm.

In [None]:
# GRADED FUNCTION: policy_iteration

def policy_iteration(env, gamma=0.99, theta=1e-8, max_iterations=100):
    """
    Solve an MDP using Policy Iteration.
    
    Arguments:
    env -- OpenAI Gym environment
    gamma -- discount factor
    theta -- convergence threshold for policy evaluation
    max_iterations -- maximum number of policy iterations
    
    Returns:
    policy -- optimal policy, numpy array of shape (n_states, n_actions)
    V -- optimal value function, numpy array of shape (n_states,)
    iterations -- number of policy iterations
    """
    # (approx. 12-15 lines)
    # 1. Get n_states and n_actions
    # 2. Initialize policy using initialize_policy()
    # 3. For iteration in range(max_iterations):
    #    a. Evaluate current policy: V = policy_evaluation(...)
    #    b. Improve policy: new_policy = policy_improvement(...)
    #    c. Check if policy changed:
    #       - If np.array_equal(policy, new_policy): break (converged)
    #       - Else: policy = new_policy
    # 4. Return policy, V, and number of iterations
    
    # YOUR CODE STARTS HERE
    
    
    # YOUR CODE ENDS HERE
    
    return policy, V, iterations

In [None]:
# Test your implementation
policy_iteration_test(policy_iteration)

## Visualization: Compare with Value Iteration

Let's compare Policy Iteration with Value Iteration on the same environment.

In [None]:
# Run Policy Iteration
policy, V, pi_iterations = policy_iteration(env)

print(f"Policy Iteration converged in {pi_iterations} iterations")
print(f"\nOptimal Value Function:")
print(V.reshape(4, 4))

# Extract deterministic policy (action with highest probability)
action_map = {0: '←', 1: '↓', 2: '→', 3: '↑'}
deterministic_policy = np.argmax(policy, axis=1)

print(f"\nOptimal Policy:")
for i in range(4):
    for j in range(4):
        state = i * 4 + j
        print(action_map[deterministic_policy[state]], end=' ')
    print()

## Key Insights

**Policy Iteration Characteristics**:
- Typically converges in **fewer iterations** than Value Iteration
- Each iteration is **more expensive** (full policy evaluation)
- Guarantees **policy convergence** (not just value convergence)
- Often preferred when policy convergence is more important than value accuracy

**When to use Policy Iteration vs Value Iteration**:
- **Policy Iteration**: When you need the exact optimal policy quickly
- **Value Iteration**: When you need approximate values quickly, or in continuous/large state spaces

Both algorithms are guaranteed to converge to the optimal policy for finite MDPs!

## Congratulations!

You've successfully implemented Policy Iteration! You now understand:
- ✅ The two-phase structure of Policy Iteration
- ✅ Policy evaluation using Bellman Expectation
- ✅ Policy improvement using greedy action selection
- ✅ How to combine them for the complete algorithm
- ✅ Differences between Policy Iteration and Value Iteration

**Next Steps**: Try Policy Gradient methods like REINFORCE for large/continuous state spaces!