# COGS 188 Lab 5 Part 1: Taxi Routing with Dynamic Programming

In this part of the lab, you will apply dynamic programming techniques to solve a taxi routing problem in a city grid using the Gymnasium "Taxi-v3" environment. You will implement Policy Iteration and Value Iteration algorithms and analyze their performance.


## Taxi-v3 Environment Details

Please read the following documentation to learn more about the Taxi-v3 environment: [https://gymnasium.farama.org/environments/toy_text/taxi/](https://gymnasium.farama.org/environments/toy_text/taxi/)

Make sure you understand the state space, action space, and reward structure of the environment before proceeding.

## Problem Description
Your task is to optimize taxi movements to efficiently pick up and drop off passengers. The environment is a simplified city grid where the taxi must navigate to pick up and drop off passengers at designated locations.

In [None]:
# Import the necessary packages
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Image, display
import imageio

In [None]:
# Initialize the Taxi environment using Gymnasium
env = gym.make("Taxi-v3", render_mode='rgb_array')

# Print environment details
print("State space:", env.observation_space)
print("Action space:", env.action_space)

env.reset(seed=42)

### Dynamic Programming Algorithms for MDPs

Here is a brief overview of the dynamic programming algorithms you will implement:

#### Policy Iteration
1. **Initialize:** Initialize an arbitrary policy $\pi(s)$ and the value function $V(s) = 0$ for all states.
2. **Policy Evaluation:** For a given policy, solve:
   $$V^{\pi}(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma V^{\pi}(s')]$$
3. **Policy Improvement:** Update the policy:
   $$\pi(s) \leftarrow \arg \max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma V^{\pi}(s')]$$
4. **Repeat** steps 2-3 until the policy stabilizes.

#### Value Iteration
1. **Initialize:** Initialize the value function $V(s) = 0$ for all states.
2. **Update:** For each state $s$:
   $$V(s) \leftarrow \max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma V(s')]$$
3. **Repeat** step 2 until the value function stabilizes.
4. **Policy Extraction:** Extract the policy:
   $$\pi(s) = \arg \max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma V(s')]$$

Complete the function below to implement Policy Iteration algorithm for the Taxi-v3 environment.

In [None]:
def policy_iteration(env, gamma=0.9, max_iterations=1000):
    """
    Perform policy iteration to find the optimal policy and value function for a given environment.

    Parameters:
    - env: The environment object representing the problem.
    - gamma: The discount factor for future rewards (default: 0.9).
    - max_iterations: The maximum number of iterations to perform (default: 1000).

    Returns:
    - policy: The optimal policy, represented as an array of actions for each state.
    - value_function: The value function corresponding to the optimal policy.

    """
    # Initialize the number of states and actions
    n_states = env.observation_space.n
    n_actions = env.action_space.n

    # Initialize the policy and value function
    policy = np.zeros(n_states, dtype=int)
    value_function = np.zeros(n_states)

    # Function to calculate the action-value function using one-step lookahead
    def one_step_lookahead(state, V):
        """
        Compute action-value function using one-step lookahead.

        Args:
            state (int): Current state.
            V (np.ndarray): Current value function.

        Returns:
            A (np.ndarray): Action-value array.
        """
        A = np.zeros(n_actions)
        for action in range(n_actions):
            for prob, next_state, reward, _ in env.P[state][action]:
                # TODO: Calculate the expected value of each action
                # The reward is the immediate reward plus the discounted future reward
                ...
        return A

    for i in range(max_iterations):
        stable_policy = True
        for state in range(n_states):
            # TODO: Retrieve the current action from the policy
            

            # TODO: Compute the action-values using one-step lookahead
            

            # TODO: Select the best action (the one with the highest expected value)
            

            # TODO: Check if the current policy action is different from the best action
            # If so, set the stable_policy flag to False
            
            # TODO: Update the policy at the current state with the best action

        # Policy Evaluation
        for state in range(n_states):
            # TODO: Update the value function for each state based on the current policy
            ...

        # TODO: If the policy is stable, break out of the loop
        ...

    return policy, value_function


Complete the function below to implement Value Iteration algorithm for the Taxi-v3 environment.

In [None]:
def value_iteration(env, gamma=0.9, max_iterations=1000, threshold=1e-4):
    """
    Perform value iteration to find the optimal policy and value function for a given environment.

    Parameters:
    - env: The environment object representing the problem.
    - gamma: The discount factor for future rewards (default: 0.9).
    - max_iterations: The maximum number of iterations to perform (default: 1000).
    - threshold: The convergence threshold for the value function.

    Returns:
    - policy: The optimal policy, represented as an array of actions for each state.
    - value_function: The value function corresponding to the optimal policy.
    """
    # Initialize the number of states and actions
    n_states = env.observation_space.n
    n_actions = env.action_space.n

    # Initialize the value function
    value_function = np.zeros(n_states)

    # Function to calculate the action-value function using one-step lookahead
    def one_step_lookahead(state, V):
        """
        Compute action-value function using one-step lookahead.

        Args:
            state (int): Current state.
            V (np.ndarray): Current value function.

        Returns:
            A (np.ndarray): Action-value array.
        """
        A = np.zeros(n_actions)
        for action in range(n_actions):
            # TODO: Fill in the transition dynamics (probability, next state, reward, done)
            for prob, next_state, reward, _ in env.P[state][action]:
                # TODO: Calculate the expected value of each action
                # Similar to policy iteration, the reward is the immediate reward plus the discounted future reward
                ...
        return A

    for i in range(max_iterations):
        delta = 0
        for state in range(n_states):
            # TODO: Compute the action-values using one-step lookahead

            # TODO: Update the value function for the state

            # TODO: Calculate the maximum change (delta) between the current value and the new values

        # TODO: Check for convergence: i.e., if the change in value is less than the threshold, break out of the loop

    # Extract the policy based on the optimal value function
    policy = np.zeros(n_states, dtype=int)
    for state in range(n_states):
        # TODO: Determine the best action for each state
        ...

    return policy, value_function


The function below evaluates the policy in the environment and returns the total reward for that policy.

In [None]:
# Function to evaluate a given policy
def evaluate_policy(env, policy, n_episodes=100):
    rewards = []
    for _ in range(n_episodes):
        state, _ = env.reset()
        total_reward = 0
        done = False
        while not done:
            action = policy[state]
            state, reward, done, _, _ = env.step(action)
            total_reward += reward
        rewards.append(total_reward)
    return np.mean(rewards)


Run the cell below to evaluate both the Policy Iteration and Value Iteration algorithms on the Taxi-v3 environment.

In [None]:
# Run Policy Iteration
policy_pi, value_pi = policy_iteration(env)
print("Policy Iteration:")
print("Average reward:", evaluate_policy(env, policy_pi))

# Run Value Iteration
policy_vi, value_vi = value_iteration(env)
print("\nValue Iteration:")
print("Average reward:", evaluate_policy(env, policy_vi))

Lastly, we generate an animation of the taxi navigating the environment using the optimal policy found by the algorithms.

In [None]:
def generate_animation(env, policy, filename='taxi_animation.gif', seed=42):
    frames = []
    state, _ = env.reset(seed=seed)
    done = False
    while not done:
        frame = env.render()
        frames.append(frame)
        action = policy[state]
        state, reward, done, _, _ = env.step(action)

    env.close()
    imageio.mimsave(filename, frames, fps=5)
    return filename

In [None]:
# Generate the animations for both policies
animation_pi = generate_animation(env, policy_pi, 'policy_iteration.gif')
animation_vi = generate_animation(env, policy_vi, 'value_iteration.gif')

Let's visualize the gifs generated:

In [None]:
display(Image(filename=animation_pi))

In [None]:
display(Image(filename=animation_vi))

## Submission Instructions

Please submit the completed notebook (after running all cells) as a .ipynb file to Gradescope, along with the two .gif files generated at the end of the notebook.