# **Frozen Lake Environment - Policy Iteration vs. Value Iteration**

### **Objective**
Learn the optimal policy for the Frozen Lake environment using **Policy Iteration** and **Value Iteration**, and compare their performance.

### **Frozen Lake Environment**
We use OpenAI Gym's Frozen Lake environment:  
🔗 [Frozen Lake - Gym Documentation](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)  

---

## **1. Policy Iteration**
### **Parameters:**
- **Policy**: 2D array of shape (nS, nA), each cell represents the probability of taking action *a* in state *s*.
- **Environment**: Initialized OpenAI Gym environment.
- **Discount Factor** (*γ*): Factor for future rewards.
- **Theta**: Convergence threshold for value function updates.
- **Max Iterations**: Maximum number of iterations before stopping.


---

## **2. Value Iteration**
### **Parameters:**
- **Environment**: Initialized OpenAI Gym environment.
- **Discount Factor** (*γ*): Factor for future rewards.
- **Theta**: Convergence threshold for value function updates.
- **Max Iterations**: Maximum number of iterations before stopping.

###  c.  Compare the number of wins, and average return after 1000 episodes and comment on which method performed                   better.

In [1]:
import gymnasium as gym
import numpy as np
from gymnasium.envs.toy_text.frozen_lake import FrozenLake

In [2]:
env = gym.make('FrozenLake-v1',map_name="4x4", is_slippery=True)
print('observation space: ',env.observation_space)
print('action space: ',env.action_space)
num_of_states = env.observation_space.n
num_of_actions = env.action_space.n
print("Number of actions: ",num_of_actions)
print("Number of states: ",num_of_states)


observation space:  Discrete(16)
action space:  Discrete(4)
Number of actions:  4
Number of states:  16


In [3]:
#create a initial policy with 16*4 board with 0.25 probability for each
initial_policy = np.ones((num_of_states,num_of_actions))/num_of_actions
#discount_factor
gamma = 0.99
#threshold to compare between diff iterations the value function value
threshold = 1e-8
#maximum iterations to do to find the optimal policy
max_iterations = 1000

In [4]:
initial_policy

array([[0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25]])

### Policy Iteration function

In [14]:
def policy_iteration(policy,env,discount_factor=0.99, theta=1e-8,max_iterations=1000,num_of_actions=4,num_of_states=16):
    #performing policy iteration using Dynamic Programming(refer book and slides for understanding)
    """
    Performs Policy Iteration to find an optimal policy for a given environment.

    Args:
        policy: Initial policy (probabilities of actions in each state)
        env: OpenAI Gym environment
        discount_factor: Discount rate for future rewards
        theta: Convergence threshold
        max_iterations: Max number of iterations
        num_of_states: the number of states default 16(4x4map)
        num_of_actions: the number of actions default 4(up,down,left,right)

    Returns:
        value_function: Optimal state-value function
        policy: Optimized policy
    """
    if hasattr(env, 'unwrapped'):
        env = env.unwrapped  # Get the raw environment if wrapped
    
    #initialize the value function values for each state i.e 16 here with inital value of 0
    value_functions = np.zeros(num_of_states)
    
    #loop through max iterations
    #policy evaluation-------------
    for i in range(max_iterations):
        #Policy Evaluation: updation of each state value function with the help of the current policy and bellman equation
        #doing this till the theta i.e(the difference between the value function values between current and previous is same or below theta)
        while True:
            delta = 0 #tracking the values
            for state in range(num_of_states):
                #storing the value function for delta check
                old_value_function = value_functions[state]
                #init the new value function with 0
                new_value_funtion = 0
                for action ,action_prob in enumerate(policy[state]):
                    #getting the action and the action probability for the current state from the policy
                    for prob, next_state,reward,done in env.P[state][action]:
                        #loop through all the transistions to appl the bellman equation to find the new value of the value function of the state
                        #applying the bellman equation formula
                        new_value_funtion+=action_prob*prob*(reward + discount_factor*value_functions[next_state])
                    
                #update the value funtion
                value_functions[state] = new_value_funtion
                #find delta
                delta = max(delta,abs(old_value_function-new_value_funtion))
            
            if delta<theta:
                #once we get the delta values for entire all state update check for delta and if change compared to previous is small then we stop
                break
            #so stop the policy evaluation step and move to policy iteration
    
    #policy improvement:updating the current policy based on the new value functions
    #we say that the policy is stable if the new policy and the current policy are same then stop and return
    policy_stable = True
    for state in range(num_of_states):
        #we are finding the state action value for each state and find the best action for each state, then compare it with the current policy if the actions are same then policy is stable
        old_action = np.argmax(policy[state])#find the best action accoring to the current policy
        
        #define state action value functions all to zeros
        action_values = np.zeros(num_of_actions)
        #Loop through each action for the state and find the action value function according to the bellman equation
        for action in range(num_of_actions):
            for prob, next_state, reward, done in env.P[state][action]:
                #finding the action value on applying bellman equation
                action_values[action] += prob * (reward + discount_factor * value_functions[next_state])
            
            #best action for the state identify based on the action values
            best_action = np.argmax(action_values)
            new_policy = np.eye(num_of_actions)[best_action]  # Update policy for best action
            
            if not np.array_equal(new_policy, policy[state]):  # Check if policy changed
                policy_stable = False#if policy changed then not stable so again policy iteration
            policy[state] = new_policy  # Apply new policy

        if policy_stable:  # Stop if policy is stable
            break#if stable then finish of value iteration
    
    #if finish of iterations or policy eval and iteration gave stable policy finish and return the value functions and the optimal policy
    return value_functions,policy

### Value Iteration Functions

In [19]:
def value_iteration(env,discount_factor=0.99,theta = 1e-8,max_iterations=1000,num_of_states=16,num_of_actions=4):
    #performing value iterations using Dynamic Programming
    """
    Performs Value Iteration to find an optimal policy.

    Args:
        env: OpenAI Gym environment
        discount_factor: Discount rate for future rewards
        theta: Convergence threshold
        max_iterations: Max number of iterations
        num_of_states: the number of states default 16(4x4map)
        num_of_actions: the number of actions default 4(up,down,left,right)

    Returns:
        value_function: Optimal state-value function
        policy: Optimal policy
    """
    
    if hasattr(env, 'unwrapped'):
        env = env.unwrapped  # Get raw environment
        
    #initialize the value function values for each state i.e 16 here with inital value of 0
    value_functions = np.zeros(num_of_states)
    
    for i in range(max_iterations):
        delta = 0 #to track change
        for state in range(num_of_states):
            #storing the value function for delta check
            old_value_function = value_functions[state]
            #for each state in value function instead of value evaluation and improvement we find the action value functions for each and update the state value function with max among it  
            action_values = np.zeros(num_of_actions)
            
            #finding action value function value using bellman equation
            for action in range(num_of_actions):
                for prob, next_state, reward, done in env.P[state][action]:
                    action_values[action] += prob * (reward + discount_factor * value_functions[next_state])
            value_functions[state] = np.max(action_values)  # Update state value with maximum of action value
            delta = max(delta, abs(old_value_function - value_functions[state]))  # Check for convergence
        if delta < theta:  # Stop if change is small
            break
    
    # Derive policy from optimal value function
    policy = np.zeros((num_of_states,num_of_actions))/num_of_actions
    for state in range(num_of_states):
        action_values = np.zeros(num_of_actions)  # Compute action values
        for action in range(num_of_actions):
            for prob, next_state, reward, done in env.P[state][action]:
                action_values[action] += prob * (reward + discount_factor * value_functions[next_state])
        best_action = np.argmax(action_values)  # Select best action
        policy[state] = np.eye(num_of_actions)[best_action]  # One-hot encode policy

    return value_functions, policy

### Evaluation of the policy obtained on the environment and finding the win ratio

In [20]:
def evaluate_policy(env, policy, num_episodes=1000):
    """
    Evaluates a given policy by running multiple episodes and calculating win rate and average return.

    Args:
        env: OpenAI Gym environment
        policy: Policy to evaluate
        num_episodes: Number of episodes to simulate

    Returns:
        wins: Number of successful episodes (goal reached)
        avg_return: Average return per episode
    """
    wins = 0
    total_return = 0
    for i in range(num_episodes):
        state, _ = env.reset()  # Reset environment
        terminated = truncated = False
        episode_return = 0
        while not (terminated or truncated):  # Run until episode ends
            #action choosing from the policy based on probability
            action = np.random.choice(np.arange(env.action_space.n), p=policy[state])  # Choose action from policy
            state, reward, terminated, truncated, _ = env.step(action)  # Take action
            episode_return += reward  # Accumulate reward
        if reward > 0:  # Check if goal was reached
            wins += 1
        total_return += episode_return  # Track total return
    return wins, total_return / num_episodes  # Return win count and average return

### Running both approches and comparing

In [21]:
import time
start_time = time.time()
V_policy, policy_policy = policy_iteration(initial_policy.copy(), env, gamma, threshold, max_iterations,num_of_actions,num_of_states)
policy_time = time.time() - start_time
wins_policy, avg_return_policy = evaluate_policy(env, policy_policy, num_episodes=1000)

print("Policy Iteration Results:")
print(f"Wins: {wins_policy}/1000 episodes")
print(f"Average Return: {avg_return_policy:.3f}")
print(f"Time Taken: {policy_time:.6f} seconds")

start_time = time.time()
V_value, policy_value = value_iteration(env, gamma, threshold, max_iterations,num_of_states,num_of_actions)
value_time = time.time() - start_time
wins_value, avg_return_value = evaluate_policy(env, policy_value, num_episodes=1000)

print("\nValue Iteration Results:")
print(f"Wins: {wins_value}/1000 episodes")
print(f"Average Return: {avg_return_value:.3f}")
print(f"Time Taken: {value_time:.6f} seconds")

Policy Iteration Results:
Wins: 704/1000 episodes
Average Return: 0.704
Time Taken: 0.053150 seconds

Value Iteration Results:
Wins: 742/1000 episodes
Average Return: 0.742
Time Taken: 0.015628 seconds


### **Conclusion**
- **Value Iteration** performed better with **more wins (742 vs. 704)** and **higher average return**.
- **Value Iteration was faster**, converging in **less time** than Policy Iteration.

🔹 **Final Verdict**: **Value Iteration** is more efficient in this scenario.