# Frozen Lake Solution

The Frozen Lake problem in reinforcement learning involves navigating a grid-like environment represented as a frozen lake with holes. The goal is to reach a target location without falling into the holes. The q-Learning method is applied to learn an optimal strategy: it iteratively updates a Q-table, which estimates the value of taking certain actions in specific states. This table guides the agent to make decisions that maximize long-term rewards. The agent learns through exploration (trying random actions) and exploitation (using the best-known action), gradually refining its policy to navigate the lake successfully.

# Installation

In [4]:
!pip install gymnasium[toy-text]
!pip install gymnasium



# Coding

In [6]:
import numpy as np
import gymnasium as gym
from tqdm import tqdm

# epsilon_greedy_policy

The function `epsilon_greedy_policy` selects an action for a given state in a reinforcement learning environment using an epsilon-greedy approach. It first determines the number of possible actions `num_actions` in the current state from the Q-table, which contains the expected rewards for each action at each state.

In [7]:
def epsilon_greedy_policy(Q_table, state, epsilon):
    """
    Selects an action using epsilon-greedy policy

    Args:
    Q_table (numpy.ndarray): Q-value table
    state (int): Current state
    epsilon (float): Epsilon value for exploration-exploitation trade-off

    Returns:
    int: Selected action
    """
    # Determine the number of actions available in the current state
    num_actions = len(Q_table[state])

    # Decide whether to explore or exploit
    if np.random.uniform(0, 1) < epsilon:
        # Explore: choose a random action
        return np.random.choice(num_actions)
    else:
        # Exploit: choose the best action based on current Q-values
        return np.argmax(Q_table[state])

# q_learning

The `q_learning` function implements the q-Learning algorithm for a given environment over a specified number of episodes. It initializes a Q-table to store action values and iteratively updates it based on the temporal difference target, balancing between exploration and exploitation using the epsilon-greedy policy. The learning rate `alpha` and discount factor `gamma` influence how the Q-table is updated. Progress is monitored using a progress bar, and the policy's performance is evaluated intermittently.

In [8]:
def q_learning(environment, episodes, alpha=0.2, gamma=0.99, epsilon=0.1):
    Q_table = np.zeros((environment.observation_space.n, environment.action_space.n))
    pbar = tqdm(total=episodes, dynamic_ncols=True)

    for episode in range(episodes):
        state, _ = environment.reset()
        done = False
        episode_reward = 0

        while not done:
            action = epsilon_greedy_policy(Q_table, state, epsilon)
            next_state, reward, done, _, _ = environment.step(action)
            next_action = np.argmax(Q_table[next_state, :])

            td_target = reward + gamma * Q_table[next_state, next_action]
            Q_table[state, action] += alpha * (td_target - Q_table[state, action])

            state = next_state
            episode_reward += reward

        pbar.update(1)

        # Evaluate policy less frequently to save computation
        if episode % 1000 == 0 or episode == episodes - 1:
            avg_reward = evaluate_policy(environment, Q_table, 100)
            pbar.set_description(f"Episode: {episode} - Average Reward: {avg_reward:.2f}")

    pbar.close()
    return Q_table

# evaluate_policy

The `evaluate_policy` function assesses the effectiveness of a policy, derived from a Q-table, in a specified environment over a set number of episodes. It calculates the policy by taking the action with the highest value in the Q-table for each state. For each episode, the function resets the environment, then continually selects actions according to the policy and updates the state based on the environment's response, accumulating rewards until the episode ends. The function finally returns the average total reward per episode, providing a measure of the policy's performance.

In [9]:
def evaluate_policy(environment, Q_table, episodes):
    """
    Evaluate the performance of a given policy over a certain number of episodes

    Args:
    environment: The environment to test the policy in
    Q_table (numpy.ndarray): The Q-table representing the policy
    episodes (int): Number of episodes to run the evaluation for

    Returns:
    float: The average total reward per episode
    """
    total_reward = 0
    # Extract the policy from the Q-table
    optimal_policy = np.argmax(Q_table, axis=1)

    for _ in range(episodes):
        state, _ = environment.reset()
        done = False
        episode_reward = 0

        while not done:
            action = optimal_policy[state]
            state, reward, done, _, _ = environment.step(action)
            episode_reward += reward

        total_reward += episode_reward

    # Calculate the average reward per episode
    average_reward = total_reward / episodes
    return average_reward

The `demo_agent` function visually demonstrates the behavior of an agent in a specified environment using a policy derived from a Q-table. It first computes the optimal policy by selecting the action with the highest value in the Q-table for each state. For each requested episode, the function resets the environment and then repeatedly chooses actions according to the optimal policy, updating the state based on the environment's response. If `render` is set to `True` , the environment's state is visually rendered at each step, allowing for a visual representation of the agent's actions and decisions. The function continues until the episode ends, showcasing how the agent navigates the environment using the learned policy.

In [10]:
def demo_agent(environment, Q_table, episodes=1, render=True):
    """
    Demonstrates the behavior of an agent in a given environment using the policy derived from the Q-table

    Args:
    environment: The environment in which to demonstrate the agent
    Q_table (numpy.ndarray): The Q-table used to derive the policy
    episodes (int): Number of episodes to demonstrate
    render (bool): If True, the environment will be rendered during the demonstration
    """
    optimal_policy = np.argmax(Q_table, axis=1)

    for episode in range(episodes):
        state, _ = environment.reset()
        done = False
        print("\nEpisode:", episode + 1)

        while not done:
            if render:
                environment.render()
            action = optimal_policy[state]
            state, _, done, _, _ = environment.step(action)

        if render:
            environment.render()

# main

The `main` function orchestrates the process of training an agent using q-Learning, evaluating its performance, and demonstrating the learned behavior in the FrozenLake environment. Initially, it sets up the environment and runs the q-Learning algorithm for a specified number of episodes, resulting in a trained Q-table. It then evaluates the effectiveness of the learned policy by calculating the average reward over the same number of episodes. This performance metric is printed out. After evaluation, the function demonstrates the agent's behavior in a visually-renderable version of the environment for a set number of demo episodes. Error handling is included to catch and report exceptions, ensuring that the environments are properly closed regardless of whether the process completes successfully or encounters an error. This structured approach helps in understanding the complete lifecycle of an agent's training and deployment in a reinforcement learning setting.

In [11]:
def main(episodes=10000, demo_episodes=5):
    """
    Main function to run Q-Learning on the FrozenLake environment

    Args:
    episodes (int): Number of episodes to run Q-Learning
    demo_episodes (int): Number of episodes to demonstrate the learned policy
    """
    try:
        # Create and run Q-learning on the environment
        environment = gym.make("FrozenLake-v1")
        Q_table = q_learning(environment, episodes)

        # Evaluate the learned policy
        avg_reward = evaluate_policy(environment, Q_table, episodes)
        print(f"Average reward after q-learning: {avg_reward}")

        # Demonstrate the learned policy
        visual_env = gym.make('FrozenLake-v1', render_mode='human')
        demo_agent(visual_env, Q_table, demo_episodes)

    except Exception as e:
        print(f"An error occurred: {e}")

    finally:
        # Clean up and close the environment
        environment.close()
        visual_env.close()

In [12]:
if __name__ == "__main__":
    main()


Episode: 9999 - Average Reward: 0.80: 100%|████████████████████████████████████| 10000/10000 [00:21<00:00, 460.50it/s]


Average reward after q-learning: 0.8249

Episode: 1

Episode: 2

Episode: 3

Episode: 4

Episode: 5


The `demo_episodes` is 5, so it runs 5 times and then close. If it occures any bug, please stop the code and run the code again!