## Frozen Lake Domain Description

Frozen Lake is a simple grid-world environment where an agent navigates a frozen lake to reach a goal while avoiding falling into holes. The environment is represented as a grid, with each cell being one of the following:

* **S**: Starting position of the agent
* **F**: Frozen surface, safe to walk on
* **H**: Hole, falling into one ends the episode with a reward of 0
* **G**: Goal, reaching it ends the episode with a reward of 1

The agent can take four actions:

* **0: Left**
* **1: Down**
* **2: Right**
* **3: Up**

However, due to the slippery nature of the ice, the agent might not always move in the intended direction. There's a chance it moves perpendicular to the intended direction.




In [29]:
import gym

# Create the environment
env = gym.make('FrozenLake-v1', render_mode='ansi')  # 'ansi' mode for text-based rendering

# Reset the environment to the initial state
observation = env.reset()

# Take a few actions and observe the results
for _ in range(5):
    action = env.action_space.sample()  # Choose a random action
    observation, reward, done, info = env.step(action)
    # Render the environment to see the agent's movement (text-based)
    if done:
        observation= env.reset()
    else:
      rendered = env.render()
      if len(rendered) > 1:  # Check if there's a second element
         print(rendered[1])  # Print the second element
# Close the environment
env.close()

  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG

  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG



The transition model for the Frozen Lake world describes how the agent's actions affect its movement and the resulting state transitions. Here's a breakdown of the key components:

**Actions:**

* The agent can choose from four actions:
    * 0: Left
    * 1: Down
    * 2: Right
    * 3: Up

**State Transitions:**

* **Intended Movement:** Ideally, the agent moves one cell in the chosen direction.
* **Slippery Ice:** Due to the slippery nature of the ice, there's a probability that the agent will move in a perpendicular direction instead of the intended one. The exact probabilities depend on the specific Frozen Lake environment configuration, but typically:
    * **Successful Move:** The agent moves in the intended direction with a high probability.
    * **Perpendicular Move:** The agent moves 90 degrees to the left or right of the intended direction with a lower probability.
* **Boundaries:** If the intended movement would take the agent outside the grid boundaries, it remains in its current position.
* **Holes:** If the agent lands on a hole ("H"), the episode ends, and it receives a reward of 0.
* **Goal:** If the agent reaches the goal ("G"), the episode ends, and it receives a reward of 1.




In [30]:
import gym

# Create the environment
env = gym.make('FrozenLake-v1', render_mode='ansi')  # 'ansi' mode for text-based rendering

# Reset the environment to the initial state
observation = env.reset()

# Take a few actions and observe the results
for _ in range(5):
    action = env.action_space.sample()  # Choose a random action
    observation, reward, done, info = env.step(action)
    # Render the environment to see the agent's movement (text-based)
    if done:
        observation= env.reset()
    else:
      rendered = env.render()
      if len(rendered) > 1:  # Check if there's a second element
         print(rendered[1])  # Print the second element
# Close the environment
env.close()
print ("State 14 Going Right: (s, a, r, Done)", env.P[14][2])

  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG

State 14 Going Right: (s, a, r, Done) [(0.3333333333333333, 14, 0.0, False), (0.3333333333333333, 15, 1.0, True), (0.3333333333333333, 10, 0.0, False)]


In [33]:
import gym
import numpy as np

# Create FrozenLake environment
env = gym.make("FrozenLake-v1")

# Value Iteration Algorithm
def value_iteration(env, gamma=0.9, num_iterations=1000):
    """
    Implements the Value Iteration algorithm.

    Args:
        env: The OpenAI Gym environment.
        gamma: Discount factor.
        num_iterations: Number of iterations to run.

    Returns:
        The optimal value function and policy.
    """

    # Initialize value function and policy
    V = np.zeros(env.observation_space.n)  # Value function for all states
    policy = np.zeros(env.observation_space.n)  # Policy to store the best action for each state

    # Iterate for the number of iterations specified
    for i in range(num_iterations):
        prev_V = np.copy(V)

        # Iterate over all states
        for state in range(env.observation_space.n):
            Q_values = []  # To store Q-values for all actions

            # Calculate Q-values for each action in the current state
            for action in range(env.action_space.n):
                q_value = 0
                for prob, next_state, reward, done in env.P[state][action]:
                    q_value += prob * (reward + gamma * prev_V[next_state])
                Q_values.append(q_value)

            # Update the value function with the max Q-value
            V[state] = max(Q_values)

            # Update the policy to choose the action with the highest Q-value
            policy[state] = np.argmax(Q_values)

        # Check for convergence (optional step)
        if np.max(np.abs(V - prev_V)) < 1e-4:
            print(f"Value iteration converged after {i+1} iterations.")
            break

    return V, policy

# Apply Value Iteration
optimal_V, optimal_policy = value_iteration(env)

# Evaluate the resulting policy
def evaluate_policy(env, policy, num_episodes=100):
    """
    Evaluates the policy by running it for a number of episodes and calculating
    the average reward.

    Args:
        env: The OpenAI Gym environment.
        policy: The policy to evaluate.
        num_episodes: Number of episodes to run.

    Returns:
        The average reward obtained by following the policy.
    """

    total_reward = 0
    for _ in range(num_episodes):
        state = env.reset()
        done = False

        while not done:
            action = policy[state]
            state, reward, done, _ = env.step(action)
            total_reward += reward

    return total_reward / num_episodes

# Evaluate the optimal policy
average_reward = evaluate_policy(env, optimal_policy)
print("Average Reward:", average_reward)


Value iteration converged after 44 iterations.
Average Reward: 0.79


The Value Iteration Algorithm

Value Iteration is a dynamic programming algorithm used to compute the optimal value function and policy for Markov Decision Processes (MDPs).
In each iteration, we update the value function by considering the expected rewards and transitions, and the policy is updated based on the action
that leads to the maximum Q-value. In this implementation, we loop through all states and actions to find the optimal values and policies.


The Results of Value Iteration

After running Value Iteration, we evaluate the optimal policy over 100 episodes. The algorithm converges to an optimal solution based on the
discount factor (gamma = 0.9), which balances immediate rewards with future rewards. The 'Average Reward' printed shows the effectiveness
of the policy in reaching the goal while avoiding the holes.