# Frozen Lake w/ Value Iteration & Direct Evaluation

## Frozen Lake Domain Description

Frozen Lake is a simple grid-world environment where an agent navigates a frozen lake to reach a goal while avoiding falling into holes. The environment is represented as a grid, with each cell being one of the following:

* **S**: Starting position of the agent
* **F**: Frozen surface, safe to walk on
* **H**: Hole, falling into one ends the episode with a reward of 0
* **G**: Goal, reaching it ends the episode with a reward of 1

The agent can take four actions:

* **0: Left**
* **1: Down**
* **2: Right**
* **3: Up**

However, due to the slippery nature of the ice, the agent might not always move in the intended direction. There's a chance it moves perpendicular to the intended direction.




In [None]:
import gym

# Create the environment
env = gym.make('FrozenLake-v1', render_mode='ansi')  # 'ansi' mode for text-based rendering

# Reset the environment to the initial state
observation = env.reset()

# Take a few actions and observe the results
for _ in range(5):
    action = env.action_space.sample()  # Choose a random action
    observation, reward, done, info = env.step(action)
    # Render the environment to see the agent's movement (text-based)
    if done:
        observation= env.reset()
    else:
      rendered = env.render()
      if len(rendered) > 1:  # Check if there's a second element
         print(rendered[1])  # Print the second element
# Close the environment
env.close()

  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG



  deprecation(
  deprecation(
  if not isinstance(terminated, (bool, np.bool8)):


The transition model for the Frozen Lake world describes how the agent's actions affect its movement and the resulting state transitions. Here's a breakdown of the key components:

**Actions:**

* The agent can choose from four actions:
    * 0: Left
    * 1: Down
    * 2: Right
    * 3: Up

**State Transitions:**

* **Intended Movement:** Ideally, the agent moves one cell in the chosen direction.
* **Slippery Ice:** Due to the slippery nature of the ice, there's a probability that the agent will move in a perpendicular direction instead of the intended one. The exact probabilities depend on the specific Frozen Lake environment configuration, but typically:
    * **Successful Move:** The agent moves in the intended direction with a high probability.
    * **Perpendicular Move:** The agent moves 90 degrees to the left or right of the intended direction with a lower probability.
* **Boundaries:** If the intended movement would take the agent outside the grid boundaries, it remains in its current position.
* **Holes:** If the agent lands on a hole ("H"), the episode ends, and it receives a reward of 0.
* **Goal:** If the agent reaches the goal ("G"), the episode ends, and it receives a reward of 1.




In [None]:
import gym

# Create the environment
env = gym.make('FrozenLake-v1', render_mode='ansi')  # 'ansi' mode for text-based rendering

# Reset the environment to the initial state
observation = env.reset()

# Take a few actions and observe the results
for _ in range(5):
    action = env.action_space.sample()  # Choose a random action
    observation, reward, done, info = env.step(action)
    # Render the environment to see the agent's movement (text-based)
    if done:
        observation= env.reset()
    else:
      rendered = env.render()
      if len(rendered) > 1:  # Check if there's a second element
         print(rendered[1])  # Print the second element
# Close the environment
env.close()
print ("State 14 Going Right: (s, a, r, Done)", env.P[14][2])

  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG

  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG

State 14 Going Right: (s, a, r, Done) [(0.3333333333333333, 14, 0.0, False), (0.3333333333333333, 15, 1.0, True), (0.3333333333333333, 10, 0.0, False)]


  deprecation(
  deprecation(
  if not isinstance(terminated, (bool, np.bool8)):


# Direct Evaluation

## Evaluate Single Episode

In [None]:
def EvaluateEpisode(env, e, V_DE, V_Counts, gamma=0.9):
    future_reward = 0
    for t in reversed(e):  # Iterate in reverse order
        future_reward = t[3] + gamma * future_reward
        V_DE[t[0]] = future_reward+V_DE[t[0]]
        V_Counts[t[0]] = V_Counts[t[0]]+1
    return V_DE, V_Counts

## Evaluate Episode 1



In [None]:
V_DE = np.zeros((env.observation_space.n))
V_Counts = np.zeros((env.observation_space.n))
V_DE, V_Count = EvaluateEpisode(env, training_episodes[0], V_DE, V_Counts, 0.9)
V = np.where(V_Counts != 0, V_DE / V_Counts, 0)
print (f"V_DE=\n{V_DE.reshape((4,4))}")
print (f"V_Counts=\n{V_Counts.reshape((4,4))}")
print (f"V=\n{V.reshape((4,4))}")

V_DE=
[[0.59049 0.      0.      0.     ]
 [0.6561  0.      0.      0.     ]
 [0.729   0.81    0.9     0.     ]
 [0.      0.      1.      0.     ]]
V_Counts=
[[1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 1. 1. 0.]
 [0. 0. 1. 0.]]
V=
[[0.59049 0.      0.      0.     ]
 [0.6561  0.      0.      0.     ]
 [0.729   0.81    0.9     0.     ]
 [0.      0.      1.      0.     ]]


  V = np.where(V_Counts != 0, V_DE / V_Counts, 0)


## Evaluate All Episodes

In [None]:
V_DE = np.zeros((env.observation_space.n))
V_Counts = np.zeros((env.observation_space.n))
for e in training_episodes2:
    V_DE, V_Count = EvaluateEpisode(env, e, V_DE, V_Counts, 0.9)
V = np.where(V_Counts != 0, V_DE / V_Counts, 0)
print (f"V_DE=\n{V_DE.reshape((4,4))}")
print (f"V_Counts=\n{V_Counts.reshape((4,4))}")
print (f"V_DirectEvaluation=\n{np.round(V.reshape((4,4)),2)}")
print (f"optimal policy= \n{optimal_policy.reshape((4,4))}\n optimal_V=\n{np.round(optimal_V.reshape((4,4)), 2)}")

V_DE=
[[ 879.44773993   12.02033939   32.45370905    0.        ]
 [ 910.67996731    0.           63.889207      0.        ]
 [ 943.61931212  887.58499323  418.52594351    0.        ]
 [   0.         1065.13267976 1369.46029196    0.        ]]
V_Counts=
[[12819.   227.   503.     0.]
 [ 9779.     0.   646.     0.]
 [ 6632.  3600.  1432.     0.]
 [    0.  2876.  2159.     0.]]
V_DirectEvaluation=
[[0.07 0.05 0.06 0.  ]
 [0.09 0.   0.1  0.  ]
 [0.14 0.25 0.29 0.  ]
 [0.   0.37 0.63 0.  ]]
optimal policy= 
[[0 3 0 3]
 [0 0 0 0]
 [3 1 0 0]
 [0 2 1 0]]
 optimal_V=
[[0.07 0.06 0.07 0.06]
 [0.09 0.   0.11 0.  ]
 [0.15 0.25 0.3  0.  ]
 [0.   0.38 0.64 0.  ]]


  V = np.where(V_Counts != 0, V_DE / V_Counts, 0)


# My Code (Run This Section)

## Value Iteration Code From Previous Assignment

In [1]:
import gym
import numpy as np

# Create FrozenLake environment
env = gym.make("FrozenLake-v1")

# Value Iteration Algorithm
def value_iteration(env, gamma=0.9, num_iterations=1000):
    # Initialize value function and policy
    V = np.zeros(env.observation_space.n)
    policy_value_iteration = np.zeros(env.observation_space.n)

    for i in range(num_iterations):
        # Create a copy of the current value function
        prev_V = np.copy(V)

        # Iterate through all states
        for state in range(env.observation_space.n):
            # Initialize an array to store Q-values for all actions in this state
            Q_values = np.zeros(env.action_space.n)

            # Iterate through all possible actions
            for action in range(env.action_space.n):
                # Calculate the expected value of taking this action
                for prob, next_state, reward, done in env.P[state][action]:
                    Q_values[action] += prob * (reward + gamma * prev_V[next_state])

            # Update the value function with the max Q-value
            V[state] = max(Q_values)

            # Update the policy to choose the action that gives the highest Q-value
            policy_value_iteration[state] = np.argmax(Q_values)

        # Early stopping condition (optional)
        if np.max(np.abs(prev_V - V)) < 1e-6:
            break

    return V, policy_value_iteration

# Apply Value Iteration
optimal_V, optimal_policy_value_iteration = value_iteration(env)


  deprecation(
  deprecation(


## Submitted Code for Q-Learning

In [2]:
import numpy as np
import gym

# Create Frozen Lake environment
env = gym.make("FrozenLake-v1")

# Q-Learning algorithm function
def q_learning(env, num_episodes=10000, max_steps=100, alpha=0.1, gamma=0.99, epsilon=1.0, epsilon_min=0.01, epsilon_decay=0.995):
    # Initialize Q-table with zeros
    Q = np.zeros((env.observation_space.n, env.action_space.n))

    # Function for Epsilon-Greedy policy
    def epsilon_greedy_action(state, Q, epsilon):
        if np.random.rand() < epsilon:
            return env.action_space.sample()  # Exploration
        else:
            return np.argmax(Q[state])  # Exploitation

    # Q-Learning loop
    for episode in range(num_episodes):
        state = env.reset()
        done = False

        for step in range(max_steps):
            # Select action using epsilon-greedy policy
            action = epsilon_greedy_action(state, Q, epsilon)

            # Perform action and observe the next state and reward
            next_state, reward, done, _ = env.step(action)

            # Update Q-value
            best_next_action = np.argmax(Q[next_state])
            Q[state, action] = Q[state, action] + alpha * (reward + gamma * Q[next_state, best_next_action] - Q[state, action])

            # Move to the next state
            state = next_state

            if done:
                break

        # Decay epsilon to reduce exploration over time
        epsilon = max(epsilon_min, epsilon * epsilon_decay)

    # Extract the optimal policy from Q-table
    optimal_policy = np.argmax(Q, axis=1)
    return Q, optimal_policy

#Apply Q_learning
Q_table, optimal_policy_q_learning = q_learning(env)

  and should_run_async(code)
  if not isinstance(terminated, (bool, np.bool8)):


## Evaluate Policy

In [3]:
# Evaluate Policy Function
def evaluate_policy(env, policy, num_episodes=1000):
    total_reward = 0
    for _ in range(num_episodes):
        state = env.reset()
        done = False
        while not done:
            action = policy[state]
            state, reward, done, _ = env.step(action)
            total_reward += reward
    return total_reward / num_episodes

## Extended Q_Learning

In [4]:
import numpy as np
import gym

# Create Frozen Lake environment
env = gym.make("FrozenLake-v1")

# Q-Learning algorithm function with optimizations
def q_learning_optimized(env, num_episodes=10000, max_steps=100, alpha=0.1, gamma=0.99, epsilon=1.0, epsilon_min=0.01, epsilon_decay=0.995, exploration_bonus=0.1):
    # Initialize Q-table with zeros
    Q = np.zeros((env.observation_space.n, env.action_space.n))
    exploration_count = np.zeros((env.observation_space.n, env.action_space.n))  # To track visits to each state-action pair

    # Exploration function to favor less-explored actions
    def exploration_function(Q, state, action, exploration_count, exploration_bonus):
        return Q[state, action] + exploration_bonus / (exploration_count[state, action] + 1)  # Favor less-explored actions

    # Function for Epsilon-Greedy policy
    def epsilon_greedy_action(state, Q, epsilon, exploration_count, exploration_bonus):
        if np.random.rand() < epsilon:
            return env.action_space.sample()  # Exploration
        else:
            # Choose action based on the exploration function
            exploration_q_values = [exploration_function(Q, state, action, exploration_count, exploration_bonus) for action in range(env.action_space.n)]
            return np.argmax(exploration_q_values)  # Exploitation using exploration function

    # Q-Learning loop
    for episode in range(num_episodes):
        state = env.reset()
        done = False

        for step in range(max_steps):
            # Select action using epsilon-greedy policy with exploration bonus
            action = epsilon_greedy_action(state, Q, epsilon, exploration_count, exploration_bonus)

            # Perform action and observe the next state and reward
            next_state, reward, done, _ = env.step(action)

            # Update the exploration count
            exploration_count[state, action] += 1

            # Update Q-value
            best_next_action = np.argmax(Q[next_state])
            Q[state, action] = Q[state, action] + alpha * (reward + gamma * Q[next_state, best_next_action] - Q[state, action])

            # Move to the next state
            state = next_state

            if done:
                break

        # Decay epsilon to reduce exploration over time
        epsilon = max(epsilon_min, epsilon * epsilon_decay)

    # Extract the optimal policy from Q-table
    optimal_policy = np.argmax(Q, axis=1)
    return Q, optimal_policy

# Optimized Q-Learning
Q_table_optimized, optimal_policy_optimized = q_learning_optimized(env, exploration_bonus=0.1)  # Optimized version

## Print Results

In [5]:
# Evaluate the policy from Value Iteration
value_iteration_reward = evaluate_policy(env, optimal_policy_value_iteration)

# Evaluate the policy from Q-Learning
q_learning_reward = evaluate_policy(env, optimal_policy_q_learning)

# Evaluate policy with optimized Q-Learning
optimized_reward = evaluate_policy(env, optimal_policy_optimized)

# Compute value function from Q-tables by taking the max Q-value for each state
values_q_learning = np.max(Q_table, axis=1)
values_optimized_q_learning = np.max(Q_table_optimized, axis=1)

# Round the values to three decimal places
values_q_learning = np.round(values_q_learning, 3)
values_optimized_q_learning = np.round(values_optimized_q_learning, 3)
optimal_V = np.round(optimal_V, 3)

# Print the average reward for all algorithms
print(f"Average reward using Value Iteration: {value_iteration_reward}")
print(f"Average reward using Q-Learning: {q_learning_reward}")
print(f"Average reward with exploration bonus: {optimized_reward}")

# Print the policies for each algorithm
print("\nPolicies and Values for each algorithm:")

# Value Iteration
print("Policy from Value Iteration:")
print(optimal_policy_value_iteration.reshape((4, 4)))  # Reshape for the 4x4 Frozen Lake
print("\nValues from Value Iteration:")
print(optimal_V.reshape((4, 4)))

# Q-Learning
print("\nPolicy from Q-Learning:")
print(optimal_policy_q_learning.reshape((4, 4)))  # Reshape for the 4x4 Frozen Lake
print("\nValues from Q-Learning:")
print(values_q_learning.reshape((4, 4)))

# Optimized Q-Learning
print("\nPolicy from Optimized Q-Learning:")
print(optimal_policy_optimized.reshape((4, 4)))  # Reshape for the 4x4 Frozen Lake
print("\nValues from Optimized Q-Learning:")
print(values_optimized_q_learning.reshape((4, 4)))

Average reward using Value Iteration: 0.744
Average reward using Q-Learning: 0.702
Average reward with exploration bonus: 0.723

Policies and Values for each algorithm:
Policy from Value Iteration:
[[0. 3. 0. 3.]
 [0. 0. 0. 0.]
 [3. 1. 0. 0.]
 [0. 2. 1. 0.]]

Values from Value Iteration:
[[0.069 0.061 0.074 0.056]
 [0.092 0.    0.112 0.   ]
 [0.145 0.247 0.3   0.   ]
 [0.    0.38  0.639 0.   ]]

Policy from Q-Learning:
[[0 1 0 3]
 [0 0 0 0]
 [3 1 0 0]
 [0 2 1 0]]

Values from Q-Learning:
[[0.49  0.175 0.184 0.188]
 [0.509 0.    0.165 0.   ]
 [0.528 0.558 0.443 0.   ]
 [0.    0.683 0.855 0.   ]]

Policy from Optimized Q-Learning:
[[0 3 0 3]
 [0 0 2 0]
 [3 1 0 0]
 [0 2 1 0]]

Values from Optimized Q-Learning:
[[0.563 0.407 0.354 0.1  ]
 [0.591 0.    0.366 0.   ]
 [0.621 0.653 0.613 0.   ]
 [0.    0.735 0.874 0.   ]]
