# Q Learning on Frozen Lake

This exercise will challenge you to solve the Reinforcement Learning problem in the Frozen Lake environment, it seems quite simple but watch out! It may have a few suprises in store!

Frozen lake involves crossing a frozen lake from Start(S) to Goal(G) without falling into any Holes(H) by walking over the Frozen(F) lake.

## Action Space
The agent takes a 1-element vector for actions. The action space is (dir), where dir decides direction to move in which can be:

0: LEFT

1: DOWN

2: RIGHT

3: UP

## Observation Space
The observation is a value representing the agent’s current position as current_row * nrows + current_col (where both the row and col start at 0). For example, the goal position in the 4x4 map can be calculated as follows: 3 * 4 + 3 = 15. The number of possible observations is dependent on the size of the map. For example, the 4x4 map has 16 possible observations.

## Rewards
Reward schedule:

Reach goal(G): +1

Reach hole(H): 0 (Terminates the episode)

Reach frozen(F): 0

1. Let's start by installing some libraries

In [None]:
!pip install cmake
!pip install scipy
!pip install gymnasium

In [None]:
import gymnasium as gym
import numpy as np
import random
from IPython.display import clear_output

2. Use `gym` to set up a frozen Lake environment with dimension 4x4 that is not slippery. You may learn how to do that [here](https://www.gymlibrary.ml/environments/toy_text/frozen_lake/). Reset the environment and display the first state using matplotlib and the `.render(mode="rgb_array")` method.

In [None]:
print("[INFO] : Version Gym : ", gym.__version__)

env = gym.make("FrozenLake-v1", desc=None, map_name="4x4", is_slippery=False, render_mode="rgb_array").env
env.reset()
env.render()

3. Reset the environment and look what happens when taking action `0`.

In [None]:
env.reset()

In [None]:
env.step(0)


4. How do you interpret the resulting values?

5. Print the size of the action space and the observation space using attributes of the environment object.

In [None]:
print("Action Space {}".format(env.action_space))
print("State Space {}".format(env.observation_space))


6. Setup the Q Table (remember it should represent the values of each state action pairs), initialize all values to zero

In [None]:
q_table = np.zeros([env.observation_space.n, env.action_space.n])
q_table

7. Run a Q-Learning Loop inspired from the demo over 10 000 episodes.

In [None]:
%%time
# magic command for measuring the cell's exectution time
"""Training the agent"""

# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

# For plotting metrics
all_epochs = []
all_penalties = []

# Loop over a certain number off episodes
for i in range(1, 10001):
    state, info = env.reset() # start by re-initializing the environment

    epochs, penalties, reward, = 0, 0, 0
    done = False

    while not done: # starting a while loop that will keep running until the termination of an episode
        # We then define the epsilon greedy policy
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values

        # take action and get next state information
        next_state, reward, done, _, info = env.step(action)

        # update q table using the algorithm formula
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])

        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max) # decaying average of the old value and the new estimated value
        q_table[state, action] = new_value

        # update the state variable
        state = next_state
        epochs += 1

    # every 100 episode we print the episode number
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print("Training finished.\n")

8. Try and visualize what the agent is doing using `.render`, does anything surprise you? Why do you think this is happening?

In [None]:
# watch trained agent
state, info = env.reset()
done = False
rewards = 0
max_steps = 20

for s in range(max_steps):

    print(f"TRAINED AGENT")
    print("Step {}".format(s+1))

    action = np.argmax(q_table[state])
    new_state, reward, done, _, info = env.step(action)
    rewards += reward
    plt.imshow(env.render())
    plt.show()
    print(f"score: {rewards}")
    state = new_state

    if done == True:
        break

env.close()

9. The agent does not move! Although we know for a fact this is not the optimal behaviour! The problem comes from the fact that we start with a Q-table filled with zeros. That means that at the beginning, the optimal action given by `np.argmax(q_table[state])` is always `0` which lead our agent against the wall. The odds of picking random actions are really low with our chosen policy (e.g. 0.1). In this setting, it becomes almost impossible for the agent to randomly reach the goal (the only non zero reward) and start learning!

Try to think of a solution for this for at least 5 minutes then click the spoiler to get a clue:

<details>
<summary>SPOILER</summary>
The solution is to force the agent to pick a random action whenever the score for all actions are equivalent
</details>

Once you think you have found a way to solve this issue, rerun your adapted training loop.

In [None]:
%%time
# magic command for measuring the cell's exectution time
"""Training the agent"""

# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

# For plotting metrics
all_epochs = []
all_penalties = []

# Loop over a certain number off episodes
for i in range(1, 10001):
    state, info = env.reset() # start by re-initializing the environment

    epochs, penalties, reward, = 0, 0, 0
    done = False

    while not done: # starting a while loop that will keep running until the termination of an episode
        # We then define the epsilon greedy policy
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # Explore action space
        elif np.min(q_table[state]) == np.max(q_table[state]):
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values

        # take action and get next state information
        next_state, reward, done, _, info = env.step(action)

        # update q table using the algorithm formula
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])

        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max) # decaying average of the old value and the new estimated value
        q_table[state, action] = new_value

        # update the state variable
        state = next_state
        epochs += 1

    # every 100 episode we print the episode number
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print("Training finished.\n")

10. Calculate the average reward across 100 episodes

In [None]:
"""Evaluate agent's performance after Q-learning"""

total_epochs, total_penalties = 0, 0
episodes = 100
cumulated_reward = 0

for _ in range(episodes):
    state, info = env.reset()
    epochs, penalties, reward = 0, 0, 0

    done = False

    while not done:
        # this time we use the greedy policy
        action = np.argmax(q_table[state])
        state, reward, done, _, info = env.step(action)

        cumulated_reward += reward

        epochs += 1

    total_epochs += epochs

print(f"Results after {episodes} episodes:")
print(f"Average reward per episode: {cumulated_reward / episodes}")

11. What is the agent doing now? Show its behaviour visually.

In [None]:
# watch trained agent
state, info = env.reset()
done = False
rewards = 0
max_steps = 100

for s in range(max_steps):

    print(f"TRAINED AGENT")
    print("Step {}".format(s+1))

    action = np.argmax(q_table[state])
    new_state, reward, done, _, info = env.step(action)
    rewards += reward
    plt.imshow(env.render())
    plt.show()
    print(f"score: {rewards}")
    state = new_state

    if done == True:
        break

env.close()

Yay! The agent is now able to win the game in an optimal way!

## Frozen Lake 8x8

No let's see if we can solve the frozen lake problem with a more challenging map!

1. Setup a Frozen Lake environment with an 8x8 map and not slippery.

In [None]:
print("[INFO] : Version Gym : ", gym.__version__)

env = gym.make("FrozenLake-v1", desc=None,map_name="8x8", is_slippery=False, render_mode="rgb_array").env
env.reset()
env.render()

2. In this setting, the probability to randomly reach the objective is way thinner, let's see if our training loop has any chance to complete the Q-learning algorithm. Setup the Q table with initial values equal to zero.

In [None]:
q_table = np.zeros([env.observation_space.n, env.action_space.n])
q_table

3. Run the Q-learning algorithm over 100 000 episodes.

In [None]:
%%time
"""Training the agent"""


# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

# For plotting metrics
all_epochs = []
all_penalties = []

# Loop over a certain number off episodes
for i in range(1, 100001):
    state, info = env.reset() # start by re-initializing the environment

    epochs, penalties, reward, = 0, 0, 0
    done = False

    while not done: # starting a while loop that will keep running until the termination of an episode
        # We then define the epsilon greedy policy
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # Explore action space
        elif np.min(q_table[state]) == np.max(q_table[state]):
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values

        # take action and get next state information
        next_state, reward, done, _, info = env.step(action)

        # update q table using the algorithm formula
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])

        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max) # decaying average of the old value and the new estimated value
        q_table[state, action] = new_value

        # update the state variable
        state = next_state
        epochs += 1

    # every 100 episode we print the episode number
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print("Training finished.\n")

4. Visualize the agent's behaviour.

In [None]:
state, info = env.reset()
done = False
rewards = 0
max_steps = 100

for s in range(max_steps):

    print(f"TRAINED AGENT")
    print("Step {}".format(s+1))

    action = np.argmax(q_table[state])
    new_state, reward, done, _, info = env.step(action)
    rewards += reward
    plt.imshow(env.render())
    plt.show()
    print(f"score: {rewards}")
    state = new_state

    if done == True:
        break

env.close()

Looks like it reached the goal!

## Slippery Frozen Lake

Now let's complicate things even further by making the lake slippery! This means that whenever you pick an action you have two out of three chances to go lateral instead of going forward (also with one third of a chance). For example if I pick the action "down", the probability of going "down" is $\frac{1}{3}$ the probability of going "left" is $\frac{1}{3}$ and the probability of going "right" is $\frac{1}{3}$.

1. Setup an environment with map 8x8 in slippery mode.

In [None]:
print("[INFO] : Version Gym : ", gym.__version__)


env = gym.make("FrozenLake-v1", desc=None,map_name="8x8", is_slippery=False, render_mode="rgb_array").env
env.reset()
env.render()



2. Setup the Q table

In [None]:
q_table = np.zeros([env.observation_space.n, env.action_space.n])
q_table

3. Run the Q learning algorithm over 100 000 episodes

In [None]:
%%time
# magic command for measuring the cell's exectution time
"""Training the agent"""

# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

# For plotting metrics
all_epochs = []
all_penalties = []

# Loop over a certain number off episodes
for i in range(1, 100001):
    state, info = env.reset() # start by re-initializing the environment

    epochs, penalties, reward, = 0, 0, 0
    done = False

    while not done: # starting a while loop that will keep running until the termination of an episode
        # We then define the epsilon greedy policy
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # Explore action space
        elif np.min(q_table[state]) == np.max(q_table[state]):
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values

        # take action and get next state information
        next_state, reward, done, _, info = env.step(action)

        # update q table using the algorithm formula
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])

        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max) # decaying average of the old value and the new estimated value
        q_table[state, action] = new_value

        # update the state variable
        state = next_state
        epochs += 1

    # every 100 episode we print the episode number
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print("Training finished.\n")

4. Visualize what the agent is doing

In [None]:
# watch trained agent
state, info = env.reset()
done = False
rewards = 0
max_steps = 100

for s in range(max_steps):

    print(f"TRAINED AGENT")
    print("Step {}".format(s+1))

    action = np.argmax(q_table[state])
    new_state, reward, done, _, info = env.step(action)
    rewards += reward
    plt.imshow(env.render())
    plt.show()
    print(f"score: {rewards}")
    state = new_state

    if done == True:
        break

env.close()

5. Looks like the agent is moving, let's see what its average reward is across one hundred episodes under the greedy policy.

In [None]:
"""Evaluate agent's performance after Q-learning"""

total_epochs, total_penalties = 0, 0
episodes = 100
cumulated_reward = 0

for i in range(episodes):
    clear_output(wait=True)
    print(i)
    state, info = env.reset()
    epochs, penalties, reward = 0, 0, 0

    done = False

    while not done:
        # this time we use the greedy policy
        action = np.argmax(q_table[state])
        state, reward, done, _, info = env.step(action)

        cumulated_reward += reward

        epochs += 1

    total_epochs += epochs


print(f"Results after {episodes} episodes:")
print(f"Average reward per episode: {cumulated_reward / episodes}")

6. The agent has not learned nothing, since it reached the goal in 6% of trials. Is there a way of getting a reward of 1 100% of the time? Try to run the algorithm for 500 000 more steps to see if we improve our score!

In [None]:
%%time
# magic command for measuring the cell's exectution time
"""Training the agent"""


# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

# For plotting metrics
all_epochs = []
all_penalties = []

# Loop over a certain number off episodes
for i in range(1, 500001):
    state, info = env.reset() # start by re-initializing the environment

    epochs, penalties, reward, = 0, 0, 0
    done = False

    while not done: # starting a while loop that will keep running until the termination of an episode
        # We then define the epsilon greedy policy
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # Explore action space
        elif np.min(q_table[state]) == np.max(q_table[state]):
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values

        # take action and get next state information
        next_state, reward, done, _, info = env.step(action)

        # update q table using the algorithm formula
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])

        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max) # decaying average of the old value and the new estimated value
        q_table[state, action] = new_value

        # update the state variable
        state = next_state
        epochs += 1

    # every 100 episode we print the episode number
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print("Training finished.\n")

In [None]:
"""Evaluate agent's performance after Q-learning"""

total_epochs, total_penalties = 0, 0
episodes = 1000
cumulated_reward = 0

for i in range(episodes):
    clear_output(wait=True)
    print(i)
    state, info = env.reset()
    epochs, penalties, reward = 0, 0, 0

    done = False

    while not done:
        # this time we use the greedy policy
        action = np.argmax(q_table[state])
        state, reward, done, _, info = env.step(action)

        cumulated_reward += reward

        epochs += 1

    total_epochs += epochs

print(f"Results after {episodes} episodes:")
print(f"Average reward per episode: {cumulated_reward / episodes}")

7. Looks like in this case, more training does not let us win everytime! Try and pick the actions manually to get 100% chance of winning!

In [None]:
env.reset()

In [None]:
done = False
action = 3
state=0
while done == False:
    if state == 7:
        action = 2
    obs = env.step(action)
    state, reward, done, _, info = obs
    print(obs)

It's technically feasible to reach the goal with probability 100% when picking the exact right policy, and even though the Q learning algorithm should ultimately converge to the optimal policy it may be very computationally expensive to get there, even with such a simple problem!