![Logo](../assets/logo.png)

Made by **Domonkos Nagy**

[<img src="../assets/open_button.png">](https://colab.research.google.com/github/Fortuz/rl_education/blob/main/5.%20Temporal%20Difference/frozen_lake.ipynb)

# Frozen Lake

Frozen lake involves crossing a frozen lake from start to goal without falling into any holes by walking over the frozen lake. The player may not always move in the intended direction due to the slippery nature of the frozen lake.

The game starts with the player at location [0,0] of the frozen lake grid world with the goal located at far extent of the world e.g. [3,3] for the 4x4 environment.
Holes in the ice are distributed in set locations.
The player makes moves until they reach the goal or fall in a hole.

![Example image](assets/frozen_lake.png)

This problem can be formulated with a finite, undiscounted MDP, where the states are the positions in the grid world, the actions are UP, DOWN, LEFT and RIGHT, and the reward is 1 for reaching the goal and 0 otherwise (even for falling in a hole). In this example, we use the `FrozenLake-v1` environment from the `Gymnasium` library to represent the problem, and use *Q-learning* to solve it.

- Documentation for the Frozen Lake environment: https://gymnasium.farama.org/environments/toy_text/frozen_lake/

In [7]:
import numpy as np
import gymnasium as gym
from gymnasium.wrappers import RecordVideo
import time
from tqdm.notebook import trange
from IPython import display
import matplotlib.pyplot as plt
import pickle
import ipywidgets as widgets

In [8]:
base_env = gym.make('FrozenLake-v1', render_mode='rgb_array')  # creating the environment

In [9]:
# initializing q-table
action_space_size = base_env.action_space.n
observation_space_size = base_env.observation_space.n

q_table = np.zeros((observation_space_size, action_space_size))
print("Q-TABLE:")
print(q_table)

Q-TABLE:
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


In [10]:
# hyperparameters
N_EPISODES = 10_000
MAX_STEPS_PER_EPISODE = 100

ALPHA = 0.1  # learning rate
GAMMA = 0.98  # discount rate

EPSILON = 1  # exploration rate
EPSILON_MIN = 0.001
EPSILON_DECAY = (2 * EPSILON) / N_EPISODES

LOG_RATE = N_EPISODES / 10
REC_EPISODES = np.linspace(0, N_EPISODES-1, num=3, dtype=int)

In [11]:
# wrap environment
trigger = lambda t: t in REC_EPISODES
env = RecordVideo(base_env, video_folder="./videos", episode_trigger=trigger, disable_logger=True)

  logger.warn(
  logger.warn(


## Q-learning

Q-learning combines ideas from both *Dynamic Programming* and *Monte Carlo* methods. Similarly to MC, Q-learning simulates episodes, and updates the
value function according to the returns. However, there is an important difference in the update rule of these two methods: while MC uses only returns
from the currently simulated episode, Q-learning utilizes *bootstrapping*, that is, it updates estimates based on other learned estimates, without
waiting for a final outcome.

The update rule for Q-learning looks like this:

$$ Q_t(S_t,A_t) \leftarrow Q_t(S_t,A_t) + \alpha[R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q_t(S_t,A_t)] $$

Where $\alpha \in (0;1]$ is a constant step-size parameter and $\gamma \in [0;1]$ is the discount rate.

In [13]:
env = RecordVideo(env, video_folder="./videos", episode_trigger=trigger, disable_logger=True)
sum_rewards = 0

for episode in trange(N_EPISODES):
    state, _ = env.reset()
    done = False

    for step in range(MAX_STEPS_PER_EPISODE):
        # epsilon-greedy action selection
        if np.random.rand() > EPSILON:
            action = np.argmax(q_table[state, :])
        else:
            action = env.action_space.sample()

        new_state, reward, done, truncated, info = env.step(action)

        # updating q-table
        q_table[state, action] = q_table[state, action] * (1 - ALPHA) + \
            ALPHA * (reward + GAMMA * np.max(q_table[new_state, :]))

        state = new_state
        sum_rewards += reward

        if done:
            break

    # updating epsilon
    EPSILON = max(EPSILON - EPSILON_DECAY, EPSILON_MIN)

    # logging the results
    if (episode + 1) % LOG_RATE == 0:
        print(f'Episode {episode + 1} : avg={sum_rewards / LOG_RATE}')
        sum_rewards = 0

# saving the q-table
with open('q_table.bin', 'wb') as f:
    pickle.dump(q_table, f)

  logger.warn(
  logger.warn(


  0%|          | 0/10000 [00:00<?, ?it/s]

Episode 1000 : avg=0.726
Episode 2000 : avg=0.761
Episode 3000 : avg=0.743
Episode 4000 : avg=0.727
Episode 5000 : avg=0.731
Episode 6000 : avg=0.742
Episode 7000 : avg=0.744
Episode 8000 : avg=0.725
Episode 9000 : avg=0.702
Episode 10000 : avg=0.764


In [19]:
# Print updated Q-table
print("Q-TABLE:")
print(q_table)

Q-TABLE:
[[0.37466491 0.34553799 0.32557899 0.33504751]
 [0.25089242 0.24986768 0.17460412 0.36649769]
 [0.24799974 0.25409921 0.24602968 0.32296225]
 [0.23261584 0.13045663 0.22034271 0.29413206]
 [0.39781504 0.31866799 0.30784145 0.20311995]
 [0.         0.         0.         0.        ]
 [0.12552234 0.13038634 0.19133641 0.11120052]
 [0.         0.         0.         0.        ]
 [0.28874683 0.36867847 0.26763989 0.44445408]
 [0.33792861 0.50095111 0.33415657 0.22190402]
 [0.53087826 0.33426376 0.22245882 0.30284547]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.37585708 0.48888763 0.59097015 0.41864528]
 [0.6355178  0.72732036 0.64952208 0.67990017]
 [0.         0.         0.         0.        ]]


In [21]:
children = [widgets.Video.from_file(f'./videos/rl-video-episode-{episode}.mp4', autoplay=False, loop=False, width=500) for episode in REC_EPISODES]
tab = widgets.Tab()
tab.children = children
tab.titles = tuple([f'Episode {episode+1}' for episode in REC_EPISODES])
tab

Tab(children=(Video(value=b'\x00\x00\x00 ftypisom\x00\x00\x02\x00isomiso2avc1mp41\x00\x00\x00\x08free...', aut…