![Logo](../assets/logo.png)

Made by **Domonkos Nagy**

[<img src="https://colab.research.google.com/assets/colab-badge.svg">](https://colab.research.google.com/github/Fortuz/rl_education/blob/main/5.%20Temporal%20Difference/frozen_lake_solution.ipynb)

# Frozen Lake (solution)

Frozen lake involves crossing a frozen lake from start to goal without falling into any holes by walking over the frozen lake. The player may not always move in the intended direction due to the slippery nature of the frozen lake.

The game starts with the player at location [0,0] of the frozen lake grid world with the goal located at far extent of the world e.g. [3,3] for the 4x4 environment.
Holes in the ice are distributed in set locations.
The player makes moves until they reach the goal or fall in a hole.

<img src="assets/frozen_lake.gif" width="500"/>

In Frozen Lake, the states are the positions in the grid world (integers 0-15), and the actions are UP, DOWN, LEFT and RIGHT (integers 0-3). The reward is 1 for reaching the goal and 0 otherwise (even for falling in a hole).

This notebook uses *Q-learning* to find the optimal policy in the `FrozenLake-v1` Gymnasium environment.

- This notebook is based on Chapter 6 of the book *Reinforcement Learning: An Introduction (2nd ed.)* by R. Sutton & A. Barto, available at http://incompleteideas.net/book/the-book-2nd.html
- Documentation for the Frozen Lake environment: https://gymnasium.farama.org/environments/toy_text/frozen_lake/

In [1]:
# Install dependencies if running in Colab
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install gymnasium==0.29.0

In [2]:
import numpy as np
import gymnasium as gym
from gymnasium.wrappers import RecordVideo
import time
from tqdm.notebook import trange
import matplotlib.pyplot as plt
import pickle
import ipywidgets as widgets
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Argmax function that breaks ties randomly
def argmax(arr):
    arr_max = np.max(arr)
    return np.random.choice(np.where(arr == arr_max)[0])

In [4]:
# Hyperparameters
N_EPISODES = 10_000 # Number of training episodes
MAX_STEPS_PER_EPISODE = 100 # Number of episodes before truncation (there is no truncation in this env by default)
EPSILON = 1  # Initial exploration
EPSILON_MIN = 0.001  # Final exploration
EPSILON_DECAY = (2 * EPSILON) / N_EPISODES  # Exploration decay rate
ALPHA = 0.1  # Learning rate
GAMMA = 0.98  # Discount factor
N_RECORDINGS = 3  # Number of episodes to record
REC_EPISODES = np.linspace(0, N_EPISODES-1, num=N_RECORDINGS, dtype=int)  # Episodes to record
LOG_FREQ = N_EPISODES / 10  # Progress log frequency

In [5]:
# Create environment
base_env = gym.make('FrozenLake-v1', render_mode='rgb_array')
# Wrap environment to record videos throughout the learning process 
trigger = lambda ep: ep in REC_EPISODES
env = RecordVideo(base_env, video_folder="./videos", episode_trigger=trigger, disable_logger=True)

In [6]:
# Initialize Q-table
action_space_size = env.action_space.n
observation_space_size = env.observation_space.n
q_table_shape = observation_space_size, action_space_size
q_table = np.zeros(q_table_shape)
print("Q-TABLE:")
print(q_table)

Q-TABLE:
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


## Q-learning

Q-learning combines ideas from both *Dynamic Programming* and *Monte Carlo* methods. Similarly to MC, Q-learning simulates episodes, and updates the
value function according to the returns. However, while MC uses only returns
from the currently simulated episode, Q-learning updates estimates based on other learned estimates, without waiting for a final outcome. This property makes Q-learning a *bootstrapping* method, like DP.

The update rule for Q-learning is:

$$ Q_t(S_t,A_t) \leftarrow Q_t(S_t,A_t) + \alpha[R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q_t(S_t,A_t)] $$

Where $\alpha \in (0;1]$ is a constant step-size parameter and $\gamma \in [0;1]$ is the discount rate.

In [7]:
# Re-intialize environment and Q-table
env = RecordVideo(base_env, video_folder="./videos", episode_trigger=trigger, disable_logger=True)
q_table = np.zeros(q_table_shape)
sum_rewards = 0

# Training loop
for episode in trange(N_EPISODES):
    obs, _ = env.reset()
    done = False
    step = 0

    while not done:
        # Epsilon-greedy action selection
        if np.random.rand() > EPSILON:
            action = argmax(q_table[obs])
        else:
            action = env.action_space.sample()
            
        # Take step
        new_obs, reward, terminated, truncated, info = env.step(action)
        done = terminated or step >= MAX_STEPS_PER_EPISODE  # Truncating manually

        # Update Q-table
        q_table[obs, action] += ALPHA * (reward + GAMMA * np.max(q_table[new_obs]) - q_table[obs, action])

        # Store reward and new state
        sum_rewards += reward
        obs = new_obs
        step += 1

    # Decay epsilon
    EPSILON = max(EPSILON - EPSILON_DECAY, EPSILON_MIN)

    # Log results
    if (episode + 1) % LOG_FREQ == 0:
        print(f'Episode {episode + 1} : avg={sum_rewards / LOG_FREQ}')
        sum_rewards = 0

# Save Q-table
with open('q_table.bin', 'wb') as f:
    pickle.dump(q_table, f)

# Close environment
env.close()

  0%|          | 0/10000 [00:00<?, ?it/s]

Episode 1000 : avg=0.018
Episode 2000 : avg=0.032
Episode 3000 : avg=0.065
Episode 4000 : avg=0.152
Episode 5000 : avg=0.416
Episode 6000 : avg=0.714
Episode 7000 : avg=0.729
Episode 8000 : avg=0.767
Episode 9000 : avg=0.745
Episode 10000 : avg=0.707


In [8]:
# Print updated Q-table
print("Q-TABLE:")
print(q_table)

Q-TABLE:
[[0.34798346 0.31604709 0.31920247 0.31920109]
 [0.27577851 0.20877358 0.18096888 0.30864357]
 [0.25478524 0.26154585 0.25434942 0.25457424]
 [0.1673691  0.16800298 0.19078848 0.25826826]
 [0.3744031  0.31658972 0.18656963 0.2815639 ]
 [0.         0.         0.         0.        ]
 [0.11431091 0.07316903 0.24596349 0.09962938]
 [0.         0.         0.         0.        ]
 [0.31589313 0.28015899 0.27582565 0.40717594]
 [0.37909307 0.45413064 0.41051007 0.29902925]
 [0.48756108 0.31962601 0.29624658 0.26183056]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.44713157 0.42553751 0.59194665 0.4338701 ]
 [0.59748442 0.75547702 0.68296786 0.66044644]
 [0.         0.         0.         0.        ]]


## Results

You can watch the videos recorded throughout the training process here:

In [9]:
# Display recordings
children = [widgets.Video.from_file(f'./videos/rl-video-episode-{episode}.mp4', autoplay=False, loop=False, width=500) for episode in REC_EPISODES]
tab = widgets.Tab()
tab.children = children
titles = tuple([f'Episode {episode + 1:,}' for episode in REC_EPISODES])
for i in range(len(children)):
    tab.set_title(i, titles[i])
display(tab)

Tab(children=(Video(value=b'\x00\x00\x00 ftypisom\x00\x00\x02\x00isomiso2avc1mp41\x00\x00\x00\x08free...', aut…