![Logo](../assets/logo.png)

Made by **Domonkos Nagy**

[<img src="https://colab.research.google.com/assets/colab-badge.svg">](https://colab.research.google.com/github/Fortuz/rl_education/blob/main/5.%20Temporal%20Difference/frozen_lake.ipynb)

# Frozen Lake

Frozen lake involves crossing a frozen lake from start to goal without falling into any holes by walking over the frozen lake. The player may not always move in the intended direction due to the slippery nature of the frozen lake.

The game starts with the player at location [0,0] of the frozen lake grid world with the goal located at far extent of the world e.g. [3,3] for the 4x4 environment.
Holes in the ice are distributed in set locations.
The player makes moves until they reach the goal or fall in a hole.

<img src="assets/frozen_lake.gif" width="400"/>

In Frozen Lake, the states are the positions in the grid world (integers 0-15), and the actions are UP, DOWN, LEFT and RIGHT (integers 0-3). The reward is 1 for reaching the goal and 0 otherwise (even for falling in a hole).

This notebook uses *Q-learning* to approximate the optimal policy in the `FrozenLake-v1` Gymnasium environment.

- This notebook is based on Chapter 6 of the book *Reinforcement Learning: An Introduction (2nd ed.)* by R. Sutton & A. Barto, available at http://incompleteideas.net/book/the-book-2nd.html
- Documentation for the Frozen Lake environment: https://gymnasium.farama.org/environments/toy_text/frozen_lake/

In [None]:
# Install dependencies if running in Colab
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install gymnasium==0.29.0

In [None]:
import numpy as np
import gymnasium as gym
from gymnasium.wrappers import RecordVideo
import time
from tqdm.notebook import trange
import pickle
import ipywidgets as widgets
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Argmax function that breaks ties randomly
def argmax(arr):
    arr_max = np.max(arr)
    return np.random.choice(np.where(arr == arr_max)[0])

In [None]:
# Hyperparameters
N_EPISODES = 10_000  # Number of training episodes
MAX_STEPS_PER_EPISODE = 100  # Number of steps before truncation (there is no truncation in this env by default)
EPSILON_MAX = 1  # Initial exploration
EPSILON_MIN = 0.001  # Final exploration
EPSILON_DECAY = 2 * EPSILON_MAX / N_EPISODES  # Exploration decay rate
ALPHA = 0.1  # Learning rate
GAMMA = 0.98  # Discount factor
N_RECORDINGS = 3  # Number of episodes to record
REC_EPISODES = np.linspace(0, N_EPISODES-1, num=N_RECORDINGS, dtype=int)  # Episodes to record
LOG_FREQ = N_EPISODES / 10  # Progress log frequency

In [None]:
# Create environment
base_env = gym.make('FrozenLake-v1', render_mode='rgb_array')
# Wrap environment to record videos throughout the learning process 
trigger = lambda ep: ep in REC_EPISODES
env = RecordVideo(base_env, video_folder="./videos", episode_trigger=trigger, disable_logger=True)

In [None]:
# Initialize Q-table
observation_space_size = env.observation_space.n
action_space_size = env.action_space.n
q_table_shape = observation_space_size, action_space_size
q_table = np.zeros(q_table_shape)
print("Q-TABLE: (shape =", q_table.shape, ")")
print(q_table)

## Q-learning

Q-learning combines ideas from both *Dynamic Programming* and *Monte Carlo* methods. Similarly to MC, Q-learning simulates episodes, and updates the
value function according to the returns. However, while MC uses only returns
from the currently simulated episode, Q-learning updates estimates based on other learned estimates, without waiting for a final outcome. This property makes Q-learning a *bootstrapping* method, like DP.

The update rule for Q-learning is:

$$ Q_t(S_t,A_t) \leftarrow Q_t(S_t,A_t) + \alpha[R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q_t(S_t,A_t)] $$

Where $\alpha \in (0;1]$ is a constant step-size parameter and $\gamma \in [0;1]$ is the discount rate.

***

### **Your Task**

Implement this algorithm! The block below only contains code necessary for logging the average reward at every `LOG_FREQ` iterations. The algorithm itself is up to you! Pseudocode for this algorithm is shown in the box below.

<img src="assets/q-learning.png" width="700"/>

*Pseudocode from page 131 of the Sutton & Barto book*

#### **Hints:**

- Use the array defined above: `q_table` corresponds to $Q$ in the pseudocode.
- Instead of `np.argmax`, use the `argmax` function defined above!

***

In [None]:
# Re-intialize environment
env = RecordVideo(base_env, video_folder="./videos", episode_trigger=trigger, disable_logger=True)
sum_rewards = 0

# Training loop
for episode in trange(N_EPISODES):
    
    ############## CODE HERE ###################
        




    ############################################

    # Log results
    if (episode + 1) % LOG_FREQ == 0:
        print(f'Episode {episode + 1} : avg={sum_rewards / LOG_FREQ}')
        sum_rewards = 0

# Save Q-table
with open('q_table.bin', 'wb') as f:
    pickle.dump(q_table, f)

# Close environment
env.close()

In [None]:
# Print updated Q-table
print("Q-TABLE:")
print(q_table)

## Results

You can watch the videos recorded throughout the training process here:

In [None]:
# Display recordings
children = [widgets.Video.from_file(f'./videos/rl-video-episode-{episode}.mp4', autoplay=False, loop=False, width=500) for episode in REC_EPISODES]
tab = widgets.Tab()
tab.children = children
titles = tuple([f'Episode {episode + 1:,}' for episode in REC_EPISODES])
for i in range(len(children)):
    tab.set_title(i, titles[i])
display(tab)