# Part 2: FrozenLake

FrozenLake is a small gridworld problem where you should go from the top left corner to the bottom right corner without hitting a lake. If you hit a lake, you get minus 1 point. If you reach the goal, you get plus one point. There is an option called `is_slippery`, which we will turn on later on in the lab. When the problem is slippery, any action has a 33% probability each to instead go to either of the two perpendicular directions to the one chosen.

Direction mapping:
0: left
1: down
2: right
3: up

The gym is defined in [frozenlake.py](frozenlake.py), but this is in turn a convenience wrapper around the env from gymnasium.

The grid world is n by n large, and the squares are defined in an n^2 long vector starting from top left.

Statespace: The state space is hence just your current location as an integer $i \in [0,..,n^2-1]$

Actionspace: The action space is an integer between 0 and 3, inclusive.

In [None]:
from typing import Literal

from datasets import tqdm
from matplotlib import pyplot as plt

from frozenlake import FrozenLakeEnv
import numpy as np
import numpy.typing as npt
from dataclasses import dataclass

env = FrozenLakeEnv(map_size=8, seed=42, is_slippery=False)


## Let us first have a look at the environment

In [None]:
state = env.reset()
env.render()

print("Current state: ", state)
print("State space:", env.observation_space)
print("Action space:", env.action_space)

## Take a few steps

In [None]:
# ➡️ TODO : Take a few steps manually and re-render the environment. ⬅️


# Learn a policy: SARSA, Q-Learning, and Episodic

Ok, now when we understand how the environment works, let us generate a few policy improvement frameworks. This problem has much longer episodes than BlackJack and so it makes sense to start looking into frameworks that update the policy through bootstrapping, i.e., we no longer only look at the reward for each individual experiment, but instead we update the values of Q based on other values of Q (as in SARSA and Q-Learning), which we can now do during the episode.

__Question__: The main difference between SARSA and Q-Learning is that SARSA is "On-Policy", whereas Q-Learning is "Off-Policy". What does this mean? How do we see it in the formulas? And what impact is it going to have in Frozenlake in particular?

Ok, off you go! We have give you a skeleton, but the rest is quite open.

In [None]:

# ➡️ TODO : You should do three things:
#    TODO :     - Implement the epsilon_greedy_policy (same as in part 1 so should be quick)
#    TODO :     - Implement the training loop and the Q-Learning, SARSA and Episodic improvement steps
#    TODO :     - Play around with and try to understand how the different methods work and
#    TODO :          how they interplay with the different parameters.
#   ⬅️


def epsilon_greedy_policy(
        state_: int,
        Q: npt.NDArray,
        epsilon: float,
) -> npt.NDArray:
    """
    Takes the state, Q, and epsilon value and returns the probabilities of taking each action.
    """

    # ➡️ TODO : implement the epsilon greedy policy. ⬅️
    return ...

@dataclass
class EpisodeHistory:
    """A storage container for the data from a single episode"""
    states: npt.NDArray
    actions: npt.NDArray
    reward: float

episodes: list[EpisodeHistory] = []

def train_policy(
    num_episodes: int,
    method: Literal["Q-Learning", "SARSA", "Full-Episode"],
    discount: float,
    learning_rate: float,
    epsilon: float,
    initial_value: float,
) -> npt.NDArray:

    """
    Takes some useful input (add/remove/rename as you like) and returns the policy state-action values Q.
    """

    # Here, the statespace is simpler, so we instead just save the Q values in an (n^2 x 4) matrix
    # with the states in the first dimension and the actions in the second.
    Q = np.ones((64, 4)) * initial_value
    Q[-1, :] = np.zeros((1, 4)) # There is no value of being in the final state, only the reward

    for _ in tqdm(range(num_episodes)):
        state, _ = env.reset()
        done = False
        state_history: list[int] = []    # used for plotting
        action_history: list[int] = []   # used for plotting
        reward = 0


        # ➡️ TODO : Create training data and update the Q values. ⬅️
        # ➡️ TODO : If you want the plots below to work, also update the stat history and action history ⬅️


        episodes.append(
            EpisodeHistory(
                states=np.array(state_history),
                actions=np.array(action_history),
                reward=reward,
            )
        )

    return Q


# ➡️ TODO : run it with some different values for the hyperparameters. ⬅️
Q = train_policy(
    num_episodes=...,
    method=...,
    discount=...,
    learning_rate=...,
    epsilon=...,
    initial_value=...,
)


## Let us plot the state values function as we did in Part 1

In [None]:
env.plot_value_function(Q)

## Average reward

We can also plot the average reward every 100 episodes. The code for that is below.

In [None]:
def plot_average_performance(
        episodes: list[EpisodeHistory] | list[list[EpisodeHistory]],
        names: list[str] = [],
):
    if len(episodes) == 0:
        print("Nothing to plot")
        return

    if isinstance(episodes[0], EpisodeHistory):
        episodes = [episodes]

    names = names + [f"exp {i}" for i in range(len(episodes) - len(names))]

    for eps, name in zip(episodes, names):
        print("Plotting", name)
        rewards = np.array([e.reward for e in eps])
        avg_rewards = rewards[:rewards.shape[0] // 100 * 100].reshape(-1, 100).mean(axis=1)
        plt.plot(avg_rewards, label=name)
        plt.xlabel("Episodes (x100)")
        plt.ylabel("Average reward")

    plt.legend()
    plt.show()

plot_average_performance(episodes=episodes)

# Tests, conclusions and analysis

With the three methods tested and implemented, what did we see?

- Which methods worked better, which worked worse?
- What parameters seemed important? Any ideas why?

If you want to, you can up to where we define the gym and set `is_slippery`to `True`, to see how that affects the results.

## Next steps

In this lab we are still working with Tabular RL. In practice, we quite often cannot enumerate the search space, and in those cases we need a model to approximate Q. n most cases, this leads to Deep Reinforcement Learning, which would be the natural next step.