In [17]:
!pip -q install imageio[ffmpeg]

In [18]:
if False:
    import gymnasium as gym
    from gymnasium.utils.play import play

    play(
        gym.make(
            "FrozenLake-v1",
            map_name="8x8",
            is_slippery=False,
            render_mode="rgb_array"
        ),
        keys_to_action={'w': 3, 'a': 0, 'd': 2, 's':1},
        noop=2
    )

In [19]:
import numpy as np
import gymnasium as gym
import random
import imageio

In [20]:
TRAIN_FROZENLAKE_4X4 = True
RECORD_VIDEO_FROZENLAKE_4X4 = True

TRAIN_TAXI = True
RECORD_VIDEO_TAXI = True

# Q-Learning with FrozenLake-v1 ⛄ and Taxi-v3

In this notebook, **you'll code your first Reinforcement Learning agent from scratch** to play FrozenLake ❄️ using Q-Learning, play and experiment with different configurations.

⬇️ Here is an example of what **you will achieve in just a couple of minutes.** ⬇️


<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif" alt="Environments" width=600px/>

*Q-Learning* **is the RL algorithm that**:

- Trains *Q-Function*, an **action-value function** that is encoded, in internal memory, by a *Q-table* **that contains all the state-action pair values.**

- Given a state and action, our Q-Function **will search the Q-table for the corresponding value.**
    
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function"  width="60%"/>

- When the training is done, **we have an optimal Q-Function, so an optimal Q-Table.**
    
- And if we **have an optimal Q-function**, we
have an optimal policy, since we **know for, each state, the best action to take.**

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy"  width="60%"/>


But, in the beginning, our **Q-Table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we’ll explore the environment and update our Q-Table it will give us better and better approximations

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/q-learning.jpeg" alt="q-learning.jpeg" width="60%"/>

This is the Q-Learning pseudocode:

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="60%"/>


## Part 1: Frozen Lake ⛄ (non slippery version)

### Create and understand [FrozenLake environment ⛄](https://gymnasium.farama.org/environments/toy_text/frozen_lake/)

A good habit when you start to use an environment is to check its documentation

👉 https://gymnasium.farama.org/environments/toy_text/frozen_lake/

---

We're going to train our Q-Learning agent **to navigate from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoid holes (H)**.

We can have two sizes of environment:

- `map_name="4x4"`: a 4x4 grid version
- `map_name="8x8"`: a 8x8 grid version


The environment has two modes:

- `is_slippery=False`: The agent always moves **in the intended direction** due to the non-slippery nature of the frozen lake (deterministic).
- `is_slippery=True`: The agent **may not always move in the intended direction** due to the slippery nature of the frozen lake (stochastic).

For now let's keep it simple with the 4x4 map and non-slippery.
We add a parameter called `render_mode` that specifies how the environment should be visualised. In our case because we **want to record a video of the environment at the end, we need to set render_mode to rgb_array**.

As [explained in the documentation](https://gymnasium.farama.org/api/env/#gymnasium.Env.render) “rgb_array”: Return a single frame representing the current state of the environment. A frame is a np.ndarray with shape (x, y, 3) representing RGB values for an x-by-y pixel image.

In [21]:
EnvFrozenLake4X4 = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False, render_mode="rgb_array")

You can create your own custom grid like this:

```python
desc=["SFFF", "FHFH", "FFFH", "HFFG"]
gym.make('FrozenLake-v1', desc=desc, is_slippery=True)
```

but we'll use the default environment for now.

#### Let's see what the Environment looks like

In [22]:
print("_____OBSERVATION SPACE_____ \n")
print("Observation Space", EnvFrozenLake4X4.observation_space)
print("Sample observation", EnvFrozenLake4X4.observation_space.sample()) # Get a random observation

_____OBSERVATION SPACE_____ 

Observation Space Discrete(16)
Sample observation 8


We see with `Observation Space Shape Discrete(+64)` that the observation is an integer representing the **agent’s current position as current_row * ncols + current_col (where both the row and col start at 0)**.

For example, the goal position in the 8x8 map can be calculated as follows: 7 * 8 + 7 = 63. The number of possible observations is dependent on the size of the map. **For example, the 8x8 map has 64 possible observations.**

For instance, this is what state = 0 looks like:

<img src="FrozenLakeInitial.png" alt="FrozenLake">

In [23]:
print("\n _____ACTION SPACE_____ \n")
print("Action Space Shape", EnvFrozenLake4X4.action_space.n)
print("Action Space Sample", EnvFrozenLake4X4.action_space.sample()) # Take a random action


 _____ACTION SPACE_____ 

Action Space Shape 4
Action Space Sample 2


The action space is discrete with 4 actions available 🎮:
- 0: Go left
- 1: Go down
- 2: Go right
- 3: Go up

Reward function 💰:
- Reach goal: +1
- Reach hole: 0
- Reach frozen: 0

### Create and initialize the Q-table 🗄️

(Step 1 of the pseudocode)

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="60%"/>

It's time to initialize our Q-table! To know how many rows (states) and columns (actions) to use, we need to know the action and observation space. We already know their values from before, but we'll want to obtain them programmatically so that our algorithm generalizes for different environments. `Gym` provides us a way to do that: `env.action_space.n` and `env.observation_space.n`


In [24]:
StateSpace = EnvFrozenLake4X4.observation_space.n
print("There are ", StateSpace, " possible states")

ActionSpace = EnvFrozenLake4X4.action_space.n
print("There are ", ActionSpace, " possible actions")

There are  16  possible states
There are  4  possible actions


In [25]:
# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros
def initialize_q_table(state_space, action_space):
  Qtable = np.zeros((state_space, action_space))
  return Qtable

### Define the greedy policy 🤖

In [26]:
def greedy_policy(Qtable, state):
  # Exploitation: take the action with the highest state, action value
  action = np.argmax(Qtable[state][:])

  return action

### Define the epsilon-greedy policy 🤖

(Step 2 of the pseudocode)

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="60%"/>

Epsilon-greedy is the training policy that handles the exploration/exploitation trade-off.

The idea with epsilon-greedy:

- With probability $1-\varepsilon$ : **we do exploitation** (i.e. our agent selects the action with the highest state-action pair value).

- With probability $\varepsilon$: **we do exploration** (trying a random action).

As the training continues, we progressively **reduce the epsilon value since we will need less and less exploration and more exploitation.**

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Q-Learning" width="60%"/>


In [27]:
def epsilon_greedy_policy(Qtable, env, state, epsilon):
  # Randomly generate a number between 0 and 1
  random_num = random.uniform(0,1)
  # if random_num > greater than epsilon --> exploitation
  if random_num > epsilon:
    # Take the action with the highest value given a state
    # np.argmax can be useful here
    action = greedy_policy(Qtable, state)
  # else --> exploration
  else:
    action = env.action_space.sample()

  return action

### Create the training loop method

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="60%"/>

<img src="Q-learning-8.png" alt="Q-Learning-formula" width="60%">

The training loop goes like this:

```
For episode in the total of training episodes:

Reduce epsilon (since we need less and less exploration)
Reset the environment

  For step in max timesteps:    
    Choose the action At using epsilon greedy policy
    Take the action (a) and observe the outcome state(s') and reward (r)
    Update the Q-value Q(s,a) using Bellman equation Q(s,a) + lr [R(s,a) + gamma * max_a' Q(s',a') - Q(s,a)]
    If done, finish the episode
    Our next state is the new state
```

In [28]:
def train(n_training_episodes, max_epsilon, min_epsilon, discount_rate_g, learning_rate_a, decay_rate, env, Qtable):
  for episode in range(n_training_episodes):
    # Reset the environment
    state, info = env.reset()
    terminated = False
    truncated = False

    epsilon = max_epsilon

    while not terminated or not truncated:
      # Choose the action At using epsilon greedy policy
      action = epsilon_greedy_policy(Qtable, env, state, epsilon)

      # Take action At and observe Rt+1 and St+1
      # Take the action (a) and observe the outcome state(s') and reward (r)
      new_state, reward, terminated, truncated, info = env.step(action)

      # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
      Qtable[state][action] = Qtable[state][action] + learning_rate_a * (reward + discount_rate_g * np.max(Qtable[new_state]) - Qtable[state][action])

      # Our next state is the new state
      state = new_state

    # Reduce epsilon (because we need less and less exploration)
    #epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
    epsilon = max(epsilon - decay_rate, min_epsilon)

  return Qtable

### Set the hyperparameters ⚙️

The exploration related hyperparamters are some of the most important ones.

- We need to make sure that our agent **explores enough of the state space** to learn a good value approximation. To do that, we need to have progressive decay of the epsilon.
- If you decrease epsilon too fast (too high decay_rate), **you take the risk that your agent will be stuck**, since your agent didn't explore enough of the state space and hence can't solve the problem.

In [29]:
# Training parameters
N_TRAINING_EPISODES = 1000  # Total training episodes
LEARNING_RATE_A = 0.9        # Learning rate

# Evaluation parameters
N_EVAL_EPISODES = 100        # Total number of test episodes
EVAL_SEED = []               # The evaluation seed of the environment

# Environment parameters
ENV_ID = "FrozenLake-v1-4X4"     # Name of the environment
MAX_STEPS = 200              # Max steps per episode, I use this parameter for the evaluation step
DISCOUNT_RATE_G = 0.9        # Discounting rate

# Exploration parameters
MAX_EPSILON = 1              # Exploration probability at start
MIN_EPSILON = 0              # Minimum exploration probability
DECAY_RATE  = 0.0001         # Decay rate for exploration prob

### Train the Q-Learning agent or load an already trained Q-Learning agent

In [38]:
if TRAIN_FROZENLAKE_4X4:
    Qtable_frozenlake = initialize_q_table(StateSpace, ActionSpace)
    Qtable_frozenlake = train(
        N_TRAINING_EPISODES,
        MAX_EPSILON,
        MIN_EPSILON,
        DISCOUNT_RATE_G,
        LEARNING_RATE_A,
        DECAY_RATE,
        EnvFrozenLake4X4,
        Qtable_frozenlake
        )
    np.savetxt("Qtable_frozenlake_4X4.csv", Qtable_frozenlake, delimiter=",")
else:
    Qtable_frozenlake = np.loadtxt("Qtable_frozenlake_4X4.csv", delimiter=",")

### Record a video

In [31]:
def record_video(env, Qtable, out_directory, max_steps, random_policy=False, fps=1):
  """
  Generate a replay video of the agent
  :param env
  :param Qtable: Qtable of our agent
  :param out_directory
  :param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)
  """
  images = []
  terminated = False
  truncated = False

  state, info = env.reset(seed=random.randint(0,500))
  img = env.render()
  images.append(img)

  for step in range(max_steps):
    if random_policy:
      action = np.random.choice(range(env.action_space.n))
    else:
      # Take the action (index) that have the maximum expected future reward given that state
      action = np.argmax(Qtable[state][:])

    state, reward, terminated, truncated, info = env.step(action) # We directly put next_state = state for recording logic
    img = env.render()
    images.append(img)

    if terminated or truncated:
      break

  img = env.render()
  images.append(img)
  imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)

In [32]:
if RECORD_VIDEO_FROZENLAKE_4X4:
    record_video(EnvFrozenLake4X4, Qtable_frozenlake, ENV_ID + "-OptimalPolicy.mp4", MAX_STEPS, fps=1)
    record_video(EnvFrozenLake4X4, Qtable_frozenlake, ENV_ID + "-RandomPolicy.mp4", MAX_STEPS, random_policy=True, fps=1)

## Part 2: Taxi-v3

### Create and understand [Taxi-v3](https://gymnasium.farama.org/environments/toy_text/taxi/)
A good habit when you start to use an environment is to check its documentation

👉 https://gymnasium.farama.org/environments/toy_text/taxi/

---

In `Taxi-v3`, there are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue).

When the episode starts, **the taxi starts off at a random square** and the passenger is at a random location. The taxi drives to the passenger’s location, **picks up the passenger**, drives to the passenger’s destination (another one of the four specified locations), and then **drops off the passenger**. Once the passenger is dropped off, the episode ends.


<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi.png" alt="Taxi">


In [33]:
Env_taxi = gym.make("Taxi-v3", render_mode="rgb_array")

There are **500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger** (including the case when the passenger is in the taxi), and **4 destination locations.**


In [34]:
StateSpace = Env_taxi.observation_space.n
print("There are ", StateSpace, " possible states")

There are  500  possible states


In [35]:
ActionSpace = Env_taxi.action_space.n
print("There are ", ActionSpace, " possible actions")

There are  6  possible actions


The action space (the set of possible actions the agent can take) is discrete with **6 actions available 🎮**:

- 0: move south
- 1: move north
- 2: move east
- 3: move west
- 4: pickup passenger
- 5: drop off passenger

Reward function 💰:

- -1 per step unless other reward is triggered.
- +20 delivering passenger.
- -10 executing “pickup” and “drop-off” actions illegally.

### Define the hyperparameters ⚙️

In [39]:
# Training parameters
N_TRAINING_EPISODES = 25000  # Total training episodes
LEARNING_RATE_A = 0.7        # Learning rate

# Evaluation parameters
N_EVAL_EPISODES = 100        # Total number of test episodes
EVAL_SEED = [16,54,165,177,191,191,120,80,149,178,48,38,6,125,174,73,50,172,100,148,146,6,25,40,68,148,49,167,9,97,164,176,61,7,54,55,
 161,131,184,51,170,12,120,113,95,126,51,98,36,135,54,82,45,95,89,59,95,124,9,113,58,85,51,134,121,169,105,21,30,11,50,65,12,43,82,145,152,97,106,55,31,85,38,
 112,102,168,123,97,21,83,158,26,80,63,5,81,32,11,28,148]

# Environment parameters
ENV_ID = "Taxi-v3"           # Name of the environment
MAX_STEPS = 99               # Max steps per episode, I use this parameter for the evaluation step
DISCOUNT_RATE_G = 0.9        # Discounting rate

# Exploration parameters
MAX_EPSILON = 1              # Exploration probability at start
MIN_EPSILON = 0              # Minimum exploration probability
DECAY_RATE  = 0.001          # Decay rate for exploration prob

### Train our Q-Learning agent

In [40]:
if TRAIN_TAXI:
    Qtable_taxi = initialize_q_table(StateSpace, ActionSpace)
    Qtable_taxi = train(N_TRAINING_EPISODES, MAX_EPSILON, MIN_EPSILON, DISCOUNT_RATE_G, LEARNING_RATE_A, DECAY_RATE, Env_taxi, Qtable_taxi)
    np.savetxt("Qtable_taxi.csv", Qtable_taxi, delimiter=",")
else:
    Qtable_taxi = np.loadtxt("Qtable_taxi.csv", delimiter=",")

### Record a video

In [None]:
if RECORD_VIDEO_TAXI:
    record_video(Env_taxi, Qtable_taxi, ENV_ID + "-OptimalPolicy.mp4", MAX_STEPS, fps=1)
    record_video(Env_taxi, Qtable_taxi, ENV_ID + "-RandomPolicy.mp4", MAX_STEPS, random_policy=True, fps=1)