# Workshop on Reinforcement Learning or How to drive a taxi in a data-driven way?

Welcome to the workshop on Reinforcement Learning. We want to introduce the concept of Reinforcement Learning in a problem-based way with an interactive small example, the so called **Taxi environment**.

There are four designated pick-up and drop-off locations (Red, Green, Yellow and Blue) in the 5x5 grid world. The taxi starts off at a random square and the passenger at one of the designated locations.

The goal is move the taxi to the passenger’s location, pick up the passenger, move to the passenger’s desired destination, and drop off the passenger. Once the passenger is dropped off, the episode ends.

The player receives positive rewards for successfully dropping-off the passenger at the correct location. Negative rewards for incorrect attempts to pick-up/drop-off passenger and for each step where another reward is not received.

<img src="mat/taxi.gif" alt="Taxi driver randomly driving around" width="400"/>

More information can be found in the [official documentation](https://gymnasium.farama.org/environments/toy_text/taxi/)

## Excercise 1: Visualizing your agent in the environment

For using the taxi environment you need the `gymnasium` package.

If you use python locally, you can use the `pygame` package to visualize the game. If you use colab instead, you can create a video from your agent acting in the environment.
At first, we want to try out the environment by instantiating it and setup the typical RL data stream we introduced in the slides:

<img src="mat/01-RL-datastream.png" alt="RL datastream" width="400"/>

Therefore, we implement a `while` loop, sample an **action** from the possible actions in the action space and **do** one step with action in the environment. As an agent, we get the next state (called **observation**, short obs), a **reward** and some additional information whether the episode has ended.

### Note: Use the following code if you use python locally

In [None]:
import gymnasium as gym
import pygame

# instantiation of the environment
env = gym.make('Taxi-v3', render_mode='human')
# resetting the environment for first start
obs, _ = env.reset()

done = False
while not done:
    # sample an action
    action = env.action_space.sample()
    # do one step in the environment
    obs, reward, terminated, truncated, _ = env.step(action)

    # flag whether the episode is finished
    done = terminated or truncated
    # render the game
    env.render()

    # this is just event handling that you can end the visualization by clicking q button
    for event in pygame.event.get():
        if event.type == pygame.KEYDOWN:
            if event.key == pygame.K_q:
                pygame.quit()
                done = True

env.close()

### Note: Use the following code if you are using google colab

In [None]:
import gymnasium as gym
from IPython.display import HTML
from base64 import b64encode

import imageio

# instantiation of the environment
env = gym.make("Taxi-v3", render_mode="rgb_array")

# resetting the environment for first start
obs, _ = env.reset()

# initialize a list of frames for video creation
frames = []

done = False
while not done:
    # capture the frame and append it to frames list
    frame = env.render()
    frames.append(frame)

    # sample an action
    action = env.action_space.sample()
    # do one step in the environment
    obs, reward, terminated, truncated, info = env.step(action)

    # flag whether the episode is finished
    done = terminated or truncated

    # final rendering for last image of episode
    if done:
      frame = env.render()
      frames.append(frame)

env.close()

# save video as
video_path = "./taxi_vid.mp4"
imageio.mimsave(video_path, frames, fps=5)

In [None]:
# this is for displaying the video after saving
mp4 = open(video_path, 'rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()

HTML(f"""
<video width=400 controls>
    <source src="{data_url}" type="video/mp4">
</video>
""")

## Excercise 2

Find out the dimensions of state and action space and check it with the ideas we introduced theortically before.

In [None]:
print("Dimension observation space: ", env.observation_space.n)

In [None]:
print("Dimension action space: ", env.action_space.n)

## Excercise 3

How would you build up and implement a strategy for the taxi driver to properly solve the taxi problem? Try out some thoughts and hardcode the optimal policy for a given problem instance. Look at the following situation which is defined as state 328:

<img src="mat/taxi-seed328.png" alt="Taxi problem state 328" width="400"/>

- think about the exact order of actions you have to do
- hardcode them in a list and try it out!

Remark: The actions are encoded in the following way according to the documentation:

- 0: Move south (down)
- 1: Move north (up)
- 2: Move east (right)
- 3: Move west (left)
- 4: Pickup passenger
-5: Drop off passenger

**Note**: from here on I will always provide the two options for visualizing either in local python setup or in colab

### Note: Use the following code if you use python locally

In [None]:
import gymnasium as gym
import pygame

# instantiation of the environment
env = gym.make('Taxi-v3', render_mode='human')
# reseting the environment for first start
obs, _ = env.reset()

# consider a specific problem instance
env.unwrapped.s = 328

done = False
while not done:
    # sample an action
    action = env.action_space.sample()
    # do one step in the environment
    obs, reward, terminated, truncated, _ = env.step(action)

    # flag whether the episode is finished
    done = terminated or truncated
    # render the game
    env.render()

    # this is just event handling that you can end the visualization by clicking q button
    for event in pygame.event.get():
        if event.type == pygame.KEYDOWN:
            if event.key == pygame.K_q:
                pygame.quit()
                done = True

env.close()

### Note: Use the following code if you use google colab

In [None]:
import gymnasium as gym
from IPython.display import HTML
from base64 import b64encode

import imageio

# instantiation of the environment
env = gym.make("Taxi-v3", render_mode="rgb_array")

# resetting the environment for first start
obs, _ = env.reset()

# consider a specific problem instance
env.unwrapped.s = 328

# initialize a list of frames for video creation
frames = []

done = False
while not done:
    # capture the frame and append it to frames list
    frame = env.render()
    frames.append(frame)

    # sample an action
    action = env.action_space.sample()
    # do one step in the environment
    obs, reward, terminated, truncated, info = env.step(action)

    # flag whether the episode is finished
    done = terminated or truncated

    # final rendering for last image of episode
    if done:
      frame = env.render()
      frames.append(frame)

env.close()

# save video as
video_path = "./taxi_vid_own_policy.mp4"
imageio.mimsave(video_path, frames, fps=5)

In [None]:
# this is for displaying the video after saving
mp4 = open(video_path, 'rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()

HTML(f"""
<video width=400 controls>
    <source src="{data_url}" type="video/mp4">
</video>
""")

## Excercise 4

Let's start learning! Recall from the slides that there is an approach called Q-learning based on a Q-value function for each state-action pair. Doing updates wrt
$$ Q(s,a) \leftarrow Q(s,a) + \alpha( r_{t+1} + \gamma \cdot \max_{a} Q(s',a) - Q(s,a)) $$
provides the optimal Q-function. Starting with a lot of exploration and estimating $Q(s,a)$ in each time step for the experience states and actions leads to the optimal function. Finally we can get the optimal policy by using the $\text{arg} \max_a Q(s,a)$ in each state.

In order to balance exploration and exploitation during learning we introduced an $\epsilon$-greedy approach. That is, we first try to explore a lot as a taxi driver in the environment and start exploiting the knowledge the more experience we have. One of the easiest implementations is a temporal decay of an $\epsilon$ that starts with 1 and degrades with a factor $\epsilon_{\text{decay}} \in [0,1)$. Hence, we have an $\epsilon$-greedy policy as

$$ \pi(s) = \begin{cases}
        \text{env.action\_space.sample()} & \text{if np.random.rand()} < \epsilon \\
        \text{arg}\max_a Q(s,a) & \text{otherwise}
\end{cases}$$

and $\epsilon$ degrades in each time-step following
$$ \epsilon = \epsilon \cdot \epsilon_{\text{decay}} $$

#### Task 1

Implement the $\epsilon$-greedy policy according to the formula above in the function `eps_greedy_policy(env, eps, state, Q)`. The function shall return the chosen action.


In [None]:
import numpy as np

def eps_greedy_policy(env, eps, s, Q):
    """
    Selects an action using the epsilon-greedy strategy.

    With probability `eps`, a random action is selected (exploration).
    Otherwise, the action with the highest Q-value for the given state is chosen (exploitation).

    Parameters
    ----------
    env : gym.Env
        The environment instance, used to sample random actions.
    eps : float
        The exploration rate (0 ≤ eps ≤ 1). Higher values increase the likelihood of random actions.
    state : int
        The current state of the agent (index into Q-table).
    Q : np.ndarray
        The Q-table with shape (num_states, num_actions), containing estimated action-values.

    Returns
    -------
    action : int
        The selected action to take in the current state.
    """

    # there is a method called np.argmax() for taking the argmax of a set of values
    pass

#### Task 2

Implement the method `Qlearn(env, alpha, gamma, eps, eps_decay, max_eps)` that takes an environment without rendering (otherwise it will take some time) and several hyperparameters and learns and returns a Q-function for each state-action pair.

In [None]:
import gymnasium as gym

def Qlearn(env, alpha, gamma, eps, eps_decay, max_eps):
    """
    Performs Q-learning to learn an optimal Q-value table for a given environment.

    Parameters
    ----------
    env : gym.Env
        The environment to learn from. Must have discrete observation and action spaces.
    alpha : float
        Learning rate (0 < alpha <= 1), determines how much new information overrides old.
    gamma : float
        Discount factor (0 <= gamma <= 1), determines the importance of future rewards.
    eps : float
        Initial epsilon for the epsilon-greedy policy (0 <= eps <= 1), controls exploration.
    eps_decay : float
        Multiplicative decay factor for epsilon after each episode (0 < eps_decay < 1).
    max_eps : int
        Number of episodes to train for.

    Returns
    -------
    Q : numpy.ndarray
        The learned Q-table of shape (num_states, num_actions), where each entry Q[s, a]
        estimates the expected return of taking action a in state s and following the
        learned policy thereafter.

    """
    Q = np.zeros((env.observation_space.n, env.action_space.n))
    rewards = []

    for episode in range(max_eps):
        s, _ = env.reset()
        done = False
        total_reward = 0
        while not done:
            # sample an action
            # TODO: select proper actions
            a = ...
            # do one step in the environment
            s_prime, reward, terminated, truncated, _ = env.step(a)
            total_reward += reward

            done = terminated or truncated

            # update step of Q-learning
            # TODO: do the proper update step
            Q[s, a] = ...

            # set state = next_state for next time step in episode
            s = s_prime
        rewards.append(total_reward)
        eps = eps*eps_decay
    return Q, rewards

#### Task 3

Test your implementation by running an episode with rendering. Check whether your taxi driver follows the optimal policy.



In [None]:
# instantiation of the environment without rendering
env = gym.make('Taxi-v3')

# applying Q-learning on the env
Q = Qlearn(env, alpha=0.1, gamma=0.99, eps=1, eps_decay=.99, max_eps=1000)

# reseting the environment for first start
env.close()

### Note: Use the following code if you use python locally

In [None]:
import gymnasium as gym
import pygame

# instantiation of the environment
env = gym.make('Taxi-v3', render_mode='human')
# reseting the environment for first start
obs, _ = env.reset()

done = False
while not done:
    # sample an action
    action = np.argmax(Q[obs, :])
    # do one step in the environment
    obs, reward, terminated, truncated, _ = env.step(action)

    # flag whether the episode is finished
    done = terminated or truncated
    # render the game
    env.render()

    # this is just event handling that you can end the visualization by clicking q button
    for event in pygame.event.get():
        if event.type == pygame.KEYDOWN:
            if event.key == pygame.K_q:
                pygame.quit()
                done = True

env.close()

### Note: Use the following code if you use google colab

In [None]:
import gymnasium as gym
from IPython.display import HTML
from base64 import b64encode

import imageio

# instantiation of the environment
env = gym.make("Taxi-v3", render_mode="rgb_array")

# resetting the environment for first start
obs, _ = env.reset()

# initialize a list of frames for video creation
frames = []

done = False
while not done:
    # capture the frame and append it to frames list
    frame = env.render()
    frames.append(frame)

    # sample an action
    action = np.argmax(Q[obs, :])
    # do one step in the environment
    obs, reward, terminated, truncated, info = env.step(action)

    # flag whether the episode is finished
    done = terminated or truncated

    # final rendering for last image of episode
    if done:
      frame = env.render()
      frames.append(frame)

env.close()

# save video as
video_path = "./taxi_vid_own_policy.mp4"
imageio.mimsave(video_path, frames, fps=5)

In [None]:
# this is for displaying the video after saving
mp4 = open(video_path, 'rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()

HTML(f"""
<video width=400 controls>
    <source src="{data_url}" type="video/mp4">
</video>
""")

#### Bonus Task:

Try to find the best hyperparameter setting to learn as fast as you can the optimal policy. Tune the parameters to find the optimal solution with a minimum number of episodes. The best team gets awarded as **Q-Genius** team and gets a special reward!

**Note**: it may help to gather and visualize the rewards over episodes over the whole training phase to check whether you really reached the optimal Q-function. Therefore, you need to track the total_reward for each episode in the `Qlearn()` method. Often the rewards as well as a moving average is visualized. An examplary code snippet may be found below.

In [None]:
import matplotlib.pyplot as plt

# instantiation of the environment
env = gym.make('Taxi-v3')
Q, rew = Qlearn(env, alpha=0.1, gamma=0.99, eps=1, eps_decay=.99, max_eps=1000)
# reseting the environment for first start
env.close()

# Compute moving average of rewards
window = 50
moving_avg = np.convolve(rew, np.ones(window)/window, mode='valid')

# Plot
plt.figure(figsize=(10, 6))
plt.plot(rew, label='Episode Reward', alpha=0.3)
plt.plot(moving_avg, label=f'{window}-Episode Moving Average', linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Q-learning Performance on Taxi-v3')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()