# Workshop on Reinforcement Learning or How to drive a taxi in a data-driven way?

Welcome to the workshop on Reinforcement Learning. We want to introduce the concept of Reinforcement Learning in a problem-based way with an interactive small example, the so called **Taxi environment**.

There are four designated pick-up and drop-off locations (Red, Green, Yellow and Blue) in the 5x5 grid world. The taxi starts off at a random square and the passenger at one of the designated locations.

The goal is move the taxi to the passenger’s location, pick up the passenger, move to the passenger’s desired destination, and drop off the passenger. Once the passenger is dropped off, the episode ends.

The player receives positive rewards for successfully dropping-off the passenger at the correct location. Negative rewards for incorrect attempts to pick-up/drop-off passenger and for each step where another reward is not received.

<img src="mat/taxi.gif" alt="Taxi driver randomly driving around" width="400"/>

More information can be found in the [official documentation](https://gymnasium.farama.org/environments/toy_text/taxi/)

## Excercise 1: Visualizing your agent in the environment

For using the taxi environment you need the `gymnasium` package.

If you use python locally, you can use the `pygame` package to visualize the game. If you use colab instead, you can create a video from your agent acting in the environment.
At first, we want to try out the environment by instantiating it and setup the typical RL data stream we introduced in the slides:

<img src="mat/01-RL-datastream.png" alt="RL datastream" width="400"/>

Therefore, we implement a `while` loop, sample an **action** from the possible actions in the action space and **do** one step with action in the environment. As an agent, we get the next state (called **observation**, short obs), a **reward** and some additional information whether the episode has ended.

In [7]:
import gymnasium as gym
import pygame

# instantiation of the environment
env = gym.make('Taxi-v3', render_mode='human')
# reseting the environment for first start
obs, _ = env.reset()

done = False
while not done:
    # sample an action
    action = env.action_space.sample()
    # do one step in the environment
    obs, reward, terminated, truncated, _ = env.step(action)

    # flag whether the episode is finished
    done = terminated or truncated
    # render the game
    env.render()

    # this is just event handling that you can end the visualization by clicking q button
    for event in pygame.event.get():
        if event.type == pygame.KEYDOWN:
            if event.key == pygame.K_q:
                pygame.quit()
                done = True

env.close()

In [13]:
from IPython.display import HTML
from base64 import b64encode

import imageio

env = gym.make("Taxi-v3", render_mode="rgb_array")

frames = []
obs, _ = env.reset()
done = False
i=1

while not done:
    frame = env.render()  # Capture the frame
    frames.append(frame)

    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated
    if done:
      frame = env.render()
      frames.append(frame)

env.close()

video_path = "./taxi_vid.mp4"
imageio.mimsave(video_path, frames, fps=5)

In [14]:
mp4 = open(video_path, 'rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()

HTML(f"""
<video width=400 controls>
    <source src="{data_url}" type="video/mp4">
</video>
""")



## Excercise 2

Find out the dimensions of state and action space and check it with the ideas we introduced theortically before.

In [6]:
env.observation_space

Discrete(500)

In [4]:
env.action_space

Discrete(6)

## Excercise 3

How would you build up and implement a strategy for the taxi driver to properly solve the taxi problem? Try out some thoughts and hardcode the optimal policy for a given problem instance.

- check how the initial problem with seed 328 looks like
- think about the exact order of actions you have to do
- hardcode them in a list and try it out!

Remark: The actions are encoded in the following way according to the documentation:

- 0: Move south (down)
- 1: Move north (up)
- 2: Move east (right)
- 3: Move west (left)
- 4: Pickup passenger
-5: Drop off passenger

In [12]:
import gymnasium as gym
import pygame

# instantiation of the environment
env = gym.make('Taxi-v3', render_mode='human')
# reseting the environment for first start
obs, _ = env.reset()

# consider a specific problem instance
env.unwrapped.s = 328

done = False
while not done:
    # sample an action
    action = env.action_space.sample()
    # do one step in the environment
    obs, reward, terminated, truncated, _ = env.step(action)

    # flag whether the episode is finished
    done = terminated or truncated
    # render the game
    env.render()

    # this is just event handling that you can end the visualization by clicking q button
    for event in pygame.event.get():
        if event.type == pygame.KEYDOWN:
            if event.key == pygame.K_q:
                pygame.quit()
                done = True

env.close()