# Solving the Taxi-v3 Challenge with Q-Learning 🚕

<img src='https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi.png' alt='Taxi Environment' width='400'>

In this project, we'll apply the Q-Learning algorithm to a more challenging environment: `Taxi-v3`. Our goal is to train an agent to efficiently pick up a passenger and transport them to their destination. This notebook will guide you through understanding the environment, implementing the Q-Learning algorithm, tuning hyperparameters, and finally, publishing your trained agent to the Hugging Face Hub.

### 📜 For Certification

To validate this project for the [Hugging Face Deep RL Course certification](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained Taxi model to the Hub and achieve a result of **>= 4.5**. Your result is calculated as `mean_reward - std_reward` on the official [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard).

## Step 1: Setup and Installations

We begin by installing the required libraries and setting up a virtual display for rendering the environment, which is necessary for creating video replays.

In [None]:
!pip install numpy gymnasium pygame imageio tqdm pickle5 huggingface_hub pyvirtualdisplay > /dev/null 2>&1
!sudo apt-get update > /dev/null 2>&1
!sudo apt-get install -y python3-opengl > /dev/null 2>&1
!apt install ffmpeg xvfb > /dev/null 2>&1

In [None]:
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

## Step 2: Import Libraries and Utilities

Import the necessary Python libraries and our custom helper functions for evaluation and model publishing.

In [None]:
import numpy as np
import gymnasium as gym
import random
from tqdm.notebook import tqdm

# Import helper functions
from utils import evaluate_agent, push_to_hub

## Step 3: The Environment - Taxi-v3 🚖

Let's create and explore the `Taxi-v3` environment.

👉 **Documentation**: [Taxi-v3](https://gymnasium.farama.org/environments/toy_text/taxi/)

In this environment, a taxi must navigate a grid to pick up a passenger and drop them off at a designated location. The environment has 500 discrete states, accounting for the taxi's position (25), the passenger's location (5, including inside the taxi), and the destination (4).

In [None]:
env = gym.make("Taxi-v3", render_mode="rgb_array")

### Understanding the State and Action Spaces

In [None]:
state_space = env.observation_space.n
action_space = env.action_space.n
print(f"There are {state_space} possible states and {action_space} possible actions.")

- **Action Space**: `Discrete(6)` corresponds to 6 possible actions:
  - 0: `move south`
  - 1: `move north`
  - 2: `move east`
  - 3: `move west`
  - 4: `pickup`
  - 5: `dropoff`

- **Reward System**:
  - `-1` for each step.
  - `+20` for a successful drop-off.
  - `-10` for illegal pickup or drop-off actions.

## Step 4: Building the Q-Learning Algorithm

<img src='https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg' alt='Q-Learning Pseudocode' width='800'/>

We'll use the same core Q-Learning functions as before, as the algorithm is general-purpose.

In [None]:
def initialize_q_table(state_space, action_space):
    return np.zeros((state_space, action_space))

def epsilon_greedy_policy(Qtable, state, epsilon):
    if random.uniform(0, 1) > epsilon:
        return np.argmax(Qtable[state][:])
    else:
        return env.action_space.sample()

def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable, learning_rate, gamma):
    for episode in tqdm(range(n_training_episodes)):
        epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
        state, info = env.reset()
        terminated = False
        truncated = False

        for step in range(max_steps):
            action = epsilon_greedy_policy(Qtable, state, epsilon)
            new_state, reward, terminated, truncated, info = env.step(action)
            Qtable[state][action] += learning_rate * (reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action])
            if terminated or truncated:
                break
            state = new_state
    return Qtable

## Step 5: Define Hyperparameters

These hyperparameters have been chosen as a good starting point. Feel free to experiment with them to improve your agent's performance!

**⚠️ Do not modify `eval_seed`.** This ensures your agent is evaluated on the same starting conditions as everyone else for fair comparison on the leaderboard.

In [None]:
# Training parameters
n_training_episodes = 25000
learning_rate = 0.7

# Environment parameters
env_id = "Taxi-v3"
max_steps = 99
gamma = 0.95

# Evaluation parameters
n_eval_episodes = 100
eval_seed = [16,54,165,177,191,191,120,80,149,178,48,38,6,125,174,73,50,172,100,148,146,6,25,40,68,148,49,167,9,97,164,176,61,7,54,55,161,131,184,51,170,12,120,113,95,126,51,98,36,135,54,82,45,95,89,59,95,124,9,113,58,85,51,134,121,169,105,21,30,11,50,65,12,43,82,145,152,97,106,55,31,85,38,112,102,168,123,97,21,83,158,26,80,63,5,81,32,11,28,148]

# Exploration parameters
max_epsilon = 1.0
min_epsilon = 0.05
decay_rate = 0.005

## Step 6: Train the Agent

Initialize the Q-table and start the training process. This may take a few minutes.

In [None]:
q_table_taxi = initialize_q_table(state_space, action_space)
trained_q_table_taxi = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, q_table_taxi, learning_rate, gamma)

## Step 7: Evaluate and Publish to Hugging Face Hub 🔥

Now for the final step: package our trained model and push it to the Hub to get a score on the leaderboard.

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
model = {
    "env_id": env_id,
    "max_steps": max_steps,
    "n_training_episodes": n_training_episodes,
    "n_eval_episodes": n_eval_episodes,
    "eval_seed": eval_seed,
    "learning_rate": learning_rate,
    "gamma": gamma,
    "max_epsilon": max_epsilon,
    "min_epsilon": min_epsilon,
    "decay_rate": decay_rate,
    "qtable": trained_q_table_taxi
}

In [None]:
username = "<your-username>" # FILL THIS
repo_name = "q-Taxi-v3" # You can choose a different name
repo_id = f"{username}/{repo_name}"

# The push_to_hub function will evaluate the model, record a video,
# and upload everything to the Hub.
push_to_hub(repo_id, model, env)

## Next Steps & Challenges

Congratulations on training an agent for `Taxi-v3`! Check the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) to see your score.

If your score isn't high enough, here are some ideas to improve it:
- **Train for more episodes**: Increase `n_training_episodes`.
- **Adjust the learning rate**: A smaller `learning_rate` might lead to more stable learning.
- **Tune the exploration decay**: A slower `decay_rate` allows for more exploration, which might be necessary for this larger state space.