#  Q-Learning on Taxi-v3 — Step-by-Step Student Notebook

---

### 1. Introduction

We are going to teach a virtual taxi driver using Reinforcement Learning (RL).
Think of RL like teaching a kid or a pet — we give rewards for good behavior and small punishments for mistakes.
The taxi must learn how to pick up and drop off passengers in a small city.

**Key Idea:**

- The taxi learns by trial and error.
- The world is a small 5x5 grid with pickup and drop-off spots: Red, Green, Yellow, and Blue.
- The taxi can move, pick up, and drop off passengers.

**Our goal:** train the taxi to earn more rewards and make fewer mistakes.

---

### 2. Problem Definition

Right now, our taxi doesn’t know where to go, when to stop, or how to pick up a passenger.

We want it to learn the best decisions through experience.

For that, we’ll use the Q-Learning algorithm, which helps it remember what worked best in different situations.

---
### 3. Goal and Approach

**We will:**

- Create the Taxi world using Gymnasium.
- Understand its states, actions, and rewards using the helper file assignment2_utils.py.
- Build the Q-learning algorithm.
- Train it with different learning rates (α) and exploration rates (ε).
- Plot results to see how learning improves.
- Choose the best model and retrain it.

Now let's check and perform test how smart our final taxi has become.

---

## Step 1: Install and Import Libraries

In [5]:
# Install the required packages
%pip install gymnasium numpy matplotlib


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [10]:
%pip install matplotlib

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:

# Import libraries
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
from assignment2_utils import describe_env, describe_obs
import pandas as pd


Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.



### Explanation 

- **gymnasium** – builds the taxi environment.
- **numpy** – helps with math operations.
- **matplotlib** – makes graphs so we can see learning progress.
- **pandas**- Makes tables for results.
- **assignment2_utils**- Provided helper file that describes actions, observations, and rewards.

---

### **Step 1 : Create and Describe the Taxi World**

In [4]:
# Create Taxi environment
env = gym.make('Taxi-v3')

# Add a compatibility wrapper so env.reward_range exists
if not hasattr(env, 'reward_range'):
    env.reward_range = (-10, 20)

# Now we can safely describe the environment
describe_env(env)

num_states = env.observation_space.n
num_actions = env.action_space.n

Observation space:  Discrete(500)
Observation space size:  500
Reward Range:  (-10, 20)
Number of actions:  6
Action description:  {0: 'Move south (down)', 1: 'Move north (up)', 2: 'Move east (right)', 3: 'Move west (left)', 4: 'Pickup passenger', 5: 'Drop off passenger'}


###  Explanation 
- There are **500 states** — these are all possible combinations of taxi positions, passenger locations, and destinations.
- There are **6 actions** — move south, north, east, west, pick up, and drop off.


The helper function describe_env() prints the observation space, reward range, and action meanings — this helps us understand the environment structure.

---

## Step 2: Understanding Q-Learning
Q-learning helps the taxi learn what’s the **best action** in each situation.

### The Formula

```
Q(state, action) = Q(state, action) + α * [reward + γ * max(Q(next_state)) - Q(state, action)]
```

### Example 

Imagine you are playing a game:
- You jump over an obstacle → +10 points.
- You fall → -5 points.
Over time, you learn which moves give you the best score. The taxi learns in the same way!

Meaning:

New Value = Old Value + How Much Better the New Experience Is.

α (alpha): learning rate

γ (gamma): future importance

ε (epsilon): exploration probability
---

In [6]:
##  Code 2: Initialize the Q-Table

# Create a Q-table with all zeros

q_table = np.zeros([env.observation_space.n, env.action_space.n])

###  Explanation 

- The table has **500 rows (states)** and **6 columns (actions)**.
- Each number shows how good an action is in a particular state.
- Right now, all are **0** because the taxi hasn’t learned anything yet.

---

In [7]:
##  Code 3: Define Learning Parameters
# Set learning parameters
alpha = 0.1      # Learning rate (how fast the taxi learns)
gamma = 0.9      # Discount factor (how much it values the future)
epsilon = 0.1    # Exploration rate (how often it tries random moves)
episodes = 1000  # Number of training games

###  **Explanation**

| Parameter | Meaning | Simple Example |
|------------|----------|----------------|
| α (alpha) | How fast we learn | Like studying — too fast, we forget old info |
| γ (gamma) | Looks at future rewards | Like saving for the future vs spending now |
| ε (epsilon) | How often we explore | Trying new routes sometimes helps! |


In [None]:











---

## 🏃‍♂️ Code 4: Train the Taxi
```python
rewards = []

for i in range(episodes):
    state, _ = env.reset()
    done = False
    total_reward = 0

    while not done:
        # Explore vs Exploit decision
        if np.random.uniform(0,1) < epsilon:
            action = env.action_space.sample()  # Try something random
        else:
            action = np.argmax(q_table[state])  # Choose best known move

        # Take action
        next_state, reward, done, truncated, info = env.step(action)

        # Update rule
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        q_table[state, action] = old_value + alpha * (reward + gamma * next_max - old_value)

        # Move to next state
        state = next_state
        total_reward += reward

    rewards.append(total_reward)
```

### 🧠 Explanation (Markdown)
- The taxi plays **1000 games (episodes)**.
- In each game, it tries moves, gets rewards, and updates the Q-table.
- Slowly, it learns what actions lead to the best outcomes.

---

## 📊 Code 5: Plot the Learning Progress
```python
plt.plot(rewards)
plt.xlabel('Episodes')
plt.ylabel('Total Reward per Episode')
plt.title('Taxi Learning Over Time')
plt.show()
```

### 📈 Explanation (Markdown)
- The line starts low and should go higher as the taxi learns.
- A higher reward means the taxi is making better choices.

---

## 🚖 Code 6: Watch the Trained Taxi
```python
state, _ = env.reset()
done = False
env.render()

while not done:
    action = np.argmax(q_table[state])
    next_state, reward, done, truncated, info = env.step(action)
    env.render()
    state = next_state
```

### 🎬 Explanation (Markdown)
Now the taxi is **smart**! It knows where to go and how to pick up and drop off passengers efficiently.

---

## 🧪 Code 7: Experiment with Parameters
```python
# Try different learning rates and exploration factors
alphas = [0.01, 0.2]
gammas = [0.2, 0.3]
```

### 🧠 Explanation (Markdown)
Try changing one value at a time and watch how the learning curve changes.

| Parameter | Low Value | High Value | What Happens |
|------------|------------|-------------|---------------|
| α (alpha) | Learns slowly | Learns too fast | Might forget old things |
| γ (gamma) | Focuses on now | Focuses on future | May miss quick wins |
| ε (epsilon) | Less exploration | More exploration | More discovery early on |

---

## 🧾 Conclusion (Markdown)
We built and trained a **Q-Learning taxi** that learned to pick up and drop off passengers on its own.

### Key Learnings:
- Q-Learning uses rewards to guide behavior.
- Hyperparameters (α, γ, ε) control how the taxi learns.
- The more it practices, the smarter it becomes.

🟩 **Real-life Example:** This is how self-driving cars learn — by practicing in simulations before hitting the road!
