
# Reinforcement Learning – Solving Taxi-v3 with Q-Learning

This notebook uses **Gymnasium** to train a Taxi agent using **Q-learning**, and includes animated runs with a legend to explain the agent's environment.

- The **Taxi agent** learns to plan its moves more efficiently over time.
- **Before training**: random and wasteful behavior.
- **After training**: clear, goal-driven strategy.

### Environment Overview

- Grid: 5×5 world
- Objective: **Pick up** and **drop off** the passenger
- Rewards:
  - +20 for successful drop-off
  - -1 per move
  - -10 for illegal actions


In [None]:
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
from IPython.display import clear_output

sns.set_theme(style="whitegrid")

In [None]:
# Define
env = gym.make("Taxi-v3", render_mode="ansi")
n_states = env.observation_space.n
n_actions = env.action_space.n

q_table = np.zeros((n_states, n_actions))

# Hyperparameters
alpha = 0.7
gamma = 0.618
epsilon = 1.0
epsilon_min = 0.1
decay = 0.995
episodes = 2000
rewards = []

In [None]:
# Training
for ep in range(episodes):
    state, _ = env.reset()
    total_reward = 0
    done = False

    while not done:
        if np.random.rand() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(q_table[state])

        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        q_table[state, action] += alpha * (reward + gamma * np.max(q_table[next_state]) - q_table[state, action])
        state = next_state
        total_reward += reward

    epsilon = max(epsilon * decay, epsilon_min)
    rewards.append(total_reward)

print("Training complete.")

In [None]:
# Plot
def moving_avg(data, window=100):
    return np.convolve(data, np.ones(window)/window, mode='valid')

plt.figure(figsize=(10, 4))
plt.plot(moving_avg(rewards), label='Smoothed Reward')
plt.axhline(y=8, color='r', linestyle='--', label='Solved Threshold')
plt.title("Taxi-v3 – Reward Progression")
plt.xlabel("Episode")
plt.ylabel("Reward")
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Define functions support animation
def print_legend():
    print("""
    - R, G, B, Y: Possible passenger(blue) or destination(pink) locations
    - T : Taxi (yellow)
    - | : Wall (non-traversable border)
    - Passenger is either waiting or in taxi
    """)

def animate_run(q_table=None, title="Run"):
    state, _ = env.reset()
    done = False
    steps = 0
    total_reward = 0
    frames = []

    while not done and steps < 100:
        frame = env.render()
        frames.append(frame)

        if q_table is not None:
            action = np.argmax(q_table[state])
        else:
            action = env.action_space.sample()

        state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        total_reward += reward
        steps += 1

    print(f"{title} – Steps: {steps}, Reward: {total_reward}\n")
    for i, f in enumerate(frames):
        clear_output(wait=True)
        print(f"Step {i+1} – {title}")
        print(f)
        print_legend()
        time.sleep(0.3)

In [None]:
animate_run(q_table=None, title="Before Training")

In [None]:
animate_run(q_table=q_table, title="After Training")
# Attempt several times to see different possible outcomes. (Note: Q-tables might not cover every scenario.)


### Key Reinforcement Learning Concepts

| Concept | Description |
|--------|-------------|
| Q-Learning | Learns value of actions in states |
| State | Position of taxi, passenger, destination |
| Action | Move N/S/E/W, pickup, dropoff |
| Reward | +20 success, -1 per move, -10 illegal |
| Policy | Best action at each state |

Use this to explain how RL supports smart planning in complex environments like logistics or automation.