#  Reinforcement Learning Assignment 2 – Taxi-v3 Environment  
CSCN 8020 – Reinforcement Learning Programming  

## Done by ***Eris Leksi***


# Import required libraries

In [9]:
# Imports & setup
import os
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
from IPython.display import clear_output, display
import assignment2_utils as utils   # professor's helper file 

# Reproducibility & results folder
SEED = 42
np.random.seed(SEED)
RESULTS_DIR = "results"
os.makedirs(RESULTS_DIR, exist_ok=True)

### Explanation
Import standard libraries, the professor's `assignment2_utils.py` helper module, set random seed, and create a results directory.


## Create and Describe the Taxi Environment


In [10]:
import gymnasium as gym
import importlib
import assignment2_utils as utils
importlib.reload(utils)

env = gym.make("Taxi-v3")
num_obs, num_actions = utils.describe_env(env)


Observation space: Discrete(500)
Number of observations: 500
Action space: Discrete(6)
Number of actions: 6
Reward range: (-10, 20)


### Explanation
We create the `Taxi-v3` environment and call `describe_env()` from the supplied utility file to print action/observation details.


## Optional: Quick random rollout to verify environment (no learning)


In [24]:
# quick random rollout (short) to check env mechanics; no rendering here so it works in notebook
state, _ = env.reset(seed=SEED)
for step in range(10):
    action = env.action_space.sample()
    next_state, reward, terminated, truncated, info = env.step(action)
    print(f"Step {step+1}: action={action}, reward={reward}, done={terminated or truncated}")
    state = next_state
    if terminated or truncated:
        break


Step 1: action=0, reward=-1, done=False
Step 2: action=4, reward=-10, done=False
Step 3: action=2, reward=-1, done=False
Step 4: action=0, reward=-1, done=False
Step 5: action=0, reward=-1, done=False
Step 6: action=1, reward=-1, done=False
Step 7: action=1, reward=-1, done=False
Step 8: action=5, reward=-10, done=False
Step 9: action=0, reward=-1, done=False
Step 10: action=4, reward=-10, done=False


### Explanation
A short random run to verify environment interactions (state transitions and rewards). This does not render a GUI and is safe to run in notebook.


## Hyperparameters & Experiment Configuration


In [12]:
# Baseline hyperparameters (assignment)
ALPHA_BASE = 0.1
EPSILON_BASE = 0.1
GAMMA_BASE = 0.9

# Parameter variations requested by the assignment
ALPHA_VARIATIONS = [0.01, 0.001, 0.2]   # change alpha separately
EPSILON_VARIATIONS = [0.2, 0.3]         # change epsilon separately (assignment probably meant ε)

# Training settings
N_EPISODES = 5000         
MAX_STEPS_PER_EP = 200


### Explanation
Define baseline hyperparameters, the specified variations to test, and training run-length parameters. Use `N_EPISODES`=5000 unless you want a faster debug run.


## Q-Learning Training Function (tabular)


In [13]:
def train_q_learning(env, n_episodes, alpha, gamma, epsilon, max_steps_per_episode=200, verbose=False):
    """
    Tabular Q-Learning on Taxi-v3.
    Returns: dict with episode list, steps, returns, final Q-table, runtime seconds.
    """
    n_states = env.observation_space.n
    n_actions = env.action_space.n
    Q = np.zeros((n_states, n_actions), dtype=float)

    episode_steps = []
    episode_returns = []

    t0 = time.time()
    for ep in range(1, n_episodes + 1):
        reset_out = env.reset(seed=SEED + ep)  # slight variation to seed each episode
        state = reset_out[0] if isinstance(reset_out, tuple) else reset_out

        ep_return = 0
        ep_steps = 0
        for _ in range(max_steps_per_episode):
            # epsilon-greedy action
            if np.random.rand() < epsilon:
                action = env.action_space.sample()
            else:
                action = int(np.argmax(Q[state]))

            step_out = env.step(action)
            # gymnasium returns (obs, reward, terminated, truncated, info)
            if len(step_out) == 5:
                next_state, reward, terminated, truncated, _ = step_out
                done = terminated or truncated
            else:
                next_state, reward, done, _ = step_out

            # Q-learning update
            best_next = np.max(Q[next_state])
            td_target = reward + gamma * best_next
            td_error = td_target - Q[state, action]
            Q[state, action] += alpha * td_error

            state = next_state
            ep_return += reward
            ep_steps += 1

            if done:
                break

        episode_steps.append(ep_steps)
        episode_returns.append(ep_return)

        if verbose and (ep % 500 == 0 or ep == 1):
            print(f"Ep {ep}/{n_episodes} avg_last100={np.mean(episode_returns[-100:]):.2f}")

    runtime = time.time() - t0
    return {
        "episode": list(range(1, n_episodes + 1)),
        "steps": episode_steps,
        "returns": episode_returns,
        "Q": Q,
        "runtime_sec": runtime
    }


### Explanation
This is the tabular Q-Learning training function using the ε-greedy policy. It returns per-episode steps and returns, plus the final Q-table and runtime.


## Helper: Save metrics and plot utility


In [14]:
def save_metrics_csv(metrics, label):
    df = pd.DataFrame({"episode": metrics["episode"], "steps": metrics["steps"], "return": metrics["returns"]})
    path = os.path.join(RESULTS_DIR, f"metrics_{label}.csv")
    df.to_csv(path, index=False)
    return path

def plot_metrics(metrics, label, show=True):
    episodes = metrics["episode"]
    steps = metrics["steps"]
    returns = metrics["returns"]

    fig, axes = plt.subplots(2, 1, figsize=(9, 7), sharex=True)
    axes[0].plot(episodes, steps)
    axes[0].set_ylabel("Steps per episode")
    axes[0].grid(True)

    rolling = pd.Series(returns).rolling(100, min_periods=1).mean()
    axes[1].plot(episodes, returns, alpha=0.3, label="return")
    axes[1].plot(episodes, rolling, label="rolling mean (100)", linewidth=2)
    axes[1].set_ylabel("Return")
    axes[1].set_xlabel("Episode")
    axes[1].legend()
    axes[1].grid(True)

    plt.suptitle(f"Q-Learning: {label}")
    plt.tight_layout(rect=[0, 0, 1, 0.96])
    png_path = os.path.join(RESULTS_DIR, f"plot_{label}.png")
    plt.savefig(png_path)
    if show:
        display(fig)
    plt.close(fig)
    return png_path

def summarize_metrics(metrics):
    ep_count = len(metrics["episode"])
    return {
        "episodes": ep_count,
        "avg_steps": float(np.mean(metrics["steps"])),
        "avg_return": float(np.mean(metrics["returns"])),
        "last100_avg_return": float(np.mean(metrics["returns"][-100:])),
        "total_steps": int(np.sum(metrics["steps"])),
        "runtime_sec": float(metrics.get("runtime_sec", 0.0))
    }


### Explanation
Utility functions to save CSVs, plot learning curves, and summarize run statistics for reports.


## Run Baseline & Hyperparameter Experiments (Q-Learning)


In [15]:
# Run baseline
experiments = []

print("Running baseline...")
metrics_base = train_q_learning(env, N_EPISODES, ALPHA_BASE, GAMMA_BASE, EPSILON_BASE, MAX_STEPS_PER_EP)
experiments.append(("baseline", ALPHA_BASE, EPSILON_BASE, GAMMA_BASE, metrics_base))
save_metrics_csv(metrics_base, "baseline")
plot_metrics(metrics_base, "baseline", show=False)

# Run alpha variations (keep epsilon, gamma baseline)
for a in ALPHA_VARIATIONS:
    label = f"alpha_{a}"
    print(f"Running alpha variation: {label}")
    m = train_q_learning(env, N_EPISODES, a, GAMMA_BASE, EPSILON_BASE, MAX_STEPS_PER_EP)
    experiments.append((label, a, EPSILON_BASE, GAMMA_BASE, m))
    save_metrics_csv(m, label)
    plot_metrics(m, label, show=False)

# Run epsilon variations (keep alpha baseline)
for e in EPSILON_VARIATIONS:
    label = f"epsilon_{e}"
    print(f"Running epsilon variation: {label}")
    m = train_q_learning(env, N_EPISODES, ALPHA_BASE, GAMMA_BASE, e, MAX_STEPS_PER_EP)
    experiments.append((label, ALPHA_BASE, e, GAMMA_BASE, m))
    save_metrics_csv(m, label)
    plot_metrics(m, label, show=False)

print("All experiments finished.")


Running baseline...
Running alpha variation: alpha_0.01
Running alpha variation: alpha_0.001
Running alpha variation: alpha_0.2
Running epsilon variation: epsilon_0.2
Running epsilon variation: epsilon_0.3
All experiments finished.


### Explanation
Run baseline training and all requested hyperparameter variations separately. Each run saves metrics and a plot PNG into the `results/` folder. This cell is the main deliverable code for the “Python code implementing Q-Learning and running it for the different hyperparameters.”


## Compare Experiments and Choose Best Combination


In [16]:
# Build summary table of experiments
rows = []
for label, alpha, eps, gamma, metrics in experiments:
    s = summarize_metrics(metrics)
    rows.append({
        "label": label, "alpha": alpha, "epsilon": eps, "gamma": gamma,
        "episodes": s["episodes"], "avg_steps": s["avg_steps"],
        "avg_return": s["avg_return"], "last100_avg_return": s["last100_avg_return"],
        "total_steps": s["total_steps"], "runtime_sec": s["runtime_sec"]
    })
df_summary = pd.DataFrame(rows).sort_values(by="last100_avg_return", ascending=False).reset_index(drop=True)
display(df_summary)


Unnamed: 0,label,alpha,epsilon,gamma,episodes,avg_steps,avg_return,last100_avg_return,total_steps,runtime_sec
0,baseline,0.1,0.1,0.9,5000,30.2672,-21.557,2.3,151336,10.037345
1,alpha_0.2,0.2,0.1,0.9,5000,23.4146,-11.519,1.82,117073,7.228286
2,epsilon_0.2,0.1,0.2,0.9,5000,32.7438,-32.8638,-4.75,163719,10.830549
3,epsilon_0.3,0.1,0.3,0.9,5000,36.0178,-47.371,-15.03,180089,11.740064
4,alpha_0.01,0.01,0.1,0.9,5000,126.416,-159.6092,-66.79,632080,33.903199
5,alpha_0.001,0.001,0.1,0.9,5000,185.3298,-258.504,-246.82,926649,51.997869


### Explanation
Summarize metrics (average return, last-100 average, steps) and show a comparison table. Use the `last100_avg_return` column to pick the best-performing hyperparameter combination.


## Re-run the Best Combination and Save Q-Table


In [25]:
# pick best by last100_avg_return
best_row = df_summary.iloc[0]
best_label = best_row["label"]
best_alpha = best_row["alpha"]
best_eps = best_row["epsilon"]
best_gamma = best_row["gamma"]

print(f"Best found: {best_label} (alpha={best_alpha}, eps={best_eps}) — re-running to confirm.")
metrics_best = train_q_learning(env, N_EPISODES, best_alpha, best_gamma, best_eps, MAX_STEPS_PER_EP, verbose=True)
experiments.append(("best_rerun", best_alpha, best_eps, best_gamma, metrics_best))
save_metrics_csv(metrics_best, "best_rerun")
plot_metrics(metrics_best, "best_rerun", show=False)

# save Q-table
q_table_path = os.path.join(RESULTS_DIR, f"q_table_{best_label}.npy")
np.save(q_table_path, metrics_best["Q"])
print("Saved Q-table to:", q_table_path)


Best found: baseline (alpha=0.1, eps=0.1) — re-running to confirm.
Ep 1/5000 avg_last100=-569.00
Ep 500/5000 avg_last100=-96.71
Ep 1000/5000 avg_last100=-14.82
Ep 1500/5000 avg_last100=-2.74
Ep 2000/5000 avg_last100=-0.60
Ep 2500/5000 avg_last100=1.65
Ep 3000/5000 avg_last100=2.30
Ep 3500/5000 avg_last100=2.14
Ep 4000/5000 avg_last100=2.72
Ep 4500/5000 avg_last100=3.37
Ep 5000/5000 avg_last100=2.12
Saved Q-table to: results\q_table_baseline.npy


### Explanation
Re-run training with the best hyperparameters (from comparisons) to confirm results and save the learned Q-table (NumPy `.npy`) for later evaluation/simulation.


## Generate a PDF Report (metrics, plots, short comments)


In [18]:
# Create a simple multipage PDF with summary and plots
pdf_path = os.path.join(RESULTS_DIR, "assignment2_report.pdf")
with PdfPages(pdf_path) as pdf:
    # Title page
    fig = plt.figure(figsize=(8.5, 11))
    fig.text(0.5, 0.85, "CSCN 8020 — Assignment 2: Q-Learning on Taxi-v3", ha="center", fontsize=16)
    fig.text(0.5, 0.80, time.strftime("%Y-%m-%d %H:%M:%S"), ha="center", fontsize=10)
    fig.text(0.1, 0.7, "Deliverables included:", fontsize=12)
    fig.text(0.12, 0.66, "- Python code implementing Q-Learning and hyperparameter runs (notebook).")
    fig.text(0.12, 0.62, "- Metrics CSVs and plots in results/")
    fig.text(0.12, 0.58, "- Q-table for best run.")
    pdf.savefig()
    plt.close(fig)

    # Summary table page
    fig = plt.figure(figsize=(11, 8.5))
    fig.suptitle("Experiment Summary (sorted by last 100 avg return)", fontsize=14)
    plt.axis("off")
    tbl = plt.table(cellText=df_summary.round(3).values, colLabels=df_summary.columns, loc="center")
    tbl.auto_set_font_size(False)
    tbl.set_fontsize(8)
    tbl.scale(1, 1.5)
    pdf.savefig()
    plt.close(fig)

    # Plot pages for each experiment (embed saved pngs)
    for label, alpha, eps, gamma, metrics in experiments:
        png = os.path.join(RESULTS_DIR, f"plot_{label}.png")
        if os.path.exists(png):
            img = plt.imread(png)
            fig = plt.figure(figsize=(8.5, 11))
            plt.imshow(img)
            plt.axis("off")
            pdf.savefig()
            plt.close(fig)

print("PDF report saved at:", pdf_path)


PDF report saved at: results\assignment2_report.pdf


### Explanation
Generate `assignment2_report.pdf` containing a title page, a comparison table, and the saved learning-curve plots for each experiment. This satisfies the PDF deliverable requirement with metrics and plots.


## Agent Wrapper for Simulation


In [19]:
class QAgent:
    """
    Simple wrapper that exposes select_action(state) as required by utils.simulate_episodes().
    Uses an internal Q-table; selects greedy actions by default.
    """
    def __init__(self, Q_table):
        self.Q = Q_table

    def select_action(self, state):
        # greedy selection from Q-table (used for simulation)
        return int(np.argmax(self.Q[state]))
    


## Simulate the trained agent (visual, human render)

In [23]:
import gymnasium as gym
import time

# ---- Settings ----
NUM_EPISODES_SIM = 10
MAX_STEPS_PER_EP = 200    # keep same as training
RENDER_DELAY = 0.05       # seconds between renders; adjust to taste (0.0 = fastest)

# Names for taxi pickup/drop locations used by Taxi-v3 (standard)
# index 0..3 correspond to the four fixed map locations in Taxi-v3
LOC_NAMES = ["Red", "Green", "Yellow", "Blue"]

# ---- Prepare agent and env ----
agent = QAgent(metrics_best["Q"])
env_vis = gym.make("Taxi-v3", render_mode="human")

# Start first episode (no seed -> allow randomness)
state, info = env_vis.reset()        # initial state with random passenger/dest
done = False

for ep in range(1, NUM_EPISODES_SIM + 1):
    # Decode to get readable components
    taxi_row, taxi_col, passenger_loc, dest = env_vis.unwrapped.decode(state)

    # If passenger_loc == 4, passenger is already in taxi — show that as "in taxi"
    passenger_str = "in taxi" if passenger_loc == 4 else LOC_NAMES[passenger_loc]
    dest_str = LOC_NAMES[dest]

    # Print clear start-of-episode summary (taxi coords shown as (row, col))
    print(f"--- Episode {ep} ---")
    print(f"Passenger is at: {passenger_str}, wants to go to {dest_str}. Taxi currently at ({taxi_row:.0f}, {taxi_col:.0f})")

    # Run episode until termination (uses greedy policy)
    total_reward = 0
    step = 0
    done = False
    while not done and step < MAX_STEPS_PER_EP:
        action = agent.select_action(state)
        step_out = env_vis.step(action)
        # gymnasium returns either 5-tuple or 4-tuple depending on version
        if len(step_out) == 5:
            next_state, reward, terminated, truncated, info = step_out
            done = terminated or truncated
        else:
            next_state, reward, done, info = step_out

        total_reward += reward
        state = next_state
        step += 1

        # render human window (will show visually)
        try:
            env_vis.render()
        except Exception:
            pass

        # small delay so you can watch; set to 0 for faster execution
        time.sleep(RENDER_DELAY)

    print(f"Episode {ep} finished with total reward: {total_reward}\n")

    # Prepare next episode (if any): keep taxi at current location,
    # but resample passenger & destination randomly (ensure passenger != dest)
    # We use the environment's RNG so results differ each time.
    if ep < NUM_EPISODES_SIM:
        taxi_row, taxi_col, _, _ = env_vis.unwrapped.decode(state)
        # sample passenger_loc in 0..3 (on-map), and dest in 0..3, ensuring they differ
        rng = env_vis.unwrapped.np_random
        new_pass = int(rng.integers(0, 4))
        new_dest = int(rng.integers(0, 4))
        # ensure passenger != destination
        while new_dest == new_pass:
            new_dest = int(rng.integers(0, 4))

        # encode state and set it without teleporting taxi coords (we already set taxi_row/col)
        state = env_vis.unwrapped.encode(taxi_row, taxi_col, new_pass, new_dest)
        env_vis.unwrapped.s = state
        # small render to reflect new passenger on the map before next episode
        try:
            env_vis.render()
        except Exception:
            pass
        time.sleep(0.12)

# Done
env_vis.close()


--- Episode 1 ---
Passenger is at: Yellow, wants to go to Blue. Taxi currently at (2, 0)
Episode 1 finished with total reward: 10

--- Episode 2 ---
Passenger is at: Yellow, wants to go to Blue. Taxi currently at (4, 3)
Episode 2 finished with total reward: 5

--- Episode 3 ---
Passenger is at: Green, wants to go to Yellow. Taxi currently at (4, 3)
Episode 3 finished with total reward: 6

--- Episode 4 ---
Passenger is at: Yellow, wants to go to Green. Taxi currently at (4, 0)
Episode 4 finished with total reward: 11

--- Episode 5 ---
Passenger is at: Green, wants to go to Blue. Taxi currently at (0, 4)
Episode 5 finished with total reward: 14

--- Episode 6 ---
Passenger is at: Red, wants to go to Yellow. Taxi currently at (4, 3)
Episode 6 finished with total reward: 8

--- Episode 7 ---
Passenger is at: Blue, wants to go to Green. Taxi currently at (4, 0)
Episode 7 finished with total reward: 7

--- Episode 8 ---
Passenger is at: Red, wants to go to Blue. Taxi currently at (0, 4)
Ep

### Explanation
This cell runs `simulate_episodes()` from the provided utils file with the trained greedy agent. It uses `render_mode="human"` to visualize the taxi in action. Run this after training when you want to see the behavior.


## Experiment Summary

- The **baseline** and each **parameter variation** were trained for **5,000 episodes**.  
- Performance was evaluated using **average return** and **last-100-episode mean reward** to measure convergence trends.  
- All detailed results, learning curves, and comparison tables are included in the accompanying **assignment2_report.pdf**.  
- Supporting files — including CSV logs and trained Q-tables — are saved in the `results/` directory.  
- Based on the final **last-100 average return**, the **baseline configuration (α=0.1, ε=0.1, γ=0.9)** performed best and was re-run for confirmation.  
  The resulting Q-table was saved as:


- **Suggested Next Steps:**  
- Implement **ε-decay** for dynamic exploration control.  
- Extend training to **10,000 episodes** for improved convergence stability.  
- Compare results against **SARSA** and **Deep Q-Network (DQN)** implementations for broader insight.
