# Curriculum_vs_NonCurriculum_MetaDrive_SB3_Experiments

**Generated:** 2025-11-20T19:41:51.012806Z

## Abstract

This notebook contains a full, reproducible experiment pipeline for comparing **non-curriculum** vs **curriculum** reinforcement learning for autonomous driving using **MetaDrive** and **Stable Baselines3 (SB3)**. It includes:

- Full environment factory and wrappers (including a discrete-action wrapper for DQN).
- Exact stage definitions (C0..C3) and matching budgets.
- Non-curriculum runner (train each target map separately for the same total sample budget).
- Curriculum runner (C0→C1→C2→C3 sequential fine-tuning).
- Evaluation harness (metrics logging, CSV saving, TensorBoard integration, video recording).
- Hyperparameters and experiment folder conventions.

**Note:** This notebook provides runnable code cells. It is designed to be run on a machine with MetaDrive installed and sufficient CPU/GPU. Heavy training cells are provided but **commented** or parametrized with small pilot budgets; uncomment or change budgets to run full experiments.


## 0. Install dependencies (run once)

Run in a terminal or notebook cell. If MetaDrive or SB3 is not installed, install them. On many systems you may need
`pip install stable-baselines3[extra] gymnasium metadrive numpy pandas matplotlib tensorboard opencv-python`

**Important:** Do not run this on environments without internet access; install packages beforehand if needed.


In [None]:
# !pip install stable-baselines3[extra] gymnasium metadrive numpy pandas matplotlib tensorboard opencv-python
# If you already have the deps installed, comment the line above out.
print('Install packages if needed.')

## 1. Imports and utility functions

In [None]:
import os
import time
import json
import math
import random
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# SB3 imports
from stable_baselines3 import PPO, SAC, DQN
from stable_baselines3.common.callbacks import EvalCallback, CheckpointCallback, BaseCallback
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv
from stable_baselines3.common.evaluation import evaluate_policy

# MetaDrive import guard (user must have installed)
try:
    from metadrive.envs.metadrive_env import MetaDriveEnv
except Exception as e:
    MetaDriveEnv = None
    print('MetaDrive import failed - ensure metadrive is installed in the runtime if you plan to run MetaDrive experiments.')

print('Imports ready')

## 2. Experiment configuration (maps, stages, budgets, hyperparameters, folder layout)

In [None]:
### Maps, stages and budgets
STAGES = [
    ("C0_Straight", "Straight", 0.0, 200_000),
    ("C1_Curve",    "Curve",    0.0, 300_000),
    ("C2_Roundabout","Roundabout",0.0,400_000),
    ("C3_Dynamic",  "20-block", 0.3, 400_000),
]
TOTAL_CURRICULUM_BUDGET = sum([s[3] for s in STAGES])
print('Total curriculum budget (per algorithm) =', TOTAL_CURRICULUM_BUDGET)

# Folder convention
EXPERIMENT_ROOT = Path('experiments')
EXPERIMENT_ROOT.mkdir(exist_ok=True)

# Seeds and workers
SEEDS = [0,1,2]
N_ENVS = 8
EVAL_FREQ = 10_000
EVAL_EPISODES = 10

# Held-out test map
HELDOUT_MAP = ("Fork", 0.2)

# Hyperparameters (consistent)
HYPERS = {
    'PPO': {
        'policy':'MlpPolicy', 'policy_kwargs':{'net_arch':[64,64]}, 'learning_rate':3e-4,
        'n_steps':2048, 'batch_size':64, 'n_epochs':10, 'gamma':0.99, 'clip_range':0.2
    },
    'SAC': {
        'policy':'MlpPolicy', 'policy_kwargs':{'net_arch':[256,256]}, 'learning_rate':3e-4,
        'batch_size':256, 'buffer_size':100_000, 'gamma':0.99
    },
    'DQN': {
        'policy':'MlpPolicy', 'policy_kwargs':{'net_arch':[64,64]}, 'learning_rate':1e-4,'buffer_size':50_000,'batch_size':32,'train_freq':4
    }
}

print('Config ready')

## 3. Environment wrappers
Includes a discrete-action wrapper for DQN (maps discrete indices -> continuous steer/throttle).

In [None]:
import gymnasium as gym
from gymnasium import spaces

class DiscreteActionWrapper(gym.ActionWrapper):
    def __init__(self, env, mapping):
        super().__init__(env)
        self.mapping = mapping
        self.action_space = spaces.Discrete(len(mapping))

    def action(self, action):
        return np.array(self.mapping[action], dtype=np.float32)

print('Wrapper defined')

## 4. Environment factory / vectorized env helper

In [None]:
from functools import partial

def make_metadrive_env(map_name, traffic_density=0.0, use_discrete=False, seed=0, render=False):
    def _init():
        cfg = {
            'map': map_name,
            'traffic_density': traffic_density,
            'use_render': False,
            'start_seed': seed,
            'random_spawn': True,
            'debug': False,
        }
        env = MetaDriveEnv(cfg)
        # Optionally wrap for DQN
        if use_discrete:
            mapping = [(-1.0,0.0),(-1.0,0.3),(0.0,0.5),(1.0,0.3),(1.0,0.0)]
            env = DiscreteActionWrapper(env, mapping)
        return Monitor(env)
    return _init

def make_vec_env(map_name, traffic_density=0.0, n_envs=8, use_discrete=False, seed=0, parallel=False):
    factories = [make_metadrive_env(map_name, traffic_density, use_discrete, seed+i) for i in range(n_envs)]
    if parallel:
        return SubprocVecEnv(factories)
    else:
        return DummyVecEnv(factories)

print('Env factory ready')

## 5. Custom callback to log per-eval metrics (success rate, collisions, speed etc)

In [None]:
class MetricsCallback(BaseCallback):
    """
    Custom callback to compute and append evaluation metrics to CSV on each eval.
    Assumes the eval_env yields episode info dicts with keys 'success','collision','avg_speed','traffic_violation'.
    """
    def __init__(self, eval_env, out_csv, eval_episodes=10, verbose=0):
        super().__init__(verbose)
        self.eval_env = eval_env
        self.out_csv = out_csv
        self.eval_episodes = eval_episodes
        self._cols = ['timestamp','total_timesteps','mean_reward','std_reward','success_rate','collision_rate','avg_speed','traffic_violations']
        if not os.path.exists(out_csv):
            pd.DataFrame(columns=self._cols).to_csv(out_csv, index=False)

    def _on_step(self) -> bool:
        return True

    def on_training_end(self):
        pass

    def record_eval(self, model, total_timesteps):
        # run evaluation episodes and compute metrics
        rewards = []
        successes = []
        collisions = []
        speeds = []
        traffic_viol = []
        for ep in range(self.eval_episodes):
            obs, _ = self.eval_env.reset()
            done = False
            ep_r = 0.0
            while not done:
                action, _ = model.predict(obs, deterministic=True)
                obs, reward, terminated, truncated, info = self.eval_env.step(action)
                ep_r += reward
                done = terminated or truncated
            rewards.append(ep_r)
            info_ep = info if isinstance(info, dict) else {}
            # fallback if env does not provide keys
            successes.append(info_ep.get('success', 1.0 if ep_r>0 else 0.0))
            collisions.append(info_ep.get('collision', 0.0))
            speeds.append(info_ep.get('avg_speed', info_ep.get('speed', 0.0)))
            traffic_viol.append(info_ep.get('traffic_violation', 0.0))

        mean_r = np.mean(rewards)
        std_r = np.std(rewards)
        success_rate = np.mean(successes)
        collision_rate = np.mean(collisions)
        avg_speed = np.mean(speeds)
        tv = np.mean(traffic_viol)
        row = {
            'timestamp': time.time(), 'total_timesteps': total_timesteps, 'mean_reward': mean_r, 'std_reward': std_r,
            'success_rate': success_rate, 'collision_rate': collision_rate, 'avg_speed': avg_speed, 'traffic_violations': tv
        }
        df = pd.read_csv(self.out_csv)
        df = df.append(row, ignore_index=True)
        df.to_csv(self.out_csv, index=False)
        return row

print('MetricsCallback defined')

## 6. Training functions: non-curriculum and curriculum runners

These functions create models, attach callbacks, and run training. Each saves checkpoint, best-model and CSV metrics.


In [None]:
def make_model(algo, env, hyperparams):
    if algo == 'PPO':
        model = PPO(hyperparams['policy'], env, verbose=1, tensorboard_log=str(EXPERIMENT_ROOT/'tensorboard'),
                    policy_kwargs=hyperparams['policy_kwargs'], learning_rate=hyperparams['learning_rate'],
                    n_steps=hyperparams['n_steps'], batch_size=hyperparams['batch_size'], n_epochs=hyperparams['n_epochs'], gamma=hyperparams['gamma'])
        return model
    if algo == 'SAC':
        model = SAC(hyperparams['policy'], env, verbose=1, tensorboard_log=str(EXPERIMENT_ROOT/'tensorboard'),
                    policy_kwargs=hyperparams['policy_kwargs'], learning_rate=hyperparams['learning_rate'],
                    batch_size=hyperparams.get('batch_size',256), buffer_size=hyperparams.get('buffer_size',100000), gamma=hyperparams.get('gamma',0.99))
        return model
    if algo == 'DQN':
        model = DQN(hyperparams['policy'], env, verbose=1, tensorboard_log=str(EXPERIMENT_ROOT/'tensorboard'),
                    policy_kwargs=hyperparams['policy_kwargs'], learning_rate=hyperparams['learning_rate'],
                    buffer_size=hyperparams.get('buffer_size',50000), batch_size=hyperparams.get('batch_size',32), train_freq=hyperparams.get('train_freq',4))
        return model
    raise ValueError('Unknown algo')


def train_noncurriculum(algo, map_name, traffic, total_timesteps, seed, n_envs=N_ENVS):
    out_dir = EXPERIMENT_ROOT/f"{algo}/noncurriculum/seed_{seed}/{map_name}"
    out_dir.mkdir(parents=True, exist_ok=True)
    print('Training non-curriculum:', algo, map_name, 'seed', seed)

    use_discrete = (algo=='DQN')
    env = make_vec_env(map_name, traffic_density=traffic, n_envs=n_envs, use_discrete=use_discrete, seed=seed)
    eval_env = make_vec_env(map_name, traffic_density=traffic, n_envs=1, use_discrete=use_discrete, seed=seed+100)
    eval_env = eval_env.env_fns[0]() if hasattr(eval_env, 'env_fns') else eval_env

    model = make_model(algo, env, HYPERS[algo])

    # callbacks
    eval_cb = EvalCallback(eval_env, best_model_save_path=str(out_dir/'best_model'), log_path=str(out_dir/'eval_logs'), eval_freq=EVAL_FREQ, n_eval_episodes=EVAL_EPISODES, deterministic=True)
    ckpt_cb = CheckpointCallback(save_freq=EVAL_FREQ, save_path=str(out_dir/'checkpoints'), name_prefix='ckpt')
    metrics_csv = out_dir/'metrics.csv'
    metrics_cb = MetricsCallback(eval_env, str(metrics_csv), eval_episodes=EVAL_EPISODES)

    # train
    model.learn(total_timesteps=total_timesteps, callback=[eval_cb, ckpt_cb])
    model.save(str(out_dir/'model.zip'))

    # held-out eval
    held_map, held_traffic = HELDOUT_MAP
    held_env = make_vec_env(held_map, traffic_density=held_traffic, n_envs=1, use_discrete=use_discrete, seed=seed+500)
    held_env = held_env.env_fns[0]() if hasattr(held_env, 'env_fns') else held_env
    mean_reward, std_reward = evaluate_policy(model, held_env, n_eval_episodes=100)
    pd.DataFrame([{'mean_reward':mean_reward,'std_reward':std_reward}]).to_csv(out_dir/'heldout_metrics.csv', index=False)
    print('Non-curriculum training complete and saved to', out_dir)
    return out_dir


def train_curriculum(algo, seed, n_envs=N_ENVS):
    out_base = EXPERIMENT_ROOT/f"{algo}/curriculum/seed_{seed}"
    out_base.mkdir(parents=True, exist_ok=True)
    print('Starting curriculum training for', algo, 'seed', seed)

    use_discrete = (algo=='DQN')
    # initialize model on C0
    stage0 = STAGES[0]
    _, map0, traffic0, steps0 = stage0
    vec_env = make_vec_env(map0, traffic_density=traffic0, n_envs=n_envs, use_discrete=use_discrete, seed=seed)
    model = make_model(algo, vec_env, HYPERS[algo])

    cumulative = 0
    for stage_name, map_name, traffic, stage_steps in STAGES:
        out_dir = out_base/stage_name
        out_dir.mkdir(parents=True, exist_ok=True)
        print('Curriculum stage:', stage_name, 'map:', map_name)

        # replace env
        try:
            vec_env.close()
        except:
            pass
        vec_env = make_vec_env(map_name, traffic_density=traffic, n_envs=n_envs, use_discrete=use_discrete, seed=seed+hash(stage_name)%100)
        model.set_env(vec_env)

        # eval env
        eval_env = make_vec_env(map_name, traffic_density=traffic, n_envs=1, use_discrete=use_discrete, seed=seed+100)
        eval_env = eval_env.env_fns[0]() if hasattr(eval_env, 'env_fns') else eval_env

        eval_cb = EvalCallback(eval_env, best_model_save_path=str(out_dir/'best_model'), log_path=str(out_dir/'eval_logs'), eval_freq=EVAL_FREQ, n_eval_episodes=EVAL_EPISODES, deterministic=True)
        ckpt_cb = CheckpointCallback(save_freq=EVAL_FREQ, save_path=str(out_dir/'checkpoints'), name_prefix='ckpt')

        model.learn(total_timesteps=stage_steps, reset_num_timesteps=False, callback=[eval_cb, ckpt_cb])
        model.save(str(out_dir/'stage_model.zip'))
        cumulative += stage_steps

    # heldout eval
    held_map, held_traffic = HELDOUT_MAP
    held_env = make_vec_env(held_map, traffic_density=held_traffic, n_envs=1, use_discrete=use_discrete, seed=seed+500)
    held_env = held_env.env_fns[0]() if hasattr(held_env, 'env_fns') else held_env
    mean_reward, std_reward = evaluate_policy(model, held_env, n_eval_episodes=100)
    pd.DataFrame([{'mean_reward':mean_reward,'std_reward':std_reward}]).to_csv(out_base/'heldout_metrics.csv', index=False)

    print('Curriculum finished and saved under', out_base)
    return out_base

print('Training functions defined')

## 7. Visualization helpers (plot metrics, learning curves, and display videos)

In [None]:
def plot_metrics(csv_path, title=None):
    if not os.path.exists(csv_path):
        print('CSV not found:', csv_path); return
    df = pd.read_csv(csv_path)
    fig, axs = plt.subplots(2,2, figsize=(12,8))
    axs = axs.flatten()
    axs[0].plot(df['total_timesteps'], df['mean_reward'], marker='o'); axs[0].set_title('Mean reward')
    axs[1].plot(df['total_timesteps'], df['success_rate'], marker='o'); axs[1].set_title('Success rate')
    axs[2].plot(df['total_timesteps'], df['collision_rate'], marker='o'); axs[2].set_title('Collision rate')
    axs[3].plot(df['total_timesteps'], df['avg_speed'], marker='o'); axs[3].set_title('Avg speed')
    if title: fig.suptitle(title)
    plt.tight_layout(); plt.show()

print('Plot helper ready')

## 8. Quick pilot run (toy budget) — uncomment to test

Below is a small pilot run (very small budgets) to validate pipeline. **Do not use these tiny budgets for final experiments**. Use to test that your environment and SB3 setup works.


In [None]:
# Pilot example: run one tiny non-curriculum PPO for 2000 steps on Straight map
# Uncomment to run

# out = train_noncurriculum('PPO', 'Straight', 0.0, total_timesteps=2000, seed=0, n_envs=2)
# print('Pilot saved at:', out)

print('Pilot cell ready — uncomment to run a small test')

## 9. Full experiment sweep orchestration (commented)

This cell will launch the entire sweep: for each algorithm, for each seed, both non-curriculum and curriculum runs.
**Careful**: this will run many long experiments — uncomment only when you are ready and have compute/time resources.


In [None]:
# Full sweep (uncomment to run)
# algos = ['PPO','SAC','DQN']
# for algo in algos:
#     # non-curriculum: for each target map train for the total curriculum budget (to match sample budget)
#     for seed in SEEDS:
#         for (_, map_name, traffic, _) in STAGES:
#             train_noncurriculum(algo, map_name, traffic, total_timesteps=TOTAL_CURRICULUM_BUDGET, seed=seed, n_envs=N_ENVS)
#     # curriculum
#     for seed in SEEDS:
#         train_curriculum(algo, seed, n_envs=N_ENVS)

print('Full sweep cell ready (commented)')

## 10. Notes, reproducibility, and next steps

- The notebook defines exact hyperparams, budgets, maps, seeds and logging conventions to ensure reproducible experiments.
- Use Highway-Env for fast Phase 1 iteration; use MetaDrive for full MetaDrive experiments.
- Always run a short pilot to measure `steps/sec` and set realistic budgets.
- For visualization, use TensorBoard pointed at the `experiments` folder and the `plot_metrics()` helper.


---

**End of notebook.**
