# RL Quickstart

Minimal example showing how to train an RL agent on the service scheduling problem.

This notebook uses Stable-Baselines3 (DQN) but any Gymnasium-compatible algorithm works.

In [None]:
import sys
sys.path.insert(0, '..')

import numpy as np
from stable_baselines3 import DQN

from src.rl import (
    ServiceEnv,
    get_default_scenario,
    compare_with_baselines,
    evaluate_model,
    DEFAULT_MAX_TIME,
)

## 1. Create Environment

The `ServiceEnv` wraps the simulation as a Gymnasium environment.

In [None]:
scenario = get_default_scenario()
env = ServiceEnv(scenario, max_time=DEFAULT_MAX_TIME)

print(f"Observation space: {env.observation_space}")
print(f"Action space: {env.action_space}")
print(f"Action delays: {env.action_delays[:5]} ... {env.action_delays[-2:]}")

## 2. Baseline Comparison

See how handcrafted policies perform.

In [None]:
results = compare_with_baselines(scenario, max_time=DEFAULT_MAX_TIME)

## 3. Train with Any Algorithm

Use any Gymnasium-compatible RL algorithm. Here we use DQN from Stable-Baselines3.

In [None]:
# Train DQN (or substitute your own algorithm)
model = DQN('MlpPolicy', env, verbose=0, seed=42)
model.learn(total_timesteps=50_000)
print("Training complete!")

## 4. Evaluate

Compare trained model against baselines.

In [None]:
rewards = evaluate_model(model, scenario, n_episodes=100, max_time=DEFAULT_MAX_TIME)

print(f"\nTrained model: {np.mean(rewards):.2f} +/- {np.std(rewards):.2f}")
print(f"Optimal baseline: {results['optimised_linear']['mean']:.2f}")

## Next Steps

- See `rl_dqn.ipynb` and `rl_sac.ipynb` for detailed experiments
- Tune hyperparameters for better performance
- Try different algorithms (PPO, SAC, etc.)
- For surrogate models and advanced algorithms, contact for access to `discretiser-surrogate`