Dynamic SWE-bench Verified evaluation for per-step model routers. Leaderboard ranked by a single metric: total USD spent.
Chinese translation: README.zh.md. External docs and communication default to English.
SWERouterBench is the dynamic sibling of CommonRouterBench. Where CommonRouterBench scores routers against a static question bank of 970 pre-recorded routing steps, SWERouterBench actually runs your router end-to-end on the 500 SWE-bench Verified instances. At every LLM call the harness calls back into your router, which picks a concrete model_id from an official locked pool. Billing follows published provider pricing for each model. Failed instances are charged an additional full rerun at the highest-priced pool model (1× baseline penalty). Lower total USD = better rank.
Alpha. Current package version is 0.2.0 (see CHANGELOG.md, pyproject.toml).
| CommonRouterBench | SWERouterBench | |
|---|---|---|
| Evaluation | Static question bank, 970 rows | Dynamic, 500 SWE-bench Verified instances |
| Router output | Tier id 0..3 |
Concrete model_id from locked pool |
| Pricing | Nominal tier prices | Real published provider prices |
| Cache model | Step-distance TTL = 3 | Wall-clock TTL = 300s |
| Pass criterion | pred_tier >= gold_tier (proxy) |
SWE-bench resolved (FAIL_TO_PASS + PASS_TO_PASS) |
| Heavy deps | None | swebench, docker |
SWERouterBench depends on CommonRouterBench for shared tokenizer and classifier utilities (main.tokenizer, main.router_llm).
- Python 3.10+
- Docker (daemon running; images pulled or buildable for SWE-bench Verified)
- LLM API compatible with the harness client (OpenAI-compatible base URL + API key; OpenRouter is a common choice)
- Network for dataset fetch / image pulls as required by
swebench
-
Install (Python ≥ 3.10) from the repository root.
pip install -e .pullsCommonRouterBenchfrom PyPI; for a monorepo checkout you can stillpip install -e ../CommonRouterBenchfirst if needed.pip install -e . -
Credentials: copy
.env.exampleto.envin the repo root (or export variables in your shell). TheswerouterbenchCLI loads.envon startup when keys are unset. Never commit.env(.gitignore).cp .env.example .env # edit .env — set OPENROUTER_BASE_URL + OPENROUTER_API_KEY (or SWEROUTER_* / COMMONSTACK_*) -
Smoke run (first three Verified instances in dataset order, fixed pool model):
swerouterbench run \ --router-import swerouter.routers.always_model:AlwaysModelRouter \ --router-arg model_id=deepseek/deepseek-v3.2 \ --router-arg label=always_deepseek_smoke \ --router-label always_deepseek_smoke \ --output-dir runs/smoke_always \ --limit 3 --run-id smoke_always
Non-install entrypoint:
scripts/run_router.pyforwards to the same CLI (python scripts/run_router.py ...).
From the repository root:
pip install -e .This installs the swerouterbench CLI (entry point).
| Command | Purpose |
|---|---|
swerouterbench run |
Run your router on SWE-bench Verified and write traces + per-instance results under --output-dir. |
swerouterbench score |
Recompute leaderboard billing (total_actual_bill_usd) from an existing run directory. Optional --exclude-infra-failures, --reprice-from-raw-usage. |
swerouterbench audit-infra |
Scan results/*.json for instances that fair-metrics exclusion would drop. |
swerouterbench audit-trace-cost |
Compare summed trace step_cost_usd vs provider raw_usage.cost in *.trace.jsonl. |
swerouterbench render |
Render a markdown leaderboard from one or more score.json files. |
Connection settings for run: pass --base-url / --api-key, or set environment
variables (see table below). CLI flags override environment defaults.
| Variable | Role |
|---|---|
OPENROUTER_BASE_URL |
OpenAI-compatible API base (used if SWEROUTER_BASE_URL is unset). |
OPENROUTER_API_KEY |
Bearer token for that base (used if SWEROUTER_API_KEY is unset). |
SWEROUTER_BASE_URL / SWEROUTER_API_KEY |
Explicit names matching the swerouterbench run flag names. |
COMMONSTACK_API_BASE / COMMONSTACK_API_KEY |
Optional: when set, mapped onto OPENROUTER_* / SWEROUTER_* (see swerouter.cli._apply_gateway_aliases). |
OPENROUTER_API_KEY_EXP |
Optional alternate key when OPENROUTER_API_KEY is empty. |
You pass --router-import module:Callable and repeated --router-arg key=value
(all values are strings). The CLI imports the callable and invokes it with
**router_args. The returned object must implement the
Router protocol: synchronous
select(ctx: RouterContext) -> RouterDecision with model_id in
ctx.available_models (invalid choices fail fast in the harness).
Module discovery: module must be importable from the current Python
environment (pip install -e /path/to/your_router_pkg or
PYTHONPATH=/path/to/parent python -m swerouter.cli ... if you keep the router
outside this repo).
Locked pool: every RouterDecision.model_id must be one of the entries in
data/model_pool.json. The harness passes the same ids
in ctx.available_models each step.
Built-in examples live under swerouter/routers/ (AlwaysModelRouter,
GoldTierRouter.from_cli_args, etc.). Full behaviour contract (timeouts,
fail-fast rules): docs/router_api_zh.md (Chinese).
Implement a class with select(self, ctx: RouterContext) -> RouterDecision, put
it in any importable package, then point --router-import at the class and pass
string kwargs. Example:
# myteam_router/router.py
from dataclasses import dataclass
from swerouter.router import RouterContext, RouterDecision
@dataclass
class FirstPoolModelRouter:
"""Always pick the first model_id the harness exposes for this step."""
label: str
def select(self, ctx: RouterContext) -> RouterDecision:
mid = ctx.available_models[0]
return RouterDecision(model_id=mid, rationale=f"{self.label}:first_pool")pip install -e . # SWERouterBench (provides swerouter.*)
pip install -e ../myteam_router # your package exporting myteam_router.router
swerouterbench run \
--router-import myteam_router.router:FirstPoolModelRouter \
--router-arg label=demo \
--router-label demo_first_pool \
--output-dir runs/demo_first_pool \
--limit 1swerouterbench run \
--router-import swerouter.routers.always_model:AlwaysModelRouter \
--router-arg model_id=deepseek/deepseek-v3.2 \
--router-arg label=always_deepseek \
--base-url https://openrouter.ai/api/v1 \
--api-key "$OPENROUTER_API_KEY" \
--output-dir runs/always_deepseek_smoke \
--router-label always_deepseek \
--limit 3Routers whose constructors need non-string args can expose a module:Class.factory path (e.g. a @classmethod that only takes string kwargs from repeated --router-arg key=value):
swerouterbench run \
--router-import swerouter.routers.gold_tier:GoldTierRouter.from_cli_args \
--router-arg question_bank_path=/path/to/CommonRouterBench/data/question_bank.jsonl \
--router-arg tier_to_model_path=/path/to/SWERouterBench/data/tier_to_model.json \
--router-arg allowed_instance_ids=django__django-11133,django__django-10097 \
--router-arg label=gold_tier_smoke \
--base-url https://openrouter.ai/api/v1 \
--api-key "$OPENROUTER_API_KEY" \
--output-dir runs/gold_tier_smoke \
--router-label gold_tier_smoke \
--instances django__django-11133 django__django-10097| Flag | Meaning |
|---|---|
--router-import |
Required. module:Callable — class, function, or dotted path like pkg.mod:Cls.from_cli_args. The callable is invoked with kwargs assembled from --router-arg. |
--router-arg |
Repeatable key=value; all values are strings. Passed as **kwargs to the import target. |
--router-label |
Required human-readable id for summaries and leaderboard rows. |
--output-dir |
Run workspace: results/, traces, eval_summary.json, etc. |
--limit |
Only the first N instances in dataset order (after --instances filtering if any). |
--instances |
Optional explicit list of SWE-bench instance_id values. |
--workers |
Concurrent instances (default 8). |
--max-steps |
Global cap on LLM steps per instance (default 40). |
--max-steps-json / --max-steps-json-file |
Optional per-instance overrides of max_steps (JSON object: instance id → positive int). |
--budget-usd |
Per-instance spend cap for routed model calls (default 5.0). |
--run-id |
Passed through to SWE-bench harness logging (default swerouter_default). |
--force-rerun |
Ignore existing <output_dir>/results/<instance_id>.json and re-execute. |
--rm-image |
Remove Docker images after runs (optional cleanup). |
Resume: if <output_dir>/results/<instance_id>.json already exists and is valid, that instance is skipped on the next run unless --force-rerun is set.
swerouterbench score --run-dir runs/always_deepseek_smoke --router-label always_deepseek
# writes runs/always_deepseek_smoke/score.json by default
swerouterbench render --score runs/a/score.json runs/b/score.json --out leaderboard.mdOptional --pricing, --ttl, --pool on score override the default JSON tables under data/.
- For each SWE-bench Verified instance, the agent loop runs inside the work container.
- Before each LLM request, the harness calls
router.select(ctx)with a frozenRouterContext: messages, tools,step_index,available_models(from the locked pool), cache snapshot, spend-so-far, and run limits. - Your code must return a
RouterDecisionwhosemodel_idis exactly one ofctx.available_models. Invalid or slowselectcalls fail fast (no silent fallback).
Built-in reference routers live under swerouter/routers/: AlwaysModelRouter, RoundRobinRouter, TierFromCRBRouter, GoldTierRouter, and the FunctionRouter wrapper for callables.
Further detail (Chinese): docs/router_api_zh.md — behaviour contract, fail-fast table, and what routers must not assume.
Before pushing or tagging a public release:
-
No secrets in git:
.envmust stay untracked (see.gitignore). Use.env.exampleonly for placeholder names and fake values. -
Scan the tree for accidental keys (adjust patterns for your provider):
git grep -nE 'sk-[a-zA-Z0-9]{20,}' || true git grep -niE 'apikey|api_key|secret_key|bearer [a-z0-9]{20,}' || true
A clean tree should only hit documentation telling people not to commit keys, or test fixtures with obviously fake payloads.
-
No personal run artefacts: do not
git addruns/,logs/,agent_logs/,*.trace.jsonl, or local smoke scripts underscripts/that match the ignore rules in.gitignore(local-only patterns such asscripts/smoke_*.pyare intentionally excluded from the published repo layout). -
Run tests locally after
pip install -e ".[dev]":pytest -q
data/model_pool.json— authoritative list ofmodel_idvalues routers may return (schema_versionin the file).data/model_pricing.json— billing table keyed bymodel_id.data/ttl_policy.json— cache TTL policy surfaced inRouterContext.cache_state.data/tier_to_model.json— maps CommonRouterBench tiers to poolmodel_id(for classifiers / oracle routers).
You can drive evaluation from Python without the CLI by building an EvalRequest and calling run_eval(request, router_label=...). The CLI’s run subcommand is a thin wrapper around the same path.
- Upstream SWE-bench: swebench.com
- CommonRouterBench — static router benchmark and GT bank.
- MiniSWERouterBench — same Router protocol and locked pool / pricing, but driven by the mini-swe-agent scaffold (bash-only + linear history). Use it when GT tier labels must align with CommonRouterBench trajectories.
- This repository: SWERouterBench on GitHub.
- Pricing and cache (internal, Chinese): docs/pricing_and_cache_zh.md
- Scoring (internal, Chinese): docs/scoring_zh.md
Apache-2.0. See LICENSE.