SWERouterBench

Dynamic SWE-bench Verified evaluation for per-step model routers. Leaderboard ranked by a single metric: total USD spent.

Chinese translation: README.zh.md. External docs and communication default to English.

SWERouterBench is the dynamic sibling of CommonRouterBench. Where CommonRouterBench scores routers against a static question bank of 970 pre-recorded routing steps, SWERouterBench actually runs your router end-to-end on the 500 SWE-bench Verified instances. At every LLM call the harness calls back into your router, which picks a concrete model_id from an official locked pool. Billing follows published provider pricing for each model. Failed instances are charged an additional full rerun at the highest-priced pool model (1× baseline penalty). Lower total USD = better rank.

Status

Alpha. Current package version is 0.2.0 (see CHANGELOG.md, pyproject.toml).

Relationship to CommonRouterBench

	CommonRouterBench	SWERouterBench
Evaluation	Static question bank, 970 rows	Dynamic, 500 SWE-bench Verified instances
Router output	Tier id `0..3`	Concrete `model_id` from locked pool
Pricing	Nominal tier prices	Real published provider prices
Cache model	Step-distance TTL = 3	Wall-clock TTL = 300s
Pass criterion	`pred_tier >= gold_tier` (proxy)	SWE-bench `resolved` (FAIL_TO_PASS + PASS_TO_PASS)
Heavy deps	None	`swebench`, `docker`

SWERouterBench depends on CommonRouterBench for shared tokenizer and classifier utilities (main.tokenizer, main.router_llm).

Prerequisites

Python 3.10+
Docker (daemon running; images pulled or buildable for SWE-bench Verified)
LLM API compatible with the harness client (OpenAI-compatible base URL + API key; OpenRouter is a common choice)
Network for dataset fetch / image pulls as required by swebench

Quickstart

Install (Python ≥ 3.10) from the repository root. pip install -e . pulls CommonRouterBench from PyPI; for a monorepo checkout you can still pip install -e ../CommonRouterBench first if needed.
```
pip install -e .
```
Credentials: copy .env.example to .env in the repo root (or export variables in your shell). The swerouterbench CLI loads .env on startup when keys are unset. Never commit .env (.gitignore).
```
cp .env.example .env
# edit .env — set OPENROUTER_BASE_URL + OPENROUTER_API_KEY (or SWEROUTER_* / COMMONSTACK_*)
```

Smoke run (first three Verified instances in dataset order, fixed pool model):

swerouterbench run \
  --router-import swerouter.routers.always_model:AlwaysModelRouter \
  --router-arg model_id=deepseek/deepseek-v3.2 \
  --router-arg label=always_deepseek_smoke \
  --router-label always_deepseek_smoke \
  --output-dir runs/smoke_always \
  --limit 3 --run-id smoke_always

Non-install entrypoint: scripts/run_router.py forwards to the same CLI (python scripts/run_router.py ...).

Installation

From the repository root:

pip install -e .

This installs the swerouterbench CLI (entry point).

CLI overview

Command	Purpose
`swerouterbench run`	Run your router on SWE-bench Verified and write traces + per-instance results under `--output-dir`.
`swerouterbench score`	Recompute leaderboard billing (`total_actual_bill_usd`) from an existing run directory. Optional `--exclude-infra-failures`, `--reprice-from-raw-usage`.
`swerouterbench audit-infra`	Scan `results/*.json` for instances that fair-metrics exclusion would drop.
`swerouterbench audit-trace-cost`	Compare summed trace `step_cost_usd` vs provider `raw_usage.cost` in `*.trace.jsonl`.
`swerouterbench render`	Render a markdown leaderboard from one or more `score.json` files.

Connection settings for run: pass --base-url / --api-key, or set environment variables (see table below). CLI flags override environment defaults.

Environment variables

Variable	Role
`OPENROUTER_BASE_URL`	OpenAI-compatible API base (used if `SWEROUTER_BASE_URL` is unset).
`OPENROUTER_API_KEY`	Bearer token for that base (used if `SWEROUTER_API_KEY` is unset).
`SWEROUTER_BASE_URL` / `SWEROUTER_API_KEY`	Explicit names matching the `swerouterbench run` flag names.
`COMMONSTACK_API_BASE` / `COMMONSTACK_API_KEY`	Optional: when set, mapped onto `OPENROUTER_` / `SWEROUTER_` (see `swerouter.cli._apply_gateway_aliases`).
`OPENROUTER_API_KEY_EXP`	Optional alternate key when `OPENROUTER_API_KEY` is empty.

How to plug in your router

You pass --router-import module:Callable and repeated --router-arg key=value (all values are strings). The CLI imports the callable and invokes it with **router_args. The returned object must implement the Router protocol: synchronous select(ctx: RouterContext) -> RouterDecision with model_id in ctx.available_models (invalid choices fail fast in the harness).

Module discovery: module must be importable from the current Python environment (pip install -e /path/to/your_router_pkg or PYTHONPATH=/path/to/parent python -m swerouter.cli ... if you keep the router outside this repo).

Locked pool: every RouterDecision.model_id must be one of the entries in data/model_pool.json. The harness passes the same ids in ctx.available_models each step.

Built-in examples live under swerouter/routers/ (AlwaysModelRouter, GoldTierRouter.from_cli_args, etc.). Full behaviour contract (timeouts, fail-fast rules): docs/router_api_zh.md (Chinese).

Minimal custom router

Implement a class with select(self, ctx: RouterContext) -> RouterDecision, put it in any importable package, then point --router-import at the class and pass string kwargs. Example:

# myteam_router/router.py
from dataclasses import dataclass

from swerouter.router import RouterContext, RouterDecision


@dataclass
class FirstPoolModelRouter:
    """Always pick the first model_id the harness exposes for this step."""

    label: str

    def select(self, ctx: RouterContext) -> RouterDecision:
        mid = ctx.available_models[0]
        return RouterDecision(model_id=mid, rationale=f"{self.label}:first_pool")

pip install -e .   # SWERouterBench (provides swerouter.*)
pip install -e ../myteam_router   # your package exporting myteam_router.router
swerouterbench run \
  --router-import myteam_router.router:FirstPoolModelRouter \
  --router-arg label=demo \
  --router-label demo_first_pool \
  --output-dir runs/demo_first_pool \
  --limit 1

Example: fixed model baseline

swerouterbench run \
  --router-import swerouter.routers.always_model:AlwaysModelRouter \
  --router-arg model_id=deepseek/deepseek-v3.2 \
  --router-arg label=always_deepseek \
  --base-url https://openrouter.ai/api/v1 \
  --api-key "$OPENROUTER_API_KEY" \
  --output-dir runs/always_deepseek_smoke \
  --router-label always_deepseek \
  --limit 3

Example: oracle router via classmethod factory

Routers whose constructors need non-string args can expose a module:Class.factory path (e.g. a @classmethod that only takes string kwargs from repeated --router-arg key=value):

swerouterbench run \
  --router-import swerouter.routers.gold_tier:GoldTierRouter.from_cli_args \
  --router-arg question_bank_path=/path/to/CommonRouterBench/data/question_bank.jsonl \
  --router-arg tier_to_model_path=/path/to/SWERouterBench/data/tier_to_model.json \
  --router-arg allowed_instance_ids=django__django-11133,django__django-10097 \
  --router-arg label=gold_tier_smoke \
  --base-url https://openrouter.ai/api/v1 \
  --api-key "$OPENROUTER_API_KEY" \
  --output-dir runs/gold_tier_smoke \
  --router-label gold_tier_smoke \
  --instances django__django-11133 django__django-10097

Useful `run` flags

Flag	Meaning
`--router-import`	Required. `module:Callable` — class, function, or dotted path like `pkg.mod:Cls.from_cli_args`. The callable is invoked with kwargs assembled from `--router-arg`.
`--router-arg`	Repeatable `key=value`; all values are strings. Passed as `**kwargs` to the import target.
`--router-label`	Required human-readable id for summaries and leaderboard rows.
`--output-dir`	Run workspace: `results/`, traces, `eval_summary.json`, etc.
`--limit`	Only the first N instances in dataset order (after `--instances` filtering if any).
`--instances`	Optional explicit list of SWE-bench `instance_id` values.
`--workers`	Concurrent instances (default `8`).
`--max-steps`	Global cap on LLM steps per instance (default `40`).
`--max-steps-json` / `--max-steps-json-file`	Optional per-instance overrides of `max_steps` (JSON object: instance id → positive int).
`--budget-usd`	Per-instance spend cap for routed model calls (default `5.0`).
`--run-id`	Passed through to SWE-bench harness logging (default `swerouter_default`).
`--force-rerun`	Ignore existing `<output_dir>/results/<instance_id>.json` and re-execute.
`--rm-image`	Remove Docker images after runs (optional cleanup).

Resume: if <output_dir>/results/<instance_id>.json already exists and is valid, that instance is skipped on the next run unless --force-rerun is set.

`score` and `render`

swerouterbench score --run-dir runs/always_deepseek_smoke --router-label always_deepseek
# writes runs/always_deepseek_smoke/score.json by default

swerouterbench render --score runs/a/score.json runs/b/score.json --out leaderboard.md

Optional --pricing, --ttl, --pool on score override the default JSON tables under data/.

How the harness calls your router

For each SWE-bench Verified instance, the agent loop runs inside the work container.
Before each LLM request, the harness calls router.select(ctx) with a frozen RouterContext: messages, tools, step_index, available_models (from the locked pool), cache snapshot, spend-so-far, and run limits.
Your code must return a RouterDecision whose model_id is exactly one of ctx.available_models. Invalid or slow select calls fail fast (no silent fallback).

Built-in reference routers live under swerouter/routers/: AlwaysModelRouter, RoundRobinRouter, TierFromCRBRouter, GoldTierRouter, and the FunctionRouter wrapper for callables.

Further detail (Chinese): docs/router_api_zh.md — behaviour contract, fail-fast table, and what routers must not assume.

Open source / release checklist

Before pushing or tagging a public release:

No secrets in git: .env must stay untracked (see .gitignore). Use .env.example only for placeholder names and fake values.
Scan the tree for accidental keys (adjust patterns for your provider):
```
git grep -nE 'sk-[a-zA-Z0-9]{20,}' || true
git grep -niE 'apikey|api_key|secret_key|bearer [a-z0-9]{20,}' || true
```
A clean tree should only hit documentation telling people not to commit keys, or test fixtures with obviously fake payloads.
No personal run artefacts: do not git add runs/, logs/, agent_logs/, *.trace.jsonl, or local smoke scripts under scripts/ that match the ignore rules in .gitignore (local-only patterns such as scripts/smoke_*.py are intentionally excluded from the published repo layout).
Run tests locally after pip install -e ".[dev]":
```
pytest -q
```

Official locked pool and data files

data/model_pool.json — authoritative list of model_id values routers may return (schema_version in the file).
data/model_pricing.json — billing table keyed by model_id.
data/ttl_policy.json — cache TTL policy surfaced in RouterContext.cache_state.
data/tier_to_model.json — maps CommonRouterBench tiers to pool model_id (for classifiers / oracle routers).

Programmatic use

You can drive evaluation from Python without the CLI by building an EvalRequest and calling run_eval(request, router_label=...). The CLI’s run subcommand is a thin wrapper around the same path.

License

Apache-2.0. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SWERouterBench

Status

Relationship to CommonRouterBench

Prerequisites

Quickstart

Installation

CLI overview

Environment variables

How to plug in your router

Minimal custom router

Example: fixed model baseline

Example: oracle router via classmethod factory

Useful `run` flags

`score` and `render`

How the harness calls your router

Open source / release checklist

Official locked pool and data files

Programmatic use

See also

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
docs		docs
scripts		scripts
swerouter		swerouter
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
README.zh.md		README.zh.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

SWERouterBench

Status

Relationship to CommonRouterBench

Prerequisites

Quickstart

Installation

CLI overview

Environment variables

How to plug in your router

Minimal custom router

Example: fixed model baseline

Example: oracle router via classmethod factory

Useful run flags

score and render

How the harness calls your router

Open source / release checklist

Official locked pool and data files

Programmatic use

See also

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Useful `run` flags

`score` and `render`

Packages