Skip to content

CommonstackAI/SWERouterBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SWERouterBench

Dynamic SWE-bench Verified evaluation for per-step model routers. Leaderboard ranked by a single metric: total USD spent.

Chinese translation: README.zh.md. External docs and communication default to English.

SWERouterBench is the dynamic sibling of CommonRouterBench. Where CommonRouterBench scores routers against a static question bank of 970 pre-recorded routing steps, SWERouterBench actually runs your router end-to-end on the 500 SWE-bench Verified instances. At every LLM call the harness calls back into your router, which picks a concrete model_id from an official locked pool. Billing follows published provider pricing for each model. Failed instances are charged an additional full rerun at the highest-priced pool model (1× baseline penalty). Lower total USD = better rank.

Status

Alpha. Current package version is 0.2.0 (see CHANGELOG.md, pyproject.toml).

Relationship to CommonRouterBench

CommonRouterBench SWERouterBench
Evaluation Static question bank, 970 rows Dynamic, 500 SWE-bench Verified instances
Router output Tier id 0..3 Concrete model_id from locked pool
Pricing Nominal tier prices Real published provider prices
Cache model Step-distance TTL = 3 Wall-clock TTL = 300s
Pass criterion pred_tier >= gold_tier (proxy) SWE-bench resolved (FAIL_TO_PASS + PASS_TO_PASS)
Heavy deps None swebench, docker

SWERouterBench depends on CommonRouterBench for shared tokenizer and classifier utilities (main.tokenizer, main.router_llm).

Prerequisites

  • Python 3.10+
  • Docker (daemon running; images pulled or buildable for SWE-bench Verified)
  • LLM API compatible with the harness client (OpenAI-compatible base URL + API key; OpenRouter is a common choice)
  • Network for dataset fetch / image pulls as required by swebench

Quickstart

  1. Install (Python ≥ 3.10) from the repository root. pip install -e . pulls CommonRouterBench from PyPI; for a monorepo checkout you can still pip install -e ../CommonRouterBench first if needed.

    pip install -e .
  2. Credentials: copy .env.example to .env in the repo root (or export variables in your shell). The swerouterbench CLI loads .env on startup when keys are unset. Never commit .env (.gitignore).

    cp .env.example .env
    # edit .env — set OPENROUTER_BASE_URL + OPENROUTER_API_KEY (or SWEROUTER_* / COMMONSTACK_*)
  3. Smoke run (first three Verified instances in dataset order, fixed pool model):

    swerouterbench run \
      --router-import swerouter.routers.always_model:AlwaysModelRouter \
      --router-arg model_id=deepseek/deepseek-v3.2 \
      --router-arg label=always_deepseek_smoke \
      --router-label always_deepseek_smoke \
      --output-dir runs/smoke_always \
      --limit 3 --run-id smoke_always

    Non-install entrypoint: scripts/run_router.py forwards to the same CLI (python scripts/run_router.py ...).

Installation

From the repository root:

pip install -e .

This installs the swerouterbench CLI (entry point).

CLI overview

Command Purpose
swerouterbench run Run your router on SWE-bench Verified and write traces + per-instance results under --output-dir.
swerouterbench score Recompute leaderboard billing (total_actual_bill_usd) from an existing run directory. Optional --exclude-infra-failures, --reprice-from-raw-usage.
swerouterbench audit-infra Scan results/*.json for instances that fair-metrics exclusion would drop.
swerouterbench audit-trace-cost Compare summed trace step_cost_usd vs provider raw_usage.cost in *.trace.jsonl.
swerouterbench render Render a markdown leaderboard from one or more score.json files.

Connection settings for run: pass --base-url / --api-key, or set environment variables (see table below). CLI flags override environment defaults.

Environment variables

Variable Role
OPENROUTER_BASE_URL OpenAI-compatible API base (used if SWEROUTER_BASE_URL is unset).
OPENROUTER_API_KEY Bearer token for that base (used if SWEROUTER_API_KEY is unset).
SWEROUTER_BASE_URL / SWEROUTER_API_KEY Explicit names matching the swerouterbench run flag names.
COMMONSTACK_API_BASE / COMMONSTACK_API_KEY Optional: when set, mapped onto OPENROUTER_* / SWEROUTER_* (see swerouter.cli._apply_gateway_aliases).
OPENROUTER_API_KEY_EXP Optional alternate key when OPENROUTER_API_KEY is empty.

How to plug in your router

You pass --router-import module:Callable and repeated --router-arg key=value (all values are strings). The CLI imports the callable and invokes it with **router_args. The returned object must implement the Router protocol: synchronous select(ctx: RouterContext) -> RouterDecision with model_id in ctx.available_models (invalid choices fail fast in the harness).

Module discovery: module must be importable from the current Python environment (pip install -e /path/to/your_router_pkg or PYTHONPATH=/path/to/parent python -m swerouter.cli ... if you keep the router outside this repo).

Locked pool: every RouterDecision.model_id must be one of the entries in data/model_pool.json. The harness passes the same ids in ctx.available_models each step.

Built-in examples live under swerouter/routers/ (AlwaysModelRouter, GoldTierRouter.from_cli_args, etc.). Full behaviour contract (timeouts, fail-fast rules): docs/router_api_zh.md (Chinese).

Minimal custom router

Implement a class with select(self, ctx: RouterContext) -> RouterDecision, put it in any importable package, then point --router-import at the class and pass string kwargs. Example:

# myteam_router/router.py
from dataclasses import dataclass

from swerouter.router import RouterContext, RouterDecision


@dataclass
class FirstPoolModelRouter:
    """Always pick the first model_id the harness exposes for this step."""

    label: str

    def select(self, ctx: RouterContext) -> RouterDecision:
        mid = ctx.available_models[0]
        return RouterDecision(model_id=mid, rationale=f"{self.label}:first_pool")
pip install -e .   # SWERouterBench (provides swerouter.*)
pip install -e ../myteam_router   # your package exporting myteam_router.router
swerouterbench run \
  --router-import myteam_router.router:FirstPoolModelRouter \
  --router-arg label=demo \
  --router-label demo_first_pool \
  --output-dir runs/demo_first_pool \
  --limit 1

Example: fixed model baseline

swerouterbench run \
  --router-import swerouter.routers.always_model:AlwaysModelRouter \
  --router-arg model_id=deepseek/deepseek-v3.2 \
  --router-arg label=always_deepseek \
  --base-url https://openrouter.ai/api/v1 \
  --api-key "$OPENROUTER_API_KEY" \
  --output-dir runs/always_deepseek_smoke \
  --router-label always_deepseek \
  --limit 3

Example: oracle router via classmethod factory

Routers whose constructors need non-string args can expose a module:Class.factory path (e.g. a @classmethod that only takes string kwargs from repeated --router-arg key=value):

swerouterbench run \
  --router-import swerouter.routers.gold_tier:GoldTierRouter.from_cli_args \
  --router-arg question_bank_path=/path/to/CommonRouterBench/data/question_bank.jsonl \
  --router-arg tier_to_model_path=/path/to/SWERouterBench/data/tier_to_model.json \
  --router-arg allowed_instance_ids=django__django-11133,django__django-10097 \
  --router-arg label=gold_tier_smoke \
  --base-url https://openrouter.ai/api/v1 \
  --api-key "$OPENROUTER_API_KEY" \
  --output-dir runs/gold_tier_smoke \
  --router-label gold_tier_smoke \
  --instances django__django-11133 django__django-10097

Useful run flags

Flag Meaning
--router-import Required. module:Callable — class, function, or dotted path like pkg.mod:Cls.from_cli_args. The callable is invoked with kwargs assembled from --router-arg.
--router-arg Repeatable key=value; all values are strings. Passed as **kwargs to the import target.
--router-label Required human-readable id for summaries and leaderboard rows.
--output-dir Run workspace: results/, traces, eval_summary.json, etc.
--limit Only the first N instances in dataset order (after --instances filtering if any).
--instances Optional explicit list of SWE-bench instance_id values.
--workers Concurrent instances (default 8).
--max-steps Global cap on LLM steps per instance (default 40).
--max-steps-json / --max-steps-json-file Optional per-instance overrides of max_steps (JSON object: instance id → positive int).
--budget-usd Per-instance spend cap for routed model calls (default 5.0).
--run-id Passed through to SWE-bench harness logging (default swerouter_default).
--force-rerun Ignore existing <output_dir>/results/<instance_id>.json and re-execute.
--rm-image Remove Docker images after runs (optional cleanup).

Resume: if <output_dir>/results/<instance_id>.json already exists and is valid, that instance is skipped on the next run unless --force-rerun is set.

score and render

swerouterbench score --run-dir runs/always_deepseek_smoke --router-label always_deepseek
# writes runs/always_deepseek_smoke/score.json by default

swerouterbench render --score runs/a/score.json runs/b/score.json --out leaderboard.md

Optional --pricing, --ttl, --pool on score override the default JSON tables under data/.

How the harness calls your router

  1. For each SWE-bench Verified instance, the agent loop runs inside the work container.
  2. Before each LLM request, the harness calls router.select(ctx) with a frozen RouterContext: messages, tools, step_index, available_models (from the locked pool), cache snapshot, spend-so-far, and run limits.
  3. Your code must return a RouterDecision whose model_id is exactly one of ctx.available_models. Invalid or slow select calls fail fast (no silent fallback).

Built-in reference routers live under swerouter/routers/: AlwaysModelRouter, RoundRobinRouter, TierFromCRBRouter, GoldTierRouter, and the FunctionRouter wrapper for callables.

Further detail (Chinese): docs/router_api_zh.md — behaviour contract, fail-fast table, and what routers must not assume.

Open source / release checklist

Before pushing or tagging a public release:

  • No secrets in git: .env must stay untracked (see .gitignore). Use .env.example only for placeholder names and fake values.

  • Scan the tree for accidental keys (adjust patterns for your provider):

    git grep -nE 'sk-[a-zA-Z0-9]{20,}' || true
    git grep -niE 'apikey|api_key|secret_key|bearer [a-z0-9]{20,}' || true

    A clean tree should only hit documentation telling people not to commit keys, or test fixtures with obviously fake payloads.

  • No personal run artefacts: do not git add runs/, logs/, agent_logs/, *.trace.jsonl, or local smoke scripts under scripts/ that match the ignore rules in .gitignore (local-only patterns such as scripts/smoke_*.py are intentionally excluded from the published repo layout).

  • Run tests locally after pip install -e ".[dev]":

    pytest -q

Official locked pool and data files

Programmatic use

You can drive evaluation from Python without the CLI by building an EvalRequest and calling run_eval(request, router_label=...). The CLI’s run subcommand is a thin wrapper around the same path.

See also

License

Apache-2.0. See LICENSE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages