MiniSWERouterBench

Run SWERouterBench-style per-step routing and USD scoring on the mini-swe-agent scaffold (miniswerouterbench CLI).

Standard model pool (locked)

Your router’s select() must return a model_id that appears in the locked pool shipped with SWERouterBench (data/model_pool.json inside the installed package, or this copy on GitHub). The current four IDs are:

`model_id`	Role in the bench
`anthropic/claude-opus-4.6`	High baseline (`is_high_baseline=true`): failed runs bill a full replay at this model.
`google/gemini-3-flash-preview`	Pool member
`minimax/minimax-m2.7`	Pool member
`deepseek/deepseek-v3.2`	Pool member

Pricing, TTL, and tier→model tables come from the same SWERouterBench data/ bundle. Override paths only if you intentionally pin different JSON files (--pool, --pricing, --ttl, --tier-map on run / score).

Prerequisites

Python 3.10+
Docker (SWE-bench Verified images, same contract as SWERouterBench)
An OpenAI-compatible LLM gateway (base URL + API key; OpenRouter is typical)

Install

pip install -e .

API credentials

Copy .env.example to .env in this repo root (or export vars in your shell). The CLI loads .env when variables are unset. Do not commit .env.

Variable	Purpose
`OPENROUTER_BASE_URL` / `OPENROUTER_API_KEY`	Default gateway if `SWEROUTER_*` unset
`SWEROUTER_BASE_URL` / `SWEROUTER_API_KEY`	Explicit names for `run` defaults
`COMMONSTACK_API_BASE` / `COMMONSTACK_API_KEY`	Optional; mapped to the above (see `miniswerouter.cli`)
`OPENROUTER_API_KEY_EXP`	Optional backup key

--base-url and --api-key on run override the environment.

Plug in and test one router

Implement the SWERouterBench Router protocol: synchronous select(ctx) -> RouterDecision, with model_id ∈ ctx.available_models (invalid IDs fail fast).

Run the harness (example: one Verified instance, smoke settings):

miniswerouterbench run \
  --router-import your.package.module:YourRouterClass \
  --router-arg some_param=value \
  --router-label my_router_smoke \
  --output-dir runs/my_router_smoke \
  --instances django__django-11133 \
  --limit 1 --workers 1 --run-id my_router_smoke

Built-in smoke reference (fixed model every step):

miniswerouterbench run \
  --router-import swerouter.routers.always_model:AlwaysModelRouter \
  --router-arg model_id=deepseek/deepseek-v3.2 \
  --router-arg label=always_deepseek \
  --router-label always_deepseek_smoke \
  --output-dir runs/smoke_always \
  --instances django__django-11133 \
  --limit 1 --workers 1 --run-id smoke_always

Factories with non-string constructor args use a dotted import, e.g. swerouter.routers.gold_tier:GoldTierRouter.from_cli_args plus --router-arg key=value (all string values). Reference routers live under swerouter.routers in the SWERouterBench package.

Score the run directory:

miniswerouterbench score \
  --run-dir runs/my_router_smoke \
  --router-label my_router_smoke \
  --reprice-from-raw-usage \
  --out runs/my_router_smoke/score.json

Optional checks: audit-infra --run-dir ... (infra exclusions), audit-trace-cost --run-dir ... (trace vs provider cost), render --score ... --out leaderboard.md.

Shell helpers: scripts/examples/ (env.inc.sh, resume_until_n.sh, example_router_a.sh, example_router_b.sh).

CLI

Command	Purpose
`run`	Run one router on SWE-bench Verified; writes `results/`, `.trace.jsonl`, `agent_logs/`, `case_summaries/`, `.mini_traj.json`, `eval_summary.json` under `--output-dir`.
`score`	Recompute `total_actual_bill_usd` (optional `--reprice-from-raw-usage`, `--exclude-infra-failures`, `--pool` / `--pricing` / `--ttl`).
`audit-infra`	List instances dropped by fair-metrics infra rules.
`audit-trace-cost`	Compare trace `step_cost_usd` vs `raw_usage.cost`.
`render`	Markdown leaderboard from `score.json` files.

Output layout (`--output-dir`)

results/<instance_id>.json — outcome (SWERouterBench-compatible schema)
<instance_id>.trace.jsonl — scorer-facing trace + loop_summary
agent_logs/<instance_id>/agent.log — harness log
case_summaries/<instance_id>.summary.json — per-case rollup
<instance_id>.mini_traj.json — full mini-swe-agent trajectory (there is no llm_io/ like the editor scaffold in SWERouterBench)

Official eval defaults

CLI defaults match mini-swe-agent’s published SWE-bench settings: --max-steps 250, --budget-usd 3. For a real submission or comparable numbers, keep those defaults and avoid dev-only flags such as --max-steps-json unless you know why you need them.

Related repositories

SWERouterBench — router protocol, locked data/, editor-scaffold bench
CommonRouterBench — tier GT bank (e.g. for GoldTierRouter smoke)
mini-swe-agent — agent scaffold
This repo — source for miniswerouterbench

License

Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
miniswerouter		miniswerouter
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
README.zh.md		README.zh.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MiniSWERouterBench

Standard model pool (locked)

Prerequisites

Install

API credentials

Plug in and test one router

CLI

Output layout (`--output-dir`)

Official eval defaults

Related repositories

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MiniSWERouterBench

Standard model pool (locked)

Prerequisites

Install

API credentials

Plug in and test one router

CLI

Output layout (--output-dir)

Official eval defaults

Related repositories

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Output layout (`--output-dir`)

Packages