Run SWERouterBench-style per-step routing and USD scoring on the mini-swe-agent scaffold (miniswerouterbench CLI).
Your router’s select() must return a model_id that appears in the locked pool
shipped with SWERouterBench (data/model_pool.json inside the installed package, or
this copy
on GitHub). The current four IDs are:
model_id |
Role in the bench |
|---|---|
anthropic/claude-opus-4.6 |
High baseline (is_high_baseline=true): failed runs bill a full replay at this model. |
google/gemini-3-flash-preview |
Pool member |
minimax/minimax-m2.7 |
Pool member |
deepseek/deepseek-v3.2 |
Pool member |
Pricing, TTL, and tier→model tables come from the same SWERouterBench data/ bundle.
Override paths only if you intentionally pin different JSON files (--pool, --pricing, --ttl, --tier-map on run / score).
- Python 3.10+
- Docker (SWE-bench Verified images, same contract as SWERouterBench)
- An OpenAI-compatible LLM gateway (base URL + API key; OpenRouter is typical)
pip install -e .Copy .env.example to .env in this repo root (or export vars in your shell). The CLI loads .env when variables are unset. Do not commit .env.
| Variable | Purpose |
|---|---|
OPENROUTER_BASE_URL / OPENROUTER_API_KEY |
Default gateway if SWEROUTER_* unset |
SWEROUTER_BASE_URL / SWEROUTER_API_KEY |
Explicit names for run defaults |
COMMONSTACK_API_BASE / COMMONSTACK_API_KEY |
Optional; mapped to the above (see miniswerouter.cli) |
OPENROUTER_API_KEY_EXP |
Optional backup key |
--base-url and --api-key on run override the environment.
-
Implement the SWERouterBench
Routerprotocol: synchronousselect(ctx) -> RouterDecision, withmodel_id∈ctx.available_models(invalid IDs fail fast). -
Run the harness (example: one Verified instance, smoke settings):
miniswerouterbench run \ --router-import your.package.module:YourRouterClass \ --router-arg some_param=value \ --router-label my_router_smoke \ --output-dir runs/my_router_smoke \ --instances django__django-11133 \ --limit 1 --workers 1 --run-id my_router_smoke
Built-in smoke reference (fixed model every step):
miniswerouterbench run \ --router-import swerouter.routers.always_model:AlwaysModelRouter \ --router-arg model_id=deepseek/deepseek-v3.2 \ --router-arg label=always_deepseek \ --router-label always_deepseek_smoke \ --output-dir runs/smoke_always \ --instances django__django-11133 \ --limit 1 --workers 1 --run-id smoke_always
Factories with non-string constructor args use a dotted import, e.g.
swerouter.routers.gold_tier:GoldTierRouter.from_cli_argsplus--router-arg key=value(all string values). Reference routers live underswerouter.routersin the SWERouterBench package. -
Score the run directory:
miniswerouterbench score \ --run-dir runs/my_router_smoke \ --router-label my_router_smoke \ --reprice-from-raw-usage \ --out runs/my_router_smoke/score.json
-
Optional checks:
audit-infra --run-dir ...(infra exclusions),audit-trace-cost --run-dir ...(trace vs provider cost),render --score ... --out leaderboard.md.
Shell helpers: scripts/examples/ (env.inc.sh, resume_until_n.sh, example_router_a.sh, example_router_b.sh).
| Command | Purpose |
|---|---|
run |
Run one router on SWE-bench Verified; writes results/, *.trace.jsonl, agent_logs/, case_summaries/, *.mini_traj.json, eval_summary.json under --output-dir. |
score |
Recompute total_actual_bill_usd (optional --reprice-from-raw-usage, --exclude-infra-failures, --pool / --pricing / --ttl). |
audit-infra |
List instances dropped by fair-metrics infra rules. |
audit-trace-cost |
Compare trace step_cost_usd vs raw_usage.cost. |
render |
Markdown leaderboard from score.json files. |
results/<instance_id>.json— outcome (SWERouterBench-compatible schema)<instance_id>.trace.jsonl— scorer-facing trace +loop_summaryagent_logs/<instance_id>/agent.log— harness logcase_summaries/<instance_id>.summary.json— per-case rollup<instance_id>.mini_traj.json— full mini-swe-agent trajectory (there is nollm_io/like the editor scaffold in SWERouterBench)
CLI defaults match mini-swe-agent’s published SWE-bench settings: --max-steps 250, --budget-usd 3. For a real submission or comparable numbers, keep those defaults and avoid dev-only flags such as --max-steps-json unless you know why you need them.
- SWERouterBench — router protocol, locked
data/, editor-scaffold bench - CommonRouterBench — tier GT bank (e.g. for
GoldTierRoutersmoke) - mini-swe-agent — agent scaffold
- This repo — source for
miniswerouterbench
Apache-2.0.