v3.3 — cross-model sweep (ARCHIVED, pre-v1.x)
Pre-release
Pre-release
⚠️ ARCHIVED — pre-v1.x naming. This is a historical research sweep (thev3.xline) that predates the currentv1.xreleases. It is not the latest version. For the current benchmark and datasets, see v1.5.1 (Latest). Kept for reproducibility of the 3,581-row cross-model sweep only.
Highlights
The biggest single benchmark sweep this repo has produced. 4.5 days of continuous compute on M4 Max 64 GB.
- 3,581 rows across 33 variant directories
- 6 local models × 5 routes × 7 routing strategies × 8 task shapes × 6 pricing scenarios
TL;DR
- Can hybrid routing save cost? Not via multi-step orchestration (R3/R4/R5 cost 1.9× to 5× more than R1 cloud-only). Yes via per-task gating: ~16-20% savings on a mixed workload.
- Best local model: Qwen3-Coder:30B at $0.229/correct. Beats devstral, qwen2.5-coder, gemma4, GLM, AND both newer Qwen 3.6 variants.
- Best routing strategy: Cascade with default threshold 15. Replicates as Pareto winner across all 6 models.
Major findings
- LLM-classifier is structurally broken on SWE-bench: 5 classifier sizes from 0.6B to 4B all score 0/10 (Phase 6 sub-sweep). Scaling does NOT help.
- Cascade threshold 15 is empirically optimal (Phase 7 sub-sweep tested 5/10/15/20/25; t=20 is a brittleness cliff).
- Newer Qwen 3.6 family regressed vs older Qwen3-Coder on this benchmark — counter-intuitive but reproducible across both 27B-mxfp8 and 35B-A3B-MoE variants.
- R5 DevMinion is catastrophically bad on prose — 5.13× R1 cost with composite 0.00 across 7 of 8 D3/D4 tasks.
- Multi-step hybrid loses on every count — cost, quality, latency. Skip it.
Attached artifacts
v3.3-report.html— single-file standalone HTML report with all 5 charts base64-embedded. Open in any browser.cross-model-leaderboard.png— R3 heuristic scatter, lower-right is betterstrategy-model-heatmap.png— 7 models × 5 strategies $/correct heatmapphase-6-classifier-sweep.png— B-pass rate per classifier (the 0/10 collapse)phase-7-cascade-threshold.png— cascade threshold tuningper-shape-r1-vs-alt.png— R1 vs best alternative cost per task shape
Read the full article
reports/ARTICLE.md— ~10,000-word comprehensive write-updocs/HYBRID_ROUTER_DESIGN.md— the deployable router architecturedocs/REPRODUCING.md— copy-paste reproduction
Reproducibility
Full sweep takes ~4.5 days on M4 Max 64 GB and ~$240 in OpenAI spend. Same dataset re-prices under 6 cloud scenarios without re-running inference. See REPRODUCING.md.
🤖 Sweep + analysis + article + this release all generated via Claude Code.