Skip to content

v3.3 — cross-model sweep (ARCHIVED, pre-v1.x)

Pre-release
Pre-release

Choose a tag to compare

@sanchitmonga22 sanchitmonga22 released this 19 May 04:25

⚠️ ARCHIVED — pre-v1.x naming. This is a historical research sweep (the v3.x line) that predates the current v1.x releases. It is not the latest version. For the current benchmark and datasets, see v1.5.1 (Latest). Kept for reproducibility of the 3,581-row cross-model sweep only.


Highlights

The biggest single benchmark sweep this repo has produced. 4.5 days of continuous compute on M4 Max 64 GB.

  • 3,581 rows across 33 variant directories
  • 6 local models × 5 routes × 7 routing strategies × 8 task shapes × 6 pricing scenarios

TL;DR

  1. Can hybrid routing save cost? Not via multi-step orchestration (R3/R4/R5 cost 1.9× to 5× more than R1 cloud-only). Yes via per-task gating: ~16-20% savings on a mixed workload.
  2. Best local model: Qwen3-Coder:30B at $0.229/correct. Beats devstral, qwen2.5-coder, gemma4, GLM, AND both newer Qwen 3.6 variants.
  3. Best routing strategy: Cascade with default threshold 15. Replicates as Pareto winner across all 6 models.

Major findings

  • LLM-classifier is structurally broken on SWE-bench: 5 classifier sizes from 0.6B to 4B all score 0/10 (Phase 6 sub-sweep). Scaling does NOT help.
  • Cascade threshold 15 is empirically optimal (Phase 7 sub-sweep tested 5/10/15/20/25; t=20 is a brittleness cliff).
  • Newer Qwen 3.6 family regressed vs older Qwen3-Coder on this benchmark — counter-intuitive but reproducible across both 27B-mxfp8 and 35B-A3B-MoE variants.
  • R5 DevMinion is catastrophically bad on prose — 5.13× R1 cost with composite 0.00 across 7 of 8 D3/D4 tasks.
  • Multi-step hybrid loses on every count — cost, quality, latency. Skip it.

Attached artifacts

  • v3.3-report.html — single-file standalone HTML report with all 5 charts base64-embedded. Open in any browser.
  • cross-model-leaderboard.png — R3 heuristic scatter, lower-right is better
  • strategy-model-heatmap.png — 7 models × 5 strategies $/correct heatmap
  • phase-6-classifier-sweep.png — B-pass rate per classifier (the 0/10 collapse)
  • phase-7-cascade-threshold.png — cascade threshold tuning
  • per-shape-r1-vs-alt.png — R1 vs best alternative cost per task shape

Read the full article

Reproducibility

Full sweep takes ~4.5 days on M4 Max 64 GB and ~$240 in OpenAI spend. Same dataset re-prices under 6 cloud scenarios without re-running inference. See REPRODUCING.md.

🤖 Sweep + analysis + article + this release all generated via Claude Code.