Release v3.3 — cross-model sweep (ARCHIVED, pre-v1.x) · RunanywhereAI/hybrid-arena

⚠️ ARCHIVED — pre-v1.x naming. This is a historical research sweep (the v3.x line) that predates the current v1.x releases. It is not the latest version. For the current benchmark and datasets, see v1.5.1 (Latest). Kept for reproducibility of the 3,581-row cross-model sweep only.

Highlights

The biggest single benchmark sweep this repo has produced. 4.5 days of continuous compute on M4 Max 64 GB.

3,581 rows across 33 variant directories
6 local models × 5 routes × 7 routing strategies × 8 task shapes × 6 pricing scenarios

TL;DR

Can hybrid routing save cost? Not via multi-step orchestration (R3/R4/R5 cost 1.9× to 5× more than R1 cloud-only). Yes via per-task gating: ~16-20% savings on a mixed workload.
Best local model: Qwen3-Coder:30B at $0.229/correct. Beats devstral, qwen2.5-coder, gemma4, GLM, AND both newer Qwen 3.6 variants.
Best routing strategy: Cascade with default threshold 15. Replicates as Pareto winner across all 6 models.

Major findings

LLM-classifier is structurally broken on SWE-bench: 5 classifier sizes from 0.6B to 4B all score 0/10 (Phase 6 sub-sweep). Scaling does NOT help.
Cascade threshold 15 is empirically optimal (Phase 7 sub-sweep tested 5/10/15/20/25; t=20 is a brittleness cliff).
Newer Qwen 3.6 family regressed vs older Qwen3-Coder on this benchmark — counter-intuitive but reproducible across both 27B-mxfp8 and 35B-A3B-MoE variants.
R5 DevMinion is catastrophically bad on prose — 5.13× R1 cost with composite 0.00 across 7 of 8 D3/D4 tasks.
Multi-step hybrid loses on every count — cost, quality, latency. Skip it.

Attached artifacts

v3.3-report.html — single-file standalone HTML report with all 5 charts base64-embedded. Open in any browser.
cross-model-leaderboard.png — R3 heuristic scatter, lower-right is better
strategy-model-heatmap.png — 7 models × 5 strategies $/correct heatmap
phase-6-classifier-sweep.png — B-pass rate per classifier (the 0/10 collapse)
phase-7-cascade-threshold.png — cascade threshold tuning
per-shape-r1-vs-alt.png — R1 vs best alternative cost per task shape

Read the full article

reports/ARTICLE.md — ~10,000-word comprehensive write-up
docs/HYBRID_ROUTER_DESIGN.md — the deployable router architecture
docs/REPRODUCING.md — copy-paste reproduction

Reproducibility

Full sweep takes ~4.5 days on M4 Max 64 GB and ~$240 in OpenAI spend. Same dataset re-prices under 6 cloud scenarios without re-running inference. See REPRODUCING.md.

🤖 Sweep + analysis + article + this release all generated via Claude Code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v3.3 — cross-model sweep (ARCHIVED, pre-v1.x)

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

TL;DR

Major findings

Attached artifacts

Read the full article

Reproducibility

Uh oh!