Releases: RunanywhereAI/hybrid-arena
v1.6.0 — Hybrid Coding Arena (rebrand)
Hybrid Coding Arena is the new name for this benchmark (formerly hybrid-coding-eval), from RunAnywhere.
What changed in v1.6.0
This is a rebrand release. The benchmark, methodology, and dataset are unchanged.
- Python package:
hybrid_coding_evalis nowhybrid_arena. - Distribution + repo:
hybrid-arena(the oldhybrid-coding-evalURL redirects here). - CLI command:
benchis nowarena(e.g.arena sweep,arena analyze). - Clearer headline chart: pass-rate and cloud usage are now separate, labeled elements.
git clone https://github.com/RunanywhereAI/hybrid-arena
cd hybrid-arena && python3.12 -m venv .venv
.venv/bin/pip install -e ".[dev,agents]"
arena setup
arena sweep --config configs/v1.4-smoke.yaml --strategies always-cloud --seeds 42
arena analyze results/runs/v1.4-smokeDataset
results-v1.6.0.tar.gz is byte-identical to the v1.5.0/v1.5.1 dataset (1,704 rows). No new benchmark runs in this release.
Headline
cline + qwen3.6 + cascadeon real-developer refactors: 24/24 = 100% at 8% cloud, about $0.022/task.- Local-only solves 67% of the hard (D6) tasks at $0 cloud; cloud-only holds 100%.
- 1,704 rows, 3 local models, 3 coding agents, 8 routing strategies, 17 tasks, one M4 Max laptop, 95% bootstrap CIs.
v1.5.1: open-source polish
Open-source polish release. Addresses every finding from the pre-publish audit pass — security, licensing, UX, hygiene. No code-behaviour changes; safe to take.
Full release notes: docs/release-notes/v1.5.1.md.
What changed
Licensing simplified
- Deleted
NOTICE.md,LICENSE-DATA,LICENSE.md. - Single
LICENSE(MIT) now covers code, data, charts, and docs prose. - Citation request lives in the README's bibtex block.
Documentation rewritten
README.md— six-cell headline table, full prereq + quickstart with realistic time/cost estimates, "picking a config for real work" section distilled from the v1.5 leaderboard, fullbenchCLI table.AGENTS.md— refreshed for v1.5.0: D6 task class documented, v1.5 configs added to the tree, conventions reflect that single-letter codes (A/B/D) are retired.CODE_OF_CONDUCT.md— short and direct.- Source-tree docstrings + per-task READMEs —
lib.*rewritten tocore.*; "Category D / B / X" rewritten torefactors/real-prs/puzzlesend-to-end.
UX cleanup
- Deleted
scripts/reproduce.sh—./bench setupalready does prereq checks, smoke is./bench sweep --config configs/v1.4-smoke.yaml. - Deleted
logs/v3.3/— historical sweep logs moved out of git.logs/is now gitignored.
Hygiene
__version__bumped 0.1.0 → 1.5.1 (it was stuck at 0.1.0).pytestmoved from runtime deps to[dev]extras.- Removed unused
pytest -m slowfilter from CI + docs. .github/ISSUE_TEMPLATE/new_model.md: brokenconfigs/variants/_template.yamlreference fixed.docs/HYBRID_ROUTING_DESIGN.md+ v1.4.{0,1} release notes:jqsnippets updated from legacyD::cline::heuristictorefactors::cline::heuristic.- Test aliases
r10_cline,r6_mini_swe_agentrenamed.
Privacy
- Sanitized 263 absolute-path leaks (
/Users/<owner>/...) in trackedraw.jsonl/progress.log. JSON re-validated on every row (520 rows, 0 parse errors).
Verification
- 120 fast tests pass on Python 3.11 + 3.12 (CI matrix).
ruff check src/ tests/clean.
Citation
If you use this benchmark, a citation would be really appreciated. BibTeX in the README.
📦 Dataset
results-v1.5.1.tar.gz is byte-identical to the v1.5.0 dataset — v1.5.1 added 0 new benchmark rows (it is an open-source-polish release). It is attached here so visitors landing on the Latest release can download the data directly. The canonical 1,704-row dataset is unchanged since v1.5.0.
gh release download v1.5.1 -p results-v1.5.1.tar.gz # or v1.5.0 — same bytesv1.5.0 — Hard-Task Stress Test
v1.5.0 — Hard-Task Stress Test (2026-05-27)
Theme: Push past the D1/D5 ceiling. v1.4.x had top configs hitting
100% on the canonical 8-task refactor set, which made the leaderboard
uninformative at the top end. v1.5 adds a new harder task shape (D6)
designed to actually stress 30B local models, and stress-tests the
v1.4.1 champion configurations against it.
What's new
New D6 task class: hard refactors (4 tasks, 80 acceptance tests)
Each D6 task is a single-file Python implementation challenge with
comprehensive pytest coverage. The reference solutions are all ≤200
LOC, so the difficulty isn't volume — it's the breadth of corner
cases the model has to internalise from the prompt alone.
| Task | LOC | Tests | What it stresses |
|---|---|---|---|
d6-lru-ttl-cache |
100 | 23 | OrderedDict-based LRU + monkey-patchable TTL clock + careful eviction accounting |
d6-token-bucket |
60 | 14 | Lazy refill correctness + multi-key isolation + non-positive-arg validation |
d6-toposort |
90 | 16 | Kahn's with deterministic tie-break + DFS cycle detection with path reconstruction |
d6-mini-template |
200 | 27 | Recursive-descent parser + AST evaluator + escape filters + nested if/for + comments |
Total: ~450 LOC of reference solution, 80 pytest assertions.
The same overlay+pytest scoring path D1/D5 uses — no new scoring
infrastructure needed.
Stress-test sweeps
Two new sweep configs target the v1.4.1 audit's top configurations:
configs/v1.5-hard-gemma4.yaml— aider+gemma4 on always-cloud and
heuristic (the v1.3 marquee profile). 4 tasks × 2 strategies × 3
seeds = 24 rows.configs/v1.5-hard-qwen3.6.yaml— cline+qwen3.6 on always-cloud,
always-local, and cascade (the v1.4.1 champion profile). 4 tasks ×
3 strategies × 3 seeds = 36 rows.
Total 60 new rows of hard-task data. Cost cap: $50 cloud spend.
Wall-time cap: 6 hours.
Findings
Both sweeps complete. Full analysis in
personal/reports/publish-v1.5/article.html §9.5.
| Agent | Model | Strategy | Pass | Cloud-frac | Notes |
|---|---|---|---|---|---|
| aider | gemma4:31b | always-cloud | 12/12 (100%) | 100% | ceiling |
| aider | gemma4:31b | heuristic | 7/12 (58%) | 61% | v1.3 marquee profile falls off on D6 |
| cline | qwen3.6:35b | always-cloud | 12/12 (100%) | 100% | ceiling |
| cline | qwen3.6:35b | always-local | 8/12 (67%) | 0% | 30B local-only ceiling — $0 cloud spend |
| cline | qwen3.6:35b | cascade | 9/12 (75%) | 13% | v1.4.1 champion holds 75% but loses on mini-template |
Key findings:
- 30B local-only solves 67% of hard refactors with zero cloud. cline + qwen3.6:35b
nails token-bucket and toposort 3/3 each, partial-passes lru-ttl-cache and mini-template.
Two of the four "failures" are cline-on-local session bugs, not model quality. - The v1.4.1 cascade champion drops from 100% to 75% on D6. Cascade only marginally
beats always-local (75% vs 67%) because the router has no global view of task difficulty.
Thed6-mini-templaterecursive-descent parser is the hard wall. - Always-cloud (gpt-5.5) is 100% on both configs. The cloud advantage on D6 is real and
42 percentage points over heuristic gemma4. - The pytest-parser bug fix in
src/hybrid_coding_eval/agents/aider.pywas caught by a
new parametrized test (tests/agents/test_aider_parser.py). The bug undercounted
passes for aider rows whereX failedappeared beforeY passedin the summary line.
Rescoring v1.5 gemma4 data found 0 affected rows; the fix is preventative.
How to reproduce
git checkout v1.5.0
./scripts/reproduce.sh # one-time env setup if not done yet
./bench sweep --config configs/v1.5-hard-gemma4.yaml \
--strategies always-cloud,heuristic --seeds 42,7,13
./bench sweep --config configs/v1.5-hard-qwen3.6.yaml \
--strategies always-cloud,always-local,cascade --seeds 42,7,13
./bench analyze results/runs/v1.5-hard-gemma4
./bench analyze results/runs/v1.5-hard-qwen3.6Expected wall time on M4 Max 64 GB: ~30 min for gemma4 sweep,
~80 min for qwen3.6 sweep. Expected cloud spend at gpt-5.5
list pricing: ~$8 total across both sweeps.
Migration from v1.4.4
Zero migration cost. v1.5 is purely additive:
- The D1–D5 task shapes are unchanged. The v1.4.1 canonical dataset
remains the headline dataset. - The D6 shape uses the same overlay+pytest scoring path as D1/D5.
No new scoring dependencies. - Existing configs continue to work. The new configs are isolated to
configs/v1.5-hard-*.yaml.
v1.4.1 — 3-model agentic leaderboard (1,644 rows total)
v1.4.1 — 3-model agentic leaderboard
Adds qwen3-coder:30b + qwen3.6:35b canonical sweeps to v1.4.0's gemma4 line. Combined v1.4 + v1.4.1: 1,644 rows.
Headline (the marquee cells)
| Cell | Pass-rate | Cloud-fraction |
|---|---|---|
| cline + qwen3.6 + cascade + refactors | 24/24 = 100% [100, 100] | low (~5-10%) |
| cline + qwen3.6 + heuristic + refactors | 22/24 = 92% | ~7% |
| cline + qwen3-coder + heuristic + refactors | 22/24 = 92% | ~7% |
| cline + qwen3.6 + always-local + puzzles | 15/15 = 100% | 0% |
| aider + gemma4 + heuristic + refactors (v1.4.0) | 23/24 = 96% [88, 100] | 48% |
What's new in v1.4.1
Router infrastructure fix (commit c7392db)
Three model-agnostic local-guard env vars in router/server.mjs:fetchLocalOllamaAsOpenAI():
ROUTER_LOCAL_NUM_PREDICT_CAP default 4096 max gen tokens per local call
ROUTER_LOCAL_REQUEST_TIMEOUT_MS default 180000 3-min per-request hard timeout
ROUTER_LOCAL_REPEAT_PENALTY default 1.1 override weak model defaults
Discovered when qwen3-coder's weak repeat_penalty=1.05 + cline's missing max_tokens caused a runaway repetition loop (34 MB streamed over 2h35m from a single HTTP request), crashed Ollama, cascaded every subsequent task to timeout. Full RCA in the release tarball.
2 new canonical sweeps (936 rows)
configs/v1.4-canonical-qwen3-coder.yaml→ 468 rows. qwen3-coder:30b (MoE coding specialist) across 3 agents × 4 strategies × 13 tasks × 3 seeds.configs/v1.4-canonical-qwen3.6.yaml→ 468 rows. qwen3.6:35b (dense generalist) across the same matrix.
Three new findings
-
qwen3.6:35b is the unsung champion. cline + qwen3.6 + cascade nails 100% on refactors. cline + qwen3.6 + always-local nails 100% on puzzles. cline + qwen3.6 + heuristic = 92% on refactors at ~7% cloud spend.
-
opencode is gemma4-specific. v1.4.0's opencode resurrection (71% on refactors heuristic with gemma4) doesn't transfer to qwen models (21-33%). opencode's runLoop requires clean tool_calls — gemma4 produces them, qwen variants don't reliably.
-
Aider is model-sensitive. 96% on gemma4 refactors heuristic → 50% qwen3.6 → 33% qwen3-coder. Aider's architect/editor protocol favors gemma4's dense-generalist training profile.
Reproducibility
git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.4.1
python3.12 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt && pip install -e ".[dev]"
cp .env.example .env # set OPEN_AI_API_KEY
ollama pull gemma4:31b qwen3-coder:30b qwen3.6:35b
./bench setup
./bench start --config configs/v1.4-canonical-qwen3.6.yaml \
--strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./bench status # progress
./bench pause # if you need the laptop
./bench resume # picks up where it left off
./bench analyze results/runs/v1.4-canonical-qwen3.6
# headline: jq '.cells["refactors::cline::cascade"].pass_rate' \
# results/runs/v1.4-canonical-qwen3.6/bootstrap_cis.jsonArtifacts attached
results-v1.4.1.tar.gz(15 MB) — qwen3-coder + qwen3.6 sweep dirs (raw.jsonl + aggregate.json + bootstrap_cis.json + decision_matrix.md + charts)article.html— the v1.4.1 master article with code-generated charts (~10-min read covering v1.0 → v1.4.1)qwen3-coder-timeout-rca.md— full root-cause analysis of the router infrastructure fix
Migration from v1.4.0
No code changes. The router fix is backwards-compatible (env vars all default to safe values). Existing v1.4.0 sweeps re-run with the v1.4.1 router will be safer against runaway local-model loops.
v1.4.0 — agent-only release (708 rows, 3 sweeps, gemma4)
v1.4.0 — cleanup + production-pipeline release
The single canonical release for hybrid coding agents on local hardware.
After 7 prior releases (v1.0.0–v1.3.0), v1.4 narrows the project to agent-only scope (deletes the legacy R1–R5 non-agentic routes + the HumanEval+/BigCodeBench/custom-arch benchmarks), adds Claude Code + Cline runners, ships 5 production-grade lifecycle commands (./bench start|pause|resume|stop|status), and runs the canonical 3-sweep dataset against the gemma4:31b local model.
Headline numbers (708 rows, ~20h wall, $90.48 cloud spend on M4 Max 64GB)
| Cell | Pass-rate | Cloud-fraction | Notes |
|---|---|---|---|
| aider + heuristic + refactors | 23/24 = 96% [88, 100] | 48% | The marquee Pareto win — replicates v1.3.0 |
| aider + always-cloud + refactors | 24/24 = 100% [100, 100] | 100% | Cloud ceiling |
| cline + always-local + puzzles | 15/15 = 100% | 0% | First 30B local-only result that nails Exercism Python |
| opencode + heuristic + refactors | 17/24 = 71% | 46% | Resurrection: vs v1.1.x's 0/15 |
| cascade (all (agent, task-class) cells) | ≤ heuristic | varies | cascade is dead in agentic regime |
What changed in v1.4
Cleanup (the big one — 82 files deleted)
- Deleted non-agentic code: R1-R5 runners (cloud-only, local-only, hybrid-architect, Stanford Minion, Stanford DevMinion), HumanEval+ adapter, BigCodeBench-Hard adapter, custom-arch adapter, LLM-judge scorer, the entire
configs/variants/directory (32 legacy YAMLs),vendor/minions/,cli/{judge,rejudge,rescore,report}.py. - Schema rename:
BenchmarkConfig.routes→agents,categories→task_classes.TaskPlan.route→agent. DropRnprefix throughout the codebase. The agent names areaider,opencode,mini-swe-agent,claude-code,cline. - Directory rename:
runners/→agents/,benchmarks/→tasks/. Sub-renames:exercism_python/→puzzles/,real_dev/→refactors/,swebench_verified/→real_prs/.
Added
- Two new agents:
claude-code(Anthropic Claude CLI; runs always-cloud-direct in v1.4 — Anthropic-compat router shim deferred to v1.5) andcline(Apache-2.0 Plan/Act agent; talks LiteLLM-compat → our router cleanly). phase-awarerouting strategy: deterministic Aider role-marker split (architect→cloud, editor→local). Falls through to legacy heuristic for non-Aider agents. 8 strategies total now (always-local,always-cloud,rules,heuristic,llm-classifier,embedding-knn,cascade,phase-aware).- 5 production lifecycle commands in
./bench:bench start --config <yaml> --strategies ... --seeds ...— spawn detached, auto-starts Ollama, writes /tmp/hcev-sweep.jsonbench pause— kill orchestrator + agents + router; keep Ollama for fast resumebench resume— relaunch with--resumeto skip rows already in raw.jsonlbench stop [--keep-ollama-app] [--clear-state]— like pause + kill Ollama (frees ~19 GB)bench status— show RUNNING / PAUSED / NO SWEEP with PID, log path, row count
bench sweep --resumeflag + automatic router lifecycle frommodels.localin the config (eliminates the manualcd router && ./start.shstep from earlier reproducers)./api/tagsstub on the router so cline's Ollama-listing probe gets a clean 200 instead of a noisy 404.- 5 production v1.4 configs under
configs/:v1.4-smoke.yaml,v1.4-canonical-gemma4.yaml,v1.4-opencode-fairness.yaml,v1.4-strategy-sweep.yaml,v1.4-real-prs.yaml, plus 2 queued (v1.4-canonical-qwen3-coder.yaml,v1.4-canonical-qwen3.6.yaml) for v1.4.1.
Fixed
- Reproducibility audit blockers (10 items from v1.3 audit): README rewrite for v1.4,
requirements.txt+= pydantic + pyyaml, auto-spawn router inbench sweep, fresh CI workflow (no longer needs vendor/minions clone), docs rewrites forREPRODUCING.md+BENCHMARK_NEW_MODEL.md. - Phase 2 rename fallout: 4 agent runners had hardcoded
benchmarks/real_dev/fixtures/paths in_REAL_DEV_FIXTURES_ROOTconstants — updated totasks/refactors/fixtures/.
What's still pending (v1.4.1)
- qwen3-coder:30b + qwen3.6:35b canonical sweeps — configs ready at
configs/v1.4-canonical-qwen3-coder.yamlandconfigs/v1.4-canonical-qwen3.6.yaml; multi-model orchestrator script atpersonal/scripts/v1.4-multi-model-orchestrator.sh. Estimated +12h compute, +$30 cloud. Will land as v1.4.1 with an updated 3-model leaderboard. - SWE-bench Verified real-PR replay — config at
configs/v1.4-real-prs.yamlbut blocked on local Docker availability for the SWE-bench testbed. Deferred to v1.5.
Reproducibility
# Fresh clone + setup (~10 min)
git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.4.0
python3.12 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt && pip install -e ".[dev]"
cp .env.example .env # set OPEN_AI_API_KEY
ollama pull gemma4:31b # ~19 GB
./bench setup # installs aider + opencode + cline; verifies Ollama + Docker
# Run the canonical gemma4 sweep (~10h on M4 Max 64GB, ~$70 cloud)
./bench start --config configs/v1.4-canonical-gemma4.yaml \
--strategies always-cloud,always-local,heuristic,cascade \
--seeds 42,7,13
# Monitor / pause / resume
./bench status # show progress
./bench pause # if you need the laptop
./bench resume # picks up where it left off
# After sweep completes, analyze
./bench analyze results/runs/v1.4-canonical-gemma4
# headline lives in: bootstrap_cis.json["cells"]["refactors::aider::heuristic"]["pass_rate"]Artifacts attached to this release
results-v1.4.0.tar.gz(7.8 MB) — bundle of all 3 sweep result directories (raw.jsonl + aggregate.json + bootstrap_cis.json + decision_matrix.md + charts)v1.4.0-article.html(~6000 words, dark-mode publishable report)findings.md— technical diagnostic write-up with per-cell bootstrap CIsdeep-analysis.md— deep-dive analysis of the 708 rows (per-task heatmaps, tail-latency analysis, routing-decision quality, etc.)
Migration from v1.3.0
If you have v1.3 variant configs (configs/variants/2{8,9,30}-v1.3-*.yaml), they're deleted in v1.4. Use the v1.4 equivalents under configs/. The new schema uses agents: instead of routes: and task_classes: instead of categories:; BenchmarkConfig.tasks_per_class instead of tasks_per_category. Variant configs are simpler — no _template.yaml wrapper, just standalone YAMLs at the top of configs/.
v1.3.0 — multi-model + threshold sweep
The first Pareto-equivalent hybrid configuration in this benchmark, with statistical significance.
Headline finding
For real_dev D1+D5 (practical refactoring tasks), gemma4:31b + heuristic routing reaches:
- Pass-rate: 96% [88, 100] (95% bootstrap CI)
- vs always-cloud's 100% [100, 100] — CIs effectively overlap
- at 79% cloud_fraction — ≈21% reduction in cloud token spend
| Cell (gemma4:31b on real_dev D1+D5) | Pass-rate | 95% CI |
|---|---|---|
| always-cloud (gpt-5.5) | 1.00 | [1.00, 1.00] |
| always-local (gemma4:31b) | 0.88 | [0.71, 1.00] |
| heuristic | 0.96 | [0.88, 1.00] ← Pareto win |
| cascade | 0.88 | [0.71, 1.00] |
What's in this release
Three publishable canonical sweeps (507 rows total, 6h13m wall, $32.88 cloud spend):
| Sweep | Variant | Rows | Wall |
|---|---|---|---|
| 28 | qwen3-coder:30b expanded (13 tasks) | 156 | 75m |
| 29 | gemma4:31b expanded (13 tasks) | 156 | 222m |
| 30 | cascade × 5 thresholds × 3 seeds | 195 | 76m |
Task matrix: 5 Exercism Python + 4 real_dev D1 + 4 real_dev D5 = 13 tasks × R7 (aider) × {always-cloud, always-local, heuristic, cascade} × 3 seeds.
Three findings
- Local model selection > router strategy tuning. Switching qwen3-coder:30b → gemma4:31b raised always-local pass-rate by +39 percentage points (23% → 62%); raised heuristic by +31pp (36% → 67%). Threshold tuning on cascade only moves the needle by ≈7pp across a 5x parameter span.
- Task type matters as much as the model. Both 30B-class models choke on Exercism Python puzzles (always-local ≤25%); both excel on real_dev refactoring patterns when given gemma4. The "viability of local for coding" question has different answers by task class.
- Cascade threshold has a flat curve. Sweep across thresholds 5/10/15/20/25 produced pass-rates 21–28% with no monotonic trend. Cloud_fraction does change as designed (0.80 → 0.55), but pass-rate doesn't track. Cascade is a poor fit for agentic loops; threshold isn't the lever.
New in v1.3.0
benchmark.task_ids: list[str] | None— explicit task-ID whitelist; scopes sweeps to a known-good subset./bench sweep --cascade-thresholds 5,10,15,20,25— sweepROUTER_CASCADE_THRESHOLD; spawns fresh router per threshold- R7 multi-file fixture support — enables real_dev D1+D5 tasks (multi-file edits) under aider
Reproducibility
git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.3.0
./bench setup
ollama pull qwen3-coder:30b gemma4:31b
./bench sweep --config configs/variants/28-v1.3-aider-r7-expanded.yaml \
--strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./bench sweep --config configs/variants/29-v1.3-aider-r7-gemma4.yaml \
--strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./bench sweep --config configs/variants/30-v1.3-aider-r7-cascade-threshold.yaml \
--strategies cascade --cascade-thresholds 5,10,15,20,25 --seeds 42,7,13Artifacts attached
results-v1.3.0.tar.gz(4.2 MB) — bundle of all 3 sweep result directories (raw.jsonl+aggregate.json+bootstrap_cis.json+decision_matrix.md+charts/)v1.3.0-report.html(54 KB) — full publishable HTML report (~5,400 words, v3.3 style)findings.md— diagnostic write-up with per-cell stratified bootstrap CIsorchestrator.log— full sweep run log
Path forward (v1.3.x / v1.4)
- More local models: deepseek-coder-v3, qwen3.6:35b, codestral-medium-2
- Expand Exercism fixtures beyond 5 (the n=15 baseline is itself unstable)
- Task-aware local-model routing (different local for puzzle vs refactor classes)
- Persistent failure-mode analysis on Exercism A — why does every strategy under-route?
v1.2.0 — single-agent R7 aider canonical: hybrid on the Pareto frontier
The v1.2 release. Single-agent simplification: R7 (aider) is the canonical agentic route. Empirically validated against opencode (R8) — aider's architect/editor protocol works end-to-end with qwen3-coder:30b; opencode's free-form tool-use does not.
Headline (60 rows = 5 Exercism Python × 4 strategies × 3 seeds)
| Strategy | Pass | $ total | $/pass | Cloud-frac (tokens) |
|---|---|---|---|---|
| always-cloud (gpt-5.5) | 9/15 (60%) | $0.91 | $0.10 | 1.00 |
| always-local (qwen3-coder:30b) | 0/15 (0%) | $0.00 | n/a | 0.00 |
| heuristic (agent-aware) | 6/15 (40%) | $0.74 | $0.12 | 0.48 |
| cascade | 3/15 (20%) | $0.65 | $0.22 | 0.35 |
On grep and pig-latin, heuristic matches or beats always-cloud (3/3 vs 2/3 on grep; 3/3 vs 3/3 on pig-latin) while routing ~50% of token volume local. Aggregate pass-rate CIs overlap.
Why R7 (aider), not R8 (opencode)
Both runners exist in-tree. v1.1.3's R8 canonical (60 rows, same matrix) showed 0/15 hybrid pass — qwen3-coder:30b can drive opencode's tool-use loop syntactically (we fixed three Ollama tool-message format issues in v1.1.x) but it writes prose on tool-interpretation turns instead of follow-up tool_calls. Aider's structured architect/editor protocol bypasses that failure mode — architect plans in cloud, editor applies edits locally. R8 + R6 stay EXPERIMENTAL in v1.2.
Reproduce in 5 minutes
git clone https://github.com/RunanywhereAI/hybrid-coding-eval && cd hybrid-coding-eval
git checkout v1.2.0
python3.12 -m venv .venv && .venv/bin/pip install -e .
ollama pull qwen3-coder:30b
cp .env.example .env # add OPEN_AI_API_KEY
./bench setup
(cd router && LOCAL_MODEL=qwen3-coder:30b ./start.sh) &
./bench sweep --config configs/variants/26-v1.2-aider-r7-canonical.yaml \
--strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./bench analyze results/runs/26-v1.2-aider-r7-canonical/~30-50 min wall on M4 Max, ~$1-2 API spend.
Benchmark a new local model
Edit models.local: in the yaml and re-sweep. Compare your bootstrap_cis.json against this release's results-v1.2.0-canonical.tar.gz baseline. See docs/BENCHMARK_NEW_MODEL.md for the full recipe.
Attached
results-v1.2.0-canonical.tar.gz— 60-row canonical datasetfindings.md— diagnostic write-up + reproducibility recipe
v1.1.3 — qwen3-coder ↔ opencode tool-format fix; hybrid loop now operational
The qwen3-coder + opencode tool-message format issue from v1.1.2 is fixed. Hybrid strategies now run the agent loop end-to-end without 400 errors.
What was the issue
Ollama's qwen3-coder renderer rejected OpenAI-standard tool messages:
tool_calls[].function.argumentsas JSON-encoded STRING (OpenAI spec) → Ollama wants OBJECT- Multi-part
toolcontent as ARRAY (OpenAI 1.x) → Ollama wants STRING
Bisected via curl probes. Confirmed against Ollama issue #11621 + goose issue #6883.
The fix
translateForLocal() in router/server.mjs (scoped to local backend only):
- Parse
tool_calls[].function.argumentsstring → object - Flatten array content → string
Inverse of the v1.1.1 outbound normalizer.
Updated canonical headline (60 rows × 3 seeds × 4 strategies)
| Strategy | Pass | Cloud tok | Local tok | Cloud_frac (calls) |
|---|---|---|---|---|
| always-cloud (gpt-5.5) | 15/15 ✓ | 16,094 | 0 | 1.00 |
| always-local (qwen3-coder:30b) | 0/15 | 0 | 2,916 | 0.00 |
| heuristic (agent-aware) | 0/15 | 2,064 | 1,439 | 0.50 |
| cascade | 0/15 | 447 | 2,774 | 0.10 |
Routing layer verified working. Hybrid strategies now make real cloud+local decisions. The remaining 0% hybrid pass is a model-quality gap — qwen3-coder writes prose instead of tool_calls on tool-interpretation turns. v1.2 unblockers: larger local model, or router-level system-prompt augmentation.
Attached
results-v1.1.3-canonical.tar.gz— 60-row canonical sweepfindings.md— diagnostic write-up
v3.3 — cross-model sweep (ARCHIVED, pre-v1.x)
⚠️ ARCHIVED — pre-v1.x naming. This is a historical research sweep (thev3.xline) that predates the currentv1.xreleases. It is not the latest version. For the current benchmark and datasets, see v1.5.1 (Latest). Kept for reproducibility of the 3,581-row cross-model sweep only.
Highlights
The biggest single benchmark sweep this repo has produced. 4.5 days of continuous compute on M4 Max 64 GB.
- 3,581 rows across 33 variant directories
- 6 local models × 5 routes × 7 routing strategies × 8 task shapes × 6 pricing scenarios
TL;DR
- Can hybrid routing save cost? Not via multi-step orchestration (R3/R4/R5 cost 1.9× to 5× more than R1 cloud-only). Yes via per-task gating: ~16-20% savings on a mixed workload.
- Best local model: Qwen3-Coder:30B at $0.229/correct. Beats devstral, qwen2.5-coder, gemma4, GLM, AND both newer Qwen 3.6 variants.
- Best routing strategy: Cascade with default threshold 15. Replicates as Pareto winner across all 6 models.
Major findings
- LLM-classifier is structurally broken on SWE-bench: 5 classifier sizes from 0.6B to 4B all score 0/10 (Phase 6 sub-sweep). Scaling does NOT help.
- Cascade threshold 15 is empirically optimal (Phase 7 sub-sweep tested 5/10/15/20/25; t=20 is a brittleness cliff).
- Newer Qwen 3.6 family regressed vs older Qwen3-Coder on this benchmark — counter-intuitive but reproducible across both 27B-mxfp8 and 35B-A3B-MoE variants.
- R5 DevMinion is catastrophically bad on prose — 5.13× R1 cost with composite 0.00 across 7 of 8 D3/D4 tasks.
- Multi-step hybrid loses on every count — cost, quality, latency. Skip it.
Attached artifacts
v3.3-report.html— single-file standalone HTML report with all 5 charts base64-embedded. Open in any browser.cross-model-leaderboard.png— R3 heuristic scatter, lower-right is betterstrategy-model-heatmap.png— 7 models × 5 strategies $/correct heatmapphase-6-classifier-sweep.png— B-pass rate per classifier (the 0/10 collapse)phase-7-cascade-threshold.png— cascade threshold tuningper-shape-r1-vs-alt.png— R1 vs best alternative cost per task shape
Read the full article
reports/ARTICLE.md— ~10,000-word comprehensive write-updocs/HYBRID_ROUTER_DESIGN.md— the deployable router architecturedocs/REPRODUCING.md— copy-paste reproduction
Reproducibility
Full sweep takes ~4.5 days on M4 Max 64 GB and ~$240 in OpenAI spend. Same dataset re-prices under 6 cloud scenarios without re-running inference. See REPRODUCING.md.
🤖 Sweep + analysis + article + this release all generated via Claude Code.
v1.1.2 — canonical sweep (60 rows, 3 seeds, bootstrap CIs)
The publishable canonical dataset for v1.1. 5 Exercism Python tasks × 4 strategies (always-cloud, always-local, heuristic, cascade) × 3 seeds = 60 rows. 95% bootstrap CIs at n=15 per cell.
Headline
| Cell | pass_rate | cloud_fraction |
|---|---|---|
| R8 / always-cloud (gpt-5.5) | 1.00 [1.00, 1.00] | 1.00 |
| R8 / always-local (qwen3-coder:30b) | 0.00 [0.00, 0.00] | 0.00 |
| R8 / heuristic (agent-aware) | 0.00 [0.00, 0.00] | 0.50 |
| R8 / cascade | 0.00 [0.00, 0.00] | 0.10 |
Verdict
The agent-aware heuristic strategy IS making rational decisions (first turn cloud for planning, post-tool-call local for tool-result interpretation, ~50% cloud-fraction over the loop). The 0% pass rate on hybrid is not a routing-logic bug — it's a model-compatibility issue between qwen3-coder + opencode tool-message format. v1.2's incoming-direction tool-message normalizer is the unblocker.
Attached
- `results-v1.1.2-canonical.tar.gz` — 60-row canonical sweep
- `findings.md` — diagnostic write-up
Reproducing
```bash
git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.1.2
python3.12 -m venv .venv && .venv/bin/pip install -e .
./bench setup
(cd router && LOCAL_MODEL=qwen3-coder:30b ./start.sh) &
./bench sweep --config configs/variants/24-v1.1-opencode-canonical.yaml \
--strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./bench analyze results/runs/24-v1.1-opencode-canonical/
```