Skip to content

Releases: RunanywhereAI/hybrid-arena

v1.6.0 — Hybrid Coding Arena (rebrand)

17 Jun 02:39
d2b514d

Choose a tag to compare

Hybrid Coding Arena is the new name for this benchmark (formerly hybrid-coding-eval), from RunAnywhere.

What changed in v1.6.0

This is a rebrand release. The benchmark, methodology, and dataset are unchanged.

  • Python package: hybrid_coding_eval is now hybrid_arena.
  • Distribution + repo: hybrid-arena (the old hybrid-coding-eval URL redirects here).
  • CLI command: bench is now arena (e.g. arena sweep, arena analyze).
  • Clearer headline chart: pass-rate and cloud usage are now separate, labeled elements.
git clone https://github.com/RunanywhereAI/hybrid-arena
cd hybrid-arena && python3.12 -m venv .venv
.venv/bin/pip install -e ".[dev,agents]"
arena setup
arena sweep --config configs/v1.4-smoke.yaml --strategies always-cloud --seeds 42
arena analyze results/runs/v1.4-smoke

Dataset

results-v1.6.0.tar.gz is byte-identical to the v1.5.0/v1.5.1 dataset (1,704 rows). No new benchmark runs in this release.

Headline

  • cline + qwen3.6 + cascade on real-developer refactors: 24/24 = 100% at 8% cloud, about $0.022/task.
  • Local-only solves 67% of the hard (D6) tasks at $0 cloud; cloud-only holds 100%.
  • 1,704 rows, 3 local models, 3 coding agents, 8 routing strategies, 17 tasks, one M4 Max laptop, 95% bootstrap CIs.

v1.5.1: open-source polish

27 May 21:47

Choose a tag to compare

Open-source polish release. Addresses every finding from the pre-publish audit pass — security, licensing, UX, hygiene. No code-behaviour changes; safe to take.

Full release notes: docs/release-notes/v1.5.1.md.

What changed

Licensing simplified

  • Deleted NOTICE.md, LICENSE-DATA, LICENSE.md.
  • Single LICENSE (MIT) now covers code, data, charts, and docs prose.
  • Citation request lives in the README's bibtex block.

Documentation rewritten

  • README.md — six-cell headline table, full prereq + quickstart with realistic time/cost estimates, "picking a config for real work" section distilled from the v1.5 leaderboard, full bench CLI table.
  • AGENTS.md — refreshed for v1.5.0: D6 task class documented, v1.5 configs added to the tree, conventions reflect that single-letter codes (A/B/D) are retired.
  • CODE_OF_CONDUCT.md — short and direct.
  • Source-tree docstrings + per-task READMEs — lib.* rewritten to core.*; "Category D / B / X" rewritten to refactors / real-prs / puzzles end-to-end.

UX cleanup

  • Deleted scripts/reproduce.sh./bench setup already does prereq checks, smoke is ./bench sweep --config configs/v1.4-smoke.yaml.
  • Deleted logs/v3.3/ — historical sweep logs moved out of git. logs/ is now gitignored.

Hygiene

  • __version__ bumped 0.1.0 → 1.5.1 (it was stuck at 0.1.0).
  • pytest moved from runtime deps to [dev] extras.
  • Removed unused pytest -m slow filter from CI + docs.
  • .github/ISSUE_TEMPLATE/new_model.md: broken configs/variants/_template.yaml reference fixed.
  • docs/HYBRID_ROUTING_DESIGN.md + v1.4.{0,1} release notes: jq snippets updated from legacy D::cline::heuristic to refactors::cline::heuristic.
  • Test aliases r10_cline, r6_mini_swe_agent renamed.

Privacy

  • Sanitized 263 absolute-path leaks (/Users/<owner>/...) in tracked raw.jsonl / progress.log. JSON re-validated on every row (520 rows, 0 parse errors).

Verification

  • 120 fast tests pass on Python 3.11 + 3.12 (CI matrix).
  • ruff check src/ tests/ clean.

Citation

If you use this benchmark, a citation would be really appreciated. BibTeX in the README.


📦 Dataset

results-v1.5.1.tar.gz is byte-identical to the v1.5.0 dataset — v1.5.1 added 0 new benchmark rows (it is an open-source-polish release). It is attached here so visitors landing on the Latest release can download the data directly. The canonical 1,704-row dataset is unchanged since v1.5.0.

gh release download v1.5.1 -p results-v1.5.1.tar.gz   # or v1.5.0 — same bytes

v1.5.0 — Hard-Task Stress Test

27 May 10:08

Choose a tag to compare

v1.5.0 — Hard-Task Stress Test (2026-05-27)

Theme: Push past the D1/D5 ceiling. v1.4.x had top configs hitting
100% on the canonical 8-task refactor set, which made the leaderboard
uninformative at the top end. v1.5 adds a new harder task shape (D6)
designed to actually stress 30B local models, and stress-tests the
v1.4.1 champion configurations against it.

What's new

New D6 task class: hard refactors (4 tasks, 80 acceptance tests)

Each D6 task is a single-file Python implementation challenge with
comprehensive pytest coverage. The reference solutions are all ≤200
LOC, so the difficulty isn't volume — it's the breadth of corner
cases the model has to internalise from the prompt alone.

Task LOC Tests What it stresses
d6-lru-ttl-cache 100 23 OrderedDict-based LRU + monkey-patchable TTL clock + careful eviction accounting
d6-token-bucket 60 14 Lazy refill correctness + multi-key isolation + non-positive-arg validation
d6-toposort 90 16 Kahn's with deterministic tie-break + DFS cycle detection with path reconstruction
d6-mini-template 200 27 Recursive-descent parser + AST evaluator + escape filters + nested if/for + comments

Total: ~450 LOC of reference solution, 80 pytest assertions.
The same overlay+pytest scoring path D1/D5 uses — no new scoring
infrastructure needed.

Stress-test sweeps

Two new sweep configs target the v1.4.1 audit's top configurations:

  • configs/v1.5-hard-gemma4.yaml — aider+gemma4 on always-cloud and
    heuristic (the v1.3 marquee profile). 4 tasks × 2 strategies × 3
    seeds = 24 rows.
  • configs/v1.5-hard-qwen3.6.yaml — cline+qwen3.6 on always-cloud,
    always-local, and cascade (the v1.4.1 champion profile). 4 tasks ×
    3 strategies × 3 seeds = 36 rows.

Total 60 new rows of hard-task data. Cost cap: $50 cloud spend.
Wall-time cap: 6 hours.

Findings

Both sweeps complete. Full analysis in
personal/reports/publish-v1.5/article.html §9.5.

Agent Model Strategy Pass Cloud-frac Notes
aider gemma4:31b always-cloud 12/12 (100%) 100% ceiling
aider gemma4:31b heuristic 7/12 (58%) 61% v1.3 marquee profile falls off on D6
cline qwen3.6:35b always-cloud 12/12 (100%) 100% ceiling
cline qwen3.6:35b always-local 8/12 (67%) 0% 30B local-only ceiling — $0 cloud spend
cline qwen3.6:35b cascade 9/12 (75%) 13% v1.4.1 champion holds 75% but loses on mini-template

Key findings:

  1. 30B local-only solves 67% of hard refactors with zero cloud. cline + qwen3.6:35b
    nails token-bucket and toposort 3/3 each, partial-passes lru-ttl-cache and mini-template.
    Two of the four "failures" are cline-on-local session bugs, not model quality.
  2. The v1.4.1 cascade champion drops from 100% to 75% on D6. Cascade only marginally
    beats always-local (75% vs 67%) because the router has no global view of task difficulty.
    The d6-mini-template recursive-descent parser is the hard wall.
  3. Always-cloud (gpt-5.5) is 100% on both configs. The cloud advantage on D6 is real and
    42 percentage points over heuristic gemma4.
  4. The pytest-parser bug fix in src/hybrid_coding_eval/agents/aider.py was caught by a
    new parametrized test (tests/agents/test_aider_parser.py). The bug undercounted
    passes for aider rows where X failed appeared before Y passed in the summary line.
    Rescoring v1.5 gemma4 data found 0 affected rows; the fix is preventative.

How to reproduce

git checkout v1.5.0
./scripts/reproduce.sh    # one-time env setup if not done yet
./bench sweep --config configs/v1.5-hard-gemma4.yaml \
    --strategies always-cloud,heuristic --seeds 42,7,13
./bench sweep --config configs/v1.5-hard-qwen3.6.yaml \
    --strategies always-cloud,always-local,cascade --seeds 42,7,13
./bench analyze results/runs/v1.5-hard-gemma4
./bench analyze results/runs/v1.5-hard-qwen3.6

Expected wall time on M4 Max 64 GB: ~30 min for gemma4 sweep,
~80 min for qwen3.6 sweep. Expected cloud spend at gpt-5.5
list pricing: ~$8 total across both sweeps.

Migration from v1.4.4

Zero migration cost. v1.5 is purely additive:

  • The D1–D5 task shapes are unchanged. The v1.4.1 canonical dataset
    remains the headline dataset.
  • The D6 shape uses the same overlay+pytest scoring path as D1/D5.
    No new scoring dependencies.
  • Existing configs continue to work. The new configs are isolated to
    configs/v1.5-hard-*.yaml.

v1.4.1 — 3-model agentic leaderboard (1,644 rows total)

26 May 02:28

Choose a tag to compare

v1.4.1 — 3-model agentic leaderboard

Adds qwen3-coder:30b + qwen3.6:35b canonical sweeps to v1.4.0's gemma4 line. Combined v1.4 + v1.4.1: 1,644 rows.

Headline (the marquee cells)

Cell Pass-rate Cloud-fraction
cline + qwen3.6 + cascade + refactors 24/24 = 100% [100, 100] low (~5-10%)
cline + qwen3.6 + heuristic + refactors 22/24 = 92% ~7%
cline + qwen3-coder + heuristic + refactors 22/24 = 92% ~7%
cline + qwen3.6 + always-local + puzzles 15/15 = 100% 0%
aider + gemma4 + heuristic + refactors (v1.4.0) 23/24 = 96% [88, 100] 48%

What's new in v1.4.1

Router infrastructure fix (commit c7392db)

Three model-agnostic local-guard env vars in router/server.mjs:fetchLocalOllamaAsOpenAI():

ROUTER_LOCAL_NUM_PREDICT_CAP       default 4096   max gen tokens per local call
ROUTER_LOCAL_REQUEST_TIMEOUT_MS    default 180000 3-min per-request hard timeout
ROUTER_LOCAL_REPEAT_PENALTY        default 1.1    override weak model defaults

Discovered when qwen3-coder's weak repeat_penalty=1.05 + cline's missing max_tokens caused a runaway repetition loop (34 MB streamed over 2h35m from a single HTTP request), crashed Ollama, cascaded every subsequent task to timeout. Full RCA in the release tarball.

2 new canonical sweeps (936 rows)

  • configs/v1.4-canonical-qwen3-coder.yaml → 468 rows. qwen3-coder:30b (MoE coding specialist) across 3 agents × 4 strategies × 13 tasks × 3 seeds.
  • configs/v1.4-canonical-qwen3.6.yaml → 468 rows. qwen3.6:35b (dense generalist) across the same matrix.

Three new findings

  1. qwen3.6:35b is the unsung champion. cline + qwen3.6 + cascade nails 100% on refactors. cline + qwen3.6 + always-local nails 100% on puzzles. cline + qwen3.6 + heuristic = 92% on refactors at ~7% cloud spend.

  2. opencode is gemma4-specific. v1.4.0's opencode resurrection (71% on refactors heuristic with gemma4) doesn't transfer to qwen models (21-33%). opencode's runLoop requires clean tool_calls — gemma4 produces them, qwen variants don't reliably.

  3. Aider is model-sensitive. 96% on gemma4 refactors heuristic → 50% qwen3.6 → 33% qwen3-coder. Aider's architect/editor protocol favors gemma4's dense-generalist training profile.

Reproducibility

git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.4.1
python3.12 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt && pip install -e ".[dev]"
cp .env.example .env  # set OPEN_AI_API_KEY
ollama pull gemma4:31b qwen3-coder:30b qwen3.6:35b
./bench setup
./bench start --config configs/v1.4-canonical-qwen3.6.yaml \
    --strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./bench status     # progress
./bench pause      # if you need the laptop
./bench resume     # picks up where it left off
./bench analyze results/runs/v1.4-canonical-qwen3.6
# headline: jq '.cells["refactors::cline::cascade"].pass_rate' \
#     results/runs/v1.4-canonical-qwen3.6/bootstrap_cis.json

Artifacts attached

  • results-v1.4.1.tar.gz (15 MB) — qwen3-coder + qwen3.6 sweep dirs (raw.jsonl + aggregate.json + bootstrap_cis.json + decision_matrix.md + charts)
  • article.html — the v1.4.1 master article with code-generated charts (~10-min read covering v1.0 → v1.4.1)
  • qwen3-coder-timeout-rca.md — full root-cause analysis of the router infrastructure fix

Migration from v1.4.0

No code changes. The router fix is backwards-compatible (env vars all default to safe values). Existing v1.4.0 sweeps re-run with the v1.4.1 router will be safer against runaway local-model loops.

v1.4.0 — agent-only release (708 rows, 3 sweeps, gemma4)

22 May 16:50
945dd6a

Choose a tag to compare

v1.4.0 — cleanup + production-pipeline release

The single canonical release for hybrid coding agents on local hardware.

After 7 prior releases (v1.0.0–v1.3.0), v1.4 narrows the project to agent-only scope (deletes the legacy R1–R5 non-agentic routes + the HumanEval+/BigCodeBench/custom-arch benchmarks), adds Claude Code + Cline runners, ships 5 production-grade lifecycle commands (./bench start|pause|resume|stop|status), and runs the canonical 3-sweep dataset against the gemma4:31b local model.

Headline numbers (708 rows, ~20h wall, $90.48 cloud spend on M4 Max 64GB)

Cell Pass-rate Cloud-fraction Notes
aider + heuristic + refactors 23/24 = 96% [88, 100] 48% The marquee Pareto win — replicates v1.3.0
aider + always-cloud + refactors 24/24 = 100% [100, 100] 100% Cloud ceiling
cline + always-local + puzzles 15/15 = 100% 0% First 30B local-only result that nails Exercism Python
opencode + heuristic + refactors 17/24 = 71% 46% Resurrection: vs v1.1.x's 0/15
cascade (all (agent, task-class) cells) ≤ heuristic varies cascade is dead in agentic regime

What changed in v1.4

Cleanup (the big one — 82 files deleted)

  • Deleted non-agentic code: R1-R5 runners (cloud-only, local-only, hybrid-architect, Stanford Minion, Stanford DevMinion), HumanEval+ adapter, BigCodeBench-Hard adapter, custom-arch adapter, LLM-judge scorer, the entire configs/variants/ directory (32 legacy YAMLs), vendor/minions/, cli/{judge,rejudge,rescore,report}.py.
  • Schema rename: BenchmarkConfig.routesagents, categoriestask_classes. TaskPlan.routeagent. Drop Rn prefix throughout the codebase. The agent names are aider, opencode, mini-swe-agent, claude-code, cline.
  • Directory rename: runners/agents/, benchmarks/tasks/. Sub-renames: exercism_python/puzzles/, real_dev/refactors/, swebench_verified/real_prs/.

Added

  • Two new agents: claude-code (Anthropic Claude CLI; runs always-cloud-direct in v1.4 — Anthropic-compat router shim deferred to v1.5) and cline (Apache-2.0 Plan/Act agent; talks LiteLLM-compat → our router cleanly).
  • phase-aware routing strategy: deterministic Aider role-marker split (architect→cloud, editor→local). Falls through to legacy heuristic for non-Aider agents. 8 strategies total now (always-local, always-cloud, rules, heuristic, llm-classifier, embedding-knn, cascade, phase-aware).
  • 5 production lifecycle commands in ./bench:
    • bench start --config <yaml> --strategies ... --seeds ... — spawn detached, auto-starts Ollama, writes /tmp/hcev-sweep.json
    • bench pause — kill orchestrator + agents + router; keep Ollama for fast resume
    • bench resume — relaunch with --resume to skip rows already in raw.jsonl
    • bench stop [--keep-ollama-app] [--clear-state] — like pause + kill Ollama (frees ~19 GB)
    • bench status — show RUNNING / PAUSED / NO SWEEP with PID, log path, row count
  • bench sweep --resume flag + automatic router lifecycle from models.local in the config (eliminates the manual cd router && ./start.sh step from earlier reproducers).
  • /api/tags stub on the router so cline's Ollama-listing probe gets a clean 200 instead of a noisy 404.
  • 5 production v1.4 configs under configs/: v1.4-smoke.yaml, v1.4-canonical-gemma4.yaml, v1.4-opencode-fairness.yaml, v1.4-strategy-sweep.yaml, v1.4-real-prs.yaml, plus 2 queued (v1.4-canonical-qwen3-coder.yaml, v1.4-canonical-qwen3.6.yaml) for v1.4.1.

Fixed

  • Reproducibility audit blockers (10 items from v1.3 audit): README rewrite for v1.4, requirements.txt += pydantic + pyyaml, auto-spawn router in bench sweep, fresh CI workflow (no longer needs vendor/minions clone), docs rewrites for REPRODUCING.md + BENCHMARK_NEW_MODEL.md.
  • Phase 2 rename fallout: 4 agent runners had hardcoded benchmarks/real_dev/fixtures/ paths in _REAL_DEV_FIXTURES_ROOT constants — updated to tasks/refactors/fixtures/.

What's still pending (v1.4.1)

  • qwen3-coder:30b + qwen3.6:35b canonical sweeps — configs ready at configs/v1.4-canonical-qwen3-coder.yaml and configs/v1.4-canonical-qwen3.6.yaml; multi-model orchestrator script at personal/scripts/v1.4-multi-model-orchestrator.sh. Estimated +12h compute, +$30 cloud. Will land as v1.4.1 with an updated 3-model leaderboard.
  • SWE-bench Verified real-PR replay — config at configs/v1.4-real-prs.yaml but blocked on local Docker availability for the SWE-bench testbed. Deferred to v1.5.

Reproducibility

# Fresh clone + setup (~10 min)
git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.4.0
python3.12 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt && pip install -e ".[dev]"
cp .env.example .env  # set OPEN_AI_API_KEY
ollama pull gemma4:31b   # ~19 GB
./bench setup           # installs aider + opencode + cline; verifies Ollama + Docker

# Run the canonical gemma4 sweep (~10h on M4 Max 64GB, ~$70 cloud)
./bench start --config configs/v1.4-canonical-gemma4.yaml \
    --strategies always-cloud,always-local,heuristic,cascade \
    --seeds 42,7,13

# Monitor / pause / resume
./bench status         # show progress
./bench pause          # if you need the laptop
./bench resume         # picks up where it left off

# After sweep completes, analyze
./bench analyze results/runs/v1.4-canonical-gemma4
# headline lives in: bootstrap_cis.json["cells"]["refactors::aider::heuristic"]["pass_rate"]

Artifacts attached to this release

  • results-v1.4.0.tar.gz (7.8 MB) — bundle of all 3 sweep result directories (raw.jsonl + aggregate.json + bootstrap_cis.json + decision_matrix.md + charts)
  • v1.4.0-article.html (~6000 words, dark-mode publishable report)
  • findings.md — technical diagnostic write-up with per-cell bootstrap CIs
  • deep-analysis.md — deep-dive analysis of the 708 rows (per-task heatmaps, tail-latency analysis, routing-decision quality, etc.)

Migration from v1.3.0

If you have v1.3 variant configs (configs/variants/2{8,9,30}-v1.3-*.yaml), they're deleted in v1.4. Use the v1.4 equivalents under configs/. The new schema uses agents: instead of routes: and task_classes: instead of categories:; BenchmarkConfig.tasks_per_class instead of tasks_per_category. Variant configs are simpler — no _template.yaml wrapper, just standalone YAMLs at the top of configs/.

v1.3.0 — multi-model + threshold sweep

20 May 18:15
d966ba7

Choose a tag to compare

The first Pareto-equivalent hybrid configuration in this benchmark, with statistical significance.

Headline finding

For real_dev D1+D5 (practical refactoring tasks), gemma4:31b + heuristic routing reaches:

  • Pass-rate: 96% [88, 100] (95% bootstrap CI)
  • vs always-cloud's 100% [100, 100] — CIs effectively overlap
  • at 79% cloud_fraction — ≈21% reduction in cloud token spend
Cell (gemma4:31b on real_dev D1+D5) Pass-rate 95% CI
always-cloud (gpt-5.5) 1.00 [1.00, 1.00]
always-local (gemma4:31b) 0.88 [0.71, 1.00]
heuristic 0.96 [0.88, 1.00] ← Pareto win
cascade 0.88 [0.71, 1.00]

What's in this release

Three publishable canonical sweeps (507 rows total, 6h13m wall, $32.88 cloud spend):

Sweep Variant Rows Wall
28 qwen3-coder:30b expanded (13 tasks) 156 75m
29 gemma4:31b expanded (13 tasks) 156 222m
30 cascade × 5 thresholds × 3 seeds 195 76m

Task matrix: 5 Exercism Python + 4 real_dev D1 + 4 real_dev D5 = 13 tasks × R7 (aider) × {always-cloud, always-local, heuristic, cascade} × 3 seeds.

Three findings

  1. Local model selection > router strategy tuning. Switching qwen3-coder:30b → gemma4:31b raised always-local pass-rate by +39 percentage points (23% → 62%); raised heuristic by +31pp (36% → 67%). Threshold tuning on cascade only moves the needle by ≈7pp across a 5x parameter span.
  2. Task type matters as much as the model. Both 30B-class models choke on Exercism Python puzzles (always-local ≤25%); both excel on real_dev refactoring patterns when given gemma4. The "viability of local for coding" question has different answers by task class.
  3. Cascade threshold has a flat curve. Sweep across thresholds 5/10/15/20/25 produced pass-rates 21–28% with no monotonic trend. Cloud_fraction does change as designed (0.80 → 0.55), but pass-rate doesn't track. Cascade is a poor fit for agentic loops; threshold isn't the lever.

New in v1.3.0

  • benchmark.task_ids: list[str] | None — explicit task-ID whitelist; scopes sweeps to a known-good subset
  • ./bench sweep --cascade-thresholds 5,10,15,20,25 — sweep ROUTER_CASCADE_THRESHOLD; spawns fresh router per threshold
  • R7 multi-file fixture support — enables real_dev D1+D5 tasks (multi-file edits) under aider

Reproducibility

git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.3.0
./bench setup
ollama pull qwen3-coder:30b gemma4:31b

./bench sweep --config configs/variants/28-v1.3-aider-r7-expanded.yaml \
    --strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13

./bench sweep --config configs/variants/29-v1.3-aider-r7-gemma4.yaml \
    --strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13

./bench sweep --config configs/variants/30-v1.3-aider-r7-cascade-threshold.yaml \
    --strategies cascade --cascade-thresholds 5,10,15,20,25 --seeds 42,7,13

Artifacts attached

  • results-v1.3.0.tar.gz (4.2 MB) — bundle of all 3 sweep result directories (raw.jsonl + aggregate.json + bootstrap_cis.json + decision_matrix.md + charts/)
  • v1.3.0-report.html (54 KB) — full publishable HTML report (~5,400 words, v3.3 style)
  • findings.md — diagnostic write-up with per-cell stratified bootstrap CIs
  • orchestrator.log — full sweep run log

Path forward (v1.3.x / v1.4)

  • More local models: deepseek-coder-v3, qwen3.6:35b, codestral-medium-2
  • Expand Exercism fixtures beyond 5 (the n=15 baseline is itself unstable)
  • Task-aware local-model routing (different local for puzzle vs refactor classes)
  • Persistent failure-mode analysis on Exercism A — why does every strategy under-route?

v1.2.0 — single-agent R7 aider canonical: hybrid on the Pareto frontier

20 May 03:46

Choose a tag to compare

The v1.2 release. Single-agent simplification: R7 (aider) is the canonical agentic route. Empirically validated against opencode (R8) — aider's architect/editor protocol works end-to-end with qwen3-coder:30b; opencode's free-form tool-use does not.

Headline (60 rows = 5 Exercism Python × 4 strategies × 3 seeds)

Strategy Pass $ total $/pass Cloud-frac (tokens)
always-cloud (gpt-5.5) 9/15 (60%) $0.91 $0.10 1.00
always-local (qwen3-coder:30b) 0/15 (0%) $0.00 n/a 0.00
heuristic (agent-aware) 6/15 (40%) $0.74 $0.12 0.48
cascade 3/15 (20%) $0.65 $0.22 0.35

On grep and pig-latin, heuristic matches or beats always-cloud (3/3 vs 2/3 on grep; 3/3 vs 3/3 on pig-latin) while routing ~50% of token volume local. Aggregate pass-rate CIs overlap.

Why R7 (aider), not R8 (opencode)

Both runners exist in-tree. v1.1.3's R8 canonical (60 rows, same matrix) showed 0/15 hybrid pass — qwen3-coder:30b can drive opencode's tool-use loop syntactically (we fixed three Ollama tool-message format issues in v1.1.x) but it writes prose on tool-interpretation turns instead of follow-up tool_calls. Aider's structured architect/editor protocol bypasses that failure mode — architect plans in cloud, editor applies edits locally. R8 + R6 stay EXPERIMENTAL in v1.2.

Reproduce in 5 minutes

git clone https://github.com/RunanywhereAI/hybrid-coding-eval && cd hybrid-coding-eval
git checkout v1.2.0
python3.12 -m venv .venv && .venv/bin/pip install -e .
ollama pull qwen3-coder:30b
cp .env.example .env  # add OPEN_AI_API_KEY

./bench setup
(cd router && LOCAL_MODEL=qwen3-coder:30b ./start.sh) &

./bench sweep --config configs/variants/26-v1.2-aider-r7-canonical.yaml \
  --strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./bench analyze results/runs/26-v1.2-aider-r7-canonical/

~30-50 min wall on M4 Max, ~$1-2 API spend.

Benchmark a new local model

Edit models.local: in the yaml and re-sweep. Compare your bootstrap_cis.json against this release's results-v1.2.0-canonical.tar.gz baseline. See docs/BENCHMARK_NEW_MODEL.md for the full recipe.

Attached

  • results-v1.2.0-canonical.tar.gz — 60-row canonical dataset
  • findings.md — diagnostic write-up + reproducibility recipe

v1.1.3 — qwen3-coder ↔ opencode tool-format fix; hybrid loop now operational

20 May 01:17

Choose a tag to compare

The qwen3-coder + opencode tool-message format issue from v1.1.2 is fixed. Hybrid strategies now run the agent loop end-to-end without 400 errors.

What was the issue

Ollama's qwen3-coder renderer rejected OpenAI-standard tool messages:

  • tool_calls[].function.arguments as JSON-encoded STRING (OpenAI spec) → Ollama wants OBJECT
  • Multi-part tool content as ARRAY (OpenAI 1.x) → Ollama wants STRING

Bisected via curl probes. Confirmed against Ollama issue #11621 + goose issue #6883.

The fix

translateForLocal() in router/server.mjs (scoped to local backend only):

  1. Parse tool_calls[].function.arguments string → object
  2. Flatten array content → string

Inverse of the v1.1.1 outbound normalizer.

Updated canonical headline (60 rows × 3 seeds × 4 strategies)

Strategy Pass Cloud tok Local tok Cloud_frac (calls)
always-cloud (gpt-5.5) 15/15 16,094 0 1.00
always-local (qwen3-coder:30b) 0/15 0 2,916 0.00
heuristic (agent-aware) 0/15 2,064 1,439 0.50
cascade 0/15 447 2,774 0.10

Routing layer verified working. Hybrid strategies now make real cloud+local decisions. The remaining 0% hybrid pass is a model-quality gap — qwen3-coder writes prose instead of tool_calls on tool-interpretation turns. v1.2 unblockers: larger local model, or router-level system-prompt augmentation.

Attached

  • results-v1.1.3-canonical.tar.gz — 60-row canonical sweep
  • findings.md — diagnostic write-up

v3.3 — cross-model sweep (ARCHIVED, pre-v1.x)

19 May 04:25

Choose a tag to compare

⚠️ ARCHIVED — pre-v1.x naming. This is a historical research sweep (the v3.x line) that predates the current v1.x releases. It is not the latest version. For the current benchmark and datasets, see v1.5.1 (Latest). Kept for reproducibility of the 3,581-row cross-model sweep only.


Highlights

The biggest single benchmark sweep this repo has produced. 4.5 days of continuous compute on M4 Max 64 GB.

  • 3,581 rows across 33 variant directories
  • 6 local models × 5 routes × 7 routing strategies × 8 task shapes × 6 pricing scenarios

TL;DR

  1. Can hybrid routing save cost? Not via multi-step orchestration (R3/R4/R5 cost 1.9× to 5× more than R1 cloud-only). Yes via per-task gating: ~16-20% savings on a mixed workload.
  2. Best local model: Qwen3-Coder:30B at $0.229/correct. Beats devstral, qwen2.5-coder, gemma4, GLM, AND both newer Qwen 3.6 variants.
  3. Best routing strategy: Cascade with default threshold 15. Replicates as Pareto winner across all 6 models.

Major findings

  • LLM-classifier is structurally broken on SWE-bench: 5 classifier sizes from 0.6B to 4B all score 0/10 (Phase 6 sub-sweep). Scaling does NOT help.
  • Cascade threshold 15 is empirically optimal (Phase 7 sub-sweep tested 5/10/15/20/25; t=20 is a brittleness cliff).
  • Newer Qwen 3.6 family regressed vs older Qwen3-Coder on this benchmark — counter-intuitive but reproducible across both 27B-mxfp8 and 35B-A3B-MoE variants.
  • R5 DevMinion is catastrophically bad on prose — 5.13× R1 cost with composite 0.00 across 7 of 8 D3/D4 tasks.
  • Multi-step hybrid loses on every count — cost, quality, latency. Skip it.

Attached artifacts

  • v3.3-report.html — single-file standalone HTML report with all 5 charts base64-embedded. Open in any browser.
  • cross-model-leaderboard.png — R3 heuristic scatter, lower-right is better
  • strategy-model-heatmap.png — 7 models × 5 strategies $/correct heatmap
  • phase-6-classifier-sweep.png — B-pass rate per classifier (the 0/10 collapse)
  • phase-7-cascade-threshold.png — cascade threshold tuning
  • per-shape-r1-vs-alt.png — R1 vs best alternative cost per task shape

Read the full article

Reproducibility

Full sweep takes ~4.5 days on M4 Max 64 GB and ~$240 in OpenAI spend. Same dataset re-prices under 6 cloud scenarios without re-running inference. See REPRODUCING.md.

🤖 Sweep + analysis + article + this release all generated via Claude Code.

v1.1.2 — canonical sweep (60 rows, 3 seeds, bootstrap CIs)

19 May 23:59

Choose a tag to compare

The publishable canonical dataset for v1.1. 5 Exercism Python tasks × 4 strategies (always-cloud, always-local, heuristic, cascade) × 3 seeds = 60 rows. 95% bootstrap CIs at n=15 per cell.

Headline

Cell pass_rate cloud_fraction
R8 / always-cloud (gpt-5.5) 1.00 [1.00, 1.00] 1.00
R8 / always-local (qwen3-coder:30b) 0.00 [0.00, 0.00] 0.00
R8 / heuristic (agent-aware) 0.00 [0.00, 0.00] 0.50
R8 / cascade 0.00 [0.00, 0.00] 0.10

Verdict

The agent-aware heuristic strategy IS making rational decisions (first turn cloud for planning, post-tool-call local for tool-result interpretation, ~50% cloud-fraction over the loop). The 0% pass rate on hybrid is not a routing-logic bug — it's a model-compatibility issue between qwen3-coder + opencode tool-message format. v1.2's incoming-direction tool-message normalizer is the unblocker.

Attached

  • `results-v1.1.2-canonical.tar.gz` — 60-row canonical sweep
  • `findings.md` — diagnostic write-up

Reproducing

```bash
git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.1.2
python3.12 -m venv .venv && .venv/bin/pip install -e .
./bench setup
(cd router && LOCAL_MODEL=qwen3-coder:30b ./start.sh) &
./bench sweep --config configs/variants/24-v1.1-opencode-canonical.yaml \
--strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./bench analyze results/runs/24-v1.1-opencode-canonical/
```