17 Jun 02:39

sanchitmonga22

d2b514d

v1.6.0 — Hybrid Coding Arena (rebrand) Latest

Latest

Hybrid Coding Arena is the new name for this benchmark (formerly hybrid-coding-eval), from RunAnywhere.

What changed in v1.6.0

This is a rebrand release. The benchmark, methodology, and dataset are unchanged.

Python package: hybrid_coding_eval is now hybrid_arena.
Distribution + repo: hybrid-arena (the old hybrid-coding-eval URL redirects here).
CLI command: bench is now arena (e.g. arena sweep, arena analyze).
Clearer headline chart: pass-rate and cloud usage are now separate, labeled elements.

git clone https://github.com/RunanywhereAI/hybrid-arena
cd hybrid-arena && python3.12 -m venv .venv
.venv/bin/pip install -e ".[dev,agents]"
arena setup
arena sweep --config configs/v1.4-smoke.yaml --strategies always-cloud --seeds 42
arena analyze results/runs/v1.4-smoke

Dataset

results-v1.6.0.tar.gz is byte-identical to the v1.5.0/v1.5.1 dataset (1,704 rows). No new benchmark runs in this release.

Headline

cline + qwen3.6 + cascade on real-developer refactors: 24/24 = 100% at 8% cloud, about $0.022/task.
Local-only solves 67% of the hard (D6) tasks at $0 cloud; cloud-only holds 100%.
1,704 rows, 3 local models, 3 coding agents, 8 routing strategies, 17 tasks, one M4 Max laptop, 95% bootstrap CIs.

Assets 3

27 May 21:47

sanchitmonga22

v1.5.1

3e76882

v1.5.1: open-source polish

Open-source polish release. Addresses every finding from the pre-publish audit pass — security, licensing, UX, hygiene. No code-behaviour changes; safe to take.

Full release notes: docs/release-notes/v1.5.1.md.

What changed

Licensing simplified

Deleted NOTICE.md, LICENSE-DATA, LICENSE.md.
Single LICENSE (MIT) now covers code, data, charts, and docs prose.
Citation request lives in the README's bibtex block.

Documentation rewritten

README.md — six-cell headline table, full prereq + quickstart with realistic time/cost estimates, "picking a config for real work" section distilled from the v1.5 leaderboard, full bench CLI table.
AGENTS.md — refreshed for v1.5.0: D6 task class documented, v1.5 configs added to the tree, conventions reflect that single-letter codes (A/B/D) are retired.
CODE_OF_CONDUCT.md — short and direct.
Source-tree docstrings + per-task READMEs — lib.* rewritten to core.*; "Category D / B / X" rewritten to refactors / real-prs / puzzles end-to-end.

UX cleanup

Deleted scripts/reproduce.sh — ./bench setup already does prereq checks, smoke is ./bench sweep --config configs/v1.4-smoke.yaml.
Deleted logs/v3.3/ — historical sweep logs moved out of git. logs/ is now gitignored.

Hygiene

__version__ bumped 0.1.0 → 1.5.1 (it was stuck at 0.1.0).
pytest moved from runtime deps to [dev] extras.
Removed unused pytest -m slow filter from CI + docs.
.github/ISSUE_TEMPLATE/new_model.md: broken configs/variants/_template.yaml reference fixed.
docs/HYBRID_ROUTING_DESIGN.md + v1.4.{0,1} release notes: jq snippets updated from legacy D::cline::heuristic to refactors::cline::heuristic.
Test aliases r10_cline, r6_mini_swe_agent renamed.

Privacy

Sanitized 263 absolute-path leaks (/Users/<owner>/...) in tracked raw.jsonl / progress.log. JSON re-validated on every row (520 rows, 0 parse errors).

Verification

120 fast tests pass on Python 3.11 + 3.12 (CI matrix).
ruff check src/ tests/ clean.

Citation

If you use this benchmark, a citation would be really appreciated. BibTeX in the README.

📦 Dataset

results-v1.5.1.tar.gz is byte-identical to the v1.5.0 dataset — v1.5.1 added 0 new benchmark rows (it is an open-source-polish release). It is attached here so visitors landing on the Latest release can download the data directly. The canonical 1,704-row dataset is unchanged since v1.5.0.

gh release download v1.5.1 -p results-v1.5.1.tar.gz   # or v1.5.0 — same bytes

Assets 3

27 May 10:08

sanchitmonga22

v1.5.0

1cef619

v1.5.0 — Hard-Task Stress Test

v1.5.0 — Hard-Task Stress Test (2026-05-27)

Theme: Push past the D1/D5 ceiling. v1.4.x had top configs hitting
100% on the canonical 8-task refactor set, which made the leaderboard
uninformative at the top end. v1.5 adds a new harder task shape (D6)
designed to actually stress 30B local models, and stress-tests the
v1.4.1 champion configurations against it.

What's new

New D6 task class: hard refactors (4 tasks, 80 acceptance tests)

Each D6 task is a single-file Python implementation challenge with
comprehensive pytest coverage. The reference solutions are all ≤200
LOC, so the difficulty isn't volume — it's the breadth of corner
cases the model has to internalise from the prompt alone.

Task	LOC	Tests	What it stresses
`d6-lru-ttl-cache`	100	23	OrderedDict-based LRU + monkey-patchable TTL clock + careful eviction accounting
`d6-token-bucket`	60	14	Lazy refill correctness + multi-key isolation + non-positive-arg validation
`d6-toposort`	90	16	Kahn's with deterministic tie-break + DFS cycle detection with path reconstruction
`d6-mini-template`	200	27	Recursive-descent parser + AST evaluator + escape filters + nested if/for + comments

Total: ~450 LOC of reference solution, 80 pytest assertions.
The same overlay+pytest scoring path D1/D5 uses — no new scoring
infrastructure needed.

Stress-test sweeps

Two new sweep configs target the v1.4.1 audit's top configurations:

configs/v1.5-hard-gemma4.yaml — aider+gemma4 on always-cloud and
heuristic (the v1.3 marquee profile). 4 tasks × 2 strategies × 3
seeds = 24 rows.
configs/v1.5-hard-qwen3.6.yaml — cline+qwen3.6 on always-cloud,
always-local, and cascade (the v1.4.1 champion profile). 4 tasks ×
3 strategies × 3 seeds = 36 rows.

Total 60 new rows of hard-task data. Cost cap: $50 cloud spend.
Wall-time cap: 6 hours.

Findings

Both sweeps complete. Full analysis in
personal/reports/publish-v1.5/article.html §9.5.

Agent	Model	Strategy	Pass	Cloud-frac	Notes
aider	gemma4:31b	always-cloud	12/12 (100%)	100%	ceiling
aider	gemma4:31b	heuristic	7/12 (58%)	61%	v1.3 marquee profile falls off on D6
cline	qwen3.6:35b	always-cloud	12/12 (100%)	100%	ceiling
cline	qwen3.6:35b	always-local	8/12 (67%)	0%	30B local-only ceiling — $0 cloud spend
cline	qwen3.6:35b	cascade	9/12 (75%)	13%	v1.4.1 champion holds 75% but loses on mini-template

Key findings:

30B local-only solves 67% of hard refactors with zero cloud. cline + qwen3.6:35b
nails token-bucket and toposort 3/3 each, partial-passes lru-ttl-cache and mini-template.
Two of the four "failures" are cline-on-local session bugs, not model quality.
The v1.4.1 cascade champion drops from 100% to 75% on D6. Cascade only marginally
beats always-local (75% vs 67%) because the router has no global view of task difficulty.
The d6-mini-template recursive-descent parser is the hard wall.
Always-cloud (gpt-5.5) is 100% on both configs. The cloud advantage on D6 is real and
42 percentage points over heuristic gemma4.
The pytest-parser bug fix in src/hybrid_coding_eval/agents/aider.py was caught by a
new parametrized test (tests/agents/test_aider_parser.py). The bug undercounted
passes for aider rows where X failed appeared before Y passed in the summary line.
Rescoring v1.5 gemma4 data found 0 affected rows; the fix is preventative.

How to reproduce

git checkout v1.5.0
./scripts/reproduce.sh    # one-time env setup if not done yet
./bench sweep --config configs/v1.5-hard-gemma4.yaml \
    --strategies always-cloud,heuristic --seeds 42,7,13
./bench sweep --config configs/v1.5-hard-qwen3.6.yaml \
    --strategies always-cloud,always-local,cascade --seeds 42,7,13
./bench analyze results/runs/v1.5-hard-gemma4
./bench analyze results/runs/v1.5-hard-qwen3.6

Expected wall time on M4 Max 64 GB: ~30 min for gemma4 sweep,
~80 min for qwen3.6 sweep. Expected cloud spend at gpt-5.5
list pricing: ~$8 total across both sweeps.

Migration from v1.4.4

Zero migration cost. v1.5 is purely additive:

The D1–D5 task shapes are unchanged. The v1.4.1 canonical dataset
remains the headline dataset.
The D6 shape uses the same overlay+pytest scoring path as D1/D5.
No new scoring dependencies.
Existing configs continue to work. The new configs are isolated to
configs/v1.5-hard-*.yaml.

Assets 3

26 May 02:28

sanchitmonga22

v1.4.1

374e797

v1.4.1 — 3-model agentic leaderboard (1,644 rows total)

v1.4.1 — 3-model agentic leaderboard

Adds qwen3-coder:30b + qwen3.6:35b canonical sweeps to v1.4.0's gemma4 line. Combined v1.4 + v1.4.1: 1,644 rows.

Headline (the marquee cells)

Cell	Pass-rate	Cloud-fraction
cline + qwen3.6 + cascade + refactors	24/24 = 100% [100, 100]	low (~5-10%)
cline + qwen3.6 + heuristic + refactors	22/24 = 92%	~7%
cline + qwen3-coder + heuristic + refactors	22/24 = 92%	~7%
cline + qwen3.6 + always-local + puzzles	15/15 = 100%	0%
aider + gemma4 + heuristic + refactors (v1.4.0)	23/24 = 96% [88, 100]	48%

What's new in v1.4.1

Router infrastructure fix (commit `c7392db`)

Three model-agnostic local-guard env vars in router/server.mjs:fetchLocalOllamaAsOpenAI():

ROUTER_LOCAL_NUM_PREDICT_CAP       default 4096   max gen tokens per local call
ROUTER_LOCAL_REQUEST_TIMEOUT_MS    default 180000 3-min per-request hard timeout
ROUTER_LOCAL_REPEAT_PENALTY        default 1.1    override weak model defaults

Discovered when qwen3-coder's weak repeat_penalty=1.05 + cline's missing max_tokens caused a runaway repetition loop (34 MB streamed over 2h35m from a single HTTP request), crashed Ollama, cascaded every subsequent task to timeout. Full RCA in the release tarball.

2 new canonical sweeps (936 rows)

configs/v1.4-canonical-qwen3-coder.yaml → 468 rows. qwen3-coder:30b (MoE coding specialist) across 3 agents × 4 strategies × 13 tasks × 3 seeds.
configs/v1.4-canonical-qwen3.6.yaml → 468 rows. qwen3.6:35b (dense generalist) across the same matrix.

Three new findings

qwen3.6:35b is the unsung champion. cline + qwen3.6 + cascade nails 100% on refactors. cline + qwen3.6 + always-local nails 100% on puzzles. cline + qwen3.6 + heuristic = 92% on refactors at ~7% cloud spend.
opencode is gemma4-specific. v1.4.0's opencode resurrection (71% on refactors heuristic with gemma4) doesn't transfer to qwen models (21-33%). opencode's runLoop requires clean tool_calls — gemma4 produces them, qwen variants don't reliably.
Aider is model-sensitive. 96% on gemma4 refactors heuristic → 50% qwen3.6 → 33% qwen3-coder. Aider's architect/editor protocol favors gemma4's dense-generalist training profile.

Reproducibility

git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.4.1
python3.12 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt && pip install -e ".[dev]"
cp .env.example .env  # set OPEN_AI_API_KEY
ollama pull gemma4:31b qwen3-coder:30b qwen3.6:35b
./bench setup
./bench start --config configs/v1.4-canonical-qwen3.6.yaml \
    --strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./bench status     # progress
./bench pause      # if you need the laptop
./bench resume     # picks up where it left off
./bench analyze results/runs/v1.4-canonical-qwen3.6
# headline: jq '.cells["refactors::cline::cascade"].pass_rate' \
#     results/runs/v1.4-canonical-qwen3.6/bootstrap_cis.json

Artifacts attached

results-v1.4.1.tar.gz (15 MB) — qwen3-coder + qwen3.6 sweep dirs (raw.jsonl + aggregate.json + bootstrap_cis.json + decision_matrix.md + charts)
article.html — the v1.4.1 master article with code-generated charts (~10-min read covering v1.0 → v1.4.1)
qwen3-coder-timeout-rca.md — full root-cause analysis of the router infrastructure fix

Migration from v1.4.0

No code changes. The router fix is backwards-compatible (env vars all default to safe values). Existing v1.4.0 sweeps re-run with the v1.4.1 router will be safer against runaway local-model loops.

Assets 7

22 May 16:50

sanchitmonga22

v1.4.0

945dd6a

v1.4.0 — agent-only release (708 rows, 3 sweeps, gemma4)

v1.4.0 — cleanup + production-pipeline release

The single canonical release for hybrid coding agents on local hardware.

After 7 prior releases (v1.0.0–v1.3.0), v1.4 narrows the project to agent-only scope (deletes the legacy R1–R5 non-agentic routes + the HumanEval+/BigCodeBench/custom-arch benchmarks), adds Claude Code + Cline runners, ships 5 production-grade lifecycle commands (./bench start|pause|resume|stop|status), and runs the canonical 3-sweep dataset against the gemma4:31b local model.

Headline numbers (708 rows, ~20h wall, $90.48 cloud spend on M4 Max 64GB)

Cell	Pass-rate	Cloud-fraction	Notes
aider + heuristic + refactors	23/24 = 96% [88, 100]	48%	The marquee Pareto win — replicates v1.3.0
aider + always-cloud + refactors	24/24 = 100% [100, 100]	100%	Cloud ceiling
cline + always-local + puzzles	15/15 = 100%	0%	First 30B local-only result that nails Exercism Python
opencode + heuristic + refactors	17/24 = 71%	46%	Resurrection: vs v1.1.x's 0/15
cascade (all (agent, task-class) cells)	≤ heuristic	varies	cascade is dead in agentic regime

What changed in v1.4

Cleanup (the big one — 82 files deleted)

Deleted non-agentic code: R1-R5 runners (cloud-only, local-only, hybrid-architect, Stanford Minion, Stanford DevMinion), HumanEval+ adapter, BigCodeBench-Hard adapter, custom-arch adapter, LLM-judge scorer, the entire configs/variants/ directory (32 legacy YAMLs), vendor/minions/, cli/{judge,rejudge,rescore,report}.py.
Schema rename: BenchmarkConfig.routes → agents, categories → task_classes. TaskPlan.route → agent. Drop Rn prefix throughout the codebase. The agent names are aider, opencode, mini-swe-agent, claude-code, cline.
Directory rename: runners/ → agents/, benchmarks/ → tasks/. Sub-renames: exercism_python/ → puzzles/, real_dev/ → refactors/, swebench_verified/ → real_prs/.

Added

Two new agents: claude-code (Anthropic Claude CLI; runs always-cloud-direct in v1.4 — Anthropic-compat router shim deferred to v1.5) and cline (Apache-2.0 Plan/Act agent; talks LiteLLM-compat → our router cleanly).
phase-aware routing strategy: deterministic Aider role-marker split (architect→cloud, editor→local). Falls through to legacy heuristic for non-Aider agents. 8 strategies total now (always-local, always-cloud, rules, heuristic, llm-classifier, embedding-knn, cascade, phase-aware).
5 production lifecycle commands in ./bench:
- bench start --config <yaml> --strategies ... --seeds ... — spawn detached, auto-starts Ollama, writes /tmp/hcev-sweep.json
- bench pause — kill orchestrator + agents + router; keep Ollama for fast resume
- bench resume — relaunch with --resume to skip rows already in raw.jsonl
- bench stop [--keep-ollama-app] [--clear-state] — like pause + kill Ollama (frees ~19 GB)
- bench status — show RUNNING / PAUSED / NO SWEEP with PID, log path, row count
bench sweep --resume flag + automatic router lifecycle from models.local in the config (eliminates the manual cd router && ./start.sh step from earlier reproducers).
/api/tags stub on the router so cline's Ollama-listing probe gets a clean 200 instead of a noisy 404.
5 production v1.4 configs under configs/: v1.4-smoke.yaml, v1.4-canonical-gemma4.yaml, v1.4-opencode-fairness.yaml, v1.4-strategy-sweep.yaml, v1.4-real-prs.yaml, plus 2 queued (v1.4-canonical-qwen3-coder.yaml, v1.4-canonical-qwen3.6.yaml) for v1.4.1.

Fixed

Reproducibility audit blockers (10 items from v1.3 audit): README rewrite for v1.4, requirements.txt += pydantic + pyyaml, auto-spawn router in bench sweep, fresh CI workflow (no longer needs vendor/minions clone), docs rewrites for REPRODUCING.md + BENCHMARK_NEW_MODEL.md.
Phase 2 rename fallout: 4 agent runners had hardcoded benchmarks/real_dev/fixtures/ paths in _REAL_DEV_FIXTURES_ROOT constants — updated to tasks/refactors/fixtures/.

What's still pending (v1.4.1)

qwen3-coder:30b + qwen3.6:35b canonical sweeps — configs ready at configs/v1.4-canonical-qwen3-coder.yaml and configs/v1.4-canonical-qwen3.6.yaml; multi-model orchestrator script at personal/scripts/v1.4-multi-model-orchestrator.sh. Estimated +~~12h compute, +~~$30 cloud. Will land as v1.4.1 with an updated 3-model leaderboard.
SWE-bench Verified real-PR replay — config at configs/v1.4-real-prs.yaml but blocked on local Docker availability for the SWE-bench testbed. Deferred to v1.5.

Reproducibility

# Fresh clone + setup (~10 min)
git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.4.0
python3.12 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt && pip install -e ".[dev]"
cp .env.example .env  # set OPEN_AI_API_KEY
ollama pull gemma4:31b   # ~19 GB
./bench setup           # installs aider + opencode + cline; verifies Ollama + Docker

# Run the canonical gemma4 sweep (~10h on M4 Max 64GB, ~$70 cloud)
./bench start --config configs/v1.4-canonical-gemma4.yaml \
    --strategies always-cloud,always-local,heuristic,cascade \
    --seeds 42,7,13

# Monitor / pause / resume
./bench status         # show progress
./bench pause          # if you need the laptop
./bench resume         # picks up where it left off

# After sweep completes, analyze
./bench analyze results/runs/v1.4-canonical-gemma4
# headline lives in: bootstrap_cis.json["cells"]["refactors::aider::heuristic"]["pass_rate"]

Artifacts attached to this release

results-v1.4.0.tar.gz (7.8 MB) — bundle of all 3 sweep result directories (raw.jsonl + aggregate.json + bootstrap_cis.json + decision_matrix.md + charts)
v1.4.0-article.html (~6000 words, dark-mode publishable report)
findings.md — technical diagnostic write-up with per-cell bootstrap CIs
deep-analysis.md — deep-dive analysis of the 708 rows (per-task heatmaps, tail-latency analysis, routing-decision quality, etc.)

Migration from v1.3.0

If you have v1.3 variant configs (configs/variants/2{8,9,30}-v1.3-*.yaml), they're deleted in v1.4. Use the v1.4 equivalents under configs/. The new schema uses agents: instead of routes: and task_classes: instead of categories:; BenchmarkConfig.tasks_per_class instead of tasks_per_category. Variant configs are simpler — no _template.yaml wrapper, just standalone YAMLs at the top of configs/.

Assets 8

20 May 18:15

sanchitmonga22

v1.3.0

d966ba7

v1.3.0 — multi-model + threshold sweep

The first Pareto-equivalent hybrid configuration in this benchmark, with statistical significance.

Headline finding

For real_dev D1+D5 (practical refactoring tasks), gemma4:31b + heuristic routing reaches:

Pass-rate: 96% [88, 100] (95% bootstrap CI)
vs always-cloud's 100% [100, 100] — CIs effectively overlap
at 79% cloud_fraction — ≈21% reduction in cloud token spend

Cell (gemma4:31b on real_dev D1+D5)	Pass-rate	95% CI
always-cloud (gpt-5.5)	1.00	[1.00, 1.00]
always-local (gemma4:31b)	0.88	[0.71, 1.00]
heuristic	0.96	[0.88, 1.00] ← Pareto win
cascade	0.88	[0.71, 1.00]

What's in this release

Three publishable canonical sweeps (507 rows total, 6h13m wall, $32.88 cloud spend):

Sweep	Variant	Rows	Wall
28	qwen3-coder:30b expanded (13 tasks)	156	75m
29	gemma4:31b expanded (13 tasks)	156	222m
30	cascade × 5 thresholds × 3 seeds	195	76m

Task matrix: 5 Exercism Python + 4 real_dev D1 + 4 real_dev D5 = 13 tasks × R7 (aider) × {always-cloud, always-local, heuristic, cascade} × 3 seeds.

Three findings

Local model selection > router strategy tuning. Switching qwen3-coder:30b → gemma4:31b raised always-local pass-rate by +39 percentage points (23% → 62%); raised heuristic by +31pp (36% → 67%). Threshold tuning on cascade only moves the needle by ≈7pp across a 5x parameter span.
Task type matters as much as the model. Both 30B-class models choke on Exercism Python puzzles (always-local ≤25%); both excel on real_dev refactoring patterns when given gemma4. The "viability of local for coding" question has different answers by task class.
Cascade threshold has a flat curve. Sweep across thresholds 5/10/15/20/25 produced pass-rates 21–28% with no monotonic trend. Cloud_fraction does change as designed (0.80 → 0.55), but pass-rate doesn't track. Cascade is a poor fit for agentic loops; threshold isn't the lever.

New in v1.3.0

benchmark.task_ids: list[str] | None — explicit task-ID whitelist; scopes sweeps to a known-good subset
./bench sweep --cascade-thresholds 5,10,15,20,25 — sweep ROUTER_CASCADE_THRESHOLD; spawns fresh router per threshold
R7 multi-file fixture support — enables real_dev D1+D5 tasks (multi-file edits) under aider

Reproducibility

git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.3.0
./bench setup
ollama pull qwen3-coder:30b gemma4:31b

./bench sweep --config configs/variants/28-v1.3-aider-r7-expanded.yaml \
    --strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13

./bench sweep --config configs/variants/29-v1.3-aider-r7-gemma4.yaml \
    --strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13

./bench sweep --config configs/variants/30-v1.3-aider-r7-cascade-threshold.yaml \
    --strategies cascade --cascade-thresholds 5,10,15,20,25 --seeds 42,7,13

Artifacts attached

results-v1.3.0.tar.gz (4.2 MB) — bundle of all 3 sweep result directories (raw.jsonl + aggregate.json + bootstrap_cis.json + decision_matrix.md + charts/)
v1.3.0-report.html (54 KB) — full publishable HTML report (~5,400 words, v3.3 style)
findings.md — diagnostic write-up with per-cell stratified bootstrap CIs
orchestrator.log — full sweep run log

Path forward (v1.3.x / v1.4)

More local models: deepseek-coder-v3, qwen3.6:35b, codestral-medium-2
Expand Exercism fixtures beyond 5 (the n=15 baseline is itself unstable)
Task-aware local-model routing (different local for puzzle vs refactor classes)
Persistent failure-mode analysis on Exercism A — why does every strategy under-route?

Assets 6

20 May 03:46

sanchitmonga22

v1.2.0

f27dcfc

v1.2.0 — single-agent R7 aider canonical: hybrid on the Pareto frontier

The v1.2 release. Single-agent simplification: R7 (aider) is the canonical agentic route. Empirically validated against opencode (R8) — aider's architect/editor protocol works end-to-end with qwen3-coder:30b; opencode's free-form tool-use does not.

Headline (60 rows = 5 Exercism Python × 4 strategies × 3 seeds)

Strategy	Pass	$ total	$/pass	Cloud-frac (tokens)
always-cloud (gpt-5.5)	9/15 (60%)	$0.91	$0.10	1.00
always-local (qwen3-coder:30b)	0/15 (0%)	$0.00	n/a	0.00
heuristic (agent-aware)	6/15 (40%)	$0.74	$0.12	0.48
cascade	3/15 (20%)	$0.65	$0.22	0.35

On grep and pig-latin, heuristic matches or beats always-cloud (3/3 vs 2/3 on grep; 3/3 vs 3/3 on pig-latin) while routing ~50% of token volume local. Aggregate pass-rate CIs overlap.

Why R7 (aider), not R8 (opencode)

Both runners exist in-tree. v1.1.3's R8 canonical (60 rows, same matrix) showed 0/15 hybrid pass — qwen3-coder:30b can drive opencode's tool-use loop syntactically (we fixed three Ollama tool-message format issues in v1.1.x) but it writes prose on tool-interpretation turns instead of follow-up tool_calls. Aider's structured architect/editor protocol bypasses that failure mode — architect plans in cloud, editor applies edits locally. R8 + R6 stay EXPERIMENTAL in v1.2.

Reproduce in 5 minutes

git clone https://github.com/RunanywhereAI/hybrid-coding-eval && cd hybrid-coding-eval
git checkout v1.2.0
python3.12 -m venv .venv && .venv/bin/pip install -e .
ollama pull qwen3-coder:30b
cp .env.example .env  # add OPEN_AI_API_KEY

./bench setup
(cd router && LOCAL_MODEL=qwen3-coder:30b ./start.sh) &

./bench sweep --config configs/variants/26-v1.2-aider-r7-canonical.yaml \
  --strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./bench analyze results/runs/26-v1.2-aider-r7-canonical/

~30-50 min wall on M4 Max, ~$1-2 API spend.

Benchmark a new local model

Edit models.local: in the yaml and re-sweep. Compare your bootstrap_cis.json against this release's results-v1.2.0-canonical.tar.gz baseline. See docs/BENCHMARK_NEW_MODEL.md for the full recipe.

Attached

results-v1.2.0-canonical.tar.gz — 60-row canonical dataset
findings.md — diagnostic write-up + reproducibility recipe

Assets 4

20 May 01:17

sanchitmonga22

v1.1.3

dd8d9b1

v1.1.3 — qwen3-coder ↔ opencode tool-format fix; hybrid loop now operational

The qwen3-coder + opencode tool-message format issue from v1.1.2 is fixed. Hybrid strategies now run the agent loop end-to-end without 400 errors.

What was the issue

Ollama's qwen3-coder renderer rejected OpenAI-standard tool messages:

tool_calls[].function.arguments as JSON-encoded STRING (OpenAI spec) → Ollama wants OBJECT
Multi-part tool content as ARRAY (OpenAI 1.x) → Ollama wants STRING

Bisected via curl probes. Confirmed against Ollama issue #11621 + goose issue #6883.

The fix

translateForLocal() in router/server.mjs (scoped to local backend only):

Parse tool_calls[].function.arguments string → object
Flatten array content → string

Inverse of the v1.1.1 outbound normalizer.

Updated canonical headline (60 rows × 3 seeds × 4 strategies)

Strategy	Pass	Cloud tok	Local tok	Cloud_frac (calls)
always-cloud (gpt-5.5)	15/15 ✓	16,094	0	1.00
always-local (qwen3-coder:30b)	0/15	0	2,916	0.00
heuristic (agent-aware)	0/15	2,064	1,439	0.50
cascade	0/15	447	2,774	0.10

Routing layer verified working. Hybrid strategies now make real cloud+local decisions. The remaining 0% hybrid pass is a model-quality gap — qwen3-coder writes prose instead of tool_calls on tool-interpretation turns. v1.2 unblockers: larger local model, or router-level system-prompt augmentation.

Attached

results-v1.1.3-canonical.tar.gz — 60-row canonical sweep
findings.md — diagnostic write-up

Assets 4

19 May 04:25

sanchitmonga22

v3.3

a584094

v3.3 — cross-model sweep (ARCHIVED, pre-v1.x) Pre-release

Pre-release

⚠️ ARCHIVED — pre-v1.x naming. This is a historical research sweep (the v3.x line) that predates the current v1.x releases. It is not the latest version. For the current benchmark and datasets, see v1.5.1 (Latest). Kept for reproducibility of the 3,581-row cross-model sweep only.

Highlights

The biggest single benchmark sweep this repo has produced. 4.5 days of continuous compute on M4 Max 64 GB.

3,581 rows across 33 variant directories
6 local models × 5 routes × 7 routing strategies × 8 task shapes × 6 pricing scenarios

TL;DR

Can hybrid routing save cost? Not via multi-step orchestration (R3/R4/R5 cost 1.9× to 5× more than R1 cloud-only). Yes via per-task gating: ~16-20% savings on a mixed workload.
Best local model: Qwen3-Coder:30B at $0.229/correct. Beats devstral, qwen2.5-coder, gemma4, GLM, AND both newer Qwen 3.6 variants.
Best routing strategy: Cascade with default threshold 15. Replicates as Pareto winner across all 6 models.

Major findings

LLM-classifier is structurally broken on SWE-bench: 5 classifier sizes from 0.6B to 4B all score 0/10 (Phase 6 sub-sweep). Scaling does NOT help.
Cascade threshold 15 is empirically optimal (Phase 7 sub-sweep tested 5/10/15/20/25; t=20 is a brittleness cliff).
Newer Qwen 3.6 family regressed vs older Qwen3-Coder on this benchmark — counter-intuitive but reproducible across both 27B-mxfp8 and 35B-A3B-MoE variants.
R5 DevMinion is catastrophically bad on prose — 5.13× R1 cost with composite 0.00 across 7 of 8 D3/D4 tasks.
Multi-step hybrid loses on every count — cost, quality, latency. Skip it.

Attached artifacts

v3.3-report.html — single-file standalone HTML report with all 5 charts base64-embedded. Open in any browser.
cross-model-leaderboard.png — R3 heuristic scatter, lower-right is better
strategy-model-heatmap.png — 7 models × 5 strategies $/correct heatmap
phase-6-classifier-sweep.png — B-pass rate per classifier (the 0/10 collapse)
phase-7-cascade-threshold.png — cascade threshold tuning
per-shape-r1-vs-alt.png — R1 vs best alternative cost per task shape

Read the full article

reports/ARTICLE.md — ~10,000-word comprehensive write-up
docs/HYBRID_ROUTER_DESIGN.md — the deployable router architecture
docs/REPRODUCING.md — copy-paste reproduction

Reproducibility

Full sweep takes ~4.5 days on M4 Max 64 GB and ~$240 in OpenAI spend. Same dataset re-prices under 6 cloud scenarios without re-running inference. See REPRODUCING.md.

🤖 Sweep + analysis + article + this release all generated via Claude Code.

Assets 8

19 May 23:59

sanchitmonga22

v1.1.2

8cb0331

v1.1.2 — canonical sweep (60 rows, 3 seeds, bootstrap CIs)

The publishable canonical dataset for v1.1. 5 Exercism Python tasks × 4 strategies (always-cloud, always-local, heuristic, cascade) × 3 seeds = 60 rows. 95% bootstrap CIs at n=15 per cell.

Headline

Cell	pass_rate	cloud_fraction
R8 / always-cloud (gpt-5.5)	1.00 [1.00, 1.00]	1.00
R8 / always-local (qwen3-coder:30b)	0.00 [0.00, 0.00]	0.00
R8 / heuristic (agent-aware)	0.00 [0.00, 0.00]	0.50
R8 / cascade	0.00 [0.00, 0.00]	0.10

Verdict

The agent-aware heuristic strategy IS making rational decisions (first turn cloud for planning, post-tool-call local for tool-result interpretation, ~50% cloud-fraction over the loop). The 0% pass rate on hybrid is not a routing-logic bug — it's a model-compatibility issue between qwen3-coder + opencode tool-message format. v1.2's incoming-direction tool-message normalizer is the unblocker.

Attached

`results-v1.1.2-canonical.tar.gz` — 60-row canonical sweep
`findings.md` — diagnostic write-up

Reproducing

```bash
git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.1.2
python3.12 -m venv .venv && .venv/bin/pip install -e .
./bench setup
(cd router && LOCAL_MODEL=qwen3-coder:30b ./start.sh) &
./bench sweep --config configs/variants/24-v1.1-opencode-canonical.yaml \
--strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./bench analyze results/runs/24-v1.1-opencode-canonical/
```

Assets 4

Uh oh!

Releases: RunanywhereAI/hybrid-arena

v1.6.0 — Hybrid Coding Arena (rebrand)

What changed in v1.6.0

Dataset

Headline

Uh oh!

v1.5.1: open-source polish

What changed

Licensing simplified

Documentation rewritten

UX cleanup

Hygiene

Privacy

Verification

Citation

📦 Dataset

Uh oh!