Release v1.3.0 — multi-model + threshold sweep · RunanywhereAI/hybrid-arena

The first Pareto-equivalent hybrid configuration in this benchmark, with statistical significance.

Headline finding

For real_dev D1+D5 (practical refactoring tasks), gemma4:31b + heuristic routing reaches:

Pass-rate: 96% [88, 100] (95% bootstrap CI)
vs always-cloud's 100% [100, 100] — CIs effectively overlap
at 79% cloud_fraction — ≈21% reduction in cloud token spend

Cell (gemma4:31b on real_dev D1+D5)	Pass-rate	95% CI
always-cloud (gpt-5.5)	1.00	[1.00, 1.00]
always-local (gemma4:31b)	0.88	[0.71, 1.00]
heuristic	0.96	[0.88, 1.00] ← Pareto win
cascade	0.88	[0.71, 1.00]

What's in this release

Three publishable canonical sweeps (507 rows total, 6h13m wall, $32.88 cloud spend):

Sweep	Variant	Rows	Wall
28	qwen3-coder:30b expanded (13 tasks)	156	75m
29	gemma4:31b expanded (13 tasks)	156	222m
30	cascade × 5 thresholds × 3 seeds	195	76m

Task matrix: 5 Exercism Python + 4 real_dev D1 + 4 real_dev D5 = 13 tasks × R7 (aider) × {always-cloud, always-local, heuristic, cascade} × 3 seeds.

Three findings

Local model selection > router strategy tuning. Switching qwen3-coder:30b → gemma4:31b raised always-local pass-rate by +39 percentage points (23% → 62%); raised heuristic by +31pp (36% → 67%). Threshold tuning on cascade only moves the needle by ≈7pp across a 5x parameter span.
Task type matters as much as the model. Both 30B-class models choke on Exercism Python puzzles (always-local ≤25%); both excel on real_dev refactoring patterns when given gemma4. The "viability of local for coding" question has different answers by task class.
Cascade threshold has a flat curve. Sweep across thresholds 5/10/15/20/25 produced pass-rates 21–28% with no monotonic trend. Cloud_fraction does change as designed (0.80 → 0.55), but pass-rate doesn't track. Cascade is a poor fit for agentic loops; threshold isn't the lever.

New in v1.3.0

benchmark.task_ids: list[str] | None — explicit task-ID whitelist; scopes sweeps to a known-good subset
./bench sweep --cascade-thresholds 5,10,15,20,25 — sweep ROUTER_CASCADE_THRESHOLD; spawns fresh router per threshold
R7 multi-file fixture support — enables real_dev D1+D5 tasks (multi-file edits) under aider

Reproducibility

git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.3.0
./bench setup
ollama pull qwen3-coder:30b gemma4:31b

./bench sweep --config configs/variants/28-v1.3-aider-r7-expanded.yaml \
    --strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13

./bench sweep --config configs/variants/29-v1.3-aider-r7-gemma4.yaml \
    --strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13

./bench sweep --config configs/variants/30-v1.3-aider-r7-cascade-threshold.yaml \
    --strategies cascade --cascade-thresholds 5,10,15,20,25 --seeds 42,7,13

Artifacts attached

results-v1.3.0.tar.gz (4.2 MB) — bundle of all 3 sweep result directories (raw.jsonl + aggregate.json + bootstrap_cis.json + decision_matrix.md + charts/)
v1.3.0-report.html (54 KB) — full publishable HTML report (~5,400 words, v3.3 style)
findings.md — diagnostic write-up with per-cell stratified bootstrap CIs
orchestrator.log — full sweep run log

Path forward (v1.3.x / v1.4)

More local models: deepseek-coder-v3, qwen3.6:35b, codestral-medium-2
Expand Exercism fixtures beyond 5 (the n=15 baseline is itself unstable)
Task-aware local-model routing (different local for puzzle vs refactor classes)
Persistent failure-mode analysis on Exercism A — why does every strategy under-route?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.3.0 — multi-model + threshold sweep

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Headline finding

What's in this release

Three findings

New in v1.3.0

Reproducibility

Artifacts attached

Path forward (v1.3.x / v1.4)

Uh oh!