v1.3.0 — multi-model + threshold sweep
The first Pareto-equivalent hybrid configuration in this benchmark, with statistical significance.
Headline finding
For real_dev D1+D5 (practical refactoring tasks), gemma4:31b + heuristic routing reaches:
- Pass-rate: 96% [88, 100] (95% bootstrap CI)
- vs always-cloud's 100% [100, 100] — CIs effectively overlap
- at 79% cloud_fraction — ≈21% reduction in cloud token spend
| Cell (gemma4:31b on real_dev D1+D5) | Pass-rate | 95% CI |
|---|---|---|
| always-cloud (gpt-5.5) | 1.00 | [1.00, 1.00] |
| always-local (gemma4:31b) | 0.88 | [0.71, 1.00] |
| heuristic | 0.96 | [0.88, 1.00] ← Pareto win |
| cascade | 0.88 | [0.71, 1.00] |
What's in this release
Three publishable canonical sweeps (507 rows total, 6h13m wall, $32.88 cloud spend):
| Sweep | Variant | Rows | Wall |
|---|---|---|---|
| 28 | qwen3-coder:30b expanded (13 tasks) | 156 | 75m |
| 29 | gemma4:31b expanded (13 tasks) | 156 | 222m |
| 30 | cascade × 5 thresholds × 3 seeds | 195 | 76m |
Task matrix: 5 Exercism Python + 4 real_dev D1 + 4 real_dev D5 = 13 tasks × R7 (aider) × {always-cloud, always-local, heuristic, cascade} × 3 seeds.
Three findings
- Local model selection > router strategy tuning. Switching qwen3-coder:30b → gemma4:31b raised always-local pass-rate by +39 percentage points (23% → 62%); raised heuristic by +31pp (36% → 67%). Threshold tuning on cascade only moves the needle by ≈7pp across a 5x parameter span.
- Task type matters as much as the model. Both 30B-class models choke on Exercism Python puzzles (always-local ≤25%); both excel on real_dev refactoring patterns when given gemma4. The "viability of local for coding" question has different answers by task class.
- Cascade threshold has a flat curve. Sweep across thresholds 5/10/15/20/25 produced pass-rates 21–28% with no monotonic trend. Cloud_fraction does change as designed (0.80 → 0.55), but pass-rate doesn't track. Cascade is a poor fit for agentic loops; threshold isn't the lever.
New in v1.3.0
benchmark.task_ids: list[str] | None— explicit task-ID whitelist; scopes sweeps to a known-good subset./bench sweep --cascade-thresholds 5,10,15,20,25— sweepROUTER_CASCADE_THRESHOLD; spawns fresh router per threshold- R7 multi-file fixture support — enables real_dev D1+D5 tasks (multi-file edits) under aider
Reproducibility
git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.3.0
./bench setup
ollama pull qwen3-coder:30b gemma4:31b
./bench sweep --config configs/variants/28-v1.3-aider-r7-expanded.yaml \
--strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./bench sweep --config configs/variants/29-v1.3-aider-r7-gemma4.yaml \
--strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./bench sweep --config configs/variants/30-v1.3-aider-r7-cascade-threshold.yaml \
--strategies cascade --cascade-thresholds 5,10,15,20,25 --seeds 42,7,13Artifacts attached
results-v1.3.0.tar.gz(4.2 MB) — bundle of all 3 sweep result directories (raw.jsonl+aggregate.json+bootstrap_cis.json+decision_matrix.md+charts/)v1.3.0-report.html(54 KB) — full publishable HTML report (~5,400 words, v3.3 style)findings.md— diagnostic write-up with per-cell stratified bootstrap CIsorchestrator.log— full sweep run log
Path forward (v1.3.x / v1.4)
- More local models: deepseek-coder-v3, qwen3.6:35b, codestral-medium-2
- Expand Exercism fixtures beyond 5 (the n=15 baseline is itself unstable)
- Task-aware local-model routing (different local for puzzle vs refactor classes)
- Persistent failure-mode analysis on Exercism A — why does every strategy under-route?