Skip to content

v1.3.0 — multi-model + threshold sweep

Choose a tag to compare

@sanchitmonga22 sanchitmonga22 released this 20 May 18:15
· 40 commits to main since this release
d966ba7

The first Pareto-equivalent hybrid configuration in this benchmark, with statistical significance.

Headline finding

For real_dev D1+D5 (practical refactoring tasks), gemma4:31b + heuristic routing reaches:

  • Pass-rate: 96% [88, 100] (95% bootstrap CI)
  • vs always-cloud's 100% [100, 100] — CIs effectively overlap
  • at 79% cloud_fraction — ≈21% reduction in cloud token spend
Cell (gemma4:31b on real_dev D1+D5) Pass-rate 95% CI
always-cloud (gpt-5.5) 1.00 [1.00, 1.00]
always-local (gemma4:31b) 0.88 [0.71, 1.00]
heuristic 0.96 [0.88, 1.00] ← Pareto win
cascade 0.88 [0.71, 1.00]

What's in this release

Three publishable canonical sweeps (507 rows total, 6h13m wall, $32.88 cloud spend):

Sweep Variant Rows Wall
28 qwen3-coder:30b expanded (13 tasks) 156 75m
29 gemma4:31b expanded (13 tasks) 156 222m
30 cascade × 5 thresholds × 3 seeds 195 76m

Task matrix: 5 Exercism Python + 4 real_dev D1 + 4 real_dev D5 = 13 tasks × R7 (aider) × {always-cloud, always-local, heuristic, cascade} × 3 seeds.

Three findings

  1. Local model selection > router strategy tuning. Switching qwen3-coder:30b → gemma4:31b raised always-local pass-rate by +39 percentage points (23% → 62%); raised heuristic by +31pp (36% → 67%). Threshold tuning on cascade only moves the needle by ≈7pp across a 5x parameter span.
  2. Task type matters as much as the model. Both 30B-class models choke on Exercism Python puzzles (always-local ≤25%); both excel on real_dev refactoring patterns when given gemma4. The "viability of local for coding" question has different answers by task class.
  3. Cascade threshold has a flat curve. Sweep across thresholds 5/10/15/20/25 produced pass-rates 21–28% with no monotonic trend. Cloud_fraction does change as designed (0.80 → 0.55), but pass-rate doesn't track. Cascade is a poor fit for agentic loops; threshold isn't the lever.

New in v1.3.0

  • benchmark.task_ids: list[str] | None — explicit task-ID whitelist; scopes sweeps to a known-good subset
  • ./bench sweep --cascade-thresholds 5,10,15,20,25 — sweep ROUTER_CASCADE_THRESHOLD; spawns fresh router per threshold
  • R7 multi-file fixture support — enables real_dev D1+D5 tasks (multi-file edits) under aider

Reproducibility

git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.3.0
./bench setup
ollama pull qwen3-coder:30b gemma4:31b

./bench sweep --config configs/variants/28-v1.3-aider-r7-expanded.yaml \
    --strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13

./bench sweep --config configs/variants/29-v1.3-aider-r7-gemma4.yaml \
    --strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13

./bench sweep --config configs/variants/30-v1.3-aider-r7-cascade-threshold.yaml \
    --strategies cascade --cascade-thresholds 5,10,15,20,25 --seeds 42,7,13

Artifacts attached

  • results-v1.3.0.tar.gz (4.2 MB) — bundle of all 3 sweep result directories (raw.jsonl + aggregate.json + bootstrap_cis.json + decision_matrix.md + charts/)
  • v1.3.0-report.html (54 KB) — full publishable HTML report (~5,400 words, v3.3 style)
  • findings.md — diagnostic write-up with per-cell stratified bootstrap CIs
  • orchestrator.log — full sweep run log

Path forward (v1.3.x / v1.4)

  • More local models: deepseek-coder-v3, qwen3.6:35b, codestral-medium-2
  • Expand Exercism fixtures beyond 5 (the n=15 baseline is itself unstable)
  • Task-aware local-model routing (different local for puzzle vs refactor classes)
  • Persistent failure-mode analysis on Exercism A — why does every strategy under-route?