v1.2.0 — single-agent R7 aider canonical: hybrid on the Pareto frontier
The v1.2 release. Single-agent simplification: R7 (aider) is the canonical agentic route. Empirically validated against opencode (R8) — aider's architect/editor protocol works end-to-end with qwen3-coder:30b; opencode's free-form tool-use does not.
Headline (60 rows = 5 Exercism Python × 4 strategies × 3 seeds)
| Strategy | Pass | $ total | $/pass | Cloud-frac (tokens) |
|---|---|---|---|---|
| always-cloud (gpt-5.5) | 9/15 (60%) | $0.91 | $0.10 | 1.00 |
| always-local (qwen3-coder:30b) | 0/15 (0%) | $0.00 | n/a | 0.00 |
| heuristic (agent-aware) | 6/15 (40%) | $0.74 | $0.12 | 0.48 |
| cascade | 3/15 (20%) | $0.65 | $0.22 | 0.35 |
On grep and pig-latin, heuristic matches or beats always-cloud (3/3 vs 2/3 on grep; 3/3 vs 3/3 on pig-latin) while routing ~50% of token volume local. Aggregate pass-rate CIs overlap.
Why R7 (aider), not R8 (opencode)
Both runners exist in-tree. v1.1.3's R8 canonical (60 rows, same matrix) showed 0/15 hybrid pass — qwen3-coder:30b can drive opencode's tool-use loop syntactically (we fixed three Ollama tool-message format issues in v1.1.x) but it writes prose on tool-interpretation turns instead of follow-up tool_calls. Aider's structured architect/editor protocol bypasses that failure mode — architect plans in cloud, editor applies edits locally. R8 + R6 stay EXPERIMENTAL in v1.2.
Reproduce in 5 minutes
git clone https://github.com/RunanywhereAI/hybrid-coding-eval && cd hybrid-coding-eval
git checkout v1.2.0
python3.12 -m venv .venv && .venv/bin/pip install -e .
ollama pull qwen3-coder:30b
cp .env.example .env # add OPEN_AI_API_KEY
./bench setup
(cd router && LOCAL_MODEL=qwen3-coder:30b ./start.sh) &
./bench sweep --config configs/variants/26-v1.2-aider-r7-canonical.yaml \
--strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./bench analyze results/runs/26-v1.2-aider-r7-canonical/~30-50 min wall on M4 Max, ~$1-2 API spend.
Benchmark a new local model
Edit models.local: in the yaml and re-sweep. Compare your bootstrap_cis.json against this release's results-v1.2.0-canonical.tar.gz baseline. See docs/BENCHMARK_NEW_MODEL.md for the full recipe.
Attached
results-v1.2.0-canonical.tar.gz— 60-row canonical datasetfindings.md— diagnostic write-up + reproducibility recipe