Skip to content

v1.2.0 — single-agent R7 aider canonical: hybrid on the Pareto frontier

Choose a tag to compare

@sanchitmonga22 sanchitmonga22 released this 20 May 03:46
· 45 commits to main since this release

The v1.2 release. Single-agent simplification: R7 (aider) is the canonical agentic route. Empirically validated against opencode (R8) — aider's architect/editor protocol works end-to-end with qwen3-coder:30b; opencode's free-form tool-use does not.

Headline (60 rows = 5 Exercism Python × 4 strategies × 3 seeds)

Strategy Pass $ total $/pass Cloud-frac (tokens)
always-cloud (gpt-5.5) 9/15 (60%) $0.91 $0.10 1.00
always-local (qwen3-coder:30b) 0/15 (0%) $0.00 n/a 0.00
heuristic (agent-aware) 6/15 (40%) $0.74 $0.12 0.48
cascade 3/15 (20%) $0.65 $0.22 0.35

On grep and pig-latin, heuristic matches or beats always-cloud (3/3 vs 2/3 on grep; 3/3 vs 3/3 on pig-latin) while routing ~50% of token volume local. Aggregate pass-rate CIs overlap.

Why R7 (aider), not R8 (opencode)

Both runners exist in-tree. v1.1.3's R8 canonical (60 rows, same matrix) showed 0/15 hybrid pass — qwen3-coder:30b can drive opencode's tool-use loop syntactically (we fixed three Ollama tool-message format issues in v1.1.x) but it writes prose on tool-interpretation turns instead of follow-up tool_calls. Aider's structured architect/editor protocol bypasses that failure mode — architect plans in cloud, editor applies edits locally. R8 + R6 stay EXPERIMENTAL in v1.2.

Reproduce in 5 minutes

git clone https://github.com/RunanywhereAI/hybrid-coding-eval && cd hybrid-coding-eval
git checkout v1.2.0
python3.12 -m venv .venv && .venv/bin/pip install -e .
ollama pull qwen3-coder:30b
cp .env.example .env  # add OPEN_AI_API_KEY

./bench setup
(cd router && LOCAL_MODEL=qwen3-coder:30b ./start.sh) &

./bench sweep --config configs/variants/26-v1.2-aider-r7-canonical.yaml \
  --strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./bench analyze results/runs/26-v1.2-aider-r7-canonical/

~30-50 min wall on M4 Max, ~$1-2 API spend.

Benchmark a new local model

Edit models.local: in the yaml and re-sweep. Compare your bootstrap_cis.json against this release's results-v1.2.0-canonical.tar.gz baseline. See docs/BENCHMARK_NEW_MODEL.md for the full recipe.

Attached

  • results-v1.2.0-canonical.tar.gz — 60-row canonical dataset
  • findings.md — diagnostic write-up + reproducibility recipe