Release v1.2.0 — single-agent R7 aider canonical: hybrid on the Pareto frontier · RunanywhereAI/hybrid-arena

The v1.2 release. Single-agent simplification: R7 (aider) is the canonical agentic route. Empirically validated against opencode (R8) — aider's architect/editor protocol works end-to-end with qwen3-coder:30b; opencode's free-form tool-use does not.

Headline (60 rows = 5 Exercism Python × 4 strategies × 3 seeds)

Strategy	Pass	$ total	$/pass	Cloud-frac (tokens)
always-cloud (gpt-5.5)	9/15 (60%)	$0.91	$0.10	1.00
always-local (qwen3-coder:30b)	0/15 (0%)	$0.00	n/a	0.00
heuristic (agent-aware)	6/15 (40%)	$0.74	$0.12	0.48
cascade	3/15 (20%)	$0.65	$0.22	0.35

On grep and pig-latin, heuristic matches or beats always-cloud (3/3 vs 2/3 on grep; 3/3 vs 3/3 on pig-latin) while routing ~50% of token volume local. Aggregate pass-rate CIs overlap.

Why R7 (aider), not R8 (opencode)

Both runners exist in-tree. v1.1.3's R8 canonical (60 rows, same matrix) showed 0/15 hybrid pass — qwen3-coder:30b can drive opencode's tool-use loop syntactically (we fixed three Ollama tool-message format issues in v1.1.x) but it writes prose on tool-interpretation turns instead of follow-up tool_calls. Aider's structured architect/editor protocol bypasses that failure mode — architect plans in cloud, editor applies edits locally. R8 + R6 stay EXPERIMENTAL in v1.2.

Reproduce in 5 minutes

git clone https://github.com/RunanywhereAI/hybrid-coding-eval && cd hybrid-coding-eval
git checkout v1.2.0
python3.12 -m venv .venv && .venv/bin/pip install -e .
ollama pull qwen3-coder:30b
cp .env.example .env  # add OPEN_AI_API_KEY

./bench setup
(cd router && LOCAL_MODEL=qwen3-coder:30b ./start.sh) &

./bench sweep --config configs/variants/26-v1.2-aider-r7-canonical.yaml \
  --strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./bench analyze results/runs/26-v1.2-aider-r7-canonical/

~30-50 min wall on M4 Max, ~$1-2 API spend.

Benchmark a new local model

Edit models.local: in the yaml and re-sweep. Compare your bootstrap_cis.json against this release's results-v1.2.0-canonical.tar.gz baseline. See docs/BENCHMARK_NEW_MODEL.md for the full recipe.

Attached

results-v1.2.0-canonical.tar.gz — 60-row canonical dataset
findings.md — diagnostic write-up + reproducibility recipe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.2.0 — single-agent R7 aider canonical: hybrid on the Pareto frontier

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Headline (60 rows = 5 Exercism Python × 4 strategies × 3 seeds)

Why R7 (aider), not R8 (opencode)

Reproduce in 5 minutes

Benchmark a new local model

Attached

Uh oh!