Release v1.1.2 — canonical sweep (60 rows, 3 seeds, bootstrap CIs) · RunanywhereAI/hybrid-arena

The publishable canonical dataset for v1.1. 5 Exercism Python tasks × 4 strategies (always-cloud, always-local, heuristic, cascade) × 3 seeds = 60 rows. 95% bootstrap CIs at n=15 per cell.

Headline

Cell	pass_rate	cloud_fraction
R8 / always-cloud (gpt-5.5)	1.00 [1.00, 1.00]	1.00
R8 / always-local (qwen3-coder:30b)	0.00 [0.00, 0.00]	0.00
R8 / heuristic (agent-aware)	0.00 [0.00, 0.00]	0.50
R8 / cascade	0.00 [0.00, 0.00]	0.10

Verdict

The agent-aware heuristic strategy IS making rational decisions (first turn cloud for planning, post-tool-call local for tool-result interpretation, ~50% cloud-fraction over the loop). The 0% pass rate on hybrid is not a routing-logic bug — it's a model-compatibility issue between qwen3-coder + opencode tool-message format. v1.2's incoming-direction tool-message normalizer is the unblocker.

Attached

`results-v1.1.2-canonical.tar.gz` — 60-row canonical sweep
`findings.md` — diagnostic write-up

Reproducing

```bash
git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.1.2
python3.12 -m venv .venv && .venv/bin/pip install -e .
./bench setup
(cd router && LOCAL_MODEL=qwen3-coder:30b ./start.sh) &
./bench sweep --config configs/variants/24-v1.1-opencode-canonical.yaml \
--strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./bench analyze results/runs/24-v1.1-opencode-canonical/
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.1.2 — canonical sweep (60 rows, 3 seeds, bootstrap CIs)

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Headline

Verdict

Attached

Reproducing

Uh oh!