v1.1.2 — canonical sweep (60 rows, 3 seeds, bootstrap CIs)
The publishable canonical dataset for v1.1. 5 Exercism Python tasks × 4 strategies (always-cloud, always-local, heuristic, cascade) × 3 seeds = 60 rows. 95% bootstrap CIs at n=15 per cell.
Headline
| Cell | pass_rate | cloud_fraction |
|---|---|---|
| R8 / always-cloud (gpt-5.5) | 1.00 [1.00, 1.00] | 1.00 |
| R8 / always-local (qwen3-coder:30b) | 0.00 [0.00, 0.00] | 0.00 |
| R8 / heuristic (agent-aware) | 0.00 [0.00, 0.00] | 0.50 |
| R8 / cascade | 0.00 [0.00, 0.00] | 0.10 |
Verdict
The agent-aware heuristic strategy IS making rational decisions (first turn cloud for planning, post-tool-call local for tool-result interpretation, ~50% cloud-fraction over the loop). The 0% pass rate on hybrid is not a routing-logic bug — it's a model-compatibility issue between qwen3-coder + opencode tool-message format. v1.2's incoming-direction tool-message normalizer is the unblocker.
Attached
- `results-v1.1.2-canonical.tar.gz` — 60-row canonical sweep
- `findings.md` — diagnostic write-up
Reproducing
```bash
git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.1.2
python3.12 -m venv .venv && .venv/bin/pip install -e .
./bench setup
(cd router && LOCAL_MODEL=qwen3-coder:30b ./start.sh) &
./bench sweep --config configs/variants/24-v1.1-opencode-canonical.yaml \
--strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./bench analyze results/runs/24-v1.1-opencode-canonical/
```