Skip to content

v1.1.2 — canonical sweep (60 rows, 3 seeds, bootstrap CIs)

Choose a tag to compare

@sanchitmonga22 sanchitmonga22 released this 19 May 23:59
· 49 commits to main since this release

The publishable canonical dataset for v1.1. 5 Exercism Python tasks × 4 strategies (always-cloud, always-local, heuristic, cascade) × 3 seeds = 60 rows. 95% bootstrap CIs at n=15 per cell.

Headline

Cell pass_rate cloud_fraction
R8 / always-cloud (gpt-5.5) 1.00 [1.00, 1.00] 1.00
R8 / always-local (qwen3-coder:30b) 0.00 [0.00, 0.00] 0.00
R8 / heuristic (agent-aware) 0.00 [0.00, 0.00] 0.50
R8 / cascade 0.00 [0.00, 0.00] 0.10

Verdict

The agent-aware heuristic strategy IS making rational decisions (first turn cloud for planning, post-tool-call local for tool-result interpretation, ~50% cloud-fraction over the loop). The 0% pass rate on hybrid is not a routing-logic bug — it's a model-compatibility issue between qwen3-coder + opencode tool-message format. v1.2's incoming-direction tool-message normalizer is the unblocker.

Attached

  • `results-v1.1.2-canonical.tar.gz` — 60-row canonical sweep
  • `findings.md` — diagnostic write-up

Reproducing

```bash
git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.1.2
python3.12 -m venv .venv && .venv/bin/pip install -e .
./bench setup
(cd router && LOCAL_MODEL=qwen3-coder:30b ./start.sh) &
./bench sweep --config configs/variants/24-v1.1-opencode-canonical.yaml \
--strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./bench analyze results/runs/24-v1.1-opencode-canonical/
```