Skip to content

v1.4.1 — 3-model agentic leaderboard (1,644 rows total)

Choose a tag to compare

@sanchitmonga22 sanchitmonga22 released this 26 May 02:28
· 16 commits to main since this release

v1.4.1 — 3-model agentic leaderboard

Adds qwen3-coder:30b + qwen3.6:35b canonical sweeps to v1.4.0's gemma4 line. Combined v1.4 + v1.4.1: 1,644 rows.

Headline (the marquee cells)

Cell Pass-rate Cloud-fraction
cline + qwen3.6 + cascade + refactors 24/24 = 100% [100, 100] low (~5-10%)
cline + qwen3.6 + heuristic + refactors 22/24 = 92% ~7%
cline + qwen3-coder + heuristic + refactors 22/24 = 92% ~7%
cline + qwen3.6 + always-local + puzzles 15/15 = 100% 0%
aider + gemma4 + heuristic + refactors (v1.4.0) 23/24 = 96% [88, 100] 48%

What's new in v1.4.1

Router infrastructure fix (commit c7392db)

Three model-agnostic local-guard env vars in router/server.mjs:fetchLocalOllamaAsOpenAI():

ROUTER_LOCAL_NUM_PREDICT_CAP       default 4096   max gen tokens per local call
ROUTER_LOCAL_REQUEST_TIMEOUT_MS    default 180000 3-min per-request hard timeout
ROUTER_LOCAL_REPEAT_PENALTY        default 1.1    override weak model defaults

Discovered when qwen3-coder's weak repeat_penalty=1.05 + cline's missing max_tokens caused a runaway repetition loop (34 MB streamed over 2h35m from a single HTTP request), crashed Ollama, cascaded every subsequent task to timeout. Full RCA in the release tarball.

2 new canonical sweeps (936 rows)

  • configs/v1.4-canonical-qwen3-coder.yaml → 468 rows. qwen3-coder:30b (MoE coding specialist) across 3 agents × 4 strategies × 13 tasks × 3 seeds.
  • configs/v1.4-canonical-qwen3.6.yaml → 468 rows. qwen3.6:35b (dense generalist) across the same matrix.

Three new findings

  1. qwen3.6:35b is the unsung champion. cline + qwen3.6 + cascade nails 100% on refactors. cline + qwen3.6 + always-local nails 100% on puzzles. cline + qwen3.6 + heuristic = 92% on refactors at ~7% cloud spend.

  2. opencode is gemma4-specific. v1.4.0's opencode resurrection (71% on refactors heuristic with gemma4) doesn't transfer to qwen models (21-33%). opencode's runLoop requires clean tool_calls — gemma4 produces them, qwen variants don't reliably.

  3. Aider is model-sensitive. 96% on gemma4 refactors heuristic → 50% qwen3.6 → 33% qwen3-coder. Aider's architect/editor protocol favors gemma4's dense-generalist training profile.

Reproducibility

git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.4.1
python3.12 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt && pip install -e ".[dev]"
cp .env.example .env  # set OPEN_AI_API_KEY
ollama pull gemma4:31b qwen3-coder:30b qwen3.6:35b
./bench setup
./bench start --config configs/v1.4-canonical-qwen3.6.yaml \
    --strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./bench status     # progress
./bench pause      # if you need the laptop
./bench resume     # picks up where it left off
./bench analyze results/runs/v1.4-canonical-qwen3.6
# headline: jq '.cells["refactors::cline::cascade"].pass_rate' \
#     results/runs/v1.4-canonical-qwen3.6/bootstrap_cis.json

Artifacts attached

  • results-v1.4.1.tar.gz (15 MB) — qwen3-coder + qwen3.6 sweep dirs (raw.jsonl + aggregate.json + bootstrap_cis.json + decision_matrix.md + charts)
  • article.html — the v1.4.1 master article with code-generated charts (~10-min read covering v1.0 → v1.4.1)
  • qwen3-coder-timeout-rca.md — full root-cause analysis of the router infrastructure fix

Migration from v1.4.0

No code changes. The router fix is backwards-compatible (env vars all default to safe values). Existing v1.4.0 sweeps re-run with the v1.4.1 router will be safer against runaway local-model loops.