v1.4.1 — 3-model agentic leaderboard

Adds qwen3-coder:30b + qwen3.6:35b canonical sweeps to v1.4.0's gemma4 line. Combined v1.4 + v1.4.1: 1,644 rows.

Headline (the marquee cells)

Cell	Pass-rate	Cloud-fraction
cline + qwen3.6 + cascade + refactors	24/24 = 100% [100, 100]	low (~5-10%)
cline + qwen3.6 + heuristic + refactors	22/24 = 92%	~7%
cline + qwen3-coder + heuristic + refactors	22/24 = 92%	~7%
cline + qwen3.6 + always-local + puzzles	15/15 = 100%	0%
aider + gemma4 + heuristic + refactors (v1.4.0)	23/24 = 96% [88, 100]	48%

What's new in v1.4.1

Router infrastructure fix (commit `c7392db`)

Three model-agnostic local-guard env vars in router/server.mjs:fetchLocalOllamaAsOpenAI():

ROUTER_LOCAL_NUM_PREDICT_CAP       default 4096   max gen tokens per local call
ROUTER_LOCAL_REQUEST_TIMEOUT_MS    default 180000 3-min per-request hard timeout
ROUTER_LOCAL_REPEAT_PENALTY        default 1.1    override weak model defaults

Discovered when qwen3-coder's weak repeat_penalty=1.05 + cline's missing max_tokens caused a runaway repetition loop (34 MB streamed over 2h35m from a single HTTP request), crashed Ollama, cascaded every subsequent task to timeout. Full RCA in the release tarball.

2 new canonical sweeps (936 rows)

configs/v1.4-canonical-qwen3-coder.yaml → 468 rows. qwen3-coder:30b (MoE coding specialist) across 3 agents × 4 strategies × 13 tasks × 3 seeds.
configs/v1.4-canonical-qwen3.6.yaml → 468 rows. qwen3.6:35b (dense generalist) across the same matrix.

Three new findings

qwen3.6:35b is the unsung champion. cline + qwen3.6 + cascade nails 100% on refactors. cline + qwen3.6 + always-local nails 100% on puzzles. cline + qwen3.6 + heuristic = 92% on refactors at ~7% cloud spend.
opencode is gemma4-specific. v1.4.0's opencode resurrection (71% on refactors heuristic with gemma4) doesn't transfer to qwen models (21-33%). opencode's runLoop requires clean tool_calls — gemma4 produces them, qwen variants don't reliably.
Aider is model-sensitive. 96% on gemma4 refactors heuristic → 50% qwen3.6 → 33% qwen3-coder. Aider's architect/editor protocol favors gemma4's dense-generalist training profile.

Reproducibility

git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.4.1
python3.12 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt && pip install -e ".[dev]"
cp .env.example .env  # set OPEN_AI_API_KEY
ollama pull gemma4:31b qwen3-coder:30b qwen3.6:35b
./bench setup
./bench start --config configs/v1.4-canonical-qwen3.6.yaml \
    --strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./bench status     # progress
./bench pause      # if you need the laptop
./bench resume     # picks up where it left off
./bench analyze results/runs/v1.4-canonical-qwen3.6
# headline: jq '.cells["refactors::cline::cascade"].pass_rate' \
#     results/runs/v1.4-canonical-qwen3.6/bootstrap_cis.json

Artifacts attached

results-v1.4.1.tar.gz (15 MB) — qwen3-coder + qwen3.6 sweep dirs (raw.jsonl + aggregate.json + bootstrap_cis.json + decision_matrix.md + charts)
article.html — the v1.4.1 master article with code-generated charts (~10-min read covering v1.0 → v1.4.1)
qwen3-coder-timeout-rca.md — full root-cause analysis of the router infrastructure fix

Migration from v1.4.0

No code changes. The router fix is backwards-compatible (env vars all default to safe values). Existing v1.4.0 sweeps re-run with the v1.4.1 router will be safer against runaway local-model loops.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.4.1 — 3-model agentic leaderboard (1,644 rows total)

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

v1.4.1 — 3-model agentic leaderboard

Headline (the marquee cells)

What's new in v1.4.1

Router infrastructure fix (commit `c7392db`)

2 new canonical sweeps (936 rows)

Three new findings

Reproducibility

Artifacts attached

Migration from v1.4.0

Uh oh!

Uh oh!

v1.4.1 — 3-model agentic leaderboard (1,644 rows total)

v1.4.1 — 3-model agentic leaderboard

Headline (the marquee cells)

What's new in v1.4.1

Router infrastructure fix (commit c7392db)

2 new canonical sweeps (936 rows)

Three new findings

Reproducibility

Artifacts attached

Migration from v1.4.0

Uh oh!

Router infrastructure fix (commit `c7392db`)