v1.4.0 — agent-only release (708 rows, 3 sweeps, gemma4)
v1.4.0 — cleanup + production-pipeline release
The single canonical release for hybrid coding agents on local hardware.
After 7 prior releases (v1.0.0–v1.3.0), v1.4 narrows the project to agent-only scope (deletes the legacy R1–R5 non-agentic routes + the HumanEval+/BigCodeBench/custom-arch benchmarks), adds Claude Code + Cline runners, ships 5 production-grade lifecycle commands (./bench start|pause|resume|stop|status), and runs the canonical 3-sweep dataset against the gemma4:31b local model.
Headline numbers (708 rows, ~20h wall, $90.48 cloud spend on M4 Max 64GB)
| Cell | Pass-rate | Cloud-fraction | Notes |
|---|---|---|---|
| aider + heuristic + refactors | 23/24 = 96% [88, 100] | 48% | The marquee Pareto win — replicates v1.3.0 |
| aider + always-cloud + refactors | 24/24 = 100% [100, 100] | 100% | Cloud ceiling |
| cline + always-local + puzzles | 15/15 = 100% | 0% | First 30B local-only result that nails Exercism Python |
| opencode + heuristic + refactors | 17/24 = 71% | 46% | Resurrection: vs v1.1.x's 0/15 |
| cascade (all (agent, task-class) cells) | ≤ heuristic | varies | cascade is dead in agentic regime |
What changed in v1.4
Cleanup (the big one — 82 files deleted)
- Deleted non-agentic code: R1-R5 runners (cloud-only, local-only, hybrid-architect, Stanford Minion, Stanford DevMinion), HumanEval+ adapter, BigCodeBench-Hard adapter, custom-arch adapter, LLM-judge scorer, the entire
configs/variants/directory (32 legacy YAMLs),vendor/minions/,cli/{judge,rejudge,rescore,report}.py. - Schema rename:
BenchmarkConfig.routes→agents,categories→task_classes.TaskPlan.route→agent. DropRnprefix throughout the codebase. The agent names areaider,opencode,mini-swe-agent,claude-code,cline. - Directory rename:
runners/→agents/,benchmarks/→tasks/. Sub-renames:exercism_python/→puzzles/,real_dev/→refactors/,swebench_verified/→real_prs/.
Added
- Two new agents:
claude-code(Anthropic Claude CLI; runs always-cloud-direct in v1.4 — Anthropic-compat router shim deferred to v1.5) andcline(Apache-2.0 Plan/Act agent; talks LiteLLM-compat → our router cleanly). phase-awarerouting strategy: deterministic Aider role-marker split (architect→cloud, editor→local). Falls through to legacy heuristic for non-Aider agents. 8 strategies total now (always-local,always-cloud,rules,heuristic,llm-classifier,embedding-knn,cascade,phase-aware).- 5 production lifecycle commands in
./bench:bench start --config <yaml> --strategies ... --seeds ...— spawn detached, auto-starts Ollama, writes /tmp/hcev-sweep.jsonbench pause— kill orchestrator + agents + router; keep Ollama for fast resumebench resume— relaunch with--resumeto skip rows already in raw.jsonlbench stop [--keep-ollama-app] [--clear-state]— like pause + kill Ollama (frees ~19 GB)bench status— show RUNNING / PAUSED / NO SWEEP with PID, log path, row count
bench sweep --resumeflag + automatic router lifecycle frommodels.localin the config (eliminates the manualcd router && ./start.shstep from earlier reproducers)./api/tagsstub on the router so cline's Ollama-listing probe gets a clean 200 instead of a noisy 404.- 5 production v1.4 configs under
configs/:v1.4-smoke.yaml,v1.4-canonical-gemma4.yaml,v1.4-opencode-fairness.yaml,v1.4-strategy-sweep.yaml,v1.4-real-prs.yaml, plus 2 queued (v1.4-canonical-qwen3-coder.yaml,v1.4-canonical-qwen3.6.yaml) for v1.4.1.
Fixed
- Reproducibility audit blockers (10 items from v1.3 audit): README rewrite for v1.4,
requirements.txt+= pydantic + pyyaml, auto-spawn router inbench sweep, fresh CI workflow (no longer needs vendor/minions clone), docs rewrites forREPRODUCING.md+BENCHMARK_NEW_MODEL.md. - Phase 2 rename fallout: 4 agent runners had hardcoded
benchmarks/real_dev/fixtures/paths in_REAL_DEV_FIXTURES_ROOTconstants — updated totasks/refactors/fixtures/.
What's still pending (v1.4.1)
- qwen3-coder:30b + qwen3.6:35b canonical sweeps — configs ready at
configs/v1.4-canonical-qwen3-coder.yamlandconfigs/v1.4-canonical-qwen3.6.yaml; multi-model orchestrator script atpersonal/scripts/v1.4-multi-model-orchestrator.sh. Estimated +12h compute, +$30 cloud. Will land as v1.4.1 with an updated 3-model leaderboard. - SWE-bench Verified real-PR replay — config at
configs/v1.4-real-prs.yamlbut blocked on local Docker availability for the SWE-bench testbed. Deferred to v1.5.
Reproducibility
# Fresh clone + setup (~10 min)
git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.4.0
python3.12 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt && pip install -e ".[dev]"
cp .env.example .env # set OPEN_AI_API_KEY
ollama pull gemma4:31b # ~19 GB
./bench setup # installs aider + opencode + cline; verifies Ollama + Docker
# Run the canonical gemma4 sweep (~10h on M4 Max 64GB, ~$70 cloud)
./bench start --config configs/v1.4-canonical-gemma4.yaml \
--strategies always-cloud,always-local,heuristic,cascade \
--seeds 42,7,13
# Monitor / pause / resume
./bench status # show progress
./bench pause # if you need the laptop
./bench resume # picks up where it left off
# After sweep completes, analyze
./bench analyze results/runs/v1.4-canonical-gemma4
# headline lives in: bootstrap_cis.json["cells"]["refactors::aider::heuristic"]["pass_rate"]Artifacts attached to this release
results-v1.4.0.tar.gz(7.8 MB) — bundle of all 3 sweep result directories (raw.jsonl + aggregate.json + bootstrap_cis.json + decision_matrix.md + charts)v1.4.0-article.html(~6000 words, dark-mode publishable report)findings.md— technical diagnostic write-up with per-cell bootstrap CIsdeep-analysis.md— deep-dive analysis of the 708 rows (per-task heatmaps, tail-latency analysis, routing-decision quality, etc.)
Migration from v1.3.0
If you have v1.3 variant configs (configs/variants/2{8,9,30}-v1.3-*.yaml), they're deleted in v1.4. Use the v1.4 equivalents under configs/. The new schema uses agents: instead of routes: and task_classes: instead of categories:; BenchmarkConfig.tasks_per_class instead of tasks_per_category. Variant configs are simpler — no _template.yaml wrapper, just standalone YAMLs at the top of configs/.