v1.4.0 — cleanup + production-pipeline release

The single canonical release for hybrid coding agents on local hardware.

After 7 prior releases (v1.0.0–v1.3.0), v1.4 narrows the project to agent-only scope (deletes the legacy R1–R5 non-agentic routes + the HumanEval+/BigCodeBench/custom-arch benchmarks), adds Claude Code + Cline runners, ships 5 production-grade lifecycle commands (./bench start|pause|resume|stop|status), and runs the canonical 3-sweep dataset against the gemma4:31b local model.

Headline numbers (708 rows, ~20h wall, $90.48 cloud spend on M4 Max 64GB)

Cell	Pass-rate	Cloud-fraction	Notes
aider + heuristic + refactors	23/24 = 96% [88, 100]	48%	The marquee Pareto win — replicates v1.3.0
aider + always-cloud + refactors	24/24 = 100% [100, 100]	100%	Cloud ceiling
cline + always-local + puzzles	15/15 = 100%	0%	First 30B local-only result that nails Exercism Python
opencode + heuristic + refactors	17/24 = 71%	46%	Resurrection: vs v1.1.x's 0/15
cascade (all (agent, task-class) cells)	≤ heuristic	varies	cascade is dead in agentic regime

What changed in v1.4

Cleanup (the big one — 82 files deleted)

Deleted non-agentic code: R1-R5 runners (cloud-only, local-only, hybrid-architect, Stanford Minion, Stanford DevMinion), HumanEval+ adapter, BigCodeBench-Hard adapter, custom-arch adapter, LLM-judge scorer, the entire configs/variants/ directory (32 legacy YAMLs), vendor/minions/, cli/{judge,rejudge,rescore,report}.py.
Schema rename: BenchmarkConfig.routes → agents, categories → task_classes. TaskPlan.route → agent. Drop Rn prefix throughout the codebase. The agent names are aider, opencode, mini-swe-agent, claude-code, cline.
Directory rename: runners/ → agents/, benchmarks/ → tasks/. Sub-renames: exercism_python/ → puzzles/, real_dev/ → refactors/, swebench_verified/ → real_prs/.

Added

Two new agents: claude-code (Anthropic Claude CLI; runs always-cloud-direct in v1.4 — Anthropic-compat router shim deferred to v1.5) and cline (Apache-2.0 Plan/Act agent; talks LiteLLM-compat → our router cleanly).
phase-aware routing strategy: deterministic Aider role-marker split (architect→cloud, editor→local). Falls through to legacy heuristic for non-Aider agents. 8 strategies total now (always-local, always-cloud, rules, heuristic, llm-classifier, embedding-knn, cascade, phase-aware).
5 production lifecycle commands in ./bench:
- bench start --config <yaml> --strategies ... --seeds ... — spawn detached, auto-starts Ollama, writes /tmp/hcev-sweep.json
- bench pause — kill orchestrator + agents + router; keep Ollama for fast resume
- bench resume — relaunch with --resume to skip rows already in raw.jsonl
- bench stop [--keep-ollama-app] [--clear-state] — like pause + kill Ollama (frees ~19 GB)
- bench status — show RUNNING / PAUSED / NO SWEEP with PID, log path, row count
bench sweep --resume flag + automatic router lifecycle from models.local in the config (eliminates the manual cd router && ./start.sh step from earlier reproducers).
/api/tags stub on the router so cline's Ollama-listing probe gets a clean 200 instead of a noisy 404.
5 production v1.4 configs under configs/: v1.4-smoke.yaml, v1.4-canonical-gemma4.yaml, v1.4-opencode-fairness.yaml, v1.4-strategy-sweep.yaml, v1.4-real-prs.yaml, plus 2 queued (v1.4-canonical-qwen3-coder.yaml, v1.4-canonical-qwen3.6.yaml) for v1.4.1.

Fixed

Reproducibility audit blockers (10 items from v1.3 audit): README rewrite for v1.4, requirements.txt += pydantic + pyyaml, auto-spawn router in bench sweep, fresh CI workflow (no longer needs vendor/minions clone), docs rewrites for REPRODUCING.md + BENCHMARK_NEW_MODEL.md.
Phase 2 rename fallout: 4 agent runners had hardcoded benchmarks/real_dev/fixtures/ paths in _REAL_DEV_FIXTURES_ROOT constants — updated to tasks/refactors/fixtures/.

What's still pending (v1.4.1)

qwen3-coder:30b + qwen3.6:35b canonical sweeps — configs ready at configs/v1.4-canonical-qwen3-coder.yaml and configs/v1.4-canonical-qwen3.6.yaml; multi-model orchestrator script at personal/scripts/v1.4-multi-model-orchestrator.sh. Estimated +~~12h compute, +~~$30 cloud. Will land as v1.4.1 with an updated 3-model leaderboard.
SWE-bench Verified real-PR replay — config at configs/v1.4-real-prs.yaml but blocked on local Docker availability for the SWE-bench testbed. Deferred to v1.5.

Reproducibility

# Fresh clone + setup (~10 min)
git clone https://github.com/RunanywhereAI/hybrid-coding-eval
cd hybrid-coding-eval && git checkout v1.4.0
python3.12 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt && pip install -e ".[dev]"
cp .env.example .env  # set OPEN_AI_API_KEY
ollama pull gemma4:31b   # ~19 GB
./bench setup           # installs aider + opencode + cline; verifies Ollama + Docker

# Run the canonical gemma4 sweep (~10h on M4 Max 64GB, ~$70 cloud)
./bench start --config configs/v1.4-canonical-gemma4.yaml \
    --strategies always-cloud,always-local,heuristic,cascade \
    --seeds 42,7,13

# Monitor / pause / resume
./bench status         # show progress
./bench pause          # if you need the laptop
./bench resume         # picks up where it left off

# After sweep completes, analyze
./bench analyze results/runs/v1.4-canonical-gemma4
# headline lives in: bootstrap_cis.json["cells"]["refactors::aider::heuristic"]["pass_rate"]

Artifacts attached to this release

results-v1.4.0.tar.gz (7.8 MB) — bundle of all 3 sweep result directories (raw.jsonl + aggregate.json + bootstrap_cis.json + decision_matrix.md + charts)
v1.4.0-article.html (~6000 words, dark-mode publishable report)
findings.md — technical diagnostic write-up with per-cell bootstrap CIs
deep-analysis.md — deep-dive analysis of the 708 rows (per-task heatmaps, tail-latency analysis, routing-decision quality, etc.)

Migration from v1.3.0

If you have v1.3 variant configs (configs/variants/2{8,9,30}-v1.3-*.yaml), they're deleted in v1.4. Use the v1.4 equivalents under configs/. The new schema uses agents: instead of routes: and task_classes: instead of categories:; BenchmarkConfig.tasks_per_class instead of tasks_per_category. Variant configs are simpler — no _template.yaml wrapper, just standalone YAMLs at the top of configs/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.4.0 — agent-only release (708 rows, 3 sweeps, gemma4)

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

v1.4.0 — cleanup + production-pipeline release

Headline numbers (708 rows, ~20h wall, $90.48 cloud spend on M4 Max 64GB)

What changed in v1.4

Cleanup (the big one — 82 files deleted)

Added

Fixed

What's still pending (v1.4.1)

Reproducibility

Artifacts attached to this release

Migration from v1.3.0

Uh oh!