v1.5.1: open-source polish
Open-source polish release. Addresses every finding from the pre-publish audit pass — security, licensing, UX, hygiene. No code-behaviour changes; safe to take.
Full release notes: docs/release-notes/v1.5.1.md.
What changed
Licensing simplified
- Deleted
NOTICE.md,LICENSE-DATA,LICENSE.md. - Single
LICENSE(MIT) now covers code, data, charts, and docs prose. - Citation request lives in the README's bibtex block.
Documentation rewritten
README.md— six-cell headline table, full prereq + quickstart with realistic time/cost estimates, "picking a config for real work" section distilled from the v1.5 leaderboard, fullbenchCLI table.AGENTS.md— refreshed for v1.5.0: D6 task class documented, v1.5 configs added to the tree, conventions reflect that single-letter codes (A/B/D) are retired.CODE_OF_CONDUCT.md— short and direct.- Source-tree docstrings + per-task READMEs —
lib.*rewritten tocore.*; "Category D / B / X" rewritten torefactors/real-prs/puzzlesend-to-end.
UX cleanup
- Deleted
scripts/reproduce.sh—./bench setupalready does prereq checks, smoke is./bench sweep --config configs/v1.4-smoke.yaml. - Deleted
logs/v3.3/— historical sweep logs moved out of git.logs/is now gitignored.
Hygiene
__version__bumped 0.1.0 → 1.5.1 (it was stuck at 0.1.0).pytestmoved from runtime deps to[dev]extras.- Removed unused
pytest -m slowfilter from CI + docs. .github/ISSUE_TEMPLATE/new_model.md: brokenconfigs/variants/_template.yamlreference fixed.docs/HYBRID_ROUTING_DESIGN.md+ v1.4.{0,1} release notes:jqsnippets updated from legacyD::cline::heuristictorefactors::cline::heuristic.- Test aliases
r10_cline,r6_mini_swe_agentrenamed.
Privacy
- Sanitized 263 absolute-path leaks (
/Users/<owner>/...) in trackedraw.jsonl/progress.log. JSON re-validated on every row (520 rows, 0 parse errors).
Verification
- 120 fast tests pass on Python 3.11 + 3.12 (CI matrix).
ruff check src/ tests/clean.
Citation
If you use this benchmark, a citation would be really appreciated. BibTeX in the README.
📦 Dataset
results-v1.5.1.tar.gz is byte-identical to the v1.5.0 dataset — v1.5.1 added 0 new benchmark rows (it is an open-source-polish release). It is attached here so visitors landing on the Latest release can download the data directly. The canonical 1,704-row dataset is unchanged since v1.5.0.
gh release download v1.5.1 -p results-v1.5.1.tar.gz # or v1.5.0 — same bytes