Releases: JohnYCChiang/holon-bench
Releases · JohnYCChiang/holon-bench
v0.1.0-alpha
Holon-Bench v0.1.0-alpha
First public pre-release of Holon-Bench, an open-source benchmark harness for evaluating AI coding agents on maintainer-style workflows.
What is included
- 9 benchmark tracks: Python tool engineering, Rust core, Rust Bevy, Rust porting, Go core, Go game server, Flutter cross-platform, graph memory workflow, repair needed
- Phase 1: 35 cases (5 per track) validating runner/scorer/report plumbing
- Deterministic runners with verifier-feedback repair loop support
- Repair cost metrics:
first_pass,repaired_pass,repair_tax_rate - Hidden and mutation verifier architecture (Phase 2+)
- JSON schemas for cases, results, scores, and failures
- GitHub Actions CI: schema check, py compile, smoke test
- Minimal example case for onboarding new contributors
- OSS maintainer use case documentation
Evaluated models (Phase 1 partial)
| Model | python_tool first_pass | rust_porting first_pass |
|---|---|---|
| qwen36-27b-mtp-q4 (local) | 3/5 | 2/5 |
| gemma3-27b-q4 (local) | 2/5 | 1/5 |
Full baseline results: reports/baseline_summary.md
Next
- Phase 2: 108 cases
- Codex API baseline evaluation
- Hidden verifier activation