Skip to content

Releases: JohnYCChiang/holon-bench

v0.1.0-alpha

02 Jun 14:24

Choose a tag to compare

v0.1.0-alpha Pre-release
Pre-release

Holon-Bench v0.1.0-alpha

First public pre-release of Holon-Bench, an open-source benchmark harness for evaluating AI coding agents on maintainer-style workflows.

What is included

  • 9 benchmark tracks: Python tool engineering, Rust core, Rust Bevy, Rust porting, Go core, Go game server, Flutter cross-platform, graph memory workflow, repair needed
  • Phase 1: 35 cases (5 per track) validating runner/scorer/report plumbing
  • Deterministic runners with verifier-feedback repair loop support
  • Repair cost metrics: first_pass, repaired_pass, repair_tax_rate
  • Hidden and mutation verifier architecture (Phase 2+)
  • JSON schemas for cases, results, scores, and failures
  • GitHub Actions CI: schema check, py compile, smoke test
  • Minimal example case for onboarding new contributors
  • OSS maintainer use case documentation

Evaluated models (Phase 1 partial)

Model python_tool first_pass rust_porting first_pass
qwen36-27b-mtp-q4 (local) 3/5 2/5
gemma3-27b-q4 (local) 2/5 1/5

Full baseline results: reports/baseline_summary.md

Next

  • Phase 2: 108 cases
  • Codex API baseline evaluation
  • Hidden verifier activation