v0.1.0 - Initial public benchmark baseline

Dmatut7 released this 05 Jun 18:24

· 7 commits to main since this release

2c8244c

ShopPay Audit Benchmark v0.1.0

Initial public baseline release for evaluating whether AI coding agents can find business-logic defects from written product rules.

Included

SPEC.md business-rule source of truth.
Intentionally flawed ShopPay service implementation.
Baseline tests documenting seeded defects.
BENCHMARK.md audit task and maintainer answer key.
docs/SCORING.md 100-point scoring rubric.
examples/audit-report.md sample report format.
GitHub Actions test workflow.
Contribution guide, code of conduct, issue templates, PR template, and roadmap.

How to run

npm test

Expected baseline result: all tests pass against the intentionally flawed benchmark implementation.

Assets 2