## Problem Proposal §7 promises Code2TestBench and headline metrics (70% acceptance, 80% first-run pass). No benchmark harness exists; metrics are unvalidated. ## Tasks - [ ] Harness: run generation against sample repos (flask, requests) with their real tests hidden. - [ ] Measure acceptance rate, first-run pass rate, diagnostic accuracy. - [ ] Record results in docs; reconcile with proposal targets. ## Acceptance Reproducible benchmark command + a results table committed.
Problem
Proposal §7 promises Code2TestBench and headline metrics (70% acceptance, 80% first-run pass). No benchmark harness exists; metrics are unvalidated.
Tasks
Acceptance
Reproducible benchmark command + a results table committed.