Part of #1155.
Objective
Validate the premise behind the PR/issue-mining direction by sampling real merged PRs in this repo and checking how many convert to useful eval cases. If yield is too low, the dependent work (sub-issue on skill extension) should be rescoped or dropped.
Scope
- Take the 20 most recent merged PRs in
main.
- For each, classify as one of:
- useful: a plausible eval case — the PR title/body is a task spec, the diff represents a behavioral change an agent could reproduce.
- not useful: typo fix, version bump, dep update, pure refactor with no behavior change, doc-only, or too small to be meaningful.
- Record classification with a one-line reason per PR.
- Report yield percentage.
- Recommendation: proceed, rescope, or drop.
Acceptance signals
- A short markdown note (in this issue, as a comment, or linked from the research repo) with a table: PR number, one-line summary, classification, reason.
- Yield percentage computed.
- Yes/no/rescope recommendation with rationale.
- No code changes.
Non-goals
- Not building any tooling to mine PRs programmatically — this is manual classification.
- Not evaluating quality of generated cases (we aren't generating any here).
- Not extending beyond 20 PRs unless initial signal is borderline.
Rule of thumb
- ≥50% useful: proceed with the skill extension as proposed.
- 30-50%: proceed but narrow the scope (e.g., filter by label, commit message pattern, PR size).
- <30%: rescope or drop — the premise doesn't hold for this codebase.
Blocks
Sub-issue for agentv-eval-writer extension (see #1155).
Part of #1155.
Objective
Validate the premise behind the PR/issue-mining direction by sampling real merged PRs in this repo and checking how many convert to useful eval cases. If yield is too low, the dependent work (sub-issue on skill extension) should be rescoped or dropped.
Scope
main.Acceptance signals
Non-goals
Rule of thumb
Blocks
Sub-issue for
agentv-eval-writerextension (see #1155).