You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Part of #1155. Blocked by #1156 (yield spike). Depends on #1157 (provenance schema) for provenance stamping.
Objective
Extend plugins/agentv-dev/skills/agentv-eval-writer/ to generate EVAL.yaml test cases from two new sources: merged pull requests and resolved issues. This makes PR/issue-driven eval authoring a first-class capability in AgentV's skill system without baking it into core.
Design latitude
Why a skill, not a CLI subcommand
PR/issue mining is authoring work — deciding which PRs are eval-worthy, phrasing the criteria line, choosing grader style. LLM-assisted judgment, not a pure transform. Skills already own the analogous "generate from chat transcripts" workflow in the same skill. Keeping PR/issue generation adjacent preserves "how do I generate an eval from X?" as one mental model.
Scope
GitHub only, mined via the gh CLI. No GitLab/Bitbucket/Linear adapters in v1 — add them later only if someone asks.
Deterministic structural extraction in scripts/*.sh; LLM judgment (which PRs to include, how to phrase criteria) in skill prose.
Open sub-decision: move the existing inline chat-transcript section into references/from-chat-transcripts.md for progressive disclosure, or leave inline. Recommended to move for consistency with the new sources, but not required in this issue.
Proposed layout
plugins/agentv-dev/skills/agentv-eval-writer/
├── SKILL.md # Extend: add generation-source routing
├── references/
│ ├── config-schema.json # existing
│ ├── custom-evaluators.md # existing
│ ├── eval-schema.json # existing
│ ├── rubric-evaluator.md # existing
│ ├── from-pull-requests.md # NEW — gh pr view patterns, mapping PR fields to test case fields
│ ├── from-issues.md # NEW — gh issue view, linked-PR resolution
│ ├── grader-patterns-diff.md # NEW — llm-grader with agent target for diff behavioral equivalence
│ └── provenance-block.md # NEW — how to populate provenance metadata
└── scripts/ # NEW directory
├── extract_pr.sh # gh pr view --json … → normalized JSON
└── extract_issue.sh # gh issue view --json …
Test case mapping per mined PR
PR title → test id (slugified) and seed for criteria.
PR body + linked issue body → test input.
PR merge-base commit → workspace.checkout.ref (agent starts from pre-PR state).
PR diff → test expected_output as inline multi-line string (YAML | block scalar; the schema is z.string() | MessageContent, no length cap — no fixture files needed).
Grader: llm-grader with target: claude-code. No max_steps — the agent target controls its own loop. Grader prompt compares {{file_changes}} (agent's output) against {{expected_output}} (PR's diff) for behavioral equivalence, not textual match.
Each generated test stamps a complete provenance block.
Running the emitted eval against any configured target completes without harness/wiring errors. Scoring quality is not an acceptance criterion for this issue — only that the pipeline runs.
references/from-pull-requests.md and references/from-issues.md document the mapping and gh-CLI commands clearly enough that an LLM driving the skill can follow them without extra guidance.
Non-goals
Not a CLI subcommand.
Not LLM-summarizing PRs/issues at extraction time (keep structural extraction deterministic; reproducibility matters).
Not solving large-PR prompt-size truncation — observe behavior first, solve if it becomes a problem.
Not supporting arbitrary Git providers — GitHub only in v1.
Part of #1155. Blocked by #1156 (yield spike). Depends on #1157 (provenance schema) for provenance stamping.
Objective
Extend
plugins/agentv-dev/skills/agentv-eval-writer/to generate EVAL.yaml test cases from two new sources: merged pull requests and resolved issues. This makes PR/issue-driven eval authoring a first-class capability in AgentV's skill system without baking it into core.Design latitude
Why a skill, not a CLI subcommand
PR/issue mining is authoring work — deciding which PRs are eval-worthy, phrasing the
criterialine, choosing grader style. LLM-assisted judgment, not a pure transform. Skills already own the analogous "generate from chat transcripts" workflow in the same skill. Keeping PR/issue generation adjacent preserves "how do I generate an eval from X?" as one mental model.Scope
ghCLI. No GitLab/Bitbucket/Linear adapters in v1 — add them later only if someone asks.scripts/*.sh; LLM judgment (which PRs to include, how to phrase criteria) in skill prose.references/from-chat-transcripts.mdfor progressive disclosure, or leave inline. Recommended to move for consistency with the new sources, but not required in this issue.Proposed layout
Test case mapping per mined PR
id(slugified) and seed forcriteria.input.workspace.checkout.ref(agent starts from pre-PR state).expected_outputas inline multi-line string (YAML|block scalar; the schema isz.string() | MessageContent, no length cap — no fixture files needed).llm-graderwithtarget: claude-code. Nomax_steps— the agent target controls its own loop. Grader prompt compares{{file_changes}}(agent's output) against{{expected_output}}(PR's diff) for behavioral equivalence, not textual match.provenanceblock stamped (depends on feat(core): optional provenance block in EVAL.yaml schema #1157).Emitted test case shape
Acceptance signals
agentv eval lint.provenanceblock.references/from-pull-requests.mdandreferences/from-issues.mddocument the mapping and gh-CLI commands clearly enough that an LLM driving the skill can follow them without extra guidance.Non-goals
Depends on
References
Design derivation and full design-iteration history:
agentevals-researchrepo (internal).