feat(skill): extend agentv-eval-writer with PR/issue generation sources

Part of #1155. Blocked by #1156 (yield spike). Depends on #1157 (provenance schema) for provenance stamping.

## Objective

Extend `plugins/agentv-dev/skills/agentv-eval-writer/` to generate EVAL.yaml test cases from two new sources: merged pull requests and resolved issues. This makes PR/issue-driven eval authoring a first-class capability in AgentV's skill system without baking it into core.

## Design latitude

### Why a skill, not a CLI subcommand

PR/issue mining is authoring work — deciding which PRs are eval-worthy, phrasing the `criteria` line, choosing grader style. LLM-assisted judgment, not a pure transform. Skills already own the analogous "generate from chat transcripts" workflow in the same skill. Keeping PR/issue generation adjacent preserves "how do I generate an eval from X?" as one mental model.

### Scope

- **GitHub only**, mined via the `gh` CLI. No GitLab/Bitbucket/Linear adapters in v1 — add them later only if someone asks.
- Deterministic structural extraction in `scripts/*.sh`; LLM judgment (which PRs to include, how to phrase criteria) in skill prose.
- Open sub-decision: move the existing inline chat-transcript section into `references/from-chat-transcripts.md` for progressive disclosure, or leave inline. Recommended to move for consistency with the new sources, but not required in this issue.

### Proposed layout

```
plugins/agentv-dev/skills/agentv-eval-writer/
├── SKILL.md                              # Extend: add generation-source routing
├── references/
│   ├── config-schema.json                # existing
│   ├── custom-evaluators.md              # existing
│   ├── eval-schema.json                  # existing
│   ├── rubric-evaluator.md               # existing
│   ├── from-pull-requests.md             # NEW — gh pr view patterns, mapping PR fields to test case fields
│   ├── from-issues.md                    # NEW — gh issue view, linked-PR resolution
│   ├── grader-patterns-diff.md           # NEW — llm-grader with agent target for diff behavioral equivalence
│   └── provenance-block.md               # NEW — how to populate provenance metadata
└── scripts/                              # NEW directory
    ├── extract_pr.sh                     # gh pr view --json … → normalized JSON
    └── extract_issue.sh                  # gh issue view --json …
```

### Test case mapping per mined PR

- PR title → test `id` (slugified) and seed for `criteria`.
- PR body + linked issue body → test `input`.
- PR merge-base commit → `workspace.checkout.ref` (agent starts from pre-PR state).
- PR diff → test `expected_output` as inline multi-line string (YAML `|` block scalar; the schema is `z.string() | MessageContent`, no length cap — no fixture files needed).
- Grader: `llm-grader` with `target: claude-code`. No `max_steps` — the agent target controls its own loop. Grader prompt compares `{{file_changes}}` (agent's output) against `{{expected_output}}` (PR's diff) for behavioral equivalence, not textual match.
- `provenance` block stamped (depends on #1157).

### Emitted test case shape

```yaml
- id: pr-1234-add-retry-logic
  input: "Add exponential backoff retries to the HTTP client per issue #1200"
  expected_output: |
    diff --git a/src/http.ts b/src/http.ts
    @@ -12,6 +12,20 @@
    ...
  assertions:
    - type: llm-grader
      target: claude-code
      prompt: "file://judges/diff-behavioral-equivalence.md"
  provenance:
    source: pr
    url: https://github.com/owner/name/pull/1234
    commit: abc123def
    generated_by: "agentv-eval-writer@<version>"
    generated_at: 2026-04-23T10:00:00Z
```

## Acceptance signals

- Skill runs on a real repo (agentv itself works; use the yield from #1156 to pick specific PRs) and emits a syntactically valid EVAL.yaml that passes `agentv eval lint`.
- Each generated test stamps a complete `provenance` block.
- Running the emitted eval against any configured target completes without harness/wiring errors. Scoring quality is not an acceptance criterion for this issue — only that the pipeline runs.
- `references/from-pull-requests.md` and `references/from-issues.md` document the mapping and gh-CLI commands clearly enough that an LLM driving the skill can follow them without extra guidance.

## Non-goals

- Not a CLI subcommand.
- Not LLM-summarizing PRs/issues at extraction time (keep structural extraction deterministic; reproducibility matters).
- Not solving large-PR prompt-size truncation — observe behavior first, solve if it becomes a problem.
- Not supporting arbitrary Git providers — GitHub only in v1.

## Depends on

- #1156 (spike): go/no-go on direction.
- #1157 (provenance schema): required for provenance stamping in emitted evals.

## References

Design derivation and full design-iteration history: `agentevals-research` repo (internal).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(skill): extend agentv-eval-writer with PR/issue generation sources #1159

Objective

Design latitude

Why a skill, not a CLI subcommand

Scope

Proposed layout

Test case mapping per mined PR

Emitted test case shape

Acceptance signals

Non-goals

Depends on

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(skill): extend agentv-eval-writer with PR/issue generation sources #1159

Description

Objective

Design latitude

Why a skill, not a CLI subcommand

Scope

Proposed layout

Test case mapping per mined PR

Emitted test case shape

Acceptance signals

Non-goals

Depends on

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions