-
Notifications
You must be signed in to change notification settings - Fork 0
Improve agent-plugin-review skill to pass remaining 3 eval tests #779
Copy link
Copy link
Closed
Labels
in-progressClaimed by an agent — do not duplicate workClaimed by an agent — do not duplicate work
Description
Summary
The agent-plugin-review skill passes 6/9 eval tests against pi-cli (mean score 0.722). Three tests fail consistently:
| Test | Score | Issue |
|---|---|---|
| detect-relative-file-paths | 0.500 | Partially detected — skill mentions leading / but agent doesn't consistently flag it |
| detect-repeated-inputs | 0.000 | Missed — agent doesn't suggest top-level input for repeated file references |
| detect-missing-hard-gates | 0.000 | Missed — agent doesn't flag missing artifact existence checks between phases |
Approach
Use the agentv-bench eval-driven iteration loop:
- Analyze the failing test transcripts to understand what the agent does instead
- Identify which SKILL.md instructions are unclear or missing
- Make targeted edits to the skill
- Re-run evals to verify improvement
- Repeat until all 9 pass
Possible improvements
- Relative file paths: Add an explicit checklist item about checking
type: filevalues in eval YAML - Repeated inputs: Add guidance about the top-level
inputfield from AgentV eval docs - Hard gates: Make the workflow-checklist.md more prescriptive about what to look for (artifact existence checks at the start of each phase skill)
Eval command
bun run --filter @agentv/core build && bun apps/cli/src/cli.ts eval evals/agentic-engineering/agent-plugin-review.eval.yaml --target pi-cliNote: must rebuild @agentv/core dist before running if core source was modified.
Related
- PR feat: add workspace skills for pi-cli eval execution #776 — baseline eval results (6/9 pass)
- PR feat: add agentv-plugin-review skill #772 — original skill creation
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
in-progressClaimed by an agent — do not duplicate workClaimed by an agent — do not duplicate work
Type
Projects
Status
Done