diff --git a/apps/cli/src/commands/eval/artifact-writer.ts b/apps/cli/src/commands/eval/artifact-writer.ts index 003d8886f..755d36a1d 100644 --- a/apps/cli/src/commands/eval/artifact-writer.ts +++ b/apps/cli/src/commands/eval/artifact-writer.ts @@ -15,7 +15,9 @@ export function buildTestTargetKey(testId?: string, target?: string): string { } // Deduplication helper — keeps the last entry per (test_id, target) pair. -export function deduplicateByTestIdTarget(results: readonly EvaluationResult[]): EvaluationResult[] { +export function deduplicateByTestIdTarget( + results: readonly EvaluationResult[], +): EvaluationResult[] { const seen = new Map(); for (let i = 0; i < results.length; i++) { seen.set(buildTestTargetKey(results[i].testId, results[i].target), i); diff --git a/apps/web/src/assets/screenshots/autoresearch-trajectory.png b/apps/web/src/assets/screenshots/autoresearch-trajectory.png new file mode 100644 index 000000000..da53e874f Binary files /dev/null and b/apps/web/src/assets/screenshots/autoresearch-trajectory.png differ diff --git a/apps/web/src/content/docs/docs/guides/agent-eval-layers.mdx b/apps/web/src/content/docs/docs/guides/agent-eval-layers.mdx index f4058dcf6..329242aa5 100644 --- a/apps/web/src/content/docs/docs/guides/agent-eval-layers.mdx +++ b/apps/web/src/content/docs/docs/guides/agent-eval-layers.mdx @@ -130,6 +130,8 @@ A complete `EVAL.yaml` covering all four layers: ```yaml description: Four-layer agent evaluation starter +sidebar: + order: 1 execution: target: default diff --git a/apps/web/src/content/docs/docs/guides/autoresearch.mdx b/apps/web/src/content/docs/docs/guides/autoresearch.mdx new file mode 100644 index 000000000..f5f23362a --- /dev/null +++ b/apps/web/src/content/docs/docs/guides/autoresearch.mdx @@ -0,0 +1,207 @@ +--- +title: Autoresearch +description: Run an unattended eval-improve loop that iteratively optimizes agent skills +sidebar: + order: 5 +--- + +import { Image } from 'astro:assets'; +import trajectoryChart from '../../../../assets/screenshots/autoresearch-trajectory.png'; + +Autoresearch is an unattended optimization loop that **automatically improves your agent skills** through repeated eval cycles. It runs the same evaluate → analyze → improve loop described in the [Skill Improvement Workflow](/docs/guides/skill-improvement-workflow/), but does it hands-free — no human review between cycles. + +Autoresearch trajectory chart showing score improvement from 0.48 to 0.90 over 9 cycles + +The chart above shows a real optimization run: an incident severity classifier starts at 48% accuracy and reaches 90% after 9 automated cycles — each cycle taking seconds and costing fractions of a cent. + +## How It Works + +``` + ┌──────────┐ + │ 1. EVAL │ ◄───────────────────────────────┐ + └─────┬─────┘ │ + ▼ │ + ┌──────────┐ │ + │ 2. ANALYZE│ dispatcher → analyzer subagent │ + └─────┬─────┘ │ + ▼ │ + ┌──────────┐ wins > losses → KEEP │ + │ 3. DECIDE │ else → DROP │ + └─────┬─────┘ │ + ▼ │ + ┌──────────┐ │ + │ 4. MUTATE │ dispatcher → mutator subagent ──┘ + └──────────┘ + + Stops after 3 consecutive no-improvement cycles + or 10 total cycles (configurable). +``` + +Each cycle: +1. **Runs `agentv eval`** against the current version of the artifact +2. **Analyzes** failures via the analyzer subagent +3. **Decides** keep or discard using `agentv compare --json` (automated — no human needed) +4. **Mutates** the artifact to address failing assertions, then loops back + +The system uses a **hill-climbing ratchet**: each mutation builds on the best-scoring version, never a failed candidate. Improvements compound; regressions get discarded. + +## What Gets Optimized + +Any file or directory artifact: SKILL.md, prompt template, agent config, system prompt, or a directory of related files (e.g., a skill with `references/` and `agents/` subdirectories). The artifact mode is auto-detected — pass a file path for single-file optimization, or a directory path for multi-file optimization. The mutator rewrites artifacts in place while the eval stays fixed — same test cases, same assertions, different artifact versions. + +## Prerequisites + +- An eval file (EVAL.yaml or evals.json) that covers the behavior you care about +- The artifact must be a file or directory within a git repository (autoresearch uses git for versioning) +- Run at least one manual eval cycle first to validate your test cases + +:::tip +Autoresearch is only as good as your eval. If your assertions don't catch the failures you care about, the optimizer won't fix them. Start with the [manual improvement loop](/docs/guides/skill-improvement-workflow/) to build confidence in your eval quality before going unattended. +::: + +## Triggering Autoresearch + +Autoresearch runs through the `agentv-bench` Claude Code skill. Trigger it with natural language: + +``` +"Run autoresearch on my classifier prompt" +"Optimize this skill unattended for 5 cycles" +"Run autoresearch on examples/features/autoresearch/EVAL.yaml" +``` + +No CLI flags or YAML schema changes needed — the skill handles everything. + +## Output Structure + +Each autoresearch session creates a self-contained experiment directory: + +``` +.agentv/results/runs/autoresearch-/ +├── _autoresearch/ +│ ├── iterations.jsonl # Per-cycle data (score, decision, mutation) +│ └── trajectory.html # Live-updating Chart.js visualization +├── 2026-04-15T10-30-00/ # Cycle 1 run artifacts +│ ├── index.jsonl +│ ├── grading.json +│ └── timing.json +├── 2026-04-15T10-35-00/ # Cycle 2 run artifacts +│ └── ... +└── ... +``` + +Autoresearch uses **git-based versioning** instead of backup files. Each successful mutation is committed (`git add && git commit`), and failed mutations are reverted (`git checkout`). The optimized artifact lives in the working tree and the latest commit — no separate `best.md` to copy. + +- **`_autoresearch/trajectory.html`** — Open in a browser to see the score trajectory, per-assertion breakdown, and cumulative cost. Auto-refreshes during the loop, becomes static on completion. +- **`_autoresearch/iterations.jsonl`** — Machine-readable log of every cycle for downstream analysis. + +Review the mutation history with `git log` after the run completes. + +## The Keep/Drop Decision + +After each eval cycle, autoresearch runs `agentv compare` between the current candidate and the best baseline: + +```bash +agentv compare /index.jsonl /index.jsonl --json +``` + +The decision rule: + +| Condition | Decision | Outcome | +|-----------|----------|---------| +| `wins > losses` | **KEEP** | Promote to new baseline, reset convergence counter | +| `wins <= losses` | **DROP** | Revert to best version, increment convergence counter | +| `mean_delta == 0`, simpler artifact | **KEEP** | Simpler is better at equal performance | + +Three consecutive DROPs trigger convergence — the optimizer stops because it can't find improvements. + +## Example: Incident Severity Classifier + +Here's a real scenario showing autoresearch in action. We start with a minimal classifier prompt: + +```markdown +# classifier-prompt.md (initial version) +Classify the incident into P0, P1, P2, or P3. +Give your answer as JSON with severity and reasoning fields. +``` + +And an eval with 7 test cases covering edge cases — payment failures, SSL cert expiry, gradual memory leaks: + +```yaml +# EVAL.yaml (stays fixed — only the prompt changes) +tests: + - id: total-outage + assertions: + - type: contains + value: '"P0"' + - type: is-json + - "Reasoning mentions complete service outage" + - id: payment-failures + assertions: + - type: contains + value: '"P1"' + - type: is-json + - "Reasoning weighs revenue impact despite intermittent nature" + # ... 5 more test cases +``` + +Running autoresearch produces this trajectory: + +``` +Cycle Score Decision Mutation +───── ───── ──────── ────────────────────────────────────── + 1 0.48 KEEP initial baseline — no mutations applied + 2 0.62 KEEP added explicit JSON format, defined P0-P3 levels + 3 0.52 DROP added verbose rules — over-constrained reasoning + 4 0.71 KEEP added revenue-impact heuristic for P1 + 5 0.81 KEEP enforced raw JSON output — removed code fences + 6 0.86 KEEP added time-urgency rule for SSL/cert cases + 7 0.90 KEEP improved reasoning template — cite impact metrics + 8 0.86 DROP attempted decision tree merge — regressed + 9 0.90 DROP minor wording cleanup — no meaningful change + ↳ 3 consecutive drops → CONVERGED +``` + +**Result:** 0.48 → 0.90 (+42 points) in 9 cycles, $0.03 total cost. The optimized prompt is in the working tree (and the latest git commit). + +Key observations: +- **Cycle 3** shows a failed mutation (verbose rules hurt reasoning) — the ratchet discarded it and continued from the cycle 2 version +- **Cycles 8–9** show convergence — the optimizer couldn't improve further and stopped automatically +- **Per-assertion tracking** reveals which aspects improved: classification accuracy reached 100% by cycle 6, while JSON format compliance and reasoning quality improved more gradually + +## Convergence + +Autoresearch stops when either condition is met: + +- **3 consecutive no-improvement cycles** (configurable) — the optimizer has converged +- **10 total cycles** (configurable) — hard limit to bound cost + +You can override both limits when triggering autoresearch: + +``` +"Run autoresearch with max 20 cycles and convergence threshold of 5" +``` + +## Best Practices + +**Start manual, then automate.** Run 2-3 manual eval cycles to validate your test cases catch real issues. Once you trust the eval, switch to autoresearch. + +**Same-model pairings work best.** The meta-agent running autoresearch should match the model used by the task agent (e.g., Claude optimizing a Claude agent). Same-model pairings produce better mutations because the optimizer has implicit knowledge of how the target model interprets instructions. + +**Watch the per-assertion chart.** If one assertion is stuck at 0% while others improve, the eval may be too strict or testing something the prompt can't control. Consider adjusting the assertion. + +**Review the optimized artifact.** Autoresearch improves scores, but always review the changes (`git diff `) before adopting them. The optimizer may have found a valid but unexpected approach. + +**Keep artifact directories focused.** For directory mode, keep artifacts to 5–15 files. The mutator works best when it can reason about the full scope without reading dozens of files. Split large skill directories if needed. + +## Relationship to Manual Workflow + +| Aspect | Manual Loop | Autoresearch | +|--------|-------------|--------------| +| Human checkpoints | Every iteration | None (opted in to unattended) | +| Keep/discard | You decide | Automated via `agentv compare` | +| Mutation | You edit the skill | Mutator subagent rewrites | +| Max iterations | Unbounded | 10 cycles or convergence | +| Best for | Building eval intuition | Scaling optimization | +| Trajectory chart | Not included | Auto-generated with live refresh | + +Start with the [manual loop](/docs/guides/skill-improvement-workflow/) to understand the workflow, then use autoresearch to scale it. diff --git a/apps/web/src/content/docs/docs/guides/eval-authoring.mdx b/apps/web/src/content/docs/docs/guides/eval-authoring.mdx index d316f85d0..bdee29168 100644 --- a/apps/web/src/content/docs/docs/guides/eval-authoring.mdx +++ b/apps/web/src/content/docs/docs/guides/eval-authoring.mdx @@ -2,7 +2,7 @@ title: Eval Authoring Guide description: Practical guidance for writing workspace-based evals that work reliably across providers. sidebar: - order: 7 + order: 3 --- ## Workspace Setup: Skill Discovery Paths diff --git a/apps/web/src/content/docs/docs/guides/evaluation-types.mdx b/apps/web/src/content/docs/docs/guides/evaluation-types.mdx index c97c46ece..368ffd26d 100644 --- a/apps/web/src/content/docs/docs/guides/evaluation-types.mdx +++ b/apps/web/src/content/docs/docs/guides/evaluation-types.mdx @@ -2,7 +2,7 @@ title: Execution Quality vs Trigger Quality description: Two distinct evaluation concerns for AI agents and skills — what AgentV measures, and what belongs to skill-creator tooling. sidebar: - order: 6 + order: 2 --- Agent evaluation has two fundamentally different concerns: **execution quality** and **trigger quality**. They require different tooling, different methodologies, and different optimization surfaces. Conflating them leads to eval configs that are noisy, hard to maintain, and unreliable. diff --git a/apps/web/src/content/docs/docs/guides/skill-improvement-workflow.mdx b/apps/web/src/content/docs/docs/guides/skill-improvement-workflow.mdx index 6cc7bddc6..ec5ac0d2e 100644 --- a/apps/web/src/content/docs/docs/guides/skill-improvement-workflow.mdx +++ b/apps/web/src/content/docs/docs/guides/skill-improvement-workflow.mdx @@ -2,14 +2,14 @@ title: Skill Improvement Workflow description: Iteratively evaluate and improve agent skills using AgentV sidebar: - order: 6 + order: 4 --- ## Introduction AgentV supports a full evaluation-driven improvement loop for skills and agents. Instead of guessing whether a change makes things better, you run structured evaluations before and after, then compare. -This guide teaches the **core manual loop**. For automated iteration that runs the full cycle hands-free, see [agentv-bench](#automated-iteration). +This guide teaches the **core manual loop**. For automated iteration that runs the full cycle hands-free, see [Autoresearch](/docs/guides/autoresearch/). ## The Core Loop @@ -244,7 +244,7 @@ After converting, you can: - Use `code-grader` for custom scoring logic - Define `tool-trajectory` assertions to check tool usage patterns -See [Skill Evals (evals.json)](/docs/guides/agent-skills-evals/) for the full field mapping and side-by-side comparison. +See [Skill Evals (evals.json)](/docs/integrations/agent-skills-evals/) for the full field mapping and side-by-side comparison. ## Migration from Skill-Creator @@ -316,21 +316,14 @@ Start simple and add complexity only when the evaluation results demand it: ## Automated Iteration -For users who want the full automated improvement cycle, the `agentv-bench` skill runs a 5-phase optimization loop: +When you're confident in your eval quality, graduate to **autoresearch** — an unattended optimization loop that runs the full evaluate → analyze → improve cycle hands-free. -1. **Analyze** — examines the current skill and evaluation results -2. **Hypothesize** — generates improvement hypotheses from failure patterns -3. **Implement** — applies targeted skill modifications -4. **Evaluate** — re-runs the evaluation suite -5. **Decide** — keeps improvements that help, reverts those that don't +Autoresearch uses the same `agentv eval` and `agentv compare` primitives described above, but automates the human decision steps. A mutator subagent rewrites the artifact based on failure analysis, and an automated keep/discard rule promotes improvements and reverts regressions. -The optimizer uses the same core loop described in this guide but automates the human steps. Start with the manual loop to build intuition, then graduate to the optimizer when you're comfortable with the evaluation workflow. - -Its bundled scripts map directly onto the workflow stages: +``` +"Run autoresearch on my skill" +``` -- `run-eval.ts` and `compare-runs.ts` run and compare evaluations while still delegating to `agentv` -- `run-loop.ts` repeats the evaluation loop without moving grader logic into the script layer -- `aggregate-benchmark.ts` and `generate-report.ts` summarize AgentV artifacts into review-friendly output -- `improve-description.ts` proposes follow-up description experiments once execution quality is stable +One command starts the loop. It runs until the optimizer converges (3 consecutive no-improvement cycles) or hits the cycle limit. Typical runs: 5–10 cycles, under $0.05 total cost. -Code-grader execution, grading semantics, and artifact schemas still live in AgentV core. The scripts layer is orchestration glue over those existing primitives. +See the full guide: [Autoresearch](/docs/guides/autoresearch/) diff --git a/apps/web/src/content/docs/docs/guides/git-cache-workspace.mdx b/apps/web/src/content/docs/docs/guides/workspace-architecture.mdx similarity index 97% rename from apps/web/src/content/docs/docs/guides/git-cache-workspace.mdx rename to apps/web/src/content/docs/docs/guides/workspace-architecture.mdx index 3079c151d..bc61c9fa1 100644 --- a/apps/web/src/content/docs/docs/guides/git-cache-workspace.mdx +++ b/apps/web/src/content/docs/docs/guides/workspace-architecture.mdx @@ -2,7 +2,7 @@ title: Workspace Architecture description: How AgentV clones and materializes working trees for eval runs, with performance guidance for large repos. sidebar: - order: 3 + order: 7 --- AgentV evaluations that use `workspace.repos` clone repositories directly from their source (git URL or local path) into a workspace directory. [Workspace pooling](/docs/guides/workspace-pool/) (enabled by default) eliminates repeated clone costs by reusing materialized workspaces across runs. @@ -131,4 +131,4 @@ To disable pooling for a run: agentv eval evals/my-eval.yaml --no-pool ``` -See the [Workspace Pool](/guides/workspace-pool/) guide for details on pool configuration, clean modes, concurrency, and drift detection. +See the [Workspace Pool](/docs/guides/workspace-pool/) guide for details on pool configuration, clean modes, concurrency, and drift detection. diff --git a/apps/web/src/content/docs/docs/guides/workspace-pool.mdx b/apps/web/src/content/docs/docs/guides/workspace-pool.mdx index ea0b33e30..50224bddd 100644 --- a/apps/web/src/content/docs/docs/guides/workspace-pool.mdx +++ b/apps/web/src/content/docs/docs/guides/workspace-pool.mdx @@ -2,7 +2,7 @@ title: Workspace Pool description: Reuse materialized workspaces across eval runs with fingerprint-based pooling, eliminating repeated clone and checkout costs. sidebar: - order: 4 + order: 8 --- Workspace pooling keeps materialized workspaces on disk between eval runs. Instead of cloning repos and checking out files every time, pooled workspaces reset in-place — typically reducing setup from minutes to seconds for large repositories. diff --git a/apps/web/src/content/docs/docs/guides/agent-skills-evals.mdx b/apps/web/src/content/docs/docs/integrations/agent-skills-evals.mdx similarity index 99% rename from apps/web/src/content/docs/docs/guides/agent-skills-evals.mdx rename to apps/web/src/content/docs/docs/integrations/agent-skills-evals.mdx index 846915036..54c03807e 100644 --- a/apps/web/src/content/docs/docs/guides/agent-skills-evals.mdx +++ b/apps/web/src/content/docs/docs/integrations/agent-skills-evals.mdx @@ -2,7 +2,7 @@ title: Skill Evals (evals.json) description: Run evals.json skill evaluations with AgentV, and graduate to EVAL.yaml when you need more power. sidebar: - order: 5 + order: 2 --- ## Overview diff --git a/apps/web/src/content/docs/docs/guides/autoevals-integration.mdx b/apps/web/src/content/docs/docs/integrations/autoevals-integration.mdx similarity index 99% rename from apps/web/src/content/docs/docs/guides/autoevals-integration.mdx rename to apps/web/src/content/docs/docs/integrations/autoevals-integration.mdx index cb5401419..12c700023 100644 --- a/apps/web/src/content/docs/docs/guides/autoevals-integration.mdx +++ b/apps/web/src/content/docs/docs/integrations/autoevals-integration.mdx @@ -2,7 +2,7 @@ title: Autoevals Integration description: Use Braintrust's open-source autoevals scorers (Factuality, Faithfulness, etc.) as code_grader graders in AgentV. sidebar: - order: 2 + order: 3 --- ## Overview diff --git a/apps/web/src/content/docs/docs/tools/convert.mdx b/apps/web/src/content/docs/docs/tools/convert.mdx index 6c40fd5b1..1a754967a 100644 --- a/apps/web/src/content/docs/docs/tools/convert.mdx +++ b/apps/web/src/content/docs/docs/tools/convert.mdx @@ -31,7 +31,7 @@ Outputs a `.eval.yaml` file alongside the input. agentv convert evals.json ``` -Converts an [Agent Skills `evals.json`](/docs/guides/agent-skills-evals) file into an AgentV EVAL YAML file. The converter: +Converts an [Agent Skills `evals.json`](/docs/integrations/agent-skills-evals) file into an AgentV EVAL YAML file. The converter: - Maps `prompt` → `input` message array - Maps `expected_output` → `expected_output` diff --git a/examples/features/autoresearch/EVAL.yaml b/examples/features/autoresearch/EVAL.yaml new file mode 100644 index 000000000..c8c1819ef --- /dev/null +++ b/examples/features/autoresearch/EVAL.yaml @@ -0,0 +1,142 @@ +name: incident-severity-classifier +description: | + Evaluates the incident severity classifier prompt for accuracy, output format, + and reasoning quality. Used as an autoresearch demo — the prompt artifact starts + weak and the optimization loop iteratively improves it. + +execution: + target: llm + +defaults: + system_message: &system_prompt | + Classify the incident into P0, P1, P2, or P3. + Give your answer as JSON with severity and reasoning fields. + +tests: + - id: total-outage + criteria: Should classify a complete production outage as P0 + input: + - role: system + content: *system_prompt + - role: user + content: | + All production servers are unreachable. The main database cluster has failed over + but the failover node is also unresponsive. Customer-facing APIs return 503 for + all requests. Estimated 100% of users affected. Data replication lag detected. + assertions: + - type: contains + value: '"P0"' + id: CLASSIFIES_P0_OUTAGE + - type: is-json + id: RETURNS_VALID_JSON + - "Reasoning mentions complete service outage or total unavailability" + + - id: degraded-search + criteria: Should classify degraded search with no workaround as P1 + input: + - role: system + content: *system_prompt + - role: user + content: | + Search functionality is returning incomplete results for approximately 60% of queries. + Users report that recent items (last 2 hours) are not appearing in search results. + The indexing pipeline appears to be stuck. No workaround available for affected users. + assertions: + - type: contains + value: '"P1"' + id: CLASSIFIES_P1_DEGRADED + - type: is-json + id: RETURNS_VALID_JSON_P1 + - "Reasoning explains the user impact and lack of workaround" + + - id: slow-dashboard + criteria: Should classify slow dashboard with workaround as P2 + input: + - role: system + content: *system_prompt + - role: user + content: | + The analytics dashboard is loading slowly (15-20 seconds instead of usual 2-3 seconds). + Users can still access all data but the experience is degraded. The issue appears to be + related to a recent deployment. Users can use the API directly as a workaround. + assertions: + - type: contains + value: '"P2"' + id: CLASSIFIES_P2_SLOW + - type: is-json + id: RETURNS_VALID_JSON_P2 + - "Reasoning mentions the workaround availability" + + - id: typo-in-footer + criteria: Should classify a cosmetic typo as P3 + input: + - role: system + content: *system_prompt + - role: user + content: | + A user reported that the footer on the settings page shows "Copyrigth 2024" + instead of "Copyright 2024". This is visible on all pages with the settings layout. + No functional impact. + assertions: + - type: contains + value: '"P3"' + id: CLASSIFIES_P3_COSMETIC + - type: is-json + id: RETURNS_VALID_JSON_P3 + - "Reasoning identifies this as a cosmetic or non-functional issue" + + - id: payment-failures + criteria: Should classify intermittent payment failures as P1 due to revenue impact + input: + - role: system + content: *system_prompt + - role: user + content: | + Approximately 15% of payment transactions are failing with a timeout error from the + payment gateway. The issue is intermittent — retrying usually works on the second attempt. + Revenue impact estimated at $50K/hour during peak hours. No permanent data loss detected. + assertions: + - type: contains + value: '"P1"' + id: CLASSIFIES_P1_PAYMENTS + - type: is-json + id: RETURNS_VALID_JSON_PAYMENTS + - "Reasoning weighs the revenue impact despite the issue being intermittent" + + - id: memory-leak-gradual + criteria: Should classify a gradual memory leak with days until impact as P2 + input: + - role: system + content: *system_prompt + - role: user + content: | + Monitoring detected a gradual memory leak in the auth service. Memory usage is growing + at ~50MB/hour. Current usage is 2.1GB of 8GB available. At current rate, the service + will need a restart in approximately 3 days. Automated restarts are configured as a + safety net. No current user impact. + assertions: + - type: contains + value: '"P2"' + id: CLASSIFIES_P2_MEMORY + - type: is-json + id: RETURNS_VALID_JSON_MEMORY + - "Reasoning considers the time buffer and automated mitigations" + + - id: ssl-cert-expiry + criteria: Should classify imminent SSL cert expiry as P1 + input: + - role: system + content: *system_prompt + - role: user + content: | + SSL certificate for api.example.com expires in 4 hours. Auto-renewal failed due to + DNS validation timeout. If the cert expires, all HTTPS traffic to the production API + will show security warnings and many clients will refuse to connect. The team has + manual renewal steps documented but needs to act urgently. + assertions: + - type: contains + value: '"P1"' + id: CLASSIFIES_P1_SSL + - type: is-json + id: RETURNS_VALID_JSON_SSL + - "Reasoning mentions the time urgency and potential service disruption" diff --git a/examples/features/autoresearch/README.md b/examples/features/autoresearch/README.md new file mode 100644 index 000000000..85952a752 --- /dev/null +++ b/examples/features/autoresearch/README.md @@ -0,0 +1,26 @@ +# Autoresearch Example: Incident Severity Classifier + +Demonstrates the autoresearch optimization loop with a practical scenario. + +## Files + +- `classifier-prompt.md` — The artifact to optimize (a severity classification prompt) +- `EVAL.yaml` — 7 test cases with mixed assertion types (deterministic + rubric) + +## Running + +This example works like any other eval: + +```bash +agentv eval EVAL.yaml --experiment autoresearch-classifier --target azure +``` + +To run autoresearch, use the `agentv-bench` skill: + +``` +"Run autoresearch on examples/features/autoresearch/EVAL.yaml" +``` + +## Note + +Autoresearch is a **workflow pattern** — it works with any eval file, not just this one. This example exists as a ready-to-run demo and documentation reference. diff --git a/examples/features/autoresearch/classifier-prompt.md b/examples/features/autoresearch/classifier-prompt.md new file mode 100644 index 000000000..8ae36f1c9 --- /dev/null +++ b/examples/features/autoresearch/classifier-prompt.md @@ -0,0 +1,5 @@ +# Incident Severity Classifier + +Classify the incident into P0, P1, P2, or P3. + +Give your answer as JSON with severity and reasoning fields. diff --git a/plugins/agentv-dev/skills/agentv-bench/SKILL.md b/plugins/agentv-dev/skills/agentv-bench/SKILL.md index 5ee0657b5..bf6e60e7b 100644 --- a/plugins/agentv-dev/skills/agentv-bench/SKILL.md +++ b/plugins/agentv-dev/skills/agentv-bench/SKILL.md @@ -3,7 +3,8 @@ name: agentv-bench description: >- Run AgentV evaluations and optimize agents through eval-driven iteration. Triggers: run evals, benchmark agents, optimize prompts/skills against evals, compare - agent outputs across providers, analyze eval results, offline evaluation of recorded sessions. + agent outputs across providers, analyze eval results, offline evaluation of recorded sessions, + run autoresearch, optimize unattended, run overnight optimization loop. Not for: writing/editing eval YAML without running (use agentv-eval-writer), analyzing existing traces/JSONL without re-running (use agentv-trace-analyst). --- @@ -321,8 +322,8 @@ After improving: 1. Apply your changes to the agent's prompts/skills/config 2. Re-run all test cases (agentv creates a new `.agentv/results/runs//` directory automatically) -3. Compare against the previous iteration (Step 4) -4. Present results to the user +3. Compare against the previous iteration (Step 4). If running in automated mode, use the **automated keep/discard** logic below instead of manual judgment — it will decide whether to keep or revert the change for you. +4. Present results to the user (or log the decision if running automated keep/discard) 5. Stop when ANY of: - The user says they're happy - Feedback is all empty (everything looks good) @@ -332,6 +333,87 @@ After improving: **Human checkpoints**: At iterations 3, 6, and 9, always present progress to the user regardless of automation settings. Push back if optimization is accumulating contradictory rules or overfitting to specific test cases. +### Automated keep/discard + +After each iteration, you can automatically decide whether to keep or discard the change using structured comparison output. This replaces manual judgment at steps 3–4 of the iteration loop above, except at human checkpoint iterations (3, 6, 9) where you must still present results to the user. + +#### 1. Run the comparison + +After re-running test cases, compare the new results against the previous iteration's baseline: + +```bash +agentv compare .jsonl .jsonl --json +``` + +Where `.jsonl` is the `index.jsonl` from the previous best iteration and `.jsonl` is the `index.jsonl` from the run you just completed. + +#### 2. Parse the output + +The `--json` flag produces structured output: + +```json +{ + "summary": { + "wins": 3, + "losses": 1, + "ties": 6, + "mean_delta": 0.05 + } +} +``` + +- **wins**: number of test cases where the candidate scored higher than the baseline +- **losses**: number of test cases where the candidate scored lower +- **ties**: number of test cases with no score change +- **mean_delta**: average score difference across all test cases (positive = candidate is better) + +#### 3. Apply decision rules + +Use these rules in order: + +| Condition | Decision | Action | +|-----------|----------|--------| +| `wins > losses` | **KEEP** | Promote the candidate to the new baseline. Copy or note its `index.jsonl` path as the baseline for the next iteration. | +| `wins <= losses` | **DISCARD** | Revert the prompt/skill/config change. The previous baseline remains. Try a different mutation on the next iteration. | +| `mean_delta == 0` AND candidate prompt is shorter (fewer lines) | **KEEP** | Simpler prompts are preferred when performance is equal. Promote the candidate as the new baseline. | + +When `mean_delta == 0` and the candidate prompt is *not* shorter, treat it as a **DISCARD** — there's no reason to keep a change that adds complexity without improving results. + +#### 4. Log the decision + +Before proceeding to the next iteration, log the decision and rationale so the user can review later: + +``` +Iteration 2: KEEP + wins=3, losses=1, ties=6, meanDelta=+0.05 + Rationale: candidate wins outweigh losses (3 > 1) + Baseline promoted: .agentv/results/runs/20250101-120000/index.jsonl +``` + +``` +Iteration 3: DISCARD + wins=1, losses=2, ties=7, meanDelta=-0.03 + Rationale: candidate losses outweigh wins (2 > 1) + Reverted to baseline: .agentv/results/runs/20250101-110000/index.jsonl + Next: try a different mutation +``` + +Include this log in your progress summary. At human checkpoints (iterations 3, 6, 9), present the full log of automated decisions since the last checkpoint alongside the current results. + +#### 5. Integration with the iteration loop + +The automated keep/discard replaces the manual compare-and-present cycle (steps 3–4) during non-checkpoint iterations. The full flow becomes: + +1. Apply change to prompts/skills/config +2. Re-run all test cases +3. Run `agentv compare baseline.jsonl candidate.jsonl --json` +4. Apply keep/discard rules → promote or revert +5. Log the decision +6. If this is iteration 3, 6, or 9 → present progress to the user (human checkpoint) +7. Check stop conditions → continue or stop + +Both modes coexist: if the user is actively reviewing results, present to them as before. If the user has asked you to iterate autonomously, use automated keep/discard and only pause at human checkpoints. + --- ## Entering Mid-Lifecycle @@ -364,6 +446,235 @@ After the agent is working well, offer to optimize the skill's `description` fie --- +## Autoresearch Mode + +Autoresearch is an unattended eval-improve loop that runs multiple optimize cycles without human intervention. The user triggers it with natural language (e.g., "run autoresearch on this skill", "optimize this skill unattended"). No YAML schema changes or CLI flags are needed. + +### Prerequisites + +- An eval file (`EVAL.yaml` or `evals.json`) must exist for the artifact being optimized. +- The artifact must be a file or directory (SKILL.md, prompt template, agent config, or a directory of related files like a skill with references/). +- The user should have run at least one interactive eval cycle to build confidence in eval quality before going unattended. + +### The loop + +``` +1. RUN EVAL — agentv eval with current artifact +2. ANALYZE — dispatch analyzer subagent on results +3. DECIDE — if score > best_score: KEEP, else DROP (automated keep/discard from Step 5) +4. MUTATE — dispatch mutator subagent with failure analysis (agents/mutator.md) +5. GOTO 1 — until convergence or max_cycles +``` + +### Experiment naming + +Derive the experiment name from the artifact: `autoresearch-` (e.g., `autoresearch-pdf-skill`). The user can also provide a custom name. + +### Artifact mutation flow + +The mutator rewrites artifacts in the working tree in place. **Git is used for versioning** — HEAD always contains the best-known version: + +1. Record the starting commit SHA before the first cycle: `initial_sha=$(git rev-parse HEAD)`. +2. On each **KEEP**: `git add && git commit -m "autoresearch cycle N: "`. +3. On each **DROP**: `git checkout -- ` (restores working tree to HEAD, the last KEEP commit). +4. The eval always runs against the real file path — no temp files or indirection. +5. The mutator can reference the original via `git show :`. + +### How the skill invokes eval + +Shell out to `agentv eval --experiment autoresearch-` via the Bash tool, same as the existing interactive bench workflow. + +### Artifact layout + +Each cycle is a standard eval run. Autoresearch session metadata lives in `_autoresearch/` within the experiment directory: + +``` +.agentv/results/runs// + _autoresearch/ + iterations.jsonl # one line per cycle — data for chart + mutator + trajectory.html # live-updating score trajectory chart + 2026-04-15T10-30-00/ # cycle 1 — standard run artifacts + index.jsonl + grading.json + timing.json + benchmark.json + report.html + 2026-04-15T10-35-00/ # cycle 2 — standard run artifacts + ... +``` + +No `original.md` or `best.md` files — git history serves as the backup. The `_` prefix convention distinguishes workflow folders from timestamped run dirs. + +### iterations.jsonl + +One JSON object per line, one line per cycle: + +```jsonl +{"cycle":1,"score":0.65,"decision":"keep","cost_usd":0.12,"assertions":{"IDENTIFIES_BUG":0.8,"SUGGESTS_FIX":0.4},"mutation":"added explicit null-check instruction","run_dir":"2026-04-15T10-30-00","timestamp":"2026-04-15T10:32:15Z"} +``` + +Fields: `cycle` (1-indexed), `score` (overall pass rate 0–1), `decision` ("keep" or "drop"), `cost_usd` (eval run cost), `assertions` (per-assertion pass rates), `mutation` (one-line description of what changed), `run_dir` (timestamped directory name), `timestamp` (ISO 8601). + +### trajectory.html + +A standalone HTML chart file with embedded Chart.js. Copy the template from `scripts/trajectory.html` into the `_autoresearch/` directory. It fetches `iterations.jsonl` from the same directory on each auto-refresh — no data injection needed. Shows: + +- Score over iterations (line chart) with KEEP (green) / DISCARD (red) markers +- Per-assertion pass rates over iterations +- Cumulative cost across iterations +- Best vs original score summary + +Auto-refreshes every 2 seconds during the loop. Becomes static after completion (remove the auto-refresh meta tag on final update). + +### Convergence + +Stop after **3** consecutive cycles with no improvement (no KEEP). Also stop at **max_cycles** (default 10). Either limit can be overridden by the user. + +### Human checkpoints + +Autoresearch mode **skips** human checkpoints at iterations 3/6/9. The user opted in to unattended operation by requesting autoresearch. + +### Context hygiene + +The orchestrator must run indefinitely without exhausting its context window. To do this: + +- **Never read eval results, artifacts, or transcripts into your own context.** Use bash commands (jq, agentv CLI) that output small structured summaries. +- **Delegate all heavy reading to subagents.** The mutator reads artifacts, grading results, and transcripts from disk — you pass it paths, not content. +- **Use bash for all file I/O** in the loop body: appending to `iterations.jsonl`, git operations, score extraction. The only tool calls per cycle should be bash commands and one subagent dispatch (mutator). +- **trajectory.html auto-loads `iterations.jsonl`** via fetch — no need to read or update the HTML file after initial copy. + +### Procedure + +Follow this step-by-step procedure to execute autoresearch: + +#### 1. Setup + +1. Determine the **artifact path** (file or directory to optimize) and **eval path** (EVAL.yaml or evals.json). +2. Detect **artifact mode**: `file` if the artifact path is a file, `directory` if it's a directory. +3. Derive the **experiment name**: `autoresearch-` from the artifact filename/dirname, or use a user-provided name. +4. Set the experiment directory: `.agentv/results/runs//`. +5. Create the `_autoresearch/` subdirectory inside the experiment directory. +6. Record `initial_sha=$(git rev-parse HEAD)` — the commit before any mutations. +7. Copy `scripts/trajectory.html` to `_autoresearch/trajectory.html`. +8. Initialize variables: + - `best_score = 0` + - `convergence_count = 0` + - `cycle = 1` + - `max_cycles = 10` (or user-specified) + - `max_convergence = 3` (or user-specified) + +#### 2. Main loop + +Repeat while `cycle <= max_cycles` and `convergence_count < max_convergence`: + +**a. Run eval** + +```bash +agentv eval --experiment autoresearch- +``` + +**b. Extract scores (bash only — do NOT read result files into your context)** + +Find the latest timestamped directory in the experiment folder. Use bash/jq to extract small structured values: + +```bash +# Find latest run dir +RUN_DIR=$(ls -td /20*/ | head -1) + +# Overall score (mean of all scores in index.jsonl) +SCORE=$(jq -sr '[.[].scores[].score] | add / length' "$RUN_DIR/index.jsonl") + +# Per-assertion pass rates as JSON object +PASS_RATES=$(jq -sr '[.[].scores[]] | group_by(.type) | map({key: .[0].type, value: (map(.score) | add / length)}) | from_entries' "$RUN_DIR/index.jsonl") + +# Cost (if timing.json exists) +COST=$(jq -r '.cost_usd // 0' "$RUN_DIR/timing.json" 2>/dev/null || echo 0) +``` + +Capture only these small outputs (`SCORE`, `PASS_RATES`, `COST`) — never read the full JSONL into context. + +**c. Update iterations.jsonl (bash only)** + +After the KEEP/DROP decision (step e), append one JSON line via bash: + +```bash +echo '{"cycle":'$CYCLE',"score":'$SCORE',"decision":"'$DECISION'","cost_usd":'$COST',"assertions":'$PASS_RATES',"mutation":"'"$MUTATION_DESC"'","run_dir":"'"$(basename $RUN_DIR)"'","timestamp":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' >> /_autoresearch/iterations.jsonl +``` + +**d. trajectory.html — no action needed** + +The trajectory chart fetches `iterations.jsonl` directly via HTTP on each auto-refresh. No file manipulation required after the initial copy in setup. + +**e. Decide: KEEP or DROP** + +Apply the automated keep/discard rules from Step 5: + +1. Run `agentv compare .jsonl .jsonl --json` where `` is the best iteration's `index.jsonl` (or the first run's `index.jsonl` for cycle 1) and `` is this cycle's `index.jsonl`. +2. If `wins > losses` → **KEEP**. +3. If `wins <= losses` → **DISCARD**. +4. If `mean_delta == 0` and the artifact is simpler → **KEEP** (simpler is better at equal performance). Simplicity: for files, compare line count; for directories, compare total size via `du -sb`. + +For cycle 1, there is no baseline to compare against — always **KEEP** the first cycle. + +**f. If KEEP** + +- Update `best_score` to this cycle's score. +- Commit the artifact: `git add && git commit -m "autoresearch cycle N: "`. +- Record the current `index.jsonl` path as the new baseline for future comparisons. +- Reset `convergence_count = 0`. + +**g. If DROP** + +- Revert the working tree to HEAD: `git checkout -- ` (for files) or `git checkout -- /` (for directories). +- Increment `convergence_count`. + +**h. Check stop conditions** + +If `convergence_count >= max_convergence` or `cycle >= max_cycles` → break out of the loop. + +**i. Mutate** + +Dispatch the **mutator** subagent (`agents/mutator.md`) with: +- `artifact-path`: the file or directory to mutate +- `artifact-mode`: `file` or `directory` +- `initial-sha`: the starting commit SHA (for referencing the original via `git show`) +- `pass-rates`: the `$PASS_RATES` JSON object from step (b) (small — just assertion names and rates) +- `run-dir`: path to this cycle's run directory (the mutator reads `grading.json` and transcripts itself) +- `iterations-path`: path to `_autoresearch/iterations.jsonl` (the mutator reads mutation history itself) +- For directory mode: `focus-files` (optional — files most likely contributing to failures, derived from assertion names) + +**Do NOT pass failure descriptions, transcripts, or grading content** to the mutator — pass paths and let it read what it needs from disk. This keeps the orchestrator's context clean. + +The mutator rewrites artifacts in place. Verify the artifact was modified (e.g., `git diff --stat`) before continuing. + +**j. Continue** + +Increment `cycle` and return to step (a). + +#### 3. Completion + +1. Finalize `trajectory.html`: remove the line containing `` (which includes the `` tag) so the chart becomes static. +2. Log a final summary: + - Total cycles run + - Final best score vs original score (cycle 1) + - Number of KEEPs and DROPs + - Total cost across all cycles + - The optimized artifact is in the working tree (and the latest commit) + - Run `git diff ` to see total changes from the original + - Run `git log --oneline ..HEAD` to see the mutation history + - Path to `_autoresearch/trajectory.html` (the score chart) +3. Present results to the user with a recommendation: adopt the optimized version, revert to original (`git checkout -- `), or continue iterating interactively. + +### Interactive/autonomous hybrid + +Users can start in interactive mode (the existing Step 3–5 loop with human checkpoints), build confidence in their eval quality, and then switch to autoresearch mode to run unattended. The two modes share the same eval infrastructure and artifact layout — autoresearch simply automates the keep/discard decisions and removes human checkpoints. + +### Model empathy recommendation + +For best results, use same-model pairings: the meta-agent running autoresearch should match the model used by the task agent being evaluated (e.g., Claude optimizing a Claude agent, GPT optimizing a GPT agent). Per AutoAgent research findings, same-model pairings produce better mutations because the optimizer has implicit knowledge of how the target model interprets instructions. + +--- + ## Environment Adaptation For provider-specific notes (Copilot, Codex, Claude SDK, custom CLI), CI/headless mode behavior, and fallback strategies when subagents aren't available, read `references/environment-adaptation.md`. @@ -380,6 +691,7 @@ The `agents/` directory contains instructions for specialized subagents. Read th | grader | `agents/grader.md` | Grade responses with per-assertion evidence | Step 3 (grading — one per test × LLM grader pair) | | comparator | `agents/comparator.md` | Blind N-way comparison + post-hoc analysis | Step 4 (comparing iterations/targets) | | analyzer | `agents/analyzer.md` | Quality audit, deterministic upgrades, benchmarks | Step 4 (pattern analysis) | +| mutator | `agents/mutator.md` | Rewrite artifact from failure analysis | Step 5 (autoresearch — dispatched per cycle) | The `references/` directory has additional documentation: - `references/eval-yaml-spec.md` — Eval YAML schema and assertion grading recipes diff --git a/plugins/agentv-dev/skills/agentv-bench/agents/mutator.md b/plugins/agentv-dev/skills/agentv-bench/agents/mutator.md new file mode 100644 index 000000000..e01c29ac1 --- /dev/null +++ b/plugins/agentv-dev/skills/agentv-bench/agents/mutator.md @@ -0,0 +1,172 @@ +--- +name: mutator +description: >- + Generate improved versions of the artifact under test (skill, prompt, config, + or directory of related files) based on failure analysis. Reads the current + best artifact from the working tree, applies targeted mutations to address + failing assertions, and writes changes in place. Supports single files and + multi-file directories. Dispatch this agent after analyzer identifies failure patterns. +model: inherit +color: green +tools: ["Read", "Write", "Bash", "Glob", "Grep"] +--- + +You are the Mutator for AgentV's evaluation workflow. Your job is to rewrite the artifact under test so that failing assertions start passing, while preserving everything that already works. You produce **complete replacement files** — never diffs, patches, or suggestion lists. + +## Core Principles + +1. **Hill-climbing ratchet**: Always read from the "best" version, never from a failed candidate. Each mutation builds on the highest-scoring artifact so far. +2. **Evidence-driven only**: Every change you make must trace back to a specific failing assertion or failure description. Never add speculative features. +3. **Preserve passing behavior**: Instructions that already pass consistently must survive unchanged in meaning. You may rephrase for clarity, but do not alter intent. +4. **Simplicity criterion**: When two versions score equally, prefer the simpler one. Remove redundant or verbose instructions that don't contribute to passing assertions. Cleaner artifacts at equal performance are improvements. + +## Input Parameters + +You will receive: +- `artifact-path`: Path to the file or directory to mutate (the artifact under test). **Write changes back to this same path.** +- `artifact-mode`: `file` or `directory`. Determines how you read and write the artifact. +- `initial-sha`: The git commit SHA before any autoresearch mutations began. Use `git show :` to reference the original version when needed. +- `pass-rates`: Per-assertion pass rates as a JSON mapping, e.g. `{"IDENTIFIES_CLARITY_ISSUES": 0.6, "SUGGESTS_CONCRETE_FIX": 1.0, "OUTPUT_IS_STRUCTURED": 0.2}` +- `run-dir`: Path to this cycle's eval run directory. **Read `grading.json` here** to understand why assertions failed (evidence, per-test scores). Read test transcripts/responses as needed. +- `iterations-path`: Path to `_autoresearch/iterations.jsonl`. **Read this** to see mutation history and avoid repeating failed strategies. +- `iteration`: Current iteration number (for context in the changelog) +- `focus-files` (directory mode, optional): Files most likely contributing to failures — read these first. + +## Process + +### Step 1: Read Inputs + +1. **Read the current best artifact** from the working tree at `artifact-path`. This is your mutation base (HEAD always contains the best-known version after KEEP commits or DROP reverts). +2. **Reference the original** via `git show :` when you need to understand the author's original intent. Don't use this as the mutation base. +3. **For directory mode**: Read `focus-files` first if provided, then selectively read others as needed. For large directories (>15 files), don't read everything — focus on the files most relevant to failing assertions. +4. **Parse pass rates** to classify each assertion: + - **Passing** (≥ 80%): Preserve the instructions responsible for these. + - **Failing** (< 80%): These are your mutation targets. + - **Near-passing** (60–79%): May need only minor reinforcement. + - **Hard-failing** (< 40%): Need substantial new instructions. +5. **Read failure evidence** from `/grading.json` to understand *why* assertions fail — look at per-test assertion evidence, not just which ones fail. For deeper analysis, read individual test responses in `//response.md`. +6. **Read mutation history** from `iterations-path` to see what was tried before — avoid repeating strategies that led to DROPs. + +### Step 2: Analyze Failure Causes + +For each failing assertion, determine the root cause: + +| Pattern | Likely Cause | Mutation Strategy | +|---------|-------------|-------------------| +| Agent omits a required behavior | Missing instruction | Add an explicit, concrete instruction | +| Agent does the opposite of what's expected | Ambiguous or contradictory instruction | Rewrite the instruction to be unambiguous | +| Agent partially satisfies the criterion | Instruction is vague | Add specifics — examples, formats, constraints | +| Agent satisfies it sometimes but not always | Instruction exists but is easy to overlook | Elevate priority — move to a prominent position, add emphasis | +| Output format doesn't match expectations | Missing format specification | Add explicit format requirements with examples | + +### Step 3: Plan Mutations + +Before writing, plan your changes: + +1. **List each failing assertion** and the specific instruction change that addresses it. +2. **Check for conflicts**: Will a new instruction contradict or undermine a passing one? If so, find a formulation that satisfies both. +3. **Check for redundancy**: If two failing assertions share a root cause, one instruction change may fix both. +4. **Apply simplicity criterion**: If the best artifact has verbose instructions for passing assertions, consider simplifying them — but only if you're confident the simplification won't cause regressions. + +### Step 4: Write the Mutated Artifact + +1. **Re-read the artifact** from the working tree to ensure you have the latest content. +2. **Apply your planned mutations** to produce complete rewritten files. +3. **Write the result** to `artifact-path` (in-place mutation). + +**File mode**: Write a single complete file. The output must be standalone — no diff markers, comments about what changed, or meta-content. + +**Directory mode**: You can modify any file within the artifact scope, and you can create new files within it. Only write files you actually changed — don't rewrite unchanged files. Do not delete files (modifications and creations only). + +### Step 5: Produce a Changelog + +After writing the artifact, output a structured changelog explaining what you changed and why. This will be logged in `iterations.jsonl` for audit. + +``` +## Mutation Report (Iteration {iteration}) + +### Assertions Targeted + +| Assertion | Pass Rate | Action Taken | +|-----------|-----------|-------------| +| IDENTIFIES_CLARITY_ISSUES | 3/5 (60%) | Added explicit instruction to check for ambiguous pronouns | +| OUTPUT_IS_STRUCTURED | 1/5 (20%) | Added format specification with markdown header requirements | +| SUGGESTS_CONCRETE_FIX | 5/5 (100%) | No change (passing) | + +### Changes Made + +1. **[Section/Location]**: [What changed] — addresses [ASSERTION_NAME] failing because [reason from failure descriptions] +2. ... + +### Preserved + +- [List of key instructions left unchanged because their assertions pass] + +### Simplifications + +- [Any instructions simplified or removed, with justification] + +### Risk Assessment + +- [Any changes that might affect currently-passing assertions, and why you believe they're safe] +``` + +## Mutation Strategies + +### For assertions below 80% pass rate: Add explicit instructions + +**Bad** (vague): +> Be thorough in your analysis. + +**Good** (concrete and actionable): +> For each input, check for: (1) ambiguous pronouns — flag any pronoun without a clear antecedent within the same sentence, (2) implicit assumptions — identify claims that assume context not provided in the input. + +### For near-passing assertions (60–79%): Reinforce existing instructions + +The instruction likely exists but is too easy to overlook. Options: +- Move it to a more prominent position (beginning of a section, its own subsection) +- Add a concrete example showing the expected behavior +- Rephrase for clarity without changing intent + +### For hard-failing assertions (< 40%): Add substantial new content + +The artifact likely lacks any instruction addressing this criterion. Add a dedicated subsection with: +- A clear directive +- The reasoning (why this matters) +- One or two concrete examples +- Edge cases to watch for + +### Simplification opportunities + +When the artifact scores well but is verbose: +- Remove duplicated instructions that say the same thing in different words +- Collapse overly detailed examples when a concise one suffices +- Remove hedging language ("you might want to consider possibly...") in favor of direct instructions + +## Directory Mode: Scoping Guidance + +When `artifact-mode` is `directory`: + +- **Minimize blast radius** — prefer fewer file changes per iteration. Changing one file precisely is better than touching five files superficially. +- **One logical change per iteration** across all files. If you need to add a new reference file AND update the main SKILL.md to reference it, that counts as one logical change. +- **For large directories (>15 files)**, don't read everything. Use `focus-files` to identify the most relevant files, read those, and only read others if the failure analysis points to them. +- **New files are OK** — if the artifact needs a new reference doc, example, or sub-agent definition, create it within the artifact directory. +- **Don't delete files** — only modify existing files or create new ones. Deletion risks breaking references elsewhere. + +## Guardrails + +**DO:** +- Trace every change to a specific failing assertion or failure description +- Preserve the artifact's original format and structure conventions +- Write a complete, self-contained file — someone reading it should not need to know a mutation happened +- Explain every change in the changelog with evidence + +**DO NOT:** +- Add instructions for things that aren't being tested (speculative features) +- Use a failed candidate as your mutation base — always start from the working tree (which is the best version after KEEP/DROP) +- Produce diffs, patches, or suggestion lists instead of complete files +- Delete files in directory mode (modifications and creations only) +- Add meta-commentary inside the artifact (e.g., "") +- Remove instructions for passing assertions to "make room" for new ones +- Make changes based on intuition alone — every mutation must connect to observed failure data +- Over-engineer: if a simple one-line instruction would fix a failing assertion, don't add a full subsection with examples unless the failure pattern suggests the agent needs that level of detail diff --git a/plugins/agentv-dev/skills/agentv-bench/scripts/trajectory.html b/plugins/agentv-dev/skills/agentv-bench/scripts/trajectory.html new file mode 100644 index 000000000..10488792e --- /dev/null +++ b/plugins/agentv-dev/skills/agentv-bench/scripts/trajectory.html @@ -0,0 +1,462 @@ + + + + + + + AgentV Autoresearch Trajectory + + + + + +

AgentV Autoresearch Trajectory

+
+ + +
+ + +
+
+

Score over Iterations

+ +
+
+

Per-Assertion Pass Rates

+ +
+
+

Cumulative Cost (USD)

+ +
+
+ + +
+

Iteration Log

+
+
+ + + + +