Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion apps/cli/src/commands/eval/artifact-writer.ts
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,9 @@ export function buildTestTargetKey(testId?: string, target?: string): string {
}

// Deduplication helper — keeps the last entry per (test_id, target) pair.
export function deduplicateByTestIdTarget(results: readonly EvaluationResult[]): EvaluationResult[] {
export function deduplicateByTestIdTarget(
results: readonly EvaluationResult[],
): EvaluationResult[] {
const seen = new Map<string, number>();
for (let i = 0; i < results.length; i++) {
seen.set(buildTestTargetKey(results[i].testId, results[i].target), i);
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions apps/web/src/content/docs/docs/guides/agent-eval-layers.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,8 @@ A complete `EVAL.yaml` covering all four layers:

```yaml
description: Four-layer agent evaluation starter
sidebar:
order: 1

execution:
target: default
Expand Down
207 changes: 207 additions & 0 deletions apps/web/src/content/docs/docs/guides/autoresearch.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
---
title: Autoresearch
description: Run an unattended eval-improve loop that iteratively optimizes agent skills
sidebar:
order: 5
---

import { Image } from 'astro:assets';
import trajectoryChart from '../../../../assets/screenshots/autoresearch-trajectory.png';

Autoresearch is an unattended optimization loop that **automatically improves your agent skills** through repeated eval cycles. It runs the same evaluate → analyze → improve loop described in the [Skill Improvement Workflow](/docs/guides/skill-improvement-workflow/), but does it hands-free — no human review between cycles.

<Image src={trajectoryChart} alt="Autoresearch trajectory chart showing score improvement from 0.48 to 0.90 over 9 cycles" />

The chart above shows a real optimization run: an incident severity classifier starts at 48% accuracy and reaches 90% after 9 automated cycles — each cycle taking seconds and costing fractions of a cent.

## How It Works

```
┌──────────┐
│ 1. EVAL │ ◄───────────────────────────────┐
└─────┬─────┘ │
▼ │
┌──────────┐ │
│ 2. ANALYZE│ dispatcher → analyzer subagent │
└─────┬─────┘ │
▼ │
┌──────────┐ wins > losses → KEEP │
│ 3. DECIDE │ else → DROP │
└─────┬─────┘ │
▼ │
┌──────────┐ │
│ 4. MUTATE │ dispatcher → mutator subagent ──┘
└──────────┘

Stops after 3 consecutive no-improvement cycles
or 10 total cycles (configurable).
```

Each cycle:
1. **Runs `agentv eval`** against the current version of the artifact
2. **Analyzes** failures via the analyzer subagent
3. **Decides** keep or discard using `agentv compare --json` (automated — no human needed)
4. **Mutates** the artifact to address failing assertions, then loops back

The system uses a **hill-climbing ratchet**: each mutation builds on the best-scoring version, never a failed candidate. Improvements compound; regressions get discarded.

## What Gets Optimized

Any file or directory artifact: SKILL.md, prompt template, agent config, system prompt, or a directory of related files (e.g., a skill with `references/` and `agents/` subdirectories). The artifact mode is auto-detected — pass a file path for single-file optimization, or a directory path for multi-file optimization. The mutator rewrites artifacts in place while the eval stays fixed — same test cases, same assertions, different artifact versions.

## Prerequisites

- An eval file (EVAL.yaml or evals.json) that covers the behavior you care about
- The artifact must be a file or directory within a git repository (autoresearch uses git for versioning)
- Run at least one manual eval cycle first to validate your test cases

:::tip
Autoresearch is only as good as your eval. If your assertions don't catch the failures you care about, the optimizer won't fix them. Start with the [manual improvement loop](/docs/guides/skill-improvement-workflow/) to build confidence in your eval quality before going unattended.
:::

## Triggering Autoresearch

Autoresearch runs through the `agentv-bench` Claude Code skill. Trigger it with natural language:

```
"Run autoresearch on my classifier prompt"
"Optimize this skill unattended for 5 cycles"
"Run autoresearch on examples/features/autoresearch/EVAL.yaml"
```

No CLI flags or YAML schema changes needed — the skill handles everything.

## Output Structure

Each autoresearch session creates a self-contained experiment directory:

```
.agentv/results/runs/autoresearch-<name>/
├── _autoresearch/
│ ├── iterations.jsonl # Per-cycle data (score, decision, mutation)
│ └── trajectory.html # Live-updating Chart.js visualization
├── 2026-04-15T10-30-00/ # Cycle 1 run artifacts
│ ├── index.jsonl
│ ├── grading.json
│ └── timing.json
├── 2026-04-15T10-35-00/ # Cycle 2 run artifacts
│ └── ...
└── ...
```

Autoresearch uses **git-based versioning** instead of backup files. Each successful mutation is committed (`git add && git commit`), and failed mutations are reverted (`git checkout`). The optimized artifact lives in the working tree and the latest commit — no separate `best.md` to copy.

- **`_autoresearch/trajectory.html`** — Open in a browser to see the score trajectory, per-assertion breakdown, and cumulative cost. Auto-refreshes during the loop, becomes static on completion.
- **`_autoresearch/iterations.jsonl`** — Machine-readable log of every cycle for downstream analysis.

Review the mutation history with `git log` after the run completes.

## The Keep/Drop Decision

After each eval cycle, autoresearch runs `agentv compare` between the current candidate and the best baseline:

```bash
agentv compare <baseline>/index.jsonl <candidate>/index.jsonl --json
```

The decision rule:

| Condition | Decision | Outcome |
|-----------|----------|---------|
| `wins > losses` | **KEEP** | Promote to new baseline, reset convergence counter |
| `wins <= losses` | **DROP** | Revert to best version, increment convergence counter |
| `mean_delta == 0`, simpler artifact | **KEEP** | Simpler is better at equal performance |

Three consecutive DROPs trigger convergence — the optimizer stops because it can't find improvements.

## Example: Incident Severity Classifier

Here's a real scenario showing autoresearch in action. We start with a minimal classifier prompt:

```markdown
# classifier-prompt.md (initial version)
Classify the incident into P0, P1, P2, or P3.
Give your answer as JSON with severity and reasoning fields.
```

And an eval with 7 test cases covering edge cases — payment failures, SSL cert expiry, gradual memory leaks:

```yaml
# EVAL.yaml (stays fixed — only the prompt changes)
tests:
- id: total-outage
assertions:
- type: contains
value: '"P0"'
- type: is-json
- "Reasoning mentions complete service outage"
- id: payment-failures
assertions:
- type: contains
value: '"P1"'
- type: is-json
- "Reasoning weighs revenue impact despite intermittent nature"
# ... 5 more test cases
```

Running autoresearch produces this trajectory:

```
Cycle Score Decision Mutation
───── ───── ──────── ──────────────────────────────────────
1 0.48 KEEP initial baseline — no mutations applied
2 0.62 KEEP added explicit JSON format, defined P0-P3 levels
3 0.52 DROP added verbose rules — over-constrained reasoning
4 0.71 KEEP added revenue-impact heuristic for P1
5 0.81 KEEP enforced raw JSON output — removed code fences
6 0.86 KEEP added time-urgency rule for SSL/cert cases
7 0.90 KEEP improved reasoning template — cite impact metrics
8 0.86 DROP attempted decision tree merge — regressed
9 0.90 DROP minor wording cleanup — no meaningful change
↳ 3 consecutive drops → CONVERGED
```

**Result:** 0.48 → 0.90 (+42 points) in 9 cycles, $0.03 total cost. The optimized prompt is in the working tree (and the latest git commit).

Key observations:
- **Cycle 3** shows a failed mutation (verbose rules hurt reasoning) — the ratchet discarded it and continued from the cycle 2 version
- **Cycles 8–9** show convergence — the optimizer couldn't improve further and stopped automatically
- **Per-assertion tracking** reveals which aspects improved: classification accuracy reached 100% by cycle 6, while JSON format compliance and reasoning quality improved more gradually

## Convergence

Autoresearch stops when either condition is met:

- **3 consecutive no-improvement cycles** (configurable) — the optimizer has converged
- **10 total cycles** (configurable) — hard limit to bound cost

You can override both limits when triggering autoresearch:

```
"Run autoresearch with max 20 cycles and convergence threshold of 5"
```

## Best Practices

**Start manual, then automate.** Run 2-3 manual eval cycles to validate your test cases catch real issues. Once you trust the eval, switch to autoresearch.

**Same-model pairings work best.** The meta-agent running autoresearch should match the model used by the task agent (e.g., Claude optimizing a Claude agent). Same-model pairings produce better mutations because the optimizer has implicit knowledge of how the target model interprets instructions.

**Watch the per-assertion chart.** If one assertion is stuck at 0% while others improve, the eval may be too strict or testing something the prompt can't control. Consider adjusting the assertion.

**Review the optimized artifact.** Autoresearch improves scores, but always review the changes (`git diff <initial_sha>`) before adopting them. The optimizer may have found a valid but unexpected approach.

**Keep artifact directories focused.** For directory mode, keep artifacts to 5–15 files. The mutator works best when it can reason about the full scope without reading dozens of files. Split large skill directories if needed.

## Relationship to Manual Workflow

| Aspect | Manual Loop | Autoresearch |
|--------|-------------|--------------|
| Human checkpoints | Every iteration | None (opted in to unattended) |
| Keep/discard | You decide | Automated via `agentv compare` |
| Mutation | You edit the skill | Mutator subagent rewrites |
| Max iterations | Unbounded | 10 cycles or convergence |
| Best for | Building eval intuition | Scaling optimization |
| Trajectory chart | Not included | Auto-generated with live refresh |

Start with the [manual loop](/docs/guides/skill-improvement-workflow/) to understand the workflow, then use autoresearch to scale it.
2 changes: 1 addition & 1 deletion apps/web/src/content/docs/docs/guides/eval-authoring.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Eval Authoring Guide
description: Practical guidance for writing workspace-based evals that work reliably across providers.
sidebar:
order: 7
order: 3
---

## Workspace Setup: Skill Discovery Paths
Expand Down
2 changes: 1 addition & 1 deletion apps/web/src/content/docs/docs/guides/evaluation-types.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Execution Quality vs Trigger Quality
description: Two distinct evaluation concerns for AI agents and skills — what AgentV measures, and what belongs to skill-creator tooling.
sidebar:
order: 6
order: 2
---

Agent evaluation has two fundamentally different concerns: **execution quality** and **trigger quality**. They require different tooling, different methodologies, and different optimization surfaces. Conflating them leads to eval configs that are noisy, hard to maintain, and unreliable.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@
title: Skill Improvement Workflow
description: Iteratively evaluate and improve agent skills using AgentV
sidebar:
order: 6
order: 4
---

## Introduction

AgentV supports a full evaluation-driven improvement loop for skills and agents. Instead of guessing whether a change makes things better, you run structured evaluations before and after, then compare.

This guide teaches the **core manual loop**. For automated iteration that runs the full cycle hands-free, see [agentv-bench](#automated-iteration).
This guide teaches the **core manual loop**. For automated iteration that runs the full cycle hands-free, see [Autoresearch](/docs/guides/autoresearch/).

## The Core Loop

Expand Down Expand Up @@ -244,7 +244,7 @@ After converting, you can:
- Use `code-grader` for custom scoring logic
- Define `tool-trajectory` assertions to check tool usage patterns

See [Skill Evals (evals.json)](/docs/guides/agent-skills-evals/) for the full field mapping and side-by-side comparison.
See [Skill Evals (evals.json)](/docs/integrations/agent-skills-evals/) for the full field mapping and side-by-side comparison.

## Migration from Skill-Creator

Expand Down Expand Up @@ -316,21 +316,14 @@ Start simple and add complexity only when the evaluation results demand it:

## Automated Iteration

For users who want the full automated improvement cycle, the `agentv-bench` skill runs a 5-phase optimization loop:
When you're confident in your eval quality, graduate to **autoresearch** — an unattended optimization loop that runs the full evaluate → analyze → improve cycle hands-free.

1. **Analyze** — examines the current skill and evaluation results
2. **Hypothesize** — generates improvement hypotheses from failure patterns
3. **Implement** — applies targeted skill modifications
4. **Evaluate** — re-runs the evaluation suite
5. **Decide** — keeps improvements that help, reverts those that don't
Autoresearch uses the same `agentv eval` and `agentv compare` primitives described above, but automates the human decision steps. A mutator subagent rewrites the artifact based on failure analysis, and an automated keep/discard rule promotes improvements and reverts regressions.

The optimizer uses the same core loop described in this guide but automates the human steps. Start with the manual loop to build intuition, then graduate to the optimizer when you're comfortable with the evaluation workflow.

Its bundled scripts map directly onto the workflow stages:
```
"Run autoresearch on my skill"
```

- `run-eval.ts` and `compare-runs.ts` run and compare evaluations while still delegating to `agentv`
- `run-loop.ts` repeats the evaluation loop without moving grader logic into the script layer
- `aggregate-benchmark.ts` and `generate-report.ts` summarize AgentV artifacts into review-friendly output
- `improve-description.ts` proposes follow-up description experiments once execution quality is stable
One command starts the loop. It runs until the optimizer converges (3 consecutive no-improvement cycles) or hits the cycle limit. Typical runs: 5–10 cycles, under $0.05 total cost.

Code-grader execution, grading semantics, and artifact schemas still live in AgentV core. The scripts layer is orchestration glue over those existing primitives.
See the full guide: [Autoresearch](/docs/guides/autoresearch/)
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Workspace Architecture
description: How AgentV clones and materializes working trees for eval runs, with performance guidance for large repos.
sidebar:
order: 3
order: 7
---

AgentV evaluations that use `workspace.repos` clone repositories directly from their source (git URL or local path) into a workspace directory. [Workspace pooling](/docs/guides/workspace-pool/) (enabled by default) eliminates repeated clone costs by reusing materialized workspaces across runs.
Expand Down Expand Up @@ -131,4 +131,4 @@ To disable pooling for a run:
agentv eval evals/my-eval.yaml --no-pool
```

See the [Workspace Pool](/guides/workspace-pool/) guide for details on pool configuration, clean modes, concurrency, and drift detection.
See the [Workspace Pool](/docs/guides/workspace-pool/) guide for details on pool configuration, clean modes, concurrency, and drift detection.
2 changes: 1 addition & 1 deletion apps/web/src/content/docs/docs/guides/workspace-pool.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Workspace Pool
description: Reuse materialized workspaces across eval runs with fingerprint-based pooling, eliminating repeated clone and checkout costs.
sidebar:
order: 4
order: 8
---

Workspace pooling keeps materialized workspaces on disk between eval runs. Instead of cloning repos and checking out files every time, pooled workspaces reset in-place — typically reducing setup from minutes to seconds for large repositories.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Skill Evals (evals.json)
description: Run evals.json skill evaluations with AgentV, and graduate to EVAL.yaml when you need more power.
sidebar:
order: 5
order: 2
---

## Overview
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Autoevals Integration
description: Use Braintrust's open-source autoevals scorers (Factuality, Faithfulness, etc.) as code_grader graders in AgentV.
sidebar:
order: 2
order: 3
---

## Overview
Expand Down
2 changes: 1 addition & 1 deletion apps/web/src/content/docs/docs/tools/convert.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Outputs a `.eval.yaml` file alongside the input.
agentv convert evals.json
```

Converts an [Agent Skills `evals.json`](/docs/guides/agent-skills-evals) file into an AgentV EVAL YAML file. The converter:
Converts an [Agent Skills `evals.json`](/docs/integrations/agent-skills-evals) file into an AgentV EVAL YAML file. The converter:

- Maps `prompt` → `input` message array
- Maps `expected_output` → `expected_output`
Expand Down
Loading
Loading