feat(bench): autoresearch optimization loop (#958, #746, #748) by christso · Pull Request #1112 · EntityProcess/agentv

christso · 2026-04-15T22:56:46Z

Summary

Implements the agentv autoresearch optimization loop — three related features that compose into an unattended eval-improve cycle:

#958 — Automated keep/discard decision

Adds automated keep/discard logic to Step 5 using agentv compare --json structured output:

wins > losses → KEEP, promote to new baseline
wins <= losses → DISCARD, revert to best
mean_delta == 0 but simpler prompt → KEEP (simplicity criterion)
Human checkpoints preserved at iterations 3/6/9

#746 — Mutator subagent

New agents/mutator.md that autonomously rewrites artifacts from failure analysis:

Hill-climbing ratchet: always reads from best version, never failed candidate
Evidence-driven mutations: every change traces to a failing assertion
Simplicity criterion: cleaner artifacts at equal performance are improvements
Complete file replacement, not diffs or suggestions

#748 Phase 1 — Autoresearch mode

Wires keep/discard + mutator into an unattended loop in SKILL.md:

Full procedure: eval → analyze → decide → mutate → repeat
_autoresearch/ output folder: original.md, best.md, iterations.jsonl, trajectory.html
Each cycle writes standard run artifacts via agentv eval --experiment autoresearch-<name>
Convergence detection (3 consecutive no-improvement cycles)
Live Chart.js trajectory visualization with auto-refresh
Studio sees each cycle as a normal run — no special handling needed

Scope

Skill-only change — no CLI, schema, or core code modifications. All changes are in plugins/agentv-dev/skills/agentv-bench/.

Files changed

SKILL.md — Added automated keep/discard subsection + full autoresearch mode section + updated subagent table + updated trigger description
agents/mutator.md — New subagent for artifact mutation
scripts/trajectory.html — Standalone Chart.js trajectory chart template

Verification

✅ Build passes (bun run build)
✅ All 472 tests pass (bun run test)
✅ Lint passes (bun run lint)
✅ All 18 acceptance signals verified across all three issues
✅ Code review: wire format documentation corrected (snake_case keys)

Closes #958
Closes #746
Closes #748

Add a new 'Automated keep/discard' subsection to the iteration loop in Step 5 (Improve). After each iteration, the agent can now automatically decide whether to keep or discard a change by running: agentv compare baseline.jsonl candidate.jsonl --json Decision rules: - wins > losses → keep, promote to new baseline - wins <= losses → discard, revert, try different mutation - meanDelta == 0 but simpler prompt → keep (simplicity criterion) Each decision is logged with rationale. Human checkpoints at iterations 3, 6, 9 still fire. Both manual and automated modes coexist. Closes #958 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add unattended eval-improve loop to agentv-bench: - #958: Automated keep/discard decision in Step 5 using agentv compare --json output (wins/losses/ties/meanDelta rules), preserving human checkpoints at 3/6/9 - #746: Mutator subagent (agents/mutator.md) that rewrites artifacts from failure analysis with hill-climbing ratchet, evidence-driven mutations, and simplicity criterion - #748 Phase 1: Autoresearch mode wired into SKILL.md with full procedure: eval → analyze → decide → mutate → repeat. Includes _autoresearch/ output folder (original.md, best.md, iterations.jsonl, trajectory.html), convergence detection (3 consecutive no-improvement cycles), and standalone Chart.js trajectory visualization Skill-only change — no CLI, schema, or core code modifications. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

cloudflare-workers-and-pages · 2026-04-15T22:57:13Z

Deploying agentv with Cloudflare Pages

Latest commit:	`8f18ade`
Status:	✅ Deploy successful!
Preview URL:	https://73a610ff.agentv.pages.dev
Branch Preview URL:	https://feat-958-746-748-autoresearc.agentv.pages.dev

View logs

- Add autoresearch-automation.mdx guide with trajectory screenshot, ASCII output table, and incident classifier walkthrough - Add examples/features/autoresearch/ with working eval and prompt - Fix trajectory.html: add actual auto-refresh meta tag (was a non-functional comment), standardize badge text to "drop" (matching SKILL.md data format), guard cumulative cost against non-numeric values - Update SKILL.md completion step to match actual template markup - Update skill-improvement-workflow.mdx cross-reference to new guide Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

christso · 2026-04-15T23:43:49Z

E2E Verification & Documentation

Infrastructure Verified

✅ agentv eval --experiment autoresearch-<name> creates correct directory structure under .agentv/results/runs/
✅ agentv compare <baseline>.jsonl <candidate>.jsonl --json outputs structured JSON with wins, losses, ties, mean_delta fields
✅ trajectory.html renders correctly with Chart.js (score chart, per-assertion chart, cost chart, iteration table)
✅ All 56 example evals validate (including new autoresearch example)
✅ All 2,184 tests pass
✅ Docs site builds successfully (45 pages)

Real Eval Run

Ran the incident severity classifier scenario against Azure (gpt-5.4-mini):

Cycle 1 (weak prompt): 95% mean score — the model is already strong at classification
Cycle 2 (improved prompt): Verified agentv compare --json correctly identified regression (0 wins, 2 losses → DISCARD)
The compare output matches the keep/discard decision rules in SKILL.md

Bugs Fixed

trajectory.html: Auto-refresh meta tag was non-functional — Template had  (a comment) instead of an actual <meta http-equiv="refresh"> tag. Fixed to include the real tag.
trajectory.html: Badge text inconsistency — Badge displayed "discard" but SKILL.md data format uses "drop". Standardized badge text to "drop".
trajectory.html: Cumulative cost coercion — Added parseFloat() guard to handle string or null cost_usd values without producing NaN.
SKILL.md: Completion step mismatch — Step 3.1 said "remove the auto-refresh <meta> tag" but didn't specify the actual template markup. Updated to reference .

Documentation Added

New guide: apps/web/src/content/docs/docs/guides/autoresearch-automation.mdx
- Trajectory chart screenshot
- ASCII output table showing 9-cycle optimization run
- Flow diagram, decision rules, output structure
- Incident severity classifier example
Updated skill-improvement-workflow.mdx to cross-reference the new guide
Example: examples/features/autoresearch/ with EVAL.yaml, classifier-prompt.md, README.md

Screenshot

The trajectory chart shows a representative optimization run: 0.48 → 0.90 (+42 points) over 9 cycles at $0.03 total cost.

…validation Rewrites the autoresearch trajectory visualization to match the Studio design system (bg-gray-950 canvas, cyan accent, emerald/red status, system sans-serif, font-medium max). Adds defensive validation: try/catch for unresolved placeholder, Array.isArray guard, per-iteration field checks with graceful error messages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Previous screenshot had excessive empty margins from a 1920px viewport. Resized to 960px so content fills the frame. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… autoresearch Replace file-copy backup/restore (original.md, best.md) with git commit/revert. HEAD always contains the best-known version: KEEP commits, DROP reverts working tree. The mutator can now optimize directories (multi-file artifacts like skills with references/) in addition to single files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… screenshot - Rename autoresearch-automation.mdx → autoresearch.mdx (slug: /docs/guides/autoresearch/) - Title: "Autoresearch" (was "Autoresearch — Automated Optimization") - Retake screenshot at 1280px full-page to show all content including iteration log - Fix stale best.md reference in example section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…earch The orchestrator never reads eval results, artifacts, or transcripts into its own context — it can run indefinitely without exhausting the context window. - Extract scores via bash/jq (small structured outputs only) - trajectory.html now fetches iterations.jsonl via HTTP instead of requiring inline data injection — no file manipulation after setup - Mutator self-serves failure evidence from disk (grading.json, transcripts, iterations.jsonl) — orchestrator passes paths not content - iterations.jsonl appended via bash echo, not read-modify-write - Retake screenshot without bottom whitespace Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Trim 1280x1513 → 1280x1044 by removing empty background below the iteration log table. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Format all scores as percentages (48%, 90%, +42%) instead of decimals (0.48, 0.90, +0.42) in summary cards, chart axes, tooltips, and iteration log table. Underlying data stays 0–1 in iterations.jsonl. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Logical progression: conceptual foundations → eval authoring → improvement workflow (manual then automated) → advanced topics (external scorers, workspace infrastructure). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move agent-skills-evals.mdx from guides/ to integrations/ since it documents an external format integration, not a guide workflow. Update internal links and re-number remaining guide sidebar orders to close the gap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Autoevals documents an external library (Braintrust scorers), fitting the Integrations section alongside Langfuse and Skill Evals. Re-number remaining guide sidebar orders to close the gap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The page was renamed from autoresearch-automation to autoresearch but two internal links in skill-improvement-workflow.mdx still pointed at the old slug. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…broken link Rename slug to match the page title "Workspace Architecture". Also fix the Workspace Pool link which was missing the /docs prefix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

christso and others added 2 commits April 15, 2026 22:39

christso and others added 12 commits April 16, 2026 00:01

fix(docs): retake trajectory screenshot at 960px viewport

0cf7f91

Previous screenshot had excessive empty margins from a 1920px viewport. Resized to 960px so content fills the frame. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(docs): crop blank space from trajectory screenshot

d4c6ee0

Trim 1280x1513 → 1280x1044 by removing empty background below the iteration log table. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: reorder guides sidebar from basic to advanced

44d8296

Logical progression: conceptual foundations → eval authoring → improvement workflow (manual then automated) → advanced topics (external scorers, workspace infrastructure). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(docs): update broken autoresearch links to new slug

74bcdea

The page was renamed from autoresearch-automation to autoresearch but two internal links in skill-improvement-workflow.mdx still pointed at the old slug. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(docs): rename git-cache-workspace to workspace-architecture, fix …

8f18ade

…broken link Rename slug to match the page title "Workspace Architecture". Also fix the Workspace Pool link which was missing the /docs prefix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

christso merged commit 87d62a4 into main Apr 16, 2026
4 checks passed

christso deleted the feat/958-746-748-autoresearch branch April 16, 2026 04:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): autoresearch optimization loop (#958, #746, #748)#1112

feat(bench): autoresearch optimization loop (#958, #746, #748)#1112
christso merged 15 commits intomainfrom
feat/958-746-748-autoresearch

christso commented Apr 15, 2026

Uh oh!

cloudflare-workers-and-pages bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

christso commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Apr 15, 2026

Summary

#958 — Automated keep/discard decision

#746 — Mutator subagent

#748 Phase 1 — Autoresearch mode

Scope

Files changed

Verification

Uh oh!

cloudflare-workers-and-pages bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

christso commented Apr 15, 2026

E2E Verification & Documentation

Infrastructure Verified

Real Eval Run

Bugs Fixed

Documentation Added

Screenshot

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cloudflare-workers-and-pages bot commented Apr 15, 2026 •

edited

Loading