feat(bench): autoresearch optimization loop (#958, #746, #748)#1112
Merged
feat(bench): autoresearch optimization loop (#958, #746, #748)#1112
Conversation
Add a new 'Automated keep/discard' subsection to the iteration loop in Step 5 (Improve). After each iteration, the agent can now automatically decide whether to keep or discard a change by running: agentv compare baseline.jsonl candidate.jsonl --json Decision rules: - wins > losses → keep, promote to new baseline - wins <= losses → discard, revert, try different mutation - meanDelta == 0 but simpler prompt → keep (simplicity criterion) Each decision is logged with rationale. Human checkpoints at iterations 3, 6, 9 still fire. Both manual and automated modes coexist. Closes #958 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add unattended eval-improve loop to agentv-bench: - #958: Automated keep/discard decision in Step 5 using agentv compare --json output (wins/losses/ties/meanDelta rules), preserving human checkpoints at 3/6/9 - #746: Mutator subagent (agents/mutator.md) that rewrites artifacts from failure analysis with hill-climbing ratchet, evidence-driven mutations, and simplicity criterion - #748 Phase 1: Autoresearch mode wired into SKILL.md with full procedure: eval → analyze → decide → mutate → repeat. Includes _autoresearch/ output folder (original.md, best.md, iterations.jsonl, trajectory.html), convergence detection (3 consecutive no-improvement cycles), and standalone Chart.js trajectory visualization Skill-only change — no CLI, schema, or core code modifications. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Deploying agentv with
|
| Latest commit: |
8f18ade
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://73a610ff.agentv.pages.dev |
| Branch Preview URL: | https://feat-958-746-748-autoresearc.agentv.pages.dev |
- Add autoresearch-automation.mdx guide with trajectory screenshot, ASCII output table, and incident classifier walkthrough - Add examples/features/autoresearch/ with working eval and prompt - Fix trajectory.html: add actual auto-refresh meta tag (was a non-functional comment), standardize badge text to "drop" (matching SKILL.md data format), guard cumulative cost against non-numeric values - Update SKILL.md completion step to match actual template markup - Update skill-improvement-workflow.mdx cross-reference to new guide Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Collaborator
Author
E2E Verification & DocumentationInfrastructure Verified
Real Eval RunRan the incident severity classifier scenario against Azure (gpt-5.4-mini):
Bugs Fixed
Documentation Added
ScreenshotThe trajectory chart shows a representative optimization run: 0.48 → 0.90 (+42 points) over 9 cycles at $0.03 total cost. |
…validation Rewrites the autoresearch trajectory visualization to match the Studio design system (bg-gray-950 canvas, cyan accent, emerald/red status, system sans-serif, font-medium max). Adds defensive validation: try/catch for unresolved placeholder, Array.isArray guard, per-iteration field checks with graceful error messages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previous screenshot had excessive empty margins from a 1920px viewport. Resized to 960px so content fills the frame. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… autoresearch Replace file-copy backup/restore (original.md, best.md) with git commit/revert. HEAD always contains the best-known version: KEEP commits, DROP reverts working tree. The mutator can now optimize directories (multi-file artifacts like skills with references/) in addition to single files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… screenshot - Rename autoresearch-automation.mdx → autoresearch.mdx (slug: /docs/guides/autoresearch/) - Title: "Autoresearch" (was "Autoresearch — Automated Optimization") - Retake screenshot at 1280px full-page to show all content including iteration log - Fix stale best.md reference in example section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…earch The orchestrator never reads eval results, artifacts, or transcripts into its own context — it can run indefinitely without exhausting the context window. - Extract scores via bash/jq (small structured outputs only) - trajectory.html now fetches iterations.jsonl via HTTP instead of requiring inline data injection — no file manipulation after setup - Mutator self-serves failure evidence from disk (grading.json, transcripts, iterations.jsonl) — orchestrator passes paths not content - iterations.jsonl appended via bash echo, not read-modify-write - Retake screenshot without bottom whitespace Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Trim 1280x1513 → 1280x1044 by removing empty background below the iteration log table. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Format all scores as percentages (48%, 90%, +42%) instead of decimals (0.48, 0.90, +0.42) in summary cards, chart axes, tooltips, and iteration log table. Underlying data stays 0–1 in iterations.jsonl. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Logical progression: conceptual foundations → eval authoring → improvement workflow (manual then automated) → advanced topics (external scorers, workspace infrastructure). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move agent-skills-evals.mdx from guides/ to integrations/ since it documents an external format integration, not a guide workflow. Update internal links and re-number remaining guide sidebar orders to close the gap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Autoevals documents an external library (Braintrust scorers), fitting the Integrations section alongside Langfuse and Skill Evals. Re-number remaining guide sidebar orders to close the gap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The page was renamed from autoresearch-automation to autoresearch but two internal links in skill-improvement-workflow.mdx still pointed at the old slug. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…broken link Rename slug to match the page title "Workspace Architecture". Also fix the Workspace Pool link which was missing the /docs prefix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the agentv autoresearch optimization loop — three related features that compose into an unattended eval-improve cycle:
#958 — Automated keep/discard decision
Adds automated keep/discard logic to Step 5 using
agentv compare --jsonstructured output:wins > losses→ KEEP, promote to new baselinewins <= losses→ DISCARD, revert to bestmean_delta == 0but simpler prompt → KEEP (simplicity criterion)#746 — Mutator subagent
New
agents/mutator.mdthat autonomously rewrites artifacts from failure analysis:#748 Phase 1 — Autoresearch mode
Wires keep/discard + mutator into an unattended loop in SKILL.md:
_autoresearch/output folder:original.md,best.md,iterations.jsonl,trajectory.htmlagentv eval --experiment autoresearch-<name>Scope
Skill-only change — no CLI, schema, or core code modifications. All changes are in
plugins/agentv-dev/skills/agentv-bench/.Files changed
SKILL.md— Added automated keep/discard subsection + full autoresearch mode section + updated subagent table + updated trigger descriptionagents/mutator.md— New subagent for artifact mutationscripts/trajectory.html— Standalone Chart.js trajectory chart templateVerification
bun run build)bun run test)bun run lint)Closes #958
Closes #746
Closes #748