Skip to content

feat(bench): autoresearch optimization loop (#958, #746, #748)#1112

Merged
christso merged 15 commits intomainfrom
feat/958-746-748-autoresearch
Apr 16, 2026
Merged

feat(bench): autoresearch optimization loop (#958, #746, #748)#1112
christso merged 15 commits intomainfrom
feat/958-746-748-autoresearch

Conversation

@christso
Copy link
Copy Markdown
Collaborator

Summary

Implements the agentv autoresearch optimization loop — three related features that compose into an unattended eval-improve cycle:

#958 — Automated keep/discard decision

Adds automated keep/discard logic to Step 5 using agentv compare --json structured output:

  • wins > losses → KEEP, promote to new baseline
  • wins <= losses → DISCARD, revert to best
  • mean_delta == 0 but simpler prompt → KEEP (simplicity criterion)
  • Human checkpoints preserved at iterations 3/6/9

#746 — Mutator subagent

New agents/mutator.md that autonomously rewrites artifacts from failure analysis:

  • Hill-climbing ratchet: always reads from best version, never failed candidate
  • Evidence-driven mutations: every change traces to a failing assertion
  • Simplicity criterion: cleaner artifacts at equal performance are improvements
  • Complete file replacement, not diffs or suggestions

#748 Phase 1 — Autoresearch mode

Wires keep/discard + mutator into an unattended loop in SKILL.md:

  • Full procedure: eval → analyze → decide → mutate → repeat
  • _autoresearch/ output folder: original.md, best.md, iterations.jsonl, trajectory.html
  • Each cycle writes standard run artifacts via agentv eval --experiment autoresearch-<name>
  • Convergence detection (3 consecutive no-improvement cycles)
  • Live Chart.js trajectory visualization with auto-refresh
  • Studio sees each cycle as a normal run — no special handling needed

Scope

Skill-only change — no CLI, schema, or core code modifications. All changes are in plugins/agentv-dev/skills/agentv-bench/.

Files changed

  • SKILL.md — Added automated keep/discard subsection + full autoresearch mode section + updated subagent table + updated trigger description
  • agents/mutator.md — New subagent for artifact mutation
  • scripts/trajectory.html — Standalone Chart.js trajectory chart template

Verification

  • ✅ Build passes (bun run build)
  • ✅ All 472 tests pass (bun run test)
  • ✅ Lint passes (bun run lint)
  • ✅ All 18 acceptance signals verified across all three issues
  • ✅ Code review: wire format documentation corrected (snake_case keys)

Closes #958
Closes #746
Closes #748

christso and others added 2 commits April 15, 2026 22:39
Add a new 'Automated keep/discard' subsection to the iteration loop in
Step 5 (Improve). After each iteration, the agent can now automatically
decide whether to keep or discard a change by running:

  agentv compare baseline.jsonl candidate.jsonl --json

Decision rules:
- wins > losses → keep, promote to new baseline
- wins <= losses → discard, revert, try different mutation
- meanDelta == 0 but simpler prompt → keep (simplicity criterion)

Each decision is logged with rationale. Human checkpoints at iterations
3, 6, 9 still fire. Both manual and automated modes coexist.

Closes #958

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add unattended eval-improve loop to agentv-bench:

- #958: Automated keep/discard decision in Step 5 using agentv compare --json
  output (wins/losses/ties/meanDelta rules), preserving human checkpoints at 3/6/9

- #746: Mutator subagent (agents/mutator.md) that rewrites artifacts from failure
  analysis with hill-climbing ratchet, evidence-driven mutations, and simplicity
  criterion

- #748 Phase 1: Autoresearch mode wired into SKILL.md with full procedure:
  eval → analyze → decide → mutate → repeat. Includes _autoresearch/ output
  folder (original.md, best.md, iterations.jsonl, trajectory.html), convergence
  detection (3 consecutive no-improvement cycles), and standalone Chart.js
  trajectory visualization

Skill-only change — no CLI, schema, or core code modifications.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Apr 15, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 8f18ade
Status: ✅  Deploy successful!
Preview URL: https://73a610ff.agentv.pages.dev
Branch Preview URL: https://feat-958-746-748-autoresearc.agentv.pages.dev

View logs

- Add autoresearch-automation.mdx guide with trajectory screenshot,
  ASCII output table, and incident classifier walkthrough
- Add examples/features/autoresearch/ with working eval and prompt
- Fix trajectory.html: add actual auto-refresh meta tag (was a
  non-functional comment), standardize badge text to "drop" (matching
  SKILL.md data format), guard cumulative cost against non-numeric values
- Update SKILL.md completion step to match actual template markup
- Update skill-improvement-workflow.mdx cross-reference to new guide

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@christso
Copy link
Copy Markdown
Collaborator Author

E2E Verification & Documentation

Infrastructure Verified

  • agentv eval --experiment autoresearch-<name> creates correct directory structure under .agentv/results/runs/
  • agentv compare <baseline>.jsonl <candidate>.jsonl --json outputs structured JSON with wins, losses, ties, mean_delta fields
  • ✅ trajectory.html renders correctly with Chart.js (score chart, per-assertion chart, cost chart, iteration table)
  • ✅ All 56 example evals validate (including new autoresearch example)
  • ✅ All 2,184 tests pass
  • ✅ Docs site builds successfully (45 pages)

Real Eval Run

Ran the incident severity classifier scenario against Azure (gpt-5.4-mini):

  • Cycle 1 (weak prompt): 95% mean score — the model is already strong at classification
  • Cycle 2 (improved prompt): Verified agentv compare --json correctly identified regression (0 wins, 2 losses → DISCARD)
  • The compare output matches the keep/discard decision rules in SKILL.md

Bugs Fixed

  1. trajectory.html: Auto-refresh meta tag was non-functional — Template had <!-- __AUTO_REFRESH__ --> (a comment) instead of an actual <meta http-equiv="refresh"> tag. Fixed to include the real tag.
  2. trajectory.html: Badge text inconsistency — Badge displayed "discard" but SKILL.md data format uses "drop". Standardized badge text to "drop".
  3. trajectory.html: Cumulative cost coercion — Added parseFloat() guard to handle string or null cost_usd values without producing NaN.
  4. SKILL.md: Completion step mismatch — Step 3.1 said "remove the auto-refresh <meta> tag" but didn't specify the actual template markup. Updated to reference <!-- __AUTO_REFRESH__ -->.

Documentation Added

  • New guide: apps/web/src/content/docs/docs/guides/autoresearch-automation.mdx
    • Trajectory chart screenshot
    • ASCII output table showing 9-cycle optimization run
    • Flow diagram, decision rules, output structure
    • Incident severity classifier example
  • Updated skill-improvement-workflow.mdx to cross-reference the new guide
  • Example: examples/features/autoresearch/ with EVAL.yaml, classifier-prompt.md, README.md

Screenshot

The trajectory chart shows a representative optimization run: 0.48 → 0.90 (+42 points) over 9 cycles at $0.03 total cost.

christso and others added 12 commits April 16, 2026 00:01
…validation

Rewrites the autoresearch trajectory visualization to match the Studio
design system (bg-gray-950 canvas, cyan accent, emerald/red status,
system sans-serif, font-medium max). Adds defensive validation: try/catch
for unresolved placeholder, Array.isArray guard, per-iteration field
checks with graceful error messages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previous screenshot had excessive empty margins from a 1920px viewport.
Resized to 960px so content fills the frame.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… autoresearch

Replace file-copy backup/restore (original.md, best.md) with git
commit/revert. HEAD always contains the best-known version: KEEP
commits, DROP reverts working tree. The mutator can now optimize
directories (multi-file artifacts like skills with references/) in
addition to single files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… screenshot

- Rename autoresearch-automation.mdx → autoresearch.mdx (slug: /docs/guides/autoresearch/)
- Title: "Autoresearch" (was "Autoresearch — Automated Optimization")
- Retake screenshot at 1280px full-page to show all content including iteration log
- Fix stale best.md reference in example section

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…earch

The orchestrator never reads eval results, artifacts, or transcripts
into its own context — it can run indefinitely without exhausting the
context window.

- Extract scores via bash/jq (small structured outputs only)
- trajectory.html now fetches iterations.jsonl via HTTP instead of
  requiring inline data injection — no file manipulation after setup
- Mutator self-serves failure evidence from disk (grading.json,
  transcripts, iterations.jsonl) — orchestrator passes paths not content
- iterations.jsonl appended via bash echo, not read-modify-write
- Retake screenshot without bottom whitespace

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Trim 1280x1513 → 1280x1044 by removing empty background below the
iteration log table.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Format all scores as percentages (48%, 90%, +42%) instead of decimals
(0.48, 0.90, +0.42) in summary cards, chart axes, tooltips, and
iteration log table. Underlying data stays 0–1 in iterations.jsonl.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Logical progression: conceptual foundations → eval authoring →
improvement workflow (manual then automated) → advanced topics
(external scorers, workspace infrastructure).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move agent-skills-evals.mdx from guides/ to integrations/ since it
documents an external format integration, not a guide workflow. Update
internal links and re-number remaining guide sidebar orders to close
the gap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Autoevals documents an external library (Braintrust scorers), fitting
the Integrations section alongside Langfuse and Skill Evals. Re-number
remaining guide sidebar orders to close the gap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The page was renamed from autoresearch-automation to autoresearch but
two internal links in skill-improvement-workflow.mdx still pointed at
the old slug.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…broken link

Rename slug to match the page title "Workspace Architecture". Also fix
the Workspace Pool link which was missing the /docs prefix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@christso christso merged commit 87d62a4 into main Apr 16, 2026
4 checks passed
@christso christso deleted the feat/958-746-748-autoresearch branch April 16, 2026 04:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant