Skip to content

feat(cli): polished eval resumability — wizard entry, docs, no-args fallthrough#1217

Merged
christso merged 2 commits intomainfrom
feat/eval-resume
May 6, 2026
Merged

feat(cli): polished eval resumability — wizard entry, docs, no-args fallthrough#1217
christso merged 2 commits intomainfrom
feat/eval-resume

Conversation

@christso
Copy link
Copy Markdown
Collaborator

@christso christso commented May 6, 2026

Closes #1216.

Summary

  • agentv eval (no args) now drops into the interactive wizard via the existing TTY check in evalRunCommand.handler — bare eval is rewritten to eval run in preprocessArgv.
  • LastConfig now persists outputDir (resolved post-run from the artifact dir), backward-compatible with older saved configs.
  • The wizard's main menu surfaces a ⏯ Resume last run entry whenever the saved outputDir contains an index.jsonl. Selecting it invokes runEvalCommand with --output <dir> --resume. The existing promptRetryErrors flow then offers an in-place retry-errors loop if any execution errors occurred.
  • Docs: rewrote the "Retry Execution Errors" section to a Resume an Interrupted Run section that documents --resume / --rerun-failed / --retry-errors side-by-side with a comparison table.
  • Flipped one preprocess-argv test that asserted bare eval was a no-op.

Out of scope (issue tracks for follow-up)

  • Auto-detect latest run dir from .agentv/cache.json when --output is omitted on the CLI (research surfaced this as the biggest peer-alignment gap — promptfoo, OpenCompass, HELM all auto-detect).
  • Studio incomplete-runs panel with a "Resume" action.
  • Mutual-exclusivity error messages for --resume / --retry-errors / --rerun-failed.

Design alignment (AGENTS.md)

  • YAGNI: all four scoped changes use existing primitives. No new flags, no new graders, no new config shapes. Auto-detect and Studio explicitly deferred.
  • Lightweight core: zero changes to packages/core. All work in apps/cli + apps/web.
  • Composition over built-ins: the wizard's resume entry composes saveLastConfig + --resume + --output rather than introducing a new mechanism.

Test plan

  • Unit tests pass — bun run test (core 1752, eval 67, cli 502, all green).
  • Build + typecheck + lint clean — bun run build, bun run typecheck, bun run lint.
  • validate:examples — 56/56 valid.
  • Manual red/green UAT (below).

Manual red/green UAT

Seeded ~/.agentv/last-config.json with a real outputDir pointing at examples/features/trend/sample-runs/2026-03-15T10-00-00-000Z (which has an index.jsonl), then invoked bun apps/cli/src/cli.ts eval through a Python pty wrapper.

RED (main):

agentv eval <subcommand>
> Evaluation commands

where <subcommand> can be one of:

- run - Run eval suites and report results
- assert - Run a single code-grader assertion ...
- aggregate - Recompute benchmark.json and timing.json ...

For more help, try running `agentv eval <subcommand> --help`

The wizard never launches; the eval-group help is shown.

GREEN (this branch):

AgentV Interactive Mode

? What would you like to do?
❯ ⏯  Resume last run
  🔄 Rerun last config
  🚀 Run new evaluation
  ✕ Exit

  2026-03-15T10-00-00-000Z (target: default)

Bare agentv eval launches the wizard, the new Resume last run entry is the default selection, and the description shows the resolved run dir + target.

Regression guard: agentv eval --help still prints the eval-group help on this branch (verified).

Notes

  • I did not run a live eval (no .env in the repo — only .env.example). The wizard's resume action is wired directly to runEvalCommand with --output <dir> --resume, which has full unit-test coverage in the existing test suite. The end-to-end resume mechanic itself is unchanged by this PR.
  • The pre-push hook didn't run on push (none was installed in this worktree's .git). I ran the hook's checks manually: build, typecheck, lint, full test suite, validate:examples — all green.

🤖 Generated with Claude Code


Update: auto-detect last run dir folded in (commit ea3b988b)

Per discussion, the biggest peer-alignment gap surfaced by the research — that AgentV requires --output <dir> for --resume / --rerun-failed while promptfoo / OpenCompass / HELM all auto-detect — is now in this PR.

What changed: when --output is omitted, --resume / --rerun-failed resolve the run dir from .agentv/cache.json's lastRunDir (which saveRunCache already writes after every eval). No new flags, no new YAML knobs, no new env vars — this is a configuration surface reduction, not an expansion.

  • apps/cli/src/commands/eval/run-cache.ts — new resolveCachedRunDir(cwd) helper (returns the cached dir if it still exists on disk; undefined for missing cache or stale dir).
  • apps/cli/src/commands/eval/run-eval.ts — synthesize options.outputDir from the cached dir before the resume block, so both the skip-set load and the artifact-dir derivation see the resolved path. Print the resolved dir for auditability.
  • apps/cli/test/unit/run-cache.test.ts — 4 unit tests: cache hit, missing cache file, missing lastRunDir field, stale dir.
  • Docs: --output is now optional for --resume / --rerun-failed; example commands updated.

Manual red/green UAT (auto-detect)

Setup: /tmp/agentv-resume-uat/ with .agentv/cache.json pointing at a real run dir containing index.jsonl (one ok case + one execution_error case), and a minimal eval YAML.

RED (main): agentv eval evals/sample.eval.yaml --resume --dry-run

Warning: --resume requires --output <dir> to identify the run directory. Ignoring --resume.
Artifact directory: /tmp/agentv-resume-uat/.agentv/results/runs/default/2026-05-06T04-16-20-101Z

--resume is silently ignored; output goes to a fresh timestamped dir.

GREEN (this branch): agentv eval evals/sample.eval.yaml --resume --dry-run

Auto-detected last run dir for --resume: .agentv/results/runs/default/2026-05-06-uat
Resume: found 2 existing result(s), skipping 1 completed.
Artifact directory: /tmp/agentv-resume-uat/.agentv/results/runs/default/2026-05-06-uat

The cached dir is detected and printed; the existing skip-set logic correctly retains the ok case and re-runs the execution_error case; output appends to the resumed dir.

Sub-cases verified:

  • --rerun-failed instead of --resumeAuto-detected last run dir for --rerun-failed: ... (correct flag label).
  • Stale cache (deleted dir) → Warning: --resume requires --output <dir> (or a cached last run) to identify the run directory. Ignoring --resume. (existing fallthrough, with updated message text).

Validation

  • bun run test — 506 cli tests (4 new), all pass.
  • bun run typecheck, bun run lint, bun run build — green.
  • bun run validate:examples — 56/56 valid.

christso added 2 commits May 6, 2026 04:03
…allthrough

- preprocessArgv: bare `agentv eval` now injects `run` so the existing TTY
  check in evalRunCommand.handler drops into launchInteractiveWizard, instead
  of printing the eval-group help.
- LastConfig: persist outputDir of the most recent wizard run, optional and
  backward-compatible with older saved configs.
- Wizard: surface a "⏯  Resume last run" entry in the main menu when the
  saved outputDir contains an index.jsonl. Selecting it re-invokes the run
  with --output <dir> --resume, preserving the existing post-run retry-errors
  prompt.
- Docs: rewrite the resume section to cover --resume / --rerun-failed /
  --retry-errors with a comparison table and a wizard hint.
- Tests: flip the preprocess-argv test that asserted bare `eval` was a no-op.

Closes #1216
When --output is omitted, resolve the run directory from .agentv/cache.json
(written by saveRunCache after every eval). Matches promptfoo's
`--resume [evalId]` and OpenCompass's `-r [timestamp]` "latest by default"
convention. Stale cache entries (deleted dirs) fall through to the existing
warning.

- run-cache.ts: new resolveCachedRunDir(cwd) helper.
- run-eval.ts: synthesize options.outputDir from the cached dir before the
  resume block runs, so both the skip-set load and the artifact-dir
  derivation see the resolved path. Print the auto-detected dir for
  auditability. Existing warning text now mentions the cache fallback.
- Docs: --output is now optional for --resume / --rerun-failed; updated
  the example commands.
- Tests: 4 unit tests covering cache hit, missing cache, missing lastRunDir,
  stale dir.
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented May 6, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: ea3b988
Status: ✅  Deploy successful!
Preview URL: https://a09c6a31.agentv.pages.dev
Branch Preview URL: https://feat-eval-resume.agentv.pages.dev

View logs

@christso christso marked this pull request as ready for review May 6, 2026 04:26
@christso christso merged commit 025edb8 into main May 6, 2026
4 checks passed
@christso christso deleted the feat/eval-resume branch May 6, 2026 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Polished eval resumability — TUI integration, documentation, Studio support

1 participant