feat(cli): polished eval resumability — wizard entry, docs, no-args fallthrough by christso · Pull Request #1217 · EntityProcess/agentv

christso · 2026-05-06T02:06:11Z

Closes #1216.

Summary

agentv eval (no args) now drops into the interactive wizard via the existing TTY check in evalRunCommand.handler — bare eval is rewritten to eval run in preprocessArgv.
LastConfig now persists outputDir (resolved post-run from the artifact dir), backward-compatible with older saved configs.
The wizard's main menu surfaces a ⏯ Resume last run entry whenever the saved outputDir contains an index.jsonl. Selecting it invokes runEvalCommand with --output <dir> --resume. The existing promptRetryErrors flow then offers an in-place retry-errors loop if any execution errors occurred.
Docs: rewrote the "Retry Execution Errors" section to a Resume an Interrupted Run section that documents --resume / --rerun-failed / --retry-errors side-by-side with a comparison table.
Flipped one preprocess-argv test that asserted bare eval was a no-op.

Out of scope (issue tracks for follow-up)

Auto-detect latest run dir from .agentv/cache.json when --output is omitted on the CLI (research surfaced this as the biggest peer-alignment gap — promptfoo, OpenCompass, HELM all auto-detect).
Studio incomplete-runs panel with a "Resume" action.
Mutual-exclusivity error messages for --resume / --retry-errors / --rerun-failed.

Design alignment (`AGENTS.md`)

YAGNI: all four scoped changes use existing primitives. No new flags, no new graders, no new config shapes. Auto-detect and Studio explicitly deferred.
Lightweight core: zero changes to packages/core. All work in apps/cli + apps/web.
Composition over built-ins: the wizard's resume entry composes saveLastConfig + --resume + --output rather than introducing a new mechanism.

Test plan

Unit tests pass — bun run test (core 1752, eval 67, cli 502, all green).
Build + typecheck + lint clean — bun run build, bun run typecheck, bun run lint.
validate:examples — 56/56 valid.
Manual red/green UAT (below).

Manual red/green UAT

Seeded ~/.agentv/last-config.json with a real outputDir pointing at examples/features/trend/sample-runs/2026-03-15T10-00-00-000Z (which has an index.jsonl), then invoked bun apps/cli/src/cli.ts eval through a Python pty wrapper.

RED (main):

agentv eval <subcommand>
> Evaluation commands

where <subcommand> can be one of:

- run - Run eval suites and report results
- assert - Run a single code-grader assertion ...
- aggregate - Recompute benchmark.json and timing.json ...

For more help, try running `agentv eval <subcommand> --help`

The wizard never launches; the eval-group help is shown.

GREEN (this branch):

AgentV Interactive Mode

? What would you like to do?
❯ ⏯  Resume last run
  🔄 Rerun last config
  🚀 Run new evaluation
  ✕ Exit

  2026-03-15T10-00-00-000Z (target: default)

Bare agentv eval launches the wizard, the new Resume last run entry is the default selection, and the description shows the resolved run dir + target.

Regression guard: agentv eval --help still prints the eval-group help on this branch (verified).

Notes

I did not run a live eval (no .env in the repo — only .env.example). The wizard's resume action is wired directly to runEvalCommand with --output <dir> --resume, which has full unit-test coverage in the existing test suite. The end-to-end resume mechanic itself is unchanged by this PR.
The pre-push hook didn't run on push (none was installed in this worktree's .git). I ran the hook's checks manually: build, typecheck, lint, full test suite, validate:examples — all green.

🤖 Generated with Claude Code

Update: auto-detect last run dir folded in (commit `ea3b988b`)

Per discussion, the biggest peer-alignment gap surfaced by the research — that AgentV requires --output <dir> for --resume / --rerun-failed while promptfoo / OpenCompass / HELM all auto-detect — is now in this PR.

What changed: when --output is omitted, --resume / --rerun-failed resolve the run dir from .agentv/cache.json's lastRunDir (which saveRunCache already writes after every eval). No new flags, no new YAML knobs, no new env vars — this is a configuration surface reduction, not an expansion.

apps/cli/src/commands/eval/run-cache.ts — new resolveCachedRunDir(cwd) helper (returns the cached dir if it still exists on disk; undefined for missing cache or stale dir).
apps/cli/src/commands/eval/run-eval.ts — synthesize options.outputDir from the cached dir before the resume block, so both the skip-set load and the artifact-dir derivation see the resolved path. Print the resolved dir for auditability.
apps/cli/test/unit/run-cache.test.ts — 4 unit tests: cache hit, missing cache file, missing lastRunDir field, stale dir.
Docs: --output is now optional for --resume / --rerun-failed; example commands updated.

Manual red/green UAT (auto-detect)

Setup: /tmp/agentv-resume-uat/ with .agentv/cache.json pointing at a real run dir containing index.jsonl (one ok case + one execution_error case), and a minimal eval YAML.

RED (main): agentv eval evals/sample.eval.yaml --resume --dry-run

Warning: --resume requires --output <dir> to identify the run directory. Ignoring --resume.
Artifact directory: /tmp/agentv-resume-uat/.agentv/results/runs/default/2026-05-06T04-16-20-101Z

--resume is silently ignored; output goes to a fresh timestamped dir.

GREEN (this branch): agentv eval evals/sample.eval.yaml --resume --dry-run

Auto-detected last run dir for --resume: .agentv/results/runs/default/2026-05-06-uat
Resume: found 2 existing result(s), skipping 1 completed.
Artifact directory: /tmp/agentv-resume-uat/.agentv/results/runs/default/2026-05-06-uat

The cached dir is detected and printed; the existing skip-set logic correctly retains the ok case and re-runs the execution_error case; output appends to the resumed dir.

Sub-cases verified:

--rerun-failed instead of --resume → Auto-detected last run dir for --rerun-failed: ... (correct flag label).
Stale cache (deleted dir) → Warning: --resume requires --output <dir> (or a cached last run) to identify the run directory. Ignoring --resume. (existing fallthrough, with updated message text).

Validation

bun run test — 506 cli tests (4 new), all pass.
bun run typecheck, bun run lint, bun run build — green.
bun run validate:examples — 56/56 valid.

…allthrough - preprocessArgv: bare `agentv eval` now injects `run` so the existing TTY check in evalRunCommand.handler drops into launchInteractiveWizard, instead of printing the eval-group help. - LastConfig: persist outputDir of the most recent wizard run, optional and backward-compatible with older saved configs. - Wizard: surface a "⏯ Resume last run" entry in the main menu when the saved outputDir contains an index.jsonl. Selecting it re-invokes the run with --output <dir> --resume, preserving the existing post-run retry-errors prompt. - Docs: rewrite the resume section to cover --resume / --rerun-failed / --retry-errors with a comparison table and a wizard hint. - Tests: flip the preprocess-argv test that asserted bare `eval` was a no-op. Closes #1216

When --output is omitted, resolve the run directory from .agentv/cache.json (written by saveRunCache after every eval). Matches promptfoo's `--resume [evalId]` and OpenCompass's `-r [timestamp]` "latest by default" convention. Stale cache entries (deleted dirs) fall through to the existing warning. - run-cache.ts: new resolveCachedRunDir(cwd) helper. - run-eval.ts: synthesize options.outputDir from the cached dir before the resume block runs, so both the skip-set load and the artifact-dir derivation see the resolved path. Print the auto-detected dir for auditability. Existing warning text now mentions the cache fallback. - Docs: --output is now optional for --resume / --rerun-failed; updated the example commands. - Tests: 4 unit tests covering cache hit, missing cache, missing lastRunDir, stale dir.

cloudflare-workers-and-pages · 2026-05-06T04:19:19Z

Deploying agentv with Cloudflare Pages

Latest commit:	`ea3b988`
Status:	✅ Deploy successful!
Preview URL:	https://a09c6a31.agentv.pages.dev
Branch Preview URL:	https://feat-eval-resume.agentv.pages.dev

View logs

christso added 2 commits May 6, 2026 04:03

christso marked this pull request as ready for review May 6, 2026 04:26

christso merged commit 025edb8 into main May 6, 2026
4 checks passed

christso deleted the feat/eval-resume branch May 6, 2026 04:26

christso mentioned this pull request May 6, 2026

feat(studio): expose eval resumability — API + Resume action on run detail #1219

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cli): polished eval resumability — wizard entry, docs, no-args fallthrough#1217

feat(cli): polished eval resumability — wizard entry, docs, no-args fallthrough#1217
christso merged 2 commits intomainfrom
feat/eval-resume

christso commented May 6, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented May 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Out of scope (issue tracks for follow-up)

Design alignment (AGENTS.md)

Test plan

Manual red/green UAT

Notes

Update: auto-detect last run dir folded in (commit ea3b988b)

Manual red/green UAT (auto-detect)

Validation

Uh oh!

cloudflare-workers-and-pages Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented May 6, 2026 •

edited

Loading

Design alignment (`AGENTS.md`)

Update: auto-detect last run dir folded in (commit `ea3b988b`)

cloudflare-workers-and-pages Bot commented May 6, 2026 •

edited

Loading