Summary
agentv eval already supports resuming interrupted runs via three CLI flags (--resume, --rerun-failed, --retry-errors), but the experience is rough:
- The flags are undocumented on agentv.dev (only
--retry-errors gets a brief mention).
- They are not surfaced in the interactive wizard — no menu entry, no auto-suggestion.
- They are not surfaced in Studio (web UI), so a user staring at an incomplete run has no in-app affordance.
- Calling
agentv eval with no arguments shows the eval-group help instead of dropping into the wizard.
This issue proposes a focused polish pass to make resumability feel like a first-class workflow, without altering the underlying mechanics.
Research: how peer frameworks handle this
A short comparison of five peer frameworks (full notes in working materials):
| Framework |
Flag |
Auto-detect latest |
Errored-vs-unrun split |
| promptfoo |
--resume [evalId] |
Yes (omit id) |
Yes (--retry-errors, --filter-failing) |
| Inspect AI |
eval-retry <log> |
No (path required) |
No (single command) |
| lm-eval-harness |
--use_cache <dir> |
N/A (cache-based) |
No (errors not cached) |
| OpenCompass |
-r [timestamp] |
Yes (omit ts) |
No |
| HELM |
(transparent cache) |
N/A (always on) |
No |
| AgentV today |
--resume + required --output <dir> |
No |
Yes (--retry-errors, --rerun-failed) |
Key takeaways:
- Flag names are well-aligned.
--resume and --retry-errors match promptfoo verbatim. --rerun-failed has no exact peer but reads correctly. No renames needed.
- Auto-detect is industry standard. Three of five peers default to the latest run when no identifier is given. AgentV is the outlier, requiring
--output <dir>. We already record lastRunDir per cwd in .agentv/cache.json — the data is there, the flag just doesn't use it.
- Interactive resume is not common in peers — but they don't have wizards at all. AgentV's existing wizard makes a "Resume last run" entry a natural, low-cost UX win.
Sources: promptfoo CLI, Inspect eval-retry, OpenCompass quickstart, lm-eval-harness, HELM benchmark docs.
Current state in the codebase
apps/cli/src/commands/eval/run-eval.ts:1009-1064 — --retry-errors, --resume, --rerun-failed plumbing.
apps/cli/src/commands/eval/run-eval.ts:1731-1738 — post-run tip prints the exact --rerun-failed command.
apps/cli/src/commands/eval/retry-errors.ts — manifest filtering helpers.
apps/cli/src/commands/eval/run-cache.ts — already records lastRunDir per cwd in .agentv/cache.json.
apps/cli/src/commands/eval/last-config.ts — wizard's last-config persistence; does not include outputDir.
apps/cli/src/commands/eval/interactive.ts — wizard already has a "Rerun last config" entry and a chained post-run "Retry execution errors?" prompt; no resume entry.
apps/cli/src/index.ts:108-114 — preprocessArgv injects an implicit run when eval is followed by a non-subcommand arg, but not when eval is bare, so agentv eval shows the eval-group help instead of the wizard.
apps/web/src/content/docs/docs/evaluation/running-evals.mdx:244-252 — only --retry-errors is documented; --resume and --rerun-failed are missing.
Proposed changes (this PR)
-
agentv eval (no args) → wizard. In preprocessArgv, treat bare eval like eval <path> and inject run. The existing TTY check in evalRunCommand.handler (commands/run.ts:237) then drops into launchInteractiveWizard.
-
Persist outputDir in LastConfig. Add the field to LastConfig, capture the resolved run directory after runEvalCommand returns, and write it into the saved config. Backward-compatible (optional field).
-
Add "Resume last run" to the wizard main menu. When lastConfig.outputDir exists and contains an index.jsonl, surface it as a menu choice. On selection, call runEvalCommand with --output <lastConfig.outputDir> --resume. After it completes, the existing promptRetryErrors flow already offers a retry-errors loop if any execution errors occurred.
-
Document --resume / --rerun-failed / --retry-errors. Expand the "Retry Execution Errors" section in apps/web/src/content/docs/docs/evaluation/running-evals.mdx to cover all three flags, when to use each, and how they compose with --output.
Out of scope (follow-ups)
- Auto-detect latest run dir from
.agentv/cache.json when --output is omitted. Research surfaced this as the biggest peer-alignment gap (3/5 peers do it). Worth a separate, focused PR.
- Studio: incomplete-runs panel with a "Resume" action. Largest scope item; needs UX design work and an API endpoint to trigger a re-run from the server. Stretch — defer.
- Mutual-exclusivity error messages for
--resume + --retry-errors + --rerun-failed (promptfoo does this explicitly; we silently let one win).
Alignment with AGENTS.md design principles
- YAGNI: All four scoped changes use existing primitives (the
--resume flag, the wizard, LastConfig). No new flags, no new graders, no new config shapes. The auto-detect change and the Studio change are explicitly deferred.
- Lightweight core, plugin extensibility: Zero changes to the core evaluation engine. All work is in
apps/cli and apps/web.
- Composition over built-ins: The wizard's resume entry composes existing primitives (
saveLastConfig + --resume + --output) rather than introducing a new mechanism.
- AI-First design: The wizard surfaces resumability without an agent having to know the flag exists. Documentation update keeps the docs site in sync with what the CLI already does.
Acceptance criteria
Summary
agentv evalalready supports resuming interrupted runs via three CLI flags (--resume,--rerun-failed,--retry-errors), but the experience is rough:--retry-errorsgets a brief mention).agentv evalwith no arguments shows the eval-group help instead of dropping into the wizard.This issue proposes a focused polish pass to make resumability feel like a first-class workflow, without altering the underlying mechanics.
Research: how peer frameworks handle this
A short comparison of five peer frameworks (full notes in working materials):
--resume [evalId]--retry-errors,--filter-failing)eval-retry <log>--use_cache <dir>-r [timestamp]--resume+ required--output <dir>--retry-errors,--rerun-failed)Key takeaways:
--resumeand--retry-errorsmatch promptfoo verbatim.--rerun-failedhas no exact peer but reads correctly. No renames needed.--output <dir>. We already recordlastRunDirper cwd in.agentv/cache.json— the data is there, the flag just doesn't use it.Sources: promptfoo CLI, Inspect eval-retry, OpenCompass quickstart, lm-eval-harness, HELM benchmark docs.
Current state in the codebase
apps/cli/src/commands/eval/run-eval.ts:1009-1064—--retry-errors,--resume,--rerun-failedplumbing.apps/cli/src/commands/eval/run-eval.ts:1731-1738— post-run tip prints the exact--rerun-failedcommand.apps/cli/src/commands/eval/retry-errors.ts— manifest filtering helpers.apps/cli/src/commands/eval/run-cache.ts— already recordslastRunDirper cwd in.agentv/cache.json.apps/cli/src/commands/eval/last-config.ts— wizard's last-config persistence; does not includeoutputDir.apps/cli/src/commands/eval/interactive.ts— wizard already has a "Rerun last config" entry and a chained post-run "Retry execution errors?" prompt; no resume entry.apps/cli/src/index.ts:108-114—preprocessArgvinjects an implicitrunwhenevalis followed by a non-subcommand arg, but not whenevalis bare, soagentv evalshows the eval-group help instead of the wizard.apps/web/src/content/docs/docs/evaluation/running-evals.mdx:244-252— only--retry-errorsis documented;--resumeand--rerun-failedare missing.Proposed changes (this PR)
agentv eval(no args) → wizard. InpreprocessArgv, treat bareevallikeeval <path>and injectrun. The existing TTY check inevalRunCommand.handler(commands/run.ts:237) then drops intolaunchInteractiveWizard.Persist
outputDirinLastConfig. Add the field toLastConfig, capture the resolved run directory afterrunEvalCommandreturns, and write it into the saved config. Backward-compatible (optional field).Add "Resume last run" to the wizard main menu. When
lastConfig.outputDirexists and contains anindex.jsonl, surface it as a menu choice. On selection, callrunEvalCommandwith--output <lastConfig.outputDir> --resume. After it completes, the existingpromptRetryErrorsflow already offers a retry-errors loop if any execution errors occurred.Document
--resume/--rerun-failed/--retry-errors. Expand the "Retry Execution Errors" section inapps/web/src/content/docs/docs/evaluation/running-evals.mdxto cover all three flags, when to use each, and how they compose with--output.Out of scope (follow-ups)
.agentv/cache.jsonwhen--outputis omitted. Research surfaced this as the biggest peer-alignment gap (3/5 peers do it). Worth a separate, focused PR.--resume+--retry-errors+--rerun-failed(promptfoo does this explicitly; we silently let one win).Alignment with
AGENTS.mddesign principles--resumeflag, the wizard,LastConfig). No new flags, no new graders, no new config shapes. The auto-detect change and the Studio change are explicitly deferred.apps/cliandapps/web.saveLastConfig+--resume+--output) rather than introducing a new mechanism.Acceptance criteria
agentv evalwith no positional args drops into the wizard in a TTY.--output <dir> --resumetorunEvalCommandand prints the resolved resume dir.LastConfiground-trips anoutputDirfield; old saved configs without the field still load.evalbehaviour.