Skip to content

feat: Polished eval resumability — TUI integration, documentation, Studio support #1216

@christso

Description

@christso

Summary

agentv eval already supports resuming interrupted runs via three CLI flags (--resume, --rerun-failed, --retry-errors), but the experience is rough:

  • The flags are undocumented on agentv.dev (only --retry-errors gets a brief mention).
  • They are not surfaced in the interactive wizard — no menu entry, no auto-suggestion.
  • They are not surfaced in Studio (web UI), so a user staring at an incomplete run has no in-app affordance.
  • Calling agentv eval with no arguments shows the eval-group help instead of dropping into the wizard.

This issue proposes a focused polish pass to make resumability feel like a first-class workflow, without altering the underlying mechanics.

Research: how peer frameworks handle this

A short comparison of five peer frameworks (full notes in working materials):

Framework Flag Auto-detect latest Errored-vs-unrun split
promptfoo --resume [evalId] Yes (omit id) Yes (--retry-errors, --filter-failing)
Inspect AI eval-retry <log> No (path required) No (single command)
lm-eval-harness --use_cache <dir> N/A (cache-based) No (errors not cached)
OpenCompass -r [timestamp] Yes (omit ts) No
HELM (transparent cache) N/A (always on) No
AgentV today --resume + required --output <dir> No Yes (--retry-errors, --rerun-failed)

Key takeaways:

  1. Flag names are well-aligned. --resume and --retry-errors match promptfoo verbatim. --rerun-failed has no exact peer but reads correctly. No renames needed.
  2. Auto-detect is industry standard. Three of five peers default to the latest run when no identifier is given. AgentV is the outlier, requiring --output <dir>. We already record lastRunDir per cwd in .agentv/cache.json — the data is there, the flag just doesn't use it.
  3. Interactive resume is not common in peers — but they don't have wizards at all. AgentV's existing wizard makes a "Resume last run" entry a natural, low-cost UX win.

Sources: promptfoo CLI, Inspect eval-retry, OpenCompass quickstart, lm-eval-harness, HELM benchmark docs.

Current state in the codebase

  • apps/cli/src/commands/eval/run-eval.ts:1009-1064--retry-errors, --resume, --rerun-failed plumbing.
  • apps/cli/src/commands/eval/run-eval.ts:1731-1738 — post-run tip prints the exact --rerun-failed command.
  • apps/cli/src/commands/eval/retry-errors.ts — manifest filtering helpers.
  • apps/cli/src/commands/eval/run-cache.ts — already records lastRunDir per cwd in .agentv/cache.json.
  • apps/cli/src/commands/eval/last-config.ts — wizard's last-config persistence; does not include outputDir.
  • apps/cli/src/commands/eval/interactive.ts — wizard already has a "Rerun last config" entry and a chained post-run "Retry execution errors?" prompt; no resume entry.
  • apps/cli/src/index.ts:108-114preprocessArgv injects an implicit run when eval is followed by a non-subcommand arg, but not when eval is bare, so agentv eval shows the eval-group help instead of the wizard.
  • apps/web/src/content/docs/docs/evaluation/running-evals.mdx:244-252 — only --retry-errors is documented; --resume and --rerun-failed are missing.

Proposed changes (this PR)

  1. agentv eval (no args) → wizard. In preprocessArgv, treat bare eval like eval <path> and inject run. The existing TTY check in evalRunCommand.handler (commands/run.ts:237) then drops into launchInteractiveWizard.

  2. Persist outputDir in LastConfig. Add the field to LastConfig, capture the resolved run directory after runEvalCommand returns, and write it into the saved config. Backward-compatible (optional field).

  3. Add "Resume last run" to the wizard main menu. When lastConfig.outputDir exists and contains an index.jsonl, surface it as a menu choice. On selection, call runEvalCommand with --output <lastConfig.outputDir> --resume. After it completes, the existing promptRetryErrors flow already offers a retry-errors loop if any execution errors occurred.

  4. Document --resume / --rerun-failed / --retry-errors. Expand the "Retry Execution Errors" section in apps/web/src/content/docs/docs/evaluation/running-evals.mdx to cover all three flags, when to use each, and how they compose with --output.

Out of scope (follow-ups)

  • Auto-detect latest run dir from .agentv/cache.json when --output is omitted. Research surfaced this as the biggest peer-alignment gap (3/5 peers do it). Worth a separate, focused PR.
  • Studio: incomplete-runs panel with a "Resume" action. Largest scope item; needs UX design work and an API endpoint to trigger a re-run from the server. Stretch — defer.
  • Mutual-exclusivity error messages for --resume + --retry-errors + --rerun-failed (promptfoo does this explicitly; we silently let one win).

Alignment with AGENTS.md design principles

  • YAGNI: All four scoped changes use existing primitives (the --resume flag, the wizard, LastConfig). No new flags, no new graders, no new config shapes. The auto-detect change and the Studio change are explicitly deferred.
  • Lightweight core, plugin extensibility: Zero changes to the core evaluation engine. All work is in apps/cli and apps/web.
  • Composition over built-ins: The wizard's resume entry composes existing primitives (saveLastConfig + --resume + --output) rather than introducing a new mechanism.
  • AI-First design: The wizard surfaces resumability without an agent having to know the flag exists. Documentation update keeps the docs site in sync with what the CLI already does.

Acceptance criteria

  • agentv eval with no positional args drops into the wizard in a TTY.
  • After completing a wizard run, the same wizard invocation later shows a "Resume last run" entry.
  • Selecting "Resume last run" passes --output <dir> --resume to runEvalCommand and prints the resolved resume dir.
  • LastConfig round-trips an outputDir field; old saved configs without the field still load.
  • Docs site has a "Resume an Interrupted Run" section covering all three flags.
  • Existing tests updated to reflect the new bare-eval behaviour.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions