feat: Polished eval resumability — TUI integration, documentation, Studio support

## Summary

`agentv eval` already supports resuming interrupted runs via three CLI flags (`--resume`, `--rerun-failed`, `--retry-errors`), but the experience is rough:

- The flags are **undocumented on agentv.dev** (only `--retry-errors` gets a brief mention).
- They are **not surfaced in the interactive wizard** — no menu entry, no auto-suggestion.
- They are **not surfaced in Studio** (web UI), so a user staring at an incomplete run has no in-app affordance.
- Calling `agentv eval` with no arguments shows the eval-group help instead of dropping into the wizard.

This issue proposes a focused polish pass to make resumability feel like a first-class workflow, without altering the underlying mechanics.

## Research: how peer frameworks handle this

A short comparison of five peer frameworks (full notes in working materials):

| Framework        | Flag                  | Auto-detect latest | Errored-vs-unrun split |
|------------------|-----------------------|--------------------|------------------------|
| promptfoo        | `--resume [evalId]`   | Yes (omit id)      | Yes (`--retry-errors`, `--filter-failing`) |
| Inspect AI       | `eval-retry <log>`    | No (path required) | No (single command)    |
| lm-eval-harness  | `--use_cache <dir>`   | N/A (cache-based)  | No (errors not cached) |
| OpenCompass      | `-r [timestamp]`      | Yes (omit ts)      | No                     |
| HELM             | (transparent cache)   | N/A (always on)    | No                     |
| **AgentV today** | `--resume` + required `--output <dir>` | **No** | **Yes** (`--retry-errors`, `--rerun-failed`) |

**Key takeaways:**

1. **Flag names are well-aligned.** `--resume` and `--retry-errors` match promptfoo verbatim. `--rerun-failed` has no exact peer but reads correctly. No renames needed.
2. **Auto-detect is industry standard.** Three of five peers default to the latest run when no identifier is given. AgentV is the outlier, requiring `--output <dir>`. We already record `lastRunDir` per cwd in `.agentv/cache.json` — the data is there, the flag just doesn't use it.
3. **Interactive resume is not common in peers** — but they don't have wizards at all. AgentV's existing wizard makes a \"Resume last run\" entry a natural, low-cost UX win.

Sources: [promptfoo CLI](https://www.promptfoo.dev/docs/usage/command-line/), [Inspect eval-retry](https://inspect.aisi.org.uk/reference/inspect_eval-retry.html), [OpenCompass quickstart](https://opencompass.readthedocs.io/en/latest/get_started/quick_start.html), [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness), [HELM benchmark docs](https://github.com/stanford-crfm/helm/blob/main/docs/benchmark.md).

## Current state in the codebase

- `apps/cli/src/commands/eval/run-eval.ts:1009-1064` — `--retry-errors`, `--resume`, `--rerun-failed` plumbing.
- `apps/cli/src/commands/eval/run-eval.ts:1731-1738` — post-run tip prints the exact `--rerun-failed` command.
- `apps/cli/src/commands/eval/retry-errors.ts` — manifest filtering helpers.
- `apps/cli/src/commands/eval/run-cache.ts` — already records `lastRunDir` per cwd in `.agentv/cache.json`.
- `apps/cli/src/commands/eval/last-config.ts` — wizard's last-config persistence; **does not include `outputDir`**.
- `apps/cli/src/commands/eval/interactive.ts` — wizard already has a \"Rerun last config\" entry and a chained post-run \"Retry execution errors?\" prompt; no resume entry.
- `apps/cli/src/index.ts:108-114` — `preprocessArgv` injects an implicit `run` when `eval` is followed by a non-subcommand arg, but **not when `eval` is bare**, so `agentv eval` shows the eval-group help instead of the wizard.
- `apps/web/src/content/docs/docs/evaluation/running-evals.mdx:244-252` — only `--retry-errors` is documented; `--resume` and `--rerun-failed` are missing.

## Proposed changes (this PR)

1. **`agentv eval` (no args) → wizard.** In `preprocessArgv`, treat bare `eval` like `eval <path>` and inject `run`. The existing TTY check in `evalRunCommand.handler` (`commands/run.ts:237`) then drops into `launchInteractiveWizard`.

2. **Persist `outputDir` in `LastConfig`.** Add the field to `LastConfig`, capture the resolved run directory after `runEvalCommand` returns, and write it into the saved config. Backward-compatible (optional field).

3. **Add \"Resume last run\" to the wizard main menu.** When `lastConfig.outputDir` exists and contains an `index.jsonl`, surface it as a menu choice. On selection, call `runEvalCommand` with `--output <lastConfig.outputDir> --resume`. After it completes, the existing `promptRetryErrors` flow already offers a retry-errors loop if any execution errors occurred.

4. **Document `--resume` / `--rerun-failed` / `--retry-errors`.** Expand the \"Retry Execution Errors\" section in `apps/web/src/content/docs/docs/evaluation/running-evals.mdx` to cover all three flags, when to use each, and how they compose with `--output`.

## Out of scope (follow-ups)

- **Auto-detect latest run dir from `.agentv/cache.json` when `--output` is omitted.** Research surfaced this as the biggest peer-alignment gap (3/5 peers do it). Worth a separate, focused PR.
- **Studio: incomplete-runs panel with a \"Resume\" action.** Largest scope item; needs UX design work and an API endpoint to trigger a re-run from the server. Stretch — defer.
- **Mutual-exclusivity error messages** for `--resume` + `--retry-errors` + `--rerun-failed` (promptfoo does this explicitly; we silently let one win).

## Alignment with `AGENTS.md` design principles

- **YAGNI:** All four scoped changes use existing primitives (the `--resume` flag, the wizard, `LastConfig`). No new flags, no new graders, no new config shapes. The auto-detect change and the Studio change are explicitly deferred.
- **Lightweight core, plugin extensibility:** Zero changes to the core evaluation engine. All work is in `apps/cli` and `apps/web`.
- **Composition over built-ins:** The wizard's resume entry composes existing primitives (`saveLastConfig` + `--resume` + `--output`) rather than introducing a new mechanism.
- **AI-First design:** The wizard surfaces resumability without an agent having to know the flag exists. Documentation update keeps the docs site in sync with what the CLI already does.

## Acceptance criteria

- [ ] `agentv eval` with no positional args drops into the wizard in a TTY.
- [ ] After completing a wizard run, the same wizard invocation later shows a \"Resume last run\" entry.
- [ ] Selecting \"Resume last run\" passes `--output <dir> --resume` to `runEvalCommand` and prints the resolved resume dir.
- [ ] `LastConfig` round-trips an `outputDir` field; old saved configs without the field still load.
- [ ] Docs site has a \"Resume an Interrupted Run\" section covering all three flags.
- [ ] Existing tests updated to reflect the new bare-`eval` behaviour.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Polished eval resumability — TUI integration, documentation, Studio support #1216

Summary

Research: how peer frameworks handle this

Current state in the codebase

Proposed changes (this PR)

Out of scope (follow-ups)

Alignment with `AGENTS.md` design principles

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Framework	Flag	Auto-detect latest	Errored-vs-unrun split
promptfoo	`--resume [evalId]`	Yes (omit id)	Yes (`--retry-errors`, `--filter-failing`)
Inspect AI	`eval-retry <log>`	No (path required)	No (single command)
lm-eval-harness	`--use_cache <dir>`	N/A (cache-based)	No (errors not cached)
OpenCompass	`-r [timestamp]`	Yes (omit ts)	No
HELM	(transparent cache)	N/A (always on)	No
AgentV today	`--resume` + required `--output <dir>`	No	Yes (`--retry-errors`, `--rerun-failed`)

feat: Polished eval resumability — TUI integration, documentation, Studio support #1216

Description

Summary

Research: how peer frameworks handle this

Current state in the codebase

Proposed changes (this PR)

Out of scope (follow-ups)

Alignment with AGENTS.md design principles

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Alignment with `AGENTS.md` design principles