feat(cli): polished eval resumability — wizard entry, docs, no-args fallthrough#1217
Merged
feat(cli): polished eval resumability — wizard entry, docs, no-args fallthrough#1217
Conversation
…allthrough - preprocessArgv: bare `agentv eval` now injects `run` so the existing TTY check in evalRunCommand.handler drops into launchInteractiveWizard, instead of printing the eval-group help. - LastConfig: persist outputDir of the most recent wizard run, optional and backward-compatible with older saved configs. - Wizard: surface a "⏯ Resume last run" entry in the main menu when the saved outputDir contains an index.jsonl. Selecting it re-invokes the run with --output <dir> --resume, preserving the existing post-run retry-errors prompt. - Docs: rewrite the resume section to cover --resume / --rerun-failed / --retry-errors with a comparison table and a wizard hint. - Tests: flip the preprocess-argv test that asserted bare `eval` was a no-op. Closes #1216
When --output is omitted, resolve the run directory from .agentv/cache.json (written by saveRunCache after every eval). Matches promptfoo's `--resume [evalId]` and OpenCompass's `-r [timestamp]` "latest by default" convention. Stale cache entries (deleted dirs) fall through to the existing warning. - run-cache.ts: new resolveCachedRunDir(cwd) helper. - run-eval.ts: synthesize options.outputDir from the cached dir before the resume block runs, so both the skip-set load and the artifact-dir derivation see the resolved path. Print the auto-detected dir for auditability. Existing warning text now mentions the cache fallback. - Docs: --output is now optional for --resume / --rerun-failed; updated the example commands. - Tests: 4 unit tests covering cache hit, missing cache, missing lastRunDir, stale dir.
Deploying agentv with
|
| Latest commit: |
ea3b988
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://a09c6a31.agentv.pages.dev |
| Branch Preview URL: | https://feat-eval-resume.agentv.pages.dev |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1216.
Summary
agentv eval(no args) now drops into the interactive wizard via the existing TTY check inevalRunCommand.handler— bareevalis rewritten toeval runinpreprocessArgv.LastConfignow persistsoutputDir(resolved post-run from the artifact dir), backward-compatible with older saved configs.⏯ Resume last runentry whenever the savedoutputDircontains anindex.jsonl. Selecting it invokesrunEvalCommandwith--output <dir> --resume. The existingpromptRetryErrorsflow then offers an in-place retry-errors loop if any execution errors occurred.--resume/--rerun-failed/--retry-errorsside-by-side with a comparison table.evalwas a no-op.Out of scope (issue tracks for follow-up)
.agentv/cache.jsonwhen--outputis omitted on the CLI (research surfaced this as the biggest peer-alignment gap — promptfoo, OpenCompass, HELM all auto-detect).--resume/--retry-errors/--rerun-failed.Design alignment (
AGENTS.md)packages/core. All work inapps/cli+apps/web.saveLastConfig+--resume+--outputrather than introducing a new mechanism.Test plan
bun run test(core 1752, eval 67, cli 502, all green).bun run build,bun run typecheck,bun run lint.validate:examples— 56/56 valid.Manual red/green UAT
Seeded
~/.agentv/last-config.jsonwith a realoutputDirpointing atexamples/features/trend/sample-runs/2026-03-15T10-00-00-000Z(which has anindex.jsonl), then invokedbun apps/cli/src/cli.ts evalthrough a Python pty wrapper.RED (main):
The wizard never launches; the eval-group help is shown.
GREEN (this branch):
Bare
agentv evallaunches the wizard, the newResume last runentry is the default selection, and the description shows the resolved run dir + target.Regression guard:
agentv eval --helpstill prints the eval-group help on this branch (verified).Notes
.envin the repo — only.env.example). The wizard's resume action is wired directly torunEvalCommandwith--output <dir> --resume, which has full unit-test coverage in the existing test suite. The end-to-end resume mechanic itself is unchanged by this PR..git). I ran the hook's checks manually: build, typecheck, lint, full test suite, validate:examples — all green.🤖 Generated with Claude Code
Update: auto-detect last run dir folded in (commit
ea3b988b)Per discussion, the biggest peer-alignment gap surfaced by the research — that AgentV requires
--output <dir>for--resume/--rerun-failedwhile promptfoo / OpenCompass / HELM all auto-detect — is now in this PR.What changed: when
--outputis omitted,--resume/--rerun-failedresolve the run dir from.agentv/cache.json'slastRunDir(whichsaveRunCachealready writes after every eval). No new flags, no new YAML knobs, no new env vars — this is a configuration surface reduction, not an expansion.apps/cli/src/commands/eval/run-cache.ts— newresolveCachedRunDir(cwd)helper (returns the cached dir if it still exists on disk; undefined for missing cache or stale dir).apps/cli/src/commands/eval/run-eval.ts— synthesizeoptions.outputDirfrom the cached dir before the resume block, so both the skip-set load and the artifact-dir derivation see the resolved path. Print the resolved dir for auditability.apps/cli/test/unit/run-cache.test.ts— 4 unit tests: cache hit, missing cache file, missinglastRunDirfield, stale dir.--outputis now optional for--resume/--rerun-failed; example commands updated.Manual red/green UAT (auto-detect)
Setup:
/tmp/agentv-resume-uat/with.agentv/cache.jsonpointing at a real run dir containingindex.jsonl(oneokcase + oneexecution_errorcase), and a minimal eval YAML.RED (main):
agentv eval evals/sample.eval.yaml --resume --dry-run--resumeis silently ignored; output goes to a fresh timestamped dir.GREEN (this branch):
agentv eval evals/sample.eval.yaml --resume --dry-runThe cached dir is detected and printed; the existing skip-set logic correctly retains the
okcase and re-runs theexecution_errorcase; output appends to the resumed dir.Sub-cases verified:
--rerun-failedinstead of--resume→Auto-detected last run dir for --rerun-failed: ...(correct flag label).Warning: --resume requires --output <dir> (or a cached last run) to identify the run directory. Ignoring --resume.(existing fallthrough, with updated message text).Validation
bun run test— 506 cli tests (4 new), all pass.bun run typecheck,bun run lint,bun run build— green.bun run validate:examples— 56/56 valid.