Skip to content

feat(cli): incremental eval runs — resume, append, and aggregate#1110

Merged
christso merged 4 commits intomainfrom
feat/1071-incremental-eval
Apr 15, 2026
Merged

feat(cli): incremental eval runs — resume, append, and aggregate#1110
christso merged 4 commits intomainfrom
feat/1071-incremental-eval

Conversation

@christso
Copy link
Copy Markdown
Collaborator

Summary

Implements three related capabilities for incremental eval runs (#1071):

1. agentv eval aggregate <runDir> subcommand

A pure function: reads index.jsonl, deduplicates by (test_id, target) keeping the last entry, recomputes benchmark.json + timing.json, prints summary.

agentv eval aggregate .agentv/results/runs/2026-04-13T12-00-00/

2. --resume flag on agentv eval run

Resume an interrupted run: skip completed tests, append new results, aggregate at end.

agentv eval foo.yaml --output ./my-run/ --resume

3. --rerun-failed flag on agentv eval run

Rerun failed/errored tests while keeping passing results. Implies --resume.

agentv eval foo.yaml --output ./my-run/ --rerun-failed

Changes

  • artifact-writer.ts: Add deduplicateByTestIdTarget(), aggregateRunDir(), writePerTestArtifacts()
  • jsonl-writer.ts: Support append mode (flags: 'a')
  • output-writer.ts: Pass append option through
  • commands/aggregate.ts: New subcommand
  • commands/run.ts: Add --resume and --rerun-failed flags
  • run-eval.ts: Resume/rerun skip logic, append writer, aggregate after run
  • aggregate.test.ts: 10 new tests for dedup, aggregate, and per-test artifact writing

Testing

  • 472 existing tests pass (0 regressions)
  • 10 new tests covering deduplication, aggregation, and per-test artifact writing
  • Build and lint clean

Closes #1071

Add three related capabilities for incremental eval runs:

1. `agentv eval aggregate <runDir>` subcommand
   - Reads index.jsonl, deduplicates by (test_id, target) keeping last entry
   - Recomputes benchmark.json and timing.json
   - Prints summary to stdout

2. `--resume` flag on `agentv eval run`
   - Skips already-completed (non-error) tests
   - Appends new results to existing index.jsonl
   - Aggregates with deduplication at the end

3. `--rerun-failed` flag on `agentv eval run`
   - Like --resume but only skips tests with execution_status "ok"
   - Reruns execution_error and quality_failure tests
   - New results replace old ones via last-entry-wins deduplication

Key changes:
- artifact-writer.ts: Add deduplicateByTestIdTarget(), aggregateRunDir(),
  writePerTestArtifacts()
- jsonl-writer.ts: Support append mode (flags: "a")
- output-writer.ts: Pass append option through
- commands/aggregate.ts: New subcommand
- commands/run.ts: Add --resume and --rerun-failed flags
- run-eval.ts: Resume/rerun skip logic, append writer, aggregate after run

Closes #1071

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Apr 15, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 2202938
Status:⚡️  Build in progress...

View logs

christso and others added 2 commits April 15, 2026 12:38
Without this, `agentv eval aggregate <dir>` was rewritten to
`agentv eval run aggregate <dir>` by preprocessArgv(), causing
aggregate to be treated as an eval file path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…isplay

- Close outputWriter before reading index.jsonl for summary computation
  to avoid race condition with unflushed stream data
- Use summaryResults (all deduplicated) instead of allResults (new only)
  for matrix summary in resume mode

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@christso christso marked this pull request as ready for review April 15, 2026 12:44
@christso
Copy link
Copy Markdown
Collaborator Author

E2E UAT Results

Red (main)

  • agentv eval run --help on main: No --resume or --rerun-failed flags exist
  • agentv eval --help on main: No aggregate subcommand exists
  • No way to resume a partially-completed eval run

Green (branch)

1. Full baseline run — 3 tests, all pass (100%, 96%, 82%)

RESULT: PASS  (3/3 scored >= 80%, mean: 93%)
Artifact workspace written to: /tmp/uat-1071-e2e/baseline

2. Partial run + resume — ran 1/3 tests, then --resume completed remaining 2:

# Partial: ran only greeting
# Resume:
Resume: found 1 existing result(s), skipping 1 completed.
0/2   🔄 with-custom-eval | azure
0/2   🔄 skip-defaults | azure
RESULT: PASS  (3/3 scored >= 80%, mean: 97%)

3. Artifact parity — resumed run produces identical file structure to baseline:

# Both have identical structure:
benchmark.json, timing.json, index.jsonl, transcript.jsonl
+ 3 per-test dirs (grading.json, timing.json, input.md, response.md)

# Resumed benchmark.json has all 3 tests in tests_run ✅
# Resumed index.jsonl has exactly 3 entries (no duplicates) ✅

4. Compare compatibilityagentv compare works correctly on resumed output:

Comparing: baseline/index.jsonl → resumed/index.jsonl
  greeting              1.00       1.00     +0.00  = tie
  with-custom-eval      0.96       0.95     -0.01  = tie
  skip-defaults         0.82       0.95     +0.13  ✓ win
Summary: 1 win, 0 losses, 2 ties | Mean Δ: +0.040 | Status: improved

5. --rerun-failed — correctly reruns only quality_failure/execution_error tests:

# Modified greeting to quality_failure (score 0.3):
Rerun-failed: found 3 existing result(s), skipping 2 completed.
0/1   🔄 greeting | azure    # only reruns the failure
RESULT: PASS  (3/3 scored >= 80%, mean: 97%)

6. agentv eval aggregate — deduplicates and recomputes:

# 4 entries in JSONL (3 original + 1 rerun) → 3 after dedup
Aggregated 3 test result(s) across 1 target(s)
# greeting score = 1.0 (last-entry-wins), not 0.3 ✅
# pass_rate mean = 1 ✅

7. Edge case: all tests complete — exits cleanly:

Resume: found 3 existing result(s), skipping 3 completed.
Nothing to resume — all 3 test(s) already completed.

Verdict: ✅ PASS — all scenarios verified with live Azure OpenAI (gpt-5.4-mini)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@christso christso merged commit 28bd1b6 into main Apr 15, 2026
3 of 4 checks passed
@christso christso deleted the feat/1071-incremental-eval branch April 15, 2026 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(cli): incremental eval runs — resume, append, and aggregate

1 participant