feat(cli): incremental eval runs — resume, append, and aggregate by christso · Pull Request #1110 · EntityProcess/agentv

christso · 2026-04-15T12:36:50Z

Summary

Implements three related capabilities for incremental eval runs (#1071):

1. `agentv eval aggregate <runDir>` subcommand

A pure function: reads index.jsonl, deduplicates by (test_id, target) keeping the last entry, recomputes benchmark.json + timing.json, prints summary.

agentv eval aggregate .agentv/results/runs/2026-04-13T12-00-00/

2. `--resume` flag on `agentv eval run`

Resume an interrupted run: skip completed tests, append new results, aggregate at end.

agentv eval foo.yaml --output ./my-run/ --resume

3. `--rerun-failed` flag on `agentv eval run`

Rerun failed/errored tests while keeping passing results. Implies --resume.

agentv eval foo.yaml --output ./my-run/ --rerun-failed

Changes

artifact-writer.ts: Add deduplicateByTestIdTarget(), aggregateRunDir(), writePerTestArtifacts()
jsonl-writer.ts: Support append mode (flags: 'a')
output-writer.ts: Pass append option through
commands/aggregate.ts: New subcommand
commands/run.ts: Add --resume and --rerun-failed flags
run-eval.ts: Resume/rerun skip logic, append writer, aggregate after run
aggregate.test.ts: 10 new tests for dedup, aggregate, and per-test artifact writing

Testing

472 existing tests pass (0 regressions)
10 new tests covering deduplication, aggregation, and per-test artifact writing
Build and lint clean

Closes #1071

Add three related capabilities for incremental eval runs: 1. `agentv eval aggregate <runDir>` subcommand - Reads index.jsonl, deduplicates by (test_id, target) keeping last entry - Recomputes benchmark.json and timing.json - Prints summary to stdout 2. `--resume` flag on `agentv eval run` - Skips already-completed (non-error) tests - Appends new results to existing index.jsonl - Aggregates with deduplication at the end 3. `--rerun-failed` flag on `agentv eval run` - Like --resume but only skips tests with execution_status "ok" - Reruns execution_error and quality_failure tests - New results replace old ones via last-entry-wins deduplication Key changes: - artifact-writer.ts: Add deduplicateByTestIdTarget(), aggregateRunDir(), writePerTestArtifacts() - jsonl-writer.ts: Support append mode (flags: "a") - output-writer.ts: Pass append option through - commands/aggregate.ts: New subcommand - commands/run.ts: Add --resume and --rerun-failed flags - run-eval.ts: Resume/rerun skip logic, append writer, aggregate after run Closes #1071 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

cloudflare-workers-and-pages · 2026-04-15T12:37:18Z

Deploying agentv with Cloudflare Pages

Latest commit:	`2202938`
Status:	⚡️ Build in progress...

View logs

Without this, `agentv eval aggregate <dir>` was rewritten to `agentv eval run aggregate <dir>` by preprocessArgv(), causing aggregate to be treated as an eval file path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…isplay - Close outputWriter before reading index.jsonl for summary computation to avoid race condition with unflushed stream data - Use summaryResults (all deduplicated) instead of allResults (new only) for matrix summary in resume mode Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

christso · 2026-04-15T21:31:30Z

E2E UAT Results

Red (main)

agentv eval run --help on main: No --resume or --rerun-failed flags exist
agentv eval --help on main: No aggregate subcommand exists
No way to resume a partially-completed eval run

Green (branch)

1. Full baseline run — 3 tests, all pass (100%, 96%, 82%)

RESULT: PASS  (3/3 scored >= 80%, mean: 93%)
Artifact workspace written to: /tmp/uat-1071-e2e/baseline

2. Partial run + resume — ran 1/3 tests, then --resume completed remaining 2:

# Partial: ran only greeting
# Resume:
Resume: found 1 existing result(s), skipping 1 completed.
0/2   🔄 with-custom-eval | azure
0/2   🔄 skip-defaults | azure
RESULT: PASS  (3/3 scored >= 80%, mean: 97%)

3. Artifact parity — resumed run produces identical file structure to baseline:

# Both have identical structure:
benchmark.json, timing.json, index.jsonl, transcript.jsonl
+ 3 per-test dirs (grading.json, timing.json, input.md, response.md)

# Resumed benchmark.json has all 3 tests in tests_run ✅
# Resumed index.jsonl has exactly 3 entries (no duplicates) ✅

4. Compare compatibility — agentv compare works correctly on resumed output:

Comparing: baseline/index.jsonl → resumed/index.jsonl
  greeting              1.00       1.00     +0.00  = tie
  with-custom-eval      0.96       0.95     -0.01  = tie
  skip-defaults         0.82       0.95     +0.13  ✓ win
Summary: 1 win, 0 losses, 2 ties | Mean Δ: +0.040 | Status: improved

5. --rerun-failed — correctly reruns only quality_failure/execution_error tests:

# Modified greeting to quality_failure (score 0.3):
Rerun-failed: found 3 existing result(s), skipping 2 completed.
0/1   🔄 greeting | azure    # only reruns the failure
RESULT: PASS  (3/3 scored >= 80%, mean: 97%)

6. agentv eval aggregate — deduplicates and recomputes:

# 4 entries in JSONL (3 original + 1 rerun) → 3 after dedup
Aggregated 3 test result(s) across 1 target(s)
# greeting score = 1.0 (last-entry-wins), not 0.3 ✅
# pass_rate mean = 1 ✅

7. Edge case: all tests complete — exits cleanly:

Resume: found 3 existing result(s), skipping 3 completed.
Nothing to resume — all 3 test(s) already completed.

Verdict: ✅ PASS — all scenarios verified with live Azure OpenAI (gpt-5.4-mini)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

christso and others added 2 commits April 15, 2026 12:38

christso marked this pull request as ready for review April 15, 2026 12:44

refactor(cli): extract eval resume key helpers

2202938

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

christso merged commit 28bd1b6 into main Apr 15, 2026
3 of 4 checks passed

christso deleted the feat/1071-incremental-eval branch April 15, 2026 22:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cli): incremental eval runs — resume, append, and aggregate#1110

feat(cli): incremental eval runs — resume, append, and aggregate#1110
christso merged 4 commits intomainfrom
feat/1071-incremental-eval

christso commented Apr 15, 2026

Uh oh!

cloudflare-workers-and-pages bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

christso commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Apr 15, 2026

Summary

1. agentv eval aggregate <runDir> subcommand

2. --resume flag on agentv eval run

3. --rerun-failed flag on agentv eval run

Changes

Testing

Uh oh!

cloudflare-workers-and-pages bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

christso commented Apr 15, 2026

E2E UAT Results

Red (main)

Green (branch)

Verdict: ✅ PASS — all scenarios verified with live Azure OpenAI (gpt-5.4-mini)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `agentv eval aggregate <runDir>` subcommand

2. `--resume` flag on `agentv eval run`

3. `--rerun-failed` flag on `agentv eval run`

cloudflare-workers-and-pages bot commented Apr 15, 2026 •

edited

Loading