Problem
The current agentv compare command is strictly pairwise: it takes two pre-split JSONL files and computes deltas between them. With 3+ models this creates compounding friction:
- Manual splitting required — matrix eval produces one combined JSONL, but compare needs separate files. Users must run split-by-target first.
- Combinatorial pairwise runs — 3 models = 3 comparisons, 4 models = 6, 5 models = 10. Each is a separate command.
This is the most common workflow after a multi-target eval and it should be one command.
Industry Research
Every major eval framework treats N-way matrix comparison as the default, not pairwise:
| Framework |
Approach |
| promptfoo |
Matrix view: prompts × providers × tests. All models evaluated simultaneously. No separate compare command — comparison is the eval output. |
| Braintrust |
N-way experiment comparison in UI. PR comments show deltas vs baseline. |
| LangSmith |
evaluate_comparative() accepts 2+ experiments. Pairwise annotation queues for human review. |
| Arize Phoenix |
Side-by-side experiments with baseline diffing. |
| wandb |
Compare 50+ runs instantly. Parallel coordinates plots. |
| MLflow |
Select N runs → compare. Table + chart views. |
No framework uses a dedicated CLI command that takes two pre-split files. The standard pattern is: run eval with multiple targets, see all results in a matrix.
Key takeaways:
- N-way matrix is the standard — show all models side by side per test case
- Baseline designation drives CI exit codes — one model is the reference, regressions against it fail the pipeline
- Pairwise summaries are derived from the matrix, not a separate workflow
Full research: agentevals-research/comparisons/multi-model-comparison-and-baseline-regression.md
Proposed Enhancement
Phase 1: Combined JSONL with target filtering (pairwise shortcut)
# Filter a combined results file by target — no pre-splitting needed
agentv compare results.jsonl --baseline gpt-4.1 --candidate gpt-5-mini
- Read one JSONL file, filter records by
target field
- Reuse existing pairwise comparison logic
- Backward compatible — two-file positional args still work
Phase 2: N-way matrix comparison
# Auto-detect all targets in a combined file, show matrix
agentv compare results.jsonl
# Specify which targets to include
agentv compare results.jsonl --targets gemini-3-flash-preview gpt-4.1 gpt-5-mini
# Designate a baseline for CI exit code
agentv compare results.jsonl --baseline gpt-4.1
Output — matrix table with scores per test × target, pairwise summaries below:
Test ID gemini-3-flash-preview gpt-4.1 gpt-5-mini
─────────────────────────────────────────────────────────────────
greeting 0.90 0.85 0.95
code-generation 0.70 0.80 0.75
summarization 0.85 0.90 0.80
Pairwise Summary:
gemini-3-flash-preview → gpt-4.1: 1 win, 1 loss, 1 tie (Δ +0.033)
gemini-3-flash-preview → gpt-5-mini: 1 win, 1 loss, 1 tie (Δ +0.017)
gpt-4.1 → gpt-5-mini: 1 win, 1 loss, 1 tie (Δ -0.017)
JSON output (--json) includes the full matrix and all pairwise comparisons.
Exit code behavior
| Mode |
Exit Code |
| Two-file pairwise (existing) |
Same as today — exit 1 on regression |
Combined JSONL with --baseline |
Exit 1 if any target regresses vs baseline |
Combined JSONL without --baseline |
Exit 0 (informational) |
Implementation Guide
Current implementation
File: apps/cli/src/commands/compare/index.ts
Current structure:
- CLI args: Two positional
string args (result1, result2), --threshold, --format/--json
loadJsonlResults(filePath) — reads JSONL, extracts test_id + score (ignores target field)
compareResults(results1, results2, threshold) — matches by test_id, computes deltas, classifies win/loss/tie
formatTable(comparison, file1, file2) — renders pairwise table with ANSI colors
determineExitCode(meanDelta) — exit 0 if candidate >= baseline, else exit 1
What to change
1. CLI args
Make result2 optional. Add new flags:
// Existing (keep)
result1: positional({ type: string, description: 'Path to JSONL result file' }),
result2: positional({ type: optional(string), description: 'Path to second JSONL (pairwise mode)' }),
// New
baseline: option({ type: optional(string), long: 'baseline', short: 'b',
description: 'Target name to use as baseline (filters combined JSONL)' }),
candidate: option({ type: optional(string), long: 'candidate', short: 'c',
description: 'Target name to use as candidate (filters combined JSONL)' }),
targets: option({ type: optional(restPositionals(string)), long: 'targets',
description: 'Target names to include in matrix comparison' }),
2. Mode detection in handler
if result2 is provided → existing pairwise mode (two files)
else if --baseline and --candidate → pairwise mode from combined JSONL
else → N-way matrix mode from combined JSONL
3. New functions
loadCombinedResults(filePath: string): Map<string, EvalResult[]>
- Reads JSONL, groups records by
target field
- Each
EvalResult needs target added: { testId, score, target }
compareMatrix(groups: Map<string, EvalResult[]>, threshold: number): MatrixOutput
- For each test_id, collect scores across all targets
- Run pairwise comparisons across all target pairs (reuse existing
compareResults)
- Return:
{ matrix: TestRow[], pairwise: ComparisonOutput[], targets: string[] }
formatMatrix(matrix: MatrixOutput, baselineTarget?: string): string
- Render the score matrix table (test_id rows × target columns)
- Below the matrix, render pairwise summaries
- If
baselineTarget specified, highlight regressions vs that target
4. Exit code for matrix mode
if (baselineTarget) {
// Exit 1 if any target regresses vs baseline
const baselinePairs = pairwise.filter(p => p.baseline === baselineTarget);
const anyRegression = baselinePairs.some(p => p.summary.meanDelta < 0);
process.exit(anyRegression ? 1 : 0);
} else {
process.exit(0); // Informational
}
Update benchmark-tooling example
After implementing the compare enhancement, update examples/features/benchmark-tooling/ to demonstrate the N-way workflow instead of the split workflow.
Update examples/features/benchmark-tooling/README.md
Replace the current split-focused content with:
# Benchmark Tooling
Multi-model benchmarking workflow with AgentV.
## Quick Start
### 1. Run a matrix evaluation
\`\`\`bash
agentv eval examples/features/benchmark-tooling/evals/benchmark.eval.yaml
\`\`\`
This evaluates all tests against 3 targets and writes a combined results JSONL.
### 2. Compare all targets
\`\`\`bash
# N-way matrix — see all models side by side
agentv compare .agentv/results/<output>.jsonl
# Designate a baseline for CI regression gating
agentv compare .agentv/results/<output>.jsonl --baseline gpt-4.1
# JSON output for CI pipelines
agentv compare .agentv/results/<output>.jsonl --json
\`\`\`
### 3. Pairwise comparison (optional)
\`\`\`bash
# Compare two specific targets from the combined file
agentv compare .agentv/results/<output>.jsonl --baseline gpt-4.1 --candidate gpt-5-mini
\`\`\`
Add examples/features/benchmark-tooling/evals/benchmark.eval.yaml
execution:
targets:
- gemini-3-flash-preview
- gpt-4.1
- gpt-5-mini
tests:
- id: greeting
input: "Say hello"
criteria: "The response should contain a greeting"
- id: code-generation
input: "Write a fibonacci function in Python"
criteria: "The response should contain a valid Python function"
- id: summarization
input: "Summarize the key benefits of automated testing"
criteria: "The response should mention reliability, speed, or regression detection"
Add examples/features/benchmark-tooling/fixtures/combined-results.jsonl
Sample combined output (9 records: 3 tests × 3 targets) so the compare command can be demonstrated without running a live eval:
# Works out of the box — no API keys needed
agentv compare examples/features/benchmark-tooling/fixtures/combined-results.jsonl
Each record needs: test_id, target, score, input, answer. Use realistic mock data.
Acceptance Criteria
Supersedes
Closes #380 — the split-by-target example is no longer needed as a primary workflow. The script stays as a niche utility.
Problem
The current
agentv comparecommand is strictly pairwise: it takes two pre-split JSONL files and computes deltas between them. With 3+ models this creates compounding friction:This is the most common workflow after a multi-target eval and it should be one command.
Industry Research
Every major eval framework treats N-way matrix comparison as the default, not pairwise:
evaluate_comparative()accepts 2+ experiments. Pairwise annotation queues for human review.No framework uses a dedicated CLI command that takes two pre-split files. The standard pattern is: run eval with multiple targets, see all results in a matrix.
Key takeaways:
Full research: agentevals-research/comparisons/multi-model-comparison-and-baseline-regression.md
Proposed Enhancement
Phase 1: Combined JSONL with target filtering (pairwise shortcut)
# Filter a combined results file by target — no pre-splitting needed agentv compare results.jsonl --baseline gpt-4.1 --candidate gpt-5-minitargetfieldPhase 2: N-way matrix comparison
Output — matrix table with scores per test × target, pairwise summaries below:
JSON output (
--json) includes the full matrix and all pairwise comparisons.Exit code behavior
--baseline--baselineImplementation Guide
Current implementation
File:
apps/cli/src/commands/compare/index.tsCurrent structure:
stringargs (result1,result2),--threshold,--format/--jsonloadJsonlResults(filePath)— reads JSONL, extractstest_id+score(ignorestargetfield)compareResults(results1, results2, threshold)— matches bytest_id, computes deltas, classifies win/loss/tieformatTable(comparison, file1, file2)— renders pairwise table with ANSI colorsdetermineExitCode(meanDelta)— exit 0 if candidate >= baseline, else exit 1What to change
1. CLI args
Make
result2optional. Add new flags:2. Mode detection in handler
3. New functions
loadCombinedResults(filePath: string): Map<string, EvalResult[]>targetfieldEvalResultneedstargetadded:{ testId, score, target }compareMatrix(groups: Map<string, EvalResult[]>, threshold: number): MatrixOutputcompareResults){ matrix: TestRow[], pairwise: ComparisonOutput[], targets: string[] }formatMatrix(matrix: MatrixOutput, baselineTarget?: string): stringbaselineTargetspecified, highlight regressions vs that target4. Exit code for matrix mode
Update benchmark-tooling example
After implementing the compare enhancement, update
examples/features/benchmark-tooling/to demonstrate the N-way workflow instead of the split workflow.Update
examples/features/benchmark-tooling/README.mdReplace the current split-focused content with:
Add
examples/features/benchmark-tooling/evals/benchmark.eval.yamlAdd
examples/features/benchmark-tooling/fixtures/combined-results.jsonlSample combined output (9 records: 3 tests × 3 targets) so the compare command can be demonstrated without running a live eval:
# Works out of the box — no API keys needed agentv compare examples/features/benchmark-tooling/fixtures/combined-results.jsonlEach record needs:
test_id,target,score,input,answer. Use realistic mock data.Acceptance Criteria
agentv compare results.jsonlreads a combined JSONL and shows N-way matrixagentv compare results.jsonl --baseline gpt-4.1 --candidate gpt-5-minifilters by target, shows pairwiseagentv compare results.jsonl --baseline gpt-4.1shows matrix, exits 1 on regression vs baselineagentv compare results.jsonl --targets t1 t2limits matrix to specified targetsagentv compare results.jsonl --jsonoutputs machine-readable matrix + pairwise dataagentv compare a.jsonl b.jsonl) still works unchangedexamples/features/benchmark-tooling/updated with EVAL.yaml, fixture, and N-way READMEagentv compare examples/features/benchmark-tooling/fixtures/combined-results.jsonlSupersedes
Closes #380 — the split-by-target example is no longer needed as a primary workflow. The script stays as a niche utility.