Skip to content

feat(compare): support combined JSONL input and N-way multi-model comparison #381

@christso

Description

@christso

Problem

The current agentv compare command is strictly pairwise: it takes two pre-split JSONL files and computes deltas between them. With 3+ models this creates compounding friction:

  1. Manual splitting required — matrix eval produces one combined JSONL, but compare needs separate files. Users must run split-by-target first.
  2. Combinatorial pairwise runs — 3 models = 3 comparisons, 4 models = 6, 5 models = 10. Each is a separate command.

This is the most common workflow after a multi-target eval and it should be one command.

Industry Research

Every major eval framework treats N-way matrix comparison as the default, not pairwise:

Framework Approach
promptfoo Matrix view: prompts × providers × tests. All models evaluated simultaneously. No separate compare command — comparison is the eval output.
Braintrust N-way experiment comparison in UI. PR comments show deltas vs baseline.
LangSmith evaluate_comparative() accepts 2+ experiments. Pairwise annotation queues for human review.
Arize Phoenix Side-by-side experiments with baseline diffing.
wandb Compare 50+ runs instantly. Parallel coordinates plots.
MLflow Select N runs → compare. Table + chart views.

No framework uses a dedicated CLI command that takes two pre-split files. The standard pattern is: run eval with multiple targets, see all results in a matrix.

Key takeaways:

  • N-way matrix is the standard — show all models side by side per test case
  • Baseline designation drives CI exit codes — one model is the reference, regressions against it fail the pipeline
  • Pairwise summaries are derived from the matrix, not a separate workflow

Full research: agentevals-research/comparisons/multi-model-comparison-and-baseline-regression.md


Proposed Enhancement

Phase 1: Combined JSONL with target filtering (pairwise shortcut)

# Filter a combined results file by target — no pre-splitting needed
agentv compare results.jsonl --baseline gpt-4.1 --candidate gpt-5-mini
  • Read one JSONL file, filter records by target field
  • Reuse existing pairwise comparison logic
  • Backward compatible — two-file positional args still work

Phase 2: N-way matrix comparison

# Auto-detect all targets in a combined file, show matrix
agentv compare results.jsonl

# Specify which targets to include
agentv compare results.jsonl --targets gemini-3-flash-preview gpt-4.1 gpt-5-mini

# Designate a baseline for CI exit code
agentv compare results.jsonl --baseline gpt-4.1

Output — matrix table with scores per test × target, pairwise summaries below:

Test ID              gemini-3-flash-preview  gpt-4.1  gpt-5-mini
─────────────────────────────────────────────────────────────────
greeting                              0.90     0.85        0.95
code-generation                       0.70     0.80        0.75
summarization                         0.85     0.90        0.80

Pairwise Summary:
  gemini-3-flash-preview → gpt-4.1:    1 win, 1 loss, 1 tie  (Δ +0.033)
  gemini-3-flash-preview → gpt-5-mini: 1 win, 1 loss, 1 tie  (Δ +0.017)
  gpt-4.1 → gpt-5-mini:               1 win, 1 loss, 1 tie  (Δ -0.017)

JSON output (--json) includes the full matrix and all pairwise comparisons.

Exit code behavior

Mode Exit Code
Two-file pairwise (existing) Same as today — exit 1 on regression
Combined JSONL with --baseline Exit 1 if any target regresses vs baseline
Combined JSONL without --baseline Exit 0 (informational)

Implementation Guide

Current implementation

File: apps/cli/src/commands/compare/index.ts

Current structure:

  • CLI args: Two positional string args (result1, result2), --threshold, --format/--json
  • loadJsonlResults(filePath) — reads JSONL, extracts test_id + score (ignores target field)
  • compareResults(results1, results2, threshold) — matches by test_id, computes deltas, classifies win/loss/tie
  • formatTable(comparison, file1, file2) — renders pairwise table with ANSI colors
  • determineExitCode(meanDelta) — exit 0 if candidate >= baseline, else exit 1

What to change

1. CLI args

Make result2 optional. Add new flags:

// Existing (keep)
result1: positional({ type: string, description: 'Path to JSONL result file' }),
result2: positional({ type: optional(string), description: 'Path to second JSONL (pairwise mode)' }),

// New
baseline: option({ type: optional(string), long: 'baseline', short: 'b',
  description: 'Target name to use as baseline (filters combined JSONL)' }),
candidate: option({ type: optional(string), long: 'candidate', short: 'c',
  description: 'Target name to use as candidate (filters combined JSONL)' }),
targets: option({ type: optional(restPositionals(string)), long: 'targets',
  description: 'Target names to include in matrix comparison' }),

2. Mode detection in handler

if result2 is provided → existing pairwise mode (two files)
else if --baseline and --candidate → pairwise mode from combined JSONL
else → N-way matrix mode from combined JSONL

3. New functions

loadCombinedResults(filePath: string): Map<string, EvalResult[]>

  • Reads JSONL, groups records by target field
  • Each EvalResult needs target added: { testId, score, target }

compareMatrix(groups: Map<string, EvalResult[]>, threshold: number): MatrixOutput

  • For each test_id, collect scores across all targets
  • Run pairwise comparisons across all target pairs (reuse existing compareResults)
  • Return: { matrix: TestRow[], pairwise: ComparisonOutput[], targets: string[] }

formatMatrix(matrix: MatrixOutput, baselineTarget?: string): string

  • Render the score matrix table (test_id rows × target columns)
  • Below the matrix, render pairwise summaries
  • If baselineTarget specified, highlight regressions vs that target

4. Exit code for matrix mode

if (baselineTarget) {
  // Exit 1 if any target regresses vs baseline
  const baselinePairs = pairwise.filter(p => p.baseline === baselineTarget);
  const anyRegression = baselinePairs.some(p => p.summary.meanDelta < 0);
  process.exit(anyRegression ? 1 : 0);
} else {
  process.exit(0); // Informational
}

Update benchmark-tooling example

After implementing the compare enhancement, update examples/features/benchmark-tooling/ to demonstrate the N-way workflow instead of the split workflow.

Update examples/features/benchmark-tooling/README.md

Replace the current split-focused content with:

# Benchmark Tooling

Multi-model benchmarking workflow with AgentV.

## Quick Start

### 1. Run a matrix evaluation

\`\`\`bash
agentv eval examples/features/benchmark-tooling/evals/benchmark.eval.yaml
\`\`\`

This evaluates all tests against 3 targets and writes a combined results JSONL.

### 2. Compare all targets

\`\`\`bash
# N-way matrix — see all models side by side
agentv compare .agentv/results/<output>.jsonl

# Designate a baseline for CI regression gating
agentv compare .agentv/results/<output>.jsonl --baseline gpt-4.1

# JSON output for CI pipelines
agentv compare .agentv/results/<output>.jsonl --json
\`\`\`

### 3. Pairwise comparison (optional)

\`\`\`bash
# Compare two specific targets from the combined file
agentv compare .agentv/results/<output>.jsonl --baseline gpt-4.1 --candidate gpt-5-mini
\`\`\`

Add examples/features/benchmark-tooling/evals/benchmark.eval.yaml

execution:
  targets:
    - gemini-3-flash-preview
    - gpt-4.1
    - gpt-5-mini

tests:
  - id: greeting
    input: "Say hello"
    criteria: "The response should contain a greeting"

  - id: code-generation
    input: "Write a fibonacci function in Python"
    criteria: "The response should contain a valid Python function"

  - id: summarization
    input: "Summarize the key benefits of automated testing"
    criteria: "The response should mention reliability, speed, or regression detection"

Add examples/features/benchmark-tooling/fixtures/combined-results.jsonl

Sample combined output (9 records: 3 tests × 3 targets) so the compare command can be demonstrated without running a live eval:

# Works out of the box — no API keys needed
agentv compare examples/features/benchmark-tooling/fixtures/combined-results.jsonl

Each record needs: test_id, target, score, input, answer. Use realistic mock data.


Acceptance Criteria

  • agentv compare results.jsonl reads a combined JSONL and shows N-way matrix
  • agentv compare results.jsonl --baseline gpt-4.1 --candidate gpt-5-mini filters by target, shows pairwise
  • agentv compare results.jsonl --baseline gpt-4.1 shows matrix, exits 1 on regression vs baseline
  • agentv compare results.jsonl --targets t1 t2 limits matrix to specified targets
  • agentv compare results.jsonl --json outputs machine-readable matrix + pairwise data
  • Two-file pairwise mode (agentv compare a.jsonl b.jsonl) still works unchanged
  • examples/features/benchmark-tooling/ updated with EVAL.yaml, fixture, and N-way README
  • Fixture runs out of the box: agentv compare examples/features/benchmark-tooling/fixtures/combined-results.jsonl

Supersedes

Closes #380 — the split-by-target example is no longer needed as a primary workflow. The script stays as a niche utility.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions