Objective
Enhance the Studio UI to provide side-by-side comparison of eval results across different targets/models within the same experiment.
Motivation
The agentv CLI has a compare command, but the Studio UI currently shows runs individually. When running the same eval across different providers (e.g., azure vs gemini) with the experiment feature (#977), users need to compare results side-by-side.
From benchmarking work: running reasoning evals across azure (gpt-5.4-mini) and gemini (gemini-3-flash-preview) with with-superpowers/without-superpowers experiments produces 4 separate runs. The Studio should enable:
- Seeing all 4 runs in a comparison matrix
- Identifying which target + experiment combination performs best
- Spotting grading anomalies (e.g., "no response provided" grading failures)
Design
Extend the existing experiments tab in Studio to show a comparison matrix:
| without-superpowers | with-superpowers |
azure (gpt-5.4) | 73.5% (1/2 pass) | 25.0% (0/2 pass) |
gemini (flash) | 100% (2/2 pass) | 75.0% (1/2 pass) |
Implementation approach
- Add a "Compare" view to the experiments tab
- Group runs by experiment × target
- Show pass rate, average score, and per-test-case breakdown
- Highlight best/worst performers
- Support drill-down to individual test case differences
Acceptance Criteria
Non-goals
- Statistical significance testing (future enhancement)
- Automated recommendations
Objective
Enhance the Studio UI to provide side-by-side comparison of eval results across different targets/models within the same experiment.
Motivation
The agentv CLI has a
comparecommand, but the Studio UI currently shows runs individually. When running the same eval across different providers (e.g., azure vs gemini) with the experiment feature (#977), users need to compare results side-by-side.From benchmarking work: running reasoning evals across azure (gpt-5.4-mini) and gemini (gemini-3-flash-preview) with
with-superpowers/without-superpowersexperiments produces 4 separate runs. The Studio should enable:Design
Extend the existing experiments tab in Studio to show a comparison matrix:
Implementation approach
Acceptance Criteria
Non-goals