Context
Studio's Compare tab currently shows tables and pass-rate pills but has no charts or statistical visualization. For benchmarks that compare skill/workflow plugin variants (e.g. examples/showcase/bug-fix-benchmark), users need visual analysis tools beyond raw numbers.
The normalized gain metric (g) was added in #1101. Studio should surface it alongside existing metrics in visual form.
What's needed
0. Rename Compare tab to Analytics
Rename the "Compare" tab to "Analytics" throughout Studio (tab label, route, component name, any internal references). The existing matrix/table content stays — it becomes one section within the Analytics tab alongside the new charts.
1. Charting library
Add recharts only. Do not add shadcn/charts, radix-ui, visx, d3, or chart.js — Studio has no existing component library and adding shadcn/charts would pull in the entire radix/shadcn dependency tree unnecessarily. Style recharts components with existing Tailwind utilities to match Studio's color conventions.
2. Charts (Analytics tab)
Add charts as a collapsible section below the existing matrix in the aggregated view. The baseline selector is a dropdown in that section's header — selecting a baseline target enables delta and g computation for all other targets.
Normalized gain bar chart ← MVP, implement this first
- Horizontal bars showing
g per task, grouped by target
- Color-coded: green (positive), red (negative), gray (null/no headroom)
- Sorted by effect size descending
Domain/tag heatmap
- Pass rate by tag × target in a color-coded grid (green high, red low)
- Uses existing
tags field on test records
Negative delta table
- Filtered view showing only tasks where a non-baseline target scored worse than baseline
- Shows
Δ and g columns side by side
Score distribution histogram
- Histogram of scores across test cases for a single run
- Shows variance, not just mean
Cost vs. improvement scatter (only if token/cost data present)
- X-axis: token cost delta vs. baseline; Y-axis: score delta
- Each point is a test case, colored by target
Trend-over-time line chart
- Mean score over time, one line per target
- Sourced from existing
/api/runs endpoint sorted by timestamp — no new backend work needed
3. API additions
The /api/compare endpoint currently returns pass_rate and avg_score per (experiment, target) cell but has no concept of a baseline for delta computation.
Add a ?baseline=<target> query param. When provided:
- Each non-baseline cell gets
delta (avg_score − baseline avg_score) and normalized_gain (g) appended
- Baseline cell is unchanged
- If the specified baseline target does not exist in the data, return 400 with a clear error message
No schema changes to existing response shape — these are additive fields on existing cell objects.
Acceptance signals
- "Compare" tab renamed to "Analytics" throughout (label, route, component)
recharts added to apps/studio/package.json (no other new UI libraries)
- Charts section renders below existing matrix, collapsed by default, with baseline selector dropdown
- Normalized gain bar chart renders when a baseline is selected and run has multiple targets
g values sourced from /api/compare?baseline=<target> response
?baseline=<target> query param implemented in the compare endpoint
- No regression in existing matrix/table functionality
- No regression in existing API responses when
?baseline is omitted
Non-goals
- Not adding shadcn/charts, radix-ui, or any other component library
- Not building a full dashboarding system
- Not adding chart export or sharing
- Not adding
g to the run-level index.jsonl artifact
Related
Context
Studio's Compare tab currently shows tables and pass-rate pills but has no charts or statistical visualization. For benchmarks that compare skill/workflow plugin variants (e.g.
examples/showcase/bug-fix-benchmark), users need visual analysis tools beyond raw numbers.The normalized gain metric (
g) was added in #1101. Studio should surface it alongside existing metrics in visual form.What's needed
0. Rename Compare tab to Analytics
Rename the "Compare" tab to "Analytics" throughout Studio (tab label, route, component name, any internal references). The existing matrix/table content stays — it becomes one section within the Analytics tab alongside the new charts.
1. Charting library
Add recharts only. Do not add shadcn/charts, radix-ui, visx, d3, or chart.js — Studio has no existing component library and adding shadcn/charts would pull in the entire radix/shadcn dependency tree unnecessarily. Style recharts components with existing Tailwind utilities to match Studio's color conventions.
2. Charts (Analytics tab)
Add charts as a collapsible section below the existing matrix in the aggregated view. The baseline selector is a dropdown in that section's header — selecting a baseline target enables delta and
gcomputation for all other targets.Normalized gain bar chart ← MVP, implement this first
gper task, grouped by targetDomain/tag heatmap
tagsfield on test recordsNegative delta table
Δandgcolumns side by sideScore distribution histogram
Cost vs. improvement scatter (only if token/cost data present)
Trend-over-time line chart
/api/runsendpoint sorted by timestamp — no new backend work needed3. API additions
The
/api/compareendpoint currently returnspass_rateandavg_scoreper (experiment, target) cell but has no concept of a baseline for delta computation.Add a
?baseline=<target>query param. When provided:delta(avg_score − baseline avg_score) andnormalized_gain(g) appendedNo schema changes to existing response shape — these are additive fields on existing cell objects.
Acceptance signals
rechartsadded toapps/studio/package.json(no other new UI libraries)gvalues sourced from/api/compare?baseline=<target>response?baseline=<target>query param implemented in the compare endpoint?baselineis omittedNon-goals
gto the run-levelindex.jsonlartifactRelated
agentv compare(already merged)