feat(studio): comparison analytics charts for skills/workflow benchmarking

## Context

Studio's Compare tab currently shows tables and pass-rate pills but has no charts or statistical visualization. For benchmarks that compare skill/workflow plugin variants (e.g. `examples/showcase/bug-fix-benchmark`), users need visual analysis tools beyond raw numbers.

The normalized gain metric (`g`) was added in #1101. Studio should surface it alongside existing metrics in visual form.

## What's needed

### 0. Rename Compare tab to Analytics

Rename the "Compare" tab to "Analytics" throughout Studio (tab label, route, component name, any internal references). The existing matrix/table content stays — it becomes one section within the Analytics tab alongside the new charts.

### 1. Charting library

Add **recharts** only. Do not add shadcn/charts, radix-ui, visx, d3, or chart.js — Studio has no existing component library and adding shadcn/charts would pull in the entire radix/shadcn dependency tree unnecessarily. Style recharts components with existing Tailwind utilities to match Studio's color conventions.

### 2. Charts (Analytics tab)

Add charts as a collapsible section below the existing matrix in the aggregated view. The baseline selector is a dropdown in that section's header — selecting a baseline target enables delta and `g` computation for all other targets.

**Normalized gain bar chart** ← MVP, implement this first
- Horizontal bars showing `g` per task, grouped by target
- Color-coded: green (positive), red (negative), gray (null/no headroom)
- Sorted by effect size descending

**Domain/tag heatmap**
- Pass rate by tag × target in a color-coded grid (green high, red low)
- Uses existing `tags` field on test records

**Negative delta table**
- Filtered view showing only tasks where a non-baseline target scored worse than baseline
- Shows `Δ` and `g` columns side by side

**Score distribution histogram**
- Histogram of scores across test cases for a single run
- Shows variance, not just mean

**Cost vs. improvement scatter** (only if token/cost data present)
- X-axis: token cost delta vs. baseline; Y-axis: score delta
- Each point is a test case, colored by target

**Trend-over-time line chart**
- Mean score over time, one line per target
- Sourced from existing `/api/runs` endpoint sorted by timestamp — no new backend work needed

### 3. API additions

The `/api/compare` endpoint currently returns `pass_rate` and `avg_score` per (experiment, target) cell but has no concept of a baseline for delta computation.

Add a `?baseline=<target>` query param. When provided:
- Each non-baseline cell gets `delta` (avg_score − baseline avg_score) and `normalized_gain` (`g`) appended
- Baseline cell is unchanged
- If the specified baseline target does not exist in the data, return 400 with a clear error message

No schema changes to existing response shape — these are additive fields on existing cell objects.

## Acceptance signals

- "Compare" tab renamed to "Analytics" throughout (label, route, component)
- `recharts` added to `apps/studio/package.json` (no other new UI libraries)
- Charts section renders below existing matrix, collapsed by default, with baseline selector dropdown
- Normalized gain bar chart renders when a baseline is selected and run has multiple targets
- `g` values sourced from `/api/compare?baseline=<target>` response
- `?baseline=<target>` query param implemented in the compare endpoint
- No regression in existing matrix/table functionality
- No regression in existing API responses when `?baseline` is omitted

## Non-goals

- Not adding shadcn/charts, radix-ui, or any other component library
- Not building a full dashboarding system
- Not adding chart export or sharing
- Not adding `g` to the run-level `index.jsonl` artifact

## Related

- #1100 expand bug-fix-benchmark with workflow evals
- #1101 normalized gain in `agentv compare` (already merged)
- #1079 static HTML report command

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(studio): comparison analytics charts for skills/workflow benchmarking #1102

Context

What's needed

0. Rename Compare tab to Analytics

1. Charting library

2. Charts (Analytics tab)

3. API additions

Acceptance signals

Non-goals

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(studio): comparison analytics charts for skills/workflow benchmarking #1102

Description

Context

What's needed

0. Rename Compare tab to Analytics

1. Charting library

2. Charts (Analytics tab)

3. API additions

Acceptance signals

Non-goals

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions