Skip to content

feat(studio): comparison analytics charts for skills/workflow benchmarking #1102

@christso

Description

@christso

Context

Studio's Compare tab currently shows tables and pass-rate pills but has no charts or statistical visualization. For benchmarks that compare skill/workflow plugin variants (e.g. examples/showcase/bug-fix-benchmark), users need visual analysis tools beyond raw numbers.

The normalized gain metric (g) was added in #1101. Studio should surface it alongside existing metrics in visual form.

What's needed

0. Rename Compare tab to Analytics

Rename the "Compare" tab to "Analytics" throughout Studio (tab label, route, component name, any internal references). The existing matrix/table content stays — it becomes one section within the Analytics tab alongside the new charts.

1. Charting library

Add recharts only. Do not add shadcn/charts, radix-ui, visx, d3, or chart.js — Studio has no existing component library and adding shadcn/charts would pull in the entire radix/shadcn dependency tree unnecessarily. Style recharts components with existing Tailwind utilities to match Studio's color conventions.

2. Charts (Analytics tab)

Add charts as a collapsible section below the existing matrix in the aggregated view. The baseline selector is a dropdown in that section's header — selecting a baseline target enables delta and g computation for all other targets.

Normalized gain bar chart ← MVP, implement this first

  • Horizontal bars showing g per task, grouped by target
  • Color-coded: green (positive), red (negative), gray (null/no headroom)
  • Sorted by effect size descending

Domain/tag heatmap

  • Pass rate by tag × target in a color-coded grid (green high, red low)
  • Uses existing tags field on test records

Negative delta table

  • Filtered view showing only tasks where a non-baseline target scored worse than baseline
  • Shows Δ and g columns side by side

Score distribution histogram

  • Histogram of scores across test cases for a single run
  • Shows variance, not just mean

Cost vs. improvement scatter (only if token/cost data present)

  • X-axis: token cost delta vs. baseline; Y-axis: score delta
  • Each point is a test case, colored by target

Trend-over-time line chart

  • Mean score over time, one line per target
  • Sourced from existing /api/runs endpoint sorted by timestamp — no new backend work needed

3. API additions

The /api/compare endpoint currently returns pass_rate and avg_score per (experiment, target) cell but has no concept of a baseline for delta computation.

Add a ?baseline=<target> query param. When provided:

  • Each non-baseline cell gets delta (avg_score − baseline avg_score) and normalized_gain (g) appended
  • Baseline cell is unchanged
  • If the specified baseline target does not exist in the data, return 400 with a clear error message

No schema changes to existing response shape — these are additive fields on existing cell objects.

Acceptance signals

  • "Compare" tab renamed to "Analytics" throughout (label, route, component)
  • recharts added to apps/studio/package.json (no other new UI libraries)
  • Charts section renders below existing matrix, collapsed by default, with baseline selector dropdown
  • Normalized gain bar chart renders when a baseline is selected and run has multiple targets
  • g values sourced from /api/compare?baseline=<target> response
  • ?baseline=<target> query param implemented in the compare endpoint
  • No regression in existing matrix/table functionality
  • No regression in existing API responses when ?baseline is omitted

Non-goals

  • Not adding shadcn/charts, radix-ui, or any other component library
  • Not building a full dashboarding system
  • Not adding chart export or sharing
  • Not adding g to the run-level index.jsonl artifact

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestin-progressClaimed by an agent — do not duplicate workwuiRelates to the browser dashboard / web UI runtime

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions