Skip to content

feat(inference): measured-power Y-axis metrics on scatter chart#375

Open
arygupt wants to merge 3 commits into
masterfrom
feat/measured-power-metrics
Open

feat(inference): measured-power Y-axis metrics on scatter chart#375
arygupt wants to merge 3 commits into
masterfrom
feat/measured-power-metrics

Conversation

@arygupt
Copy link
Copy Markdown
Collaborator

@arygupt arygupt commented May 22, 2026

Summary

Adds two new options under a new Measured Energy dropdown group on both the vs. Interactivity and vs. End-to-end Latency charts:

  • Measured Avg Power per GPU (W) — Pareto envelope: lower_right (interactivity) / lower_left (e2e), matching "lower power at the same interactivity is more efficient"
  • Measured J per Output Token (J/tok) — same envelope directions

Distinct from the existing y_jTotal/y_jOutput/y_jInput which derive joules from each GPU's spec-sheet TDP. The new metrics are sourced from runner GPU telemetry averaged over the exact bench load window — see companion PR semianalysisai/InferenceX#1551 which adds aggregate_power.py and emits avg_power_w + joules_per_output_token into the agg JSON.

Wiring

File Change
packages/constants/src/metric-keys.ts Register avg_power_w, joules_per_output_token so the ETL auto-capture warning doesn't fire
packages/app/src/lib/benchmark-transform.ts Pass the two raw fields through rowToAggDataEntry. Left undefined when absent so downstream code can distinguish "no measurement" from "0 W"
packages/app/src/components/inference/types.ts Extend AggDataEntry, InferenceData, YAxisMetricKey, ChartDefinition
packages/app/src/lib/chart-utils.ts Extend Y_AXIS_METRICS, createChartDataPoint (gated on typeof === 'number'), calculateRoofline/computeAllRooflines yKey union, markRooflinePoints init+mark blocks
packages/app/src/components/inference/inference-chart-config.json Add y_measured* entries to both chartTypes. y_measuredAvgPower_roofline is intentionally omitted — ScatterGraph falls through to lower_right by default, which matches the desired semantic.
packages/app/src/components/inference/ui/ChartControls.tsx Add "Measured Energy" group to METRIC_GROUPS

Overlay path

The unofficial-run overlay path (?unofficialrun=…) is automatic — transformBenchmarkRows is shared between official and overlay rendering, so the new metrics flow to overlay points once #1551 merges and benchmarks emit the new fields. Per CLAUDE.md the code-wiring requirement is satisfied; Cypress E2E covering the overlay path is a follow-up to add once real data lands in the DB so the mocked artifact JSON can mirror a real payload shape.

Graceful degradation (audited end-to-end)

Site does not break when rows are missing the new fields. The protection points:

  1. rowToAggDataEntry — assigns undefined rather than 0 so the absence is preserved through the data pipeline.
  2. createChartDataPointtypeof === 'number' gate omits the derived measuredAvgPower / measuredJPerOutputToken fields entirely when the raw value is absent.
  3. useChartData (useChartData.ts:301-305) — filteredData.some((d) => metricKey in d) short-circuits to an empty array when no row has the metric, so the chart renders cleanly empty rather than throwing.
  4. useChartData mapping (useChartData.ts:305) — .filter((d) => metricKey in d) drops any individual row missing the field before the y-axis remap.
  5. processOverlayChartData (utils.ts:104) — same metricKey in d filter for the overlay path.
  6. ScatterGraph Pareto front — runs only on the already-filtered point set, so the roofline is never poisoned by missing-data points.

For rows without measured-power data (historical runs, runs predating the runner PR, runs where the SMI poller didn't start), the new chart options simply omit those points — the existing TDP-derived y_jTotal/y_jOutput/y_jInput stay visible as a comparable fallback. There's a test guarding the typeof === 'number' gate so a future refactor to truthiness check doesn't silently drop legitimate zero-value measurements.

Design notes

  1. typeof === 'number' gate, not truthiness. Preserves legitimate zero-value measurements (e.g. an idle GPU) while still treating undefined as "no measurement." Test in chart-utils.test.ts enforces this.
  2. No new D3 / chart-rendering code. The new metrics piggyback on the existing selectedYAxisMetric plumbing — every place that already handles tpPerGpu / jTotal / etc. now also handles measuredAvgPower / measuredJPerOutputToken.

Test plan

  • pnpm typecheck — clean
  • pnpm lint — 0 warnings, 0 errors
  • pnpm test:unit — 1921/1921 passing (+7 new tests covering rowToAggDataEntry pass-through, createChartDataPoint field gating, zero-value preservation, missing-field handling)
  • Pre-commit hook (lint + format + typecheck) — passes
  • Dev-server smoke test — confirmed "Measured Energy" group label and both metric labels are present in the served JS bundle at /_next/static/chunks/
  • Code-path audit for the "missing field" case — see "Graceful degradation" above; every code path that touches the new fields either filters first or is guarded
  • After #1551 merges and one benchmark seeds real data: verify the live chart renders measured points in the same shape as the existing TDP-derived ones
  • Add Cypress E2E covering official + overlay paths against the first real payload
  • Decide on ChartDisplay caveat subtitle ("Measured during load window via 1 Hz SMI poll") once visible with real data

Adds two new options under a new "Measured Energy" dropdown group on
both the "vs. Interactivity" and "vs. E2E Latency" charts:

  - Measured Avg Power per GPU (W)        — no roofline (no universal
                                             "better" direction)
  - Measured J per Output Token (J/tok)   — roofline lower_right
                                             (interactivity) / lower_left
                                             (e2e)

Distinct from the existing y_jTotal/y_jOutput/y_jInput which derive
joules from each GPU's spec-sheet TDP. The new metrics are sourced
from runner GPU telemetry averaged over the exact bench load window
(see aggregate_power.py in semianalysisai/InferenceX).

Wiring:

  - packages/constants/src/metric-keys.ts: register avg_power_w,
    joules_per_output_token in the canonical metric key set so the ETL
    auto-capture warning doesn't fire.
  - packages/app/src/lib/benchmark-transform.ts: pass the two raw
    fields through rowToAggDataEntry. Left undefined when absent so
    downstream code can distinguish "no measurement" from "0 W".
  - packages/app/src/components/inference/types.ts: extend AggDataEntry,
    InferenceData, YAxisMetricKey, and ChartDefinition.
  - packages/app/src/lib/chart-utils.ts: extend Y_AXIS_METRICS,
    createChartDataPoint (gated on typeof===number), calculateRoofline
    and computeAllRooflines yKey union, markRooflinePoints init+mark
    blocks.
  - packages/app/src/components/inference/inference-chart-config.json:
    add y_measured* entries to both chartTypes.
  - packages/app/src/components/inference/ui/ChartControls.tsx: add
    "Measured Energy" group to METRIC_GROUPS.

The overlay (unofficial run) path is automatic — transformBenchmarkRows
is shared between official and overlay rendering, so the new metrics
flow to ?unofficialrun= URLs once the runner-side PR is merged and
benchmarks ingest the new fields.

For rows without measured-power data (historical runs, runs predating
aggregate_power.py, runs where the SMI poller didn't start), points
are simply omitted from the new charts — the existing TDP-derived
y_jTotal/y_jOutput/y_jInput stay visible as a comparable fallback.

Verification:

  - pnpm typecheck: clean
  - pnpm lint: 0 warnings, 0 errors
  - pnpm test:unit: 1921/1921 passing (+7 new tests covering
    rowToAggDataEntry pass-through, createChartDataPoint field gating,
    zero-value preservation, missing-field handling)
  - Dev-server smoke: confirmed "Measured Energy" group label and
    both metric labels are present in the served JS bundle at
    /_next/static/chunks/

Follow-up: Cypress E2E covering both the official path and
?unofficialrun= overlay path for the two new metrics, to be added
once the runner PR ships real data to the DB.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
inferencemax-app Ready Ready Preview, Comment May 22, 2026 8:30pm

Request Review

@claude
Copy link
Copy Markdown

claude Bot commented May 22, 2026

Claude finished @arygupt's task in 4m 18s —— View job


🟢 LGTM — no blocking issues

Wiring is consistent across all six layers (constants, transform, types, chart-utils, config, controls). The graceful-degradation contract holds: useChartData.ts line 301-325 filters via metricKey in d, and the typeof === 'number' gate in createChartDataPoint keeps 'measuredAvgPower' in d false for legacy rows (rather than true with value: undefined). The "no roofline" choice for y_measuredAvgPower works because computeAllRooflines skips any metric without a _roofline entry. Test coverage for the new lib code is good — the zero-value preservation test (avg_power_w: 0) specifically guards the typeof === 'number' gate against a future refactor to truthiness.

Non-blocking observation

🟡 Historical Trends tab will render flat-zero lines for the new metrics
rowToLightweightPoint in packages/app/src/components/inference/hooks/useInterpolatedTrendData.ts:25-69 doesn't build measuredAvgPower / measuredJPerOutputToken. Because ChartControls is shared between Inference and Historical Trends, a user who selects "Measured Energy" on the Trends tab hits this path: extractMetric() returns null, interpolateMetricAtInteractivity coerces with ?? 0, the spline interpolates zeros, and every GPU's trend line renders at y=0 across all dates — a misleading flat line instead of "no data". Same latent issue already exists for y_inputTputPerMw / y_outputTputPerMw, so this PR doesn't introduce the pattern, just extends its surface. Worth either (a) adding the two measured fields to the lightweight builder once real data lands, or (b) filtering the "Measured Energy" group out of the trends-tab dropdown.

The deferred items in the PR description (ChartDisplay caveat subtitle, Cypress E2E against real payload, AI-chart prompt-template update) are appropriately blocked on #1551 emitting real data, so not a blocker for this PR.


· branch feat/measured-power-metrics

The previous comment claimed avg-power has no roofline because the field
is omitted from the chart config. That's wrong: ScatterGraph.tsx:270-277
falls back to paretoFrontLowerRight when the direction is undefined, so
the chart actually draws a lower_right roofline by default.

The behavior happens to be sensible ("lower power at the same
interactivity is more efficient"), so leaving it as-is and just fixing
the comment to describe what the code actually does.
@claude
Copy link
Copy Markdown

claude Bot commented May 22, 2026

Claude finished @arygupt's task in 2m 24s —— View job


Re-review: 🟢 LGTM

Only one new commit since the prior review: 19f9ab1 docs(inference): correct y_measuredAvgPower_roofline comment.

Comment-only changepackages/app/src/components/inference/types.ts:296-299. The author correctly noticed that the previous comment's claim ("no roofline for avg power") was wrong: ScatterGraph.tsx:270-277 falls back to paretoFrontLowerRight when _roofline is undefined, so a roofline line is drawn for y_measuredAvgPower. The new comment accurately describes the actual behavior, and the chosen fallback ("lower power at same interactivity = more efficient") is defensible.

Verified the doc claim:

  • ScatterGraph.tsx:262-277dir is undefined for y_measuredAvgPower, falls through the ternary chain to paretoFrontLowerRight. ✓
  • chart-utils.ts:616computeAllRooflines does still skip metrics without a direction, so per-point roof flags won't be set. Minor inconsistency (line drawn, no on-roofline markers) but it's pre-existing for any metric where config omits _roofline, not introduced here.

No behavior change, no test impact. Previous review's non-blocking observation about the Historical Trends tab still applies as a follow-up (verified useInterpolatedTrendData.ts:25-69 still doesn't build the new measured fields; extractMetric → ?? 0 coercion at line 109 would still produce flat-zero lines).

  • Check what changed since previous review
  • Verify the Historical Trends concern from prior review
  • Inspect overlay-path behavior with new metrics
  • Post re-review summary
    · branch feat/measured-power-metrics

Wires the new "Measured Energy" chart options into the existing
↑↑↓↓ feature-gate so they can be toggled without a deploy.

Two changes:

1. Extract useFeatureGate to a shared hook at lib/use-feature-gate.ts.
   It was previously private to tab-nav.tsx; three other components
   (gpu-power, submissions, feedback) read the same localStorage key
   and DOM events ad-hoc. Now there's one source of truth, and the
   key + event names are exported as named constants so other
   consumers can migrate when convenient. No behavior change for
   tab-nav.

2. ChartControls METRIC_GROUPS entries gain an optional `gated` flag.
   Inside the component, the group list, the metric→group map, and
   the dropdown options are derived via useMemo from useFeatureGate(),
   so toggling the gate flips visibility without a page reload. The
   Measured Energy group is marked `gated: true`; the rest stay
   as-is.

Verified:
- pnpm typecheck clean
- pnpm test:unit 1921/1921
- Pre-commit hook (lint + format + typecheck) passes
@claude
Copy link
Copy Markdown

claude Bot commented May 22, 2026

Claude finished @arygupt's task in 1m 59s —— View job


Re-review: 🟢 LGTM

One new commit since the prior review: aff6054 feat(inference): gate measured-energy dropdown behind feature flag.

What changed:

  1. packages/app/src/lib/use-feature-gate.ts (new) — extracts the ↑↑↓↓ konami hook that was previously private to tab-nav.tsx. Exports FEATURE_GATE_KEY, FEATURE_GATE_UNLOCKED_EVENT, FEATURE_GATE_LOCKED_EVENT as named constants. Logic is byte-for-byte the prior in-file hook — pure refactor, no behavior change for tab-nav.
  2. tab-nav.tsx — drops the inline hook and imports the shared one.
  3. ChartControls.tsx — adds an optional gated flag to METRIC_GROUPS, marks "Measured Energy" as gated: true, and derives visibleGroups / metricGroupMap / groupedYAxisOptions via useMemo(…, [featureGateUnlocked]). The dropdown re-renders when the gate flips without a reload.

Verified:

  • Hook effects use stable deps (unlocked for the keydown effect, [] for the lock/unlock event listeners) — no stale-closure or listener-leak risk.
  • The three other consumers (gpu-power, submissions, feedback-viewer) still hard-code FEATURE_GATE_KEY and event-name string literals. Commit message confirms this is intentional ("can migrate when convenient") and the literals match the exported constants exactly. ✓
  • SSR-safe: useState(false) initial value plus localStorage read inside useEffect. Server and first-client render both produce the locked-default option list; hydration won't mismatch.
  • Default selectedYAxisMetric is y_tpPerGpu (InferenceContext.tsx:128), not a gated metric — fresh visitors are never stranded on a gated selection.
  • The previous review's Historical Trends flat-zero concern is effectively masked for normal users now that the new metrics are dropdown-hidden until the gate unlocks. The underlying useInterpolatedTrendData issue is still latent for insiders who unlock the gate, but that audience can be told to avoid the Trends tab — appropriately deferred.

No test for use-feature-gate.ts — but the logic is unchanged from the prior in-file implementation (which was also untested), so per the "only flag NEW code, not existing untested code" rule this is a refactor, not a new untested feature. Not blocking.

Minor (non-blocking) observation

🟡 If a user unlocks the gate, selects y_measuredAvgPower, then re-locks via the admin button in gpu-power/submissions/feedback-viewer, selectedYAxisMetric stays on the gated value while the dropdown filter hides it — the chart keeps rendering measured energy but the controls trigger shows a label without a matching SelectItem. Same goes for opening a shared ?i_metric=y_measuredAvgPower URL while locked. Not a crash; the user can switch to any non-gated metric. Worth a small reset-on-lock if you want belt-and-braces, but acceptable as-is for an insider-only feature.

Tasks

  • Identify changes since last review
  • Review the new feature-flag commit (aff6054)
  • Inspect use-feature-gate.ts (new file)
  • Inspect tab-nav.tsx changes
  • Verify ChartControls integration
  • Post summary
    · branch feat/measured-power-metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant