Skip to content

feat(power): aggregate measured GPU power into agg result JSON#1558

Open
arygupt wants to merge 5 commits into
mainfrom
chore/measured-power-aggregation
Open

feat(power): aggregate measured GPU power into agg result JSON#1558
arygupt wants to merge 5 commits into
mainfrom
chore/measured-power-aggregation

Conversation

@arygupt
Copy link
Copy Markdown
Collaborator

@arygupt arygupt commented May 22, 2026

Summary

Adds measured per-GPU power and joules-per-output-token to every benchmark's agg_<run>.json, sourced from the existing gpu_metrics.csv that start_gpu_monitor already produces. The InferenceX-app dashboard consumes these via a companion PR (SemiAnalysisAI/InferenceX-app#375) to render new chart options alongside the existing TDP-derived jTotal/jOutput/jInput.

Two new fields land in the agg JSON:

  • avg_power_w — mean per-GPU draw during the load window
  • joules_per_output_tokenavg_power_w * num_gpus * duration / total_output_tokens

How it works

  1. benchmark_serving.py now records benchmark_start_time_unix and benchmark_end_time_unix (wall-clock epoch) alongside the existing duration. The aggregator needs these to know which slice of the long-running monitor CSV is the actual load window — without them, naive averaging would mix in ~60s of server warmup (~120W) and the optional eval phase (~300W), biasing a 720W per-GPU draw down to roughly 440W.

  2. utils/aggregate_power.py (new, stdlib only, ~210 lines) reads the CSV, detects vendor schema by header regex (handles nvidia-smi power.draw [W] and amd-smi socket_power), filters samples to the bench window, averages per-GPU power per timestamp then over time, and atomically patches the agg JSON. Best-effort throughout — missing/empty/malformed CSV is logged to stderr and skipped without ever failing the run.

  3. utils/process_result.py calls the aggregator right after writing the agg JSON. Path resolution checks $GPU_METRICS_CSV./gpu_metrics.csv/workspace/gpu_metrics.csv. Wrapped in try/except so telemetry never blocks the upload.

  4. benchmarks/benchmark_lib.sh exports GPU_METRICS_CSV so per-script CSV-path overrides cross the shell→Python boundary.

No workflow YAML change, no schema migration anywhere downstream — the InferenceX-app ETL's benchmark-mapper.ts is permissive about numeric keys in the agg JSON.

Sweep validation

The third commit (66344e74) adds a perf-changelog.yaml entry pointing at qwen3.5-fp8-h200-sglang so the sweep workflow will fire when sweep-enabled is added. The sweep will produce the first real agg_<run>.json with the new fields populated — meant to be inspected before merge.

Verification

End-to-end smoke test on a synthesized 1680-row CSV (8 GPUs × 210s spanning warmup at 120W, bench at 720W, eval at 300W):

[aggregate_power] avg_power_w=715.69 (per GPU, n=8)
                  joules_per_output_token=8.3869
                  duration=120.0s output_tokens=81920

Cross-check: 720W × 8 GPUs / 682 tok/s ≈ 8.45 J/tok. ✓ Window isolation correctly excluded the warmup + eval samples (naive average would have given ~440W).

Test plan

  • 26 unit tests covering NVIDIA + AMD CSV formats, multi-GPU per-sample aggregation, window filtering, malformed-row resilience, missing files, atomic JSON patching, divide-by-zero on failed runs
  • 3 subprocess integration tests through process_result.py: stages a CSV + bench JSON + env vars, asserts agg_<run>.json gets patched
  • All 22 existing test_process_result.py tests still pass
  • Sweep run with sweep-enabled label — verify avg_power_w and joules_per_output_token appear in the artifact with plausible values
  • After merge: InferenceX-app ingest picks up the new keys (no METRIC_KEYS warning in ETL logs)

arygupt and others added 2 commits May 22, 2026 14:06
Adds two new fields to agg_<run>.json so the InferenceX-app dashboard
can chart measured-energy metrics alongside the existing TDP-derived
ones:

  - avg_power_w               (mean per-GPU draw during the load window)
  - joules_per_output_token   (avg_power_w * num_gpus * duration / total_output_tokens)

How it works:

1. benchmark_serving.py now records benchmark_start_time_unix and
   benchmark_end_time_unix alongside the existing duration field so the
   aggregator knows exactly which slice of the long-running monitor CSV
   to read (the bracket-the-whole-job monitor includes server warmup
   and the optional eval phase, which would otherwise bias the average).

2. aggregate_power.py reads /workspace/gpu_metrics.csv (path overridable
   via GPU_METRICS_CSV, which benchmark_lib.sh now exports), detects the
   vendor schema by header regex (handles nvidia-smi "power.draw [W]"
   and amd-smi socket_power formats), filters samples to the bench
   window, and atomically patches the agg JSON. Best-effort: missing /
   empty / malformed CSV is logged to stderr and skipped without
   failing the run.

3. process_result.py invokes the aggregator right after writing the
   agg JSON — no workflow YAML change needed.

The InferenceX-app ETL (benchmark-mapper.ts) auto-captures unknown
numeric metrics into the metrics JSONB column, so no schema migration
or downstream change is required for the data to land in the DB. A
follow-up PR on InferenceX-app adds the two Y-axis options to the
inference scatter chart.

26 unit tests covering NVIDIA + AMD CSV shapes, window filtering,
multi-GPU per-sample aggregation, malformed-row resilience, missing
files, division-by-zero guards, and atomic JSON patching.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three new tests in TestPowerAggregationIntegration:

  - test_agg_json_gets_patched_with_power_and_joules: full pipeline.
    Stages a 1Hz nvidia-smi CSV with warmup/bench/eval phases, runs
    process_result.py as a subprocess with GPU_METRICS_CSV set, and
    verifies the agg JSON gets patched with avg_power_w (600W) and
    joules_per_output_token (9.6 J/tok = 600W * 8 GPUs * 60s / 30k tok).
    Warmup (100W) and eval (200W) samples must be excluded by the
    timestamp window — would otherwise bias the result downward.

  - test_missing_csv_does_not_break_process_result: production case for
    runs that ship without monitoring. process_result.py succeeds and
    writes the agg JSON sans power fields.

  - test_missing_bench_timestamps_does_not_patch: legacy bench JSON
    without benchmark_start_time_unix gracefully skips aggregation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment on lines +1 to +20
"""Tests for aggregate_power.py.

Covers:
- NVIDIA CSV (nvidia-smi --query-gpu format with "X W" power cells)
- AMD CSV (amd-smi --csv with ISO/epoch timestamps and bare numeric power)
- Window filtering (samples outside [start, end] are excluded)
- Multi-GPU per-sample aggregation (sum across GPUs at each timestamp,
then mean over samples — yields per-GPU mean)
- Missing / empty / malformed CSV: returns None, no exception
- End-to-end run(): patches agg JSON with avg_power_w + joules_per_output_token
- Missing bench window keys: skips gracefully without patching
"""
from __future__ import annotations

import json
import sys
from datetime import datetime
from pathlib import Path

import pytest
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The 26 new unit tests in utils/test_aggregate_power.py will not run in CI. The .github/workflows/test-process-result.yml workflow hardcodes pytest test_process_result.py -v (line 36), which never collects the new test file — even though this PR happens to trigger the workflow because process_result.py is in the paths filter. Additionally, the paths filter (lines 5-7) doesn't include utils/aggregate_power.py or utils/test_aggregate_power.py, so future PRs touching only those files won't trigger the workflow at all. Fix: add both new files to the paths list and change the pytest invocation to pytest test_process_result.py test_aggregate_power.py -v (or simply pytest -v).

Extended reasoning...

What the bug is

The PR adds two new files — utils/aggregate_power.py and utils/test_aggregate_power.py — and the description's Test plan explicitly claims "[x] 26 unit tests covering NVIDIA + AMD CSV formats, multi-GPU per-sample aggregation, window filtering, malformed-row resilience, missing files, atomic JSON patching, divide-by-zero on failed runs". None of those 26 tests will actually run in CI as the workflow is configured.

The code path that triggers it

.github/workflows/test-process-result.yml has two compounding issues:

  1. paths filter is too narrow (lines 5-7):

    paths:
      - 'utils/process_result.py'
      - 'utils/test_process_result.py'

    Neither utils/aggregate_power.py nor utils/test_aggregate_power.py is listed. A future PR that touches only those files would never start the workflow.

  2. Hardcoded pytest invocation (line 36):

    run: |
      cd utils
      pytest test_process_result.py -v

    Even when the workflow does fire, it only collects test_process_result.py. test_aggregate_power.py is never named on the pytest command line and pytest's default discovery is bypassed because a positional file argument is passed.

Why existing code doesn't prevent it

Grepping all of .github/workflows/ confirms no other workflow runs pytest against test_aggregate_power.py — the only pytest invocations in workflows target test_generate_sweep_configs.py, test_validation.py, and test_process_result.py. There is no umbrella test job that picks up new test files automatically.

Step-by-step proof for THIS PR

  1. PR modifies utils/process_result.py ✓ — paths filter matches, workflow triggers.
  2. Workflow runs cd utils && pytest test_process_result.py -v.
  3. pytest collects tests only from test_process_result.py: the original 22 tests + the 3 new TestPowerAggregationIntegration subprocess tests = 25 tests total.
  4. test_aggregate_power.py is in the same directory but pytest is given an explicit file argument, so it's not discovered.
  5. Net effect: the 26 unit tests covering the parsing edge cases (ISO/epoch timestamps, ms-vs-s heuristic at the 1e12 boundary, [N/A] cells, malformed rows, power.limit vs power.draw column disambiguation, empty/missing CSV, atomic temp-file rename, divide-by-zero on total_output_tokens=0) silently do not execute on review.

Only the 3 subprocess integration tests in test_process_result.py run; those exercise the wiring but not the parser edge cases that the unit tests cover.

Impact

  • The PR description's "[x] 26 unit tests covering ..." checkbox is misleading without a workflow update.
  • Future regressions in aggregate_power.py (e.g. a vendor schema tweak that breaks the regex column detection, a timestamp-format change, a divide-by-zero reintroduction) won't be caught by CI on PRs that only touch aggregate_power.py.
  • Even on this PR, parser correctness rests entirely on the 3 subprocess tests, which only test the happy path + two skip cases.

How to fix

One-line workflow patch, in .github/workflows/test-process-result.yml:

on:
  pull_request:
    paths:
      - 'utils/process_result.py'
      - 'utils/test_process_result.py'
      - 'utils/aggregate_power.py'
      - 'utils/test_aggregate_power.py'

# ...

      - name: Run pytest
        run: |
          cd utils
          pytest test_process_result.py test_aggregate_power.py -v

Or simply pytest -v from the utils directory if you want new test files in that directory to be picked up automatically going forward.

Comment thread perf-changelog.yaml Outdated
Comment on lines +3086 to +3091

- config-keys:
- qwen3.5-fp8-h200-sglang
description:
- "Smoke run validating measured-power aggregation pipeline. No config change; entry exists to trigger a sweep that produces the first agg_<run>.json with avg_power_w + joules_per_output_token populated by aggregate_power.py."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1551
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog.yaml entry's pr-link points to #1551, but the PR description states this PR (#1558) supersedes #1551 and #1551 will be closed without merge. Update the link to /pull/1558 so the changelog navigates to the PR with the actually-merged commit history.

Extended reasoning...

What the bug is

perf-changelog.yaml line 3091 reads:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1551

But the PR description on #1558 explicitly states:

This PR supersedes #1551 (which was opened from a fork before I had direct push access). Same commits, same diff — just routed through an upstream branch so the run-sweep.yml workflow can fire.

So #1551 will be closed without merge — anyone (or any dashboard / sweep report) that follows the pr-link from this changelog entry lands on a closed fork PR with no merged commit history, rather than on the PR (#1558) that actually produced the sweep data.

Why existing convention catches this

Every neighboring entry in this file links to the PR that was actually merged for the change:

The new entry is the only one in the file that links to a PR that is known not to be merged at the time of writing.

Step-by-step proof

  1. Open perf-changelog.yaml at line 3086.
  2. Read the new entry — pr-link is /pull/1551.
  3. Open the PR feat(power): aggregate measured GPU power into agg result JSON #1558 description in GitHub.
  4. Read the first paragraph — '"This PR supersedes feat(power): aggregate measured GPU power into agg result JSON #1551 ... routed through an upstream branch so the run-sweep.yml workflow can fire (fork-PRs get cancelled by the sweep workflow security gate)."'
  5. Observe that feat(power): aggregate measured GPU power into agg result JSON #1551 will be closed without merge.
  6. Result: clicking the pr-link from the changelog entry resolves to a closed fork PR with no merged code, not to the PR that produced the sweep run.

Impact

Cosmetic / traceability only — no functional impact on the sweep, the aggregator, or the agg JSON. The InferenceX-app ETL does not consult pr-link. But the changelog is the canonical pointer from sweep reports back to the originating PR, so a broken link here adds friction for anyone later trying to reconstruct what produced a given avg_power_w / joules_per_output_token data point.

Fix

One-character change at perf-changelog.yaml:3091:

-  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1551
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1558

Comment thread utils/aggregate_power.py
Comment on lines +141 to +168
for row in reader:
ts_raw = (row.get(timestamp_col) or "").strip()
pw_raw = (row.get(power_col) or "").strip()
ts = _parse_timestamp(ts_raw)
pw = _parse_power(pw_raw)
if ts is None or pw is None:
continue
if ts < start_unix or ts > end_unix:
continue
# Bucket by sample timestamp (rounded to ms to absorb sub-ms drift).
bucket = round(ts, 3)
per_sample_total[bucket] = per_sample_total.get(bucket, 0.0) + pw
gpu_id = (row.get(gpu_col) or "0").strip() if gpu_col else "0"
per_sample_gpus.setdefault(bucket, set()).add(gpu_id)
gpu_keys.add(gpu_id)
except (OSError, csv.Error):
return None

if not per_sample_total:
return None

# Number of distinct GPUs seen across the window.
num_gpus = max(len(gpu_keys), 1)
# Per-sample mean power per GPU = sum across GPUs at that timestamp / GPUs seen at that timestamp.
per_sample_mean_per_gpu = [
total / max(len(per_sample_gpus[ts]), 1) for ts, total in per_sample_total.items()
]
return mean(per_sample_mean_per_gpu), num_gpus
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Latent robustness bug in aggregate_power (utils/aggregate_power.py:141-168): when _detect_columns can't find a GPU index column (strict regex ^(index|gpu|gpu_id|gpu_index|card|device)$), every row collapses to gpu_id='0', causing avg_power_w to report system-total power (N × W) instead of per-GPU mean. Today's nvidia-smi (index) and amd-smi (gpu) headers match the regex so the in-tree pipeline is safe, but any future amd-smi schema variant (e.g. device_id, GPU ID) would silently break the headline new metric by a factor of N. Suggested fix: when gpu_col is None, infer num_gpus from the distinct row count per timestamp.

Extended reasoning...

What the bug is

In utils/aggregate_power.py, _detect_columns matches a GPU-index column using a strict regex anchored at both ends: r"^(index|gpu|gpu_id|gpu_index|card|device)$". When no header matches (gpu_col is None), the row-processing loop at line 153 falls back to a literal '0':

gpu_id = (row.get(gpu_col) or "0").strip() if gpu_col else "0"

Every row at every timestamp now gets bucketed under the single key '0'. Three downstream values are affected:

  1. per_sample_total[bucket] accumulates power across all N GPUs at that timestamp (e.g. 8 × 500 W = 4000)
  2. per_sample_gpus[bucket] = {'0'} — a single-element set, since every row contributes the same '0'
  3. gpu_keys = {'0'}num_gpus = max(1, 1) = 1

Then at line 166:

per_sample_mean_per_gpu = [total / max(len(per_sample_gpus[ts]), 1) for ts, total in per_sample_total.items()]

becomes total / 1 — the SUM, not the per-GPU mean. aggregate_power returns (N*W, 1) where it should return (W, N).

Step-by-step proof

CSV with 8 GPUs at 500 W, headers timestamp,device_id,power.draw [W] (device_id does NOT match ^(index|gpu|gpu_id|gpu_index|card|device)$ — note the trailing _id):

step value
_detect_columns returns ('timestamp', 'power.draw [W]', None)
Row 1 (ts=T, gpu 0, 500W) → gpu_id '0'
Row 2 (ts=T, gpu 1, 500W) → gpu_id '0'
…rows 3-8 all '0'
per_sample_total[T] 500+500+500+500+500+500+500+500 = 4000
per_sample_gpus[T] {'0'}
num_gpus max(len({'0'}), 1) = 1
per_sample_mean_per_gpu for T 4000 / 1 = 4000
Returned tuple (4000.0, 1)

run() then computes joules_per_output_token = 4000 × 1 × duration / tokens, which coincidentally equals the correct 500 × 8 × duration / tokens because num_gpus=1 cancels the inflation. But avg_power_w is written to the agg JSON as 4000 W — a value impossible for a single H100/H200 (TDP ~700 W) that will mislead the dashboard.

Why existing code doesn't prevent it

The regex ^(index|gpu|gpu_id|gpu_index|card|device)$ is anchored at both ends, so common alternatives like device_id, gpu_serial, GPU ID (with space), or slot all fail to match. The power-column regex by contrast is permissive (just r"power" with an excludelist) — the asymmetry means the function tolerates power-column name variation but not GPU-column variation.

Impact

In this PR's pipeline: zero. benchmarks/benchmark_lib.sh invokes nvidia-smi --query-gpu=timestamp,index,... (index matches) or amd-smi metric --csv (gpu matches), so the bug is latent. The submitter and all verifiers acknowledge this.

Outside this PR: the docstring of _parse_timestamp notes "amd-smi varies by version" — a future amd-smi release that renames gpu to gpu_id would still match, but a release that uses device_id or GPU ID would silently produce a factor-N error in the headline avg_power_w field. This is the worst kind of regression: the number stays plausible-looking (within an order of magnitude) but is systematically wrong.

Addressing the refutation

A refuting verifier correctly notes that (a) the CSV producer is in-tree, (b) the joules-per-token field stays correct due to the cancellation, and (c) this is defensive coding rather than an actionable production bug. All true today. The reason to flag it anyway is that avg_power_w is the standalone headline new metric introduced by this PR — it's surfaced directly to the dashboard, not just used as an intermediate. A silent factor-N error in a charted value is exactly the failure mode regex-based schema detection is supposed to avoid, and the fix is a one-liner.

Suggested fix

When gpu_col is None, count distinct rows per timestamp to recover num_gpus:

if gpu_col is None:
    # Each row at the same timestamp must be a distinct GPU.
    per_sample_gpus.setdefault(bucket, set()).add(str(per_sample_total.get(bucket, 0)))  # unique key per row

Or more cleanly, track row counts per bucket separately and use total / row_count as the per-GPU mean for that timestamp. Either approach keeps the per-GPU mean honest while the existing gpu_col-based path remains unchanged.

Appends an entry listing qwen3.5-fp8-h200-sglang so run-sweep.yml fires
when the sweep-enabled label is added to PR #1551. The sweep will produce
the first agg_<run>.json containing avg_power_w and joules_per_output_token,
validating the aggregator end-to-end on real GPU hardware.

Cheap single-node H200 config picked to minimize runner-pool contention.
@arygupt arygupt force-pushed the chore/measured-power-aggregation branch from 8b84114 to 7c29bde Compare May 22, 2026 21:09
…bsent

Addresses review feedback on PR #1558. The original _detect_columns used a
strict-anchored regex ^(index|gpu|gpu_id|gpu_index|card|device)$ for the
GPU-index column while the power column uses a permissive r"power" match.
That asymmetry meant a future SMI schema variant with header device_id,
gpu_serial, or "GPU ID" would silently collapse every row to gpu_id="0",
yielding system-total power instead of per-GPU mean.

In-tree pipelines (nvidia-smi index, amd-smi gpu) match the regex and are
unaffected — this is a latent bug, not a current-production one. But
avg_power_w is the standalone headline new metric this PR adds; a future
4000W reading on the dashboard is the worst kind of regression (still
plausible-looking, off by 8x).

Fix:
- Maintain per_sample_row_count independently of GPU-column detection.
- When gpu_col is present, divisor stays len(per_sample_gpus[ts]) — same
  as before, behavior unchanged for the existing pipeline.
- When gpu_col is absent, divisor is per_sample_row_count[ts] and
  num_gpus is the modal row count per sample. Both assume one row per
  GPU per sample, which is what every SMI tool we've encountered emits.

New test (test_aggregate_power_no_gpu_column_infers_from_row_count)
exercises a CSV header with device_id (regex miss) and asserts avg_power
is the per-GPU mean (500W), not the system sum (2000W).

Verified:
- 27 unit tests pass (was 26)
- 25 process_result tests pass (no change)
@arygupt
Copy link
Copy Markdown
Collaborator Author

arygupt commented May 22, 2026

Point 1 — perf-changelog pr-link: false positive. The link points to /pull/1558 (this PR), not /pull/1551. The rebase that resolved the merge conflict updated it; the review likely ran against a pre-rebase snapshot. You can verify in the current diff:

- config-keys:
    - qwen3.5-fp8-h200-sglang
  description:
    - "Smoke run validating measured-power aggregation pipeline. ..."
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1558  ← correct

Point 2 — GPU-column detection asymmetry: fixed in b33f2237. The analysis is correct: _GPU_INDEX_COL_RE was strict-anchored (^(index|gpu|gpu_id|gpu_index|card|device)$) while the power column uses a permissive r"power" match. A future SMI schema variant with device_id, gpu_serial, or "GPU ID" would silently collapse every row to gpu_id="0", returning system-total power as avg_power_w (factor-N inflation). The joules-per-token field would have stayed correct by coincidence (the num_gpus = 1 bug cancels the inflation in that calculation), but avg_power_w would be wrong on the dashboard.

In-tree pipelines (nvidia-smi index, amd-smi gpu) match the regex and were unaffected — this was a latent bug, not a current-production one. But fixing it costs ~20 lines and the headline new metric of the PR is avg_power_w, so defense in depth is worth it.

Fix tracks row count per timestamp independently of GPU-column detection, then either uses len(per_sample_gpus[ts]) (when gpu_col present, behavior unchanged) or per_sample_row_count[ts] (when absent, assuming one row per GPU per sample). New test test_aggregate_power_no_gpu_column_infers_from_row_count exercises a synthetic device_id header and asserts per-GPU mean (500W) is returned, not system sum (2000W).

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant