feat(power): aggregate measured GPU power into agg result JSON#1558
feat(power): aggregate measured GPU power into agg result JSON#1558arygupt wants to merge 5 commits into
Conversation
Adds two new fields to agg_<run>.json so the InferenceX-app dashboard can chart measured-energy metrics alongside the existing TDP-derived ones: - avg_power_w (mean per-GPU draw during the load window) - joules_per_output_token (avg_power_w * num_gpus * duration / total_output_tokens) How it works: 1. benchmark_serving.py now records benchmark_start_time_unix and benchmark_end_time_unix alongside the existing duration field so the aggregator knows exactly which slice of the long-running monitor CSV to read (the bracket-the-whole-job monitor includes server warmup and the optional eval phase, which would otherwise bias the average). 2. aggregate_power.py reads /workspace/gpu_metrics.csv (path overridable via GPU_METRICS_CSV, which benchmark_lib.sh now exports), detects the vendor schema by header regex (handles nvidia-smi "power.draw [W]" and amd-smi socket_power formats), filters samples to the bench window, and atomically patches the agg JSON. Best-effort: missing / empty / malformed CSV is logged to stderr and skipped without failing the run. 3. process_result.py invokes the aggregator right after writing the agg JSON — no workflow YAML change needed. The InferenceX-app ETL (benchmark-mapper.ts) auto-captures unknown numeric metrics into the metrics JSONB column, so no schema migration or downstream change is required for the data to land in the DB. A follow-up PR on InferenceX-app adds the two Y-axis options to the inference scatter chart. 26 unit tests covering NVIDIA + AMD CSV shapes, window filtering, multi-GPU per-sample aggregation, malformed-row resilience, missing files, division-by-zero guards, and atomic JSON patching. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three new tests in TestPowerAggregationIntegration:
- test_agg_json_gets_patched_with_power_and_joules: full pipeline.
Stages a 1Hz nvidia-smi CSV with warmup/bench/eval phases, runs
process_result.py as a subprocess with GPU_METRICS_CSV set, and
verifies the agg JSON gets patched with avg_power_w (600W) and
joules_per_output_token (9.6 J/tok = 600W * 8 GPUs * 60s / 30k tok).
Warmup (100W) and eval (200W) samples must be excluded by the
timestamp window — would otherwise bias the result downward.
- test_missing_csv_does_not_break_process_result: production case for
runs that ship without monitoring. process_result.py succeeds and
writes the agg JSON sans power fields.
- test_missing_bench_timestamps_does_not_patch: legacy bench JSON
without benchmark_start_time_unix gracefully skips aggregation.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| """Tests for aggregate_power.py. | ||
|
|
||
| Covers: | ||
| - NVIDIA CSV (nvidia-smi --query-gpu format with "X W" power cells) | ||
| - AMD CSV (amd-smi --csv with ISO/epoch timestamps and bare numeric power) | ||
| - Window filtering (samples outside [start, end] are excluded) | ||
| - Multi-GPU per-sample aggregation (sum across GPUs at each timestamp, | ||
| then mean over samples — yields per-GPU mean) | ||
| - Missing / empty / malformed CSV: returns None, no exception | ||
| - End-to-end run(): patches agg JSON with avg_power_w + joules_per_output_token | ||
| - Missing bench window keys: skips gracefully without patching | ||
| """ | ||
| from __future__ import annotations | ||
|
|
||
| import json | ||
| import sys | ||
| from datetime import datetime | ||
| from pathlib import Path | ||
|
|
||
| import pytest |
There was a problem hiding this comment.
🔴 The 26 new unit tests in utils/test_aggregate_power.py will not run in CI. The .github/workflows/test-process-result.yml workflow hardcodes pytest test_process_result.py -v (line 36), which never collects the new test file — even though this PR happens to trigger the workflow because process_result.py is in the paths filter. Additionally, the paths filter (lines 5-7) doesn't include utils/aggregate_power.py or utils/test_aggregate_power.py, so future PRs touching only those files won't trigger the workflow at all. Fix: add both new files to the paths list and change the pytest invocation to pytest test_process_result.py test_aggregate_power.py -v (or simply pytest -v).
Extended reasoning...
What the bug is
The PR adds two new files — utils/aggregate_power.py and utils/test_aggregate_power.py — and the description's Test plan explicitly claims "[x] 26 unit tests covering NVIDIA + AMD CSV formats, multi-GPU per-sample aggregation, window filtering, malformed-row resilience, missing files, atomic JSON patching, divide-by-zero on failed runs". None of those 26 tests will actually run in CI as the workflow is configured.
The code path that triggers it
.github/workflows/test-process-result.yml has two compounding issues:
-
pathsfilter is too narrow (lines 5-7):paths: - 'utils/process_result.py' - 'utils/test_process_result.py'
Neither
utils/aggregate_power.pynorutils/test_aggregate_power.pyis listed. A future PR that touches only those files would never start the workflow. -
Hardcoded pytest invocation (line 36):
run: | cd utils pytest test_process_result.py -v
Even when the workflow does fire, it only collects
test_process_result.py.test_aggregate_power.pyis never named on the pytest command line and pytest's default discovery is bypassed because a positional file argument is passed.
Why existing code doesn't prevent it
Grepping all of .github/workflows/ confirms no other workflow runs pytest against test_aggregate_power.py — the only pytest invocations in workflows target test_generate_sweep_configs.py, test_validation.py, and test_process_result.py. There is no umbrella test job that picks up new test files automatically.
Step-by-step proof for THIS PR
- PR modifies
utils/process_result.py✓ —pathsfilter matches, workflow triggers. - Workflow runs
cd utils && pytest test_process_result.py -v. - pytest collects tests only from
test_process_result.py: the original 22 tests + the 3 newTestPowerAggregationIntegrationsubprocess tests = 25 tests total. test_aggregate_power.pyis in the same directory but pytest is given an explicit file argument, so it's not discovered.- Net effect: the 26 unit tests covering the parsing edge cases (ISO/epoch timestamps, ms-vs-s heuristic at the 1e12 boundary, [N/A] cells, malformed rows,
power.limitvspower.drawcolumn disambiguation, empty/missing CSV, atomic temp-file rename, divide-by-zero ontotal_output_tokens=0) silently do not execute on review.
Only the 3 subprocess integration tests in test_process_result.py run; those exercise the wiring but not the parser edge cases that the unit tests cover.
Impact
- The PR description's "[x] 26 unit tests covering ..." checkbox is misleading without a workflow update.
- Future regressions in
aggregate_power.py(e.g. a vendor schema tweak that breaks the regex column detection, a timestamp-format change, a divide-by-zero reintroduction) won't be caught by CI on PRs that only touchaggregate_power.py. - Even on this PR, parser correctness rests entirely on the 3 subprocess tests, which only test the happy path + two skip cases.
How to fix
One-line workflow patch, in .github/workflows/test-process-result.yml:
on:
pull_request:
paths:
- 'utils/process_result.py'
- 'utils/test_process_result.py'
- 'utils/aggregate_power.py'
- 'utils/test_aggregate_power.py'
# ...
- name: Run pytest
run: |
cd utils
pytest test_process_result.py test_aggregate_power.py -vOr simply pytest -v from the utils directory if you want new test files in that directory to be picked up automatically going forward.
|
|
||
| - config-keys: | ||
| - qwen3.5-fp8-h200-sglang | ||
| description: | ||
| - "Smoke run validating measured-power aggregation pipeline. No config change; entry exists to trigger a sweep that produces the first agg_<run>.json with avg_power_w + joules_per_output_token populated by aggregate_power.py." | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1551 |
There was a problem hiding this comment.
🟡 The new perf-changelog.yaml entry's pr-link points to #1551, but the PR description states this PR (#1558) supersedes #1551 and #1551 will be closed without merge. Update the link to /pull/1558 so the changelog navigates to the PR with the actually-merged commit history.
Extended reasoning...
What the bug is
perf-changelog.yaml line 3091 reads:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1551But the PR description on #1558 explicitly states:
This PR supersedes #1551 (which was opened from a fork before I had direct push access). Same commits, same diff — just routed through an upstream branch so the
run-sweep.ymlworkflow can fire.
So #1551 will be closed without merge — anyone (or any dashboard / sweep report) that follows the pr-link from this changelog entry lands on a closed fork PR with no merged commit history, rather than on the PR (#1558) that actually produced the sweep data.
Why existing convention catches this
Every neighboring entry in this file links to the PR that was actually merged for the change:
- Line 3058 — pr-link to the merged PR for that entry
- Line 3066 — same
- Line 3072 — same ([AMD][ROCM] dsv4-fp4-mi355x-vllm, Bump vLLM ROCm image to (nightly-4f940896) #1546)
- Line 3078 — same ([AMD][MI35X] 0521 DSV4 #1548)
- Line 3085 — [Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Update qwen3.5-fp8-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 #1451 (visible in the diff right above the new entry)
The new entry is the only one in the file that links to a PR that is known not to be merged at the time of writing.
Step-by-step proof
- Open
perf-changelog.yamlat line 3086. - Read the new entry — pr-link is
/pull/1551. - Open the PR feat(power): aggregate measured GPU power into agg result JSON #1558 description in GitHub.
- Read the first paragraph — '"This PR supersedes feat(power): aggregate measured GPU power into agg result JSON #1551 ... routed through an upstream branch so the
run-sweep.ymlworkflow can fire (fork-PRs get cancelled by the sweep workflow security gate)."' - Observe that feat(power): aggregate measured GPU power into agg result JSON #1551 will be closed without merge.
- Result: clicking the pr-link from the changelog entry resolves to a closed fork PR with no merged code, not to the PR that produced the sweep run.
Impact
Cosmetic / traceability only — no functional impact on the sweep, the aggregator, or the agg JSON. The InferenceX-app ETL does not consult pr-link. But the changelog is the canonical pointer from sweep reports back to the originating PR, so a broken link here adds friction for anyone later trying to reconstruct what produced a given avg_power_w / joules_per_output_token data point.
Fix
One-character change at perf-changelog.yaml:3091:
- pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1551
+ pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1558| for row in reader: | ||
| ts_raw = (row.get(timestamp_col) or "").strip() | ||
| pw_raw = (row.get(power_col) or "").strip() | ||
| ts = _parse_timestamp(ts_raw) | ||
| pw = _parse_power(pw_raw) | ||
| if ts is None or pw is None: | ||
| continue | ||
| if ts < start_unix or ts > end_unix: | ||
| continue | ||
| # Bucket by sample timestamp (rounded to ms to absorb sub-ms drift). | ||
| bucket = round(ts, 3) | ||
| per_sample_total[bucket] = per_sample_total.get(bucket, 0.0) + pw | ||
| gpu_id = (row.get(gpu_col) or "0").strip() if gpu_col else "0" | ||
| per_sample_gpus.setdefault(bucket, set()).add(gpu_id) | ||
| gpu_keys.add(gpu_id) | ||
| except (OSError, csv.Error): | ||
| return None | ||
|
|
||
| if not per_sample_total: | ||
| return None | ||
|
|
||
| # Number of distinct GPUs seen across the window. | ||
| num_gpus = max(len(gpu_keys), 1) | ||
| # Per-sample mean power per GPU = sum across GPUs at that timestamp / GPUs seen at that timestamp. | ||
| per_sample_mean_per_gpu = [ | ||
| total / max(len(per_sample_gpus[ts]), 1) for ts, total in per_sample_total.items() | ||
| ] | ||
| return mean(per_sample_mean_per_gpu), num_gpus |
There was a problem hiding this comment.
🟡 Latent robustness bug in aggregate_power (utils/aggregate_power.py:141-168): when _detect_columns can't find a GPU index column (strict regex ^(index|gpu|gpu_id|gpu_index|card|device)$), every row collapses to gpu_id='0', causing avg_power_w to report system-total power (N × W) instead of per-GPU mean. Today's nvidia-smi (index) and amd-smi (gpu) headers match the regex so the in-tree pipeline is safe, but any future amd-smi schema variant (e.g. device_id, GPU ID) would silently break the headline new metric by a factor of N. Suggested fix: when gpu_col is None, infer num_gpus from the distinct row count per timestamp.
Extended reasoning...
What the bug is
In utils/aggregate_power.py, _detect_columns matches a GPU-index column using a strict regex anchored at both ends: r"^(index|gpu|gpu_id|gpu_index|card|device)$". When no header matches (gpu_col is None), the row-processing loop at line 153 falls back to a literal '0':
gpu_id = (row.get(gpu_col) or "0").strip() if gpu_col else "0"Every row at every timestamp now gets bucketed under the single key '0'. Three downstream values are affected:
per_sample_total[bucket]accumulates power across all N GPUs at that timestamp (e.g. 8 × 500 W = 4000)per_sample_gpus[bucket] = {'0'}— a single-element set, since every row contributes the same'0'gpu_keys = {'0'}→num_gpus = max(1, 1) = 1
Then at line 166:
per_sample_mean_per_gpu = [total / max(len(per_sample_gpus[ts]), 1) for ts, total in per_sample_total.items()]becomes total / 1 — the SUM, not the per-GPU mean. aggregate_power returns (N*W, 1) where it should return (W, N).
Step-by-step proof
CSV with 8 GPUs at 500 W, headers timestamp,device_id,power.draw [W] (device_id does NOT match ^(index|gpu|gpu_id|gpu_index|card|device)$ — note the trailing _id):
| step | value |
|---|---|
_detect_columns returns |
('timestamp', 'power.draw [W]', None) |
Row 1 (ts=T, gpu 0, 500W) → gpu_id |
'0' |
Row 2 (ts=T, gpu 1, 500W) → gpu_id |
'0' |
| …rows 3-8 | all '0' |
per_sample_total[T] |
500+500+500+500+500+500+500+500 = 4000 |
per_sample_gpus[T] |
{'0'} |
num_gpus |
max(len({'0'}), 1) = 1 |
per_sample_mean_per_gpu for T |
4000 / 1 = 4000 |
| Returned tuple | (4000.0, 1) |
run() then computes joules_per_output_token = 4000 × 1 × duration / tokens, which coincidentally equals the correct 500 × 8 × duration / tokens because num_gpus=1 cancels the inflation. But avg_power_w is written to the agg JSON as 4000 W — a value impossible for a single H100/H200 (TDP ~700 W) that will mislead the dashboard.
Why existing code doesn't prevent it
The regex ^(index|gpu|gpu_id|gpu_index|card|device)$ is anchored at both ends, so common alternatives like device_id, gpu_serial, GPU ID (with space), or slot all fail to match. The power-column regex by contrast is permissive (just r"power" with an excludelist) — the asymmetry means the function tolerates power-column name variation but not GPU-column variation.
Impact
In this PR's pipeline: zero. benchmarks/benchmark_lib.sh invokes nvidia-smi --query-gpu=timestamp,index,... (index matches) or amd-smi metric --csv (gpu matches), so the bug is latent. The submitter and all verifiers acknowledge this.
Outside this PR: the docstring of _parse_timestamp notes "amd-smi varies by version" — a future amd-smi release that renames gpu to gpu_id would still match, but a release that uses device_id or GPU ID would silently produce a factor-N error in the headline avg_power_w field. This is the worst kind of regression: the number stays plausible-looking (within an order of magnitude) but is systematically wrong.
Addressing the refutation
A refuting verifier correctly notes that (a) the CSV producer is in-tree, (b) the joules-per-token field stays correct due to the cancellation, and (c) this is defensive coding rather than an actionable production bug. All true today. The reason to flag it anyway is that avg_power_w is the standalone headline new metric introduced by this PR — it's surfaced directly to the dashboard, not just used as an intermediate. A silent factor-N error in a charted value is exactly the failure mode regex-based schema detection is supposed to avoid, and the fix is a one-liner.
Suggested fix
When gpu_col is None, count distinct rows per timestamp to recover num_gpus:
if gpu_col is None:
# Each row at the same timestamp must be a distinct GPU.
per_sample_gpus.setdefault(bucket, set()).add(str(per_sample_total.get(bucket, 0))) # unique key per rowOr more cleanly, track row counts per bucket separately and use total / row_count as the per-GPU mean for that timestamp. Either approach keeps the per-GPU mean honest while the existing gpu_col-based path remains unchanged.
Appends an entry listing qwen3.5-fp8-h200-sglang so run-sweep.yml fires when the sweep-enabled label is added to PR #1551. The sweep will produce the first agg_<run>.json containing avg_power_w and joules_per_output_token, validating the aggregator end-to-end on real GPU hardware. Cheap single-node H200 config picked to minimize runner-pool contention.
8b84114 to
7c29bde
Compare
…bsent Addresses review feedback on PR #1558. The original _detect_columns used a strict-anchored regex ^(index|gpu|gpu_id|gpu_index|card|device)$ for the GPU-index column while the power column uses a permissive r"power" match. That asymmetry meant a future SMI schema variant with header device_id, gpu_serial, or "GPU ID" would silently collapse every row to gpu_id="0", yielding system-total power instead of per-GPU mean. In-tree pipelines (nvidia-smi index, amd-smi gpu) match the regex and are unaffected — this is a latent bug, not a current-production one. But avg_power_w is the standalone headline new metric this PR adds; a future 4000W reading on the dashboard is the worst kind of regression (still plausible-looking, off by 8x). Fix: - Maintain per_sample_row_count independently of GPU-column detection. - When gpu_col is present, divisor stays len(per_sample_gpus[ts]) — same as before, behavior unchanged for the existing pipeline. - When gpu_col is absent, divisor is per_sample_row_count[ts] and num_gpus is the modal row count per sample. Both assume one row per GPU per sample, which is what every SMI tool we've encountered emits. New test (test_aggregate_power_no_gpu_column_infers_from_row_count) exercises a CSV header with device_id (regex miss) and asserts avg_power is the per-GPU mean (500W), not the system sum (2000W). Verified: - 27 unit tests pass (was 26) - 25 process_result tests pass (no change)
|
Point 1 — perf-changelog pr-link: false positive. The link points to - config-keys:
- qwen3.5-fp8-h200-sglang
description:
- "Smoke run validating measured-power aggregation pipeline. ..."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1558 ← correctPoint 2 — GPU-column detection asymmetry: fixed in In-tree pipelines (nvidia-smi Fix tracks row count per timestamp independently of GPU-column detection, then either uses |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26312107787 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26313298597 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26313725530 |
Summary
Adds measured per-GPU power and joules-per-output-token to every benchmark's
agg_<run>.json, sourced from the existinggpu_metrics.csvthatstart_gpu_monitoralready produces. The InferenceX-app dashboard consumes these via a companion PR (SemiAnalysisAI/InferenceX-app#375) to render new chart options alongside the existing TDP-derivedjTotal/jOutput/jInput.Two new fields land in the agg JSON:
avg_power_w— mean per-GPU draw during the load windowjoules_per_output_token—avg_power_w * num_gpus * duration / total_output_tokensHow it works
benchmark_serving.pynow recordsbenchmark_start_time_unixandbenchmark_end_time_unix(wall-clock epoch) alongside the existingduration. The aggregator needs these to know which slice of the long-running monitor CSV is the actual load window — without them, naive averaging would mix in ~60s of server warmup (~120W) and the optional eval phase (~300W), biasing a 720W per-GPU draw down to roughly 440W.utils/aggregate_power.py(new, stdlib only, ~210 lines) reads the CSV, detects vendor schema by header regex (handles nvidia-smipower.draw [W]and amd-smisocket_power), filters samples to the bench window, averages per-GPU power per timestamp then over time, and atomically patches the agg JSON. Best-effort throughout — missing/empty/malformed CSV is logged to stderr and skipped without ever failing the run.utils/process_result.pycalls the aggregator right after writing the agg JSON. Path resolution checks$GPU_METRICS_CSV→./gpu_metrics.csv→/workspace/gpu_metrics.csv. Wrapped intry/exceptso telemetry never blocks the upload.benchmarks/benchmark_lib.shexportsGPU_METRICS_CSVso per-script CSV-path overrides cross the shell→Python boundary.No workflow YAML change, no schema migration anywhere downstream — the InferenceX-app ETL's
benchmark-mapper.tsis permissive about numeric keys in the agg JSON.Sweep validation
The third commit (
66344e74) adds aperf-changelog.yamlentry pointing atqwen3.5-fp8-h200-sglangso the sweep workflow will fire whensweep-enabledis added. The sweep will produce the first realagg_<run>.jsonwith the new fields populated — meant to be inspected before merge.Verification
End-to-end smoke test on a synthesized 1680-row CSV (8 GPUs × 210s spanning warmup at 120W, bench at 720W, eval at 300W):
Cross-check: 720W × 8 GPUs / 682 tok/s ≈ 8.45 J/tok. ✓ Window isolation correctly excluded the warmup + eval samples (naive average would have given ~440W).
Test plan
process_result.py: stages a CSV + bench JSON + env vars, assertsagg_<run>.jsongets patchedtest_process_result.pytests still passsweep-enabledlabel — verifyavg_power_wandjoules_per_output_tokenappear in the artifact with plausible values