feat(power): aggregate measured GPU power into agg result JSON by arygupt · Pull Request #1551 · SemiAnalysisAI/InferenceX

arygupt · 2026-05-22T00:42:58Z

Summary

Adds measured per-GPU power and joules-per-output-token to every benchmark's agg_<run>.json, sourced from the existing gpu_metrics.csv that start_gpu_monitor already produces. The InferenceX-app dashboard consumes these via a companion PR (semianalysisai/InferenceX-app) to render new chart options alongside the existing TDP-derived jTotal/jOutput/jInput.

Two new fields land in the agg JSON:

avg_power_w — mean per-GPU draw during the load window
joules_per_output_token — avg_power_w * num_gpus * duration / total_output_tokens

How it works

benchmark_serving.py now records benchmark_start_time_unix and benchmark_end_time_unix (wall-clock epoch) alongside the existing duration. The aggregator needs these to know which slice of the long-running monitor CSV is the actual load window — without them, naive averaging would mix in ~60s of server warmup (~120W) and the optional eval phase (~300W), biasing a 720W per-GPU draw down to roughly 440W.
utils/aggregate_power.py (new, stdlib only, ~210 lines) reads the CSV, detects vendor schema by header regex (handles nvidia-smi power.draw [W] and amd-smi socket_power), filters samples to the bench window, averages per-GPU power per timestamp then over time, and atomically patches the agg JSON. Best-effort throughout — missing/empty/malformed CSV is logged to stderr and skipped without ever failing the run.
utils/process_result.py calls the aggregator right after writing the agg JSON. Path resolution checks $GPU_METRICS_CSV → ./gpu_metrics.csv → /workspace/gpu_metrics.csv, accommodating the scripts in benchmarks/single_node/ that override the default path. Wrapped in try/except so telemetry never blocks the upload.
benchmarks/benchmark_lib.sh exports GPU_METRICS_CSV so per-script CSV-path overrides cross the shell→Python boundary.

No workflow YAML change, no schema migration anywhere downstream — the InferenceX-app ETL's benchmark-mapper.ts is permissive about numeric keys in the agg JSON.

Verification

End-to-end smoke test on a synthesized 1680-row CSV (8 GPUs × 210s spanning warmup at 120W, bench at 720W, eval at 300W):

[aggregate_power] avg_power_w=715.69 (per GPU, n=8)
                  joules_per_output_token=8.3869
                  duration=120.0s output_tokens=81920

Cross-check: 720W × 8 GPUs / 682 tok/s ≈ 8.45 J/tok. ✓ Window isolation correctly excluded the warmup + eval samples (naive average would have given ~440W).

Test plan

26 unit tests covering NVIDIA + AMD CSV formats, multi-GPU per-sample aggregation, window filtering, malformed-row resilience, missing files, atomic JSON patching, divide-by-zero on failed runs
3 subprocess integration tests through process_result.py: stages a CSV + bench JSON + env vars, asserts agg_<run>.json gets patched
All 22 existing test_process_result.py tests still pass (no regressions in the established flow)
First real benchmark run after merge — verify the two new keys appear in the uploaded agg_<run>.json artifact
InferenceX-app ingest picks up the new keys (no METRIC_KEYS warning in ETL logs)

Backfill option

The workflow already uploads gpu_metrics.csv as an artifact for every run (.github/workflows/benchmark-tmpl.yml). After this merges, historical runs that have both a CSV and an agg_<run>.json could be backfilled by re-running this aggregator against the artifact store. Out of scope here.

🤖 Generated with Claude Code

Adds two new fields to agg_<run>.json so the InferenceX-app dashboard can chart measured-energy metrics alongside the existing TDP-derived ones: - avg_power_w (mean per-GPU draw during the load window) - joules_per_output_token (avg_power_w * num_gpus * duration / total_output_tokens) How it works: 1. benchmark_serving.py now records benchmark_start_time_unix and benchmark_end_time_unix alongside the existing duration field so the aggregator knows exactly which slice of the long-running monitor CSV to read (the bracket-the-whole-job monitor includes server warmup and the optional eval phase, which would otherwise bias the average). 2. aggregate_power.py reads /workspace/gpu_metrics.csv (path overridable via GPU_METRICS_CSV, which benchmark_lib.sh now exports), detects the vendor schema by header regex (handles nvidia-smi "power.draw [W]" and amd-smi socket_power formats), filters samples to the bench window, and atomically patches the agg JSON. Best-effort: missing / empty / malformed CSV is logged to stderr and skipped without failing the run. 3. process_result.py invokes the aggregator right after writing the agg JSON — no workflow YAML change needed. The InferenceX-app ETL (benchmark-mapper.ts) auto-captures unknown numeric metrics into the metrics JSONB column, so no schema migration or downstream change is required for the data to land in the DB. A follow-up PR on InferenceX-app adds the two Y-axis options to the inference scatter chart. 26 unit tests covering NVIDIA + AMD CSV shapes, window filtering, multi-GPU per-sample aggregation, malformed-row resilience, missing files, division-by-zero guards, and atomic JSON patching. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Three new tests in TestPowerAggregationIntegration: - test_agg_json_gets_patched_with_power_and_joules: full pipeline. Stages a 1Hz nvidia-smi CSV with warmup/bench/eval phases, runs process_result.py as a subprocess with GPU_METRICS_CSV set, and verifies the agg JSON gets patched with avg_power_w (600W) and joules_per_output_token (9.6 J/tok = 600W * 8 GPUs * 60s / 30k tok). Warmup (100W) and eval (200W) samples must be excluded by the timestamp window — would otherwise bias the result downward. - test_missing_csv_does_not_break_process_result: production case for runs that ship without monitoring. process_result.py succeeds and writes the agg JSON sans power fields. - test_missing_bench_timestamps_does_not_patch: legacy bench JSON without benchmark_start_time_unix gracefully skips aggregation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Appends an entry listing qwen3.5-fp8-h200-sglang so run-sweep.yml fires when the sweep-enabled label is added to PR SemiAnalysisAI#1551. The sweep will produce the first agg_<run>.json containing avg_power_w and joules_per_output_token, validating the aggregator end-to-end on real GPU hardware. Cheap single-node H200 config picked to minimize runner-pool contention.

arygupt · 2026-05-22T20:56:32Z

Closed this in favor of #1558

Appends an entry listing qwen3.5-fp8-h200-sglang so run-sweep.yml fires when the sweep-enabled label is added to PR #1551. The sweep will produce the first agg_<run>.json containing avg_power_w and joules_per_output_token, validating the aggregator end-to-end on real GPU hardware. Cheap single-node H200 config picked to minimize runner-pool contention.

arygupt and others added 2 commits May 21, 2026 16:40

arygupt requested a review from a team May 22, 2026 00:42

github-project-automation Bot added this to InferenceMAX Board May 22, 2026

claude Bot reviewed May 22, 2026

View reviewed changes

arygupt mentioned this pull request May 22, 2026

feat(inference): measured-power Y-axis metrics on scatter chart SemiAnalysisAI/InferenceX-app#375

Open

9 tasks

arygupt added sweep-enabled full-sweep-enabled labels May 22, 2026

arygupt mentioned this pull request May 22, 2026

feat(power): aggregate measured GPU power into agg result JSON #1558

Open

5 tasks

arygupt closed this May 22, 2026

github-project-automation Bot moved this to Done in InferenceMAX Board May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(power): aggregate measured GPU power into agg result JSON#1551

feat(power): aggregate measured GPU power into agg result JSON#1551
arygupt wants to merge 3 commits into
SemiAnalysisAI:mainfrom
arygupt:chore/measured-power-aggregation

arygupt commented May 22, 2026

Uh oh!

claude Bot left a comment

Uh oh!

arygupt commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arygupt commented May 22, 2026

Summary

How it works

Verification

Test plan

Backfill option

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

arygupt commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arygupt commented May 22, 2026 •

edited

Loading