feat(power): aggregate measured GPU power into agg result JSON#1551
Closed
arygupt wants to merge 3 commits into
Closed
feat(power): aggregate measured GPU power into agg result JSON#1551arygupt wants to merge 3 commits into
arygupt wants to merge 3 commits into
Conversation
Adds two new fields to agg_<run>.json so the InferenceX-app dashboard can chart measured-energy metrics alongside the existing TDP-derived ones: - avg_power_w (mean per-GPU draw during the load window) - joules_per_output_token (avg_power_w * num_gpus * duration / total_output_tokens) How it works: 1. benchmark_serving.py now records benchmark_start_time_unix and benchmark_end_time_unix alongside the existing duration field so the aggregator knows exactly which slice of the long-running monitor CSV to read (the bracket-the-whole-job monitor includes server warmup and the optional eval phase, which would otherwise bias the average). 2. aggregate_power.py reads /workspace/gpu_metrics.csv (path overridable via GPU_METRICS_CSV, which benchmark_lib.sh now exports), detects the vendor schema by header regex (handles nvidia-smi "power.draw [W]" and amd-smi socket_power formats), filters samples to the bench window, and atomically patches the agg JSON. Best-effort: missing / empty / malformed CSV is logged to stderr and skipped without failing the run. 3. process_result.py invokes the aggregator right after writing the agg JSON — no workflow YAML change needed. The InferenceX-app ETL (benchmark-mapper.ts) auto-captures unknown numeric metrics into the metrics JSONB column, so no schema migration or downstream change is required for the data to land in the DB. A follow-up PR on InferenceX-app adds the two Y-axis options to the inference scatter chart. 26 unit tests covering NVIDIA + AMD CSV shapes, window filtering, multi-GPU per-sample aggregation, malformed-row resilience, missing files, division-by-zero guards, and atomic JSON patching. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three new tests in TestPowerAggregationIntegration:
- test_agg_json_gets_patched_with_power_and_joules: full pipeline.
Stages a 1Hz nvidia-smi CSV with warmup/bench/eval phases, runs
process_result.py as a subprocess with GPU_METRICS_CSV set, and
verifies the agg JSON gets patched with avg_power_w (600W) and
joules_per_output_token (9.6 J/tok = 600W * 8 GPUs * 60s / 30k tok).
Warmup (100W) and eval (200W) samples must be excluded by the
timestamp window — would otherwise bias the result downward.
- test_missing_csv_does_not_break_process_result: production case for
runs that ship without monitoring. process_result.py succeeds and
writes the agg JSON sans power fields.
- test_missing_bench_timestamps_does_not_patch: legacy bench JSON
without benchmark_start_time_unix gracefully skips aggregation.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Open
9 tasks
Appends an entry listing qwen3.5-fp8-h200-sglang so run-sweep.yml fires when the sweep-enabled label is added to PR SemiAnalysisAI#1551. The sweep will produce the first agg_<run>.json containing avg_power_w and joules_per_output_token, validating the aggregator end-to-end on real GPU hardware. Cheap single-node H200 config picked to minimize runner-pool contention.
5 tasks
Collaborator
Author
|
Closed this in favor of #1558 |
arygupt
added a commit
that referenced
this pull request
May 22, 2026
Appends an entry listing qwen3.5-fp8-h200-sglang so run-sweep.yml fires when the sweep-enabled label is added to PR #1551. The sweep will produce the first agg_<run>.json containing avg_power_w and joules_per_output_token, validating the aggregator end-to-end on real GPU hardware. Cheap single-node H200 config picked to minimize runner-pool contention.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds measured per-GPU power and joules-per-output-token to every benchmark's
agg_<run>.json, sourced from the existinggpu_metrics.csvthatstart_gpu_monitoralready produces. The InferenceX-app dashboard consumes these via a companion PR (semianalysisai/InferenceX-app) to render new chart options alongside the existing TDP-derivedjTotal/jOutput/jInput.Two new fields land in the agg JSON:
avg_power_w— mean per-GPU draw during the load windowjoules_per_output_token—avg_power_w * num_gpus * duration / total_output_tokensHow it works
benchmark_serving.pynow recordsbenchmark_start_time_unixandbenchmark_end_time_unix(wall-clock epoch) alongside the existingduration. The aggregator needs these to know which slice of the long-running monitor CSV is the actual load window — without them, naive averaging would mix in ~60s of server warmup (~120W) and the optional eval phase (~300W), biasing a 720W per-GPU draw down to roughly 440W.utils/aggregate_power.py(new, stdlib only, ~210 lines) reads the CSV, detects vendor schema by header regex (handles nvidia-smipower.draw [W]and amd-smisocket_power), filters samples to the bench window, averages per-GPU power per timestamp then over time, and atomically patches the agg JSON. Best-effort throughout — missing/empty/malformed CSV is logged to stderr and skipped without ever failing the run.utils/process_result.pycalls the aggregator right after writing the agg JSON. Path resolution checks$GPU_METRICS_CSV→./gpu_metrics.csv→/workspace/gpu_metrics.csv, accommodating the scripts inbenchmarks/single_node/that override the default path. Wrapped intry/exceptso telemetry never blocks the upload.benchmarks/benchmark_lib.shexportsGPU_METRICS_CSVso per-script CSV-path overrides cross the shell→Python boundary.No workflow YAML change, no schema migration anywhere downstream — the InferenceX-app ETL's
benchmark-mapper.tsis permissive about numeric keys in the agg JSON.Verification
End-to-end smoke test on a synthesized 1680-row CSV (8 GPUs × 210s spanning warmup at 120W, bench at 720W, eval at 300W):
Cross-check: 720W × 8 GPUs / 682 tok/s ≈ 8.45 J/tok. ✓ Window isolation correctly excluded the warmup + eval samples (naive average would have given ~440W).
Test plan
process_result.py: stages a CSV + bench JSON + env vars, assertsagg_<run>.jsongets patchedtest_process_result.pytests still pass (no regressions in the established flow)agg_<run>.jsonartifactBackfill option
The workflow already uploads
gpu_metrics.csvas an artifact for every run (.github/workflows/benchmark-tmpl.yml). After this merges, historical runs that have both a CSV and anagg_<run>.jsoncould be backfilled by re-running this aggregator against the artifact store. Out of scope here.🤖 Generated with Claude Code