Skip to content

feat(power): aggregate measured GPU power into agg result JSON#1551

Closed
arygupt wants to merge 3 commits into
SemiAnalysisAI:mainfrom
arygupt:chore/measured-power-aggregation
Closed

feat(power): aggregate measured GPU power into agg result JSON#1551
arygupt wants to merge 3 commits into
SemiAnalysisAI:mainfrom
arygupt:chore/measured-power-aggregation

Conversation

@arygupt
Copy link
Copy Markdown
Collaborator

@arygupt arygupt commented May 22, 2026

Summary

Adds measured per-GPU power and joules-per-output-token to every benchmark's agg_<run>.json, sourced from the existing gpu_metrics.csv that start_gpu_monitor already produces. The InferenceX-app dashboard consumes these via a companion PR (semianalysisai/InferenceX-app) to render new chart options alongside the existing TDP-derived jTotal/jOutput/jInput.

Two new fields land in the agg JSON:

  • avg_power_w — mean per-GPU draw during the load window
  • joules_per_output_tokenavg_power_w * num_gpus * duration / total_output_tokens

How it works

  1. benchmark_serving.py now records benchmark_start_time_unix and benchmark_end_time_unix (wall-clock epoch) alongside the existing duration. The aggregator needs these to know which slice of the long-running monitor CSV is the actual load window — without them, naive averaging would mix in ~60s of server warmup (~120W) and the optional eval phase (~300W), biasing a 720W per-GPU draw down to roughly 440W.

  2. utils/aggregate_power.py (new, stdlib only, ~210 lines) reads the CSV, detects vendor schema by header regex (handles nvidia-smi power.draw [W] and amd-smi socket_power), filters samples to the bench window, averages per-GPU power per timestamp then over time, and atomically patches the agg JSON. Best-effort throughout — missing/empty/malformed CSV is logged to stderr and skipped without ever failing the run.

  3. utils/process_result.py calls the aggregator right after writing the agg JSON. Path resolution checks $GPU_METRICS_CSV./gpu_metrics.csv/workspace/gpu_metrics.csv, accommodating the scripts in benchmarks/single_node/ that override the default path. Wrapped in try/except so telemetry never blocks the upload.

  4. benchmarks/benchmark_lib.sh exports GPU_METRICS_CSV so per-script CSV-path overrides cross the shell→Python boundary.

No workflow YAML change, no schema migration anywhere downstream — the InferenceX-app ETL's benchmark-mapper.ts is permissive about numeric keys in the agg JSON.

Verification

End-to-end smoke test on a synthesized 1680-row CSV (8 GPUs × 210s spanning warmup at 120W, bench at 720W, eval at 300W):

[aggregate_power] avg_power_w=715.69 (per GPU, n=8)
                  joules_per_output_token=8.3869
                  duration=120.0s output_tokens=81920

Cross-check: 720W × 8 GPUs / 682 tok/s ≈ 8.45 J/tok. ✓ Window isolation correctly excluded the warmup + eval samples (naive average would have given ~440W).

Test plan

  • 26 unit tests covering NVIDIA + AMD CSV formats, multi-GPU per-sample aggregation, window filtering, malformed-row resilience, missing files, atomic JSON patching, divide-by-zero on failed runs
  • 3 subprocess integration tests through process_result.py: stages a CSV + bench JSON + env vars, asserts agg_<run>.json gets patched
  • All 22 existing test_process_result.py tests still pass (no regressions in the established flow)
  • First real benchmark run after merge — verify the two new keys appear in the uploaded agg_<run>.json artifact
  • InferenceX-app ingest picks up the new keys (no METRIC_KEYS warning in ETL logs)

Backfill option

The workflow already uploads gpu_metrics.csv as an artifact for every run (.github/workflows/benchmark-tmpl.yml). After this merges, historical runs that have both a CSV and an agg_<run>.json could be backfilled by re-running this aggregator against the artifact store. Out of scope here.

🤖 Generated with Claude Code

arygupt and others added 2 commits May 21, 2026 16:40
Adds two new fields to agg_<run>.json so the InferenceX-app dashboard
can chart measured-energy metrics alongside the existing TDP-derived
ones:

  - avg_power_w               (mean per-GPU draw during the load window)
  - joules_per_output_token   (avg_power_w * num_gpus * duration / total_output_tokens)

How it works:

1. benchmark_serving.py now records benchmark_start_time_unix and
   benchmark_end_time_unix alongside the existing duration field so the
   aggregator knows exactly which slice of the long-running monitor CSV
   to read (the bracket-the-whole-job monitor includes server warmup
   and the optional eval phase, which would otherwise bias the average).

2. aggregate_power.py reads /workspace/gpu_metrics.csv (path overridable
   via GPU_METRICS_CSV, which benchmark_lib.sh now exports), detects the
   vendor schema by header regex (handles nvidia-smi "power.draw [W]"
   and amd-smi socket_power formats), filters samples to the bench
   window, and atomically patches the agg JSON. Best-effort: missing /
   empty / malformed CSV is logged to stderr and skipped without
   failing the run.

3. process_result.py invokes the aggregator right after writing the
   agg JSON — no workflow YAML change needed.

The InferenceX-app ETL (benchmark-mapper.ts) auto-captures unknown
numeric metrics into the metrics JSONB column, so no schema migration
or downstream change is required for the data to land in the DB. A
follow-up PR on InferenceX-app adds the two Y-axis options to the
inference scatter chart.

26 unit tests covering NVIDIA + AMD CSV shapes, window filtering,
multi-GPU per-sample aggregation, malformed-row resilience, missing
files, division-by-zero guards, and atomic JSON patching.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three new tests in TestPowerAggregationIntegration:

  - test_agg_json_gets_patched_with_power_and_joules: full pipeline.
    Stages a 1Hz nvidia-smi CSV with warmup/bench/eval phases, runs
    process_result.py as a subprocess with GPU_METRICS_CSV set, and
    verifies the agg JSON gets patched with avg_power_w (600W) and
    joules_per_output_token (9.6 J/tok = 600W * 8 GPUs * 60s / 30k tok).
    Warmup (100W) and eval (200W) samples must be excluded by the
    timestamp window — would otherwise bias the result downward.

  - test_missing_csv_does_not_break_process_result: production case for
    runs that ship without monitoring. process_result.py succeeds and
    writes the agg JSON sans power fields.

  - test_missing_bench_timestamps_does_not_patch: legacy bench JSON
    without benchmark_start_time_unix gracefully skips aggregation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@arygupt arygupt requested a review from a team May 22, 2026 00:42
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Appends an entry listing qwen3.5-fp8-h200-sglang so run-sweep.yml fires
when the sweep-enabled label is added to PR SemiAnalysisAI#1551. The sweep will produce
the first agg_<run>.json containing avg_power_w and joules_per_output_token,
validating the aggregator end-to-end on real GPU hardware.

Cheap single-node H200 config picked to minimize runner-pool contention.
@arygupt
Copy link
Copy Markdown
Collaborator Author

arygupt commented May 22, 2026

Closed this in favor of #1558

@arygupt arygupt closed this May 22, 2026
arygupt added a commit that referenced this pull request May 22, 2026
Appends an entry listing qwen3.5-fp8-h200-sglang so run-sweep.yml fires
when the sweep-enabled label is added to PR #1551. The sweep will produce
the first agg_<run>.json containing avg_power_w and joules_per_output_token,
validating the aggregator end-to-end on real GPU hardware.

Cheap single-node H200 config picked to minimize runner-pool contention.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant