feat(power): per-worker prefill/decode power + role-split joules (stacked on #1574)#1577
feat(power): per-worker prefill/decode power + role-split joules (stacked on #1574)#1577arygupt wants to merge 1 commit into
Conversation
… joules Layers per-worker breakdown on top of the cluster-wide multinode aggregation in the parent PR #1574. New agg JSON fields (additive — all existing keys preserved bit-for-bit for backward compat): workers: [{role, worker_idx, num_gpus, avg_power_w}, ...] role ∈ "prefill" / "decode" / "agg" / "frontend". Each (role, idx) aggregates across all CSVs for that worker — a multi-node TP=16 decode worker on 4 nodes produces one workers entry with num_gpus=16. prefill_avg_power_w, decode_avg_power_w (disagg only) Weighted per-GPU averages within each role. joules_per_input_token = prefill_energy / total_input_tokens joules_per_output_token_decode = decode_energy / total_output_tokens Disagg-only role-split metrics. Existing joules_per_output_token and joules_per_total_token keep their cluster-wide semantics so the chart won't shift on existing data. Worker → CSV mapping is by filename: srt-slurm's perfmon (companion change on SemiAnalysisAI/srt-slurm c4c86dc) writes `perf_samples_<role>_w<worker_idx>_<host>.csv`. Unlabeled filenames (old single-CSV format) silently emit empty workers list and skip the role split — cluster-wide metrics unchanged in that case. 77/77 tests pass (68 existing + 9 new — per-worker grouping, multi-node worker aggregation, mixed labeled/unlabeled inputs, disagg E2E with role split, agg E2E omitting disagg-only fields, bit-for-bit backward compat for old-format callers).
There was a problem hiding this comment.
LGTM — additive per-worker breakdown on top of existing cluster-wide aggregation, backward compatible for legacy CSV names, well-covered by 9 new tests.
Extended reasoning...
Overview
Extends utils/aggregate_power.py with per-worker power attribution: a new _parse_role_from_filename helper, a new aggregate_power_per_worker function that wraps the existing aggregator and groups CSVs by (role, worker_idx) parsed from perf_samples_<role>_w<idx>_<host>.csv, a new optional extras kwarg to patch_agg_result, and an extended run() that emits a workers array plus role-split scalars (prefill_avg_power_w, decode_avg_power_w, joules_per_input_token, joules_per_output_token_decode) when both prefill and decode workers are present. utils/test_aggregate_power.py adds 9 tests.
Security risks
None. This is offline post-processing of NVIDIA/AMD SMI CSVs produced by the benchmark harness; no network calls, no auth, no user-controlled input. Filename regex is anchored and narrow.
Level of scrutiny
Low. This touches only optional benchmark power telemetry — not the inference, serving, or scheduling hot paths. The aggregator is best-effort and returns 0 on every error path so monitoring failures cannot break benchmark uploads. Cluster-wide fields (avg_power_w, joules_per_output_token, joules_per_total_token) keep their existing semantics and old-format unlabeled CSVs produce bit-for-bit identical output.
Other factors
- Tests cover filename parsing (labeled / multi-digit idx / old format), per-worker grouping with multi-node decode worker aggregation, mixed labeled+unlabeled inputs, disagg E2E with role-split joules, agg E2E without disagg-only fields, and legacy single-CSV bit-for-bit compat.
- The V1 limitation (multiple workers of the same role colocated on one node collapse to lowest worker_idx) is documented in the PR description and does not affect the target gb300-cw multi-node disagg case.
- Bug hunter found no issues.
Stacked on #1574 — review #1574 first; this PR adds per-worker breakdown on top of the cluster-wide multinode aggregation that PR lands.
Summary
Manager-requested follow-up. Adds per-worker power attribution and role-split energy metrics for disagg runs:
workers: [{role, worker_idx, num_gpus, avg_power_w}, ...]— per-worker breakdown in the agg JSON.role∈prefill/decode/agg/frontend. Multi-node workers (e.g. TP=16 decode across 4 nodes) aggregate as one entry withnum_gpus=16.prefill_avg_power_w,decode_avg_power_w— weighted per-GPU averages within each role (disagg only).joules_per_input_token=prefill_energy / total_input_tokens(disagg only).joules_per_output_token_decode=decode_energy / total_output_tokens(disagg only).Backward compatible: existing
avg_power_w,joules_per_output_token,joules_per_total_tokenkeep their cluster-wide semantics. Old-format unlabeled CSVs (single-nodestart_gpu_monitorpath) emit noworkersfield and no role split — bit-for-bit identical to the parent PR's behavior for those callers.Pipeline
Companion srt-slurm change
SemiAnalysisAI/srt-slurm c4c86dc:
_start_perf_monitorderives role+worker_idx per node frombackend_processes(which carriesendpoint_modeandendpoint_indexper Process) and labels output filenames. Allocator invariant — prefill and decode never share a node — means a single (role, worker_idx) per node is well-defined.Files
utils/aggregate_power.py: new_parse_role_from_filename+aggregate_power_per_worker.patch_agg_resultgainsextraskwarg.run()computes role-split energy and per-worker breakdown when filenames are labeled.utils/test_aggregate_power.py: 9 new tests.Test plan
workersarray +prefill_avg_power_w/decode_avg_power_w/joules_per_input_token/joules_per_output_token_decodein the agg JSONbenchmark-transform.tsto surfaceworkersarray; add prefill/decode chart legendApp-side follow-up
ETL auto-captures any numeric field, so the scalar role-split fields (
prefill_avg_power_wetc.) land inmetricsJSONB automatically. The nestedworkersarray needs explicit handling — that's a separate PR onInferenceX-appcovering: type extension,benchmark-transform.tsmapper, chart legend with prefill/decode swatches.Limitation (V1)
When multiple workers of the same role share a node (uncommon — happens in single-node multi-worker setups, e.g. 2 decode workers per 8-GPU node), perfmon labels by the lowest worker_idx and the aggregator currently attributes that node's full power to the lowest-idx worker. The other colocated workers would not appear in the workers list. For the GB300 multi-node disagg case (worker spans 1+ nodes, never shares), this isn't an issue.
Note
Low Risk
Changes are confined to optional benchmark power post-processing and agg JSON fields; cluster-wide metrics stay the same for legacy CSV names, with broad test coverage.
Overview
Extends
aggregate_powerso multinode perf CSVs namedperf_samples_<role>_w<idx>_<host>.csv(srt-slurm perfmon) drive per-worker power in the agg JSON, on top of unchanged cluster-wideavg_power_w/ joules fields.aggregate_power_per_workerkeeps cluster totals via existingaggregate_power, groups labeled files by(role, worker_idx)(multi-node workers merge multiple CSVs), and returns aworkerslist.run()switches to that API;patch_agg_resultaccepts optionalextras. When both prefill and decode workers are present, the agg JSON also gets role-weightedprefill_avg_power_w/decode_avg_power_w,joules_per_input_token, andjoules_per_output_token_decode. Unlabeled legacy CSVs still patch only the original three keys (noworkers).Nine new tests cover filename parsing, grouping, mixed labeled/unlabeled inputs, and disagg vs agg vs old-format E2E behavior.
Reviewed by Cursor Bugbot for commit f951aef. Bugbot is set up for automated code reviews on this repo. Configure here.