feat(power): per-worker prefill/decode power + role-split joules (stacked on #1574) by arygupt · Pull Request #1577 · SemiAnalysisAI/InferenceX

arygupt · 2026-05-28T01:22:11Z

Stacked on #1574 — review #1574 first; this PR adds per-worker breakdown on top of the cluster-wide multinode aggregation that PR lands.

Summary

Manager-requested follow-up. Adds per-worker power attribution and role-split energy metrics for disagg runs:

workers: [{role, worker_idx, num_gpus, avg_power_w}, ...] — per-worker breakdown in the agg JSON. role ∈ prefill / decode / agg / frontend. Multi-node workers (e.g. TP=16 decode across 4 nodes) aggregate as one entry with num_gpus=16.
prefill_avg_power_w, decode_avg_power_w — weighted per-GPU averages within each role (disagg only).
joules_per_input_token = prefill_energy / total_input_tokens (disagg only).
joules_per_output_token_decode = decode_energy / total_output_tokens (disagg only).

Backward compatible: existing avg_power_w, joules_per_output_token, joules_per_total_token keep their cluster-wide semantics. Old-format unlabeled CSVs (single-node start_gpu_monitor path) emit no workers field and no role split — bit-for-bit identical to the parent PR's behavior for those callers.

Pipeline

srt-slurm perfmon (SemiAnalysisAI/srt-slurm:feat/inferencex-perfmon c4c86dc)
  → perf_samples_<role>_w<worker_idx>_<host>.csv on shared NFS
  → process_result.py (parent PR's bridge) → aggregate_power.run()
  → aggregate_power_per_worker() groups CSVs by (role, worker_idx)
  → agg JSON gets workers array + role-split joules (disagg) + existing cluster fields

Companion srt-slurm change

SemiAnalysisAI/srt-slurm c4c86dc: _start_perf_monitor derives role+worker_idx per node from backend_processes (which carries endpoint_mode and endpoint_index per Process) and labels output filenames. Allocator invariant — prefill and decode never share a node — means a single (role, worker_idx) per node is well-defined.

Files

utils/aggregate_power.py: new _parse_role_from_filename + aggregate_power_per_worker. patch_agg_result gains extras kwarg. run() computes role-split energy and per-worker breakdown when filenames are labeled.
utils/test_aggregate_power.py: 9 new tests.

Test plan

77/77 tests pass (68 existing + 9 new — filename parsing, per-worker grouping, multi-node worker aggregation, mixed labeled/unlabeled inputs, disagg E2E with role split, agg E2E without disagg-only fields, bit-for-bit backward compat)
After parent PR merges and srt-slurm change lands, first real disagg sweep on gb300-cw emits workers array + prefill_avg_power_w / decode_avg_power_w / joules_per_input_token / joules_per_output_token_decode in the agg JSON
InferenceX-app stage (separate PR): extend benchmark-transform.ts to surface workers array; add prefill/decode chart legend

App-side follow-up

ETL auto-captures any numeric field, so the scalar role-split fields (prefill_avg_power_w etc.) land in metrics JSONB automatically. The nested workers array needs explicit handling — that's a separate PR on InferenceX-app covering: type extension, benchmark-transform.ts mapper, chart legend with prefill/decode swatches.

Limitation (V1)

When multiple workers of the same role share a node (uncommon — happens in single-node multi-worker setups, e.g. 2 decode workers per 8-GPU node), perfmon labels by the lowest worker_idx and the aggregator currently attributes that node's full power to the lowest-idx worker. The other colocated workers would not appear in the workers list. For the GB300 multi-node disagg case (worker spans 1+ nodes, never shares), this isn't an issue.

Note

Low Risk
Changes are confined to optional benchmark power post-processing and agg JSON fields; cluster-wide metrics stay the same for legacy CSV names, with broad test coverage.

Overview
Extends aggregate_power so multinode perf CSVs named perf_samples_<role>_w<idx>_<host>.csv (srt-slurm perfmon) drive per-worker power in the agg JSON, on top of unchanged cluster-wide avg_power_w / joules fields.

aggregate_power_per_worker keeps cluster totals via existing aggregate_power, groups labeled files by (role, worker_idx) (multi-node workers merge multiple CSVs), and returns a workers list. run() switches to that API; patch_agg_result accepts optional extras. When both prefill and decode workers are present, the agg JSON also gets role-weighted prefill_avg_power_w / decode_avg_power_w, joules_per_input_token, and joules_per_output_token_decode. Unlabeled legacy CSVs still patch only the original three keys (no workers).

Nine new tests cover filename parsing, grouping, mixed labeled/unlabeled inputs, and disagg vs agg vs old-format E2E behavior.

^{Reviewed by Cursor Bugbot for commit f951aef. Bugbot is set up for automated code reviews on this repo. Configure here.}

… joules Layers per-worker breakdown on top of the cluster-wide multinode aggregation in the parent PR #1574. New agg JSON fields (additive — all existing keys preserved bit-for-bit for backward compat): workers: [{role, worker_idx, num_gpus, avg_power_w}, ...] role ∈ "prefill" / "decode" / "agg" / "frontend". Each (role, idx) aggregates across all CSVs for that worker — a multi-node TP=16 decode worker on 4 nodes produces one workers entry with num_gpus=16. prefill_avg_power_w, decode_avg_power_w (disagg only) Weighted per-GPU averages within each role. joules_per_input_token = prefill_energy / total_input_tokens joules_per_output_token_decode = decode_energy / total_output_tokens Disagg-only role-split metrics. Existing joules_per_output_token and joules_per_total_token keep their cluster-wide semantics so the chart won't shift on existing data. Worker → CSV mapping is by filename: srt-slurm's perfmon (companion change on SemiAnalysisAI/srt-slurm c4c86dc) writes `perf_samples_<role>_w<worker_idx>_<host>.csv`. Unlabeled filenames (old single-CSV format) silently emit empty workers list and skip the role split — cluster-wide metrics unchanged in that case. 77/77 tests pass (68 existing + 9 new — per-worker grouping, multi-node worker aggregation, mixed labeled/unlabeled inputs, disagg E2E with role split, agg E2E omitting disagg-only fields, bit-for-bit backward compat for old-format callers).

claude

LGTM — additive per-worker breakdown on top of existing cluster-wide aggregation, backward compatible for legacy CSV names, well-covered by 9 new tests.

Extended reasoning...

Overview

Extends utils/aggregate_power.py with per-worker power attribution: a new _parse_role_from_filename helper, a new aggregate_power_per_worker function that wraps the existing aggregator and groups CSVs by (role, worker_idx) parsed from perf_samples_<role>_w<idx>_<host>.csv, a new optional extras kwarg to patch_agg_result, and an extended run() that emits a workers array plus role-split scalars (prefill_avg_power_w, decode_avg_power_w, joules_per_input_token, joules_per_output_token_decode) when both prefill and decode workers are present. utils/test_aggregate_power.py adds 9 tests.

Security risks

None. This is offline post-processing of NVIDIA/AMD SMI CSVs produced by the benchmark harness; no network calls, no auth, no user-controlled input. Filename regex is anchored and narrow.

Level of scrutiny

Low. This touches only optional benchmark power telemetry — not the inference, serving, or scheduling hot paths. The aggregator is best-effort and returns 0 on every error path so monitoring failures cannot break benchmark uploads. Cluster-wide fields (avg_power_w, joules_per_output_token, joules_per_total_token) keep their existing semantics and old-format unlabeled CSVs produce bit-for-bit identical output.

Other factors

Tests cover filename parsing (labeled / multi-digit idx / old format), per-worker grouping with multi-node decode worker aggregation, mixed labeled+unlabeled inputs, disagg E2E with role-split joules, agg E2E without disagg-only fields, and legacy single-CSV bit-for-bit compat.
The V1 limitation (multiple workers of the same role colocated on one node collapse to lowest worker_idx) is documented in the PR description and does not affect the target gb300-cw multi-node disagg case.
Bug hunter found no issues.

github-project-automation Bot added this to InferenceMAX Board May 28, 2026

claude Bot reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(power): per-worker prefill/decode power + role-split joules (stacked on #1574)#1577

feat(power): per-worker prefill/decode power + role-split joules (stacked on #1574)#1577
arygupt wants to merge 1 commit into
feat/measured-power-multinodefrom
feat/measured-power-per-worker

arygupt commented May 28, 2026 •

edited by cursor Bot

Loading

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arygupt commented May 28, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Pipeline

Companion srt-slurm change

Files

Test plan

App-side follow-up

Limitation (V1)

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arygupt commented May 28, 2026 •

edited by cursor Bot

Loading