Skip to content

feat(power): per-worker prefill/decode power + role-split joules (stacked on #1574)#1577

Open
arygupt wants to merge 1 commit into
feat/measured-power-multinodefrom
feat/measured-power-per-worker
Open

feat(power): per-worker prefill/decode power + role-split joules (stacked on #1574)#1577
arygupt wants to merge 1 commit into
feat/measured-power-multinodefrom
feat/measured-power-per-worker

Conversation

@arygupt
Copy link
Copy Markdown
Collaborator

@arygupt arygupt commented May 28, 2026

Stacked on #1574 — review #1574 first; this PR adds per-worker breakdown on top of the cluster-wide multinode aggregation that PR lands.

Summary

Manager-requested follow-up. Adds per-worker power attribution and role-split energy metrics for disagg runs:

  • workers: [{role, worker_idx, num_gpus, avg_power_w}, ...] — per-worker breakdown in the agg JSON. roleprefill / decode / agg / frontend. Multi-node workers (e.g. TP=16 decode across 4 nodes) aggregate as one entry with num_gpus=16.
  • prefill_avg_power_w, decode_avg_power_w — weighted per-GPU averages within each role (disagg only).
  • joules_per_input_token = prefill_energy / total_input_tokens (disagg only).
  • joules_per_output_token_decode = decode_energy / total_output_tokens (disagg only).

Backward compatible: existing avg_power_w, joules_per_output_token, joules_per_total_token keep their cluster-wide semantics. Old-format unlabeled CSVs (single-node start_gpu_monitor path) emit no workers field and no role split — bit-for-bit identical to the parent PR's behavior for those callers.

Pipeline

srt-slurm perfmon (SemiAnalysisAI/srt-slurm:feat/inferencex-perfmon c4c86dc)
  → perf_samples_<role>_w<worker_idx>_<host>.csv on shared NFS
  → process_result.py (parent PR's bridge) → aggregate_power.run()
  → aggregate_power_per_worker() groups CSVs by (role, worker_idx)
  → agg JSON gets workers array + role-split joules (disagg) + existing cluster fields

Companion srt-slurm change

SemiAnalysisAI/srt-slurm c4c86dc: _start_perf_monitor derives role+worker_idx per node from backend_processes (which carries endpoint_mode and endpoint_index per Process) and labels output filenames. Allocator invariant — prefill and decode never share a node — means a single (role, worker_idx) per node is well-defined.

Files

  • utils/aggregate_power.py: new _parse_role_from_filename + aggregate_power_per_worker. patch_agg_result gains extras kwarg. run() computes role-split energy and per-worker breakdown when filenames are labeled.
  • utils/test_aggregate_power.py: 9 new tests.

Test plan

  • 77/77 tests pass (68 existing + 9 new — filename parsing, per-worker grouping, multi-node worker aggregation, mixed labeled/unlabeled inputs, disagg E2E with role split, agg E2E without disagg-only fields, bit-for-bit backward compat)
  • After parent PR merges and srt-slurm change lands, first real disagg sweep on gb300-cw emits workers array + prefill_avg_power_w / decode_avg_power_w / joules_per_input_token / joules_per_output_token_decode in the agg JSON
  • InferenceX-app stage (separate PR): extend benchmark-transform.ts to surface workers array; add prefill/decode chart legend

App-side follow-up

ETL auto-captures any numeric field, so the scalar role-split fields (prefill_avg_power_w etc.) land in metrics JSONB automatically. The nested workers array needs explicit handling — that's a separate PR on InferenceX-app covering: type extension, benchmark-transform.ts mapper, chart legend with prefill/decode swatches.

Limitation (V1)

When multiple workers of the same role share a node (uncommon — happens in single-node multi-worker setups, e.g. 2 decode workers per 8-GPU node), perfmon labels by the lowest worker_idx and the aggregator currently attributes that node's full power to the lowest-idx worker. The other colocated workers would not appear in the workers list. For the GB300 multi-node disagg case (worker spans 1+ nodes, never shares), this isn't an issue.


Note

Low Risk
Changes are confined to optional benchmark power post-processing and agg JSON fields; cluster-wide metrics stay the same for legacy CSV names, with broad test coverage.

Overview
Extends aggregate_power so multinode perf CSVs named perf_samples_<role>_w<idx>_<host>.csv (srt-slurm perfmon) drive per-worker power in the agg JSON, on top of unchanged cluster-wide avg_power_w / joules fields.

aggregate_power_per_worker keeps cluster totals via existing aggregate_power, groups labeled files by (role, worker_idx) (multi-node workers merge multiple CSVs), and returns a workers list. run() switches to that API; patch_agg_result accepts optional extras. When both prefill and decode workers are present, the agg JSON also gets role-weighted prefill_avg_power_w / decode_avg_power_w, joules_per_input_token, and joules_per_output_token_decode. Unlabeled legacy CSVs still patch only the original three keys (no workers).

Nine new tests cover filename parsing, grouping, mixed labeled/unlabeled inputs, and disagg vs agg vs old-format E2E behavior.

Reviewed by Cursor Bugbot for commit f951aef. Bugbot is set up for automated code reviews on this repo. Configure here.

… joules

Layers per-worker breakdown on top of the cluster-wide multinode
aggregation in the parent PR #1574. New agg JSON fields (additive — all
existing keys preserved bit-for-bit for backward compat):

  workers: [{role, worker_idx, num_gpus, avg_power_w}, ...]
    role ∈ "prefill" / "decode" / "agg" / "frontend". Each (role, idx)
    aggregates across all CSVs for that worker — a multi-node TP=16
    decode worker on 4 nodes produces one workers entry with num_gpus=16.

  prefill_avg_power_w, decode_avg_power_w  (disagg only)
    Weighted per-GPU averages within each role.

  joules_per_input_token         = prefill_energy / total_input_tokens
  joules_per_output_token_decode = decode_energy  / total_output_tokens
    Disagg-only role-split metrics. Existing joules_per_output_token and
    joules_per_total_token keep their cluster-wide semantics so the chart
    won't shift on existing data.

Worker → CSV mapping is by filename: srt-slurm's perfmon (companion change
on SemiAnalysisAI/srt-slurm c4c86dc) writes
`perf_samples_<role>_w<worker_idx>_<host>.csv`. Unlabeled filenames (old
single-CSV format) silently emit empty workers list and skip the role
split — cluster-wide metrics unchanged in that case.

77/77 tests pass (68 existing + 9 new — per-worker grouping, multi-node
worker aggregation, mixed labeled/unlabeled inputs, disagg E2E with role
split, agg E2E omitting disagg-only fields, bit-for-bit backward compat
for old-format callers).
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — additive per-worker breakdown on top of existing cluster-wide aggregation, backward compatible for legacy CSV names, well-covered by 9 new tests.

Extended reasoning...

Overview

Extends utils/aggregate_power.py with per-worker power attribution: a new _parse_role_from_filename helper, a new aggregate_power_per_worker function that wraps the existing aggregator and groups CSVs by (role, worker_idx) parsed from perf_samples_<role>_w<idx>_<host>.csv, a new optional extras kwarg to patch_agg_result, and an extended run() that emits a workers array plus role-split scalars (prefill_avg_power_w, decode_avg_power_w, joules_per_input_token, joules_per_output_token_decode) when both prefill and decode workers are present. utils/test_aggregate_power.py adds 9 tests.

Security risks

None. This is offline post-processing of NVIDIA/AMD SMI CSVs produced by the benchmark harness; no network calls, no auth, no user-controlled input. Filename regex is anchored and narrow.

Level of scrutiny

Low. This touches only optional benchmark power telemetry — not the inference, serving, or scheduling hot paths. The aggregator is best-effort and returns 0 on every error path so monitoring failures cannot break benchmark uploads. Cluster-wide fields (avg_power_w, joules_per_output_token, joules_per_total_token) keep their existing semantics and old-format unlabeled CSVs produce bit-for-bit identical output.

Other factors

  • Tests cover filename parsing (labeled / multi-digit idx / old format), per-worker grouping with multi-node decode worker aggregation, mixed labeled+unlabeled inputs, disagg E2E with role-split joules, agg E2E without disagg-only fields, and legacy single-CSV bit-for-bit compat.
  • The V1 limitation (multiple workers of the same role colocated on one node collapse to lowest worker_idx) is documented in the PR description and does not affect the target gb300-cw multi-node disagg case.
  • Bug hunter found no issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant