Skip to content

[NVIDIA] Reduce B200 Runs & add B200 FP4 Docker Script#35

Merged
functionstackx merged 5 commits into
mainfrom
reduce-b200-runs
Sep 22, 2025
Merged

[NVIDIA] Reduce B200 Runs & add B200 FP4 Docker Script#35
functionstackx merged 5 commits into
mainfrom
reduce-b200-runs

Conversation

@kimbochen
Copy link
Copy Markdown
Collaborator

  • Copied commands from non-TRT slurm scripts for docker scripts
  • Reduced B200 TP lists to [1, 8] for baseline and low latency scenarios
  • Removed b200 labels from the NV slurm runners
  • Validating the B200 run here

@functionstackx functionstackx changed the title Reduce B200 Runs Reduce B200 Runs & add B200 FP4 Docker Script Sep 22, 2025
@functionstackx functionstackx merged commit 27f29ac into main Sep 22, 2025
@cquil11 cquil11 added the NVIDIA label Apr 8, 2026
@cquil11 cquil11 changed the title Reduce B200 Runs & add B200 FP4 Docker Script [NVIDIA] Reduce B200 Runs & add B200 FP4 Docker Script Apr 8, 2026
arygupt added a commit that referenced this pull request May 28, 2026
… runs

Builds on PR #1558 (single-node measured-power) for multinode benchmarks
via srt-slurm. Pipeline:

  srt-slurm perfmon (per-node nvidia-smi sampling — PR #35 on
    NVIDIA/srt-slurm, layered on SemiAnalysisAI/srt-slurm:feat/inferencex-perfmon)
   perf_samples_<host>.csv in outputs/<job>/logs/ on shared NFS
   launch_gb300-cw.sh exports GPU_METRICS_CSV_GLOB to $GITHUB_ENV
   process_result.py expands the glob and hands the list to
   aggregate_power.run()
   aggregate_power.py namespaces local GPU indices per source CSV stem so
   each node's local indices 0..N-1 don't collide across nodes; emits
   cluster-wide avg_power_w + joules_per_*_token
   InferenceX-app ETL auto-captures the numeric fields (no schema change)

Changes:

- utils/aggregate_power.py: widen csv_path to Path | Iterable[Path] keeping
  the original param name. Per-source GPU-id namespacing only kicks in when
  there are 2+ sources so single-node num_gpus is unchanged. CLI adds
  --csv-glob (Python-side glob, mutually exclusive with --csv).
- utils/process_result.py: bridge GPU_METRICS_CSV_GLOB env var. Glob takes
  precedence over single GPU_METRICS_CSV when both are set.
- runners/launch_gb300-cw.sh: point dynamo-sglang at our srt-slurm fork,
  append `monitoring:` block to each recipe post-copy (idempotent), and
  write GPU_METRICS_CSV_GLOB to $GITHUB_ENV after the job for the
  downstream Process result step.
- 8 new multinode tests in test_aggregate_power.py (per-source namespacing,
  sub-second clock drift, asymmetric prefill/decode power, missing-CSV
  silent skip, backward-compat single-path-in-list, Iterable acceptance,
  E2E run with list). 3 new in test_process_result.py (glob aggregation,
  precedence over single CSV, empty-match falls through). 64/64 pass.

Verified data-format end-to-end on gb300 hardware: nvidia-smi
inside the sglang container emits the columns aggregate_power.py needs
timestamp, gpu, power_w.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants