Perf tests by shanmugamr1992 · Pull Request #4917 · NVIDIA/Megatron-LM

shanmugamr1992 · 2026-05-21T18:48:18Z

Summary

Consolidates inference throughput coverage for the 583M dynamic-batching DP=8 config into the perf-tests harness. Coverage previously lived in two places:

tests/functional_tests/test_cases/gpt/gpt_dynamic_inference_tp1_pp1_dp8_583m_throughputtest_zmq/ + a dedicated pytest at tests/functional_tests/python_test_utils/test_inference_regular_pipeline.py — already commented-out in tests/test_utils/recipes/h100/gpt-dynamic-inference-with-coordinator.yaml, so it was dormant in CI.
tests/performance_tests/test_cases/gpt/gpt_583m_perf/ (DP=1) — the new harness, but exercising only the single-GPU path.

This PR drops (1), upgrades (2) to DP=8, and folds the rule that pytest enforced (≤ 10% slower / ≤ 20% faster on throughput) into compare_to_baseline.py. End result: one place to look for 583M throughput regressions.

I, the PR author, have personally reviewed every line of this PR.

What does this PR do ?

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact @NVIDIA/mcore-oncall.

Issue tracking

For PRs from open-source community contributors:

New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

… GPT 16B Adds tests/performance_tests/, a server+client benchmark harness that spins up `tools.run_dynamic_text_generation_server`, sweeps batch sizes against the OpenAI-compatible `/v1/completions` endpoint, and compares throughput / avg- p50-p99 latency / TPOT against checked-in baseline_values.json with a 10% tolerance. Vendored from /Users/shanmugamr/inference-bench and adapted to work inside the existing cog + nemo-run CI flow. Three test cases shipped with baselines captured on cw-dfw H100: - gpt/gpt_583m_perf: 4 batches (1, 8, 32, 128), ~5 min - hybrid/hybrid_2b_perf: 4 batches (1, 8, 32, 128), ~8 min - gpt/gpt_16b_perf: 3 batches (1, 8, 32), ~13 min CI recipes in tests/test_utils/recipes/h100/{gpt,hybrid}-perf.yaml trigger on every MR via the existing mr-github scope (alongside functional tests). Skill + slash command for human-driven runs live under skills/run-performance-tests/ and .claude/commands/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

These were committed by mistake in 1404dbe; they're local-only Claude Code workflow files and should not live in the repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dd +20% upper bound Before this commit, throughput coverage for the dynamic-inference 583M dp=8 config lived in two places: - tests/functional_tests/test_cases/gpt/gpt_dynamic_inference_tp1_pp1_dp8_583m_throughputtest_zmq/ (with a pytest at tests/functional_tests/python_test_utils/test_inference_regular_pipeline.py enforcing throughput within 10% and +20% upper-bound) - tests/performance_tests/test_cases/gpt/gpt_583m_perf/ (the new perf harness, DP=1, throughput-only regression check) The functional-test entry was already commented out in tests/test_utils/recipes/h100/gpt-dynamic-inference-with-coordinator.yaml, so it had been dormant in CI. To keep all perf tests in one place, this commit: 1. Deletes the dormant functional test directory + its commented recipe stub. 2. Switches the perf-harness gpt_583m_perf test from DP=1 to DP=8 so it exercises the same multi-rank dynamic-batching + ZMQ coordinator path the deleted test used to cover. Re-recorded its baseline_values.json on cw-dfw H100 (batch_128 → 5643 tok/s, p99 latency 2901 ms, tpot 22.7 ms/tok). 3. Adds a +20% upper-bound throughput check to tests/performance_tests/shell_test_utils/compare_to_baseline.py, matching the rule the deleted pytest applied (improvements beyond UPPER_TOLERANCE_PCT fail with "refresh baseline" so the regression floor doesn't go stale). New `UPPER_TOLERANCE_PCT` config field, default 20. Test-case changes: - DELETED: tests/functional_tests/test_cases/gpt/gpt_dynamic_inference_tp1_pp1_dp8_583m_throughputtest_zmq/ - DELETED: commented entry from tests/test_utils/recipes/h100/gpt-dynamic-inference-with-coordinator.yaml - MODIFIED: tests/performance_tests/test_cases/gpt/gpt_583m_perf/model_config.yaml (DP 1 → 8; same batch sweep 1,8,32,128; same ISL/OSL 512/128) - MODIFIED: tests/performance_tests/test_cases/gpt/gpt_583m_perf/baseline_values.json (re-recorded for DP=8 on cw-dfw H100, 2026-05-19) Recipe changes: - SPLIT: tests/test_utils/recipes/h100/gpt-perf.yaml now holds 1-GPU tests only (gpt_16b_perf). 8-GPU tests move to the new tests/test_utils/recipes/h100/gpt-perf-dp8.yaml (gpt_583m_perf at gpus: 8, ntasks-per-node: 1). Runner changes (tests/performance_tests/shell_test_utils/run_perf_test.sh): - Added optional EP schema field in model_config.yaml. world_size = TP * PP * max(DP, EP). EP defaults to 1 so existing dense configs (583m, 2b, 16b) are unaffected. - Reordered the torchrun arg vector so SERVER_COMMON_ARGS come before MODEL_ARGS. With argparse's last-wins semantics, this lets a model's .args file override default server flags (e.g. --inference-dynamic-batching-buffer-size-gb, --inference-dynamic-batching-max-requests) without having to edit the runner. Verified (cw-dfw H100, 2026-05-19): gpt_583m_perf (DP=8), hybrid_2b_perf, gpt_16b_perf — all three pass compare_to_baseline.py against their checked-in baselines within tolerance. Not addressed in this PR (filed as follow-ups in skills/run-performance-tests/SKILL.md): - mem-max-allocated-bytes regression check (deleted pytest had ±5% on this; bringing it back via this harness needs a server-side /v1/stats endpoint or server.log parser). - Real-prompt-from-disk mode (deleted pytest loaded sharegpt-vicuna; our client uses synthetic "hello " * N prompts). - nanov3 perf coverage: 7 bootstrap attempts on cw-dfw via cog all hit a Triton incompatibility in megatron.core.inference.moe.permute kernels (`'constexpr_type' object has no attribute 'is_ptr'`) against the Triton 3.6.0 shipped in the cog dev image. The existing nanov3 functional test uses a different CI-built image. Will revisit once a Triton-compatible image is available. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

copy-pr-bot · 2026-05-21T18:48:22Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-05-21T18:48:28Z

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

Add the oncall reviewer (optional reviewer)
Add required review teams based on your changes

See the contribution guide for more details.

Resolves 5 add/add conflicts on perf-tests scaffolding that landed separately on main (with the original DP=1 baseline) and on this branch (with DP=8 + +20% upper-bound check). Took OUR side for all five since this branch's content is the consolidation work. - tests/performance_tests/shell_test_utils/compare_to_baseline.py - tests/performance_tests/shell_test_utils/run_perf_test.sh - tests/performance_tests/test_cases/gpt/gpt_583m_perf/baseline_values.json - tests/performance_tests/test_cases/gpt/gpt_583m_perf/model_config.yaml - tests/test_utils/recipes/h100/gpt-perf.yaml Also retains the deletion of tests/functional_tests/test_cases/gpt/ gpt_dynamic_inference_tp1_pp1_dp8_583m_throughputtest_zmq/ and the corresponding commented stub removal in tests/test_utils/recipes/h100/gpt-dynamic-inference-with-coordinator.yaml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…back) Reviewer feedback on PR NVIDIA#4917: MoE models should not benchmark with synthetic `"hello " * N` prompts. Identical token IDs route to the same expert subset every step, so router/dispatcher load is uniform → misleading throughput. Same concern applies to mamba/hybrid models where SSM state evolution is content-sensitive. Changes: - Vendor 256 gsm8k prompts at tests/performance_tests/client/data/gsm8k_prompts.jsonl (from openai/gsm8k test split, ~64 KB). Keeps the test data-independent of cluster-side mounts. - Add `--dataset {synthetic,gsm8k}` flag to static_benchmark.py. gsm8k mode cycles through the vendored prompts deterministically across requests/iterations and ignores --num-input-tokens (prompts have natural variable length). results.json now records `num_input_tokens_avg` (measured) and `dataset` instead of the assumed fixed-length field. - Add `DATASET` field to model_config.yaml schema (default: synthetic for back-compat). Wired through run_perf_test.sh. - Switch gpt_16b_perf and hybrid_2b_perf to `DATASET: gsm8k`. gpt_583m_perf (dense) stays on synthetic — the MoE/hybrid concern doesn't apply. - Re-record baselines for gpt_16b_perf and hybrid_2b_perf on cw-dfw H100. The new MoE numbers tell the real story: gpt_16b at batch=32 measures 317 tok/s with 101 ms/tok (vs the prior synthetic 1356 tok/s @ 24 ms/tok — ~4× artifact from uniform expert routing). Also restores the .pth-based mamba-ssm shim in run_perf_test.sh that was inadvertently reverted by the merge from main: the older PYTHONPATH-prepend form shadowed the cog venv's nvrx 0.6.0 with /opt/venv's 0.6.0.dev69 and broke hybrid model load. The .pth approach appends /opt/venv *after* the venv on sys.path, so newer cog-venv packages win. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reviewer feedback on PR NVIDIA#4917: OSL=128 lets prefill dominate the per-token average and inflates TPOT; the production MoE workload (Nemo-RL rollouts) is decode-heavy. Bumping to OSL=2048 to reflect that regime. Recorded fresh baseline on cw-dfw H100 (1 GPU, gsm8k prompts, ~1h 12m total wall time across batches 1/8/32). The actual finding is mild: OSL=128 batch=32 -> 317 tok/s, tpot=101 ms/tok OSL=2048 batch=32 -> 326 tok/s, tpot= 98 ms/tok TPOT only dropped 3 ms/tok. Prefill *was* a small share of total work (ISL ~60 tokens vs OSL 128), and ~95-100 ms/tok is the actual steady-state MoE decode rate, not a prefill artifact. The longer OSL still gives a cleaner decode-only signal, which is what we want for catching regressions in the expert-dispatch path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New 8-GPU inference perf test mirroring the nanov3 3B hybrid MoE functional config (chunked prefill + local CUDA graphs + nvls dispatcher + inference_optimized transformer impl + mamba state dtypes fp32). Uses gsm8k prompts since both the mamba state path and the MoE router/dispatcher are content-sensitive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…flag list) Reviewer asked to add to gpt_16b: --transformer-impl inference_optimized --inference-grouped-gemm-backend vllm --inference-moe-token-dispatcher-type nvls --moe-shared-expert-overlap --cuda-graph-impl local --cuda-graph-scope full_iteration_inference --inference-dynamic-batching-num-cuda-graphs -1 --enable-chunked-prefill Only the last one is compatible with this model: - `inference_optimized` rejects --swiglu, but the 16B deepseek checkpoint requires SwiGLU (encoded in its FFN weight layout). Dropping --swiglu would silently corrupt model outputs. - vllm-backend MoE / nvls dispatcher / shared-expert overlap all *require* inference_optimized, so they're out by transitive closure. - CUDA graph capture at full_iteration_inference scope crashes because the alltoall MoE dispatcher (transformer_engine path) does `d2h_event.synchronize()` inside its forward, which is illegal during stream capture. Comment block in gpt_16b.args records the incompatibility so reviewers don't re-request these flags. The inference_optimized stack will become available for this model once it supports SwiGLU (groundwork exists — see commit 6815c0f "Combine GEMM + SwiGLU fused MLP PRs" — but the route isn't wired through inference_optimized yet). Re-recorded baseline on cw-dfw H100. Chunked prefill is a ~no-op in our decode-heavy regime (OSL=2048, avg ISL=60 means decode is 97% of work): batch=1 10.5 -> 10.2 tok/s (-3%), batch=8 83.4 -> 80.9 (-3%), batch=32 325.7 -> 320.7 (-1.5%) — all within run-to-run noise. Keeping the flag for the prefill-warm-up portion and for prompt-mix tests we might add later. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops the per-CI-run wall time for gpt_16b_perf from ~1h 12m down to ~10m without losing the steady-state decode signal. The numbers barely move: OSL=2048 batch=32: 326 tok/s, tpot=98 ms/tok, avg_lat=201 s OSL=256 batch=32: 325 tok/s, tpot=99 ms/tok, avg_lat= 25 s Prefill (avg ISL=60 in gsm8k mode) is amortized over 256 output tokens, which is enough that the TPOT delta from OSL=2048 is well inside run-to-run noise (~1 ms). The decode-dominated regime the reviewer cares about is already visible at OSL=256. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

shanmugamr1992 · 2026-05-22T01:07:28Z

/ok to test e1b02c7

svcnvidia-nemo-ci · 2026-05-22T15:43:27Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26297439694

shanmugamr1992 and others added 8 commits May 14, 2026 12:21

Merge branch 'main' into perf-tests

b90fdcd

Updated to fix errors

08e3c68

Merge branch 'main' into perf-tests

1e73e1e

Making mr run only on merge in slurm clusters

cce0734

Merge branch 'main' into perf-tests

ce2c8c9

chore: untrack run-performance-tests skill and slash command

b571da0

These were committed by mistake in 1404dbe; they're local-only Claude Code workflow files and should not live in the repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

shanmugamr1992 requested a review from a team as a code owner May 21, 2026 18:48

svcnvidia-nemo-ci marked this pull request as draft May 21, 2026 18:48

shanmugamr1992 marked this pull request as ready for review May 21, 2026 18:54

svcnvidia-nemo-ci requested a review from a team May 21, 2026 18:54

svcnvidia-nemo-ci added the complexity: medium label May 21, 2026

shanmugamr1992 and others added 5 commits May 21, 2026 12:32

shanmugamr1992 enabled auto-merge May 22, 2026 00:40

shanmugamr1992 disabled auto-merge May 22, 2026 00:45

chtruong814 approved these changes May 22, 2026

View reviewed changes

svcnvidia-nemo-ci added the Approved All necessary approvals have been made label May 22, 2026

perf-tests: add MIT attribution for vendored gsm8k prompt subset

e1b02c7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

copy-pr-bot Bot temporarily deployed to public May 22, 2026 01:07 Inactive

tdene approved these changes May 22, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to test May 22, 2026 01:08 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 01:11 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 01:19 Inactive

shanmugamr1992 added this pull request to the merge queue May 22, 2026

sidsingh-nvidia approved these changes May 22, 2026

View reviewed changes

Merged via the queue into NVIDIA:main with commit 686aa8c May 22, 2026
80 checks passed

shanmugamr1992 deleted the perf-tests branch May 22, 2026 16:36

chtruong814 mentioned this pull request May 23, 2026

ci: restore perf test torchrun logs #4951

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf tests#4917

Perf tests#4917
shanmugamr1992 merged 15 commits into
NVIDIA:mainfrom
shanmugamr1992:perf-tests

shanmugamr1992 commented May 21, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

shanmugamr1992 commented May 22, 2026

Uh oh!

svcnvidia-nemo-ci commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

shanmugamr1992 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What does this PR do ?

Issue tracking

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

shanmugamr1992 commented May 22, 2026

Uh oh!

svcnvidia-nemo-ci commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shanmugamr1992 commented May 21, 2026 •

edited

Loading