Perf tests#4917
Merged
Merged
Conversation
… GPT 16B
Adds tests/performance_tests/, a server+client benchmark harness that spins
up `tools.run_dynamic_text_generation_server`, sweeps batch sizes against the
OpenAI-compatible `/v1/completions` endpoint, and compares throughput / avg-
p50-p99 latency / TPOT against checked-in baseline_values.json with a 10%
tolerance. Vendored from /Users/shanmugamr/inference-bench and adapted to
work inside the existing cog + nemo-run CI flow.
Three test cases shipped with baselines captured on cw-dfw H100:
- gpt/gpt_583m_perf: 4 batches (1, 8, 32, 128), ~5 min
- hybrid/hybrid_2b_perf: 4 batches (1, 8, 32, 128), ~8 min
- gpt/gpt_16b_perf: 3 batches (1, 8, 32), ~13 min
CI recipes in tests/test_utils/recipes/h100/{gpt,hybrid}-perf.yaml trigger on
every MR via the existing mr-github scope (alongside functional tests).
Skill + slash command for human-driven runs live under
skills/run-performance-tests/ and .claude/commands/.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These were committed by mistake in 1404dbe; they're local-only Claude Code workflow files and should not live in the repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dd +20% upper bound
Before this commit, throughput coverage for the dynamic-inference 583M dp=8
config lived in two places:
- tests/functional_tests/test_cases/gpt/gpt_dynamic_inference_tp1_pp1_dp8_583m_throughputtest_zmq/
(with a pytest at tests/functional_tests/python_test_utils/test_inference_regular_pipeline.py
enforcing throughput within 10% and +20% upper-bound)
- tests/performance_tests/test_cases/gpt/gpt_583m_perf/ (the new perf harness,
DP=1, throughput-only regression check)
The functional-test entry was already commented out in
tests/test_utils/recipes/h100/gpt-dynamic-inference-with-coordinator.yaml, so
it had been dormant in CI. To keep all perf tests in one place, this commit:
1. Deletes the dormant functional test directory + its commented recipe stub.
2. Switches the perf-harness gpt_583m_perf test from DP=1 to DP=8 so it
exercises the same multi-rank dynamic-batching + ZMQ coordinator path the
deleted test used to cover. Re-recorded its baseline_values.json on cw-dfw
H100 (batch_128 → 5643 tok/s, p99 latency 2901 ms, tpot 22.7 ms/tok).
3. Adds a +20% upper-bound throughput check to
tests/performance_tests/shell_test_utils/compare_to_baseline.py, matching
the rule the deleted pytest applied (improvements beyond UPPER_TOLERANCE_PCT
fail with "refresh baseline" so the regression floor doesn't go stale).
New `UPPER_TOLERANCE_PCT` config field, default 20.
Test-case changes:
- DELETED: tests/functional_tests/test_cases/gpt/gpt_dynamic_inference_tp1_pp1_dp8_583m_throughputtest_zmq/
- DELETED: commented entry from
tests/test_utils/recipes/h100/gpt-dynamic-inference-with-coordinator.yaml
- MODIFIED: tests/performance_tests/test_cases/gpt/gpt_583m_perf/model_config.yaml
(DP 1 → 8; same batch sweep 1,8,32,128; same ISL/OSL 512/128)
- MODIFIED: tests/performance_tests/test_cases/gpt/gpt_583m_perf/baseline_values.json
(re-recorded for DP=8 on cw-dfw H100, 2026-05-19)
Recipe changes:
- SPLIT: tests/test_utils/recipes/h100/gpt-perf.yaml now holds 1-GPU tests
only (gpt_16b_perf). 8-GPU tests move to the new
tests/test_utils/recipes/h100/gpt-perf-dp8.yaml (gpt_583m_perf at
gpus: 8, ntasks-per-node: 1).
Runner changes (tests/performance_tests/shell_test_utils/run_perf_test.sh):
- Added optional EP schema field in model_config.yaml.
world_size = TP * PP * max(DP, EP). EP defaults to 1 so existing dense
configs (583m, 2b, 16b) are unaffected.
- Reordered the torchrun arg vector so SERVER_COMMON_ARGS come before
MODEL_ARGS. With argparse's last-wins semantics, this lets a model's
.args file override default server flags
(e.g. --inference-dynamic-batching-buffer-size-gb,
--inference-dynamic-batching-max-requests) without having to edit the
runner.
Verified (cw-dfw H100, 2026-05-19): gpt_583m_perf (DP=8), hybrid_2b_perf,
gpt_16b_perf — all three pass compare_to_baseline.py against their checked-in
baselines within tolerance.
Not addressed in this PR (filed as follow-ups in skills/run-performance-tests/SKILL.md):
- mem-max-allocated-bytes regression check (deleted pytest had ±5% on this;
bringing it back via this harness needs a server-side /v1/stats endpoint
or server.log parser).
- Real-prompt-from-disk mode (deleted pytest loaded sharegpt-vicuna; our
client uses synthetic "hello " * N prompts).
- nanov3 perf coverage: 7 bootstrap attempts on cw-dfw via cog all hit a
Triton incompatibility in megatron.core.inference.moe.permute kernels
(`'constexpr_type' object has no attribute 'is_ptr'`) against the
Triton 3.6.0 shipped in the cog dev image. The existing nanov3
functional test uses a different CI-built image. Will revisit once a
Triton-compatible image is available.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
This PR has been automatically converted to draft because all PRs must start as drafts. When you are ready for review, click Ready for Review to begin the review process. This will:
See the contribution guide for more details. |
Resolves 5 add/add conflicts on perf-tests scaffolding that landed separately on main (with the original DP=1 baseline) and on this branch (with DP=8 + +20% upper-bound check). Took OUR side for all five since this branch's content is the consolidation work. - tests/performance_tests/shell_test_utils/compare_to_baseline.py - tests/performance_tests/shell_test_utils/run_perf_test.sh - tests/performance_tests/test_cases/gpt/gpt_583m_perf/baseline_values.json - tests/performance_tests/test_cases/gpt/gpt_583m_perf/model_config.yaml - tests/test_utils/recipes/h100/gpt-perf.yaml Also retains the deletion of tests/functional_tests/test_cases/gpt/ gpt_dynamic_inference_tp1_pp1_dp8_583m_throughputtest_zmq/ and the corresponding commented stub removal in tests/test_utils/recipes/h100/gpt-dynamic-inference-with-coordinator.yaml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…back) Reviewer feedback on PR NVIDIA#4917: MoE models should not benchmark with synthetic `"hello " * N` prompts. Identical token IDs route to the same expert subset every step, so router/dispatcher load is uniform → misleading throughput. Same concern applies to mamba/hybrid models where SSM state evolution is content-sensitive. Changes: - Vendor 256 gsm8k prompts at tests/performance_tests/client/data/gsm8k_prompts.jsonl (from openai/gsm8k test split, ~64 KB). Keeps the test data-independent of cluster-side mounts. - Add `--dataset {synthetic,gsm8k}` flag to static_benchmark.py. gsm8k mode cycles through the vendored prompts deterministically across requests/iterations and ignores --num-input-tokens (prompts have natural variable length). results.json now records `num_input_tokens_avg` (measured) and `dataset` instead of the assumed fixed-length field. - Add `DATASET` field to model_config.yaml schema (default: synthetic for back-compat). Wired through run_perf_test.sh. - Switch gpt_16b_perf and hybrid_2b_perf to `DATASET: gsm8k`. gpt_583m_perf (dense) stays on synthetic — the MoE/hybrid concern doesn't apply. - Re-record baselines for gpt_16b_perf and hybrid_2b_perf on cw-dfw H100. The new MoE numbers tell the real story: gpt_16b at batch=32 measures 317 tok/s with 101 ms/tok (vs the prior synthetic 1356 tok/s @ 24 ms/tok — ~4× artifact from uniform expert routing). Also restores the .pth-based mamba-ssm shim in run_perf_test.sh that was inadvertently reverted by the merge from main: the older PYTHONPATH-prepend form shadowed the cog venv's nvrx 0.6.0 with /opt/venv's 0.6.0.dev69 and broke hybrid model load. The .pth approach appends /opt/venv *after* the venv on sys.path, so newer cog-venv packages win. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer feedback on PR NVIDIA#4917: OSL=128 lets prefill dominate the per-token average and inflates TPOT; the production MoE workload (Nemo-RL rollouts) is decode-heavy. Bumping to OSL=2048 to reflect that regime. Recorded fresh baseline on cw-dfw H100 (1 GPU, gsm8k prompts, ~1h 12m total wall time across batches 1/8/32). The actual finding is mild: OSL=128 batch=32 -> 317 tok/s, tpot=101 ms/tok OSL=2048 batch=32 -> 326 tok/s, tpot= 98 ms/tok TPOT only dropped 3 ms/tok. Prefill *was* a small share of total work (ISL ~60 tokens vs OSL 128), and ~95-100 ms/tok is the actual steady-state MoE decode rate, not a prefill artifact. The longer OSL still gives a cleaner decode-only signal, which is what we want for catching regressions in the expert-dispatch path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New 8-GPU inference perf test mirroring the nanov3 3B hybrid MoE functional config (chunked prefill + local CUDA graphs + nvls dispatcher + inference_optimized transformer impl + mamba state dtypes fp32). Uses gsm8k prompts since both the mamba state path and the MoE router/dispatcher are content-sensitive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…flag list)
Reviewer asked to add to gpt_16b:
--transformer-impl inference_optimized
--inference-grouped-gemm-backend vllm
--inference-moe-token-dispatcher-type nvls
--moe-shared-expert-overlap
--cuda-graph-impl local
--cuda-graph-scope full_iteration_inference
--inference-dynamic-batching-num-cuda-graphs -1
--enable-chunked-prefill
Only the last one is compatible with this model:
- `inference_optimized` rejects --swiglu, but the 16B deepseek checkpoint
requires SwiGLU (encoded in its FFN weight layout). Dropping --swiglu
would silently corrupt model outputs.
- vllm-backend MoE / nvls dispatcher / shared-expert overlap all *require*
inference_optimized, so they're out by transitive closure.
- CUDA graph capture at full_iteration_inference scope crashes because the
alltoall MoE dispatcher (transformer_engine path) does
`d2h_event.synchronize()` inside its forward, which is illegal during
stream capture.
Comment block in gpt_16b.args records the incompatibility so reviewers don't
re-request these flags. The inference_optimized stack will become available
for this model once it supports SwiGLU (groundwork exists — see commit
6815c0f "Combine GEMM + SwiGLU fused MLP PRs" — but the route isn't wired
through inference_optimized yet).
Re-recorded baseline on cw-dfw H100. Chunked prefill is a ~no-op in our
decode-heavy regime (OSL=2048, avg ISL=60 means decode is 97% of work):
batch=1 10.5 -> 10.2 tok/s (-3%), batch=8 83.4 -> 80.9 (-3%),
batch=32 325.7 -> 320.7 (-1.5%) — all within run-to-run noise. Keeping the
flag for the prefill-warm-up portion and for prompt-mix tests we might add
later.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the per-CI-run wall time for gpt_16b_perf from ~1h 12m down to ~10m without losing the steady-state decode signal. The numbers barely move: OSL=2048 batch=32: 326 tok/s, tpot=98 ms/tok, avg_lat=201 s OSL=256 batch=32: 325 tok/s, tpot=99 ms/tok, avg_lat= 25 s Prefill (avg ISL=60 in gsm8k mode) is amortized over 256 output tokens, which is enough that the TPOT delta from OSL=2048 is well inside run-to-run noise (~1 ms). The decode-dominated regime the reviewer cares about is already visible at OSL=256. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chtruong814
approved these changes
May 22, 2026
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
/ok to test e1b02c7 |
tdene
approved these changes
May 22, 2026
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26297439694 |
sidsingh-nvidia
approved these changes
May 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Consolidates inference throughput coverage for the 583M dynamic-batching DP=8 config into the perf-tests harness. Coverage previously lived in two places:
tests/functional_tests/test_cases/gpt/gpt_dynamic_inference_tp1_pp1_dp8_583m_throughputtest_zmq/+ a dedicated pytest attests/functional_tests/python_test_utils/test_inference_regular_pipeline.py— already commented-out intests/test_utils/recipes/h100/gpt-dynamic-inference-with-coordinator.yaml, so it was dormant in CI.tests/performance_tests/test_cases/gpt/gpt_583m_perf/(DP=1) — the new harness, but exercising only the single-GPU path.This PR drops (1), upgrades (2) to DP=8, and folds the rule that pytest enforced (≤ 10% slower / ≤ 20% faster on throughput) into
compare_to_baseline.py. End result: one place to look for 583M throughput regressions.What does this PR do ?
Issue tracking
For PRs from open-source community contributors:
Linked issue:
Contribution process
Pre-checks
Code review
Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.