Skip to content

Perf tests#4917

Merged
shanmugamr1992 merged 15 commits into
NVIDIA:mainfrom
shanmugamr1992:perf-tests
May 22, 2026
Merged

Perf tests#4917
shanmugamr1992 merged 15 commits into
NVIDIA:mainfrom
shanmugamr1992:perf-tests

Conversation

@shanmugamr1992
Copy link
Copy Markdown
Contributor

@shanmugamr1992 shanmugamr1992 commented May 21, 2026

Summary

Consolidates inference throughput coverage for the 583M dynamic-batching DP=8 config into the perf-tests harness. Coverage previously lived in two places:

  1. tests/functional_tests/test_cases/gpt/gpt_dynamic_inference_tp1_pp1_dp8_583m_throughputtest_zmq/ + a dedicated pytest at tests/functional_tests/python_test_utils/test_inference_regular_pipeline.py — already commented-out in tests/test_utils/recipes/h100/gpt-dynamic-inference-with-coordinator.yaml, so it was dormant in CI.
  2. tests/performance_tests/test_cases/gpt/gpt_583m_perf/ (DP=1) — the new harness, but exercising only the single-GPU path.

This PR drops (1), upgrades (2) to DP=8, and folds the rule that pytest enforced (≤ 10% slower / ≤ 20% faster on throughput) into compare_to_baseline.py. End result: one place to look for 583M throughput regressions.

  • I, the PR author, have personally reviewed every line of this PR.

What does this PR do ?

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact @NVIDIA/mcore-oncall.

Issue tracking

For PRs from open-source community contributors:

  • New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
  • Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

shanmugamr1992 and others added 8 commits May 14, 2026 12:21
… GPT 16B

Adds tests/performance_tests/, a server+client benchmark harness that spins
up `tools.run_dynamic_text_generation_server`, sweeps batch sizes against the
OpenAI-compatible `/v1/completions` endpoint, and compares throughput / avg-
p50-p99 latency / TPOT against checked-in baseline_values.json with a 10%
tolerance. Vendored from /Users/shanmugamr/inference-bench and adapted to
work inside the existing cog + nemo-run CI flow.

Three test cases shipped with baselines captured on cw-dfw H100:
- gpt/gpt_583m_perf: 4 batches (1, 8, 32, 128), ~5 min
- hybrid/hybrid_2b_perf: 4 batches (1, 8, 32, 128), ~8 min
- gpt/gpt_16b_perf: 3 batches (1, 8, 32), ~13 min

CI recipes in tests/test_utils/recipes/h100/{gpt,hybrid}-perf.yaml trigger on
every MR via the existing mr-github scope (alongside functional tests).

Skill + slash command for human-driven runs live under
skills/run-performance-tests/ and .claude/commands/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These were committed by mistake in 1404dbe; they're local-only Claude
Code workflow files and should not live in the repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dd +20% upper bound

Before this commit, throughput coverage for the dynamic-inference 583M dp=8
config lived in two places:

  - tests/functional_tests/test_cases/gpt/gpt_dynamic_inference_tp1_pp1_dp8_583m_throughputtest_zmq/
    (with a pytest at tests/functional_tests/python_test_utils/test_inference_regular_pipeline.py
    enforcing throughput within 10% and +20% upper-bound)
  - tests/performance_tests/test_cases/gpt/gpt_583m_perf/ (the new perf harness,
    DP=1, throughput-only regression check)

The functional-test entry was already commented out in
tests/test_utils/recipes/h100/gpt-dynamic-inference-with-coordinator.yaml, so
it had been dormant in CI. To keep all perf tests in one place, this commit:

1. Deletes the dormant functional test directory + its commented recipe stub.
2. Switches the perf-harness gpt_583m_perf test from DP=1 to DP=8 so it
   exercises the same multi-rank dynamic-batching + ZMQ coordinator path the
   deleted test used to cover. Re-recorded its baseline_values.json on cw-dfw
   H100 (batch_128 → 5643 tok/s, p99 latency 2901 ms, tpot 22.7 ms/tok).
3. Adds a +20% upper-bound throughput check to
   tests/performance_tests/shell_test_utils/compare_to_baseline.py, matching
   the rule the deleted pytest applied (improvements beyond UPPER_TOLERANCE_PCT
   fail with "refresh baseline" so the regression floor doesn't go stale).
   New `UPPER_TOLERANCE_PCT` config field, default 20.

Test-case changes:

  - DELETED: tests/functional_tests/test_cases/gpt/gpt_dynamic_inference_tp1_pp1_dp8_583m_throughputtest_zmq/
  - DELETED: commented entry from
    tests/test_utils/recipes/h100/gpt-dynamic-inference-with-coordinator.yaml
  - MODIFIED: tests/performance_tests/test_cases/gpt/gpt_583m_perf/model_config.yaml
    (DP 1 → 8; same batch sweep 1,8,32,128; same ISL/OSL 512/128)
  - MODIFIED: tests/performance_tests/test_cases/gpt/gpt_583m_perf/baseline_values.json
    (re-recorded for DP=8 on cw-dfw H100, 2026-05-19)

Recipe changes:

  - SPLIT: tests/test_utils/recipes/h100/gpt-perf.yaml now holds 1-GPU tests
    only (gpt_16b_perf). 8-GPU tests move to the new
    tests/test_utils/recipes/h100/gpt-perf-dp8.yaml (gpt_583m_perf at
    gpus: 8, ntasks-per-node: 1).

Runner changes (tests/performance_tests/shell_test_utils/run_perf_test.sh):

  - Added optional EP schema field in model_config.yaml.
    world_size = TP * PP * max(DP, EP). EP defaults to 1 so existing dense
    configs (583m, 2b, 16b) are unaffected.
  - Reordered the torchrun arg vector so SERVER_COMMON_ARGS come before
    MODEL_ARGS. With argparse's last-wins semantics, this lets a model's
    .args file override default server flags
    (e.g. --inference-dynamic-batching-buffer-size-gb,
    --inference-dynamic-batching-max-requests) without having to edit the
    runner.

Verified (cw-dfw H100, 2026-05-19): gpt_583m_perf (DP=8), hybrid_2b_perf,
gpt_16b_perf — all three pass compare_to_baseline.py against their checked-in
baselines within tolerance.

Not addressed in this PR (filed as follow-ups in skills/run-performance-tests/SKILL.md):
  - mem-max-allocated-bytes regression check (deleted pytest had ±5% on this;
    bringing it back via this harness needs a server-side /v1/stats endpoint
    or server.log parser).
  - Real-prompt-from-disk mode (deleted pytest loaded sharegpt-vicuna; our
    client uses synthetic "hello " * N prompts).
  - nanov3 perf coverage: 7 bootstrap attempts on cw-dfw via cog all hit a
    Triton incompatibility in megatron.core.inference.moe.permute kernels
    (`'constexpr_type' object has no attribute 'is_ptr'`) against the
    Triton 3.6.0 shipped in the cog dev image. The existing nanov3
    functional test uses a different CI-built image. Will revisit once a
    Triton-compatible image is available.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shanmugamr1992 shanmugamr1992 requested a review from a team as a code owner May 21, 2026 18:48
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft May 21, 2026 18:48
@github-actions
Copy link
Copy Markdown
Contributor

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

Resolves 5 add/add conflicts on perf-tests scaffolding that landed
separately on main (with the original DP=1 baseline) and on this branch
(with DP=8 + +20% upper-bound check). Took OUR side for all five since
this branch's content is the consolidation work.

  - tests/performance_tests/shell_test_utils/compare_to_baseline.py
  - tests/performance_tests/shell_test_utils/run_perf_test.sh
  - tests/performance_tests/test_cases/gpt/gpt_583m_perf/baseline_values.json
  - tests/performance_tests/test_cases/gpt/gpt_583m_perf/model_config.yaml
  - tests/test_utils/recipes/h100/gpt-perf.yaml

Also retains the deletion of tests/functional_tests/test_cases/gpt/
gpt_dynamic_inference_tp1_pp1_dp8_583m_throughputtest_zmq/ and the
corresponding commented stub removal in
tests/test_utils/recipes/h100/gpt-dynamic-inference-with-coordinator.yaml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shanmugamr1992 shanmugamr1992 marked this pull request as ready for review May 21, 2026 18:54
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team May 21, 2026 18:54
shanmugamr1992 and others added 5 commits May 21, 2026 12:32
…back)

Reviewer feedback on PR NVIDIA#4917: MoE models should not benchmark with
synthetic `"hello " * N` prompts. Identical token IDs route to the same
expert subset every step, so router/dispatcher load is uniform → misleading
throughput. Same concern applies to mamba/hybrid models where SSM state
evolution is content-sensitive.

Changes:

  - Vendor 256 gsm8k prompts at tests/performance_tests/client/data/gsm8k_prompts.jsonl
    (from openai/gsm8k test split, ~64 KB). Keeps the test data-independent
    of cluster-side mounts.

  - Add `--dataset {synthetic,gsm8k}` flag to static_benchmark.py. gsm8k
    mode cycles through the vendored prompts deterministically across
    requests/iterations and ignores --num-input-tokens (prompts have natural
    variable length). results.json now records `num_input_tokens_avg`
    (measured) and `dataset` instead of the assumed fixed-length field.

  - Add `DATASET` field to model_config.yaml schema (default: synthetic for
    back-compat). Wired through run_perf_test.sh.

  - Switch gpt_16b_perf and hybrid_2b_perf to `DATASET: gsm8k`.
    gpt_583m_perf (dense) stays on synthetic — the MoE/hybrid concern
    doesn't apply.

  - Re-record baselines for gpt_16b_perf and hybrid_2b_perf on cw-dfw H100.
    The new MoE numbers tell the real story: gpt_16b at batch=32 measures
    317 tok/s with 101 ms/tok (vs the prior synthetic 1356 tok/s @ 24 ms/tok
    — ~4× artifact from uniform expert routing).

Also restores the .pth-based mamba-ssm shim in run_perf_test.sh that was
inadvertently reverted by the merge from main: the older PYTHONPATH-prepend
form shadowed the cog venv's nvrx 0.6.0 with /opt/venv's 0.6.0.dev69 and
broke hybrid model load. The .pth approach appends /opt/venv *after* the
venv on sys.path, so newer cog-venv packages win.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer feedback on PR NVIDIA#4917: OSL=128 lets prefill dominate the per-token
average and inflates TPOT; the production MoE workload (Nemo-RL rollouts) is
decode-heavy. Bumping to OSL=2048 to reflect that regime.

Recorded fresh baseline on cw-dfw H100 (1 GPU, gsm8k prompts, ~1h 12m total
wall time across batches 1/8/32). The actual finding is mild:

  OSL=128  batch=32  ->  317 tok/s, tpot=101 ms/tok
  OSL=2048 batch=32  ->  326 tok/s, tpot= 98 ms/tok

TPOT only dropped 3 ms/tok. Prefill *was* a small share of total work (ISL
~60 tokens vs OSL 128), and ~95-100 ms/tok is the actual steady-state MoE
decode rate, not a prefill artifact. The longer OSL still gives a cleaner
decode-only signal, which is what we want for catching regressions in the
expert-dispatch path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New 8-GPU inference perf test mirroring the nanov3 3B hybrid MoE
functional config (chunked prefill + local CUDA graphs + nvls dispatcher
+ inference_optimized transformer impl + mamba state dtypes fp32).
Uses gsm8k prompts since both the mamba state path and the MoE
router/dispatcher are content-sensitive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…flag list)

Reviewer asked to add to gpt_16b:
  --transformer-impl inference_optimized
  --inference-grouped-gemm-backend vllm
  --inference-moe-token-dispatcher-type nvls
  --moe-shared-expert-overlap
  --cuda-graph-impl local
  --cuda-graph-scope full_iteration_inference
  --inference-dynamic-batching-num-cuda-graphs -1
  --enable-chunked-prefill

Only the last one is compatible with this model:

  - `inference_optimized` rejects --swiglu, but the 16B deepseek checkpoint
    requires SwiGLU (encoded in its FFN weight layout). Dropping --swiglu
    would silently corrupt model outputs.
  - vllm-backend MoE / nvls dispatcher / shared-expert overlap all *require*
    inference_optimized, so they're out by transitive closure.
  - CUDA graph capture at full_iteration_inference scope crashes because the
    alltoall MoE dispatcher (transformer_engine path) does
    `d2h_event.synchronize()` inside its forward, which is illegal during
    stream capture.

Comment block in gpt_16b.args records the incompatibility so reviewers don't
re-request these flags. The inference_optimized stack will become available
for this model once it supports SwiGLU (groundwork exists — see commit
6815c0f "Combine GEMM + SwiGLU fused MLP PRs" — but the route isn't wired
through inference_optimized yet).

Re-recorded baseline on cw-dfw H100. Chunked prefill is a ~no-op in our
decode-heavy regime (OSL=2048, avg ISL=60 means decode is 97% of work):
batch=1 10.5 -> 10.2 tok/s (-3%), batch=8 83.4 -> 80.9 (-3%),
batch=32 325.7 -> 320.7 (-1.5%) — all within run-to-run noise. Keeping the
flag for the prefill-warm-up portion and for prompt-mix tests we might add
later.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the per-CI-run wall time for gpt_16b_perf from ~1h 12m down to ~10m
without losing the steady-state decode signal. The numbers barely move:

  OSL=2048 batch=32:  326 tok/s, tpot=98 ms/tok, avg_lat=201 s
  OSL=256  batch=32:  325 tok/s, tpot=99 ms/tok, avg_lat= 25 s

Prefill (avg ISL=60 in gsm8k mode) is amortized over 256 output tokens,
which is enough that the TPOT delta from OSL=2048 is well inside run-to-run
noise (~1 ms). The decode-dominated regime the reviewer cares about is
already visible at OSL=256.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shanmugamr1992 shanmugamr1992 enabled auto-merge May 22, 2026 00:40
@shanmugamr1992 shanmugamr1992 disabled auto-merge May 22, 2026 00:45
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Approved All necessary approvals have been made label May 22, 2026
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shanmugamr1992
Copy link
Copy Markdown
Contributor Author

/ok to test e1b02c7

@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26297439694

Merged via the queue into NVIDIA:main with commit 686aa8c May 22, 2026
80 checks passed
@shanmugamr1992 shanmugamr1992 deleted the perf-tests branch May 22, 2026 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made complexity: medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants