Skip to content

[None][feat] Minimax RMS norm optimization#12163

Merged
syuoni merged 20 commits intoNVIDIA:mainfrom
jmydurant:user/mingyangj/minimax_opt_rebase
Apr 20, 2026
Merged

[None][feat] Minimax RMS norm optimization#12163
syuoni merged 20 commits intoNVIDIA:mainfrom
jmydurant:user/mingyangj/minimax_opt_rebase

Conversation

@jmydurant
Copy link
Copy Markdown
Collaborator

@jmydurant jmydurant commented Mar 12, 2026

This PR optimizes MiniMax M2 Q/K RMSNorm in tensor-parallel attention.

Previously, after qkv_proj, each rank only owned a local shard [N, D / tp]. To perform RMSNorm over the full Q/K hidden dimension, the implementation first all-gathered local shards
into a full [N, D] tensor, applied RMSNorm, and then sliced the result back to each rank. This introduced unnecessary communication and temporary full-tensor materialization.

This PR adds a dedicated MiniMax allreduce RMS kernel that keeps computation on local shards. Each rank computes the local variance sum for its [N, D / tp] shard, reduces the variance
across TP ranks, and then applies RMSNorm locally using the rank-local gamma shard. This reduces synchronization volume from full Q/K activations to per-token variance sums and removes the
allgather -> full RMSNorm -> reshard path.

Main changes:

  • Add a dedicated CUDA kernel for MiniMax allreduce RMS and fused Q+K RMS.
  • Add PyTorch custom-op bindings for single-input RMS and fused Q+K RMS.
  • Integrate the new path into MiniMaxM2 TP attention Q/K norm.
  • Load Q/K RMSNorm weights as TP-local shards.
  • Add unit tests and a microbenchmark.

Here's benchmark result B200 * 4, isl/osl 2k/256, concurrency 10

method total throughput (tokens/s)
origin tp 4643.2
new tp 7088.5
attention dp 5791.7

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@jmydurant jmydurant force-pushed the user/mingyangj/minimax_opt_rebase branch from 096ec75 to 201717c Compare March 27, 2026 05:19
@jmydurant jmydurant force-pushed the user/mingyangj/minimax_opt_rebase branch 2 times, most recently from 049daba to d406d02 Compare April 9, 2026 07:49
@jmydurant jmydurant marked this pull request as ready for review April 9, 2026 07:51
@jmydurant jmydurant requested review from a team as code owners April 9, 2026 07:51
@jmydurant
Copy link
Copy Markdown
Collaborator Author

/bot help

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 9, 2026

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@jmydurant
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42503 [ run ] triggered by Bot. Commit: d406d02 Link to invocation

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a new MiniMax collective all-reduce operation for RMS normalization using Lamport-style cross-rank synchronization. It adds CUDA kernels, PyTorch bindings, a distributed module wrapper, integration into the MiniMaxM2 attention layer, benchmarks, and unit tests to support both single-tensor and dual Q+K tensor paths.

Changes

Cohort / File(s) Summary
CUDA Kernel Implementation
cpp/tensorrt_llm/kernels/communicationKernels/MiniMaxReduceRMSKernel.cu, MiniMaxReduceRMSKernel.h
New CUDA kernels (minimax_reduce_rms_kernel_lamport and minimax_reduce_qk_rms_kernel_lamport_float4) implementing RMS normalization with variance allreduce via Lamport synchronization, volatile global loads, warp/block-level reductions, and device utilities for RMS reciprocal operations. Includes host-side launchers for cluster/stream configuration and dtype/rank-based kernel selection.
PyTorch Bindings
cpp/tensorrt_llm/thop/allreduceOp.cpp
Added two new torch operators (minimax_allreduce_rms and minimax_allreduce_rms_qk) with CUDA implementations. Includes kernel dispatch logic constructing parameter structs, runtime validation for tensor rank/contiguity/dtype matching, and dimension constraints for the Q+K variant.
Distributed Module API
tensorrt_llm/_torch/distributed/__init__.py, tensorrt_llm/_torch/distributed/ops.py
New MiniMaxAllReduceRMS nn.Module exposing forward() for single tensors and forward_qk() for dual Q+K paths, wrapping workspace allocation and torch operation dispatch.
Model Integration
tensorrt_llm/_torch/models/modeling_minimaxm2.py
Replaced prior allgather-based QK RMS normalization with new MiniMaxRMSNorm module using MiniMaxAllReduceRMS for tensor-parallel configurations. Includes new imports for tensor-parallel utilities and distributed collectives.
Testing & Benchmarking
tests/microbenchmarks/minimax_all_reduce.py, tests/unittest/_torch/multi_gpu/test_allreduce.py
Added comprehensive benchmark script measuring latency across tensor shapes with CUDA graph capture and MPI coordination. Added test cases validating single-tensor and Q+K RMS normalization correctness against reference implementations with ~0.2 relative tolerance in bfloat16.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 24.64% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning PR description lacks required template sections. The author provided context about optimization but did not fill in the Description, Test Coverage, or complete the PR Checklist properly. Add a substantive 'Description' section explaining the optimization rationale and approach. Complete 'Test Coverage' section listing all added tests. Verify and update the checklist items as appropriate for this PR.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title '[None][feat] Minimax RMS norm optimization' is related to the changeset which implements RMS norm optimization kernels and distributed collective support, but lacks specificity about the main innovation (cross-rank synchronization via Lamport clock).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/tensorrt_llm/kernels/communicationKernels/MiniMaxReduceRMSKernel.cu`:
- Around line 1-7: This new CUDA source (MiniMaxReduceRMSKernel.cu) is missing
the required NVIDIA copyright/SPDX file header; add the standard NVIDIA header
block at the very top of MiniMaxReduceRMSKernel.cu (before any `#include`),
including the current year, "NVIDIA CORPORATION" copyright line and the
SPDX-License-Identifier (e.g., Apache-2.0) per the repo guideline so the file
matches other TensorRT-LLM sources.

In `@cpp/tensorrt_llm/kernels/communicationKernels/MiniMaxReduceRMSKernel.h`:
- Around line 1-68: This header (containing MiniMaxReduceRMSParams and
minimax_reduce_rms_op in namespace kernels::minimax_ar) is new and must include
the required NVIDIA OSS file header; add the NVIDIA copyright/SPDX header block
at the top of the file (with the correct latest modification year and SPDX
identifier) before the `#pragma` once so the file complies with the project coding
guidelines.

In `@cpp/tensorrt_llm/thop/allreduceOp.cpp`:
- Around line 1837-1848: The kernel currently assumes rms_gamma is BF16 but
accepts any norm_weight dtype; add a runtime dtype guard that rejects non-BF16
norm_weight before constructing MiniMaxReduceRMSParams to avoid silent mis-typed
gamma (check norm_weight.scalar_type() and return/throw a clear error if not
torch::kBFloat16), and apply the same guard for the analogous assignment blocks
around the other allreduce params region (the block using
allreduce_params.rms_gamma/_k at ~1867-1898) so both entrypoints refuse non-BF16
gamma until the kernel supports other types.

In `@tensorrt_llm/_torch/distributed/__init__.py`:
- Around line 5-8: The export block in
tensorrt_llm/_torch/distributed/__init__.py is not sorted and fails pre-commit;
run isort (or manually sort alphabetically) on the from .ops import (...) line
so the imported names (AllReduce, AllReduceParams, AllReduceStrategy,
HelixAllToAllNative, MiniMaxAllReduceRMS, MoEAllReduce, MoEAllReduceParams,
all_to_all_4d, all_to_all_5d, allgather, alltoall_helix, cp_allgather,
reducescatter, userbuffers_allreduce_finalize) are in the linter-expected order
and update the single import line accordingly.

In `@tests/unittest/_torch/multi_gpu/test_allreduce.py`:
- Around line 900-903: The test test_minimax_allreduce_rms_qk currently forces
mpi_pool_executor=4 but lacks a guard; add a pytest skip condition to the test
so it only runs when at least 4 GPUs are visible (e.g., use
pytest.mark.skipif(torch.cuda.device_count() < 4, reason="requires 4 GPUs")) and
ensure torch is imported at top of the test file; apply this to the parametrized
decorator that sets mpi_pool_executor so fixture setup won't fail on smaller
runners.
- Around line 715-725: The current reference path computes rms_norm only over
the local hidden slice (after reshape to [total_tokens, tp_size, local_hidden])
so it misses the cross-rank reduction; change the reference computation to
perform normalization over the full hidden dimension (tp_size * local_hidden)
before slicing back to the per-rank view: reshape input to [total_tokens, -1]
(or compute squared-sum/mean across the combined hidden dimension using
tensor_parallel_size * local_hidden), run rms_norm (or equivalent manual rms
calculation using rms_weights and eps) on that full-hidden tensor to produce a
global ref_output, then reshape to [total_tokens, tensor_parallel_size, -1],
cast to origin_dtype and finally take the slice ref_output[:,
tensor_parallel_rank, :] so the reference includes the cross-rank reduction;
update uses of rms_norm, ref_output, input and rms_weights accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a69d8bb5-91e6-4f32-815b-f00f4d262d37

📥 Commits

Reviewing files that changed from the base of the PR and between ce71620 and d406d02.

📒 Files selected for processing (8)
  • cpp/tensorrt_llm/kernels/communicationKernels/MiniMaxReduceRMSKernel.cu
  • cpp/tensorrt_llm/kernels/communicationKernels/MiniMaxReduceRMSKernel.h
  • cpp/tensorrt_llm/thop/allreduceOp.cpp
  • tensorrt_llm/_torch/distributed/__init__.py
  • tensorrt_llm/_torch/distributed/ops.py
  • tensorrt_llm/_torch/models/modeling_minimaxm2.py
  • tests/microbenchmarks/minimax_all_reduce.py
  • tests/unittest/_torch/multi_gpu/test_allreduce.py

Comment thread cpp/tensorrt_llm/thop/allreduceOp.cpp
Comment thread tensorrt_llm/_torch/distributed/__init__.py Outdated
Comment thread tests/unittest/_torch/multi_gpu/test_allreduce.py Outdated
Comment thread tests/unittest/_torch/multi_gpu/test_allreduce.py
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42503 [ run ] completed with state SUCCESS. Commit: d406d02
/LLM/main/L0_MergeRequest_PR pipeline #33248 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@jmydurant
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42663 [ run ] triggered by Bot. Commit: f995db2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42663 [ run ] completed with state SUCCESS. Commit: f995db2
/LLM/main/L0_MergeRequest_PR pipeline #33372 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@jmydurant jmydurant requested a review from a team as a code owner April 13, 2026 02:28
@jmydurant jmydurant requested a review from hyukn April 13, 2026 02:28
@jmydurant
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42930 [ run ] triggered by Bot. Commit: dc57cc7 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42930 [ run ] completed with state SUCCESS. Commit: dc57cc7
/LLM/main/L0_MergeRequest_PR pipeline #33590 completed with status: 'SUCCESS'

CI Report

Link to invocation

Comment thread cpp/tensorrt_llm/kernels/communicationKernels/MiniMaxReduceRMSKernel.cu Outdated
Comment thread cpp/tensorrt_llm/kernels/communicationKernels/MiniMaxReduceRMSKernel.cu Outdated
Comment thread tensorrt_llm/_torch/models/modeling_minimaxm2.py Outdated
Comment thread cpp/tensorrt_llm/kernels/communicationKernels/MiniMaxReduceRMSKernel.cu Outdated
Copy link
Copy Markdown
Collaborator

@hyukn hyukn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot.

Copy link
Copy Markdown
Collaborator

@syuoni syuoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jmydurant jmydurant force-pushed the user/mingyangj/minimax_opt_rebase branch from 328d56c to 8cf5272 Compare April 15, 2026 12:40
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
@jmydurant jmydurant force-pushed the user/mingyangj/minimax_opt_rebase branch from d1a546f to d912d1c Compare April 16, 2026 07:31
@jmydurant
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43710 [ run ] triggered by Bot. Commit: d912d1c Link to invocation

@jmydurant
Copy link
Copy Markdown
Collaborator Author

/bot kill

@jmydurant
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43784 [ kill ] triggered by Bot. Commit: d912d1c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43785 [ run ] triggered by Bot. Commit: d912d1c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43784 [ kill ] completed with state ABORTED. Commit: d912d1c

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43710 [ run ] completed with state ABORTED. Commit: d912d1c

Link to invocation

@jmydurant
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43793 [ run ] triggered by Bot. Commit: d912d1c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43793 [ run ] completed with state FAILURE. Commit: d912d1c
/LLM/main/L0_MergeRequest_PR pipeline #34271 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@jmydurant
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43941 [ run ] triggered by Bot. Commit: d912d1c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43941 [ run ] completed with state SUCCESS. Commit: d912d1c
/LLM/main/L0_MergeRequest_PR pipeline #34386 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@jmydurant
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44086 [ run ] triggered by Bot. Commit: d912d1c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44086 [ run ] completed with state SUCCESS. Commit: d912d1c
/LLM/main/L0_MergeRequest_PR pipeline #34514 completed with status: 'SUCCESS'

CI Report

Link to invocation

@syuoni syuoni merged commit a56a8d2 into NVIDIA:main Apr 20, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants