[https://nvbugs/5945047][fix] Fix cluster launch enablement for SM120 GPUs in allReduce fusion#13169
Conversation
📝 WalkthroughWalkthroughThe change refines CUDA cluster launch configuration logic in the all-reduce fusion kernel launcher. The SM architecture check is narrowed from Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu (1)
693-700:⚠️ Potential issue | 🔴 CriticalFix critical launch attribute configuration on SM120/SM121: programmatic stream serialization must not be disabled when kernels use cudaGridDependencySynchronize() or cudaTriggerProgrammaticLaunchCompletion().
At line 700, setting
cfg.numAttrs = 0whensupports_clusteris false disablescudaLaunchAttributeProgrammaticStreamSerializationset at lines 693–694. However, the kernels in this file callcudaGridDependencySynchronize()(lines 460, 539) andcudaTriggerProgrammaticLaunchCompletion()(lines 463, 521, 590), which require this attribute to be present in the launch configuration. On SM120/SM121 (where cluster launch is unsupported but programmatic launch is available),cfg.numAttrsshould be 1 (programmatic serialization only) rather than 0.Suggested fix
- cfg.numAttrs = supports_cluster ? 2 : 0; + bool const supports_programmatic_launch = (SM >= 90); + cfg.numAttrs = supports_programmatic_launch ? (supports_cluster ? 2 : 1) : 0;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu` around lines 693 - 700, The launch-attribute setup currently zeroes cfg.numAttrs when supports_cluster is false, which inadvertently drops the required cudaLaunchAttributeProgrammaticStreamSerialization used by kernels that call cudaGridDependencySynchronize() and cudaTriggerProgrammaticLaunchCompletion(); change the logic so cfg.numAttrs is set to 2 when supports_cluster is true and to 1 otherwise, while still only populating the cluster attribute (cudaLaunchAttributeClusterDimension) when supports_cluster is true so the programmatic serialization attribute (cudaLaunchAttributeProgrammaticStreamSerialization) is always passed for SM120/SM121; update the block that assigns attribute[], cfg.attrs and cfg.numAttrs accordingly (look for variables named attribute, cfg, supports_cluster and the attribute ids cudaLaunchAttributeProgrammaticStreamSerialization / cudaLaunchAttributeClusterDimension).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu`:
- Around line 693-700: The launch-attribute setup currently zeroes cfg.numAttrs
when supports_cluster is false, which inadvertently drops the required
cudaLaunchAttributeProgrammaticStreamSerialization used by kernels that call
cudaGridDependencySynchronize() and cudaTriggerProgrammaticLaunchCompletion();
change the logic so cfg.numAttrs is set to 2 when supports_cluster is true and
to 1 otherwise, while still only populating the cluster attribute
(cudaLaunchAttributeClusterDimension) when supports_cluster is true so the
programmatic serialization attribute
(cudaLaunchAttributeProgrammaticStreamSerialization) is always passed for
SM120/SM121; update the block that assigns attribute[], cfg.attrs and
cfg.numAttrs accordingly (look for variables named attribute, cfg,
supports_cluster and the attribute ids
cudaLaunchAttributeProgrammaticStreamSerialization /
cudaLaunchAttributeClusterDimension).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 98a94691-9314-4566-aa8e-7dee0fcec9a9
📒 Files selected for processing (1)
cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu
|
/bot run |
|
PR_Github #44254 [ run ] triggered by Bot. Commit: |
|
PR_Github #44254 [ run ] completed with state
|
|
/bot run |
|
PR_Github #44338 [ run ] triggered by Bot. Commit: |
|
PR_Github #44338 [ run ] completed with state
|
8146a29 to
9f905f4
Compare
|
/bot run |
|
PR_Github #44910 [ run ] triggered by Bot. Commit: |
|
PR_Github #44910 [ run ] completed with state
|
9f905f4 to
7fcbe29
Compare
|
/bot run |
|
PR_Github #45814 [ run ] triggered by Bot. Commit: |
|
PR_Github #45814 [ run ] completed with state
|
|
/bot run |
|
PR_Github #45859 [ run ] triggered by Bot. Commit: |
|
PR_Github #45859 [ run ] completed with state
|
|
/bot run |
|
PR_Github #45917 [ run ] triggered by Bot. Commit: |
|
PR_Github #45917 [ run ] completed with state
|
7fcbe29 to
7af0f5c
Compare
|
/bot run |
|
PR_Github #46000 [ run ] triggered by Bot. Commit: |
|
PR_Github #46000 [ run ] completed with state
|
|
/bot run |
|
/bot run |
|
PR_Github #47452 [ run ] triggered by Bot. Commit: |
|
PR_Github #47452 [ run ] completed with state
|
d0e0dad to
0c6a75b
Compare
|
/bot run |
|
PR_Github #47572 [ run ] triggered by Bot. Commit: |
|
PR_Github #47572 [ run ] completed with state
|
|
/bot run |
|
PR_Github #47589 [ run ] triggered by Bot. Commit: |
|
PR_Github #47589 [ run ] completed with state
|
0c6a75b to
16d64d1
Compare
|
/bot run |
|
PR_Github #47626 [ run ] triggered by Bot. Commit: |
|
PR_Github #47626 [ run ] completed with state
|
|
/bot run |
|
PR_Github #47668 [ run ] triggered by Bot. Commit: |
|
PR_Github #47668 [ run ] completed with state
|
…e fusion kernels The allreduce fusion kernel launcher used `SM >= 90` to enable CUDA cluster launch, which incorrectly included SM120 (workstation Blackwell / RTX PRO 6000). SM120 does not support cluster launch, causing `cudaErrorInvalidArgument` that surfaced as "RMSNorm failed with error code invalid argument" during Eagle3 inference. Fix the condition to `SM >= 90 && SM < 120`, matching the existing pattern in `is_lamport_supported()` which already correctly excludes SM120. Signed-off-by: Ziyi Xiong <219238287+ziyixiong-nv@users.noreply.github.com>
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
16d64d1 to
42e9227
Compare
|
/bot run |
|
PR_Github #47694 [ run ] triggered by Bot. Commit: |
|
PR_Github #47694 [ run ] completed with state
|
|
/bot run |
|
PR_Github #47911 [ run ] triggered by Bot. Commit: |
|
PR_Github #47911 [ run ] completed with state |
… GPUs in allReduce fusion (NVIDIA#13169) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
Summary
Test plan
Links