-
Notifications
You must be signed in to change notification settings - Fork 2k
[TRTLLM-10048][feat] Fuse the AllGather for expert statistics required by the EPLB. #10885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/bot run --disable-fail-fast |
|
/bot kill |
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
6a3e620 to
84fca8c
Compare
|
PR_Github #32934 [ run ] triggered by Bot. Commit: |
|
PR_Github #32935 [ kill ] triggered by Bot. Commit: |
|
PR_Github #32934 [ run ] completed with state |
|
PR_Github #32935 [ kill ] completed with state |
📝 WalkthroughWalkthroughAdds optional external pluggable load balancer (EPLB) statistics support to MoE AlltoAll communication pipeline. Introduces template parameters, pointers for gathered/local stats, and conditionally wires EPLB data through dispatch/combine kernels while maintaining backward compatibility. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant MoeAlltoAll as MoeAlltoAll<br/>(Python API)
participant Dispatch as Dispatch Kernel<br/>(GPU)
participant Combine as Combine Kernel<br/>(GPU)
participant LoadBalancer as Load Balancer<br/>(EPLB)
User->>MoeAlltoAll: dispatch(token_selected_experts,<br/>input_payloads,<br/>eplb_local_stats=None)
alt enable_eplb == True
MoeAlltoAll->>MoeAlltoAll: Validate eplb_local_stats present
MoeAlltoAll->>Dispatch: Launch with EPLB_STATS=true<br/>eplb_local_stats, eplb_gathered_stats
else enable_eplb == False
MoeAlltoAll->>Dispatch: Launch with EPLB_STATS=false
end
Dispatch->>Dispatch: Compute expert routing<br/>Exchange data via AlltoAll
alt enable_eplb == True
Dispatch->>Dispatch: Write local EPLB stats<br/>to peer buffers
Dispatch->>Dispatch: Copy to eplb_gathered_stats
end
Dispatch-->>MoeAlltoAll: Returns output_payloads,<br/>eplb_gathered_stats
alt enable_eplb == True
MoeAlltoAll->>LoadBalancer: Update statistics with<br/>eplb_gathered_stats
end
MoeAlltoAll-->>User: (output_payloads, eplb_gathered_stats)
User->>MoeAlltoAll: combine(final_hidden_states)
MoeAlltoAll->>Combine: Launch combine kernel
Combine->>Combine: Aggregate expert outputs
Combine-->>MoeAlltoAll: Returns combined output
MoeAlltoAll-->>User: combined_output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (8)
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (1)
618-639: Add divisibility guards for num_experts before launch.
num_experts_per_rankis now derived fromnum_experts / ep_size; if inputs are invalid, target ranks can go out of range and corrupt memory. Add validation inmoe_a2a_dispatch_launchbefore specialization.🛠️ Suggested validation
TLLM_CHECK(params.top_k > 0 && params.top_k <= kMaxTopK); TLLM_CHECK(params.ep_size > 0 && params.ep_size <= kMaxRanks); TLLM_CHECK(params.local_num_tokens >= 0); TLLM_CHECK(params.num_payloads > 0 && params.num_payloads <= kMaxPayloads); + TLLM_CHECK(params.num_experts > 0); + TLLM_CHECK(params.num_experts % params.ep_size == 0);tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)
1-3: Add NVIDIA SPDX header for compliance.This source file is missing the required NVIDIA SPDX header with the latest modification year.
📝 Proposed header addition
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import osAs per coding guidelines, please add the standard NVIDIA SPDX header with the latest year.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
1-3: Add NVIDIA SPDX header for compliance.This file is missing the required NVIDIA SPDX header with the latest modification year.
📝 Proposed header addition
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import osAs per coding guidelines, please add the standard NVIDIA SPDX header with the latest year.
tensorrt_llm/_torch/distributed/moe_alltoall.py (2)
1-6: Add NVIDIA SPDX header for compliance.This file is missing the required NVIDIA SPDX header with the latest modification year.
📝 Proposed header addition
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + """ MoE All-to-All OperationsAs per coding guidelines, please add the standard NVIDIA SPDX header with the latest year.
160-179: Guard workspace reuse againstnum_experts/enable_eplbmismatches.With EPLB now affecting metainfo layout, reusing
_WORKSPACEacross instances with differentnum_expertsorenable_eplbcan lead to incompatible offsets (especially if the workspace size is overridden). Store and validate these fields when reusing the workspace.🛠️ Proposed fix
MoeAlltoAll._WORKSPACE = { "workspace_size_per_rank": workspace_size_per_rank, "max_num_tokens": self.max_num_tokens, "ep_rank": self.ep_rank, "ep_size": self.ep_size, + "num_experts": self.num_experts, + "enable_eplb": self.enable_eplb, "mnnvl_mem": mnnvl_mem, "workspace": workspace, "metainfo": metainfo, } else: assert self._WORKSPACE[ "workspace_size_per_rank"] == workspace_size_per_rank, "reuse workspace with different workspace_size_per_rank" assert self._WORKSPACE[ "max_num_tokens"] == self.max_num_tokens, "reuse workspace with different max_num_tokens" assert self._WORKSPACE[ "ep_rank"] == self.ep_rank, "reuse workspace with different ep_rank" assert self._WORKSPACE[ "ep_size"] == self.ep_size, "reuse workspace with different ep_size" + assert self._WORKSPACE[ + "num_experts"] == self.num_experts, "reuse workspace with different num_experts" + assert self._WORKSPACE[ + "enable_eplb"] == self.enable_eplb, "reuse workspace with different enable_eplb"Also applies to: 190-197
tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py (2)
1-2: Update SPDX year to reflect latest modification.The header still says 2025 even though the file was modified in 2026.
📝 Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.As per coding guidelines, please keep the header year in sync with the latest meaningful modification.
256-352: Fix dispatch phase guard to use_dispatch_state.Line 309 checks
self._state, which is no longer defined after the_dispatch_staterefactor. This will raise anAttributeErroron the first dispatch and also fails to guard double-dispatch.🐛 Proposed fix
- if self._state == "dispatched": + if self._dispatch_state.get("phase") == "dispatched": raise RuntimeError("dispatch called twice without an intervening combine")tests/unittest/_torch/multi_gpu/test_moe_a2a.py (1)
1-5: Add NVIDIA SPDX header for compliance.This test file is missing the required NVIDIA SPDX header with the latest modification year.
📝 Proposed header addition
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import pickleAs per coding guidelines, please add the standard NVIDIA SPDX header with the latest year.
🤖 Fix all issues with AI agents
In `@cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp`:
- Around line 584-587: Add validation for the numExperts parameter at the start
of moeA2AGetAuxDataSizeOp: ensure numExperts > 0 and fits in an int (e.g., <=
std::numeric_limits<int>::max()); if it fails, throw a std::invalid_argument (or
return a safe sentinel) with a clear message including the invalid numExperts
value, and only then call calculateOffsets/static_cast<int> conversions and
proceed — this check should live in moeA2AGetAuxDataSizeOp before the
calculateOffsets call to prevent bogus sizes.
🧹 Nitpick comments (1)
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (1)
304-312: Remove duplicate num_experts assignment.
params.num_expertsis set twice; keep a single assignment to reduce confusion.♻️ Minimal cleanup
params.ep_rank = static_cast<int>(epRank); - params.num_experts = static_cast<int>(numExperts); params.local_num_tokens = static_cast<int>(localNumTokens); params.max_tokens_per_rank = static_cast<int>(runtimeMaxTokensPerRank); params.top_k = static_cast<int>(topK); params.num_experts = static_cast<int>(numExperts);
|
/bot run --disable-fail-fast |
1 similar comment
|
/bot run --disable-fail-fast |
|
PR_Github #32970 [ run ] triggered by Bot. Commit: |
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
Outdated
Show resolved
Hide resolved
|
PR_Github #32970 [ run ] completed with state
|
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py
Show resolved
Hide resolved
|
/bot run --disable-fail-fast |
|
PR_Github #33168 [ run ] triggered by Bot. Commit: |
|
PR_Github #33168 [ run ] completed with state
|
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #33253 [ run ] triggered by Bot. Commit: |
|
PR_Github #33253 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #33289 [ run ] triggered by Bot. Commit: |
xxi-nv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py
Show resolved
Hide resolved
|
PR_Github #33289 [ run ] completed with state
|
|
/bot run --reuse-test |
|
PR_Github #33361 [ run ] triggered by Bot. Commit: |
|
PR_Github #33361 [ run ] completed with state
|
|
/bot run --reuse-test |
|
PR_Github #33488 [ run ] triggered by Bot. Commit: |
|
PR_Github #33488 [ run ] completed with state |
|
/bot run --reuse-test |
|
PR_Github #33497 [ run ] triggered by Bot. Commit: |
|
PR_Github #33497 [ run ] completed with state |
|
/bot run --reuse-test |
|
PR_Github #33544 [ run ] triggered by Bot. Commit: |
|
PR_Github #33544 [ run ] completed with state |
Summary by CodeRabbit
Release Notes
New Features
get_eplb_gathered_statistics()method to retrieve collected EPLB statistics.Enhancements
✏️ Tip: You can customize this high-level summary in your review settings.
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.