[TRTLLM-10004][chore] Enable NCCL symmetric zero-copy by default#14472
Conversation
📝 WalkthroughWalkthroughThe NCCL symmetric zero-copy feature flag ChangesSymmetric Zero Copy Feature Default
🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/bot run --disable-fail-fast |
Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>
6095b5e to
9bba6d7
Compare
|
/bot run --add-multi-gpu --disable-fail-fast |
|
PR_Github #49998 [ run ] triggered by Bot. Commit: |
|
PR_Github #49999 [ run ] triggered by Bot. Commit: |
|
PR_Github #49999 [ run ] completed with state
|
|
/bot run --add-multi-gpu --disable-fail-fast |
|
PR_Github #50343 [ run ] triggered by Bot. Commit: |
|
PR_Github #50343 [ run ] completed with state
|
|
/bot run --add-multi-gpu --disable-fail-fast |
|
PR_Github #50397 [ run ] triggered by Bot. Commit: |
|
PR_Github #50397 [ run ] completed with state |
Summary by CodeRabbit
Release Notes
0if needed.Description
This PR enables NCCL symmetric zero-copy AllReduce by default for the PyTorch distributed path.
Previously,
TLLM_NCCL_SYMMETRIC_ZERO_COPYdefaulted to disabled unless explicitly set. This change flips the default so the zero-copy path is active by default, while preserving the existing opt-out behavior via:Internal E2E throughput sweeps showed improvement on dense FP8 Llama models:
The same sweeps did not show E2E throughput regression on the other measured models:
Test Coverage
CI has been run with this settings before. And will be repeated.
PR Checklist
Please review the following before submitting your PR:
GitHub Bot Help
To see a list of available CI bot commands, please comment /bot help.