[https://nvbugs/6037654][fix] Cap DeepEP low-latency token limit to prevent OOM and illegal memory access#13362
Closed
tensorrt-cicd wants to merge 1 commit intoNVIDIA:mainfrom
Closed
Conversation
…revent OOM and illegal memory access Cap deep_ep_max_num_tokens to 256 in DeepEP low-latency mode, matching the recommended limit from the DeepEP library. Previously the limit was set to min(max_num_tokens, moe_max_num_tokens) which could be 4096-8192, causing excessive RDMA buffer memory consumption (contributing to OOM) and illegal memory access in the low_latency_combine kernel. With the 256 cap, large batches (e.g. prefill) automatically fall back to AllGatherReduceScatter via is_workload_feasible(), while small decode batches still use the efficient low-latency path. Also reduce free_gpu_memory_fraction from 0.6 to 0.4 for the attention_dp variant of TestQwen3_235B_A22B::test_fp8 to fit within RTX PRO 6000 Blackwell Server Edition GPU memory, and remove the corresponding test waiver. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
Contributor
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (3)
💤 Files with no reviewable changes (1)
📝 WalkthroughWalkthroughThe changes introduce a fixed low-latency token limit constant in the DeepEP communication module, adjust KV cache memory configuration for FP8 accuracy tests, and remove a test waiver for Qwen3_235B throughput latency testing. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Collaborator
|
Create another PR #13484 to prevent changing the behaviors of other models. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
max_num_tokensormoe_max_num_tokensexceeded 256, the resulting buffer allocations consumed excessive GPU memory, causing OOM during autotuner warmup on the Qwen3-235B-A22B FP8 test withattention_dp=Trueandfree_gpu_memory_fraction=0.6on 8×RTX PRO 6000 Blackwell GPUs. The DeepEP kernel is also unsafe beyond 256 tokens per rank.deep_ep_max_num_tokensat 256 (the DeepEP-recommended limit) by adding_MAX_LOW_LATENCY_TOKENSto themin()computation inDeepEPLowLatency.__init__. This only affects small decode batches since larger prefill batches already fall back to AllGatherReduceScatter viais_workload_feasible(). Additionally reducedfree_gpu_memory_fractionfrom 0.6 to 0.4 for theattention_dptest variant and removed the waive entry to re-enable the test.Test plan
Links
Summary by CodeRabbit
Performance
Tests