Skip to content

[https://nvbugs/6037654][fix] Cap DeepEP low-latency token limit to prevent OOM and illegal memory access#13362

Closed
tensorrt-cicd wants to merge 1 commit intoNVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6037654
Closed

[https://nvbugs/6037654][fix] Cap DeepEP low-latency token limit to prevent OOM and illegal memory access#13362
tensorrt-cicd wants to merge 1 commit intoNVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6037654

Conversation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

@tensorrt-cicd tensorrt-cicd commented Apr 23, 2026

Summary

  • Fix for NVBugs 6037654: [TensorRT-LLM][main]: TestQwen3_235B_A22B::test_fp8 is failure
  • Root cause: DeepEP low-latency mode allocates RDMA buffers that scale with the token count per rank. When max_num_tokens or moe_max_num_tokens exceeded 256, the resulting buffer allocations consumed excessive GPU memory, causing OOM during autotuner warmup on the Qwen3-235B-A22B FP8 test with attention_dp=True and free_gpu_memory_fraction=0.6 on 8×RTX PRO 6000 Blackwell GPUs. The DeepEP kernel is also unsafe beyond 256 tokens per rank.
  • Fix: Capped deep_ep_max_num_tokens at 256 (the DeepEP-recommended limit) by adding _MAX_LOW_LATENCY_TOKENS to the min() computation in DeepEPLowLatency.__init__. This only affects small decode batches since larger prefill batches already fall back to AllGatherReduceScatter via is_workload_feasible(). Additionally reduced free_gpu_memory_fraction from 0.6 to 0.4 for the attention_dp test variant and removed the waive entry to re-enable the test.
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

  • Performance

    • Optimized low-latency token limit handling with sensible defaults and environment configuration support.
    • Fine-tuned memory allocation in FP8 accuracy tests for improved resource efficiency.
  • Tests

    • Re-enabled test coverage for FP8 throughput and latency scenarios.

…revent OOM and illegal memory access

Cap deep_ep_max_num_tokens to 256 in DeepEP low-latency mode, matching the
recommended limit from the DeepEP library. Previously the limit was set to
min(max_num_tokens, moe_max_num_tokens) which could be 4096-8192, causing
excessive RDMA buffer memory consumption (contributing to OOM) and illegal
memory access in the low_latency_combine kernel.

With the 256 cap, large batches (e.g. prefill) automatically fall back to
AllGatherReduceScatter via is_workload_feasible(), while small decode batches
still use the efficient low-latency path.

Also reduce free_gpu_memory_fraction from 0.6 to 0.4 for the attention_dp
variant of TestQwen3_235B_A22B::test_fp8 to fit within RTX PRO 6000 Blackwell
Server Edition GPU memory, and remove the corresponding test waiver.

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 45393340-1617-4496-988b-1c3019d4bae1

📥 Commits

Reviewing files that changed from the base of the PR and between 36113bf and ffcd5b9.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tests/integration/test_lists/waives.txt
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/waives.txt

📝 Walkthrough

Walkthrough

The changes introduce a fixed low-latency token limit constant in the DeepEP communication module, adjust KV cache memory configuration for FP8 accuracy tests, and remove a test waiver for Qwen3_235B throughput latency testing.

Changes

Cohort / File(s) Summary
DeepEP Low-Latency Token Limits
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py
Introduces a _MAX_LOW_LATENCY_TOKENS = 256 constant and incorporates it into the default_limit computation to cap token reservations. Environment override remains functional when set; otherwise, constraints now apply an additional minimum of 256 tokens.
Test Configuration Updates
tests/integration/defs/accuracy/test_llm_api_pytorch.py, tests/integration/test_lists/waives.txt
Adjusts KV cache free_gpu_memory_fraction from 0.6 to 0.4 when attention_dp is enabled in FP8 accuracy tests. Removes test waiver for Qwen3_235B throughput latency test case, allowing it to run.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main change: capping DeepEP low-latency token limit to prevent OOM and memory access issues, directly matching the core fix in the changeset.
Description check ✅ Passed The description includes a clear summary of the root cause, the fix applied, test verification, and relevant links, covering all essential template sections despite some non-critical checklist items being incomplete.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@byshiue
Copy link
Copy Markdown
Collaborator

byshiue commented Apr 27, 2026

Create another PR #13484 to prevent changing the behaviors of other models.

@byshiue byshiue closed this Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants