[https://nvbugs/6037654][fix] Cap DeepEP low-latency token limit to prevent OOM and illegal memory access by tensorrt-cicd · Pull Request #13362 · NVIDIA/TensorRT-LLM

tensorrt-cicd · 2026-04-23T04:45:04Z

Summary

Fix for NVBugs 6037654: [TensorRT-LLM][main]: TestQwen3_235B_A22B::test_fp8 is failure
Root cause: DeepEP low-latency mode allocates RDMA buffers that scale with the token count per rank. When max_num_tokens or moe_max_num_tokens exceeded 256, the resulting buffer allocations consumed excessive GPU memory, causing OOM during autotuner warmup on the Qwen3-235B-A22B FP8 test with attention_dp=True and free_gpu_memory_fraction=0.6 on 8×RTX PRO 6000 Blackwell GPUs. The DeepEP kernel is also unsafe beyond 256 tokens per rank.
Fix: Capped deep_ep_max_num_tokens at 256 (the DeepEP-recommended limit) by adding _MAX_LOW_LATENCY_TOKENS to the min() computation in DeepEPLowLatency.__init__. This only affects small decode batches since larger prefill batches already fall back to AllGatherReduceScatter via is_workload_feasible(). Additionally reduced free_gpu_memory_fraction from 0.6 to 0.4 for the attention_dp test variant and removed the waive entry to re-enable the test.
Automated fix generated by repair-bot

Test plan

Verify fix on the same GPU type as the original failure
Check for regressions in related tests

Links

Bug: https://nvbugs/6037654

Summary by CodeRabbit

Performance
- Optimized low-latency token limit handling with sensible defaults and environment configuration support.
- Fine-tuned memory allocation in FP8 accuracy tests for improved resource efficiency.
Tests
- Re-enabled test coverage for FP8 throughput and latency scenarios.

…revent OOM and illegal memory access Cap deep_ep_max_num_tokens to 256 in DeepEP low-latency mode, matching the recommended limit from the DeepEP library. Previously the limit was set to min(max_num_tokens, moe_max_num_tokens) which could be 4096-8192, causing excessive RDMA buffer memory consumption (contributing to OOM) and illegal memory access in the low_latency_combine kernel. With the 256 cap, large batches (e.g. prefill) automatically fall back to AllGatherReduceScatter via is_workload_feasible(), while small decode batches still use the efficient low-latency path. Also reduce free_gpu_memory_fraction from 0.6 to 0.4 for the attention_dp variant of TestQwen3_235B_A22B::test_fp8 to fit within RTX PRO 6000 Blackwell Server Edition GPU memory, and remove the corresponding test waiver. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

coderabbitai · 2026-04-23T04:48:46Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 45393340-1617-4496-988b-1c3019d4bae1

📥 Commits

Reviewing files that changed from the base of the PR and between 36113bf and ffcd5b9.

📒 Files selected for processing (3)

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tests/integration/test_lists/waives.txt

💤 Files with no reviewable changes (1)

tests/integration/test_lists/waives.txt

📝 Walkthrough

Walkthrough

The changes introduce a fixed low-latency token limit constant in the DeepEP communication module, adjust KV cache memory configuration for FP8 accuracy tests, and remove a test waiver for Qwen3_235B throughput latency testing.

Changes

Cohort / File(s)	Summary
DeepEP Low-Latency Token Limits `tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py`	Introduces a `_MAX_LOW_LATENCY_TOKENS = 256` constant and incorporates it into the `default_limit` computation to cap token reservations. Environment override remains functional when set; otherwise, constraints now apply an additional minimum of 256 tokens.
Test Configuration Updates `tests/integration/defs/accuracy/test_llm_api_pytorch.py`, `tests/integration/test_lists/waives.txt`	Adjusts KV cache `free_gpu_memory_fraction` from 0.6 to 0.4 when `attention_dp` is enabled in FP8 accuracy tests. Removes test waiver for Qwen3_235B throughput latency test case, allowing it to run.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the main change: capping DeepEP low-latency token limit to prevent OOM and memory access issues, directly matching the core fix in the changeset.
Description check	✅ Passed	The description includes a clear summary of the root cause, the fix applied, test verification, and relevant links, covering all essential template sections despite some non-critical checklist items being incomplete.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

byshiue · 2026-04-27T05:50:13Z

Create another PR #13484 to prevent changing the behaviors of other models.

tensorrt-cicd requested review from a team as code owners April 23, 2026 04:45

tensorrt-cicd requested a review from QiJune April 23, 2026 04:45

github-actions Bot assigned tensorrt-cicd Apr 23, 2026

byshiue closed this Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6037654][fix] Cap DeepEP low-latency token limit to prevent OOM and illegal memory access#13362

[https://nvbugs/6037654][fix] Cap DeepEP low-latency token limit to prevent OOM and illegal memory access#13362
tensorrt-cicd wants to merge 1 commit intoNVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6037654

tensorrt-cicd commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 23, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

byshiue commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tensorrt-cicd commented Apr 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Links

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 23, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

byshiue commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tensorrt-cicd commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading