Fix DeepEP TMA constraint violation for MoE CUDA graph batch sizes#1267
Fix DeepEP TMA constraint violation for MoE CUDA graph batch sizes#1267kevalmorabia97 wants to merge 1 commit intomainfrom
Conversation
When running a MoE model on multiple GPUs, the CUDA graph batch_sizes list could include values that violate DeepEP's TMA constraint: (num_ranks * num_max_dispatch_tokens_per_rank) % 4 == 0 For example, with max_batch_size=2 and ep=2, batch_sizes=[1, 2]. Building a CUDA graph for batch_size=1 passes num_max_dispatch_tokens_per_rank=1 to DeepEP, and (2 * 1) % 4 = 2 != 0 triggers a RuntimeError. Filter out batch sizes that violate the constraint when ep > 1. enable_padding=True ensures smaller batches are padded up to the next valid size at inference time. Fixes test_ptq_mixtral on 2-GPU machines. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
📝 WalkthroughWalkthroughUpdates CUDA graph configuration batch-size selection logic in the LLM initialization path. When Mixture-of-Experts (DeepEP) is enabled ( Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/deploy/llm/generate.py`:
- Around line 136-141: The filtered batch_sizes list can become empty (when ep >
1) causing CudaGraphConfig(...) to raise when it calls max(self.batch_sizes);
capture the original batch_sizes before filtering, then after applying
batch_sizes = [b for b in batch_sizes if (ep * b) % 4 == 0] add a guard: if not
batch_sizes, raise a clear ValueError (or log + bail) that includes ep and the
original batch_sizes so the caller knows why no valid batch sizes exist; ensure
this guard sits immediately before constructing CudaGraphConfig so
CudaGraphConfig(...) never receives an empty list.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 0d5502b0-dea2-4e6b-a20c-b24769f1ed26
📒 Files selected for processing (1)
modelopt/deploy/llm/generate.py
| batch_sizes = [b for b in batch_sizes if (ep * b) % 4 == 0] | ||
| cuda_graph_config = CudaGraphConfig( | ||
| batch_sizes=[2**i for i in range(int((max_batch_size - 1).bit_length()))] | ||
| + [max_batch_size], | ||
| batch_sizes=batch_sizes, | ||
| max_batch_size=max_batch_size, | ||
| enable_padding=True, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In TensorRT-LLM (llmapi.CudaGraphConfig), is batch_sizes=[] valid input? If valid, what exact runtime behavior is expected? If invalid, what exception/error is raised and in which versions?
💡 Result:
No, batch_sizes=[] is not a valid input for llmapi.CudaGraphConfig in TensorRT-LLM. It raises a ValueError during validation in the @model_validator because max(self.batch_sizes) on an empty list fails. The validator sorts the list and computes derived_max = max(self.batch_sizes), which errors on empty input. To disable CUDA graphs or use defaults, pass batch_sizes=None (default) or max_batch_size=0. With batch_sizes=None and max_batch_size=0 (also default), it falls back to generating batch sizes up to 128. Setting max_batch_size=0 with batch_sizes=None likely results in no graphs captured, using eager execution (runtime behavior seen in issues like #4811, #5634 where max_batch_size=0 or cuda_graph_config: null disables graphs). This behavior is from recent versions (post-2025 commits at hash 656091bf), consistent across source code and issues. No version-specific differences found; validation prevents empty lists uniformly.
Citations:
- 1: https://github.com/NVIDIA/TensorRT-LLM/blob/656091bf/tensorrt_llm/llmapi/llm_args.py
- 2: [Bug] Users need to add
cuda_graph_max_batch_size=0to avoid crash when config from extra-llm-api-config.yml TensorRT-LLM#4811 - 3: [fix] Update to properly set cuda graphs in trtllm-bench overrides. TensorRT-LLM#5634
- 4: Breaking change: perf: [TRTLLM-4662] Enable cuda graph by default TensorRT-LLM#5480
🏁 Script executed:
# First, let's find and examine the file
fd -t f "generate.py" | grep -E "deploy/llm"Repository: NVIDIA/Model-Optimizer
Length of output: 97
🏁 Script executed:
# Then examine the specific lines mentioned in the review
head -150 modelopt/deploy/llm/generate.py | tail -30Repository: NVIDIA/Model-Optimizer
Length of output: 1401
🏁 Script executed:
# Get more context around the issue to understand the full function
sed -n '100,160p' modelopt/deploy/llm/generate.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2378
Add a guard to check that batch_sizes is not empty before creating CudaGraphConfig.
The filter at line 136 can produce an empty list when ep > 1 and no batch sizes satisfy the (ep * b) % 4 == 0 constraint. Passing batch_sizes=[] to CudaGraphConfig raises a ValueError because its validator calls max(self.batch_sizes) on the empty list.
Suggested fix
batch_sizes = [b for b in batch_sizes if (ep * b) % 4 == 0]
- cuda_graph_config = CudaGraphConfig(
- batch_sizes=batch_sizes,
- max_batch_size=max_batch_size,
- enable_padding=True,
- )
+ if batch_sizes:
+ cuda_graph_config = CudaGraphConfig(
+ batch_sizes=batch_sizes,
+ max_batch_size=max_batch_size,
+ enable_padding=True,
+ )
+ else:
+ warnings.warn(
+ "No CUDA graph batch sizes satisfy DeepEP TMA constraint "
+ f"(ep={ep}, max_batch_size={max_batch_size}); disabling cuda graphs."
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/deploy/llm/generate.py` around lines 136 - 141, The filtered
batch_sizes list can become empty (when ep > 1) causing CudaGraphConfig(...) to
raise when it calls max(self.batch_sizes); capture the original batch_sizes
before filtering, then after applying batch_sizes = [b for b in batch_sizes if
(ep * b) % 4 == 0] add a guard: if not batch_sizes, raise a clear ValueError (or
log + bail) that includes ep and the original batch_sizes so the caller knows
why no valid batch sizes exist; ensure this guard sits immediately before
constructing CudaGraphConfig so CudaGraphConfig(...) never receives an empty
list.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1267 +/- ##
==========================================
+ Coverage 75.58% 76.52% +0.93%
==========================================
Files 459 459
Lines 48528 48531 +3
==========================================
+ Hits 36681 37139 +458
+ Misses 11847 11392 -455
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Replaced by #1273 |
Summary
test_ptq_mixtralwas failing on 2-GPU machines with aRuntimeErrorfrom DeepEP inside TRT-LLMLLM.__init__ingenerate.pybuilt aCudaGraphConfigwithbatch_sizes=[1, 2]formax_batch_size=2. Building the CUDA graph for batch size 1 withep=2passednum_max_dispatch_tokens_per_rank=1to DeepEP, violating its TMA constraint:(num_ranks * num_max_dispatch_tokens_per_rank) % 4 == 0→(2 * 1) % 4 = 2 ≠ 0ep > 1, filterbatch_sizesto only include values satisfying the constraint.enable_padding=Trueis already set so smaller actual batches are padded up to the next valid size at runtime.Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit