Skip to content

Fix DeepEP TMA constraint violation for MoE CUDA graph batch sizes#1267

Closed
kevalmorabia97 wants to merge 1 commit intomainfrom
fix/moe-deep-ep-cuda-graph-batch-size
Closed

Fix DeepEP TMA constraint violation for MoE CUDA graph batch sizes#1267
kevalmorabia97 wants to merge 1 commit intomainfrom
fix/moe-deep-ep-cuda-graph-batch-size

Conversation

@kevalmorabia97
Copy link
Copy Markdown
Collaborator

@kevalmorabia97 kevalmorabia97 commented Apr 15, 2026

Summary

  • test_ptq_mixtral was failing on 2-GPU machines with a RuntimeError from DeepEP inside TRT-LLM
  • Root cause: LLM.__init__ in generate.py built a CudaGraphConfig with batch_sizes=[1, 2] for max_batch_size=2. Building the CUDA graph for batch size 1 with ep=2 passed num_max_dispatch_tokens_per_rank=1 to DeepEP, violating its TMA constraint: (num_ranks * num_max_dispatch_tokens_per_rank) % 4 == 0(2 * 1) % 4 = 2 ≠ 0
  • Fix: when ep > 1, filter batch_sizes to only include values satisfying the constraint. enable_padding=True is already set so smaller actual batches are padded up to the next valid size at runtime.

Test plan

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Refactor
    • Optimized CUDA graph batch-size configuration for Mixture-of-Experts deployments to improve memory alignment and performance efficiency.

When running a MoE model on multiple GPUs, the CUDA graph batch_sizes
list could include values that violate DeepEP's TMA constraint:
  (num_ranks * num_max_dispatch_tokens_per_rank) % 4 == 0

For example, with max_batch_size=2 and ep=2, batch_sizes=[1, 2].
Building a CUDA graph for batch_size=1 passes num_max_dispatch_tokens_per_rank=1
to DeepEP, and (2 * 1) % 4 = 2 != 0 triggers a RuntimeError.

Filter out batch sizes that violate the constraint when ep > 1.
enable_padding=True ensures smaller batches are padded up to the next
valid size at inference time.

Fixes test_ptq_mixtral on 2-GPU machines.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner April 15, 2026 19:56
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 15, 2026

📝 Walkthrough

Walkthrough

Updates CUDA graph configuration batch-size selection logic in the LLM initialization path. When Mixture-of-Experts (DeepEP) is enabled (ep > 1), candidate batch sizes are now filtered to only include values where (ep * b) % 4 == 0.

Changes

Cohort / File(s) Summary
CUDA Graph Batch-Size Filtering
modelopt/deploy/llm/generate.py
Added conditional filtering of CudaGraphConfig.batch_sizes when ep > 1 to ensure batch sizes satisfy the divisibility constraint (ep * b) % 4 == 0.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately summarizes the main change: fixing a DeepEP/TMA constraint issue specific to MoE CUDA graph batch sizes, which directly matches the core problem and solution described in the PR objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns ✅ Passed The pull request modifies batch size filtering logic for CUDA graphs with expert parallelism, containing no unsafe deserialization, remote code execution flags, eval/exec calls, nosec comments, or new dependencies.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/moe-deep-ep-cuda-graph-batch-size

Comment @coderabbitai help to get the list of available commands and usage tips.

@kevalmorabia97 kevalmorabia97 requested a review from cjluo-nv April 15, 2026 19:57
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/deploy/llm/generate.py`:
- Around line 136-141: The filtered batch_sizes list can become empty (when ep >
1) causing CudaGraphConfig(...) to raise when it calls max(self.batch_sizes);
capture the original batch_sizes before filtering, then after applying
batch_sizes = [b for b in batch_sizes if (ep * b) % 4 == 0] add a guard: if not
batch_sizes, raise a clear ValueError (or log + bail) that includes ep and the
original batch_sizes so the caller knows why no valid batch sizes exist; ensure
this guard sits immediately before constructing CudaGraphConfig so
CudaGraphConfig(...) never receives an empty list.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 0d5502b0-dea2-4e6b-a20c-b24769f1ed26

📥 Commits

Reviewing files that changed from the base of the PR and between 361f7e3 and 15e1087.

📒 Files selected for processing (1)
  • modelopt/deploy/llm/generate.py

Comment on lines +136 to 141
batch_sizes = [b for b in batch_sizes if (ep * b) % 4 == 0]
cuda_graph_config = CudaGraphConfig(
batch_sizes=[2**i for i in range(int((max_batch_size - 1).bit_length()))]
+ [max_batch_size],
batch_sizes=batch_sizes,
max_batch_size=max_batch_size,
enable_padding=True,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

In TensorRT-LLM (llmapi.CudaGraphConfig), is batch_sizes=[] valid input? If valid, what exact runtime behavior is expected? If invalid, what exception/error is raised and in which versions?

💡 Result:

No, batch_sizes=[] is not a valid input for llmapi.CudaGraphConfig in TensorRT-LLM. It raises a ValueError during validation in the @model_validator because max(self.batch_sizes) on an empty list fails. The validator sorts the list and computes derived_max = max(self.batch_sizes), which errors on empty input. To disable CUDA graphs or use defaults, pass batch_sizes=None (default) or max_batch_size=0. With batch_sizes=None and max_batch_size=0 (also default), it falls back to generating batch sizes up to 128. Setting max_batch_size=0 with batch_sizes=None likely results in no graphs captured, using eager execution (runtime behavior seen in issues like #4811, #5634 where max_batch_size=0 or cuda_graph_config: null disables graphs). This behavior is from recent versions (post-2025 commits at hash 656091bf), consistent across source code and issues. No version-specific differences found; validation prevents empty lists uniformly.

Citations:


🏁 Script executed:

# First, let's find and examine the file
fd -t f "generate.py" | grep -E "deploy/llm"

Repository: NVIDIA/Model-Optimizer

Length of output: 97


🏁 Script executed:

# Then examine the specific lines mentioned in the review
head -150 modelopt/deploy/llm/generate.py | tail -30

Repository: NVIDIA/Model-Optimizer

Length of output: 1401


🏁 Script executed:

# Get more context around the issue to understand the full function
sed -n '100,160p' modelopt/deploy/llm/generate.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2378


Add a guard to check that batch_sizes is not empty before creating CudaGraphConfig.

The filter at line 136 can produce an empty list when ep > 1 and no batch sizes satisfy the (ep * b) % 4 == 0 constraint. Passing batch_sizes=[] to CudaGraphConfig raises a ValueError because its validator calls max(self.batch_sizes) on the empty list.

Suggested fix
             batch_sizes = [b for b in batch_sizes if (ep * b) % 4 == 0]
-            cuda_graph_config = CudaGraphConfig(
-                batch_sizes=batch_sizes,
-                max_batch_size=max_batch_size,
-                enable_padding=True,
-            )
+            if batch_sizes:
+                cuda_graph_config = CudaGraphConfig(
+                    batch_sizes=batch_sizes,
+                    max_batch_size=max_batch_size,
+                    enable_padding=True,
+                )
+            else:
+                warnings.warn(
+                    "No CUDA graph batch sizes satisfy DeepEP TMA constraint "
+                    f"(ep={ep}, max_batch_size={max_batch_size}); disabling cuda graphs."
+                )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/deploy/llm/generate.py` around lines 136 - 141, The filtered
batch_sizes list can become empty (when ep > 1) causing CudaGraphConfig(...) to
raise when it calls max(self.batch_sizes); capture the original batch_sizes
before filtering, then after applying batch_sizes = [b for b in batch_sizes if
(ep * b) % 4 == 0] add a guard: if not batch_sizes, raise a clear ValueError (or
log + bail) that includes ep and the original batch_sizes so the caller knows
why no valid batch sizes exist; ensure this guard sits immediately before
constructing CudaGraphConfig so CudaGraphConfig(...) never receives an empty
list.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 15, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-16 06:12 UTC

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 76.52%. Comparing base (361f7e3) to head (15e1087).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/deploy/llm/generate.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1267      +/-   ##
==========================================
+ Coverage   75.58%   76.52%   +0.93%     
==========================================
  Files         459      459              
  Lines       48528    48531       +3     
==========================================
+ Hits        36681    37139     +458     
+ Misses      11847    11392     -455     
Flag Coverage Δ
examples 41.39% <66.66%> (+11.53%) ⬆️
gpu 59.99% <0.00%> (-0.49%) ⬇️
unit 52.03% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kevalmorabia97
Copy link
Copy Markdown
Collaborator Author

Replaced by #1273

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants