Fix DeepEP TMA constraint violation for MoE CUDA graph batch sizes by kevalmorabia97 · Pull Request #1267 · NVIDIA/Model-Optimizer

kevalmorabia97 · 2026-04-15T19:56:51Z

Summary

test_ptq_mixtral was failing on 2-GPU machines with a RuntimeError from DeepEP inside TRT-LLM
Root cause: LLM.__init__ in generate.py built a CudaGraphConfig with batch_sizes=[1, 2] for max_batch_size=2. Building the CUDA graph for batch size 1 with ep=2 passed num_max_dispatch_tokens_per_rank=1 to DeepEP, violating its TMA constraint: (num_ranks * num_max_dispatch_tokens_per_rank) % 4 == 0 → (2 * 1) % 4 = 2 ≠ 0
Fix: when ep > 1, filter batch_sizes to only include values satisfying the constraint. enable_padding=True is already set so smaller actual batches are padded up to the next valid size at runtime.

Test plan

Re-run 2-gpu CI: https://github.com/NVIDIA/Model-Optimizer/actions/runs/24475378208/job/71526114398

🤖 Generated with Claude Code

Summary by CodeRabbit

Refactor
- Optimized CUDA graph batch-size configuration for Mixture-of-Experts deployments to improve memory alignment and performance efficiency.

When running a MoE model on multiple GPUs, the CUDA graph batch_sizes list could include values that violate DeepEP's TMA constraint: (num_ranks * num_max_dispatch_tokens_per_rank) % 4 == 0 For example, with max_batch_size=2 and ep=2, batch_sizes=[1, 2]. Building a CUDA graph for batch_size=1 passes num_max_dispatch_tokens_per_rank=1 to DeepEP, and (2 * 1) % 4 = 2 != 0 triggers a RuntimeError. Filter out batch sizes that violate the constraint when ep > 1. enable_padding=True ensures smaller batches are padded up to the next valid size at inference time. Fixes test_ptq_mixtral on 2-GPU machines. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

coderabbitai · 2026-04-15T19:57:04Z

📝 Walkthrough

Walkthrough

Updates CUDA graph configuration batch-size selection logic in the LLM initialization path. When Mixture-of-Experts (DeepEP) is enabled (ep > 1), candidate batch sizes are now filtered to only include values where (ep * b) % 4 == 0.

Changes

Cohort / File(s)	Summary
CUDA Graph Batch-Size Filtering `modelopt/deploy/llm/generate.py`	Added conditional filtering of `CudaGraphConfig.batch_sizes` when `ep > 1` to ensure batch sizes satisfy the divisibility constraint `(ep * b) % 4 == 0`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title accurately summarizes the main change: fixing a DeepEP/TMA constraint issue specific to MoE CUDA graph batch sizes, which directly matches the core problem and solution described in the PR objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns	✅ Passed	The pull request modifies batch size filtering logic for CUDA graphs with expert parallelism, containing no unsafe deserialization, remote code execution flags, eval/exec calls, nosec comments, or new dependencies.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/moe-deep-ep-cuda-graph-batch-size

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/deploy/llm/generate.py`:
- Around line 136-141: The filtered batch_sizes list can become empty (when ep >
1) causing CudaGraphConfig(...) to raise when it calls max(self.batch_sizes);
capture the original batch_sizes before filtering, then after applying
batch_sizes = [b for b in batch_sizes if (ep * b) % 4 == 0] add a guard: if not
batch_sizes, raise a clear ValueError (or log + bail) that includes ep and the
original batch_sizes so the caller knows why no valid batch sizes exist; ensure
this guard sits immediately before constructing CudaGraphConfig so
CudaGraphConfig(...) never receives an empty list.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 0d5502b0-dea2-4e6b-a20c-b24769f1ed26

📥 Commits

Reviewing files that changed from the base of the PR and between 361f7e3 and 15e1087.

📒 Files selected for processing (1)

modelopt/deploy/llm/generate.py

coderabbitai · 2026-04-15T20:01:54Z

+                batch_sizes = [b for b in batch_sizes if (ep * b) % 4 == 0]
            cuda_graph_config = CudaGraphConfig(
-                batch_sizes=[2**i for i in range(int((max_batch_size - 1).bit_length()))]
-                + [max_batch_size],
+                batch_sizes=batch_sizes,
                max_batch_size=max_batch_size,
                enable_padding=True,
            )


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

In TensorRT-LLM (llmapi.CudaGraphConfig), is batch_sizes=[] valid input? If valid, what exact runtime behavior is expected? If invalid, what exception/error is raised and in which versions?

💡 Result:

No, batch_sizes=[] is not a valid input for llmapi.CudaGraphConfig in TensorRT-LLM. It raises a ValueError during validation in the @model_validator because max(self.batch_sizes) on an empty list fails. The validator sorts the list and computes derived_max = max(self.batch_sizes), which errors on empty input. To disable CUDA graphs or use defaults, pass batch_sizes=None (default) or max_batch_size=0. With batch_sizes=None and max_batch_size=0 (also default), it falls back to generating batch sizes up to 128. Setting max_batch_size=0 with batch_sizes=None likely results in no graphs captured, using eager execution (runtime behavior seen in issues like #4811, #5634 where max_batch_size=0 or cuda_graph_config: null disables graphs). This behavior is from recent versions (post-2025 commits at hash 656091bf), consistent across source code and issues. No version-specific differences found; validation prevents empty lists uniformly.

Citations:

1: https://github.com/NVIDIA/TensorRT-LLM/blob/656091bf/tensorrt_llm/llmapi/llm_args.py

2: [Bug] Users need to add cuda_graph_max_batch_size=0 to avoid crash when config from extra-llm-api-config.yml TensorRT-LLM#4811

3: [fix] Update to properly set cuda graphs in trtllm-bench overrides. TensorRT-LLM#5634

4: Breaking change: perf: [TRTLLM-4662] Enable cuda graph by default TensorRT-LLM#5480

🏁 Script executed:

# First, let's find and examine the file fd -t f "generate.py" | grep -E "deploy/llm"

Repository: NVIDIA/Model-Optimizer

Length of output: 97

🏁 Script executed:

# Then examine the specific lines mentioned in the review head -150 modelopt/deploy/llm/generate.py | tail -30

Repository: NVIDIA/Model-Optimizer

Length of output: 1401

🏁 Script executed:

# Get more context around the issue to understand the full function sed -n '100,160p' modelopt/deploy/llm/generate.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2378

Add a guard to check that batch_sizes is not empty before creating CudaGraphConfig.

The filter at line 136 can produce an empty list when ep > 1 and no batch sizes satisfy the (ep * b) % 4 == 0 constraint. Passing batch_sizes=[] to CudaGraphConfig raises a ValueError because its validator calls max(self.batch_sizes) on the empty list.

Suggested fix

batch_sizes = [b for b in batch_sizes if (ep * b) % 4 == 0] - cuda_graph_config = CudaGraphConfig( - batch_sizes=batch_sizes, - max_batch_size=max_batch_size, - enable_padding=True, - ) + if batch_sizes: + cuda_graph_config = CudaGraphConfig( + batch_sizes=batch_sizes, + max_batch_size=max_batch_size, + enable_padding=True, + ) + else: + warnings.warn( + "No CUDA graph batch sizes satisfy DeepEP TMA constraint " + f"(ep={ep}, max_batch_size={max_batch_size}); disabling cuda graphs." + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/deploy/llm/generate.py` around lines 136 - 141, The filtered batch_sizes list can become empty (when ep > 1) causing CudaGraphConfig(...) to raise when it calls max(self.batch_sizes); capture the original batch_sizes before filtering, then after applying batch_sizes = [b for b in batch_sizes if (ep * b) % 4 == 0] add a guard: if not batch_sizes, raise a clear ValueError (or log + bail) that includes ep and the original batch_sizes so the caller knows why no valid batch sizes exist; ensure this guard sits immediately before constructing CudaGraphConfig so CudaGraphConfig(...) never receives an empty list.

github-actions · 2026-04-15T20:02:05Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-16 06:12 UTC

codecov · 2026-04-15T20:12:50Z

Codecov Report

❌ Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 76.52%. Comparing base (361f7e3) to head (15e1087).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/deploy/llm/generate.py	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1267      +/-   ##
==========================================
+ Coverage   75.58%   76.52%   +0.93%     
==========================================
  Files         459      459              
  Lines       48528    48531       +3     
==========================================
+ Hits        36681    37139     +458     
+ Misses      11847    11392     -455

Flag	Coverage Δ
examples	`41.39% <66.66%> (+11.53%)`	⬆️
gpu	`59.99% <0.00%> (-0.49%)`	⬇️
unit	`52.03% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kevalmorabia97 · 2026-04-16T06:10:48Z

Replaced by #1273

kevalmorabia97 requested a review from a team as a code owner April 15, 2026 19:56

kevalmorabia97 requested a review from sugunav14 April 15, 2026 19:56

kevalmorabia97 requested a review from cjluo-nv April 15, 2026 19:57

coderabbitai Bot reviewed Apr 15, 2026

View reviewed changes

cjluo-nv approved these changes Apr 15, 2026

View reviewed changes

kevalmorabia97 closed this Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DeepEP TMA constraint violation for MoE CUDA graph batch sizes#1267

Fix DeepEP TMA constraint violation for MoE CUDA graph batch sizes#1267
kevalmorabia97 wants to merge 1 commit intomainfrom
fix/moe-deep-ep-cuda-graph-batch-size

kevalmorabia97 commented Apr 15, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 15, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

kevalmorabia97 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kevalmorabia97 commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kevalmorabia97 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevalmorabia97 commented Apr 15, 2026 •

edited

Loading

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading

github-actions Bot commented Apr 15, 2026 •

edited

Loading

codecov Bot commented Apr 15, 2026 •

edited

Loading