Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions modelopt/deploy/llm/generate.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,12 +109,13 @@ def _find_max_position_embeddings(cfg: dict) -> int | None:
if tp < 1:
tp = torch.cuda.device_count()

# Check if any key in config contains both "num" and "experts"
# Force ep=1 to avoid TRT-LLM DeepEP kernel failures on unsupported GPUs
# (e.g. Blackwell SM 12.0). Expert parallelism can be enabled explicitly
# by the caller when the environment is known to support it.
ep = 1
Comment on lines +112 to 115
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Comment/behavior mismatch: expert parallelism is now hard-disabled with no caller override.

Line 113 says EP can be enabled explicitly by callers, but this wrapper has no moe_expert_parallel_size input and always sets ep = 1 (propagated at Line 145). That makes EP impossible to enable through this API and can regress MoE multi-GPU setups that expect configurable EP (as shown in examples/specdec_bench/specdec_bench/models/trtllm_torch_api.py and examples/speculative_decoding/collect_hidden_states/compute_hidden_states_trtllm.py).

Suggested fix
 class LLM(TRTLLM):
     def __init__(
         self,
         checkpoint_dir: str | Path,
         tokenizer: "str | Path | None" = None,
         medusa_choices: Any = None,
         tp: int = 0,
+        moe_expert_parallel_size: int | None = None,
         trust_remote_code: bool = False,
         max_seq_len: int = 0,
         max_batch_size: int = 0,
     ):
@@
-        # Force ep=1 to avoid TRT-LLM DeepEP kernel failures on unsupported GPUs
-        # (e.g. Blackwell SM 12.0). Expert parallelism can be enabled explicitly
-        # by the caller when the environment is known to support it.
-        ep = 1
+        # Default to EP=1 to avoid TRT-LLM DeepEP kernel failures on unsupported GPUs.
+        # Allow explicit override when the caller knows the environment supports EP.
+        ep = 1 if moe_expert_parallel_size is None else moe_expert_parallel_size
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/deploy/llm/generate.py` around lines 112 - 115, The code currently
forces ep = 1 unconditionally, preventing callers from enabling expert
parallelism; modify the wrapper signature (e.g., add a parameter named
moe_expert_parallel_size or similar to the generate function and/or its callers)
with a default of 1, replace the hard-coded ep = 1 assignment with ep =
moe_expert_parallel_size, and propagate that value to where ep is used (the
variable referenced at Line ~145); keep the existing comment about forcing ep=1
as the safe default for unsupported GPUs but allow callers to set
moe_expert_parallel_size when their environment supports EP.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjluo-nv is this important to address?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it just mean when TRT LLM supports it, we will remove the hardcoded EP. Should be ok

enable_attention_dp = False
for k in config:
if "num" in k and "experts" in k:
ep = torch.cuda.device_count()
enable_attention_dp = True
break

Expand Down
Loading