[#13819][feat] AutoDeploy: Qwen3.5 MoE (VLM) MTP#14641
[#13819][feat] AutoDeploy: Qwen3.5 MoE (VLM) MTP#14641govind-ramnarayan wants to merge 2 commits into
Conversation
fc554b1 to
8a77743
Compare
📝 WalkthroughWalkthroughThis PR adds complete MTP (Mixed Token Prefill) speculative decoding support for Qwen3.5 MoE 35B. The implementation extends the FLA custom op with an extend-path kernel, introduces a new Qwen3.5 MoE Eagle layer for mixed-token drafting, adds configurable target factory wiring in LlmArgs, refactors Eagle export infrastructure, and includes comprehensive unit and integration test coverage validating the end-to-end pipeline. ChangesMTP Eagle One-Model for Qwen3.5 MoE 35B
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe.py (1)
2248-2278:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
inputs_embeds-only requests are still broken for multimodal/cache paths.This fallback now permits
input_ids=None, but the same method still unconditionally readsinput_idslater for placeholder masks/counts, chunked mRoPE position reconstruction, and themrope_delta_cachedtype cast. A caller that passesinputs_embedswith image/video metadata will still fail on aNoneTypeaccess.Please either keep requiring
input_idswhenever multimodal metadata or mRoPE delta caching is present, or fully decouple those branches from token IDs.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe.py` around lines 2248 - 2278, The forward method currently allows inputs_embeds without input_ids but later unconditionally accesses input_ids for multimodal/cache logic (places referencing input_ids, image_grid_thw, video_grid_thw, batch_info, compute_mrope_positions, cu_seqlen, and mrope_delta_cache), which causes NoneType errors; fix by adding a guard in forward that raises a clear ValueError when inputs_embeds is provided but any multimodal metadata or mRoPE-delta caching is present (e.g., if image_grid_thw, video_grid_thw, batch_info, or mrope_delta_cache is not None) and require input_ids in that case, or alternatively refactor all downstream code paths that use input_ids (the placeholder masks/counts, chunked mRoPE reconstruction, and dtype cast) to derive the needed shapes/values from inputs_embeds instead; choose the simpler approach of requiring input_ids for multimodal/cache paths and add the explicit check near the top of forward after the inputs_embeds fallback.
🧹 Nitpick comments (1)
tests/unittest/auto_deploy/singlegpu/shim/test_llm_config.py (1)
240-273: ⚡ Quick winCover the no-speculative-config guard rails too.
Coverage is still insufficient for
tensorrt_llm/_torch/auto_deploy/llm_args.pyLines 324-330. Please add negative cases intests/unittest/auto_deploy/singlegpu/shim/test_llm_config.pyformodel_factory="eagle_one_model"withoutspeculative_config, and fortarget_model_factoryset without speculative decoding.As per coding guidelines,
tests/**: Act as a QA engineer reviewing test changes and coverage for TensorRT-LLM. Keep feedback actionable: suggest concrete list file names and whether coverage is sufficient, insufficient, or needs follow-up outside the PR.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/auto_deploy/singlegpu/shim/test_llm_config.py` around lines 240 - 273, Add two negative unit tests to test_llm_config.py to cover the no-speculative-config guard rails in LlmArgs: (1) create a test that constructs LlmArgs with model_factory="eagle_one_model" and no speculative_config and assert it raises ValueError matching "speculative_config" (this exercises the guard in LlmArgs when eagle wrapper is declared without speculative decoding), and (2) create a test that sets target_model_factory (e.g., target_model_factory="AutoModelForImageTextToText") while leaving speculative_config=None and assert it raises ValueError matching "speculative_config" (this covers the guard that prevents declaring a target factory without speculative decoding). Ensure both tests import/instantiate LlmArgs and use pytest.raises with the appropriate match.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_backend_gated_delta.py`:
- Around line 203-205: Run the code formatter (ruff-format) on the file to
normalize the multiline assignment to y_flat (the line assigning
y_flat[extend_start:extend_end] = y_extend.view(num_extend_tokens, HV,
-1).to(y_flat.dtype)); specifically reformat that multiline call so it matches
the project's ruff formatting rules and commit the resulting changes before
merging.
In `@tensorrt_llm/_torch/auto_deploy/models/eagle.py`:
- Around line 384-399: The draft export is left with the default 2D position_ids
contract; pass the same target export shape info into the draft export by
threading target_export_info into the DraftModelExportInfo construction (i.e.,
add target_export_info=target_export_info when creating DraftModelExportInfo
alongside TargetModelExportInfo) so the draft graph is specialized with the same
3D mRoPE position_ids as the target; update the return list where
TargetModelExportInfo and DraftModelExportInfo are created to include this
parameter.
In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`:
- Around line 442-462: The function is_trivial_passthrough_user currently treats
aten view/reshape/etc. as passthrough but omits torch.ops.auto_deploy.view,
causing inconsistent traversal with is_any_view_op; update
is_trivial_passthrough_user to also return True for torch.ops.auto_deploy.view
(add it to the call_function checks alongside
torch.ops.aten.view/reshape/transpose/permute/contiguous or add an explicit
is_op(...) check for torch.ops.auto_deploy.view) so that
collect_terminal_users_through_passthrough() stops at the auto_deploy.view
wrapper consistently.
In `@tests/integration/defs/accuracy/test_llm_api_autodeploy.py`:
- Line 1692: Add an explicit return type annotation "-> None" to the test
function definition for test_ir_mtp_gsm8k so its signature becomes def
test_ir_mtp_gsm8k(...) -> None:, matching the project's typing guideline; update
the function declaration (test_ir_mtp_gsm8k) only and ensure formatting matches
surrounding test functions' style.
In
`@tests/unittest/auto_deploy/singlegpu/custom_ops/fla/test_fla_cached_gated_delta_rule.py`:
- Around line 94-96: Change the single-request test to also cover a batched
2-request case (or add a new test file named
test_fla_cached_gated_delta_rule_batched.py) so the per-request indexing code
paths in fla_cached_gated_delta_rule are exercised: set num_extend = 2 (and
tokens_per_extend > 1), construct inputs with distinct slot ids for each batch
row, run the same flow that produces intermediate_delta_cache and final output,
and add assertions that each row of intermediate_delta_cache and the
corresponding output map back to the correct request (verifying slot_idx_extend,
recurrent_state_indices behavior and the view(num_extend, tokens_per_extend,
...) reshape semantics). Ensure the test asserts per-row correctness rather than
only overall shapes so the new per-request indexing logic is covered.
In `@tests/unittest/auto_deploy/singlegpu/models/test_qwen3_5_moe.py`:
- Around line 3301-3305: The comparison is building reference logits incorrectly
from model.model.language_model(...).logits; instead call
model.model.language_model(...) to get its last_hidden_state and then convert
that hidden state to logits via the LM head used by the wrapper (use
model.model.lm_head or the appropriate head on model.model) so expected_logits =
lm_head(last_hidden_state) (ensure you match any transposition or
dtype/placement done by wrapper_logits). Update the block referencing
wrapper_logits, model.model.language_model, and model.model.lm_head accordingly.
---
Outside diff comments:
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe.py`:
- Around line 2248-2278: The forward method currently allows inputs_embeds
without input_ids but later unconditionally accesses input_ids for
multimodal/cache logic (places referencing input_ids, image_grid_thw,
video_grid_thw, batch_info, compute_mrope_positions, cu_seqlen, and
mrope_delta_cache), which causes NoneType errors; fix by adding a guard in
forward that raises a clear ValueError when inputs_embeds is provided but any
multimodal metadata or mRoPE-delta caching is present (e.g., if image_grid_thw,
video_grid_thw, batch_info, or mrope_delta_cache is not None) and require
input_ids in that case, or alternatively refactor all downstream code paths that
use input_ids (the placeholder masks/counts, chunked mRoPE reconstruction, and
dtype cast) to derive the needed shapes/values from inputs_embeds instead;
choose the simpler approach of requiring input_ids for multimodal/cache paths
and add the explicit check near the top of forward after the inputs_embeds
fallback.
---
Nitpick comments:
In `@tests/unittest/auto_deploy/singlegpu/shim/test_llm_config.py`:
- Around line 240-273: Add two negative unit tests to test_llm_config.py to
cover the no-speculative-config guard rails in LlmArgs: (1) create a test that
constructs LlmArgs with model_factory="eagle_one_model" and no
speculative_config and assert it raises ValueError matching "speculative_config"
(this exercises the guard in LlmArgs when eagle wrapper is declared without
speculative decoding), and (2) create a test that sets target_model_factory
(e.g., target_model_factory="AutoModelForImageTextToText") while leaving
speculative_config=None and assert it raises ValueError matching
"speculative_config" (this covers the guard that prevents declaring a target
factory without speculative decoding). Ensure both tests import/instantiate
LlmArgs and use pytest.raises with the appropriate match.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: fba98229-0ec6-4834-81b0-d069d1fc359b
📒 Files selected for processing (17)
examples/auto_deploy/model_registry/configs/qwen3.5_moe_35b_mtp.yamlexamples/auto_deploy/model_registry/models.yamltensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_backend_gated_delta.pytensorrt_llm/_torch/auto_deploy/llm_args.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe.pytensorrt_llm/_torch/auto_deploy/models/eagle.pytensorrt_llm/_torch/auto_deploy/models/hf.pytensorrt_llm/_torch/auto_deploy/utils/node_utils.pytests/integration/defs/accuracy/references/gsm8k.yamltests/integration/defs/accuracy/test_llm_api_autodeploy.pytests/integration/test_lists/test-db/l0_dgx_h100.ymltests/unittest/auto_deploy/_utils_test/_model_test_utils.pytests/unittest/auto_deploy/singlegpu/custom_ops/fla/test_fla_cached_gated_delta_rule.pytests/unittest/auto_deploy/singlegpu/models/test_qwen3_5_moe.pytests/unittest/auto_deploy/singlegpu/shim/test_llm_config.pytests/unittest/auto_deploy/singlegpu/smoke/test_ad_speculative_decoding.py
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
e82d086 to
71357c9
Compare
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
|
/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" |
|
PR_Github #51098 [ run ] triggered by Bot. Commit: |
|
PR_Github #51098 [ run ] completed with state
|
fixes: #13819
Two parts to this change:
Summary by CodeRabbit
Release Notes
New Features
Tests
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.