[#13819][feat] AutoDeploy: Qwen3.5 MoE (VLM) MTP by govind-ramnarayan · Pull Request #14641 · NVIDIA/TensorRT-LLM

govind-ramnarayan · 2026-05-27T18:24:22Z

fixes: #13819

Two parts to this change:

Support VLMs with speculative decoding in AutoDeploy
Add modeling code + tests for Qwen3.5 MoE + MTP with the sharding IR.

Summary by CodeRabbit

Release Notes

New Features
- Added support for Qwen3.5 MoE 35B with MTP (Multi-Token Prediction) speculative decoding configuration and deployment.
- Enhanced Eagle drafter model architecture to support additional model variants and improved speculative decoding integration.
Tests
- Added comprehensive test coverage for MTP speculative decoding functionality, accuracy validation, and export workflows.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-05-29T00:03:08Z

📝 Walkthrough

Walkthrough

This PR adds complete MTP (Mixed Token Prefill) speculative decoding support for Qwen3.5 MoE 35B. The implementation extends the FLA custom op with an extend-path kernel, introduces a new Qwen3.5 MoE Eagle layer for mixed-token drafting, adds configurable target factory wiring in LlmArgs, refactors Eagle export infrastructure, and includes comprehensive unit and integration test coverage validating the end-to-end pipeline.

Changes

MTP Eagle One-Model for Qwen3.5 MoE 35B

Layer / File(s)	Summary
FLA Cached Gated Delta - Extend Path Support `tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_backend_gated_delta.py`	Custom op registration and signature updated to accept optional `intermediate_delta_cache`. Batch accounting refactored to separate prefill/extend/decode sequences and tokens. Extend path validates intermediate buffer, derives recurrent initial states from delta_cache, calls `fused_recurrent_gated_delta_rule_update` with `disable_state_update=True`, and writes per-extend results to intermediate buffer. Cache initialization changes `delta_cache` from StateResourceHandler to SSMResourceHandler and adds SpecSSMResourceHandler for intermediate cache.
Qwen3.5 MoE Model - Eagle Layer & Accessors `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe.py`	New `Qwen3_5MoeEagleLayer` class fuses inputs_embeds with hidden states via RMSNorm-separate, concatenate, and linear projection; normalizes position_ids to 3D for mRoPE; dispatches through attention and MoE. Model wrappers now expose `get_output_embeddings()` and `get_final_normalization()` accessors. Multimodal wrapper `forward` accepts optional `inputs_embeds` parameter with proper fallback to input_ids. Conditional-generation wrapper forwards `inputs_embeds` through to multimodal call.
LlmArgs - Speculative Config & Target Factory Validation `tensorrt_llm/_torch/auto_deploy/llm_args.py`	Adds `target_model_factory: Optional[str]` field (used only with `model_factory='eagle_one_model'`). New `validate_speculative_model_factory` model validator enforces allowed speculative-config combinations and auto-selects `model_factory` when required. Relaxed `model_factory_exists` validator accepts None target_factory and skips registry checks. New `_requires_eagle_one_model()` helper centralizes Eagle routing logic. `create_factory()` now conditionally passes speculative/sync kwargs only for eagle_one_model, using shared common_kwargs for other factories.
Eagle Modeling - Layer Dispatch & Config Defaults `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py`	Added Qwen3.5 MoE Eagle layer dispatch in `get_eagle_layers()`. New `EagleConfig._drafter_defaults['qwen3_5_moe_text']` with checkpoint conversion mapping for mtp.* state dicts. `EagleDrafterForCausalLM` now includes `Qwen3_5MoeEagleLayer` in `_no_split_modules` and accepts unused HuggingFace kwargs. Enhanced `_filter_kwargs_for_submodule()` handles nested GraphModule targets by unwrapping child graphs and detecting ambiguous cases.
Eagle Factory - Configurable Target Factory & Export `tensorrt_llm/_torch/auto_deploy/models/eagle.py`	`EagleDrafterFactory` extended with `use_inner_text_config` parameter; `_get_model_config()` can optionally extract `text_config` from multimodal configs. `TargetModelExportInfo` refactored to accept configurable `submodule_name` and optional `target_export_info`; dynamic-shape lookup merges constraints from target_export_info; post_process delegates to target_export_info and uses `expose_graph_module_accessor` for binding target embeddings/outputs. `EagleOneModelFactory` adds `target_model_factory` parameter with registry lookup and computed `use_inner_text_config` based on factory class. `get_export_infos()` derives target_export_info and computes nested `target_model.<inner>` paths.
HF Export - expose_graph_module_accessor Helper `tensorrt_llm/_torch/auto_deploy/models/hf.py`	New `expose_graph_module_accessor()` utility binds zero-argument accessors from original modules onto exported GraphModules, recreates submodule hierarchy, and inserts sentinel `torch._assert` nodes to prevent cleanup deletion. `TextModelExportInfo.post_process()` delegated to this helper.
Node Utils - Passthrough Detection Simplification `tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`	Simplified passthrough detection by replacing layered helpers with single `is_trivial_passthrough_user()` predicate for narrow set of layout ops (view, reshape, transpose, permute, getitem, call_method). Removed `allow_dtype_cast` parameters from signatures. Removed `unwrap_input_through_passthrough()` utility and related helpers.
Model Registry - MTP YAML Config & Model Entry `examples/auto_deploy/model_registry/configs/qwen3.5_moe_35b_mtp.yaml`, `examples/auto_deploy/model_registry/models.yaml`, `tests/unittest/auto_deploy/_utils_test/_model_test_utils.py`	New config file defines MTP speculative decoding settings, CUDA graph batch sizes, model factory selection, and transformation/sharding parameters. Model registry entry for Qwen/Qwen3.5-35B-A3B updated to use mtp variant. Test utils add Qwen3.5 MoE to _SMALL_MODEL_CONFIGS with detailed text_config and vision_config parameters.
FLA Custom Op Tests - Extend Path Coverage `tests/unittest/auto_deploy/singlegpu/custom_ops/fla/test_fla_cached_gated_delta_rule.py`	New `make_extend_kernel_inputs()` helper constructs extend-request tuples with intermediate_delta_cache. Updated existing test invocation sites to pass None for intermediate_delta_cache. New `test_intermediate_delta_cache()` validates per-prefix state writes and delta_cache preservation. New `test_extend_cuda_graph_capture()` verifies CUDA graph captureability.
Qwen3.5 MoE Unit Tests - MTP Layer & Factory Coverage `tests/unittest/auto_deploy/singlegpu/models/test_qwen3_5_moe.py`	Comprehensive test suite (600+ lines) with factories, dynamic shapes, weight initialization, and manual reference implementations for RMSNorm, attention, MoE, and full Eagle layer. Tests validate MTP config defaults, checkpoint mapping, layer output matching, drafter I/O contract, factory wrapper selection, hidden-state capture, VLM text config handling, one-model wrapper assembly, and strict checkpoint loading.
LlmArgs Config Validation Tests `tests/unittest/auto_deploy/singlegpu/shim/test_llm_config.py`	Enhanced speculative config tests with target_model_factory assertions for Eagle3 and MTP modes. New test verifies declared factory preserved as target_model_factory for MTP-Eagle one-model. New negative test ensures wrapper factories require explicit target_model_factory.
Smoke Tests - Eagle Wrapper & MTP End-to-End `tests/unittest/auto_deploy/singlegpu/smoke/test_ad_speculative_decoding.py`	New helper and unit tests for `EagleWrapper._filter_kwargs_for_submodule()` covering direct/nested graph handling and ambiguous rejection. New test validates `EagleOneModelFactory.get_export_infos()` export-info ordering. New `test_qwen3_5_moe_mtp_smoke()` smoke test for VLM target with MTP/Eagle config. Extended existing capture test assertion.
Integration Tests - GSM8K Accuracy & Test Registry `tests/integration/defs/accuracy/references/gsm8k.yaml`, `tests/integration/defs/accuracy/test_llm_api_autodeploy.py`, `tests/integration/test_lists/test-db/l0_dgx_h100.yml`	New GSM8K accuracy reference (94.53) for MTP config. Class constants for extra_acc_spec and min acceptance rate. New `test_ir_mtp_gsm8k()` validates MTP config structure, runs AutoDeployLLM, evaluates GSM8K, and checks acceptance rate. Registered in l0_dgx_h100 post-merge suite.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#14352: Modifies yaml_extra sourcing for AutoDeployLLM in accuracy tests, which the MTP GSM8K test depends on for registry-driven config loading.

Suggested reviewers

suyoggupta
syuoni
2ez4bz
galagam
yechank-nvidia

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 29.52% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description is incomplete, showing only the issue reference and brief feature summary without detailed explanation of changes or test coverage details.	Provide comprehensive description explaining the VLM speculative decoding support and Qwen3.5 MoE MTP changes; detail test coverage and architectural decisions.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title '[`#13819`][feat] AutoDeploy: Qwen3.5 MoE (VLM) MTP' is specific and clearly describes the feature—adding AutoDeploy support for Qwen3.5 MoE VLM with MTP speculative decoding.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe.py (1)
2248-2278: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

inputs_embeds-only requests are still broken for multimodal/cache paths.

This fallback now permits input_ids=None, but the same method still unconditionally reads input_ids later for placeholder masks/counts, chunked mRoPE position reconstruction, and the mrope_delta_cache dtype cast. A caller that passes inputs_embeds with image/video metadata will still fail on a NoneType access.

Please either keep requiring input_ids whenever multimodal metadata or mRoPE delta caching is present, or fully decouple those branches from token IDs.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe.py` around
lines 2248 - 2278, The forward method currently allows inputs_embeds without
input_ids but later unconditionally accesses input_ids for multimodal/cache
logic (places referencing input_ids, image_grid_thw, video_grid_thw, batch_info,
compute_mrope_positions, cu_seqlen, and mrope_delta_cache), which causes
NoneType errors; fix by adding a guard in forward that raises a clear ValueError
when inputs_embeds is provided but any multimodal metadata or mRoPE-delta
caching is present (e.g., if image_grid_thw, video_grid_thw, batch_info, or
mrope_delta_cache is not None) and require input_ids in that case, or
alternatively refactor all downstream code paths that use input_ids (the
placeholder masks/counts, chunked mRoPE reconstruction, and dtype cast) to
derive the needed shapes/values from inputs_embeds instead; choose the simpler
approach of requiring input_ids for multimodal/cache paths and add the explicit
check near the top of forward after the inputs_embeds fallback.

🧹 Nitpick comments (1)

tests/unittest/auto_deploy/singlegpu/shim/test_llm_config.py (1)
240-273: ⚡ Quick win

Cover the no-speculative-config guard rails too.

Coverage is still insufficient for tensorrt_llm/_torch/auto_deploy/llm_args.py Lines 324-330. Please add negative cases in tests/unittest/auto_deploy/singlegpu/shim/test_llm_config.py for model_factory="eagle_one_model" without speculative_config, and for target_model_factory set without speculative decoding.

As per coding guidelines, tests/**: Act as a QA engineer reviewing test changes and coverage for TensorRT-LLM. Keep feedback actionable: suggest concrete list file names and whether coverage is sufficient, insufficient, or needs follow-up outside the PR.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/auto_deploy/singlegpu/shim/test_llm_config.py` around lines
240 - 273, Add two negative unit tests to test_llm_config.py to cover the
no-speculative-config guard rails in LlmArgs: (1) create a test that constructs
LlmArgs with model_factory="eagle_one_model" and no speculative_config and
assert it raises ValueError matching "speculative_config" (this exercises the
guard in LlmArgs when eagle wrapper is declared without speculative decoding),
and (2) create a test that sets target_model_factory (e.g.,
target_model_factory="AutoModelForImageTextToText") while leaving
speculative_config=None and assert it raises ValueError matching
"speculative_config" (this covers the guard that prevents declaring a target
factory without speculative decoding). Ensure both tests import/instantiate
LlmArgs and use pytest.raises with the appropriate match.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_backend_gated_delta.py`:
- Around line 203-205: Run the code formatter (ruff-format) on the file to
normalize the multiline assignment to y_flat (the line assigning
y_flat[extend_start:extend_end] = y_extend.view(num_extend_tokens, HV,
-1).to(y_flat.dtype)); specifically reformat that multiline call so it matches
the project's ruff formatting rules and commit the resulting changes before
merging.

In `@tensorrt_llm/_torch/auto_deploy/models/eagle.py`:
- Around line 384-399: The draft export is left with the default 2D position_ids
contract; pass the same target export shape info into the draft export by
threading target_export_info into the DraftModelExportInfo construction (i.e.,
add target_export_info=target_export_info when creating DraftModelExportInfo
alongside TargetModelExportInfo) so the draft graph is specialized with the same
3D mRoPE position_ids as the target; update the return list where
TargetModelExportInfo and DraftModelExportInfo are created to include this
parameter.

In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`:
- Around line 442-462: The function is_trivial_passthrough_user currently treats
aten view/reshape/etc. as passthrough but omits torch.ops.auto_deploy.view,
causing inconsistent traversal with is_any_view_op; update
is_trivial_passthrough_user to also return True for torch.ops.auto_deploy.view
(add it to the call_function checks alongside
torch.ops.aten.view/reshape/transpose/permute/contiguous or add an explicit
is_op(...) check for torch.ops.auto_deploy.view) so that
collect_terminal_users_through_passthrough() stops at the auto_deploy.view
wrapper consistently.

In `@tests/integration/defs/accuracy/test_llm_api_autodeploy.py`:
- Line 1692: Add an explicit return type annotation "-> None" to the test
function definition for test_ir_mtp_gsm8k so its signature becomes def
test_ir_mtp_gsm8k(...) -> None:, matching the project's typing guideline; update
the function declaration (test_ir_mtp_gsm8k) only and ensure formatting matches
surrounding test functions' style.

In
`@tests/unittest/auto_deploy/singlegpu/custom_ops/fla/test_fla_cached_gated_delta_rule.py`:
- Around line 94-96: Change the single-request test to also cover a batched
2-request case (or add a new test file named
test_fla_cached_gated_delta_rule_batched.py) so the per-request indexing code
paths in fla_cached_gated_delta_rule are exercised: set num_extend = 2 (and
tokens_per_extend > 1), construct inputs with distinct slot ids for each batch
row, run the same flow that produces intermediate_delta_cache and final output,
and add assertions that each row of intermediate_delta_cache and the
corresponding output map back to the correct request (verifying slot_idx_extend,
recurrent_state_indices behavior and the view(num_extend, tokens_per_extend,
...) reshape semantics). Ensure the test asserts per-row correctness rather than
only overall shapes so the new per-request indexing logic is covered.

In `@tests/unittest/auto_deploy/singlegpu/models/test_qwen3_5_moe.py`:
- Around line 3301-3305: The comparison is building reference logits incorrectly
from model.model.language_model(...).logits; instead call
model.model.language_model(...) to get its last_hidden_state and then convert
that hidden state to logits via the LM head used by the wrapper (use
model.model.lm_head or the appropriate head on model.model) so expected_logits =
lm_head(last_hidden_state) (ensure you match any transposition or
dtype/placement done by wrapper_logits). Update the block referencing
wrapper_logits, model.model.language_model, and model.model.lm_head accordingly.

---

Outside diff comments:
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe.py`:
- Around line 2248-2278: The forward method currently allows inputs_embeds
without input_ids but later unconditionally accesses input_ids for
multimodal/cache logic (places referencing input_ids, image_grid_thw,
video_grid_thw, batch_info, compute_mrope_positions, cu_seqlen, and
mrope_delta_cache), which causes NoneType errors; fix by adding a guard in
forward that raises a clear ValueError when inputs_embeds is provided but any
multimodal metadata or mRoPE-delta caching is present (e.g., if image_grid_thw,
video_grid_thw, batch_info, or mrope_delta_cache is not None) and require
input_ids in that case, or alternatively refactor all downstream code paths that
use input_ids (the placeholder masks/counts, chunked mRoPE reconstruction, and
dtype cast) to derive the needed shapes/values from inputs_embeds instead;
choose the simpler approach of requiring input_ids for multimodal/cache paths
and add the explicit check near the top of forward after the inputs_embeds
fallback.

---

Nitpick comments:
In `@tests/unittest/auto_deploy/singlegpu/shim/test_llm_config.py`:
- Around line 240-273: Add two negative unit tests to test_llm_config.py to
cover the no-speculative-config guard rails in LlmArgs: (1) create a test that
constructs LlmArgs with model_factory="eagle_one_model" and no
speculative_config and assert it raises ValueError matching "speculative_config"
(this exercises the guard in LlmArgs when eagle wrapper is declared without
speculative decoding), and (2) create a test that sets target_model_factory
(e.g., target_model_factory="AutoModelForImageTextToText") while leaving
speculative_config=None and assert it raises ValueError matching
"speculative_config" (this covers the guard that prevents declaring a target
factory without speculative decoding). Ensure both tests import/instantiate
LlmArgs and use pytest.raises with the appropriate match.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fba98229-0ec6-4834-81b0-d069d1fc359b

📥 Commits

Reviewing files that changed from the base of the PR and between f6ba936 and 8a77743.

📒 Files selected for processing (17)

examples/auto_deploy/model_registry/configs/qwen3.5_moe_35b_mtp.yaml
examples/auto_deploy/model_registry/models.yaml
tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_backend_gated_delta.py
tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe.py
tensorrt_llm/_torch/auto_deploy/models/eagle.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py
tests/integration/defs/accuracy/references/gsm8k.yaml
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
tests/integration/test_lists/test-db/l0_dgx_h100.yml
tests/unittest/auto_deploy/_utils_test/_model_test_utils.py
tests/unittest/auto_deploy/singlegpu/custom_ops/fla/test_fla_cached_gated_delta_rule.py
tests/unittest/auto_deploy/singlegpu/models/test_qwen3_5_moe.py
tests/unittest/auto_deploy/singlegpu/shim/test_llm_config.py
tests/unittest/auto_deploy/singlegpu/smoke/test_ad_speculative_decoding.py

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>

govind-ramnarayan · 2026-05-29T19:48:52Z

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-05-29T19:55:41Z

PR_Github #51098 [ run ] triggered by Bot. Commit: 1b8d508 Link to invocation

tensorrt-cicd · 2026-05-29T22:33:50Z

PR_Github #51098 [ run ] completed with state SUCCESS. Commit: 1b8d508
/LLM/main/L0_MergeRequest_PR pipeline #40536 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

github-actions Bot assigned govind-ramnarayan May 27, 2026