Skip to content

[#13819][feat] AutoDeploy: Qwen3.5 MoE (VLM) MTP#14641

Open
govind-ramnarayan wants to merge 2 commits into
NVIDIA:mainfrom
nv-auto-deploy:gramnarayan/qwen3-vlm-mtp
Open

[#13819][feat] AutoDeploy: Qwen3.5 MoE (VLM) MTP#14641
govind-ramnarayan wants to merge 2 commits into
NVIDIA:mainfrom
nv-auto-deploy:gramnarayan/qwen3-vlm-mtp

Conversation

@govind-ramnarayan
Copy link
Copy Markdown
Collaborator

@govind-ramnarayan govind-ramnarayan commented May 27, 2026

fixes: #13819

Two parts to this change:

  1. Support VLMs with speculative decoding in AutoDeploy
  2. Add modeling code + tests for Qwen3.5 MoE + MTP with the sharding IR.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for Qwen3.5 MoE 35B with MTP (Multi-Token Prediction) speculative decoding configuration and deployment.
    • Enhanced Eagle drafter model architecture to support additional model variants and improved speculative decoding integration.
  • Tests

    • Added comprehensive test coverage for MTP speculative decoding functionality, accuracy validation, and export workflows.

Review Change Stack

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Comment thread examples/auto_deploy/model_registry/configs/qwen3.5_moe_35b_mtp.yaml Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_backend_gated_delta.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_backend_gated_delta.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py
Comment thread tensorrt_llm/_torch/auto_deploy/models/eagle.py
Comment thread tensorrt_llm/_torch/auto_deploy/models/eagle.py
Comment thread tensorrt_llm/_torch/auto_deploy/models/eagle.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/models/hf.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/utils/node_utils.py
Comment thread tensorrt_llm/_torch/auto_deploy/utils/node_utils.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/utils/node_utils.py Outdated
Comment thread tests/integration/defs/accuracy/test_llm_api_autodeploy.py Outdated
Comment thread tests/integration/defs/accuracy/test_llm_api_autodeploy.py Outdated
Comment thread tests/integration/defs/accuracy/test_llm_api_autodeploy.py Outdated
Comment thread tests/integration/defs/accuracy/test_llm_api_autodeploy.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py
Comment thread tensorrt_llm/_torch/auto_deploy/utils/node_utils.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/models/eagle.py
Comment thread tensorrt_llm/_torch/auto_deploy/models/eagle.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/models/hf.py
Comment thread tensorrt_llm/_torch/auto_deploy/models/hf.py
Comment thread tensorrt_llm/_torch/auto_deploy/utils/node_utils.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py
Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/llm_args.py
Comment thread tensorrt_llm/_torch/auto_deploy/llm_args.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/llm_args.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/llm_args.py Outdated
Comment thread tests/unittest/auto_deploy/singlegpu/shim/test_llm_config.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/models/eagle.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/llm_args.py
Comment thread tensorrt_llm/_torch/auto_deploy/llm_args.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/llm_args.py Outdated
@govind-ramnarayan govind-ramnarayan marked this pull request as ready for review May 29, 2026 00:00
@govind-ramnarayan govind-ramnarayan requested review from a team as code owners May 29, 2026 00:00
@govind-ramnarayan govind-ramnarayan force-pushed the gramnarayan/qwen3-vlm-mtp branch from fc554b1 to 8a77743 Compare May 29, 2026 00:02
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

📝 Walkthrough

Walkthrough

This PR adds complete MTP (Mixed Token Prefill) speculative decoding support for Qwen3.5 MoE 35B. The implementation extends the FLA custom op with an extend-path kernel, introduces a new Qwen3.5 MoE Eagle layer for mixed-token drafting, adds configurable target factory wiring in LlmArgs, refactors Eagle export infrastructure, and includes comprehensive unit and integration test coverage validating the end-to-end pipeline.

Changes

MTP Eagle One-Model for Qwen3.5 MoE 35B

Layer / File(s) Summary
FLA Cached Gated Delta - Extend Path Support
tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_backend_gated_delta.py
Custom op registration and signature updated to accept optional intermediate_delta_cache. Batch accounting refactored to separate prefill/extend/decode sequences and tokens. Extend path validates intermediate buffer, derives recurrent initial states from delta_cache, calls fused_recurrent_gated_delta_rule_update with disable_state_update=True, and writes per-extend results to intermediate buffer. Cache initialization changes delta_cache from StateResourceHandler to SSMResourceHandler and adds SpecSSMResourceHandler for intermediate cache.
Qwen3.5 MoE Model - Eagle Layer & Accessors
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe.py
New Qwen3_5MoeEagleLayer class fuses inputs_embeds with hidden states via RMSNorm-separate, concatenate, and linear projection; normalizes position_ids to 3D for mRoPE; dispatches through attention and MoE. Model wrappers now expose get_output_embeddings() and get_final_normalization() accessors. Multimodal wrapper forward accepts optional inputs_embeds parameter with proper fallback to input_ids. Conditional-generation wrapper forwards inputs_embeds through to multimodal call.
LlmArgs - Speculative Config & Target Factory Validation
tensorrt_llm/_torch/auto_deploy/llm_args.py
Adds target_model_factory: Optional[str] field (used only with model_factory='eagle_one_model'). New validate_speculative_model_factory model validator enforces allowed speculative-config combinations and auto-selects model_factory when required. Relaxed model_factory_exists validator accepts None target_factory and skips registry checks. New _requires_eagle_one_model() helper centralizes Eagle routing logic. create_factory() now conditionally passes speculative/sync kwargs only for eagle_one_model, using shared common_kwargs for other factories.
Eagle Modeling - Layer Dispatch & Config Defaults
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py
Added Qwen3.5 MoE Eagle layer dispatch in get_eagle_layers(). New EagleConfig._drafter_defaults['qwen3_5_moe_text'] with checkpoint conversion mapping for mtp.* state dicts. EagleDrafterForCausalLM now includes Qwen3_5MoeEagleLayer in _no_split_modules and accepts unused HuggingFace kwargs. Enhanced _filter_kwargs_for_submodule() handles nested GraphModule targets by unwrapping child graphs and detecting ambiguous cases.
Eagle Factory - Configurable Target Factory & Export
tensorrt_llm/_torch/auto_deploy/models/eagle.py
EagleDrafterFactory extended with use_inner_text_config parameter; _get_model_config() can optionally extract text_config from multimodal configs. TargetModelExportInfo refactored to accept configurable submodule_name and optional target_export_info; dynamic-shape lookup merges constraints from target_export_info; post_process delegates to target_export_info and uses expose_graph_module_accessor for binding target embeddings/outputs. EagleOneModelFactory adds target_model_factory parameter with registry lookup and computed use_inner_text_config based on factory class. get_export_infos() derives target_export_info and computes nested target_model.<inner> paths.
HF Export - expose_graph_module_accessor Helper
tensorrt_llm/_torch/auto_deploy/models/hf.py
New expose_graph_module_accessor() utility binds zero-argument accessors from original modules onto exported GraphModules, recreates submodule hierarchy, and inserts sentinel torch._assert nodes to prevent cleanup deletion. TextModelExportInfo.post_process() delegated to this helper.
Node Utils - Passthrough Detection Simplification
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py
Simplified passthrough detection by replacing layered helpers with single is_trivial_passthrough_user() predicate for narrow set of layout ops (view, reshape, transpose, permute, getitem, call_method). Removed allow_dtype_cast parameters from signatures. Removed unwrap_input_through_passthrough() utility and related helpers.
Model Registry - MTP YAML Config & Model Entry
examples/auto_deploy/model_registry/configs/qwen3.5_moe_35b_mtp.yaml, examples/auto_deploy/model_registry/models.yaml, tests/unittest/auto_deploy/_utils_test/_model_test_utils.py
New config file defines MTP speculative decoding settings, CUDA graph batch sizes, model factory selection, and transformation/sharding parameters. Model registry entry for Qwen/Qwen3.5-35B-A3B updated to use mtp variant. Test utils add Qwen3.5 MoE to _SMALL_MODEL_CONFIGS with detailed text_config and vision_config parameters.
FLA Custom Op Tests - Extend Path Coverage
tests/unittest/auto_deploy/singlegpu/custom_ops/fla/test_fla_cached_gated_delta_rule.py
New make_extend_kernel_inputs() helper constructs extend-request tuples with intermediate_delta_cache. Updated existing test invocation sites to pass None for intermediate_delta_cache. New test_intermediate_delta_cache() validates per-prefix state writes and delta_cache preservation. New test_extend_cuda_graph_capture() verifies CUDA graph captureability.
Qwen3.5 MoE Unit Tests - MTP Layer & Factory Coverage
tests/unittest/auto_deploy/singlegpu/models/test_qwen3_5_moe.py
Comprehensive test suite (600+ lines) with factories, dynamic shapes, weight initialization, and manual reference implementations for RMSNorm, attention, MoE, and full Eagle layer. Tests validate MTP config defaults, checkpoint mapping, layer output matching, drafter I/O contract, factory wrapper selection, hidden-state capture, VLM text config handling, one-model wrapper assembly, and strict checkpoint loading.
LlmArgs Config Validation Tests
tests/unittest/auto_deploy/singlegpu/shim/test_llm_config.py
Enhanced speculative config tests with target_model_factory assertions for Eagle3 and MTP modes. New test verifies declared factory preserved as target_model_factory for MTP-Eagle one-model. New negative test ensures wrapper factories require explicit target_model_factory.
Smoke Tests - Eagle Wrapper & MTP End-to-End
tests/unittest/auto_deploy/singlegpu/smoke/test_ad_speculative_decoding.py
New helper and unit tests for EagleWrapper._filter_kwargs_for_submodule() covering direct/nested graph handling and ambiguous rejection. New test validates EagleOneModelFactory.get_export_infos() export-info ordering. New test_qwen3_5_moe_mtp_smoke() smoke test for VLM target with MTP/Eagle config. Extended existing capture test assertion.
Integration Tests - GSM8K Accuracy & Test Registry
tests/integration/defs/accuracy/references/gsm8k.yaml, tests/integration/defs/accuracy/test_llm_api_autodeploy.py, tests/integration/test_lists/test-db/l0_dgx_h100.yml
New GSM8K accuracy reference (94.53) for MTP config. Class constants for extra_acc_spec and min acceptance rate. New test_ir_mtp_gsm8k() validates MTP config structure, runs AutoDeployLLM, evaluates GSM8K, and checks acceptance rate. Registered in l0_dgx_h100 post-merge suite.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#14352: Modifies yaml_extra sourcing for AutoDeployLLM in accuracy tests, which the MTP GSM8K test depends on for registry-driven config loading.

Suggested reviewers

  • suyoggupta
  • syuoni
  • 2ez4bz
  • galagam
  • yechank-nvidia
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.52% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description is incomplete, showing only the issue reference and brief feature summary without detailed explanation of changes or test coverage details. Provide comprehensive description explaining the VLM speculative decoding support and Qwen3.5 MoE MTP changes; detail test coverage and architectural decisions.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title '[#13819][feat] AutoDeploy: Qwen3.5 MoE (VLM) MTP' is specific and clearly describes the feature—adding AutoDeploy support for Qwen3.5 MoE VLM with MTP speculative decoding.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@govind-ramnarayan govind-ramnarayan changed the title [None][feat] AutoDeploy: Qwen3.5 MoE (VLM) MTP [#13819][feat] AutoDeploy: Qwen3.5 MoE (VLM) MTP May 29, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe.py (1)

2248-2278: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

inputs_embeds-only requests are still broken for multimodal/cache paths.

This fallback now permits input_ids=None, but the same method still unconditionally reads input_ids later for placeholder masks/counts, chunked mRoPE position reconstruction, and the mrope_delta_cache dtype cast. A caller that passes inputs_embeds with image/video metadata will still fail on a NoneType access.

Please either keep requiring input_ids whenever multimodal metadata or mRoPE delta caching is present, or fully decouple those branches from token IDs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe.py` around
lines 2248 - 2278, The forward method currently allows inputs_embeds without
input_ids but later unconditionally accesses input_ids for multimodal/cache
logic (places referencing input_ids, image_grid_thw, video_grid_thw, batch_info,
compute_mrope_positions, cu_seqlen, and mrope_delta_cache), which causes
NoneType errors; fix by adding a guard in forward that raises a clear ValueError
when inputs_embeds is provided but any multimodal metadata or mRoPE-delta
caching is present (e.g., if image_grid_thw, video_grid_thw, batch_info, or
mrope_delta_cache is not None) and require input_ids in that case, or
alternatively refactor all downstream code paths that use input_ids (the
placeholder masks/counts, chunked mRoPE reconstruction, and dtype cast) to
derive the needed shapes/values from inputs_embeds instead; choose the simpler
approach of requiring input_ids for multimodal/cache paths and add the explicit
check near the top of forward after the inputs_embeds fallback.
🧹 Nitpick comments (1)
tests/unittest/auto_deploy/singlegpu/shim/test_llm_config.py (1)

240-273: ⚡ Quick win

Cover the no-speculative-config guard rails too.

Coverage is still insufficient for tensorrt_llm/_torch/auto_deploy/llm_args.py Lines 324-330. Please add negative cases in tests/unittest/auto_deploy/singlegpu/shim/test_llm_config.py for model_factory="eagle_one_model" without speculative_config, and for target_model_factory set without speculative decoding.

As per coding guidelines, tests/**: Act as a QA engineer reviewing test changes and coverage for TensorRT-LLM. Keep feedback actionable: suggest concrete list file names and whether coverage is sufficient, insufficient, or needs follow-up outside the PR.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/auto_deploy/singlegpu/shim/test_llm_config.py` around lines
240 - 273, Add two negative unit tests to test_llm_config.py to cover the
no-speculative-config guard rails in LlmArgs: (1) create a test that constructs
LlmArgs with model_factory="eagle_one_model" and no speculative_config and
assert it raises ValueError matching "speculative_config" (this exercises the
guard in LlmArgs when eagle wrapper is declared without speculative decoding),
and (2) create a test that sets target_model_factory (e.g.,
target_model_factory="AutoModelForImageTextToText") while leaving
speculative_config=None and assert it raises ValueError matching
"speculative_config" (this covers the guard that prevents declaring a target
factory without speculative decoding). Ensure both tests import/instantiate
LlmArgs and use pytest.raises with the appropriate match.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_backend_gated_delta.py`:
- Around line 203-205: Run the code formatter (ruff-format) on the file to
normalize the multiline assignment to y_flat (the line assigning
y_flat[extend_start:extend_end] = y_extend.view(num_extend_tokens, HV,
-1).to(y_flat.dtype)); specifically reformat that multiline call so it matches
the project's ruff formatting rules and commit the resulting changes before
merging.

In `@tensorrt_llm/_torch/auto_deploy/models/eagle.py`:
- Around line 384-399: The draft export is left with the default 2D position_ids
contract; pass the same target export shape info into the draft export by
threading target_export_info into the DraftModelExportInfo construction (i.e.,
add target_export_info=target_export_info when creating DraftModelExportInfo
alongside TargetModelExportInfo) so the draft graph is specialized with the same
3D mRoPE position_ids as the target; update the return list where
TargetModelExportInfo and DraftModelExportInfo are created to include this
parameter.

In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`:
- Around line 442-462: The function is_trivial_passthrough_user currently treats
aten view/reshape/etc. as passthrough but omits torch.ops.auto_deploy.view,
causing inconsistent traversal with is_any_view_op; update
is_trivial_passthrough_user to also return True for torch.ops.auto_deploy.view
(add it to the call_function checks alongside
torch.ops.aten.view/reshape/transpose/permute/contiguous or add an explicit
is_op(...) check for torch.ops.auto_deploy.view) so that
collect_terminal_users_through_passthrough() stops at the auto_deploy.view
wrapper consistently.

In `@tests/integration/defs/accuracy/test_llm_api_autodeploy.py`:
- Line 1692: Add an explicit return type annotation "-> None" to the test
function definition for test_ir_mtp_gsm8k so its signature becomes def
test_ir_mtp_gsm8k(...) -> None:, matching the project's typing guideline; update
the function declaration (test_ir_mtp_gsm8k) only and ensure formatting matches
surrounding test functions' style.

In
`@tests/unittest/auto_deploy/singlegpu/custom_ops/fla/test_fla_cached_gated_delta_rule.py`:
- Around line 94-96: Change the single-request test to also cover a batched
2-request case (or add a new test file named
test_fla_cached_gated_delta_rule_batched.py) so the per-request indexing code
paths in fla_cached_gated_delta_rule are exercised: set num_extend = 2 (and
tokens_per_extend > 1), construct inputs with distinct slot ids for each batch
row, run the same flow that produces intermediate_delta_cache and final output,
and add assertions that each row of intermediate_delta_cache and the
corresponding output map back to the correct request (verifying slot_idx_extend,
recurrent_state_indices behavior and the view(num_extend, tokens_per_extend,
...) reshape semantics). Ensure the test asserts per-row correctness rather than
only overall shapes so the new per-request indexing logic is covered.

In `@tests/unittest/auto_deploy/singlegpu/models/test_qwen3_5_moe.py`:
- Around line 3301-3305: The comparison is building reference logits incorrectly
from model.model.language_model(...).logits; instead call
model.model.language_model(...) to get its last_hidden_state and then convert
that hidden state to logits via the LM head used by the wrapper (use
model.model.lm_head or the appropriate head on model.model) so expected_logits =
lm_head(last_hidden_state) (ensure you match any transposition or
dtype/placement done by wrapper_logits). Update the block referencing
wrapper_logits, model.model.language_model, and model.model.lm_head accordingly.

---

Outside diff comments:
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe.py`:
- Around line 2248-2278: The forward method currently allows inputs_embeds
without input_ids but later unconditionally accesses input_ids for
multimodal/cache logic (places referencing input_ids, image_grid_thw,
video_grid_thw, batch_info, compute_mrope_positions, cu_seqlen, and
mrope_delta_cache), which causes NoneType errors; fix by adding a guard in
forward that raises a clear ValueError when inputs_embeds is provided but any
multimodal metadata or mRoPE-delta caching is present (e.g., if image_grid_thw,
video_grid_thw, batch_info, or mrope_delta_cache is not None) and require
input_ids in that case, or alternatively refactor all downstream code paths that
use input_ids (the placeholder masks/counts, chunked mRoPE reconstruction, and
dtype cast) to derive the needed shapes/values from inputs_embeds instead;
choose the simpler approach of requiring input_ids for multimodal/cache paths
and add the explicit check near the top of forward after the inputs_embeds
fallback.

---

Nitpick comments:
In `@tests/unittest/auto_deploy/singlegpu/shim/test_llm_config.py`:
- Around line 240-273: Add two negative unit tests to test_llm_config.py to
cover the no-speculative-config guard rails in LlmArgs: (1) create a test that
constructs LlmArgs with model_factory="eagle_one_model" and no
speculative_config and assert it raises ValueError matching "speculative_config"
(this exercises the guard in LlmArgs when eagle wrapper is declared without
speculative decoding), and (2) create a test that sets target_model_factory
(e.g., target_model_factory="AutoModelForImageTextToText") while leaving
speculative_config=None and assert it raises ValueError matching
"speculative_config" (this covers the guard that prevents declaring a target
factory without speculative decoding). Ensure both tests import/instantiate
LlmArgs and use pytest.raises with the appropriate match.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fba98229-0ec6-4834-81b0-d069d1fc359b

📥 Commits

Reviewing files that changed from the base of the PR and between f6ba936 and 8a77743.

📒 Files selected for processing (17)
  • examples/auto_deploy/model_registry/configs/qwen3.5_moe_35b_mtp.yaml
  • examples/auto_deploy/model_registry/models.yaml
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_backend_gated_delta.py
  • tensorrt_llm/_torch/auto_deploy/llm_args.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe.py
  • tensorrt_llm/_torch/auto_deploy/models/eagle.py
  • tensorrt_llm/_torch/auto_deploy/models/hf.py
  • tensorrt_llm/_torch/auto_deploy/utils/node_utils.py
  • tests/integration/defs/accuracy/references/gsm8k.yaml
  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml
  • tests/unittest/auto_deploy/_utils_test/_model_test_utils.py
  • tests/unittest/auto_deploy/singlegpu/custom_ops/fla/test_fla_cached_gated_delta_rule.py
  • tests/unittest/auto_deploy/singlegpu/models/test_qwen3_5_moe.py
  • tests/unittest/auto_deploy/singlegpu/shim/test_llm_config.py
  • tests/unittest/auto_deploy/singlegpu/smoke/test_ad_speculative_decoding.py

Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_backend_gated_delta.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/models/eagle.py
Comment thread tensorrt_llm/_torch/auto_deploy/utils/node_utils.py Outdated
Comment thread tests/integration/defs/accuracy/test_llm_api_autodeploy.py Outdated
Comment thread tests/unittest/auto_deploy/singlegpu/models/test_qwen3_5_moe.py Outdated
Comment thread tests/integration/test_lists/test-db/l0_dgx_h100.yml
Comment thread tensorrt_llm/_torch/auto_deploy/utils/node_utils.py
Comment thread tensorrt_llm/_torch/auto_deploy/llm_args.py Outdated
@govind-ramnarayan govind-ramnarayan requested a review from a team as a code owner May 29, 2026 17:05
Comment thread tensorrt_llm/_torch/auto_deploy/llm_args.py Outdated
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
@govind-ramnarayan govind-ramnarayan force-pushed the gramnarayan/qwen3-vlm-mtp branch from e82d086 to 71357c9 Compare May 29, 2026 19:30
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
@govind-ramnarayan
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51098 [ run ] triggered by Bot. Commit: 1b8d508 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51098 [ run ] completed with state SUCCESS. Commit: 1b8d508
/LLM/main/L0_MergeRequest_PR pipeline #40536 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Enabled MTP for qwen3.5

3 participants