Skip to content

[TRTLLM-10232][feat] Support LoRA adapter for nemotron-h models#12154

Open
Wanli-Jiang wants to merge 3 commits intoNVIDIA:mainfrom
Wanli-Jiang:user/williamj/support-nemotron-h-lora-t2
Open

[TRTLLM-10232][feat] Support LoRA adapter for nemotron-h models#12154
Wanli-Jiang wants to merge 3 commits intoNVIDIA:mainfrom
Wanli-Jiang:user/williamj/support-nemotron-h-lora-t2

Conversation

@Wanli-Jiang
Copy link
Copy Markdown
Collaborator

@Wanli-Jiang Wanli-Jiang commented Mar 12, 2026

Features

Support following moduel LoRAs:

Attention Q/K/V/O proj
MoE_shared_expert
Mamba in/out proj
MoE latent_proj

Not Supporte yet:

MoE route_experts
MoE gate_proj

Summary by CodeRabbit

  • New Features
    • Extended LoRA support to Mamba models (input/output projections) and MoE latent projections.
    • Added LoRA fine-tuning for DeepseekV3 routing gates and Nemotron-H hybrid models with MoE.
    • Introduced --lora_dir CLI flag to the LLM API for simplified LoRA configuration.
    • Added LoRA layer configuration API for improved control in hybrid architectures.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 12, 2026

📝 Walkthrough

Walkthrough

This PR extends LoRA support to Mamba and MoE architectures by introducing four new module types (Mamba in/out projections and MoE latent up/down), updating model configuration to track LoRA-applicable layers separately from attention layers, and integrating LoRA parameters throughout Nemotron-H and Mamba2 forward paths.

Changes

Cohort / File(s) Summary
LoRA Module Type Definitions
cpp/include/tensorrt_llm/runtime/loraModule.h, tensorrt_llm/_torch/peft/lora/layer.py
Added four new enum members (kMAMBA_IN_PROJ, kMAMBA_OUT_PROJ, kMOE_LATENT_UP, kMOE_LATENT_DOWN with values 19–22) and updated toModuleType/toModuleName mappings for string conversion. Python implementation also added is_mamba property and refactored eager-forward parameter lookup.
Model Configuration for LoRA
cpp/include/tensorrt_llm/runtime/modelConfig.h, tensorrt_llm/_torch/model_config.py
Added getNbLoraLayers() and setNbLoraLayers() methods to track LoRA-applicable layers separately; includes pipelineParallelism handling and conditional setting for Nemotron hybrid models.
C++ LoRA Runtime Layer Calculation
cpp/tensorrt_llm/runtime/loraCache.cpp, cpp/tensorrt_llm/runtime/loraManager.cpp, cpp/tensorrt_llm/runtime/loraUtils.cpp
Replaced getNbAttentionLayers() calls with getNbLoraLayers() to use the new LoRA-specific layer count for validation and pagination.
C++ LoRA Module Factory
cpp/tensorrt_llm/runtime/loraModule.cpp
Added four new switch cases in createLoraModules for Mamba and MoE module types, each with appropriate in/out size mappings and boundary flags.
Python Bindings
cpp/tensorrt_llm/nanobind/bindings.cpp
Exposed new LoRA module types (MAMBA_IN_PROJ, MAMBA_OUT_PROJ, MOE_LATENT_UP, MOE_LATENT_DOWN) and added num_lora_layers() and set_num_lora_layers() methods to Python ModelConfig API.
DeepseekV3 Gate LoRA Support
tensorrt_llm/_torch/models/modeling_deepseekv3.py
Extended DeepseekV3Gate constructor with lora_config parameter and added gate_lora layer; modified forward to accept lora_params and layer_idx, applying LoRA adjustment to router logits when configured.
Nemotron-H Model LoRA Integration
tensorrt_llm/_torch/models/modeling_nemotron_h.py
Added lora_params parameter to forward methods across MLPLayer, TransformerLayer, NemotronHMOE, NemotronHLayer, NemotronHModel, and related components; created per-layer latent LoRA layers (fc1_latent_lora, fc2_latent_lora) for MoE shared experts when lora_config is present.
Mamba2 Mixer LoRA Integration
tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py
Added in_proj_lora and out_proj_lora layers with forward path support for lora_params and layer_idx; included NVFP4 fusion compatibility check to disable fused quantization when LoRA is present.
MLP Module Type Customization
tensorrt_llm/_torch/modules/mlp.py
Added lora_up_module_type and lora_down_module_type parameters to MLP constructor with configurable defaults, replacing hardcoded module type references.
LoRA Module Mapping and Management
tensorrt_llm/lora_helper.py, tensorrt_llm/lora_manager.py
Extended moe_h_to_4h and moe_4h_to_h mappings to lists of targets, and added new mappings for mamba_in_proj, mamba_out_proj, moe_latent_up, moe_latent_down; added corresponding entries to LORA_MODULE_IDS dictionary.
Python Example with LoRA
examples/llm-api/quickstart_advanced.py
Added CLI option --lora_dir, modified setup_llm to construct and return LoRA config and request, and updated llm.generate calls to accept lora_request parameter; modified function signatures to propagate LoRA objects through the workflow.
Nemotron-H LoRA Test Suite
tests/unittest/_torch/modules/tests_lora_modules/test_nemotron_h_lora_sanity.py
Added new test classes TestNemotronHLoRASanity and TestNemotronHLoRAModuleMappings with test_load_and_generate_with_lora and test_nemotron_h_modules_have_mappings methods; includes helper for creating dummy LoRA adapters.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • amitz-nv
  • xxi-nv
  • byshiue
  • dongjiyingdjy
  • 2ez4bz
🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description lists supported LoRA modules but lacks crucial details on the implementation approach, architectural changes, and rationale. Expand the description with: (1) detailed explanation of what was changed and why; (2) high-level architectural overview of LoRA integration; (3) specific test cases provided; (4) any breaking changes or migration notes; (5) justification for unsupported modules.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly and concisely describes the main feature added: LoRA adapter support for nemotron-h models, with appropriate ticket reference and type tag.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can get early access to new features in CodeRabbit.

Enable the early_access setting to enable early access features such as new models, tools, and more.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

855-904: ⚠️ Potential issue | 🟠 Major

The new router-LoRA path is never wired in.

Deepseekv3MoE still instantiates gate_cls(... moe_backend=model_config.moe_backend) on Lines 960-969 without lora_config, and compute_routed_output() still calls self.gate(hidden_states) on Line 1154 without lora_params or layer_idx. As written, self.gate_lora stays None, so this new code never applies a router delta.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_deepseekv3.py` around lines 855 - 904,
The router LoRA is never used because Deepseekv3MoE constructs the gate without
passing lora_config and compute_routed_output calls the gate without lora
params; fix by (1) passing lora_config into the gate when instantiating gate_cls
(update the call that currently does gate_cls(...
moe_backend=model_config.moe_backend) to include lora_config=lora_config) and
(2) update compute_routed_output to call the gate with the LoRA args (replace
self.gate(hidden_states) with self.gate(hidden_states, lora_params, layer_idx))
so DeepseekV3Gate.gate_lora can be created and applied.
tensorrt_llm/_torch/models/modeling_nemotron_h.py (1)

926-940: ⚠️ Potential issue | 🟠 Major

MTP LoRA lookup is ambiguous for repeated sublayer types.

These calls now thread lora_params into every MTP sublayer, but each NemotronHMTPDecoderLayer in this block is constructed with the same layer_idx=self.layer_idx, and LoraLayer lookup is keyed by layer_idx. If mtp_hybrid_override_pattern contains the same LoRA-capable type more than once, those sublayers will alias the same adapter slot instead of getting distinct weights. Please give each MTP sublayer a unique LoRA key/index, or keep MTP out of the LoRA path until that addressing exists.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_nemotron_h.py` around lines 926 - 940,
The MTP sublayers are all constructed/queried with the same layer_idx
(self.layer_idx), causing LoraLayer lookup collisions when
mtp_hybrid_override_pattern repeats types; update the loop in the forward path
that iterates self.pattern_len so each MTP sublayer gets a unique LoRA key—e.g.,
compute a unique_layer_idx using the loop index (like unique_layer_idx =
self.layer_idx * self.pattern_len + i or unique_layer_idx = (self.layer_idx, i)
flattened) and pass that into the NemotronHMTPDecoderLayer call instead of
self.layer_idx (or alternatively omit passing lora_params into MTP sublayers
until unique addressing exists); change references to layer(...,
lora_params=lora_params, ...) to use the computed unique_layer_idx so LoraLayer
lookups are distinct.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/models/modeling_nemotron_h.py (1)

45-45: Use a module import for the LoRA symbols.

This new direct symbol import goes against the repo’s Python import rule. Please import the module and reference layer.LoraLayer / layer.LoraModuleType instead.

As per coding guidelines, "When importing in Python, always maintain the namespace. Import the module, not individual classes or functions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_nemotron_h.py` at line 45, Replace the
direct symbol import from ..peft.lora.layer with a module import and update
usages to keep the namespace; specifically, change the import so you import the
module (e.g., import ..peft.lora.layer as layer or from ..peft.lora import
layer) and then reference layer.LoraLayer and layer.LoraModuleType in
modeling_nemotron_h.py wherever those symbols are used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/include/tensorrt_llm/runtime/modelConfig.h`:
- Around line 234-243: getNbLoraLayers currently floor-divides mNbLoraLayers by
pipelineParallelism and ignores pipelineParallelismRank, silently dropping tail
layers on uneven splits; change getNbLoraLayers to be rank-aware: when
mNbLoraLayers > 0 compute base = mNbLoraLayers / pipelineParallelism and rem =
mNbLoraLayers % pipelineParallelism and return base + (pipelineParallelismRank <
rem ? 1 : 0); ensure you still validate pipelineParallelism>0 with
TLLM_CHECK_WITH_INFO and use pipelineParallelismRank to distribute the remainder
(alternatively, if you prefer failing fast instead of distributing, add a
TLLM_CHECK_WITH_INFO that mNbLoraLayers % pipelineParallelism == 0 and error
out), referencing getNbLoraLayers, mNbLoraLayers and pipelineParallelismRank in
your change so loraCache/loraManager receive correct per-rank counts.

In `@cpp/tensorrt_llm/runtime/loraModule.cpp`:
- Around line 68-70: The Mamba projections are using the single mlpHidden value
for both in_proj and out_proj shapes which is incorrect; create and use two
separate dimensions (e.g., mambaInProjDim for d_in_proj and mambaOutProjDim for
d_inner) instead of mlpHidden and pass them to the ModuleType constructors: for
ModuleType::kMAMBA_IN_PROJ call modules.emplace_back(t, hidden, mambaInProjDim,
false, true, -1, 0) and for ModuleType::kMAMBA_OUT_PROJ call
modules.emplace_back(t, mambaOutProjDim, hidden, false, true, 1, -1); populate
mambaInProjDim and mambaOutProjDim from the Mamba layer descriptor/config (the
same source that builds d_in_proj and d_inner in mamba2_mixer.py) so packed LoRA
shapes match the runtime descriptors.

In `@examples/llm-api/quickstart_advanced.py`:
- Around line 303-304: The code currently constructs LoraConfig with only
lora_dir (lora_config = LoraConfig(lora_dir=[args.lora_dir]) if args.lora_dir
else None) which relies on the default max_lora_rank=64 and will fail for
adapters whose r > 64; update the logic that builds LoraConfig to determine the
adapter rank automatically (by reading adapter_config.json inside args.lora_dir
and extracting the "r" value) or expose a new CLI flag (e.g., --max_lora_rank)
and pass it as max_lora_rank to LoraConfig; locate the LoraConfig instantiation
and replace it with code that reads adapter_config.json when args.lora_dir is
present, parses the rank, and sets LoraConfig(lora_dir=[args.lora_dir],
max_lora_rank=parsed_rank) (falling back to the flag or default if parsing
fails).

In `@tensorrt_llm/_torch/models/modeling_deepseekv3.py`:
- Line 855: Add a type-only import for LoraConfig so static checkers stop
flagging the annotation; import TYPE_CHECKING from typing (or typing_extensions)
at the top and inside an if TYPE_CHECKING: block add "from peft import
LoraConfig" (or the correct module that defines LoraConfig), leaving the
parameter annotation lora_config: Optional["LoraConfig"] unchanged at runtime.

In `@tensorrt_llm/_torch/modules/mlp.py`:
- Around line 55-57: The LoRA width uses the static config TP size instead of
the effective/sharded TP size when TP is overridden; change the up_lora
construction so its width uses the effective TP size (i.e., the overridden TP if
present) rather than config.mapping.tp_size—compute an effective_tp (e.g.,
self.overridden_tp_size or self.mapping.tp_size/config.mapping.tp_size) and pass
[self.intermediate_size // effective_tp] to LoraLayer for self.up_lora so
x_up_lora matches up_proj’s sharding.

In `@tensorrt_llm/_torch/peft/lora/layer.py`:
- Around line 80-81: The is_moe helper currently omits the new MOE_LATENT_UP and
MOE_LATENT_DOWN constants, so add those two constants to the set/check within
is_moe (the function named is_moe and the enum/constants MOE_LATENT_UP and
MOE_LATENT_DOWN and existing MOE_... members) so that it returns True for those
projection types; update any conditional or membership test inside is_moe to
include MOE_LATENT_UP and MOE_LATENT_DOWN alongside the other MOE_* values to
ensure latent projections are treated as MoE.

In `@tensorrt_llm/lora_helper.py`:
- Around line 56-57: The annotations are incorrect: update any type hints
declared as Dict[str, str] for trtllm->hf mappings to allow lists (e.g.,
Dict[str, Union[str, List[str]]]) so they match
get_default_trtllm_modules_to_hf_modules() which returns values that can be str
or List[str]; specifically change the return/type of use_lora(), the
LoraConfig.trtllm_modules_to_hf_modules attribute, and any related signatures or
type aliases referencing that mapping to use Union[str, List[str]] (or a shared
alias like DictStrOrList) so static checkers reflect the actual data shape.

In
`@tests/unittest/_torch/modules/tests_lora_modules/test_nemotron_h_lora_sanity.py`:
- Around line 53-75: The test currently writes all lora_B.weight tensors as
zeros so LoRA deltas are always zero and the test cannot detect broken LoRA
application; update the test in
tests_lora_modules/test_nemotron_h_lora_sanity.py to (1) set at least some
lora_B.weight tensors to non-zero values (e.g., random small values) instead of
torch.zeros for keys built from proj_dims and attn_layers, and (2) expand the
targeted keys to include the new Mamba/shared-expert/latent LoRA paths (in
addition to "q_proj","k_proj","v_proj","o_proj") so the code paths that apply
those adapter types are exercised; adjust the saved adapter_model.safetensors
generation accordingly and keep the existing assertions.

---

Outside diff comments:
In `@tensorrt_llm/_torch/models/modeling_deepseekv3.py`:
- Around line 855-904: The router LoRA is never used because Deepseekv3MoE
constructs the gate without passing lora_config and compute_routed_output calls
the gate without lora params; fix by (1) passing lora_config into the gate when
instantiating gate_cls (update the call that currently does gate_cls(...
moe_backend=model_config.moe_backend) to include lora_config=lora_config) and
(2) update compute_routed_output to call the gate with the LoRA args (replace
self.gate(hidden_states) with self.gate(hidden_states, lora_params, layer_idx))
so DeepseekV3Gate.gate_lora can be created and applied.

In `@tensorrt_llm/_torch/models/modeling_nemotron_h.py`:
- Around line 926-940: The MTP sublayers are all constructed/queried with the
same layer_idx (self.layer_idx), causing LoraLayer lookup collisions when
mtp_hybrid_override_pattern repeats types; update the loop in the forward path
that iterates self.pattern_len so each MTP sublayer gets a unique LoRA key—e.g.,
compute a unique_layer_idx using the loop index (like unique_layer_idx =
self.layer_idx * self.pattern_len + i or unique_layer_idx = (self.layer_idx, i)
flattened) and pass that into the NemotronHMTPDecoderLayer call instead of
self.layer_idx (or alternatively omit passing lora_params into MTP sublayers
until unique addressing exists); change references to layer(...,
lora_params=lora_params, ...) to use the computed unique_layer_idx so LoraLayer
lookups are distinct.

---

Nitpick comments:
In `@tensorrt_llm/_torch/models/modeling_nemotron_h.py`:
- Line 45: Replace the direct symbol import from ..peft.lora.layer with a module
import and update usages to keep the namespace; specifically, change the import
so you import the module (e.g., import ..peft.lora.layer as layer or from
..peft.lora import layer) and then reference layer.LoraLayer and
layer.LoraModuleType in modeling_nemotron_h.py wherever those symbols are used.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 604792d5-9e00-495d-8bba-f9819862a862

📥 Commits

Reviewing files that changed from the base of the PR and between adfc542 and 89f0f2d.

📒 Files selected for processing (17)
  • cpp/include/tensorrt_llm/runtime/loraModule.h
  • cpp/include/tensorrt_llm/runtime/modelConfig.h
  • cpp/tensorrt_llm/nanobind/bindings.cpp
  • cpp/tensorrt_llm/runtime/loraCache.cpp
  • cpp/tensorrt_llm/runtime/loraManager.cpp
  • cpp/tensorrt_llm/runtime/loraModule.cpp
  • cpp/tensorrt_llm/runtime/loraUtils.cpp
  • examples/llm-api/quickstart_advanced.py
  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/models/modeling_nemotron_h.py
  • tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py
  • tensorrt_llm/_torch/modules/mlp.py
  • tensorrt_llm/_torch/peft/lora/layer.py
  • tensorrt_llm/lora_helper.py
  • tensorrt_llm/lora_manager.py
  • tests/unittest/_torch/modules/tests_lora_modules/test_nemotron_h_lora_sanity.py

@Wanli-Jiang Wanli-Jiang force-pushed the user/williamj/support-nemotron-h-lora-t2 branch 2 times, most recently from e734e9b to a2bbc7f Compare March 13, 2026 06:08
@Wanli-Jiang Wanli-Jiang requested a review from a team as a code owner March 13, 2026 06:08
@Wanli-Jiang Wanli-Jiang force-pushed the user/williamj/support-nemotron-h-lora-t2 branch 2 times, most recently from db63c94 to adb407e Compare March 13, 2026 06:50
@Wanli-Jiang
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38840 [ run ] triggered by Bot. Commit: adb407e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38840 [ run ] completed with state SUCCESS. Commit: adb407e
/LLM/main/L0_MergeRequest_PR pipeline #30150 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Wanli-Jiang Wanli-Jiang force-pushed the user/williamj/support-nemotron-h-lora-t2 branch from adb407e to 3a699fe Compare March 17, 2026 12:07
@Wanli-Jiang
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39404 [ run ] triggered by Bot. Commit: 1a38698 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39404 [ run ] completed with state SUCCESS. Commit: 1a38698
/LLM/main/L0_MergeRequest_PR pipeline #30634 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Wanli-Jiang Wanli-Jiang force-pushed the user/williamj/support-nemotron-h-lora-t2 branch from 1a38698 to 37f6ddf Compare March 19, 2026 04:26
@Wanli-Jiang
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39540 [ run ] triggered by Bot. Commit: 37f6ddf Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39540 [ run ] completed with state SUCCESS. Commit: 37f6ddf
/LLM/main/L0_MergeRequest_PR pipeline #30757 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Wanli-Jiang
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39597 [ run ] triggered by Bot. Commit: 37f6ddf Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39597 [ run ] completed with state SUCCESS. Commit: 37f6ddf
/LLM/main/L0_MergeRequest_PR pipeline #30807 completed with status: 'SUCCESS'

CI Report

Link to invocation

MAMBA_OUT_PROJ = 23 # Mamba output projection

MOE_LATENT_UP = 24 # MoE latent up projection (fc1_latent_proj)
MOE_LATENT_DOWN = 25 # MoE latent down projection (fc2_latent_proj)
Copy link
Copy Markdown

@danisereb danisereb Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that FC1 is up and FC2 is down ?
Maybe it's the opposite ?

AFAIU -
The down projection maps from a higher-dimensional space to a lower-dimensional one.
The up projection maps from a lower-dimensional space back to a higher-dimensional one.
The latent MoE is in the "lower-dimensional space".

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i renamed them as MOE_LATENT_FC1/FC2, which can avoid confusion and match the naming in the model weights.

// Mamba dimensions: in_proj outputs d_in_proj, out_proj inputs d_inner
// Fall back to mlpHidden if not specified (for backward compatibility)
auto const mambaInProj = mambaInProjSize > 0 ? mambaInProjSize * tpSize : mlpHidden;
auto const mambaInner = mambaInnerSize > 0 ? mambaInnerSize * tpSize : mlpHidden;
Copy link
Copy Markdown

@danisereb danisereb Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for only mambaInProjSize or mambaInnerSize to be non-zero?
I assume that either both are non-zero (nemotron model) or both are zero (non-nemotron model).

I am not sure if an error should be raised if invalid input is given (or alternatively, pass mambaInProjSize and mambaInnerSize as a "tuple"/struct).

I see that existing inputs like hiddenSize, tpSize are also not validated here (they can be zero AFAIU), so validation is probably done at a higher level of the code.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added check, and set mambaInProj=mambaInProjSize * tpSize directly. No need to use mlpHidden as fallback value.

(7, 3), # 3 + 2 + 2
(1, 2), # Edge: 1 + 0
(32, 1), # Single rank
],
Copy link
Copy Markdown

@danisereb danisereb Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use the sizes/dims of Nemotron Nano/Super as test cases.
(For other tests too).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is actually added, see 88 layers, and other values.

if moe_intermediate is not None and moe_intermediate > 0:
moe_hidden_size = moe_intermediate // mapping.tp_size

# Mamba dimensions for hybrid models (e.g., Nemotron-H)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The dimension resolution block here is getting unwieldy — each new model type adds another ad-hoc getattr + TP-division block. Worth a follow-up to extract a LoraModuleDimensions dataclass or similar. Not blocking this PR since you're following the existing pattern - but could you add a TODO to refactor this more cleanly?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a todo comment at the begining of this code part.


Called from post_load_weights (weights don't exist during __init__).
"""
# Don't use fused NVFP4 output if LoRA is enabled on out_proj.
Copy link
Copy Markdown
Collaborator

@venkywonka venkywonka Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't know anything about mamba - but i saw that devin had pointed towards a bug here : https://app.devin.ai/review/NVIDIA/TensorRT-LLM/pull/12154#bug-BUG_pr-review-job-d3cde4cb72cb4f4889031597155a32c4_0001

it says this that out_proj's lora has been guarded against getting nvfp4 tensors from the fused norm - but in_proj's lora hasn't been appropriately guarded when the layernorm outputs nvfp4 tensors. But looks like currently the layernorm will never output a nvfp4 tensor as the hidden size is not supported... but tomorrow this could expose itself as a bug... worth adding a todo atleast?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the bug is there.
I added a strong guard that if lora is enabled, we fallback to rms_norm legacy kernel, which produced bf16 input for lora layers.

@Wanli-Jiang Wanli-Jiang force-pushed the user/williamj/support-nemotron-h-lora-t2 branch from 37f6ddf to eb20fca Compare March 25, 2026 05:14
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
@Wanli-Jiang Wanli-Jiang force-pushed the user/williamj/support-nemotron-h-lora-t2 branch from eb20fca to 6b7f23e Compare March 25, 2026 05:33
@Wanli-Jiang
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40254 [ run ] triggered by Bot. Commit: 6b7f23e Link to invocation

Copy link
Copy Markdown
Collaborator

@HuiGao-NV HuiGao-NV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40254 [ run ] completed with state SUCCESS. Commit: 6b7f23e
/LLM/main/L0_MergeRequest_PR pipeline #31378 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Wanli-Jiang
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40473 [ run ] triggered by Bot. Commit: 6b7f23e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40473 [ run ] completed with state SUCCESS. Commit: 6b7f23e
/LLM/main/L0_MergeRequest_PR pipeline #31561 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants