[None][fix] Use bf16 for LTX-2 FP4 stage 2 by yibinl-nvidia · Pull Request #13244 · NVIDIA/TensorRT-LLM

yibinl-nvidia · 2026-04-20T21:46:33Z

Summary by CodeRabbit

Improvements
- Enhanced FP4 quantization handling during LoRA (Low-Rank Adaptation) application in multi-stage visual generation pipelines, improving parameter management and state restoration.
- Improved logging and timing information for multi-stage processing to provide better visibility into execution flow.

Description

Image Quality

See videos for BF16 / FP4 stage 2 comparison on dynamic FP4 and static FP4. BF16 stage 2 is visually better.
https://drive.google.com/drive/folders/1aPXUjUXioV5UXOaUooDQj0NpdRNHDHDJ?usp=sharing

Perf Summary

With 10s video of 1536x1024 resolution, here is the perf table comparision.

Metric	FP4	BF16	Delta	Verdict
Stage 2 denoising (avg)	15.32 s	15.81 s	+0.49 s	BF16 3.2% slower
Two-stage total (avg)	82.99 s	84.72 s	+1.73 s	BF16 2.1% slower

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

yibinl-nvidia · 2026-04-21T17:59:54Z

/bot run

coderabbitai · 2026-04-21T18:05:47Z

📝 Walkthrough

Walkthrough

This change extends LoRA delta application to handle FP4 quantization by detecting dynamic Linear modules with NVFP4LinearMethod, swapping their quant_method to UnquantizedLinearMethod for stage 2 execution, and replacing parameter storage with BF16 tensors instead of requantizing. A complementary restoration function now saves and recovers both tensor and non-tensor state (including quant_method entries) after stage 2 completes.

Changes

Cohort / File(s)	Summary
FP4 Quantization & LoRA Handling `tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2_two_stages.py`	Extended `_apply_lora_deltas` with module-path mapping and dynamic FP4 `Linear` module detection; swaps `quant_method` to `UnquantizedLinearMethod` for stage 2 and saves original state. Modified FP4 "static-packed" LoRA handling to directly replace parameter storage with BF16 tensors. Updated `_restore_lora_state` to handle both tensor and non-tensor state (including `quant_method` restoration). Adjusted stage 2 logging to explicitly indicate BF16 weights for FP4 models and timing around denoising.

Sequence Diagram(s)

sequenceDiagram
    participant Client as LoRA Application
    participant Detector as Module Detection
    participant Storage as Parameter Storage
    participant Stage2 as Stage 2 Execution
    participant Restorer as State Restoration

    Client->>Detector: Build module-path mapping
    Detector->>Detector: Scan for NVFP4LinearMethod
    Detector->>Storage: Detect FP4 Linear modules
    Storage->>Storage: Save original quant_method
    Storage->>Storage: Replace with BF16 tensor data
    Storage->>Storage: Swap quant_method to UnquantizedLinearMethod
    
    Storage-->>Stage2: Execute with BF16 weights
    Note over Stage2: Stage 2 processes with BF16<br/>(FP4 quantization disabled)
    
    Stage2-->>Restorer: Denoising complete
    Restorer->>Restorer: Retrieve saved_state
    Restorer->>Storage: Restore original quant_method
    Restorer->>Storage: Restore original parameter data
    Storage-->>Client: State fully recovered

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is largely incomplete; it lacks a filled-in Description section explaining the issue and solution, and the Test Coverage section is empty.	Add a clear Description section explaining what problem this solves and why BF16 is used for FP4 stage 2. Provide the Test Coverage section listing relevant tests that validate the changes.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly and specifically describes the main change: using bf16 (bfloat16) for FP4 stage 2, which aligns with the code modifications in the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2_two_stages.py`:
- Around line 518-523: The current loop that handles names starting with
"__quant_method__" silently skips when the target module is missing or
incompatible; change the handler so that when encountering a "__quant_method__"
entry you lookup mod via module_dict (as currently), then if mod is None or not
an instance of the expected Linear class (or does not support quant_method)
raise an explicit error instead of continuing; otherwise set mod.quant_method =
data. Ensure the error message mentions the offending module path and the
expected type (e.g., Linear) so restoration fails fast rather than leaving
modules on UnquantizedLinearMethod with mismatched packed weights.
- Around line 471-483: The code mutates param.data to bf16 before verifying the
parent Linear (module_dict.get(base)) exists, which can leave the tensor in an
incompatible layout if lookup fails; change the logic in
pipeline_ltx2_two_stages.py to first resolve and validate linear_mod (ensure
isinstance(linear_mod, Linear)), raise an exception or abort if not found, and
only then set param.data = bf16 and update
saved_state[f"__quant_method__{base}"] and linear_mod.quant_method =
UnquantizedLinearMethod(); do not keep the current warn-and-continue behavior
(logger.warning) because it leaves the model in a broken state.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: e5920e3c-e051-490d-bbec-07975ddac2ea

📥 Commits

Reviewing files that changed from the base of the PR and between 96bb8b7 and b95bce5.

📒 Files selected for processing (1)

tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2_two_stages.py

tensorrt-cicd · 2026-04-21T18:06:11Z

PR_Github #44784 [ run ] triggered by Bot. Commit: b95bce5 Link to invocation

tensorrt-cicd · 2026-04-21T20:53:10Z

PR_Github #44784 [ run ] completed with state SUCCESS. Commit: b95bce5
/LLM/main/L0_MergeRequest_PR pipeline #35138 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yibinl-nvidia · 2026-04-22T11:13:48Z

/bot run

tensorrt-cicd · 2026-04-22T11:40:54Z

PR_Github #44961 [ run ] triggered by Bot. Commit: b95bce5 Link to invocation

yibinl-nvidia · 2026-04-24T06:37:39Z

/bot run

tensorrt-cicd · 2026-04-24T06:43:12Z

PR_Github #45356 [ run ] triggered by Bot. Commit: 9f39104 Link to invocation

tensorrt-cicd · 2026-04-24T11:52:19Z

PR_Github #45356 [ run ] completed with state SUCCESS. Commit: 9f39104
/LLM/main/L0_MergeRequest_PR pipeline #35601 completed with status: 'SUCCESS'

CI Report

Link to invocation

chang-l

Thanks Yibin.
Can you also include some perf data implication in the PR description for records? Also, do we have 2-stage lora w/ NVFP4 E2E accuracy protected in CI?

zhenhuaw-me · 2026-04-28T02:43:33Z

How do we test 2 stage implementation? For example, we are saying that "FP4 stage 2" accuracy is not good enough, and we switch to BF16 in this PR. How do we prove that this PR improves the accuracy?

yibinl-nvidia · 2026-04-29T02:48:43Z

/bot run

tensorrt-cicd · 2026-04-29T02:54:21Z

PR_Github #46037 [ run ] triggered by Bot. Commit: 15664e7 Link to invocation

tensorrt-cicd · 2026-04-29T05:30:45Z

PR_Github #46037 [ run ] completed with state FAILURE. Commit: 15664e7
/LLM/main/L0_MergeRequest_PR pipeline #36184 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yibinl-nvidia · 2026-04-29T15:11:06Z

/bot run

tensorrt-cicd · 2026-04-29T15:18:12Z

PR_Github #46155 [ run ] triggered by Bot. Commit: 15664e7 Link to invocation

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

yibinl-nvidia · 2026-04-29T15:28:14Z

/bot kill

tensorrt-cicd · 2026-04-29T15:34:52Z

PR_Github #46160 [ kill ] triggered by Bot. Commit: 258c8a0 Link to invocation

tensorrt-cicd · 2026-04-29T15:34:55Z

PR_Github #46155 [ run ] completed with state ABORTED. Commit: 15664e7

Link to invocation

tensorrt-cicd · 2026-04-29T15:35:26Z

PR_Github #46160 [ kill ] completed with state SUCCESS. Commit: 258c8a0
Successfully killed previous jobs for commit 258c8a0

Link to invocation

yibinl-nvidia · 2026-04-29T16:05:36Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-29T16:12:20Z

PR_Github #46172 [ run ] triggered by Bot. Commit: 258c8a0 Link to invocation

tensorrt-cicd · 2026-04-30T01:09:04Z

PR_Github #46172 [ run ] completed with state SUCCESS. Commit: 258c8a0
/LLM/main/L0_MergeRequest_PR pipeline #36292 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yibinl-nvidia · 2026-04-30T04:48:53Z

/bot run

tensorrt-cicd · 2026-04-30T04:54:53Z

PR_Github #46297 [ run ] triggered by Bot. Commit: 258c8a0 Link to invocation

tensorrt-cicd · 2026-04-30T09:57:10Z

PR_Github #46297 [ run ] completed with state SUCCESS. Commit: 258c8a0
/LLM/main/L0_MergeRequest_PR pipeline #36398 completed with status: 'SUCCESS'

CI Report

Link to invocation

github-actions Bot assigned yibinl-nvidia Apr 20, 2026

yibinl-nvidia force-pushed the bf16_second_stage branch from 47d49d4 to b95bce5 Compare April 21, 2026 17:58

yibinl-nvidia marked this pull request as ready for review April 21, 2026 17:59

yibinl-nvidia requested a review from a team as a code owner April 21, 2026 17:59

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2_two_stages.py Outdated

Comment thread tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2_two_stages.py

zhenhuaw-me requested a review from luyiyun1021 April 23, 2026 03:30

yibinl-nvidia force-pushed the bf16_second_stage branch from 3b6f04c to 9f39104 Compare April 24, 2026 06:37

yibinl-nvidia changed the title ~~[None][fix] Use bf16 for FP4 stage 2~~ [None][fix] Use bf16 for LTX-2 FP4 stage 2 Apr 24, 2026

yibinl-nvidia requested review from chang-l and zhenhuaw-me April 27, 2026 15:34

chang-l reviewed Apr 27, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2_two_stages.py Outdated

zhenhuaw-me reviewed Apr 28, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2_two_stages.py Outdated

Comment thread tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2_two_stages.py

luyiyun1021 reviewed Apr 28, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2_two_stages.py Outdated

yibinl-nvidia force-pushed the bf16_second_stage branch 3 times, most recently from 244b19a to 94d0223 Compare April 29, 2026 02:42

yibinl-nvidia added 6 commits April 29, 2026 15:23

use bf16 for FP4 stage 2

c945d8b

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

pre-commit

51b94db

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

fix review comments

461a1b4

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

Refactor LoRA merge and unmerge logic for BF16

5da4efd

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

guard lora restore count

c1e1011

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

pre-commit

258c8a0

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

yibinl-nvidia force-pushed the bf16_second_stage branch from 15664e7 to 258c8a0 Compare April 29, 2026 15:27

chang-l approved these changes Apr 30, 2026

View reviewed changes

chang-l merged commit f3e34a0 into NVIDIA:main Apr 30, 2026
6 checks passed

Conversation

yibinl-nvidia commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Image Quality

Perf Summary

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

yibinl-nvidia commented Apr 21, 2026

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

yibinl-nvidia commented Apr 22, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

yibinl-nvidia commented Apr 24, 2026

Uh oh!

tensorrt-cicd commented Apr 24, 2026

Uh oh!

tensorrt-cicd commented Apr 24, 2026

Uh oh!

chang-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhenhuaw-me commented Apr 28, 2026

Uh oh!

Uh oh!

yibinl-nvidia commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

yibinl-nvidia commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

yibinl-nvidia commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

yibinl-nvidia commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

yibinl-nvidia commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

Uh oh!

yibinl-nvidia commented Apr 20, 2026 •

edited

Loading

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading