Skip to content

[None][fix] Use bf16 for LTX-2 FP4 stage 2#13244

Merged
chang-l merged 6 commits intoNVIDIA:mainfrom
yibinl-nvidia:bf16_second_stage
Apr 30, 2026
Merged

[None][fix] Use bf16 for LTX-2 FP4 stage 2#13244
chang-l merged 6 commits intoNVIDIA:mainfrom
yibinl-nvidia:bf16_second_stage

Conversation

@yibinl-nvidia
Copy link
Copy Markdown
Collaborator

@yibinl-nvidia yibinl-nvidia commented Apr 20, 2026

Summary by CodeRabbit

  • Improvements
    • Enhanced FP4 quantization handling during LoRA (Low-Rank Adaptation) application in multi-stage visual generation pipelines, improving parameter management and state restoration.
    • Improved logging and timing information for multi-stage processing to provide better visibility into execution flow.

Description

Image Quality

See videos for BF16 / FP4 stage 2 comparison on dynamic FP4 and static FP4. BF16 stage 2 is visually better.
https://drive.google.com/drive/folders/1aPXUjUXioV5UXOaUooDQj0NpdRNHDHDJ?usp=sharing

Perf Summary

With 10s video of 1536x1024 resolution, here is the perf table comparision.

Metric FP4 BF16 Delta Verdict
Stage 2 denoising (avg) 15.32 s 15.81 s +0.49 s BF16 3.2% slower
Two-stage total (avg) 82.99 s 84.72 s +1.73 s BF16 2.1% slower

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@yibinl-nvidia yibinl-nvidia marked this pull request as ready for review April 21, 2026 17:59
@yibinl-nvidia yibinl-nvidia requested a review from a team as a code owner April 21, 2026 17:59
@yibinl-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

📝 Walkthrough

Walkthrough

This change extends LoRA delta application to handle FP4 quantization by detecting dynamic Linear modules with NVFP4LinearMethod, swapping their quant_method to UnquantizedLinearMethod for stage 2 execution, and replacing parameter storage with BF16 tensors instead of requantizing. A complementary restoration function now saves and recovers both tensor and non-tensor state (including quant_method entries) after stage 2 completes.

Changes

Cohort / File(s) Summary
FP4 Quantization & LoRA Handling
tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2_two_stages.py
Extended _apply_lora_deltas with module-path mapping and dynamic FP4 Linear module detection; swaps quant_method to UnquantizedLinearMethod for stage 2 and saves original state. Modified FP4 "static-packed" LoRA handling to directly replace parameter storage with BF16 tensors. Updated _restore_lora_state to handle both tensor and non-tensor state (including quant_method restoration). Adjusted stage 2 logging to explicitly indicate BF16 weights for FP4 models and timing around denoising.

Sequence Diagram(s)

sequenceDiagram
    participant Client as LoRA Application
    participant Detector as Module Detection
    participant Storage as Parameter Storage
    participant Stage2 as Stage 2 Execution
    participant Restorer as State Restoration

    Client->>Detector: Build module-path mapping
    Detector->>Detector: Scan for NVFP4LinearMethod
    Detector->>Storage: Detect FP4 Linear modules
    Storage->>Storage: Save original quant_method
    Storage->>Storage: Replace with BF16 tensor data
    Storage->>Storage: Swap quant_method to UnquantizedLinearMethod
    
    Storage-->>Stage2: Execute with BF16 weights
    Note over Stage2: Stage 2 processes with BF16<br/>(FP4 quantization disabled)
    
    Stage2-->>Restorer: Denoising complete
    Restorer->>Restorer: Retrieve saved_state
    Restorer->>Storage: Restore original quant_method
    Restorer->>Storage: Restore original parameter data
    Storage-->>Client: State fully recovered
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is largely incomplete; it lacks a filled-in Description section explaining the issue and solution, and the Test Coverage section is empty. Add a clear Description section explaining what problem this solves and why BF16 is used for FP4 stage 2. Provide the Test Coverage section listing relevant tests that validate the changes.
✅ Passed checks (4 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly and specifically describes the main change: using bf16 (bfloat16) for FP4 stage 2, which aligns with the code modifications in the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2_two_stages.py`:
- Around line 518-523: The current loop that handles names starting with
"__quant_method__" silently skips when the target module is missing or
incompatible; change the handler so that when encountering a "__quant_method__"
entry you lookup mod via module_dict (as currently), then if mod is None or not
an instance of the expected Linear class (or does not support quant_method)
raise an explicit error instead of continuing; otherwise set mod.quant_method =
data. Ensure the error message mentions the offending module path and the
expected type (e.g., Linear) so restoration fails fast rather than leaving
modules on UnquantizedLinearMethod with mismatched packed weights.
- Around line 471-483: The code mutates param.data to bf16 before verifying the
parent Linear (module_dict.get(base)) exists, which can leave the tensor in an
incompatible layout if lookup fails; change the logic in
pipeline_ltx2_two_stages.py to first resolve and validate linear_mod (ensure
isinstance(linear_mod, Linear)), raise an exception or abort if not found, and
only then set param.data = bf16 and update
saved_state[f"__quant_method__{base}"] and linear_mod.quant_method =
UnquantizedLinearMethod(); do not keep the current warn-and-continue behavior
(logger.warning) because it leaves the model in a broken state.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: e5920e3c-e051-490d-bbec-07975ddac2ea

📥 Commits

Reviewing files that changed from the base of the PR and between 96bb8b7 and b95bce5.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2_two_stages.py

Comment thread tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2_two_stages.py Outdated
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44784 [ run ] triggered by Bot. Commit: b95bce5 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44784 [ run ] completed with state SUCCESS. Commit: b95bce5
/LLM/main/L0_MergeRequest_PR pipeline #35138 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@yibinl-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44961 [ run ] triggered by Bot. Commit: b95bce5 Link to invocation

@yibinl-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45356 [ run ] triggered by Bot. Commit: 9f39104 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45356 [ run ] completed with state SUCCESS. Commit: 9f39104
/LLM/main/L0_MergeRequest_PR pipeline #35601 completed with status: 'SUCCESS'

CI Report

Link to invocation

@yibinl-nvidia yibinl-nvidia changed the title [None][fix] Use bf16 for FP4 stage 2 [None][fix] Use bf16 for LTX-2 FP4 stage 2 Apr 24, 2026
Copy link
Copy Markdown
Collaborator

@chang-l chang-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Yibin.
Can you also include some perf data implication in the PR description for records? Also, do we have 2-stage lora w/ NVFP4 E2E accuracy protected in CI?

Comment thread tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2_two_stages.py Outdated
Comment thread tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2_two_stages.py Outdated
@zhenhuaw-me
Copy link
Copy Markdown
Member

How do we test 2 stage implementation? For example, we are saying that "FP4 stage 2" accuracy is not good enough, and we switch to BF16 in this PR. How do we prove that this PR improves the accuracy?

Comment thread tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2_two_stages.py Outdated
@yibinl-nvidia yibinl-nvidia force-pushed the bf16_second_stage branch 3 times, most recently from 244b19a to 94d0223 Compare April 29, 2026 02:42
@yibinl-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46037 [ run ] triggered by Bot. Commit: 15664e7 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46037 [ run ] completed with state FAILURE. Commit: 15664e7
/LLM/main/L0_MergeRequest_PR pipeline #36184 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yibinl-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46155 [ run ] triggered by Bot. Commit: 15664e7 Link to invocation

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
@yibinl-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot kill

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46160 [ kill ] triggered by Bot. Commit: 258c8a0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46155 [ run ] completed with state ABORTED. Commit: 15664e7

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46160 [ kill ] completed with state SUCCESS. Commit: 258c8a0
Successfully killed previous jobs for commit 258c8a0

Link to invocation

@yibinl-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46172 [ run ] triggered by Bot. Commit: 258c8a0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46172 [ run ] completed with state SUCCESS. Commit: 258c8a0
/LLM/main/L0_MergeRequest_PR pipeline #36292 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yibinl-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46297 [ run ] triggered by Bot. Commit: 258c8a0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46297 [ run ] completed with state SUCCESS. Commit: 258c8a0
/LLM/main/L0_MergeRequest_PR pipeline #36398 completed with status: 'SUCCESS'

CI Report

Link to invocation

@chang-l chang-l merged commit f3e34a0 into NVIDIA:main Apr 30, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants