Skip to content

Add LTX-2.3 text-to-video generation support#402

Merged
copybara-service[bot] merged 1 commit into
mainfrom
ltx23
May 12, 2026
Merged

Add LTX-2.3 text-to-video generation support#402
copybara-service[bot] merged 1 commit into
mainfrom
ltx23

Conversation

@prishajain1
Copy link
Copy Markdown
Collaborator

@prishajain1 prishajain1 commented May 10, 2026

This PR introduces end-to-end pipeline and model changes to support the LTX-2.3 multi-modal (audio-video) transformer model. It enables integrated text-to-audio-video generation using Gemma-based text conditioning, latent upsamplers, and vocoders.

Key architectural changes

  • Gated Cross-Modal Attention: Introduces a learnable gate (to_gate_logits) applied to all attention operations in the block (Self-Video, Self-Audio, Prompt-Cross, and Modal-Cross).
  • Prompt AdaLN (Noise-Aware Text Conditioning): It introduces Prompt AdaLN (self.prompt_adaln). For this specific cross-attention modulation, it derives scale and shift parameters directly from the continuous noise level (sigma)
  • Cross-Timestep Conditioning: When use_cross_timestep is enabled, it swaps the sigma (noise level) values used during the cross-modal attention steps (A2V and V2A).
  • Per-Modality Text Projections (Connectors): Introduces support for Per-Modality Projections (per_modality_projections=True). Instead of a shared feature extractor, it applies per-token RMS normalization to the raw hidden states and passes them through two separate linear projection layers (video_text_proj_in and audio_text_proj_in) before sending them to the respective video and audio connectors.
  • 4-Way Batched Denoising Compile: In addition to CFG, LTX 2.3 introduces two massive new guidance concepts, requiring a 4-pass execution per step. These are spatiotemporal guidance and modality isolation guidance
  • Stabilizing Multi-Branch Guidance via x_0 Space
  • BWE Vocoder: Introduces the Bandwidth Extension (BWE) Vocoder (LTX2VocoderWithBWE).

Files added/modified

  • ltx2_3_video.yml file: New config file for LTX2.3
  • vocoder_ltx2.py: Added support for BWE vocoder
  • ltx2_pipeline.py: Enabled 4-way sliced batched inference (Uncond, Cond, Perturb, Isolated) and integrated velocity/x0 conversion delta equations with guidance rescaling.
  • transformer_ltx2.py: Propagated modality/perturbation masks to transformer blocks and integrated prompt adaptive layer norms.
  • generate_ltx2.py, pyconfig.py, common_types.py: Added support for LTX2.3
  • ltx2_utils.py: Added support to load new LTX2.3 specific weights
  • attention_ltx2.py: Added support for gated attention and perturbed attention
  • autoencoder_kl_ltx2.py: Added support for different upsample_type
  • embeddings_connector_ltx2.py: Added gated attention configurations (gated_attn) support to intermediate transformer block connectors.
  • feature_extractor_ltx2.py: support for per_modality_projections parameter added
  • text_encoders.py: Implemented dual-modality parallel text connectors routing, token-wise RMS scaling, and independent video-audio linear projections.

Sample outputs

Configuration Generation Time Video Link
(CFG + STG + MIG) Enabled 23.4s Video
(CFG) Enabled 13.4s Video
Upsampler Video (using LTX2.3) 26.76s Video

Component wise breakdown

Pipeline Step Duration
Denoising 17.2s
Text Encoding 4.1s
Vocoder 0.8s
Video Post 0.6s
Latent Processing 0.5s
Video VAE 0.4s
Connectors 0.1s
Audio VAE 0.1s
Preparation 0.0s

Tested:

  • scan_diffusion_loop = True and scan_diffusion_loop = False
  • scan_layers = True and scan_layers = False
  • No performance or quality regressions observed in the existing LTX2 pipeline

@prishajain1 prishajain1 requested a review from entrpn as a code owner May 10, 2026 09:28
@prishajain1 prishajain1 marked this pull request as draft May 10, 2026 09:28
@github-actions
Copy link
Copy Markdown

🤖 Hi @Perseus14, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

This Pull Request successfully introduces support for LTX-2.3 text-to-video generation. It includes significant updates to the transformer architecture (gated attention, cross-modal modulation) and the denoising pipeline (4-way batched denoising for STG/CFG/MIG). The implementation is high-quality and integrates well with the existing LTX-2 infrastructure.

🔍 General Feedback

  • Redundant Patch File: The scratch_diff.patch file was likely added by mistake and should be removed before merging.
  • Robustness: A few areas in the pipeline (like audio_channels fallback and upsampler parameter inference) could be made more robust to handle different model versions and naming conventions.
  • Optimization: The use of nnx.jit for the vocoder and the optimized sequence length in smoke tests are excellent additions for performance and stability.

Comment thread src/maxdiffusion/pipelines/ltx2/ltx2_pipeline.py
Comment thread scratch_diff.patch Outdated
Comment thread src/maxdiffusion/pipelines/ltx2/ltx2_pipeline.py Outdated
@github-actions
Copy link
Copy Markdown

🤖 Hi @prishajain1, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

@github-actions
Copy link
Copy Markdown

🤖 Hi @prishajain1, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

@github-actions
Copy link
Copy Markdown

🤖 I'm sorry @prishajain1, but I was unable to process your request. Please see the logs for more details.

@github-actions
Copy link
Copy Markdown

🤖 Hi @prishajain1, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

@github-actions
Copy link
Copy Markdown

🤖 I'm sorry @prishajain1, but I was unable to process your request. Please see the logs for more details.

@github-actions
Copy link
Copy Markdown

🤖 Hi @prishajain1, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

This Pull Request introduces comprehensive support for LTX-2.3 text-to-video generation, including the end-to-end pipeline, model updates, and a new vocoder with bandwidth extension (BWE). The implementation correctly handles complex features like Spatio-Temporal Guidance (STG) and Modality Isolation Guidance (MIG) using a 4-way batched denoising approach in JAX.

🔍 General Feedback

  • STG/MIG Logic: The implementation of the 4-way split denoising logic and the corresponding delta formulations for guidance is impressive and aligns well with the LTX-2.3 technical requirements.
  • Efficiency: Utilizing nnx.scan for the denoising loop ensures optimal performance on TPU/GPU hardware.
  • Redundancy: I identified some redundant initializations and assignments in the transformer and autoencoder models that should be cleaned up.
  • Parameter Initialization: Double-check the usage of nnx.Param with kernel_init, as nnx.Param typically only accepts the data tensor and might ignore additional keyword arguments.

Comment thread src/maxdiffusion/models/ltx2/transformer_ltx2.py
Comment thread src/maxdiffusion/pipelines/ltx2/ltx2_pipeline.py Outdated
Comment thread src/maxdiffusion/pipelines/ltx2/ltx2_pipeline.py
Comment thread src/maxdiffusion/models/ltx2/transformer_ltx2.py
Comment thread src/maxdiffusion/pipelines/ltx2/ltx2_pipeline.py
Comment thread src/maxdiffusion/models/ltx2/transformer_ltx2.py
Comment thread src/maxdiffusion/pipelines/ltx2/ltx2_pipeline.py
Comment thread src/maxdiffusion/models/ltx2/transformer_ltx2.py
Comment thread src/maxdiffusion/models/ltx2/autoencoder_kl_ltx2.py Outdated
Comment thread src/maxdiffusion/models/ltx2/text_encoders/text_encoders_ltx2.py
@Perseus14 Perseus14 self-requested a review May 11, 2026 15:17
@prishajain1 prishajain1 force-pushed the ltx23 branch 2 times, most recently from 1890edb to 6e5961c Compare May 12, 2026 06:21
@prishajain1 prishajain1 marked this pull request as ready for review May 12, 2026 06:22
Copy link
Copy Markdown
Collaborator

@Perseus14 Perseus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. PTAL

Additional Comments

  • Could you test LTX2 in this branch and ensure that there is no regression?
  • Please test with scan_layers true/false as well
  • Please add e2e generation time as well as each component if possible

Comment thread src/maxdiffusion/configs/ltx2_video.yml Outdated
Comment thread src/maxdiffusion/configs/ltx2_3_video.yml
Comment thread src/maxdiffusion/models/ltx2/autoencoder_kl_ltx2.py
Comment thread src/maxdiffusion/models/ltx2/ltx2_utils.py
Comment thread src/maxdiffusion/models/ltx2/vocoder_ltx2.py
Comment thread src/maxdiffusion/models/ltx2/vocoder_ltx2.py
Comment thread src/maxdiffusion/models/attention_flax.py Outdated
Comment thread src/maxdiffusion/pipelines/ltx2/ltx2_pipeline.py Outdated
Comment thread src/maxdiffusion/pipelines/ltx2/ltx2_pipeline.py
Comment thread src/maxdiffusion/pipelines/ltx2/ltx2_pipeline.py
@github-actions
Copy link
Copy Markdown

🤖 Hi @Perseus14, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

This Pull Request introduces comprehensive support for the LTX-2.3 multi-modal (audio-video) transformer model. It includes key architectural updates such as Gated Cross-Modal Attention, Prompt AdaLN, and a sophisticated Bandwidth Extension (BWE) Vocoder. The implementation is technically sound, highly optimized for JAX/TPU, and follows the project's established modular patterns.

🔍 General Feedback

  • 4-Way Batched Denoising: The integration of Spatiotemporal Guidance (STG) and Modality Isolation Guidance (MIG) via 4-way batching is a major highlight, enabling advanced generation features.
  • Performance: Excellent use of JIT caching for the vocoder and conditional VAE replication to optimize inference latency.
  • Code Quality: The transition to more explicit logic for guidance (using x0 space) improves both readability and correctness compared to standard velocity-based CFG.

Comment thread src/maxdiffusion/models/ltx2/transformer_ltx2.py
Comment thread src/maxdiffusion/models/ltx2/vocoder_ltx2.py
Comment thread src/maxdiffusion/pipelines/ltx2/ltx2_pipeline.py Outdated
@prishajain1
Copy link
Copy Markdown
Collaborator Author

Left a few comments. PTAL

Additional Comments

  • Could you test LTX2 in this branch and ensure that there is no regression?
  • Please test with scan_layers true/false as well
  • Please add e2e generation time as well as each component if possible

Added the above details for each component in the PR description along with details of what has been tested.

@prishajain1
Copy link
Copy Markdown
Collaborator Author

unittests failure is unrelated to changes in this PR

@copybara-service copybara-service Bot merged commit 54898a3 into main May 12, 2026
16 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants