Cogvideox by Talmaj · Pull Request #13402 · Comfy-Org/ComfyUI

Talmaj · 2026-04-14T12:31:57Z

This is a pre-requirement PR for CORE-38, netflix's VOID model.
It was written by @kijai and rebased by myself. I've tested and used it only with VOID model and it works well.

This is a newly opened PR directly within the main repo. I'm closing the initial one: #13351

…channel count

coderabbitai · 2026-04-14T12:38:07Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 658fc7ec-13d6-4efe-8e3d-2c688b35c9a2

📥 Commits

Reviewing files that changed from the base of the PR and between 541f26a and 52156ed.

📒 Files selected for processing (4)

comfy/ldm/cogvideo/model.py
comfy/ldm/cogvideo/vae.py
comfy/supported_models.py
comfy/text_encoders/cogvideo.py

✅ Files skipped from review due to trivial changes (1)

comfy/ldm/cogvideo/vae.py

🚧 Files skipped from review as they are similar to previous changes (2)

comfy/supported_models.py
comfy/ldm/cogvideo/model.py

📝 Walkthrough

Walkthrough

This pull request adds comprehensive support for CogVideoX, a video generation model. The changes introduce a new latent format class, a 3D transformer-based diffusion model with patch embedding and block-based processing, a 3D convolutional VAE with causal convolution support and spatial normalization, a new V_PREDICTION_DDPM model type and sampling strategy, model auto-detection from state dicts, VAE instantiation logic, two new supported model configurations for text-to-video and image-to-video tasks, and a T5 tokenizer subclass with fixed minimum length. The implementation spans model architecture, sampling behavior, configuration detection, and model registry changes.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 7.87% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Cogvideox' is extremely vague and does not clearly communicate what was changed or why, using only a generic product name without context.	Consider a more descriptive title such as 'Add CogVideoX video diffusion model support' to better convey the scope of changes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description clearly indicates this PR adds CogVideoX model support as a prerequisite for CORE-38 (Netflix's VOID model), providing context and attribution to the original author.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

comfy/ldm/cogvideo/vae_backup.py (1)
1-485: Avoid shipping a second, unused CogVideoX VAE implementation.

comfy/sd.py only wires up comfy.ldm.cogvideo.vae.AutoencoderKLCogVideoX, so this backup copy will drift as soon as the primary path gets fixes. If you want to keep it around, it should at least live behind an explicit dev-only flag or out of the runtime package surface.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/ldm/cogvideo/vae_backup.py` around lines 1 - 485, This file contains an
extra, unused CogVideoX VAE implementation (class AutoencoderKLCogVideoX and
related Encoder3D/Decoder3D, etc.) that duplicates the canonical
comfy.ldm.cogvideo.vae module wired by comfy/sd.py; remove this backup from the
runtime package surface (delete the file) or move it into a dev-only location
(e.g., tests/dev or gated behind an explicit runtime check/ENV flag) so it won't
drift from the primary implementation; ensure no imports reference
comfy.ldm.cogvideo.vae_backup.AutoencoderKLCogVideoX and that comfy/sd.py
continues to import the canonical comfy.ldm.cogvideo.vae.AutoencoderKLCogVideoX.
comfy/supported_models.py (1)
1813-1818: Nested tokenizer class is functional but unconventional.

The CogVideoXT5Tokenizer nested class inside clip_target() works correctly, but defining a class inside a method is unusual in this codebase. Consider moving it to module level for better discoverability and potential reuse.

That said, this pattern has no functional issues and the min_length=226 correctly matches the transformer's max_text_seq_length parameter (per context snippet 1).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/supported_models.py` around lines 1813 - 1818, Nested class
CogVideoXT5Tokenizer inside clip_target is unconventional; move
CogVideoXT5Tokenizer to module level (define it alongside other tokenizers)
keeping the same initializer signature (embedding_directory=None,
tokenizer_data={}, min_length=226) and base class
comfy.text_encoders.sd3_clip.T5XXLTokenizer, then update the clip_target method
to return supported_models_base.ClipTarget(CogVideoXT5Tokenizer,
comfy.text_encoders.sd3_clip.T5XXLModel) so the method simply references the
module-level class for discoverability and reuse.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy/ldm/cogvideo/vae.py`:
- Around line 519-539: The current precompute builds full [B,C,T,H,W] entries in
z_at_res by calling _interpolate_zq(z, target) for every Phase 2 resolution,
which OOMs; instead, expand z along time once (to t_expanded) outside the
resolution loop and then perform spatial interpolation from that
temporally-expanded tensor inside the chunk/decode loop so you only create
per-chunk spatially-interpolated tensors when needed. Concretely: keep a single
temporally-expanded_z (use t_expanded) and remove the eager per-resolution calls
to _interpolate_zq in the remaining_blocks/decoder.up_blocks loop; when
processing each chunk/resolution, call _interpolate_zq on the
temporally-expanded slice for that chunk/target and cache only the
spatially-interpolated result in z_at_res keyed by (h,w) to avoid allocating
full clip-sized [B,C,T,H,W] per resolution.
- Around line 453-465: The encode method currently folds the remainder frames
into the first chunk causing batches larger than num_sample_frames_batch_size;
change the batching so chunks are capped to self.num_sample_frames_batch_size by
computing num_batches = ceil(t / frame_batch) and for each i set start = i *
frame_batch and end = min((i + 1) * frame_batch, t), then call self.encoder(x[:,
:, start:end], conv_cache=conv_cache) appending results and carrying forward
conv_cache; this ensures each chunk size <= num_sample_frames_batch_size and
prevents oversized VRAM spikes while preserving conv_cache behavior in encode.

---

Nitpick comments:
In `@comfy/ldm/cogvideo/vae_backup.py`:
- Around line 1-485: This file contains an extra, unused CogVideoX VAE
implementation (class AutoencoderKLCogVideoX and related Encoder3D/Decoder3D,
etc.) that duplicates the canonical comfy.ldm.cogvideo.vae module wired by
comfy/sd.py; remove this backup from the runtime package surface (delete the
file) or move it into a dev-only location (e.g., tests/dev or gated behind an
explicit runtime check/ENV flag) so it won't drift from the primary
implementation; ensure no imports reference
comfy.ldm.cogvideo.vae_backup.AutoencoderKLCogVideoX and that comfy/sd.py
continues to import the canonical comfy.ldm.cogvideo.vae.AutoencoderKLCogVideoX.

In `@comfy/supported_models.py`:
- Around line 1813-1818: Nested class CogVideoXT5Tokenizer inside clip_target is
unconventional; move CogVideoXT5Tokenizer to module level (define it alongside
other tokenizers) keeping the same initializer signature
(embedding_directory=None, tokenizer_data={}, min_length=226) and base class
comfy.text_encoders.sd3_clip.T5XXLTokenizer, then update the clip_target method
to return supported_models_base.ClipTarget(CogVideoXT5Tokenizer,
comfy.text_encoders.sd3_clip.T5XXLModel) so the method simply references the
module-level class for discoverability and reuse.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2df899fb-7a4b-40ce-ae3b-2acdcbf1ff24

📥 Commits

Reviewing files that changed from the base of the PR and between fed4ac0 and 541f26a.

📒 Files selected for processing (11)

comfy/latent_formats.py
comfy/ldm/cogvideo/__init__.py
comfy/ldm/cogvideo/model.py
comfy/ldm/cogvideo/vae.py
comfy/ldm/cogvideo/vae_backup.py
comfy/model_base.py
comfy/model_detection.py
comfy/model_sampling.py
comfy/sd.py
comfy/supported_models.py
nodes.py

kijai and others added 8 commits April 14, 2026 14:28

Initial CogVideoX and SparkVSR support

1112d59

Remove breaking code, logging etc.

6841484

Remove sparkvsr related code.

92571c7

Utilize use_learned_positional_embeddings in forward pass of CogVideoX.

220a044

Fix mutable input parameter.

73bd1dd

Fix CogVideoX concat_cond to handle temporal dimension and normalize …

9904f4d

…channel count

Add CogVideoX 1.5 geometry defaults to I2V path

cee57f6

Fixup ruff.

541f26a

Talmaj requested review from Kosinkadink, comfyanonymous and guill as code owners April 14, 2026 12:31

Talmaj mentioned this pull request Apr 14, 2026

CogVideoX #13351

Closed

coderabbitai bot reviewed Apr 14, 2026

View reviewed changes

Comment thread comfy/ldm/cogvideo/vae.py

Comment thread comfy/ldm/cogvideo/vae.py Outdated

Talmaj Marinc added 7 commits April 14, 2026 14:58

Remove vae_backup.py

174f688

Move cogvideo text encoder into a dedicated module.

3e961f9

Cap encode chunks at the configured frame batch size.

e962c3f

Cap encode chunks fix.

9ca7cdb

Avoid pre-interpolating z for the full clip at every high-res stage.

c8a843e

Fix cogvideox dtypes and ops.

dff15d7

Revert dtype to float32 to increase quality of video output.

52156ed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cogvideox#13402

Cogvideox#13402
Talmaj wants to merge 15 commits intomasterfrom
cogvideox

Talmaj commented Apr 14, 2026

Uh oh!

coderabbitai bot commented Apr 14, 2026 •

edited

Loading

Walkthrough

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Talmaj commented Apr 14, 2026

Uh oh!

coderabbitai bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Apr 14, 2026 •

edited

Loading