Skip to content

Cogvideox#13402

Open
Talmaj wants to merge 15 commits intomasterfrom
cogvideox
Open

Cogvideox#13402
Talmaj wants to merge 15 commits intomasterfrom
cogvideox

Conversation

@Talmaj
Copy link
Copy Markdown
Contributor

@Talmaj Talmaj commented Apr 14, 2026

This is a pre-requirement PR for CORE-38, netflix's VOID model.
It was written by @kijai and rebased by myself. I've tested and used it only with VOID model and it works well.

This is a newly opened PR directly within the main repo. I'm closing the initial one: #13351

@Talmaj Talmaj mentioned this pull request Apr 14, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 658fc7ec-13d6-4efe-8e3d-2c688b35c9a2

📥 Commits

Reviewing files that changed from the base of the PR and between 541f26a and 52156ed.

📒 Files selected for processing (4)
  • comfy/ldm/cogvideo/model.py
  • comfy/ldm/cogvideo/vae.py
  • comfy/supported_models.py
  • comfy/text_encoders/cogvideo.py
✅ Files skipped from review due to trivial changes (1)
  • comfy/ldm/cogvideo/vae.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • comfy/supported_models.py
  • comfy/ldm/cogvideo/model.py

📝 Walkthrough

Walkthrough

This pull request adds comprehensive support for CogVideoX, a video generation model. The changes introduce a new latent format class, a 3D transformer-based diffusion model with patch embedding and block-based processing, a 3D convolutional VAE with causal convolution support and spatial normalization, a new V_PREDICTION_DDPM model type and sampling strategy, model auto-detection from state dicts, VAE instantiation logic, two new supported model configurations for text-to-video and image-to-video tasks, and a T5 tokenizer subclass with fixed minimum length. The implementation spans model architecture, sampling behavior, configuration detection, and model registry changes.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.87% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Cogvideox' is extremely vague and does not clearly communicate what was changed or why, using only a generic product name without context. Consider a more descriptive title such as 'Add CogVideoX video diffusion model support' to better convey the scope of changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed The description clearly indicates this PR adds CogVideoX model support as a prerequisite for CORE-38 (Netflix's VOID model), providing context and attribution to the original author.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
comfy/ldm/cogvideo/vae_backup.py (1)

1-485: Avoid shipping a second, unused CogVideoX VAE implementation.

comfy/sd.py only wires up comfy.ldm.cogvideo.vae.AutoencoderKLCogVideoX, so this backup copy will drift as soon as the primary path gets fixes. If you want to keep it around, it should at least live behind an explicit dev-only flag or out of the runtime package surface.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/ldm/cogvideo/vae_backup.py` around lines 1 - 485, This file contains an
extra, unused CogVideoX VAE implementation (class AutoencoderKLCogVideoX and
related Encoder3D/Decoder3D, etc.) that duplicates the canonical
comfy.ldm.cogvideo.vae module wired by comfy/sd.py; remove this backup from the
runtime package surface (delete the file) or move it into a dev-only location
(e.g., tests/dev or gated behind an explicit runtime check/ENV flag) so it won't
drift from the primary implementation; ensure no imports reference
comfy.ldm.cogvideo.vae_backup.AutoencoderKLCogVideoX and that comfy/sd.py
continues to import the canonical comfy.ldm.cogvideo.vae.AutoencoderKLCogVideoX.
comfy/supported_models.py (1)

1813-1818: Nested tokenizer class is functional but unconventional.

The CogVideoXT5Tokenizer nested class inside clip_target() works correctly, but defining a class inside a method is unusual in this codebase. Consider moving it to module level for better discoverability and potential reuse.

That said, this pattern has no functional issues and the min_length=226 correctly matches the transformer's max_text_seq_length parameter (per context snippet 1).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/supported_models.py` around lines 1813 - 1818, Nested class
CogVideoXT5Tokenizer inside clip_target is unconventional; move
CogVideoXT5Tokenizer to module level (define it alongside other tokenizers)
keeping the same initializer signature (embedding_directory=None,
tokenizer_data={}, min_length=226) and base class
comfy.text_encoders.sd3_clip.T5XXLTokenizer, then update the clip_target method
to return supported_models_base.ClipTarget(CogVideoXT5Tokenizer,
comfy.text_encoders.sd3_clip.T5XXLModel) so the method simply references the
module-level class for discoverability and reuse.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy/ldm/cogvideo/vae.py`:
- Around line 519-539: The current precompute builds full [B,C,T,H,W] entries in
z_at_res by calling _interpolate_zq(z, target) for every Phase 2 resolution,
which OOMs; instead, expand z along time once (to t_expanded) outside the
resolution loop and then perform spatial interpolation from that
temporally-expanded tensor inside the chunk/decode loop so you only create
per-chunk spatially-interpolated tensors when needed. Concretely: keep a single
temporally-expanded_z (use t_expanded) and remove the eager per-resolution calls
to _interpolate_zq in the remaining_blocks/decoder.up_blocks loop; when
processing each chunk/resolution, call _interpolate_zq on the
temporally-expanded slice for that chunk/target and cache only the
spatially-interpolated result in z_at_res keyed by (h,w) to avoid allocating
full clip-sized [B,C,T,H,W] per resolution.
- Around line 453-465: The encode method currently folds the remainder frames
into the first chunk causing batches larger than num_sample_frames_batch_size;
change the batching so chunks are capped to self.num_sample_frames_batch_size by
computing num_batches = ceil(t / frame_batch) and for each i set start = i *
frame_batch and end = min((i + 1) * frame_batch, t), then call self.encoder(x[:,
:, start:end], conv_cache=conv_cache) appending results and carrying forward
conv_cache; this ensures each chunk size <= num_sample_frames_batch_size and
prevents oversized VRAM spikes while preserving conv_cache behavior in encode.

---

Nitpick comments:
In `@comfy/ldm/cogvideo/vae_backup.py`:
- Around line 1-485: This file contains an extra, unused CogVideoX VAE
implementation (class AutoencoderKLCogVideoX and related Encoder3D/Decoder3D,
etc.) that duplicates the canonical comfy.ldm.cogvideo.vae module wired by
comfy/sd.py; remove this backup from the runtime package surface (delete the
file) or move it into a dev-only location (e.g., tests/dev or gated behind an
explicit runtime check/ENV flag) so it won't drift from the primary
implementation; ensure no imports reference
comfy.ldm.cogvideo.vae_backup.AutoencoderKLCogVideoX and that comfy/sd.py
continues to import the canonical comfy.ldm.cogvideo.vae.AutoencoderKLCogVideoX.

In `@comfy/supported_models.py`:
- Around line 1813-1818: Nested class CogVideoXT5Tokenizer inside clip_target is
unconventional; move CogVideoXT5Tokenizer to module level (define it alongside
other tokenizers) keeping the same initializer signature
(embedding_directory=None, tokenizer_data={}, min_length=226) and base class
comfy.text_encoders.sd3_clip.T5XXLTokenizer, then update the clip_target method
to return supported_models_base.ClipTarget(CogVideoXT5Tokenizer,
comfy.text_encoders.sd3_clip.T5XXLModel) so the method simply references the
module-level class for discoverability and reuse.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2df899fb-7a4b-40ce-ae3b-2acdcbf1ff24

📥 Commits

Reviewing files that changed from the base of the PR and between fed4ac0 and 541f26a.

📒 Files selected for processing (11)
  • comfy/latent_formats.py
  • comfy/ldm/cogvideo/__init__.py
  • comfy/ldm/cogvideo/model.py
  • comfy/ldm/cogvideo/vae.py
  • comfy/ldm/cogvideo/vae_backup.py
  • comfy/model_base.py
  • comfy/model_detection.py
  • comfy/model_sampling.py
  • comfy/sd.py
  • comfy/supported_models.py
  • nodes.py

Comment thread comfy/ldm/cogvideo/vae.py
Comment thread comfy/ldm/cogvideo/vae.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants