Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThis pull request adds comprehensive support for CogVideoX, a video generation model. The changes introduce a new latent format class, a 3D transformer-based diffusion model with patch embedding and block-based processing, a 3D convolutional VAE with causal convolution support and spatial normalization, a new V_PREDICTION_DDPM model type and sampling strategy, model auto-detection from state dicts, VAE instantiation logic, two new supported model configurations for text-to-video and image-to-video tasks, and a T5 tokenizer subclass with fixed minimum length. The implementation spans model architecture, sampling behavior, configuration detection, and model registry changes. 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
comfy/ldm/cogvideo/vae_backup.py (1)
1-485: Avoid shipping a second, unused CogVideoX VAE implementation.
comfy/sd.pyonly wires upcomfy.ldm.cogvideo.vae.AutoencoderKLCogVideoX, so this backup copy will drift as soon as the primary path gets fixes. If you want to keep it around, it should at least live behind an explicit dev-only flag or out of the runtime package surface.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy/ldm/cogvideo/vae_backup.py` around lines 1 - 485, This file contains an extra, unused CogVideoX VAE implementation (class AutoencoderKLCogVideoX and related Encoder3D/Decoder3D, etc.) that duplicates the canonical comfy.ldm.cogvideo.vae module wired by comfy/sd.py; remove this backup from the runtime package surface (delete the file) or move it into a dev-only location (e.g., tests/dev or gated behind an explicit runtime check/ENV flag) so it won't drift from the primary implementation; ensure no imports reference comfy.ldm.cogvideo.vae_backup.AutoencoderKLCogVideoX and that comfy/sd.py continues to import the canonical comfy.ldm.cogvideo.vae.AutoencoderKLCogVideoX.comfy/supported_models.py (1)
1813-1818: Nested tokenizer class is functional but unconventional.The
CogVideoXT5Tokenizernested class insideclip_target()works correctly, but defining a class inside a method is unusual in this codebase. Consider moving it to module level for better discoverability and potential reuse.That said, this pattern has no functional issues and the
min_length=226correctly matches the transformer'smax_text_seq_lengthparameter (per context snippet 1).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy/supported_models.py` around lines 1813 - 1818, Nested class CogVideoXT5Tokenizer inside clip_target is unconventional; move CogVideoXT5Tokenizer to module level (define it alongside other tokenizers) keeping the same initializer signature (embedding_directory=None, tokenizer_data={}, min_length=226) and base class comfy.text_encoders.sd3_clip.T5XXLTokenizer, then update the clip_target method to return supported_models_base.ClipTarget(CogVideoXT5Tokenizer, comfy.text_encoders.sd3_clip.T5XXLModel) so the method simply references the module-level class for discoverability and reuse.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy/ldm/cogvideo/vae.py`:
- Around line 519-539: The current precompute builds full [B,C,T,H,W] entries in
z_at_res by calling _interpolate_zq(z, target) for every Phase 2 resolution,
which OOMs; instead, expand z along time once (to t_expanded) outside the
resolution loop and then perform spatial interpolation from that
temporally-expanded tensor inside the chunk/decode loop so you only create
per-chunk spatially-interpolated tensors when needed. Concretely: keep a single
temporally-expanded_z (use t_expanded) and remove the eager per-resolution calls
to _interpolate_zq in the remaining_blocks/decoder.up_blocks loop; when
processing each chunk/resolution, call _interpolate_zq on the
temporally-expanded slice for that chunk/target and cache only the
spatially-interpolated result in z_at_res keyed by (h,w) to avoid allocating
full clip-sized [B,C,T,H,W] per resolution.
- Around line 453-465: The encode method currently folds the remainder frames
into the first chunk causing batches larger than num_sample_frames_batch_size;
change the batching so chunks are capped to self.num_sample_frames_batch_size by
computing num_batches = ceil(t / frame_batch) and for each i set start = i *
frame_batch and end = min((i + 1) * frame_batch, t), then call self.encoder(x[:,
:, start:end], conv_cache=conv_cache) appending results and carrying forward
conv_cache; this ensures each chunk size <= num_sample_frames_batch_size and
prevents oversized VRAM spikes while preserving conv_cache behavior in encode.
---
Nitpick comments:
In `@comfy/ldm/cogvideo/vae_backup.py`:
- Around line 1-485: This file contains an extra, unused CogVideoX VAE
implementation (class AutoencoderKLCogVideoX and related Encoder3D/Decoder3D,
etc.) that duplicates the canonical comfy.ldm.cogvideo.vae module wired by
comfy/sd.py; remove this backup from the runtime package surface (delete the
file) or move it into a dev-only location (e.g., tests/dev or gated behind an
explicit runtime check/ENV flag) so it won't drift from the primary
implementation; ensure no imports reference
comfy.ldm.cogvideo.vae_backup.AutoencoderKLCogVideoX and that comfy/sd.py
continues to import the canonical comfy.ldm.cogvideo.vae.AutoencoderKLCogVideoX.
In `@comfy/supported_models.py`:
- Around line 1813-1818: Nested class CogVideoXT5Tokenizer inside clip_target is
unconventional; move CogVideoXT5Tokenizer to module level (define it alongside
other tokenizers) keeping the same initializer signature
(embedding_directory=None, tokenizer_data={}, min_length=226) and base class
comfy.text_encoders.sd3_clip.T5XXLTokenizer, then update the clip_target method
to return supported_models_base.ClipTarget(CogVideoXT5Tokenizer,
comfy.text_encoders.sd3_clip.T5XXLModel) so the method simply references the
module-level class for discoverability and reuse.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2df899fb-7a4b-40ce-ae3b-2acdcbf1ff24
📒 Files selected for processing (11)
comfy/latent_formats.pycomfy/ldm/cogvideo/__init__.pycomfy/ldm/cogvideo/model.pycomfy/ldm/cogvideo/vae.pycomfy/ldm/cogvideo/vae_backup.pycomfy/model_base.pycomfy/model_detection.pycomfy/model_sampling.pycomfy/sd.pycomfy/supported_models.pynodes.py
This is a pre-requirement PR for CORE-38, netflix's VOID model.
It was written by @kijai and rebased by myself. I've tested and used it only with VOID model and it works well.
This is a newly opened PR directly within the main repo. I'm closing the initial one: #13351