Yao-Chih Lee1,2, Erika Lu1, Sarah Rumbley1, Michal Geyer1,3, Jia-Bin Huang2, Tali Dekel1,3, Forrester Cole1
1Google DeepMind, 2University of Maryland, 3Weizmann Institute of Science
generative-omnimatte-hd.mp4
We applied the same fine-tuning strategy used for the original Casper model (video object-effect removal) to public video diffusion models, CogVideoX and Wan2.1, with minimum modifications. However, the performance of these fine-tuned public models is close to, but does not match that of the Lumiere-based Casper. We hope continued development will lead to future performance improvements.
This public reimplementation builds on code and models from aigc-apps/VideoX-Fun. We thank the authors for sharing the codes and pretrained inpainting models for CogVideoX and Wan2.1.
- Environment
- Casper (video object effect removal)
- Omnimatte optimization
- Gradio Demo
- Acknowledgments
- Citation
- License
- Tested on python 3.10, CUDA 12.4, torch 2.5.1, diffusers 0.32.2
- Please check requirements.txt for the dependencies
- Install SAM2 by following the instructions.
We provide several variants based on different model backbones. In additional to downloading our Casper model weight, please also download the pretrained inpainting model from aigc-apps/VideoX-Fun.
Pretrained inpainting from VideoX-Fun | Casper model | Description |
---|---|---|
CogVideoX-Fun-V1.5-5b-InP | google drive | (Recommended) This model may perform better and faster than the Wan-based Casper models but still not as good as the Lumiere-based Casper. It was fully-finetuned from our inpainting model, which was initially fine-tuned from VideoX-Fun's released model. During inference, it processes a temporal window of 85 frames and can handle 197 frames using temporal multidiffusion. The default inference resolution is 384x672 (HxW). |
Wan2.1-Fun-1.3B-InP (V1.0) | google drive | The model was fully-finteuned from VideoX-Fun's released inpainting model. During inference, it processes a temporal window of 81 frames and can handle 197 frames using temporal multidiffusion. The default inference resolution is 480x832 (HxW). |
Wan2.1-Fun-14B-InP (V1.0) | google drive | Due to the large model size, we applied LoRA-based fine-tuning on the top of VideoX-Fun's released inpainting model. During inference, it processes a temporal window of 81 frames and can handle 197 frames using temporal multidiffusion. The default inference resolution is 480x832 (HxW). |
-
CogVideoX-based (recommended)
- Takes 1-2 mins on an A100 GPU, and 4-5 mins on an A6000 GPU with 48GB RAM
python inference/cogvideox_fun/predict_v2v.py \ --config.data.data_rootdir="examples" \ --config.experiment.run_seqs="boys-beach,animator-draw" \ --config.experiment.save_path="CASPER/OUTPUT/DIR" \ --config.video_model.model_name="PATH/TO/CogVideoX-Fun-V1.5-5b-InP" \ --config.video_model.transformer_path="PATH/TO/CASPER/TRANSFORMER.safetensors"
-
Wan2.1-1.3B-based
- Takes ~10 mins on an A6000 GPU with 48GB RAM
python inference/wan2.1_fun/predict_v2v.py \ --config.data.data_rootdir="examples" \ --config.experiment.run_seqs="boys-beach,animator-draw" \ --config.experiment.save_path="CASPER/OUTPUT/DIR" \ --config.video_model.model_name="PATH/TO/Wan2.1-Fun-1.3B-InP" \ --config.video_model.transformer_path="PATH/TO/CASPER/TRANSFORMER.safetensors"
-
Wan2.1-14B-based (LoRA)
- Takes ~55 mins on an A6000 GPU with 64GM RAM
python inference/wan2.1_fun/predict_v2v.py \ --config.data.data_rootdir="examples" \ --config.experiment.run_seqs="boys-beach,animator-draw" \ --config.experiment.save_path="CASPER/OUTPUT/DIR" \ --config.video_model.model_name="PATH/TO/Wan2.1-Fun-14B-InP" \ --config.video_model.lora_path="PATH/TO/CASPER/LORA.safetensors" \ --config.video_model.lora_weight=1.0 \ --config.system.gpu_memory_mode="sequential_cpu_offload"
-
To run your own sequences, please follow the format in
examples/boys-beach
to provide your own input video, video masks, and text prompt in a folder. -
Modify
--config.data.data_rootdir
and--config.experiment.run_seqs
if needed.
- Please prepare the training data and the merged json file for the whole dataset by following the instructions in ./datasets
- Replace the absolute dataset path for
DATASET_META_NAME
before running the training scripts- CogVideoX Casper:
./scripts/cogvideox_fun/train_casper.sh
- Wan2.1-1.3B:
./scripts/wan2.1_fun/train_casper.sh
- Wan2.1-14B:
./scripts/wan2.1_fun/train_casper_lora.sh
- CogVideoX Casper:
- We finetuned the public models on 4 H100 GPUs
python inference/reconstruct_omnimatte.py \
--config.data.data_rootdir="examples" \
--config.experiment.run_seqs="boys-beach,animator-draw" \
--config.omnimatte.source_video_dir="CASPER/OUTPUT/DIR" \
--config.experiment.save_path="OMNIMATTE/OUTPUT/DIR" \
--config.data.sample_size="384x672"
- Takes ~8 mins on an A6000 or A5000 GPU with 48GB RAM
- The argument
sample_size
should be the same resolution that you used to run the Casper model. (384x672
for CogVideoX and480x832
for Wan by default.) - The results may be suboptimal as the optimization operates on output videos that include undesired artifacts due to the VAE of the latent-based DiT models.
- The current implementation processes a video of multiple objects sequentially rather than in parallel. In the future, speed can be improved by running individual omnimatte optimization in parallel if have multi GPUs.
GRADIO_TEMP_DIR=".tmp" python app.py \
--transformer_path PATH/TO/COGVIDEOX/CASPER/diffusion_pytorch_model.safetensors
- Tested on an A6000 GPU with 48GB RAM
- The object-effect-removal step takes ~1 min per layer with 4 sampling steps for faster demo
- The omnimatte optimization step takes ~8 min per layer
We thank the authors of VideoX-Fun, SAM2, and Omnimatte for their shared codes and models, and acknowledge the GPU resources provided by the UMD HPC cluster for training our public video models.
We also appreciate the results from Omnimatte, Omnimatte3D, and OmnimatteRF, as well as the videos on Pexels [1,2,3,4,5,6,7,8,9,10,11,12,13,14], which were used for fine-tuning Casper.
@inproceedings{generative-omnimatte,
author = {Lee, Yao-Chih and Lu, Erika and Rumbley, Sarah and Geyer, Michal and Huang, Jia-Bin and Dekel, Tali and Cole, Forrester},
title = {Generative Omnimatte: Learning to Decompose Video into Layers},
booktitle = {CVPR},
year = {2025},
}
This project is licensed under Apache-2.0 license.
The CogVideoX-5B transformer is released under the CogVideoX license.