Skip to content

isl-org/gen-omnimatte-public

 
 

Repository files navigation

Generative Omnimatte: Learning to Decompose Video into Layers (CVPR 2025 Highlight)

Yao-Chih Lee1,2, Erika Lu1, Sarah Rumbley1, Michal Geyer1,3, Jia-Bin Huang2, Tali Dekel1,3, Forrester Cole1

1Google DeepMind, 2University of Maryland, 3Weizmann Institute of Science


generative-omnimatte-hd.mp4

❗ This is a public reimplementation of Generative Omnimatte

We applied the same fine-tuning strategy used for the original Casper model (video object-effect removal) to public video diffusion models, CogVideoX and Wan2.1, with minimum modifications. However, the performance of these fine-tuned public models is close to, but does not match that of the Lumiere-based Casper. We hope continued development will lead to future performance improvements.

This public reimplementation builds on code and models from aigc-apps/VideoX-Fun. We thank the authors for sharing the codes and pretrained inpainting models for CogVideoX and Wan2.1.

Table of Contents

Environment

  • Tested on python 3.10, CUDA 12.4, torch 2.5.1, diffusers 0.32.2
  • Please check requirements.txt for the dependencies
  • Install SAM2 by following the instructions.

Casper (Video Object Effect Removal)

Model Weights

We provide several variants based on different model backbones. In additional to downloading our Casper model weight, please also download the pretrained inpainting model from aigc-apps/VideoX-Fun.

Pretrained inpainting from VideoX-Fun Casper model Description
CogVideoX-Fun-V1.5-5b-InP google drive (Recommended) This model may perform better and faster than the Wan-based Casper models but still not as good as the Lumiere-based Casper. It was fully-finetuned from our inpainting model, which was initially fine-tuned from VideoX-Fun's released model. During inference, it processes a temporal window of 85 frames and can handle 197 frames using temporal multidiffusion. The default inference resolution is 384x672 (HxW).
Wan2.1-Fun-1.3B-InP (V1.0) google drive The model was fully-finteuned from VideoX-Fun's released inpainting model. During inference, it processes a temporal window of 81 frames and can handle 197 frames using temporal multidiffusion. The default inference resolution is 480x832 (HxW).
Wan2.1-Fun-14B-InP (V1.0) google drive Due to the large model size, we applied LoRA-based fine-tuning on the top of VideoX-Fun's released inpainting model. During inference, it processes a temporal window of 81 frames and can handle 197 frames using temporal multidiffusion. The default inference resolution is 480x832 (HxW).

Run Casper

  • CogVideoX-based (recommended)

    • Takes 1-2 mins on an A100 GPU, and 4-5 mins on an A6000 GPU with 48GB RAM
    python inference/cogvideox_fun/predict_v2v.py \
        --config.data.data_rootdir="examples" \
        --config.experiment.run_seqs="boys-beach,animator-draw" \
        --config.experiment.save_path="CASPER/OUTPUT/DIR" \
        --config.video_model.model_name="PATH/TO/CogVideoX-Fun-V1.5-5b-InP" \
        --config.video_model.transformer_path="PATH/TO/CASPER/TRANSFORMER.safetensors"
    
  • Wan2.1-1.3B-based

    • Takes ~10 mins on an A6000 GPU with 48GB RAM
    python inference/wan2.1_fun/predict_v2v.py \
        --config.data.data_rootdir="examples" \
        --config.experiment.run_seqs="boys-beach,animator-draw" \
        --config.experiment.save_path="CASPER/OUTPUT/DIR" \
        --config.video_model.model_name="PATH/TO/Wan2.1-Fun-1.3B-InP" \
        --config.video_model.transformer_path="PATH/TO/CASPER/TRANSFORMER.safetensors"
    
  • Wan2.1-14B-based (LoRA)

    • Takes ~55 mins on an A6000 GPU with 64GM RAM
    python inference/wan2.1_fun/predict_v2v.py \
        --config.data.data_rootdir="examples" \
        --config.experiment.run_seqs="boys-beach,animator-draw" \
        --config.experiment.save_path="CASPER/OUTPUT/DIR" \
        --config.video_model.model_name="PATH/TO/Wan2.1-Fun-14B-InP" \
        --config.video_model.lora_path="PATH/TO/CASPER/LORA.safetensors" \
        --config.video_model.lora_weight=1.0 \
        --config.system.gpu_memory_mode="sequential_cpu_offload"
    
  • To run your own sequences, please follow the format in examples/boys-beach to provide your own input video, video masks, and text prompt in a folder.

  • Modify --config.data.data_rootdir and --config.experiment.run_seqs if needed.

Casper training

  • Please prepare the training data and the merged json file for the whole dataset by following the instructions in ./datasets
  • Replace the absolute dataset path for DATASET_META_NAME before running the training scripts
    • CogVideoX Casper: ./scripts/cogvideox_fun/train_casper.sh
    • Wan2.1-1.3B: ./scripts/wan2.1_fun/train_casper.sh
    • Wan2.1-14B: ./scripts/wan2.1_fun/train_casper_lora.sh
  • We finetuned the public models on 4 H100 GPUs

Omnimatte Optimization

python inference/reconstruct_omnimatte.py \
  --config.data.data_rootdir="examples" \
  --config.experiment.run_seqs="boys-beach,animator-draw" \
  --config.omnimatte.source_video_dir="CASPER/OUTPUT/DIR" \
  --config.experiment.save_path="OMNIMATTE/OUTPUT/DIR" \
  --config.data.sample_size="384x672"
  • Takes ~8 mins on an A6000 or A5000 GPU with 48GB RAM
  • The argument sample_size should be the same resolution that you used to run the Casper model. (384x672 for CogVideoX and 480x832 for Wan by default.)
  • The results may be suboptimal as the optimization operates on output videos that include undesired artifacts due to the VAE of the latent-based DiT models.
  • The current implementation processes a video of multiple objects sequentially rather than in parallel. In the future, speed can be improved by running individual omnimatte optimization in parallel if have multi GPUs.

Gradio Demo

GRADIO_TEMP_DIR=".tmp" python app.py \
  --transformer_path PATH/TO/COGVIDEOX/CASPER/diffusion_pytorch_model.safetensors
  • Tested on an A6000 GPU with 48GB RAM
  • The object-effect-removal step takes ~1 min per layer with 4 sampling steps for faster demo
  • The omnimatte optimization step takes ~8 min per layer

Acknowledgments

We thank the authors of VideoX-Fun, SAM2, and Omnimatte for their shared codes and models, and acknowledge the GPU resources provided by the UMD HPC cluster for training our public video models.

We also appreciate the results from Omnimatte, Omnimatte3D, and OmnimatteRF, as well as the videos on Pexels [1,2,3,4,5,6,7,8,9,10,11,12,13,14], which were used for fine-tuning Casper.

Citation

@inproceedings{generative-omnimatte,
  author    = {Lee, Yao-Chih and Lu, Erika and Rumbley, Sarah and Geyer, Michal and Huang, Jia-Bin and Dekel, Tali and Cole, Forrester},
  title     = {Generative Omnimatte: Learning to Decompose Video into Layers},
  booktitle = {CVPR},
  year      = {2025},
}

License

This project is licensed under Apache-2.0 license.

The CogVideoX-5B transformer is released under the CogVideoX license.

About

Generative Omnimatte (CVPR 2025)

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.7%
  • Shell 0.3%