Generative Omnimatte: Learning to Decompose Video into Layers (CVPR 2025 Highlight)

Yao-Chih Lee^1,2, Erika Lu¹, Sarah Rumbley¹, Michal Geyer^1,3, Jia-Bin Huang², Tali Dekel^1,3, Forrester Cole¹

¹Google DeepMind, ²University of Maryland, ³Weizmann Institute of Science

generative-omnimatte-hd.mp4

❗ This is a public reimplementation of Generative Omnimatte

We applied the same fine-tuning strategy used for the original Casper model (video object-effect removal) to public video diffusion models, CogVideoX and Wan2.1, with minimum modifications. However, the performance of these fine-tuned public models is close to, but does not match that of the Lumiere-based Casper. We hope continued development will lead to future performance improvements.

This public reimplementation builds on code and models from aigc-apps/VideoX-Fun. We thank the authors for sharing the codes and pretrained inpainting models for CogVideoX and Wan2.1.

Environment

Tested on python 3.10, CUDA 12.4, torch 2.5.1, diffusers 0.32.2
Please check requirements.txt for the dependencies
Install SAM2 by following the instructions.

Casper (Video Object Effect Removal)

Model Weights

We provide several variants based on different model backbones. In additional to downloading our Casper model weight, please also download the pretrained inpainting model from aigc-apps/VideoX-Fun.

Pretrained inpainting from VideoX-Fun	Casper model	Description
CogVideoX-Fun-V1.5-5b-InP	google drive	(Recommended) This model may perform better and faster than the Wan-based Casper models but still not as good as the Lumiere-based Casper. It was fully-finetuned from our inpainting model, which was initially fine-tuned from VideoX-Fun's released model. During inference, it processes a temporal window of 85 frames and can handle 197 frames using temporal multidiffusion. The default inference resolution is 384x672 (HxW).
Wan2.1-Fun-1.3B-InP (V1.0)	google drive	The model was fully-finteuned from VideoX-Fun's released inpainting model. During inference, it processes a temporal window of 81 frames and can handle 197 frames using temporal multidiffusion. The default inference resolution is 480x832 (HxW).
Wan2.1-Fun-14B-InP (V1.0)	google drive	Due to the large model size, we applied LoRA-based fine-tuning on the top of VideoX-Fun's released inpainting model. During inference, it processes a temporal window of 81 frames and can handle 197 frames using temporal multidiffusion. The default inference resolution is 480x832 (HxW).

Run Casper

CogVideoX-based (recommended)

Takes 1-2 mins on an A100 GPU, and 4-5 mins on an A6000 GPU with 48GB RAM

python inference/cogvideox_fun/predict_v2v.py \
    --config.data.data_rootdir="examples" \
    --config.experiment.run_seqs="boys-beach,animator-draw" \
    --config.experiment.save_path="CASPER/OUTPUT/DIR" \
    --config.video_model.model_name="PATH/TO/CogVideoX-Fun-V1.5-5b-InP" \
    --config.video_model.transformer_path="PATH/TO/CASPER/TRANSFORMER.safetensors"

Wan2.1-1.3B-based

Takes ~10 mins on an A6000 GPU with 48GB RAM

python inference/wan2.1_fun/predict_v2v.py \
    --config.data.data_rootdir="examples" \
    --config.experiment.run_seqs="boys-beach,animator-draw" \
    --config.experiment.save_path="CASPER/OUTPUT/DIR" \
    --config.video_model.model_name="PATH/TO/Wan2.1-Fun-1.3B-InP" \
    --config.video_model.transformer_path="PATH/TO/CASPER/TRANSFORMER.safetensors"

Wan2.1-14B-based (LoRA)

Takes ~55 mins on an A6000 GPU with 64GM RAM

python inference/wan2.1_fun/predict_v2v.py \
    --config.data.data_rootdir="examples" \
    --config.experiment.run_seqs="boys-beach,animator-draw" \
    --config.experiment.save_path="CASPER/OUTPUT/DIR" \
    --config.video_model.model_name="PATH/TO/Wan2.1-Fun-14B-InP" \
    --config.video_model.lora_path="PATH/TO/CASPER/LORA.safetensors" \
    --config.video_model.lora_weight=1.0 \
    --config.system.gpu_memory_mode="sequential_cpu_offload"

To run your own sequences, please follow the format in examples/boys-beach to provide your own input video, video masks, and text prompt in a folder.
Modify --config.data.data_rootdir and --config.experiment.run_seqs if needed.

Casper training

Please prepare the training data and the merged json file for the whole dataset by following the instructions in ./datasets
Replace the absolute dataset path for DATASET_META_NAME before running the training scripts
- CogVideoX Casper: ./scripts/cogvideox_fun/train_casper.sh
- Wan2.1-1.3B: ./scripts/wan2.1_fun/train_casper.sh
- Wan2.1-14B: ./scripts/wan2.1_fun/train_casper_lora.sh
We finetuned the public models on 4 H100 GPUs

Omnimatte Optimization

python inference/reconstruct_omnimatte.py \
  --config.data.data_rootdir="examples" \
  --config.experiment.run_seqs="boys-beach,animator-draw" \
  --config.omnimatte.source_video_dir="CASPER/OUTPUT/DIR" \
  --config.experiment.save_path="OMNIMATTE/OUTPUT/DIR" \
  --config.data.sample_size="384x672"

Takes ~8 mins on an A6000 or A5000 GPU with 48GB RAM
The argument sample_size should be the same resolution that you used to run the Casper model. (384x672 for CogVideoX and 480x832 for Wan by default.)
The results may be suboptimal as the optimization operates on output videos that include undesired artifacts due to the VAE of the latent-based DiT models.
The current implementation processes a video of multiple objects sequentially rather than in parallel. In the future, speed can be improved by running individual omnimatte optimization in parallel if have multi GPUs.

Gradio Demo

GRADIO_TEMP_DIR=".tmp" python app.py \
  --transformer_path PATH/TO/COGVIDEOX/CASPER/diffusion_pytorch_model.safetensors

Tested on an A6000 GPU with 48GB RAM
The object-effect-removal step takes ~1 min per layer with 4 sampling steps for faster demo
The omnimatte optimization step takes ~8 min per layer

Acknowledgments

We thank the authors of VideoX-Fun, SAM2, and Omnimatte for their shared codes and models, and acknowledge the GPU resources provided by the UMD HPC cluster for training our public video models.

We also appreciate the results from Omnimatte, Omnimatte3D, and OmnimatteRF, as well as the videos on Pexels [1,2,3,4,5,6,7,8,9,10,11,12,13,14], which were used for fine-tuning Casper.

Citation

@inproceedings{generative-omnimatte,
  author    = {Lee, Yao-Chih and Lu, Erika and Rumbley, Sarah and Geyer, Michal and Huang, Jia-Bin and Dekel, Tali and Cole, Forrester},
  title     = {Generative Omnimatte: Learning to Decompose Video into Layers},
  booktitle = {CVPR},
  year      = {2025},
}

License

This project is licensed under Apache-2.0 license.

The CogVideoX-5B transformer is released under the CogVideoX license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Generative Omnimatte: Learning to Decompose Video into Layers (CVPR 2025 Highlight)

Yao-Chih Lee^1,2, Erika Lu¹, Sarah Rumbley¹, Michal Geyer^1,3, Jia-Bin Huang², Tali Dekel^1,3, Forrester Cole¹

¹Google DeepMind, ²University of Maryland, ³Weizmann Institute of Science

❗ This is a public reimplementation of Generative Omnimatte

Table of Contents

Environment

Casper (Video Object Effect Removal)

Model Weights

Run Casper

Casper training

Omnimatte Optimization

Gradio Demo

Acknowledgments

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
config		config
datasets		datasets
examples		examples
inference		inference
omnimatte		omnimatte
scripts		scripts
videox_fun		videox_fun
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_reproduce.md		README_reproduce.md
app.py		app.py
requirements.txt		requirements.txt

License

isl-org/gen-omnimatte-public

Folders and files

Latest commit

History

Repository files navigation

Generative Omnimatte: Learning to Decompose Video into Layers (CVPR 2025 Highlight)

Yao-Chih Lee1,2, Erika Lu1, Sarah Rumbley1, Michal Geyer1,3, Jia-Bin Huang2, Tali Dekel1,3, Forrester Cole1 1Google DeepMind, 2University of Maryland, 3Weizmann Institute of Science

❗ This is a public reimplementation of Generative Omnimatte

Table of Contents

Environment

Casper (Video Object Effect Removal)

Model Weights

Run Casper

Casper training

Omnimatte Optimization

Gradio Demo

Acknowledgments

Citation

License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Yao-Chih Lee^1,2, Erika Lu¹, Sarah Rumbley¹, Michal Geyer^1,3, Jia-Bin Huang², Tali Dekel^1,3, Forrester Cole¹

¹Google DeepMind, ²University of Maryland, ³Weizmann Institute of Science

Packages