Skip to content

EnVision-Research/MotionInversion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Motion Inversion for Video Customization

Customize the motion in your videos with less than 0.5 million parameters and under 10 minutes of training time.



Luozhou Wang, Guibao Shen, Yixun Liang, Xin Tao, Pengfei Wan, Di Zhang, Yijun Li, Yingcong Chen

HKUST(GZ), HKUST, Kuaishou Technology, Adobe Research.

We present a novel approach to motion customization in video generation, addressing the widespread gap in the thorough exploration of motion representation within video generative models. Recognizing the unique challenges posed by video's spatiotemporal nature, our method introduces Motion Embeddings, a set of explicit, temporally coherent one-dimensional embeddings derived from a given video. These embeddings are designed to integrate seamlessly with the temporal transformer modules of video diffusion models, modulating self-attention computations across frames without compromising spatial integrity. Furthermore, we identify the Temporal Discrepancy in video generative models, which refers to variations in how different motion modules process temporal relationships between frames. We leverage this understanding to optimize the integration of our motion embeddings.

📰 News

  • [2024.04.03] We released the configuration files, inference code sample.
  • [2024.04.01] We will soon release the configuration files, inference code, and motion embedding weights. Please stay tuned for updates!
  • [2024.03.31] We have released the project page, arXiv paper, and training code.

🚧 Todo List

  • Released code for the UNet3D model (ZeroScope, ModelScope).
  • Release detailed guidance for training and inference.
  • Release Gradio demo.
  • Release code for the Sora-like model (Open-Sora, Latte).

Contents

Installation

# install torch
pip install torch torchvision

# install diffusers and transformers
pip install diffusers==0.26.3 transformers

Also, xformers is required in this repository. Please check here for detailed installation guidance.

Training

To start training, first download the ZeroScope weights and specify the path in the config file. Then, run the following commands to begin training:

python train.py --config ./configs/train_config.yaml

We provide a sample config file in config.py. Note for various motion types and editing requirements, selecting the appropriate loss function impacts the outcome. In scenarios where only the camera motion from the source video is desired, without the need to retain information about the objects in the source, it is advisable to employ DebiasedHybridLoss. Similarly, when editing objects that undergo significant deformation, DebiasedTemporalLoss is recommended. For straightforward cross-categorical editing, as described in DMT, utilizing BaseLoss function suffices.

Inference

After cloning the repository, you can easily load motion embeddings for video generation as follows:

import torch
from diffusers import DiffusionPipeline
from diffusers.utils import export_to_video
from models.unet.motion_embeddings import load_motion_embeddings

# load video generation model
pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w",torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()

# memory optimization
pipe.enable_vae_slicing()

# load motion embedding
motion_embed = torch.load('path/to/motion_embed.pt')
load_motion_embeddings(pipe.unet, motion_embed)


video_frames = pipe(
    prompt="A knight in armor rides a Segway",
    num_inference_steps=30,
    guidance_scale=12,
    height=320,
    width=576,
    num_frames=24,
    generator=torch.Generator("cuda").manual_seed(42)
).frames[0]

video_path = export_to_video(video_frames)
video_path

Please note that it is recommended to use a noise initialization strategy for more stable outcomes. This strategy requires a source video as input. Click here for more details. Then you should pass the init_latents to pipe using the latents argument:

video_frames = pipe(*,latents=init_latents).frames[0]

Acknowledgement

  • MotionDirector: We followed their implementation of loss design and techniques to reduce computation resources.
  • ZeroScope: The pretrained video checkpoint we used in our main paper.
  • AnimateDiff: The pretrained video checkpoint we used in our main paper.
  • Latte: A video generation model with a similar architecture to Sora.
  • Open-Sora: A video generation model with a similar architecture to Sora.

We are grateful for their exceptional work and generous contribution to the open-source community.

Citation

@misc{wang2024motion,
     title={Motion Inversion for Video Customization}, 
     author={Luozhou Wang and Guibao Shen and Yixun Liang and Xin Tao and Pengfei Wan and Di Zhang and Yijun Li and Yingcong Chen},
     year={2024},
     eprint={2403.20193},
     archivePrefix={arXiv},
     primaryClass={cs.CV}
}