Skip to content

Latest commit

 

History

History
347 lines (282 loc) · 15.7 KB

README.md

File metadata and controls

347 lines (282 loc) · 15.7 KB

SmoothVideo

This repository is the official implementation of Smooth Video Synthesis with Noise Constraints on Diffusion Models for One-shot Video Tuning.

Setup

This implementation is based on Tune-A-Video.

Requirements

pip install -r requirements.txt

Installing xformers is highly recommended for more efficiency and speed on GPUs. To enable xformers, set enable_xformers_memory_efficient_attention=True (default).

Weights

[Stable Diffusion] Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. The pre-trained Stable Diffusion models can be downloaded from Hugging Face (e.g., Stable Diffusion v1-5)).

Usage

Training

To fine-tune the text-to-image diffusion models for text-to-video generation, run this command for the baseline model:

accelerate launch train_tuneavideo.py --config="configs/man-skiing.yaml"

Run this command for the baseline model with the proposed smooth loss:

accelerate launch train_tuneavideo.py --config="configs/man-skiing.yaml" --smooth_loss

Run this command for the baseline model with the proposed simple smooth loss:

accelerate launch train_tuneavideo.py --config="configs/man-skiing.yaml" --smooth_loss --simple_manner

Note: Tuning a 24-frame video usually takes 300~500 steps, about 10~15 minutes using one A100 GPU. Reduce n_sample_frames if your GPU memory is limited.

Inference

Once the training is done, run inference:

from tuneavideo.pipelines.pipeline_tuneavideo import TuneAVideoPipeline
from tuneavideo.models.unet import UNet3DConditionModel
from tuneavideo.util import save_videos_grid
import torch

pretrained_model_path = "./checkpoints/stable-diffusion-v1-5"
my_model_path = "./outputs/man-skiing"
unet = UNet3DConditionModel.from_pretrained(my_model_path, subfolder='unet', torch_dtype=torch.float16).to('cuda')
pipe = TuneAVideoPipeline.from_pretrained(pretrained_model_path, unet=unet, torch_dtype=torch.float16).to("cuda")
pipe.enable_xformers_memory_efficient_attention()
pipe.enable_vae_slicing()

prompt = "spider man is skiing"
ddim_inv_latent = torch.load(f"{my_model_path}/inv_latents/ddim_latent-500.pt").to(torch.float16)
video = pipe(prompt, latents=ddim_inv_latent, video_length=24, height=512, width=512, num_inference_steps=50, guidance_scale=7.5).videos

save_videos_grid(video, f"./{prompt}.gif")

We provide comparisons with different baselines, as follows:

Results

Tune-A-Video

Comparisons to Tune-A-Video.

Input video Tune-A-Video
Input video Tune-A-Video + smooth loss
A jeep car is moving on the road A jeep car is moving on the beach A jeep car is moving on the snow A jeep car is moving on the road, cartoon style A sports car is moving on the road
Input video Tune-A-Video
Input video Tune-A-Video + smooth loss
A rabbit is eating a watermelon A tiger is eating a watermelon A rabbit is eating an orange A rabbit is eating a pizza A puppy is eating an orange
Input video Tune-A-Video
Input video Tune-A-Video + smooth loss
A man is skiing Mickey mouse is skiing on the snow Spider man is skiing on the beach, cartoon style Wonder woman, wearing a cowboy hat, is skiing A man, wearing pink clothes, is skiing at sunset

Make-A-Protagonist

Comparisons to Make-A-Protagonist.

Input video Make-A-Protagonist Make-A-Protagonist + smooth loss
A jeep driving down a mountain road A jeep driving down a mountain road in the rain
A man is playing basketball A man is playing a basketball on the beach, anime style
A man walking down the street at night A panda walking down the snowy street
A man waling down the street Elon musk walking down the street

ControlVideo

Comparisons to ControlVideo.

Input video Condition ControlVideo ControlVideo + smooth loss
A person is dancing Pose condition Michael Jackson is dancing
A person is dancing Pose condition A person is dancing, Makoto Shinkai style
A building Canny edge condition A wooden building, at night
A girl Hed edge condition A girl, Krenz Cushart style
A girl Hed edge condition A girl with rich makeup
Ink diffuses in water Depth condition Gentle green ink diffuses in water, beautiful light

Video2Video-zero

Comparisons to Training-free methods.

Input video Instruct Video2Video-zero Instruct Video2Video-zero + noise constraint Video InstructPix2Pix Video InstructPix2Pix + noise constraint
Instruct: Make it animation