Skip to content

SPengLiang/SmoothVideo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SmoothVideo

This repository is the official implementation of Smooth Video Synthesis with Noise Constraints on Diffusion Models for One-shot Video Tuning.

Setup

This implementation is based on Tune-A-Video.

Requirements

pip install -r requirements.txt

Installing xformers is highly recommended for more efficiency and speed on GPUs. To enable xformers, set enable_xformers_memory_efficient_attention=True (default).

Weights

[Stable Diffusion] Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. The pre-trained Stable Diffusion models can be downloaded from Hugging Face (e.g., Stable Diffusion v1-5)).

Usage

Training

To fine-tune the text-to-image diffusion models for text-to-video generation, run this command for the baseline model:

accelerate launch train_tuneavideo.py --config="configs/man-skiing.yaml"

Run this command for the baseline model with the proposed smooth loss:

accelerate launch train_tuneavideo.py --config="configs/man-skiing.yaml" --smooth_loss

Run this command for the baseline model with the proposed simple smooth loss:

accelerate launch train_tuneavideo.py --config="configs/man-skiing.yaml" --smooth_loss --simple_manner

Note: Tuning a 24-frame video usually takes 300~500 steps, about 10~15 minutes using one A100 GPU. Reduce n_sample_frames if your GPU memory is limited.

Inference

Once the training is done, run inference:

from tuneavideo.pipelines.pipeline_tuneavideo import TuneAVideoPipeline
from tuneavideo.models.unet import UNet3DConditionModel
from tuneavideo.util import save_videos_grid
import torch

pretrained_model_path = "./checkpoints/stable-diffusion-v1-5"
my_model_path = "./outputs/man-skiing"
unet = UNet3DConditionModel.from_pretrained(my_model_path, subfolder='unet', torch_dtype=torch.float16).to('cuda')
pipe = TuneAVideoPipeline.from_pretrained(pretrained_model_path, unet=unet, torch_dtype=torch.float16).to("cuda")
pipe.enable_xformers_memory_efficient_attention()
pipe.enable_vae_slicing()

prompt = "spider man is skiing"
ddim_inv_latent = torch.load(f"{my_model_path}/inv_latents/ddim_latent-500.pt").to(torch.float16)
video = pipe(prompt, latents=ddim_inv_latent, video_length=24, height=512, width=512, num_inference_steps=50, guidance_scale=7.5).videos

save_videos_grid(video, f"./{prompt}.gif")

We provide comparisons with different baselines, as follows:

Results

Tune-A-Video

Comparisons to Tune-A-Video.

Input video Tune-A-Video
Input video Tune-A-Video + smooth loss
A jeep car is moving on the road A jeep car is moving on the beach A jeep car is moving on the snow A jeep car is moving on the road, cartoon style A sports car is moving on the road
Input video Tune-A-Video
Input video Tune-A-Video + smooth loss
A rabbit is eating a watermelon A tiger is eating a watermelon A rabbit is eating an orange A rabbit is eating a pizza A puppy is eating an orange
Input video Tune-A-Video
Input video Tune-A-Video + smooth loss
A man is skiing Mickey mouse is skiing on the snow Spider man is skiing on the beach, cartoon style Wonder woman, wearing a cowboy hat, is skiing A man, wearing pink clothes, is skiing at sunset

Make-A-Protagonist

Comparisons to Make-A-Protagonist.

Input video Make-A-Protagonist Make-A-Protagonist + smooth loss
A jeep driving down a mountain road A jeep driving down a mountain road in the rain
A man is playing basketball A man is playing a basketball on the beach, anime style
A man walking down the street at night A panda walking down the snowy street
A man waling down the street Elon musk walking down the street

ControlVideo

Comparisons to ControlVideo.

Input video Condition ControlVideo ControlVideo + smooth loss
A person is dancing Pose condition Michael Jackson is dancing
A person is dancing Pose condition A person is dancing, Makoto Shinkai style
A building Canny edge condition A wooden building, at night
A girl Hed edge condition A girl, Krenz Cushart style
A girl Hed edge condition A girl with rich makeup
Ink diffuses in water Depth condition Gentle green ink diffuses in water, beautiful light

Video2Video-zero

Comparisons to Training-free methods.

Input video Instruct Video2Video-zero Instruct Video2Video-zero + noise constraint Video InstructPix2Pix Video InstructPix2Pix + noise constraint
Instruct: Make it animation

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages