Skip to content

VideoGPA is a self-supervised framework that enhances 3D consistency in Video Diffusion Models. By leveraging geometry foundation models and DPO, it automatically aligns generative videos with dense structural preferences to eliminate deformation and spatial drift without human annotations.

Notifications You must be signed in to change notification settings

Hongyang-Du/VideoGPA

Repository files navigation

VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

Hongyang Du*1,2 · Junjie Ye*1· Xiaoyan Cong*2 · Runhao Li1 · Jingcheng Ni2
Aman Agarwal2 · Zeqi Zhou2 · Zekun Li2 · Randall Balestriero2 · Yue Wang1

1Physical SuperIntelligence Lab, University of Southern California
2Department of Computer Science, Brown University
* Equal Contribution

License: MIT

Pipeline

Quick Inference Scripts 🚀

This directory contains simplified command-line scripts for generating videos using CogVideoX models. These scripts are designed for quick testing and allow you to run inference directly from the terminal without preparing JSON configuration files.

Both scripts support loading LoRA adapters for customized generation.

📋 Requirements

Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.

pip install -r requirements.txt

🔘 Checkpoint Download

Automatic Download (Recommended)

Method 1: Using the Download Script

Run the provided Python script to download checkpoint files:

# Download all checkpoints
python download_ckpt.py all

# Or download specific checkpoints
python download_ckpt.py i2v    # CogVideoX-I2V-5B
python download_ckpt.py t2v    # CogVideoX-5B
python download_ckpt.py t2v15  # CogVideoX1.5-5B

The script will:

  • ✅ Check if files already exist (skip re-downloading)
  • 🚀 Download missing checkpoints with progress bars
  • 📁 Organize files into the correct directory structure

Expected Directory Structure

After successful download, your checkpoints folder should look like:

checkpoints/
├── VideoGPA-I2V-lora/
│   └── adapter_model.safetensors
├── VideoGPA-T2V-lora/
│   └── adapter_model.safetensors
└── VideoGPA-T2V1.5-lora/
    └── adapter_model.safetensors

📝 Available Scripts

1. Text-to-Video Generation (t2v_inference.py)

Generate videos from text prompts using CogVideoX-5B.

Basic Usage:

cd generate
python t2v_inference.py "A cat playing with a ball in a garden"

Advanced Usage:

python t2v_inference.py "A flying drone over a city skyline at sunset" \
    --output_dir ./my_videos \
    --lora_path ./checkpoints/my_lora_adapter \
    --gpu_id 0

Arguments:

  • prompt (required): Text prompt for video generation
  • --output_dir: Directory to save generated videos (default: ./outputs)
  • --lora_path: Path to LoRA adapter weights (optional)
  • --gpu_id: GPU device ID (default: 0)

Output: Videos saved as {prompt}_seed{seed}.mp4


2. Image-to-Video Generation (i2v_inference.py)

Generate videos from a static image with text guidance using CogVideoX-5B-I2V.

Basic Usage:

cd generate
python i2v_inference.py "The camera slowly zooms in" ./path/to/image.jpg

Advanced Usage:

python i2v_inference.py "A realistic continuation of the reference scene. Everything must remain completely static: no moving people, no shifting objects, and no dynamic elements. Only the camera is allowed to move. Render physically accurate multi-step camera motion.  Camera motion: roll gently to one side, then swing around the room, followed by push forward into the scene." ./image.png \
    --output_dir ./i2v_outputs \
    --lora_path ./checkpoints/i2v_lora \
    --gpu_id 1

Arguments:

  • prompt (required): Text prompt describing motion/scene
  • image_path (required): Path to input image file
  • --output_dir: Directory to save generated videos (default: ./outputs)
  • --lora_path: Path to LoRA adapter weights (optional)
  • --gpu_id: GPU device ID (default: 0)

Output: Videos saved as {image_name}_seed{seed}.mp4


⚙️ Configuration

Both scripts include configurable generation parameters:

NUM_INFERENCE_STEPS = 50  # Number of diffusion steps
GUIDANCE_SCALE = 6.0      # Classifier-free guidance scale
SEED = 42              # Seed for generation

💾 GPU Memory Requirements

  • Minimum VRAM: diffusers BF16 ~5GB for base models
  • Memory optimizations (VAE tiling/slicing) are automatically enabled

🚀 Features

  • Video Quality Assessment: Comprehensive metrics for evaluating video generation quality
  • DPO Training: Direct Preference Optimization for video generation models
  • Multi-Model Support: Compatible with CogVideoX and other video generation models
  • Flexible Pipeline: Easy-to-use inference and training pipelines

📁 Code Structure

VideoGPA/
├── data_prep/      # Data preparation scripts
├── train_dpo/      # DPO training scripts
├── pipelines/      # Inference pipelines
├── metrics/        # Quality assessment metrics
├── vggt/           # Video generation model architecture
└── utils/          # Utility functions

🔧 DPO Training (Direct Preference Optimization)

VideoGPA leverages DPO to optimize video generation quality through preference learning. The training pipeline consists of 3 steps after you have your generated videos. Revise the configs as you need:

Step 1: Score Your Generate Videos

python train_dpo/video_scorer.py

Step 2: Encode Videos to Latent Space

# For CogVideoX-I2V-5B
python train_dpo/CogVideoX-I2V-5B_lora/02_encode.py

# For CogVideoX-5B
python train_dpo/CogVideoX-5B_lora/02_encode.py

# For CogVideoX1.5-5B
python train_dpo/CogVideoX1.5-5B_lora/02_encode.py

Step 3: Run DPO Training

# For CogVideoX-I2V-5B
python train_dpo/CogVideoX-I2V-5B_lora/03_train.py

# For CogVideoX-5B
python train_dpo/CogVideoX-5B_lora/03_train.py

# For CogVideoX1.5-5B
python train_dpo/CogVideoX1.5-5B_lora/03_train.py

Key Features:

  • 🎯 Preference-based learning using winner/loser pairs
  • 🔧 Parameter-efficient fine-tuning with LoRA
  • 📊 Multiple quality metrics support
  • ⚡ Distributed training with PyTorch Lightning
  • 💾 Automatic gradient checkpointing and memory optimization

Data Format: Training requires JSON metadata containing preference pairs - multiple videos generated from the same prompt with quality scores. See dataset.py for details.

🙏 Acknowledgements

We would like to express our gratitude to the following projects and researchers:

  • CogVideoX - The foundational state-of-the-art video generation model.
  • PEFT - For parameter-efficient framework and fine-tuning with LoRA.
  • Diffusion DPO - For the innovative Direct Preference Optimization approach in the diffusion latent space.

Thanks to Dawei Liu for the amazing website design!

🌟 Citation

If you find our work helpful, please leave us a star and cite our paper.

About

VideoGPA is a self-supervised framework that enhances 3D consistency in Video Diffusion Models. By leveraging geometry foundation models and DPO, it automatically aligns generative videos with dense structural preferences to eliminate deformation and spatial drift without human annotations.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages