Hongyang Du*1,2 · Junjie Ye*1· Xiaoyan Cong*2 · Runhao Li1 · Jingcheng Ni2
Aman Agarwal2 · Zeqi Zhou2 · Zekun Li2 · Randall Balestriero2 · Yue Wang1
1Physical SuperIntelligence Lab, University of Southern California
2Department of Computer Science, Brown University
* Equal Contribution
This directory contains simplified command-line scripts for generating videos using CogVideoX models. These scripts are designed for quick testing and allow you to run inference directly from the terminal without preparing JSON configuration files.
Both scripts support loading LoRA adapters for customized generation.
Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.
pip install -r requirements.txtRun the provided Python script to download checkpoint files:
# Download all checkpoints
python download_ckpt.py all
# Or download specific checkpoints
python download_ckpt.py i2v # CogVideoX-I2V-5B
python download_ckpt.py t2v # CogVideoX-5B
python download_ckpt.py t2v15 # CogVideoX1.5-5BThe script will:
- ✅ Check if files already exist (skip re-downloading)
- 🚀 Download missing checkpoints with progress bars
- 📁 Organize files into the correct directory structure
After successful download, your checkpoints folder should look like:
checkpoints/
├── VideoGPA-I2V-lora/
│ └── adapter_model.safetensors
├── VideoGPA-T2V-lora/
│ └── adapter_model.safetensors
└── VideoGPA-T2V1.5-lora/
└── adapter_model.safetensors
1. Text-to-Video Generation (t2v_inference.py)
Generate videos from text prompts using CogVideoX-5B.
Basic Usage:
cd generate
python t2v_inference.py "A cat playing with a ball in a garden"Advanced Usage:
python t2v_inference.py "A flying drone over a city skyline at sunset" \
--output_dir ./my_videos \
--lora_path ./checkpoints/my_lora_adapter \
--gpu_id 0Arguments:
prompt(required): Text prompt for video generation--output_dir: Directory to save generated videos (default:./outputs)--lora_path: Path to LoRA adapter weights (optional)--gpu_id: GPU device ID (default: 0)
Output: Videos saved as {prompt}_seed{seed}.mp4
2. Image-to-Video Generation (i2v_inference.py)
Generate videos from a static image with text guidance using CogVideoX-5B-I2V.
Basic Usage:
cd generate
python i2v_inference.py "The camera slowly zooms in" ./path/to/image.jpgAdvanced Usage:
python i2v_inference.py "A realistic continuation of the reference scene. Everything must remain completely static: no moving people, no shifting objects, and no dynamic elements. Only the camera is allowed to move. Render physically accurate multi-step camera motion. Camera motion: roll gently to one side, then swing around the room, followed by push forward into the scene." ./image.png \
--output_dir ./i2v_outputs \
--lora_path ./checkpoints/i2v_lora \
--gpu_id 1Arguments:
prompt(required): Text prompt describing motion/sceneimage_path(required): Path to input image file--output_dir: Directory to save generated videos (default:./outputs)--lora_path: Path to LoRA adapter weights (optional)--gpu_id: GPU device ID (default: 0)
Output: Videos saved as {image_name}_seed{seed}.mp4
Both scripts include configurable generation parameters:
NUM_INFERENCE_STEPS = 50 # Number of diffusion steps
GUIDANCE_SCALE = 6.0 # Classifier-free guidance scale
SEED = 42 # Seed for generation- Minimum VRAM: diffusers BF16 ~5GB for base models
- Memory optimizations (VAE tiling/slicing) are automatically enabled
- Video Quality Assessment: Comprehensive metrics for evaluating video generation quality
- DPO Training: Direct Preference Optimization for video generation models
- Multi-Model Support: Compatible with CogVideoX and other video generation models
- Flexible Pipeline: Easy-to-use inference and training pipelines
VideoGPA/
├── data_prep/ # Data preparation scripts
├── train_dpo/ # DPO training scripts
├── pipelines/ # Inference pipelines
├── metrics/ # Quality assessment metrics
├── vggt/ # Video generation model architecture
└── utils/ # Utility functions
VideoGPA leverages DPO to optimize video generation quality through preference learning. The training pipeline consists of 3 steps after you have your generated videos. Revise the configs as you need:
python train_dpo/video_scorer.py# For CogVideoX-I2V-5B
python train_dpo/CogVideoX-I2V-5B_lora/02_encode.py
# For CogVideoX-5B
python train_dpo/CogVideoX-5B_lora/02_encode.py
# For CogVideoX1.5-5B
python train_dpo/CogVideoX1.5-5B_lora/02_encode.py# For CogVideoX-I2V-5B
python train_dpo/CogVideoX-I2V-5B_lora/03_train.py
# For CogVideoX-5B
python train_dpo/CogVideoX-5B_lora/03_train.py
# For CogVideoX1.5-5B
python train_dpo/CogVideoX1.5-5B_lora/03_train.pyKey Features:
- 🎯 Preference-based learning using winner/loser pairs
- 🔧 Parameter-efficient fine-tuning with LoRA
- 📊 Multiple quality metrics support
- ⚡ Distributed training with PyTorch Lightning
- 💾 Automatic gradient checkpointing and memory optimization
Data Format: Training requires JSON metadata containing preference pairs - multiple videos generated from the same prompt with quality scores. See dataset.py for details.
We would like to express our gratitude to the following projects and researchers:
- CogVideoX - The foundational state-of-the-art video generation model.
- PEFT - For parameter-efficient framework and fine-tuning with LoRA.
- Diffusion DPO - For the innovative Direct Preference Optimization approach in the diffusion latent space.
Thanks to Dawei Liu for the amazing website design!
If you find our work helpful, please leave us a star and cite our paper.
