Long-Context Autoregressive Video Modeling with Next-Frame Prediction
- 2025-03: Paper and Code of FAR are released! π
FAR (i.e., Frame AutoRegressive Model) learns to predict continuous frames based on an autoregressive context. Its objective aligns well with video modeling, similar to the next-token prediction in language modeling.
π₯ FAR achieves better convergence than video diffusion models with the same continuous latent space
Unconditional pretraining on UCF-101 achieves state-of-the-art results in both video generation (context frame = 0) and video prediction (context frame β₯ 1) within a single model.
π For more details, check out our paper.
We provide trained FAR models in our paper for re-implementation.
We use seed-[0,2,4,6] in evaluation, following the evaluation prototype of Latte:
Model (Config) | #Params | Resolution | Condition | FVD | HF Weights | Pre-Computed Samples | Train Cost (H100 Days) |
---|---|---|---|---|---|---|---|
FAR-L | 457 M | 128x128 | β | 280 Β± 11.7 | Model-HF | Google Drive | 12.2 |
FAR-L | 457 M | 128x128 | β | 99 Β± 5.9 | Model-HF | Google Drive | 12.2 |
FAR-L | 457 M | 256x256 | β | 303 Β± 13.5 | Model-HF | Google Drive | 12.7 |
FAR-L | 457 M | 256x256 | β | 113 Β± 3.6 | Model-HF | Google Drive | 12.7 |
FAR-XL | 657 M | 256x256 | β | 279 Β± 9.2 | Model-HF | Google Drive | 14.6 |
FAR-XL | 657 M | 256x256 | β | 108 Β± 4.2 | Model-HF | Google Drive | 14.6 |
We follows the evaluation prototype of MCVD and ExtDM:
Model (Config) | #Params | Dataset | PSNR | SSIM | LPIPS | FVD | HF Weights | Pre-Computed Samples | Train Cost (H100 Days) |
---|---|---|---|---|---|---|---|---|---|
FAR-B | 130 M | UCF101 | 25.64 | 0.818 | 0.037 | 194.1 | Model-HF | Google Drive | 3.6 |
FAR-B | 130 M | BAIR (c=2, p=28) | 19.40 | 0.819 | 0.049 | 144.3 | Model-HF | Google Drive | 2.6 |
We use seed-[0,2,4,6] in evaluation, following the evaluation prototype of TECO:
Model (Config) | #Params | Dataset | PSNR | SSIM | LPIPS | FVD | HF Weights | Pre-Computed Samples | Train Cost (H100 Days) |
---|---|---|---|---|---|---|---|---|---|
FAR-B-Long | 150 M | DMLab | 22.3 | 0.687 | 0.104 | 64 | Model-HF | Google Drive | 17.5 |
FAR-M-Long | 280 M | Minecraft | 16.9 | 0.448 | 0.251 | 39 | Model-HF | Google Drive | 18.2 |
# Setup Conda Environment
conda create -n FAR python=3.10
conda activate FAR
# Install Pytorch
conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=12.4 -c pytorch -c nvidia
# Install Other Dependences
pip install -r requirements.txt
We have uploaded the dataset used in this paper to Hugging Face datasets for faster download. Please follow the instructions below to prepare.
from huggingface_hub import snapshot_download, hf_hub_download
dataset_url = {
"ucf101": "guyuchao/UCF101",
"bair": "guyuchao/BAIR",
"minecraft": "guyuchao/Minecraft",
"minecraft_latent": "guyuchao/Minecraft_Latent",
"dmlab": "guyuchao/DMLab",
"dmlab_latent": "guyuchao/DMLab_Latent"
}
for key, url in dataset_url.items():
snapshot_download(
repo_id=url,
repo_type="dataset",
local_dir=f"datasets/{key}",
token="input your hf token here"
)
Then, enter its directory and execute:
find . -name "shard-*.tar" -exec tar -xvf {} \;
We have uploaded the pretrained models of FAR to Hugging Face models. Please follow the instructions below to download if you want to evaluate FAR.
from huggingface_hub import snapshot_download, hf_hub_download
snapshot_download(
repo_id="guyuchao/FAR_Models",
repo_type="model",
local_dir="experiments/pretrained_models/FAR_Models",
token="input your hf token here"
)
To train different models, you can run the following command:
accelerate launch \
--num_processes 8 \
--num_machines 1 \
--main_process_port 19040 \
train.py \
-opt train_config.yml
- Wandb: Set
use_wandb
toTrue
in config to enable wandb monitor. - Periodally Evaluation: Set
val_freq
to control the peroidly evaluation in training. - Auto Resume: Directly rerun the script, the model will find the lastest checkpoint to resume, the wandb log will automatically resume.
- Efficient Training on Pre-Extracted Latent: Set
use_latent
toTrue
, and set thedata_list
to corresponding latent path list.
To evaluate the performance of a pretrained model, just copy the training config and set the pretrain_network: ~
to your trained folder. Then run the following scripts:
accelerate launch \
--num_processes 8 \
--num_machines 1 \
--main_process_port 10410 \
test.py \
-opt test_config.yml
This project is licensed under the MIT License - see the LICENSE file for details.
If our work assists your research, feel free to give us a star β or cite us using:
@article{gu2025long,
title={Long-Context Autoregressive Video Modeling with Next-Frame Prediction},
author={Gu, Yuchao and Mao, weijia and Shou, Mike Zheng},
journal={arXiv preprint arXiv:2503.19325},
year={2025}
}