Skip to content
/ FAR Public

Code for: "Long-Context Autoregressive Video Modeling with Next-Frame Prediction"

License

Notifications You must be signed in to change notification settings

showlab/FAR

Repository files navigation

πŸŽ₯ FAR: Frame Autoregressive Model for Both Short- and Long-Context Video Modeling πŸš€

Project Page arXiv  huggingface weights  SOTA

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

dmlab_sample

πŸ“’ News

  • 2025-03: Paper and Code of FAR are released! πŸŽ‰

🌟 What's the Potential of FAR?

πŸ”₯ Introducing FAR: a new baseline for autoregressive video generation

FAR (i.e., Frame AutoRegressive Model) learns to predict continuous frames based on an autoregressive context. Its objective aligns well with video modeling, similar to the next-token prediction in language modeling.

dmlab_sample

πŸ”₯ FAR achieves better convergence than video diffusion models with the same continuous latent space

πŸ”₯ FAR leverages clean visual context without additional image-to-video fine-tuning:

Unconditional pretraining on UCF-101 achieves state-of-the-art results in both video generation (context frame = 0) and video prediction (context frame β‰₯ 1) within a single model.

πŸ”₯ FAR supports 16x longer temporal extrapolation at test time

πŸ”₯ FAR supports efficient training on long-video sequence with managable token lengths

πŸ“š For more details, check out our paper.

πŸ‹οΈβ€β™‚οΈ FAR Model Zoo

We provide trained FAR models in our paper for re-implementation.

Video Generation

We use seed-[0,2,4,6] in evaluation, following the evaluation prototype of Latte:

Model (Config) #Params Resolution Condition FVD HF Weights Pre-Computed Samples Train Cost (H100 Days)
FAR-L 457 M 128x128 βœ— 280 Β± 11.7 Model-HF Google Drive 12.2
FAR-L 457 M 128x128 βœ“ 99 Β± 5.9 Model-HF Google Drive 12.2
FAR-L 457 M 256x256 βœ— 303 Β± 13.5 Model-HF Google Drive 12.7
FAR-L 457 M 256x256 βœ“ 113 Β± 3.6 Model-HF Google Drive 12.7
FAR-XL 657 M 256x256 βœ— 279 Β± 9.2 Model-HF Google Drive 14.6
FAR-XL 657 M 256x256 βœ“ 108 Β± 4.2 Model-HF Google Drive 14.6

Short-Video Prediction

We follows the evaluation prototype of MCVD and ExtDM:

Model (Config) #Params Dataset PSNR SSIM LPIPS FVD HF Weights Pre-Computed Samples Train Cost (H100 Days)
FAR-B 130 M UCF101 25.64 0.818 0.037 194.1 Model-HF Google Drive 3.6
FAR-B 130 M BAIR (c=2, p=28) 19.40 0.819 0.049 144.3 Model-HF Google Drive 2.6

Long-Video Prediction

We use seed-[0,2,4,6] in evaluation, following the evaluation prototype of TECO:

Model (Config) #Params Dataset PSNR SSIM LPIPS FVD HF Weights Pre-Computed Samples Train Cost (H100 Days)
FAR-B-Long 150 M DMLab 22.3 0.687 0.104 64 Model-HF Google Drive 17.5
FAR-M-Long 280 M Minecraft 16.9 0.448 0.251 39 Model-HF Google Drive 18.2

πŸ”§ Dependencies and Installation

1. Setup Environment:

# Setup Conda Environment
conda create -n FAR python=3.10
conda activate FAR

# Install Pytorch
conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=12.4 -c pytorch -c nvidia

# Install Other Dependences
pip install -r requirements.txt

2. Prepare Dataset:

We have uploaded the dataset used in this paper to Hugging Face datasets for faster download. Please follow the instructions below to prepare.

from huggingface_hub import snapshot_download, hf_hub_download

dataset_url = {
    "ucf101": "guyuchao/UCF101",
    "bair": "guyuchao/BAIR",
    "minecraft": "guyuchao/Minecraft",
    "minecraft_latent": "guyuchao/Minecraft_Latent",
    "dmlab": "guyuchao/DMLab",
    "dmlab_latent": "guyuchao/DMLab_Latent"
}

for key, url in dataset_url.items():
    snapshot_download(
        repo_id=url,
        repo_type="dataset",
        local_dir=f"datasets/{key}",
        token="input your hf token here"
    )

Then, enter its directory and execute:

find . -name "shard-*.tar" -exec tar -xvf {} \;

3. Prepare Pretrained Models of FAR:

We have uploaded the pretrained models of FAR to Hugging Face models. Please follow the instructions below to download if you want to evaluate FAR.

from huggingface_hub import snapshot_download, hf_hub_download

snapshot_download(
    repo_id="guyuchao/FAR_Models",
    repo_type="model",
    local_dir="experiments/pretrained_models/FAR_Models",
    token="input your hf token here"
)

πŸš€ Training

To train different models, you can run the following command:

accelerate launch \
    --num_processes 8 \
    --num_machines 1 \
    --main_process_port 19040 \
    train.py \
    -opt train_config.yml
  • Wandb: Set use_wandb to True in config to enable wandb monitor.
  • Periodally Evaluation: Set val_freq to control the peroidly evaluation in training.
  • Auto Resume: Directly rerun the script, the model will find the lastest checkpoint to resume, the wandb log will automatically resume.
  • Efficient Training on Pre-Extracted Latent: Set use_latent to True, and set the data_list to corresponding latent path list.

πŸ’» Sampling & Evaluation

To evaluate the performance of a pretrained model, just copy the training config and set the pretrain_network: ~ to your trained folder. Then run the following scripts:

accelerate launch \
    --num_processes 8 \
    --num_machines 1 \
    --main_process_port 10410 \
    test.py \
    -opt test_config.yml

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ“– Citation

If our work assists your research, feel free to give us a star ⭐ or cite us using:

@article{gu2025long,
    title={Long-Context Autoregressive Video Modeling with Next-Frame Prediction},
    author={Gu, Yuchao and Mao, weijia and Shou, Mike Zheng},
    journal={arXiv preprint arXiv:2503.19325},
    year={2025}
}

About

Code for: "Long-Context Autoregressive Video Modeling with Next-Frame Prediction"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages