🎥 FAR: Frame Autoregressive Model for Both Short- and Long-Context Video Modeling 🚀

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

📢 News

2025-03: Paper and Code of FAR are released! 🎉

🌟 What's the Potential of FAR?

🔥 Introducing FAR: a new baseline for autoregressive video generation

FAR (i.e., Frame AutoRegressive Model) learns to predict continuous frames based on an autoregressive context. Its objective aligns well with video modeling, similar to the next-token prediction in language modeling.

🔥 FAR achieves better convergence than video diffusion models with the same continuous latent space

🔥 FAR leverages clean visual context without additional image-to-video fine-tuning:

Unconditional pretraining on UCF-101 achieves state-of-the-art results in both video generation (context frame = 0) and video prediction (context frame ≥ 1) within a single model.

🔥 FAR supports 16x longer temporal extrapolation at test time

🔥 FAR supports efficient training on long-video sequence with managable token lengths

📚 For more details, check out our paper.

🏋️‍♂️ FAR Model Zoo

We provide trained FAR models in our paper for re-implementation.

Video Generation

We use seed-[0,2,4,6] in evaluation, following the evaluation prototype of Latte:

Model (Config)	#Params	Resolution	Condition	FVD	HF Weights	Pre-Computed Samples	Train Cost (H100 Days)
FAR-L	457 M	128x128	✗	280 ± 11.7	Model-HF	Google Drive	12.2
FAR-L	457 M	128x128	✓	99 ± 5.9	Model-HF	Google Drive	12.2
FAR-L	457 M	256x256	✗	303 ± 13.5	Model-HF	Google Drive	12.7
FAR-L	457 M	256x256	✓	113 ± 3.6	Model-HF	Google Drive	12.7
FAR-XL	657 M	256x256	✗	279 ± 9.2	Model-HF	Google Drive	14.6
FAR-XL	657 M	256x256	✓	108 ± 4.2	Model-HF	Google Drive	14.6

Short-Video Prediction

We follows the evaluation prototype of MCVD and ExtDM:

Model (Config)	#Params	Dataset	PSNR	SSIM	LPIPS	FVD	HF Weights	Pre-Computed Samples	Train Cost (H100 Days)
FAR-B	130 M	UCF101	25.64	0.818	0.037	194.1	Model-HF	Google Drive	3.6
FAR-B	130 M	BAIR (c=2, p=28)	19.40	0.819	0.049	144.3	Model-HF	Google Drive	2.6

Long-Video Prediction

We use seed-[0,2,4,6] in evaluation, following the evaluation prototype of TECO:

Model (Config)	#Params	Dataset	PSNR	SSIM	LPIPS	FVD	HF Weights	Pre-Computed Samples	Train Cost (H100 Days)
FAR-B-Long	150 M	DMLab	22.3	0.687	0.104	64	Model-HF	Google Drive	17.5
FAR-M-Long	280 M	Minecraft	16.9	0.448	0.251	39	Model-HF	Google Drive	18.2

🔧 Dependencies and Installation

1. Setup Environment:

# Setup Conda Environment
conda create -n FAR python=3.10
conda activate FAR

# Install Pytorch
conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=12.4 -c pytorch -c nvidia

# Install Other Dependences
pip install -r requirements.txt

2. Prepare Dataset:

We have uploaded the dataset used in this paper to Hugging Face datasets for faster download. Please follow the instructions below to prepare.

from huggingface_hub import snapshot_download, hf_hub_download

dataset_url = {
    "ucf101": "guyuchao/UCF101",
    "bair": "guyuchao/BAIR",
    "minecraft": "guyuchao/Minecraft",
    "minecraft_latent": "guyuchao/Minecraft_Latent",
    "dmlab": "guyuchao/DMLab",
    "dmlab_latent": "guyuchao/DMLab_Latent"
}

for key, url in dataset_url.items():
    snapshot_download(
        repo_id=url,
        repo_type="dataset",
        local_dir=f"datasets/{key}",
        token="input your hf token here"
    )

Then, enter its directory and execute:

find . -name "shard-*.tar" -exec tar -xvf {} \;

3. Prepare Pretrained Models of FAR:

We have uploaded the pretrained models of FAR to Hugging Face models. Please follow the instructions below to download if you want to evaluate FAR.

from huggingface_hub import snapshot_download, hf_hub_download

snapshot_download(
    repo_id="guyuchao/FAR_Models",
    repo_type="model",
    local_dir="experiments/pretrained_models/FAR_Models",
    token="input your hf token here"
)

🚀 Training

To train different models, you can run the following command:

accelerate launch \
    --num_processes 8 \
    --num_machines 1 \
    --main_process_port 19040 \
    train.py \
    -opt train_config.yml

Wandb: Set use_wandb to True in config to enable wandb monitor.
Periodally Evaluation: Set val_freq to control the peroidly evaluation in training.
Auto Resume: Directly rerun the script, the model will find the lastest checkpoint to resume, the wandb log will automatically resume.
Efficient Training on Pre-Extracted Latent: Set use_latent to True, and set the data_list to corresponding latent path list.

💻 Sampling & Evaluation

To evaluate the performance of a pretrained model, just copy the training config and set the pretrain_network: ~ to your trained folder. Then run the following scripts:

accelerate launch \
    --num_processes 8 \
    --num_machines 1 \
    --main_process_port 10410 \
    test.py \
    -opt test_config.yml

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

📖 Citation

If our work assists your research, feel free to give us a star ⭐ or cite us using:

@article{gu2025long,
    title={Long-Context Autoregressive Video Modeling with Next-Frame Prediction},
    author={Gu, Yuchao and Mao, weijia and Shou, Mike Zheng},
    journal={arXiv preprint arXiv:2503.19325},
    year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
datasets		datasets
experiments/pretrained_models		experiments/pretrained_models
far		far
options		options
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py
train_dcae.py		train_dcae.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎥 FAR: Frame Autoregressive Model for Both Short- and Long-Context Video Modeling 🚀

📢 News

🌟 What's the Potential of FAR?

🔥 Introducing FAR: a new baseline for autoregressive video generation

🔥 FAR achieves better convergence than video diffusion models with the same continuous latent space

🔥 FAR leverages clean visual context without additional image-to-video fine-tuning:

🔥 FAR supports 16x longer temporal extrapolation at test time

🔥 FAR supports efficient training on long-video sequence with managable token lengths

📚 For more details, check out our paper.

🏋️‍♂️ FAR Model Zoo

Video Generation

Short-Video Prediction

Long-Video Prediction

🔧 Dependencies and Installation

1. Setup Environment:

2. Prepare Dataset:

3. Prepare Pretrained Models of FAR:

🚀 Training

💻 Sampling & Evaluation

📜 License

📖 Citation

About

Releases

Packages

Contributors 2

Languages

License

showlab/FAR

Folders and files

Latest commit

History

Repository files navigation

🎥 FAR: Frame Autoregressive Model for Both Short- and Long-Context Video Modeling 🚀

📢 News

🌟 What's the Potential of FAR?

🔥 Introducing FAR: a new baseline for autoregressive video generation

🔥 FAR achieves better convergence than video diffusion models with the same continuous latent space

🔥 FAR leverages clean visual context without additional image-to-video fine-tuning:

🔥 FAR supports 16x longer temporal extrapolation at test time

🔥 FAR supports efficient training on long-video sequence with managable token lengths

📚 For more details, check out our paper.

🏋️‍♂️ FAR Model Zoo

Video Generation

Short-Video Prediction

Long-Video Prediction

🔧 Dependencies and Installation

1. Setup Environment:

2. Prepare Dataset:

3. Prepare Pretrained Models of FAR:

🚀 Training

💻 Sampling & Evaluation

📜 License

📖 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages