GitHub - ShareLab-SII/CaTok: [CVPR-26] Official repository of "CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization"

CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

Yitong Chen^1,2,3, Zuxuan Wu^1,2,3, Xipeng Qiu^1,2,3, Yu-Gang Jiang^1,3

¹ Institute of Trustworthy Embodied AI, Fudan University
² Shanghai Innovation Institute
³ Shanghai Key Laboratory of Multimodal Embodied AI

[`Paper CVPR-26`] [`Project Page`]

Installation

Clone repository:

git clone https://github.com/ShareLab-SII/CaTok
cd CaTok

Install dependencies:

pip install -r requirements.txt

Data Preparation

Please first download ImageNet on your own path, then soft-link it to this repo:

mkdir -p dataset
ln -s /path/to/imagenet ./dataset/imagenet

Expected layout:

dataset/
  imagenet/
    train/
    val/
    val256/   # optional, for FID real images

The default config uses ./dataset/imagenet/.

For FID evaluation, it is recommended to preprocess ImageNet validation images to 256x256 and place them in ./dataset/imagenet/val256 (you can use this script: prepare_imgnet_val.py).

Pretrained Dependencies

Create pretrained folder:

mkdir -p pretrained

Download MAR VAE (./pretrained/mar-vae-kl16):

huggingface-cli download xwen99/mar-vae-kl16 --local-dir ./pretrained/mar-vae-kl16

REPA backbone:

Supported presets in code:
- dinov2 -> facebook/dinov2-base
- dinov3 -> facebook/dinov3-vitb16-pretrain-lvd1689m
- siglip2 -> google/siglip2-base-patch16-256
Default training config currently uses repa_encoder: dinov2.
dinov2 and siglip2 can be downloaded automatically from Hugging Face on first run.
dinov3 is supported by codebase, but you should download its checkpoint manually and use a local path.

For local dinov3 checkpoint, set in config:

model:
  params:
    repa_encoder: ./pretrained/dinov3_vitb16_pretrain_lvd1689m-73cec8be.pth

Optional manual pre-cache for HF-based backbones (dinov2, siglip2):

python - <<'PY'
from transformers import AutoImageProcessor, AutoModel
models = [
    "facebook/dinov2-base",
    "google/siglip2-base-patch16-256",
]
for m in models:
    AutoImageProcessor.from_pretrained(m)
    AutoModel.from_pretrained(m)
    print(f"{m} cached")
PY

To switch REPA backbone, edit repa_encoder in config, e.g.:

repa_encoder: dinov3
repa_encoder: siglip2
or local file path for dinov3:
- repa_encoder: ./pretrained/dinov3_vitb16_pretrain_lvd1689m-73cec8be.pth

FID stats:

Default file is already in repo: fid_stats/adm_in256_stats.npz.

Train

Current released training example config:

configs/catok_b_256.yaml

Run training on 8 GPUs:

bash scripts/train_8gpu.sh configs/catok_b_256.yaml

Or directly:

torchrun --nproc-per-node=8 train_net.py --cfg configs/catok_b_256.yaml

Evaluation

Example tokenizer evaluation:

torchrun --nproc-per-node=8 test_net.py \
  --model ./output/catok_b_256 \
  --step 250000 \
  --cfg configs/catok_b_256.yaml \
  --cfg_value 1.0 \
  --test_num_slots 256 \
  --test_num_steps 25

Reconstruction Inference

Use scripts/infer_recon.py for controllable reconstruction:

--cfg: classifier-free guidance
--num-tokens: number of tokens used for reconstruction
--start-token: token start index

Example:

python scripts/infer_recon.py \
  --model-dir ./output/catok_b_256 \
  --config ./configs/catok_b_256.yaml \
  --checkpoint ./output/catok_b_256/models/step250000/custom_checkpoint_1.pkl \
  --image /path/to/your/input_image.webp \
  --cfg 2.0 \
  --num-tokens 256 \
  --start-token 0 \
  --sample-steps 25 \
  --output-dir ./infer_outputs

Citation

@inproceedings{catok2026,
  title={CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization},
  author={Chen, Yitong and Wu, Zuxuan and Qiu, Xipeng and Jiang, Yu-Gang},
  booktitle={CVPR},
  year={2026}
}

Acknowledgement and Note

This codebase is built on Semanticist and inspired by MeanFlow. Most of the repository refactoring and cleanup work was completed by an agent, so if you notice any issues, feel free to reach out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

Yitong Chen^1,2,3, Zuxuan Wu^1,2,3, Xipeng Qiu^1,2,3, Yu-Gang Jiang^1,3

¹ Institute of Trustworthy Embodied AI, Fudan University
² Shanghai Innovation Institute
³ Shanghai Key Laboratory of Multimodal Embodied AI

[`Paper CVPR-26`] [`Project Page`]

Installation

Data Preparation

Pretrained Dependencies

Train

Evaluation

Reconstruction Inference

Citation

Acknowledgement and Note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
catok		catok
configs		configs
fid_stats		fid_stats
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
test_fid.sh		test_fid.sh
test_net.py		test_net.py
train_net.py		train_net.py

Folders and files

Latest commit

History

Repository files navigation

CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

Yitong Chen1,2,3, Zuxuan Wu1,2,3, Xipeng Qiu1,2,3, Yu-Gang Jiang1,3 1 Institute of Trustworthy Embodied AI, Fudan University 2 Shanghai Innovation Institute 3 Shanghai Key Laboratory of Multimodal Embodied AI [Paper CVPR-26] [Project Page]

Installation

Data Preparation

Pretrained Dependencies

Train

Evaluation

Reconstruction Inference

Citation

Acknowledgement and Note

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Yitong Chen^1,2,3, Zuxuan Wu^1,2,3, Xipeng Qiu^1,2,3, Yu-Gang Jiang^1,3

¹ Institute of Trustworthy Embodied AI, Fudan University
² Shanghai Innovation Institute
³ Shanghai Key Laboratory of Multimodal Embodied AI

[`Paper CVPR-26`] [`Project Page`]

Packages