Yitong Chen1,2,3, Zuxuan Wu1,2,3, Xipeng Qiu1,2,3, Yu-Gang Jiang1,3
1 Institute of Trustworthy Embodied AI, Fudan University
2 Shanghai Innovation Institute
3 Shanghai Key Laboratory of Multimodal Embodied AI
[Paper CVPR-26]
[Project Page]
- Clone repository:
git clone https://github.com/ShareLab-SII/CaTok
cd CaTok- Install dependencies:
pip install -r requirements.txtPlease first download ImageNet on your own path, then soft-link it to this repo:
mkdir -p dataset
ln -s /path/to/imagenet ./dataset/imagenetExpected layout:
dataset/
imagenet/
train/
val/
val256/ # optional, for FID real images
The default config uses ./dataset/imagenet/.
For FID evaluation, it is recommended to preprocess ImageNet validation images
to 256x256 and place them in ./dataset/imagenet/val256
(you can use this script:
prepare_imgnet_val.py).
Create pretrained folder:
mkdir -p pretrained- Download MAR VAE (
./pretrained/mar-vae-kl16):
huggingface-cli download xwen99/mar-vae-kl16 --local-dir ./pretrained/mar-vae-kl16- REPA backbone:
- Supported presets in code:
dinov2->facebook/dinov2-basedinov3->facebook/dinov3-vitb16-pretrain-lvd1689msiglip2->google/siglip2-base-patch16-256
- Default training config currently uses
repa_encoder: dinov2. dinov2andsiglip2can be downloaded automatically from Hugging Face on first run.dinov3is supported by codebase, but you should download its checkpoint manually and use a local path.
For local dinov3 checkpoint, set in config:
model:
params:
repa_encoder: ./pretrained/dinov3_vitb16_pretrain_lvd1689m-73cec8be.pthOptional manual pre-cache for HF-based backbones (dinov2, siglip2):
python - <<'PY'
from transformers import AutoImageProcessor, AutoModel
models = [
"facebook/dinov2-base",
"google/siglip2-base-patch16-256",
]
for m in models:
AutoImageProcessor.from_pretrained(m)
AutoModel.from_pretrained(m)
print(f"{m} cached")
PYTo switch REPA backbone, edit repa_encoder in config, e.g.:
repa_encoder: dinov3repa_encoder: siglip2- or local file path for dinov3:
repa_encoder: ./pretrained/dinov3_vitb16_pretrain_lvd1689m-73cec8be.pth
- FID stats:
- Default file is already in repo:
fid_stats/adm_in256_stats.npz.
Current released training example config:
Run training on 8 GPUs:
bash scripts/train_8gpu.sh configs/catok_b_256.yamlOr directly:
torchrun --nproc-per-node=8 train_net.py --cfg configs/catok_b_256.yamlExample tokenizer evaluation:
torchrun --nproc-per-node=8 test_net.py \
--model ./output/catok_b_256 \
--step 250000 \
--cfg configs/catok_b_256.yaml \
--cfg_value 1.0 \
--test_num_slots 256 \
--test_num_steps 25Use scripts/infer_recon.py for controllable reconstruction:
--cfg: classifier-free guidance--num-tokens: number of tokens used for reconstruction--start-token: token start index
Example:
python scripts/infer_recon.py \
--model-dir ./output/catok_b_256 \
--config ./configs/catok_b_256.yaml \
--checkpoint ./output/catok_b_256/models/step250000/custom_checkpoint_1.pkl \
--image /path/to/your/input_image.webp \
--cfg 2.0 \
--num-tokens 256 \
--start-token 0 \
--sample-steps 25 \
--output-dir ./infer_outputs@inproceedings{catok2026,
title={CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization},
author={Chen, Yitong and Wu, Zuxuan and Qiu, Xipeng and Jiang, Yu-Gang},
booktitle={CVPR},
year={2026}
}This codebase is built on Semanticist and inspired by MeanFlow. Most of the repository refactoring and cleanup work was completed by an agent, so if you notice any issues, feel free to reach out.