Ming Gui*1,2 Β· Johannes Schusterbauer*1,2 Β· Timy Phan1,2
Felix Krause1,2 Β· Josh Susskind3 Β· Miguel A. Bautista3 Β· BjΓΆrn Ommer1,2
1CompVis Group @ LMU Munich Β Β Β
2MCML Β Β Β
3Apple
* equal contribution
We introduce Representation Tokenizer (RepTokπ¦), a generative framework that encodes each image into a single continuous latent token derived from self-supervised vision transformers. By jointly fine-tuning the semantic [cls]
token with a generative decoder, RepTok achieves faithful reconstructions while preserving the smooth, meaningful structure of the SSL space. This compact one-token formulation enables highly efficient latent-space generative modeling, delivering competitive results even under severely constrained training budgets.
Our approach builds on a pre-trained SSL encoder that is lightly fine-tuned and trained jointly with a generative decoder. We train the decoder with a standard flow matching objective, complemented by a cosine-similarity loss that regularizes the latent representation to remain close to its original smooth and semantically structured space, which is well-suited for generation. Without auxiliary perceptual or adversarial losses, the resulting model is able to faithfully decode the single-token latent representation into the pixel space.
This design enables highly efficient image synthesis training, allowing us to use simple, attention-free architectures such as MLP-Mixers for accelerated ImageNet training. Furthermore, we show that the framework naturally extends to text-to-image (T2I) synthesis: by incorporating cross-attention to integrate textual conditioning, our model achieves competitive zero-shot performance on the COCO benchmark under an extremely constrained training budget.
Our approach constantly achieves a substantially lower computational footprint while maintaining competitive performance on ImageNet.
This also extends to a general T2I setting: RepTok reaches SD 1.5 quality in a fraction of the cost of other methods while delivering better generative performance compared to other efficiency-focused methods.
Our approach augments the pre-trained SSL representations with additional necessary information to enable images to be faithfully encoded as a single continuous token, which allows for both high-fidelity image reconstruction and synthesis.
We observe smooth transitions not only in semantic content but also in spatial configuration. This indicates that our method successfully integrates low-level spatial information while preserving the properties of the pretrained encoder's latent space, and facilitates generation within the learned representation.
Using our approach, we trained a general T2I model which synthesizes coherent and aesthetically pleasing images with minimal compute budget.
We are in the process of preparing the public release of the RepTok codebase.
The following items are planned:
- Release pretrained checkpoints
- Provide inference demo
Stay tuned β the code and pretrained models will be released soon!
If you use our work in your research, please use the following BibTeX entry
@misc{gui2025reptok,
title={Adapting Self-Supervised Representations as a Latent Space for Efficient Generation},
author={Ming Gui and Johannes Schusterbauer and Timy Phan and Felix Krause and Josh Susskind and Miguel Angel Bautista and BjΓΆrn Ommer},
year={2025},
eprint={2510.14630},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.14630},
}