🔥 TL;DR

🦎 Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Ming Gui^*1,2 · Johannes Schusterbauer^*1,2 · Timy Phan^1,2
Felix Krause^1,2 · Josh Susskind³ · Miguel A. Bautista³ · Björn Ommer^1,2

¹CompVis Group @ LMU Munich ²MCML ³Apple

^* equal contribution

🔥 TL;DR

We introduce Representation Tokenizer (RepTok🦎), a generative framework that encodes each image into a single continuous latent token derived from self-supervised vision transformers. By jointly fine-tuning the semantic [cls] token with a generative decoder, RepTok achieves faithful reconstructions while preserving the smooth, meaningful structure of the SSL space. This compact one-token formulation enables highly efficient latent-space generative modeling, delivering competitive results even under severely constrained training budgets.

📝 Overview

Our approach builds on a pre-trained SSL encoder that is lightly fine-tuned and trained jointly with a generative decoder. We train the decoder with a standard flow matching objective, complemented by a cosine-similarity loss that regularizes the latent representation to remain close to its original smooth and semantically structured space, which is well-suited for generation. Without auxiliary perceptual or adversarial losses, the resulting model is able to faithfully decode the single-token latent representation into the pixel space.

This design enables highly efficient image synthesis training, allowing us to use simple, attention-free architectures such as MLP-Mixers for accelerated ImageNet training. Furthermore, we show that the framework naturally extends to text-to-image (T2I) synthesis: by incorporating cross-attention to integrate textual conditioning, our model achieves competitive zero-shot performance on the COCO benchmark under an extremely constrained training budget.

📈 Results

⏳ Efficiency

Our approach constantly achieves a substantially lower computational footprint while maintaining competitive performance on ImageNet.

This also extends to a general T2I setting: RepTok reaches SD 1.5 quality in a fraction of the cost of other methods while delivering better generative performance compared to other efficiency-focused methods.

🌇 Qualitative Reconstructions

Our approach augments the pre-trained SSL representations with additional necessary information to enable images to be faithfully encoded as a single continuous token, which allows for both high-fidelity image reconstruction and synthesis.

🐯 Interpolation Results

We observe smooth transitions not only in semantic content but also in spatial configuration. This indicates that our method successfully integrates low-level spatial information while preserving the properties of the pretrained encoder's latent space, and facilitates generation within the learned representation.

🔥 T2I Results

Using our approach, we trained a general T2I model which synthesizes coherent and aesthetically pleasing images with minimal compute budget.

🚀 To-Do

We are in the process of preparing the public release of the RepTok codebase.
The following items are planned:

Release pretrained checkpoints
Provide inference demo

Stay tuned — the code and pretrained models will be released soon!

🎓 Citation

If you use our work in your research, please use the following BibTeX entry

@misc{gui2025reptok,
  title={Adapting Self-Supervised Representations as a Latent Space for Efficient Generation}, 
  author={Ming Gui and Johannes Schusterbauer and Timy Phan and Felix Krause and Josh Susskind and Miguel Angel Bautista and Björn Ommer},
  year={2025},
  eprint={2510.14630},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2510.14630}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🦎 Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

🔥 TL;DR

📝 Overview

📈 Results

⏳ Efficiency

🌇 Qualitative Reconstructions

🐯 Interpolation Results

🔥 T2I Results

🚀 To-Do

🎓 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

CompVis/RepTok

Folders and files

Latest commit

History

Repository files navigation

🦎 Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

🔥 TL;DR

📝 Overview

📈 Results

⏳ Efficiency

🌇 Qualitative Reconstructions

🐯 Interpolation Results

🔥 T2I Results

🚀 To-Do

🎓 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Packages