Skip to content

CompVis/RepTok

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🦎 Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Ming Gui*1,2 Β· Johannes Schusterbauer*1,2 Β· Timy Phan1,2
Felix Krause1,2 Β· Josh Susskind3 Β· Miguel A. Bautista3 Β· BjΓΆrn Ommer1,2

1CompVis Group @ LMU Munich Β Β Β  2MCML Β Β Β  3Apple

* equal contribution

Paper PDF

πŸ”₯ TL;DR

We introduce Representation Tokenizer (RepTok🦎), a generative framework that encodes each image into a single continuous latent token derived from self-supervised vision transformers. By jointly fine-tuning the semantic [cls] token with a generative decoder, RepTok achieves faithful reconstructions while preserving the smooth, meaningful structure of the SSL space. This compact one-token formulation enables highly efficient latent-space generative modeling, delivering competitive results even under severely constrained training budgets.

πŸ“ Overview

Our approach builds on a pre-trained SSL encoder that is lightly fine-tuned and trained jointly with a generative decoder. We train the decoder with a standard flow matching objective, complemented by a cosine-similarity loss that regularizes the latent representation to remain close to its original smooth and semantically structured space, which is well-suited for generation. Without auxiliary perceptual or adversarial losses, the resulting model is able to faithfully decode the single-token latent representation into the pixel space.

Pipeline

This design enables highly efficient image synthesis training, allowing us to use simple, attention-free architectures such as MLP-Mixers for accelerated ImageNet training. Furthermore, we show that the framework naturally extends to text-to-image (T2I) synthesis: by incorporating cross-attention to integrate textual conditioning, our model achieves competitive zero-shot performance on the COCO benchmark under an extremely constrained training budget.

πŸ“ˆ Results

⏳ Efficiency

Our approach constantly achieves a substantially lower computational footprint while maintaining competitive performance on ImageNet.

ImageNet efficiency

This also extends to a general T2I setting: RepTok reaches SD 1.5 quality in a fraction of the cost of other methods while delivering better generative performance compared to other efficiency-focused methods.

T2I efficiency

πŸŒ‡ Qualitative Reconstructions

Our approach augments the pre-trained SSL representations with additional necessary information to enable images to be faithfully encoded as a single continuous token, which allows for both high-fidelity image reconstruction and synthesis.

ImageNet Reconstruction

🐯 Interpolation Results

We observe smooth transitions not only in semantic content but also in spatial configuration. This indicates that our method successfully integrates low-level spatial information while preserving the properties of the pretrained encoder's latent space, and facilitates generation within the learned representation.

Interpolation

πŸ”₯ T2I Results

Using our approach, we trained a general T2I model which synthesizes coherent and aesthetically pleasing images with minimal compute budget.

Interpolation

πŸš€ To-Do

We are in the process of preparing the public release of the RepTok codebase.
The following items are planned:

  • Release pretrained checkpoints
  • Provide inference demo

Stay tuned β€” the code and pretrained models will be released soon!

πŸŽ“ Citation

If you use our work in your research, please use the following BibTeX entry

@misc{gui2025reptok,
  title={Adapting Self-Supervised Representations as a Latent Space for Efficient Generation}, 
  author={Ming Gui and Johannes Schusterbauer and Timy Phan and Felix Krause and Josh Susskind and Miguel Angel Bautista and BjΓΆrn Ommer},
  year={2025},
  eprint={2510.14630},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2510.14630}, 
}

About

Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •