We present SSL-R1, a generic self-supervised RL post-training framework that derives intrinsically verifiable rewards from input images. SSL-R1 is vision-centric, cost-effective, and scalable, requiring neither human nor external model supervision.
π€© Key Properties
|
|
π For more results, please refer to our paper
- [04/2026] π₯ SSL-R1 is released on arXiv.
SSL-R1 is a generic self-supervised RL-based post-training framework. We re-purpose five different self-supervised tasks widely used in the vision literature as examples amenable to being used within an RLVR framework, targeting different aspects of visual information and providing comprehensive coverage of vision-centric reasoning capabilities.
We provide some qualitative examples of the baseline model (Qwen2.5-VL-7B) vs. our SSL-R1 on three types of vision-centric multimodal benchmarks.
If you find this work useful for your research, please consider citing our paper:
@article{xie2026ssl,
title = {SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models},
author = {Xie, Jiahao and Tonioni, Alessio and Rauschmayr, Nathalie and Tombari, Federico and Schiele, Bernt},
journal = {arXiv preprint arXiv:2604.20705},
year = {2026}
}

