StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners, Yonglong Tian+, N/A, arXiv'23 #742

AkihikoWatanabe · 2023-06-16T12:37:59Z

URL

https://arxiv.org/abs//2306.00984

Affiliations

Yonglong Tian, N/A
Lijie Fan, N/A
Phillip Isola, N/A
Huiwen Chang, N/A
Dilip Krishnan, N/A

Abstract

We investigate the potential of learning visual representations usingsynthetic images generated by text-to-image models. This is a natural questionin the light of the excellent performance of such models in generatinghigh-quality images. We consider specifically the Stable Diffusion, one of theleading open source text-to-image models. We show that (1) when the generativemodel is configured with proper classifier-free guidance scale, trainingself-supervised methods on synthetic images can match or beat the real imagecounterpart; (2) by treating the multiple images generated from the same textprompt as positives for each other, we develop a multi-positive contrastivelearning method, which we call StableRep. With solely synthetic images, therepresentations learned by StableRep surpass the performance of representationslearned by SimCLR and CLIP using the same set of text prompts and correspondingreal images, on large scale datasets. When we further add language supervision,StableRep trained with 20M synthetic images achieves better accuracy than CLIPtrained with 50M real images.

Translation (by gpt-3.5-turbo)

本研究では、テキストから画像を生成するモデルによって生成された合成画像を使用して視覚表現を学習する可能性を調査しました。このようなモデルが高品質の画像を生成することで優れたパフォーマンスを発揮していることを考慮すると、これは自然な問いです。特に、主要なオープンソースのテキストから画像へのモデルの1つであるStable Diffusionを考慮しました。我々は、(1)生成モデルが適切な分類器フリーガイダンススケールで構成されている場合、自己教師あり方法を合成画像に対してトレーニングすることで、実際の画像に匹敵するかそれを上回ることができることを示しました。(2)同じテキストプロンプトから生成された複数の画像を互いに正として扱うことで、マルチポジティブコントラスティブ学習手法であるStableRepを開発しました。StableRepによって学習された表現は、大規模なデータセット上で、同じテキストプロンプトと対応する実際の画像を使用してSimCLRとCLIPによって学習された表現を上回ります。さらに言語の監視を追加すると、20Mの合成画像でトレーニングされたStableRepは、50Mの実際の画像でトレーニングされたCLIPよりも優れた精度を達成します。

Summary (by gpt-3.5-turbo)

本研究では、テキストから画像を生成するモデルによって生成された合成画像を使用して視覚表現を学習することを調査しました。自己教師あり方法を合成画像に対してトレーニングすることで、実際の画像に匹敵するかそれを上回ることができることを示しました。また、同じテキストプロンプトから生成された複数の画像を互いに正として扱うことで、マルチポジティブコントラスティブ学習手法であるStableRepを開発しました。StableRepによって学習された表現は、SimCLRとCLIPによって学習された表現を上回ります。さらに、20Mの合成画像でトレーニングされたStableRepは、50Mの実際の画像でトレーニングされたCLIPよりも優れた精度を達成します。

AkihikoWatanabe added the Pocket label Jun 16, 2023

AkihikoWatanabe changed the title あ StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners, Yonglong Tian+, N/A, arXiv'23 Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners, Yonglong Tian+, N/A, arXiv'23 #742

StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners, Yonglong Tian+, N/A, arXiv'23 #742

AkihikoWatanabe commented Jun 16, 2023 •

edited

StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners, Yonglong Tian+, N/A, arXiv'23 #742

StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners, Yonglong Tian+, N/A, arXiv'23 #742

Comments

AkihikoWatanabe commented Jun 16, 2023 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Jun 16, 2023 •

edited