What Do Self-Supervised Vision Transformers Learn?, Namuk Park+, N/A, arXiv'23 #632

AkihikoWatanabe · 2023-05-04T10:45:16Z

URL

https://arxiv.org/abs/2305.00729

Affiliations

Namuk Park, N/A
Wonjae Kim, N/A
Byeongho Heo, N/A
Taekyung Kim, N/A
Sangdoo Yun, N/A

Abstract

We present a comparative study on how and why contrastive learning (CL) andmasked image modeling (MIM) differ in their representations and in theirperformance of downstream tasks. In particular, we demonstrate thatself-supervised Vision Transformers (ViTs) have the following properties: (1)CL trains self-attentions to capture longer-range global patterns than MIM,such as the shape of an object, especially in the later layers of the ViTarchitecture. This CL property helps ViTs linearly separate images in theirrepresentation spaces. However, it also makes the self-attentions collapse intohomogeneity for all query tokens and heads. Such homogeneity of self-attentionreduces the diversity of representations, worsening scalability and denseprediction performance. (2) CL utilizes the low-frequency signals of therepresentations, but MIM utilizes high-frequencies. Since low- andhigh-frequency information respectively represent shapes and textures, CL ismore shape-oriented and MIM more texture-oriented. (3) CL plays a crucial rolein the later layers, while MIM mainly focuses on the early layers. Upon theseanalyses, we find that CL and MIM can complement each other and observe thateven the simplest harmonization can help leverage the advantages of bothmethods. The code is available at https://github.com/naver-ai/cl-vs-mim.

Translation (by gpt-3.5-turbo)

本研究では、対比学習（CL）とマスク画像モデリング（MIM）が表現とダウンストリームタスクのパフォーマンスにおいてどのように異なるか、またその理由について比較的な研究を行います。特に、自己教示学習されたVision Transformers（ViTs）が以下の特性を持つことを示します：（1）CLは、MIMよりも長距離のグローバルなパターンを捉えるために、自己アテンションをトレーニングすることができます。特に、ViTアーキテクチャの後半層では、オブジェクトの形状などが含まれます。このCLの特性により、ViTsは表現空間で画像を線形に分離することができます。ただし、自己アテンションがすべてのクエリトークンとヘッドに対して均一になるため、表現の多様性が低下し、スケーラビリティと密な予測パフォーマンスが悪化することがあります。 (2) CLは表現の低周波信号を利用しますが、MIMは高周波信号を利用します。低周波と高周波情報はそれぞれ形状とテクスチャを表します。そのため、CLは形状に、MIMはテクスチャにより重点を置いています。 (3) CLは後半層で重要な役割を果たし、MIMは主に初期層に焦点を当てています。これらの分析に基づいて、CLとMIMは互いに補完的であることがわかり、最も単純な調和でも両方の方法の利点を活用するのに役立つことが観察されました。コードはhttps://github.com/naver-ai/cl-vs-mimで利用可能です。

Summary (by gpt-3.5-turbo)

本研究では、対比学習（CL）とマスク画像モデリング（MIM）の比較的な研究を行い、自己教示学習されたVision Transformers（ViTs）がCLとMIMの両方の利点を活用することができることを示した。CLは長距離のグローバルなパターンを捉えることができ、ViTsは表現空間で画像を線形に分離することができるが、表現の多様性が低下し、スケーラビリティと密な予測パフォーマンスが悪化することがある。MIMは高周波情報を利用し、形状とテクスチャを表す。CLとMIMは互いに補完的であり、両方の方法の利点を活用することができる。コードはGitHubで利用可能。

AkihikoWatanabe changed the title あ What Do Self-Supervised Vision Transformers Learn?, Namuk Park+, N/A, arXiv'23 May 4, 2023

AkihikoWatanabe added the Pocket label May 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What Do Self-Supervised Vision Transformers Learn?, Namuk Park+, N/A, arXiv'23 #632

What Do Self-Supervised Vision Transformers Learn?, Namuk Park+, N/A, arXiv'23 #632

AkihikoWatanabe commented May 4, 2023 •

edited

What Do Self-Supervised Vision Transformers Learn?, Namuk Park+, N/A, arXiv'23 #632

What Do Self-Supervised Vision Transformers Learn?, Namuk Park+, N/A, arXiv'23 #632

Comments

AkihikoWatanabe commented May 4, 2023 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented May 4, 2023 •

edited