You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We present a comparative study on how and why contrastive learning (CL) andmasked image modeling (MIM) differ in their representations and in theirperformance of downstream tasks. In particular, we demonstrate thatself-supervised Vision Transformers (ViTs) have the following properties: (1)CL trains self-attentions to capture longer-range global patterns than MIM,such as the shape of an object, especially in the later layers of the ViTarchitecture. This CL property helps ViTs linearly separate images in theirrepresentation spaces. However, it also makes the self-attentions collapse intohomogeneity for all query tokens and heads. Such homogeneity of self-attentionreduces the diversity of representations, worsening scalability and denseprediction performance. (2) CL utilizes the low-frequency signals of therepresentations, but MIM utilizes high-frequencies. Since low- andhigh-frequency information respectively represent shapes and textures, CL ismore shape-oriented and MIM more texture-oriented. (3) CL plays a crucial rolein the later layers, while MIM mainly focuses on the early layers. Upon theseanalyses, we find that CL and MIM can complement each other and observe thateven the simplest harmonization can help leverage the advantages of bothmethods. The code is available at https://github.com/naver-ai/cl-vs-mim.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: