ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, Wonjae Kim+, N/A, arXiv'21 #1009

AkihikoWatanabe · 2023-08-22T03:21:06Z

URL

https://arxiv.org/abs/2102.03334

Affiliations

Wonjae Kim, N/A
Bokyung Son, N/A
Ildoo Kim, N/A

Abstract

Vision-and-Language Pre-training (VLP) has improved performance on variousjoint vision-and-language downstream tasks. Current approaches to VLP heavilyrely on image feature extraction processes, most of which involve regionsupervision (e.g., object detection) and the convolutional architecture (e.g.,ResNet). Although disregarded in the literature, we find it problematic interms of both (1) efficiency/speed, that simply extracting input featuresrequires much more computation than the multimodal interaction steps; and (2)expressive power, as it is upper bounded to the expressive power of the visualembedder and its predefined visual vocabulary. In this paper, we present aminimal VLP model, Vision-and-Language Transformer (ViLT), monolithic in thesense that the processing of visual inputs is drastically simplified to justthe same convolution-free manner that we process textual inputs. We show thatViLT is up to tens of times faster than previous VLP models, yet withcompetitive or better downstream task performance. Our code and pre-trainedweights are available at https://github.com/dandelin/vilt.

Translation (by gpt-3.5-turbo)

Vision-and-Language Pre-training（VLP）は、さまざまな共通のビジョンと言語のタスクでのパフォーマンスを向上させています。
現在のVLPアプローチは、画像特徴の抽出プロセスに大きく依存しており、その多くは領域の監視（例：物体検出）と畳み込みアーキテクチャ（例：ResNet）を含んでいます。
文献では無視されているが、私たちはそれが以下の点で問題があると考えています：（1）効率性/速度において、単純に入力特徴を抽出するだけでも、多モーダルの相互作用のステップよりもはるかに多くの計算が必要であること；（2）表現力において、それはビジュアルエンベッダーと事前定義されたビジュアルボキャブラリーの表現力に上限があるため、問題があると考えています。
本論文では、ビジョンと言語のトランスフォーマ（ViLT）という最小限のVLPモデルを提案します。このモデルは、ビジュアル入力の処理を、テキスト入力と同じ畳み込みフリーの方法に大幅に簡素化したものです。
ViLTは、従来のVLPモデルよりも数十倍高速でありながら、競争力のあるまたはより良いダウンストリームタスクのパフォーマンスを示すことを示します。
私たちのコードと事前学習済みの重みは、https://github.com/dandelin/viltで利用可能です。

Summary (by gpt-3.5-turbo)

VLP（Vision-and-Language Pre-training）のアプローチは、ビジョンと言語のタスクでのパフォーマンスを向上させているが、現在の方法は効率性と表現力の面で問題がある。そこで、本研究では畳み込みフリーのビジョンと言語のトランスフォーマ（ViLT）モデルを提案する。ViLTは高速でありながら競争力のあるパフォーマンスを示し、コードと事前学習済みの重みはGitHubで利用可能である。

AkihikoWatanabe added the Pocket label Aug 22, 2023

AkihikoWatanabe changed the title a ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, Wonjae Kim+, N/A, arXiv'21 Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, Wonjae Kim+, N/A, arXiv'21 #1009

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, Wonjae Kim+, N/A, arXiv'21 #1009

AkihikoWatanabe commented Aug 22, 2023 •

edited

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, Wonjae Kim+, N/A, arXiv'21 #1009

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, Wonjae Kim+, N/A, arXiv'21 #1009

Comments

AkihikoWatanabe commented Aug 22, 2023 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Aug 22, 2023 •

edited