SVIT: Scaling up Visual Instruction Tuning, Bo Zhao+, N/A, arXiv'23 #792

AkihikoWatanabe · 2023-07-11T11:04:13Z

URL

Thanks to the emerging of foundation models, the large language and visionmodels are integrated to acquire the multimodal ability of visual captioning,dialogue, question answering, etc. Although existing multimodal models presentimpressive performance of visual understanding and reasoning, their limits arestill largely under-explored due to the scarcity of high-quality instructiontuning data. To push the limits of multimodal capability, we Sale up VisualInstruction Tuning (SVIT) by constructing a dataset of 3.2 million visualinstruction tuning data including 1.6M conversation question-answer (QA) pairsand 1.6M complex reasoning QA pairs and 106K detailed image descriptions.Besides the volume, the proposed dataset is also featured by the high qualityand rich diversity, which is generated by prompting GPT-4 with the abundantmanual annotations of images. We empirically verify that training multimodalmodels on SVIT can significantly improve the multimodal performance in terms ofvisual perception, reasoning and planing.

最近のfoundation modelsの登場により、大規模な言語モデルとビジョンモデルが統合され、視覚キャプション、対話、質問応答などの多モーダル能力を獲得することができるようになりました。
既存の多モーダルモデルは視覚理解と推論の印象的なパフォーマンスを示していますが、高品質なインストラクションチューニングデータの不足により、その限界はまだ十分に探求されていません。
多モーダル能力の限界を押し上げるために、私たちは3.2百万のビジュアルインストラクションチューニングデータセット（160万の対話型質問応答（QA）ペア、160万の複雑な推論QAペア、10.6万の詳細な画像の説明を含む）を構築することで、Sale up Visual Instruction Tuning（SVIT）を実現しました。
提案されたデータセットは、そのボリュームだけでなく、高品質かつ豊富な多様性も特徴としています。これは、GPT-4に対して豊富な手動注釈付き画像をプロンプトとして使用して生成されています。
SVITでの多モーダルモデルのトレーニングが、視覚認識、推論、計画などの多モーダルパフォーマンスを大幅に向上させることを経験的に検証しました。

大規模な言語モデルとビジョンモデルを統合した多モーダルモデルの能力を向上させるために、新しいデータセットSVITを構築しました。SVITは高品質かつ多様性に富んだビジュアルインストラクションチューニングデータセットであり、GPT-4のトレーニングに使用されることで多モーダルパフォーマンスを大幅に向上させることが示されました。

AkihikoWatanabe added the Pocket label Jul 11, 2023

AkihikoWatanabe changed the title あ SVIT: Scaling up Visual Instruction Tuning, Bo Zhao+, N/A, arXiv'23 Jul 11, 2023