Foundational Models Defining a New Era in Vision: A Survey and Outlook, Muhammad Awais+, N/A, arXiv'23 #914

AkihikoWatanabe · 2023-08-08T08:45:12Z

URL

https://arxiv.org/abs/2307.13721

Affiliations

Muhammad Awais, N/A
Muzammal Naseer, N/A
Salman Khan, N/A
Rao Muhammad Anwer, N/A
Hisham Cholakkal, N/A
Mubarak Shah, N/A
Ming-Hsuan Yang, N/A
Fahad Shahbaz Khan, N/A

Abstract

Vision systems to see and reason about the compositional nature of visualscenes are fundamental to understanding our world. The complex relationsbetween objects and their locations, ambiguities, and variations in thereal-world environment can be better described in human language, naturallygoverned by grammatical rules and other modalities such as audio and depth. Themodels learned to bridge the gap between such modalities coupled withlarge-scale training data facilitate contextual reasoning, generalization, andprompt capabilities at test time. These models are referred to as foundationalmodels. The output of such models can be modified through human-providedprompts without retraining, e.g., segmenting a particular object by providing abounding box, having interactive dialogues by asking questions about an imageor video scene or manipulating the robot's behavior through languageinstructions. In this survey, we provide a comprehensive review of suchemerging foundational models, including typical architecture designs to combinedifferent modalities (vision, text, audio, etc), training objectives(contrastive, generative), pre-training datasets, fine-tuning mechanisms, andthe common prompting patterns; textual, visual, and heterogeneous. We discussthe open challenges and research directions for foundational models in computervision, including difficulties in their evaluations and benchmarking, gaps intheir real-world understanding, limitations of their contextual understanding,biases, vulnerability to adversarial attacks, and interpretability issues. Wereview recent developments in this field, covering a wide range of applicationsof foundation models systematically and comprehensively. A comprehensive listof foundational models studied in this work is available at\url{https://github.com/awaisrauf/Awesome-CV-Foundational-Models}.

Translation (by gpt-3.5-turbo)

視覚システムは、視覚的なシーンの構成的な性質を見て理解するために基本的です。オブジェクトとその位置の複雑な関係、現実世界の環境における曖昧さや変動は、文法的なルールや音声、深度などの他のモダリティによって自然に記述される人間の言語でよりよく説明できます。これらのモダリティを結びつけるために学習されたモデルは、大規模なトレーニングデータとともに、テスト時の文脈的な推論、汎化、プロンプトの機能を容易にします。これらのモデルは基礎モデルと呼ばれます。このようなモデルの出力は、再トレーニングせずに人間が提供するプロンプトを介して変更することができます。例えば、特定のオブジェクトをバウンディングボックスでセグメンテーションすること、画像やビデオシーンについて質問をすること、または言語の指示を通じてロボットの動作を操作することなどがあります。本調査では、これらの新興の基礎モデルについて包括的なレビューを提供します。これには、異なるモダリティ（ビジョン、テキスト、音声など）を組み合わせるための典型的なアーキテクチャ設計、トレーニング目標（対照的な、生成的な）、事前トレーニングデータセット、微調整メカニズム、および一般的なプロンプトパターン（テキスト、ビジュアル、異種）が含まれます。また、コンピュータビジョンにおける基礎モデルの評価とベンチマーキングの困難さ、現実世界の理解のギャップ、文脈理解の制約、バイアス、敵対的攻撃への脆弱性、解釈可能性の問題など、基礎モデルに関する課題と研究方向について議論します。さらに、本分野の最近の発展について包括的かつ体系的にレビューします。本研究で調査された基礎モデルの包括的なリストは、\url{https://github.com/awaisrauf/Awesome-CV-Foundational-Models}で入手できます。

Summary (by gpt-3.5-turbo)

本研究では、視覚システムの基礎モデルについて包括的なレビューを提供します。これには、異なるモダリティを組み合わせるためのアーキテクチャ設計やトレーニング目標、トレーニングデータセットなどが含まれます。また、基礎モデルの評価や課題、最近の発展についても議論します。詳細なリストは、\url{https://github.com/awaisrauf/Awesome-CV-Foundational-Models}で入手できます。

AkihikoWatanabe · 2023-08-08T08:45:57Z

CVにおけるfoundation modelのsurvey。残されたチャレンジと研究の方向性が議論されている

AkihikoWatanabe added the Pocket label Aug 8, 2023

AkihikoWatanabe changed the title あ Foundational Models Defining a New Era in Vision: A Survey and Outlook, Muhammad Awais+, N/A, arXiv'23 Aug 8, 2023

AkihikoWatanabe added Survey ComputerVision FoundationModel and removed Pocket labels Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Foundational Models Defining a New Era in Vision: A Survey and Outlook, Muhammad Awais+, N/A, arXiv'23 #914

Foundational Models Defining a New Era in Vision: A Survey and Outlook, Muhammad Awais+, N/A, arXiv'23 #914

AkihikoWatanabe commented Aug 8, 2023 •

edited

AkihikoWatanabe commented Aug 8, 2023

Foundational Models Defining a New Era in Vision: A Survey and Outlook, Muhammad Awais+, N/A, arXiv'23 #914

Foundational Models Defining a New Era in Vision: A Survey and Outlook, Muhammad Awais+, N/A, arXiv'23 #914

Comments

AkihikoWatanabe commented Aug 8, 2023 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Aug 8, 2023

AkihikoWatanabe commented Aug 8, 2023 •

edited