You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Vision systems to see and reason about the compositional nature of visualscenes are fundamental to understanding our world. The complex relationsbetween objects and their locations, ambiguities, and variations in thereal-world environment can be better described in human language, naturallygoverned by grammatical rules and other modalities such as audio and depth. Themodels learned to bridge the gap between such modalities coupled withlarge-scale training data facilitate contextual reasoning, generalization, andprompt capabilities at test time. These models are referred to as foundationalmodels. The output of such models can be modified through human-providedprompts without retraining, e.g., segmenting a particular object by providing abounding box, having interactive dialogues by asking questions about an imageor video scene or manipulating the robot's behavior through languageinstructions. In this survey, we provide a comprehensive review of suchemerging foundational models, including typical architecture designs to combinedifferent modalities (vision, text, audio, etc), training objectives(contrastive, generative), pre-training datasets, fine-tuning mechanisms, andthe common prompting patterns; textual, visual, and heterogeneous. We discussthe open challenges and research directions for foundational models in computervision, including difficulties in their evaluations and benchmarking, gaps intheir real-world understanding, limitations of their contextual understanding,biases, vulnerability to adversarial attacks, and interpretability issues. Wereview recent developments in this field, covering a wide range of applicationsof foundation models systematically and comprehensively. A comprehensive listof foundational models studied in this work is available at\url{https://github.com/awaisrauf/Awesome-CV-Foundational-Models}.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: