You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Large language models (LLMs) have notably accelerated progress towardsartificial general intelligence (AGI), with their impressive zero-shot capacityfor user-tailored tasks, endowing them with immense potential across a range ofapplications. However, in the field of computer vision, despite theavailability of numerous powerful vision foundation models (VFMs), they arestill restricted to tasks in a pre-defined form, struggling to match theopen-ended task capabilities of LLMs. In this work, we present an LLM-basedframework for vision-centric tasks, termed VisionLLM. This framework provides aunified perspective for vision and language tasks by treating images as aforeign language and aligning vision-centric tasks with language tasks that canbe flexibly defined and managed using language instructions. An LLM-baseddecoder can then make appropriate predictions based on these instructions foropen-ended tasks. Extensive experiments show that the proposed VisionLLM canachieve different levels of task customization through language instructions,from fine-grained object-level to coarse-grained task-level customization, allwith good results. It's noteworthy that, with a generalist LLM-based framework,our model can achieve over 60% mAP on COCO, on par with detection-specificmodels. We hope this model can set a new baseline for generalist vision andlanguage models. The demo shall be released based onhttps://github.com/OpenGVLab/InternGPT. The code shall be released athttps://github.com/OpenGVLab/VisionLLM.
AkihikoWatanabe
changed the title
あ
VisionLLM: Large Language Model is also an Open-Ended Decoder for
Vision-Centric Tasks, Wenhai Wang+, N/A, arXiv'23
May 20, 2023
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: