VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks, Wenhai Wang+, N/A, arXiv'23 #689

AkihikoWatanabe · 2023-05-20T10:48:37Z

URL

https://arxiv.org/abs/2305.11175

Affiliations

Wenhai Wang, N/A
Zhe Chen, N/A
Xiaokang Chen, N/A
Jiannan Wu, N/A
Xizhou Zhu, N/A
Gang Zeng, N/A
Ping Luo, N/A
Tong Lu, N/A
Jie Zhou, N/A
Yu Qiao, N/A
Jifeng Dai, N/A

Abstract

Large language models (LLMs) have notably accelerated progress towardsartificial general intelligence (AGI), with their impressive zero-shot capacityfor user-tailored tasks, endowing them with immense potential across a range ofapplications. However, in the field of computer vision, despite theavailability of numerous powerful vision foundation models (VFMs), they arestill restricted to tasks in a pre-defined form, struggling to match theopen-ended task capabilities of LLMs. In this work, we present an LLM-basedframework for vision-centric tasks, termed VisionLLM. This framework provides aunified perspective for vision and language tasks by treating images as aforeign language and aligning vision-centric tasks with language tasks that canbe flexibly defined and managed using language instructions. An LLM-baseddecoder can then make appropriate predictions based on these instructions foropen-ended tasks. Extensive experiments show that the proposed VisionLLM canachieve different levels of task customization through language instructions,from fine-grained object-level to coarse-grained task-level customization, allwith good results. It's noteworthy that, with a generalist LLM-based framework,our model can achieve over 60% mAP on COCO, on par with detection-specificmodels. We hope this model can set a new baseline for generalist vision andlanguage models. The demo shall be released based onhttps://github.com/OpenGVLab/InternGPT. The code shall be released athttps://github.com/OpenGVLab/VisionLLM.

Translation (by gpt-3.5-turbo)

大規模言語モデル（LLMs）は、ユーザーに合わせたタスクに対する印象的なゼロショット能力を持ち、人工知能（AGI）に向けた進歩を著しく加速させています。これにより、様々なアプリケーションにおいて、膨大な可能性を持つようになりました。しかし、コンピュータビジョンの分野では、多数の強力なビジョンファウンデーションモデル（VFMs）が利用可能であるにもかかわらず、事前定義された形式のタスクに制限されており、LLMsのオープンエンドタスク能力には及びません。本研究では、ビジョン中心のタスクに対するLLMベースのフレームワークであるVisionLLMを提案します。このフレームワークは、画像を外国語として扱い、言語指示を用いて柔軟に定義および管理できる言語タスクとビジョン中心のタスクを統一的に扱うことで、ビジョンと言語タスクの統合的な視点を提供します。その後、LLMベースのデコーダーは、これらの指示に基づいて適切な予測を行い、オープンエンドタスクに対応します。徹底的な実験により、提案されたVisionLLMは、言語指示を用いた細かいオブジェクトレベルから粗いタスクレベルまで、異なるレベルのタスクカスタマイズを実現し、良好な結果を示すことができます。一般的なLLMベースのフレームワークを使用することで、モデルはCOCOで60％以上のmAPを達成し、検出専用モデルと同等の性能を発揮します。このモデルが一般的なビジョンと言語モデルの新しいベースラインを設定できることを期待しています。デモはhttps://github.com/OpenGVLab/InternGPTに基づいてリリースされます。コードはhttps://github.com/OpenGVLab/VisionLLMでリリースされます。

Summary (by gpt-3.5-turbo)

本研究では、大規模言語モデル（LLMs）を用いたビジョン中心のタスクに対するフレームワークであるVisionLLMを提案し、言語指示を用いて柔軟に定義および管理できる言語タスクとビジョン中心のタスクを統一的に扱うことで、ビジョンと言語タスクの統合的な視点を提供する。提案手法は、異なるレベルのタスクカスタマイズを実現し、良好な結果を示すことができる。また、一般的なビジョンと言語モデルの新しいベースラインを設定できることが期待される。

AkihikoWatanabe added the Pocket label May 20, 2023

AkihikoWatanabe changed the title あ VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks, Wenhai Wang+, N/A, arXiv'23 May 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks, Wenhai Wang+, N/A, arXiv'23 #689

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks, Wenhai Wang+, N/A, arXiv'23 #689

AkihikoWatanabe commented May 20, 2023 •

edited

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks, Wenhai Wang+, N/A, arXiv'23 #689

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks, Wenhai Wang+, N/A, arXiv'23 #689

Comments

AkihikoWatanabe commented May 20, 2023 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented May 20, 2023 •

edited