Skip to content

ClearTorch/Awesome-Multimodal-Large-Language-Models

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Awesome-Multimodal-Large-Language-Models

Our MLLM works

🔥🔥🔥 A Survey on Multimodal Large Language Models
Project Page | Paper

A curated list of Multimodal Large Language Models (MLLMs), including datasets, multimodal instruction tuning, multimodal in-context learning, multimodal chain-of-thought, llm-aided visual reasoning, foundation models, and others. This list will be updated in real time. ✨

Welcome to join our WeChat group of MLLM communication!

Please add WeChat ID (wmd_rz_ustc) to join the group. 🌟


🔥🔥🔥 MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Project Page [Leaderboards] | Paper

Please feel free to open an issue to add new evaluation results or if you have any questions about the evaluation. We will update the leaderboards in time. ✨

Download MME 🌟🌟

The benchmark dataset is collected by Xiamen University for academic research only. You can email guilinli@stu.xmu.edu.cn to obtain the dataset, according to the following requirement.

Requirement: A real-name system is encouraged for better academic communication. Your email suffix needs to match your affiliation, such as xx@stu.xmu.edu.cn and Xiamen University. Otherwise, you need to explain why. Please include the information bellow when sending your application email.

Name: (tell us who you are.)
Affiliation: (the name/url of your university or company)
Job Title: (e.g., professor, PhD, and researcher)
Email: (your email address)
How to use: (only for non-commercial use)

If you find our projects helpful to your research, please cite the following papers:

@article{yin2023survey,
      title={A Survey on Multimodal Large Language Models}, 
      author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Li, Ke and Sun, Xing and Xu, Tong and Chen, Enhong},
      journal={arXiv preprint arXiv:2306.13549},
      year={2023}
}

@article{fu2023mme,
      title={MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models}, 
      author={Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Qiu, Zhenyu and Lin, Wei and Yang, Jinrui and Zheng, Xiawu and Li, Ke and Sun, Xing and Ji, Rongrong},
      journal={arXiv preprint arXiv:2306.13394},
      year={2023}
}

Table of Contents


Awesome Papers

Multimodal Instruction Tuning

Title Venue Date Code Demo
Star
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
arXiv 2023-06-29 Github Coming soon
Star
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
arXiv 2023-06-27 Github -
Star
Aligning Large Multi-Modal Model with Robust Instruction Tuning
arXiv 2023-06-26 Github Demo
Star
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
arXiv 2023-06-15 Github Coming soon
Star
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
arXiv 2023-06-11 Github Demo
Star
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
arXiv 2023-06-08 Github Demo
Star
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
arXiv 2023-06-08 Github Demo
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning arXiv 2023-06-07 - -
Star
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
arXiv 2023-06-05 Github Demo
Star
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
arXiv 2023-06-01 Github -
Star
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
arXiv 2023-05-30 Github Demo
Star
ImageBind-LLM: Multi-Modality Instruction Tuning
- 2023-05-29 Github Demo
Star
PandaGPT: One Model To Instruction-Follow Them All
arXiv 2023-05-25 Github Demo
Star
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
arXiv 2023-05-25 Github -
Star
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
arXiv 2023-05-24 Github Local Demo
Star
DetGPT: Detect What You Need via Reasoning
arXiv 2023-05-23 Github Demo
Star
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
arXiv 2023-05-18 Github Demo
Star
Listen, Think, and Understand
arXiv 2023-05-18 Github Demo
Star
VisualGLM-6B
- 2023-05-17 Github Local Demo
Star
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
arXiv 2023-05-17 Github -
Star
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
arXiv 2023-05-11 Github Local Demo
Star
VideoChat: Chat-Centric Video Understanding
arXiv 2023-05-10 Github Demo
Star
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
arXiv 2023-05-08 Github Demo
Star
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
arXiv 2023-05-07 Github -
Star
LMEye: An Interactive Perception Network for Large Language Models
arXiv 2023-05-05 Github Local Demo
Star
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
arXiv 2023-04-28 Github Demo
Star
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
arXiv 2023-04-27 Github Demo
Star
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
arXiv 2023-04-20 Github -
Star
Visual Instruction Tuning
arXiv 2023-04-17 GitHub Demo
Star
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
arXiv 2023-03-28 Github Demo
Star
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
ACL 2022-12-21 Github -

Multimodal In-Context Learning

Title Venue Date Code Demo
Star
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
arXiv 2023-06-08 Github Demo
Star
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
arXiv 2023-04-19 Github Demo
Star
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
arXiv 2023-03-30 Github Demo
Star
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv 2023-03-20 Github Demo
Star
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering
CVPR 2023-03-03 Github -
Star
Visual Programming: Compositional visual reasoning without training
CVPR 2022-11-18 Github Local Demo
Star
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
AAAI 2022-06-28 Github -
Star
Flamingo: a Visual Language Model for Few-Shot Learning
NeurIPS 2022-04-29 Github Demo
Multimodal Few-Shot Learning with Frozen Language Models NeurIPS 2021-06-25 - -

Multimodal Chain-of-Thought

Title Venue Date Code Demo
Star
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
arXiv 2023-05-24 Github -
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction arXiv 2023-05-23 - -
Star
Caption Anything: Interactive Image Description with Diverse Multimodal Controls
arXiv 2023-05-04 Github Demo
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings arXiv 2023-05-03 Coming soon -
Star
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
arXiv 2023-04-19 Github Demo
Chain of Thought Prompt Tuning in Vision Language Models arXiv 2023-04-16 Coming soon -
Star
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv 2023-03-20 Github Demo
Star
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
arXiv 2023-03-08 Github Demo
Star
Multimodal Chain-of-Thought Reasoning in Language Models
arXiv 2023-02-02 Github -
Star
Visual Programming: Compositional visual reasoning without training
CVPR 2022-11-18 Github Local Demo
Star
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
NeurIPS 2022-09-20 Github -

LLM-Aided Visual Reasoning

Title Venue Date Code Demo
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models arXiv 2023-06-15 - -
Star
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
arXiv 2023-06-14 Github -
Star
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
arXiv 2023-05-30 Github Demo
Mindstorms in Natural Language-Based Societies of Mind arXiv 2023-05-26 - -
Star
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models
arXiv 2023-05-24 Github -
Star
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
arXiv 2023-05-24 Github Local Demo
Star
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
arXiv 2023-05-10 Github -
Star
Caption Anything: Interactive Image Description with Diverse Multimodal Controls
arXiv 2023-05-04 Github Demo
Star
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
arXiv 2023-04-19 Github Demo
Star
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
arXiv 2023-03-30 Github Demo
Star
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv 2023-03-20 Github Demo
Star
ViperGPT: Visual Inference via Python Execution for Reasoning
arXiv 2023-03-14 Github Local Demo
Star
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
arXiv 2023-03-12 Github Local Demo
Star
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
arXiv 2023-03-08 Github Demo
Star
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners
CVPR 2023-03-03 Github -
Star
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models
arXiv 2022-11-28 Github -
Star
PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning
CVPR 2022-11-21 Github -
Star
Visual Programming: Compositional visual reasoning without training
CVPR 2022-11-18 Github Local Demo
Star
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
arXiv 2022-04-01 Github -

Foundation Models

Title Venue Date Code Demo
Star
Kosmos-2: Grounding Multimodal Large Language Models to the World
arXiv 2023-06-26 Github -
Star
Transfer Visual Prompt Generator across LLMs
arXiv 2023-05-02 Github Demo
GPT-4 Technical Report arXiv 2023-03-15 - -
PaLM-E: An Embodied Multimodal Language Model arXiv 2023-03-06 - Demo
Star
Prismer: A Vision-Language Model with An Ensemble of Experts
arXiv 2023-03-04 Github Demo
Star
Language Is Not All You Need: Aligning Perception with Language Models
arXiv 2023-02-27 Github -
Star
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
arXiv 2023-01-30 Github Demo
Star
VIMA: General Robot Manipulation with Multimodal Prompts
ICML 2022-10-06 Github Local Demo
Star
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
NeurIPS 2022-06-17 Github -
Star
Language Models are General-Purpose Interfaces
arXiv 2022-06-13 Github -

Evaluation

Title Venue Date Page
Star
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
arXiv 2023-06-23 Github
Star
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
arXiv 2023-06-15 Github
Star
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
arXiv 2023-06-11 Github
Star
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
arXiv 2023-06-08 Github

Others

Title Venue Date Code Demo
Can Large Pre-trained Models Help Vision Models on Perception Tasks? arXiv 2023-06-01 Coming soon -
Star
Contextual Object Detection with Multimodal Large Language Models
arXiv 2023-05-29 Github Demo
Star
Generating Images with Multimodal Language Models
arXiv 2023-05-26 Github -
Star
On Evaluating Adversarial Robustness of Large Vision-Language Models
arXiv 2023-05-26 Github -
Star
Evaluating Object Hallucination in Large Vision-Language Models
arXiv 2023-05-17 Github -
Star
Grounding Language Models to Images for Multimodal Inputs and Outputs
ICML 2023-01-31 Github Demo

Awesome Datasets

Datasets of Pre-Training for Alignment

Name Paper Type Modalities
MS-COCO Microsoft COCO: Common Objects in Context Caption Image-Text
SBU Captions Im2Text: Describing Images Using 1 Million Captioned Photographs Caption Image-Text
Conceptual Captions Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning Caption Image-Text
LAION-400M LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs Caption Image-Text
VG Captions Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations Caption Image-Text
Flickr30k Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models Caption Image-Text
AI-Caps AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding Caption Image-Text
Wukong Captions Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark Caption Image-Text
Youku-mPLUG Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks Caption Video-Text
MSR-VTT MSR-VTT: A Large Video Description Dataset for Bridging Video and Language Caption Video-Text
Webvid10M Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval Caption Video-Text
WavCaps WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research Caption Audio-Text
AISHELL-1 AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline ASR Audio-Text
AISHELL-2 AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale ASR Audio-Text
VSDial-CN X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages ASR Image-Audio-Text

Datasets of Multimodal Instruction Tuning

Name Paper Link Notes
LLaVAR LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding Link A visual instruction-tuning dataset for Text-rich Image Understanding
LRV-Instruction Aligning Large Multi-Modal Model with Robust Instruction Tuning Link Visual instruction tuning dataset for addressing hallucination issue
Macaw-LLM Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration Link A large-scale multi-modal instruction dataset in terms of multi-turn dialogue
LAMM-Dataset LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark Link A comprehensive multi-modal instruction tuning dataset
Video-ChatGPT Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Link 100K high-quality video instruction dataset
MIMIC-IT MIMIC-IT: Multi-Modal In-Context Instruction Tuning Coming soon Multimodal in-context instruction tuning
M3IT M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning Link Large-scale, broad-coverage multimodal instruction tuning dataset
LLaVA-Med LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day Coming soon A large-scale, broad-coverage biomedical instruction-following dataset
GPT4Tools GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction Link Tool-related instruction datasets
MULTIS ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst Coming soon Multimodal instruction tuning dataset covering 16 multimodal tasks
DetGPT DetGPT: Detect What You Need via Reasoning Link Instruction-tuning dataset with 5000 images and around 30000 query-answer pairs
PMC-VQA PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering Coming soon Large-scale medical visual question-answering dataset
VideoChat VideoChat: Chat-Centric Video Understanding Link Video-centric multimodal instruction dataset
X-LLM X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages Link Chinese multimodal instruction dataset
LMEye LMEye: An Interactive Perception Network for Large Language Models Link A multi-modal instruction-tuning dataset
cc-sbu-align MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models Link Multimodal aligned dataset for improving model's usability and generation's fluency
LLaVA-Instruct-150K Visual Instruction Tuning Link Multimodal instruction-following data generated by GPT
MultiInstruct MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning Link The first multimodal instruction tuning benchmark dataset

Datasets of In-Context Learning

Name Paper Link Notes
MIMIC-IT MIMIC-IT: Multi-Modal In-Context Instruction Tuning Coming soon Multimodal in-context instruction dataset

Datasets of Multimodal Chain-of-Thought

Name Paper Link Notes
EgoCOT EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought Coming soon Large-scale embodied planning dataset
VIP Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction Coming soon An inference-time dataset that can be used to evaluate VideoCOT
ScienceQA Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering Link Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains

Benchmarks for Evaluation

Name Paper Link Notes
MME MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models Link A comprehensive MLLM Evaluation benchmark
LVLM-eHub LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models Link An evaluation platform for MLLMs
LAMM-Benchmark LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark Link A benchmark for evaluating the quantitative performance of MLLMs on various2D/3D vision tasks
M3Exam M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models Link A multilingual, multimodal, multilevel benchmark for evaluating MLLM
OwlEval mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality Link Dataset for evaluation on multiple capabilities

Others

Name Paper Link Notes
IMAD IMAD: IMage-Augmented multi-modal Dialogue Link Multimodal dialogue dataset
Video-ChatGPT Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Link A quantitative evaluation framework for video-based dialogue models
CLEVR-ATVC Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation Link A synthetic multimodal fine-tuning dataset for learning to reject instructions
Fruit-ATVC Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation Link A manually pictured multimodal fine-tuning dataset for learning to reject instructions
InfoSeek Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? Coming soon A VQA dataset that focuses on asking information-seeking questions

About

✨✨Latest Papers and Datasets on Multimodal Large Language Models, and Their Evaluation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published