Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image. However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation. In this paper, we fill in this blank, presenting the first MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks. In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instruction-answer pairs are all manually designed. The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering. Besides, with such an instruction, we can also easily carry out quantitative statistics. A total of 50+ advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization.
🔥🔥🔥 A Survey on Multimodal Large Language Models
Project Page | Paper
🔥🔥🔥 MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Project Page [This Page] | Paper
The first comprehensive evaluation benchmark for MLLMs. Now the leaderboards include 50+ advanced models, such as Qwen-VL-Max, Gemini Pro, and GPT-4V. ✨
If you want to add your model in our leaderboards, please feel free to email bradyfu24@gmail.com. We will update the leaderboards in time. ✨
Download MME 🌟🌟
The benchmark dataset is collected by Xiamen University for academic research only. You can email yongdongluo@stu.xmu.edu.cn to obtain the dataset, according to the following requirement.
Requirement: A real-name system is encouraged for better academic communication. Your email suffix needs to match your affiliation, such as xx@stu.xmu.edu.cn and Xiamen University. Otherwise, you need to explain why. Please include the information bellow when sending your application email.
Name: (tell us who you are.)
Affiliation: (the name/url of your university or company)
Job Title: (e.g., professor, PhD, and researcher)
Email: (your email address)
How to use: (only for non-commercial use)
🔥🔥🔥 Woodpecker: Hallucination Correction for Multimodal Large Language Models
Paper | Source Code
The first work to correct hallucinations in MLLMs. ✨
🔥🔥🔥 A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
Paper
The first technical report for Gemini vs GPT-4V. A total of 128 pages. Completed within one week of the Gemini API opening. 🌟
📑 If you find our projects helpful to your research, please consider citing:
@article{fu2023mme,
title={MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models},
author={Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Yang, Jinrui and Zheng, Xiawu and Li, Ke and Sun, Xing and others},
journal={arXiv preprint arXiv:2306.13394},
year={2023}
}
@article{fu2024video,
title={Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis},
author={Fu, Chaoyou and Dai, Yuhan and Luo, Yondong and Li, Lei and Ren, Shuhuai and Zhang, Renrui and Wang, Zihan and Zhou, Chenyu and Shen, Yunhang and Zhang, Mengdan and others},
journal={arXiv preprint arXiv:2405.21075},
year={2024}
}
@article{yin2023survey,
title={A survey on multimodal large language models},
author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Li, Ke and Sun, Xing and Xu, Tong and Chen, Enhong},
journal={arXiv preprint arXiv:2306.13549},
year={2023}
}
@article{yin2023woodpecker,
title={Woodpecker: Hallucination correction for multimodal large language models},
author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Xu, Tong and Wang, Hao and Sui, Dianbo and Shen, Yunhang and Li, Ke and Sun, Xing and Chen, Enhong},
journal={arXiv preprint arXiv:2310.16045},
year={2023}
}
@article{fu2023challenger,
title={A challenger to gpt-4v? early explorations of gemini in visual expertise},
author={Fu, Chaoyou and Zhang, Renrui and Lin, Haojia and Wang, Zihan and Gao, Timin and Luo, Yongdong and Huang, Yubo and Zhang, Zhengye and Qiu, Longtian and Ye, Gaoxiang and others},
journal={arXiv preprint arXiv:2312.12436},
year={2023}
}
- [06-06] Thanks to CMRI, JT-VL-Chat-V1.0 is added in MME. 🔥🔥
- [05-27] Thanks to Junbo Cui, MiniCPM-Llama3-V 2.5 joins MME.
- [05-18] Thanks to Chunyu Xie, 360VL is incorporated into MME.
- [04-27] Thanks to Zhe Chen, we welcome a new member InternVL-Chat-V1.5.
- [04-15] Thanks to Junbo Cui, MiniCPM-V-2 is added in MME.
- [04-10] Thanks to Wenqiao Zhang, HyperLLaVA joins our leaderboards.
- [03-14] Thanks to Muyang He, Bunny-3B takes part in MME.
- [02-23] Thanks to Jingyu Liu, ChatTruth-7B is added to MME.
- [02-07] Thanks to TsinghuaNLP, MiniCPM and OmniLMM are incorporated into our leaderboards.
- [02-05] Thanks to Haotian Liu, LLaVA-1.6 is added to MME.
- [02-05] Thanks to Bin Lin, MoE-LLaVA joins MME.
- [02-05] Thanks to Weihan Wang and Wenyi Hong, CogVLM and CogAgent take part in MME.
- [01-25] Thanks to Shijie Wang, we welcome a new member Qwen-VL-Max.
- [01-22] Thanks to Xiaoyi Dong, InternLM-XComposer2-VL joins our leaderboards.
2023
[2023-12]
- [12-31] Thanks to Dian Li, PureMM takes part in our leaderboards (update in 2024-01-14 and 2024-01-21).
- [12-31] Thanks to Yilin Ma and Min Xu, RBDash is added in MME.
- [12-18] Thanks to Zihan Wang, our leaderboards usher in Gemini Pro.
- [12-18] Thanks to Jinze Bai, a new model Qwen-VL-Plus is added in MME.
- [12-18] Thanks to Junbum Cha, Honeybee joins our leaderboards.
- [12-12] Thanks to Yuliang Liu, Monkey-Chat takes part in MME.
- [12-12] Thanks to Junkun Yuan, we welcome a new member AGILMM.
- [12-01] Thanks to Cheng Wen, BELLE-VL is added to our leaderboards.
- [12-01] Thanks to PCI Research, TransCore-M joins MME.
[2023-11]
- [11-24] Thanks to Xiaoyi Dong, we add ShareGPT4V to our leaderboards.
- [11-24] Thanks to Muyang He, DataOptim joins MME.
- [11-24] Thanks to Zifei Shan, Kanva is added.
- [11-21] Thanks to Junke Wang, LVIS-INSTRUCT4V is added to our MME.
- [11-18] Thanks to Zhenbo Luo, our leaderboards welcome a new member CVLM.
- [11-10] Thanks to Qinghao Ye, we get a new model mPLUG-Owl2 in our leaderboards.
- [11-10] Thanks to Zhibin Wang, InfMLLM joins our leaderboards (update in 2023-12-12).
[2023-10]
- [10-29] Thanks to Jiaming Han, SPHINX is added to our leaderboards.
- [10-23] Thanks to Zihan Wang, he manually evaluate the performance of GPT-4V on our benchmark. Note that GPT-4V refuses to answer questions that involve individuals, resulting in a zero score in the Celebrity subtask.
- [10-13] Thanks to Yizhou Zhou, WeMM joins our leaderboards (The results are renewed on 2023-11-10 by updating the model).
- [10-13] Thanks to Cui Junbo, we add Muffin to our leaderboards.
- [10-13] Thanks to Jiaming Han, the results of LLaMA-Adapter V2 have been updated.
- [10-04] Thanks to Haotian Liu, the results of LLaVA have been updated.
[2023-09]
- [09-28] Thanks to Huasong Zhong, Lion is added.
- [09-27] Thanks to Xiaoyi Dong, InternLM-XComposer-VL joins our leaderboards.
- [09-05] Thanks to Jinze Bai, our leaderboards usher in Qwen-VL-Chat.
- [09-01] Thanks to Skywork Multi-Modal Group, Skywork-MM takes part in our leaderboards.
[2023-08]
- [08-28] Thanks to UCSD MLPC, we welcome BLIVA to join our leaderboards.
- [08-28] Thanks to Jianfeng Wang, GIT2 is added to our leaderboards.
- [08-28] Thanks to Yike Yuan and Songyang Zhang, the results of MiniGPT4 have been revised.
- [08-21] Thanks to Haozhe Zhao, MMICL joins our leaderboards (The results are renewed on 2023-09-17 by upgrading the checkpoint.).
- [08-13] Thanks to Zhejiang University DCD Lab, our leaderboards incorporate a new member Cheetor.
- [08-08] Thanks to Fuxiao Liu, we add LRV-Instruction to our leaderboards.
[2023-07]
- [07-28] Thanks to Yingzi Ma, his work Octopus has been updated to our leaderboards.
- [07-15] Thanks to Jiani Zheng, our leaderboards welcome a new member Lynx.
- [07-12] Thanks to Ao Zhang, his work VPGTrans has been added in our leaderboards.
- [07-09] Thanks to Bo Li, we have updated the evaluation of his work Otter. It uses the latest model OTTER-Image-MPT7B that incoporates OpenFlamingv2 and enhances instruction following ability.
[2023-06]
- [06-30] Thanks to Renrui Zhang, we have updated the evaluation of his two works, i.e., LLaMA-Adapter V2 and ImageBind_LLM. The former is re-evaluated after changing the model weights, and the latter is a newly added MLLM.
- [06-30] Thanks to Gen Luo, we have added the evaluation of his work LaVIN.
- [06-30] The results of other models have also been updated, retrieving the answer from the beginning of the generated responses instead of the whole responses. An automated evaluation script for the calculation of scores has been released!
Results of Available Models [Unavailable Version]
Leaderboards of Available Models [Unavailable Version]
Sum of the scores of all perception subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, and OCR. The full score of each subtask is 200, and that of all perception is 2000.
Sum of the scores of all cognition subtasks, including commonsense reasoning, numerical calculation, text translation, and code reasoning. The full score of each subtask is 200, and that of all cognition is 800.