Can Multimodal Large Language Models Truly Understand Small Objects?

🌟 Highlights

First-Ever Benchmark: To the best of our knowledge, MLLMs based Small Object Understanding (SOU) tasks are proposed for the first time. A comprehensive benchmark (SOUBench), including relative datasets and baselines, is reported for the specific task. SOUBench fully reveals the shortcomings of current MLLMs in understanding small objects.
Comprehensive Evaluation: We design an effective automatic visual question-answer generation pipeline and introduce a comprehensive SOU-VQA evaluation dataset for small object understanding tasks, with 18,204 pairs and six relevant sub-tasks. Comprehensive experiments and comparisons are conducted in 15 state-of-the-art MLLMs to evaluate the small object understanding capability of MLLMs. Sufficient results reveal that current MLLMs have a weak understanding ability in the proposed tasks, even the best MLLM is still behind Human performance by 23.53%.
Effcitive Fine-tuning: We further construct SOU-Train, a multimodal VQA training dataset with 11,226 fine-grained annotations, to supervise the fine-tuning of the latest MLLM. The result denotes that the SOU-Train can effectively improve the small understanding ability of MLLM in different scenarios. Our research provides a crucial empirical foundation for the enhancement of the small object understanding capabilities of MLLMs.

✅ Step 1: VQA Construction (Take Driving Scenarios as an Example)

Usage

Generate driving scene VQA datasets from COCO annotations.

Files:

json_manager.py: JSON data management
CreateJson_Driving.py: Dataset generator (6 task types)
val.json: Example COCO annotations

Prerequisites:

Place your image dataset in the folder specified in CreateJson_Driving.py (e.g., Images_Driving)
Install dependencies: pip install pycocotools opencv-python

Usage:

Update image path in CreateJson_Driving.py to match your dataset folder
Run: python CreateJson_Driving.py

✅ Step 2: MLLMs evaluation (15 MLLMs)

Updating...(干饭去了)

✅ Step 3: Training

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
Annotations_Driving		Annotations_Driving
1.png		1.png
2.png		2.png
3.png		3.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Can Multimodal Large Language Models Truly Understand Small Objects?

🌟 Highlights

✅ Step 1: VQA Construction (Take Driving Scenarios as an Example)

Usage

✅ Step 2: MLLMs evaluation (15 MLLMs)

Updating...(干饭去了)

✅ Step 3: Training

Updating...(干饭去了)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Can Multimodal Large Language Models Truly Understand Small Objects?

🌟 Highlights

✅ Step 1: VQA Construction (Take Driving Scenarios as an Example)

Usage

✅ Step 2: MLLMs evaluation (15 MLLMs)

Updating...(干饭去了)

✅ Step 3: Training

Updating...(干饭去了)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages