Meteor: Mamba-based traversal of rationale for Large Language and Vision Models [ArXiv]
- Online Demo of Meteor is now available in 🤗Huggingface Space, thanks to ZeroGPU support (NVIDIA A100) by Huggingface Staff! However, there is
⚠️ Warning that input queries are limited and lots of optimization libraries (Causal-Conv1d, Mamba-SSM) cannot be applied within its space, so inference speed is slower than this official repository. - Meteor has been featured by 🤗Huggingface daily papers
- Meteor is now available in 🤗Huggingface Models: Meteor-Mamba, Meteor-MLM.
- Curated 1.1M Question-Rationale-Answer Triples are now available in 🤗Huggingface Datasets.
- Preprint of Meteor has been uploaded in ArXiv.
Official PyTorch implementation code for realizing the technical part of Mamba-based traversal of rationale (Meteor) to improve numerous vision language performances with efficient model size. This code is developed from scratch. so I have been trying to improve the readibility and simplicity of the code, compared with LLaVA which has relatively complexly structured code.
The contributions of Meteor can be simply summarized as the following lists
- Curated 1.1M Question-Rationale-Answer Triples.
- Meteor is the efficient 7B model, compared with highly Larger LLVMs.
- Meteor-7B acquires diverse capabilities, thereby showing surprising powerful vision language performances.
Open-source LLVMs with Standard Model Size
LLVMs | SQA-IMG | POPE | MME | MMB | MathVista | SEED-IMG | MM-Vet | LLaVA-W |
---|---|---|---|---|---|---|---|---|
Yi-VL-6B | 71.7 | 82.5 | 1915 | 64.2 | 29.7 | 67.5 | 32.1 | 51.9 |
LLaVA-NeXT-7B | 70.1 | 86.5 | 1851 | 69.6 | 34.6 | 70.2 | 43.9 | 72.3 |
MM1-7B | 72.6 | 86.6 | 1858 | 72.3 | 35.9 | 70.9 | 42.1 | - |
Meteor-7B | 88.3 | 88.7 | 2229 | 82.9 | 53.4 | 75.0 | 57.3 | 87.1 |
Open-source LLVMs with Large Model Sizes
LLVMs | AI2D | ChartQA | MME | MMB | MathVista | MM-Vet | LLaVA-W |
---|---|---|---|---|---|---|---|
InternVL1.5-40B | 79.0 | 68.0 | 2175 | 82.2 | 47.7 | 48.9 | - |
InternVL1.5-26B | 80.7 | 83.8 | 2188 | 82.2 | 53.5 | 62.8 | - |
MM1-30B | - | - | 2069 | 75.1 | 39.4 | 48.7 | - |
MiniGemini-34B | - | - | 2105 | 79.6 | 38.9 | 53.0 | - |
MiniGemini-HD-34B | - | - | 2141 | 80.6 | 43.3 | 59.3 | - |
LLaVA-NeXT-8B | 71.6 | 69.5 | 1972 | 72.1 | 37.5 | - | 80.1 |
LLaVA-NeXT-34B | 74.9 | 68.7 | 2030 | 79.3 | 46.0 | 57.4 | 88.8 |
LLaVA-NeXT-72B | 77.4 | 77.0 | 2159 | 80.5 | 46.6 | - | 89.2 |
LLaVA-NeXT-110B | 80.4 | 80.4 | 2201 | 80.5 | 49.0 | - | 90.4 |
Meteor-7B | 77.9 | 74.9 | 2229 | 82.9 | 53.4 | 57.3 | 87.1 |
Closed-source LLVMs
LLVMs | SQA-IMG | AI2D | ChartQA | MME | MMB | MathVista | SEED-IMG | MMStar |
---|---|---|---|---|---|---|---|---|
Qwen-VL-Plus | 71.6 | 75.9 | 78.1 | 2183 | 67.0 | 43.3 | 72.7 | 39.7 |
Gemini-Pro | 80.1 | 73.9 | 74.1 | 1933 | 73.6 | 45.2 | 70.7 | 41.6 |
GPT-4V | 84.6 | 78.2 | 78.5 | 1927 | 77.0 | 49.9 | 69.1 | 46.1 |
Meteor-7B | 88.3 | 77.9 | 74.9 | 2229 | 82.9 | 53.4 | 75.0 | 52.8 |
Run the following order.
bash install
pip install -r requirements.txt
and run the demo (Enjoy Meteor).
python demo.py
(Optional) If you want to make 📻 Gradio demo by yourself, then you should run the following file or change it to fit your style.
python app.py
(Optional) If you want to enjoy the curated question-ratinale-answer triples, then you should debug the following file.
python check_dataset.py
(Optional) If you want to conduct the vision language evaluation, then you should run the following file.
bash run
Gathered Total: 2130830, 2.1M
------------------------------
* Real-World Image: 755k
* Document & Chart & Diagram & Sign & Symbol: 627k
* Math: 747k
- Math with Vision: 180k
- Math with Text only: 566k
------------------------------
- ShareGPT4V-Caption [without SAM] (91021, 91k)
- ShareGPT4V-Instruction [Without few samples of OCR-VQA] (664703, 664k)
- MiniGemini-Instruction [DocVQA, ChartQA, DVQA, AI2D] (27670, 27k)
- DocDownstream (574268, 574k)
- DocReason (25877, 25k)
- GLLaVA-Align (60252, 60k)
- GLLaVA-QA (117205, 117k)
- MathVision (3040, 3k)
- MathInstruct [TextOnlyDataset] (262040, 262k)
- MathPlus [TextOnlyDataset] (304754, 304k)
Curated Total: 1059382, 1.1M
--------------------------------------------
Real-World Image: 338K
Document & Chart & Diagram & Sign & Symbol: 379K
Math: 342K
Math with Vision: 165K
Math with Text only: 177K
--------------------------------------------
- ShareGPT4V-Caption (72507, 73K)
- ShareGPT4V-Instruction (266072, 266K)
- MiniGemini-Instruction (26885, 27K)
- DocDownstream (298748, 299K)
- DocReason (53065, 53K)
- GLLaVA (162378, 162K)
- MathVision (2992, 3K)
- MathInstruct (81496, 81K)
- MathPlus (95239, 95K)
We collect the following eight datasets. For MiniGemini, we selectively use data samples only for DocVQA, ChartQA, DVQA, and AI2D. Therefore, it is no need for you to download all data samples for MiniGemini.
- ShareGPT4V [link]
- MiniGemini [link]
- DocDownstream [link]
- DocReason [link]
- GLLaVA [link]
- MathVision [link]
- MathInstruct [link]
- MathPlus [link]
Gathered Dataset Layout
Meteor_Dataset_Path
├── llava # ShareGPT4V
│ └── llava_pretrain
│ └── images
├── coco # ShareGPT4V
│ └── train2017
├── sam # ShareGPT4V
│ └── images
├── gqa # ShareGPT4V
│ └── images
├── ocr_vqa # ShareGPT4V
│ └── images
├── textvqa # ShareGPT4V
│ └── train_images
├── vg # ShareGPT4V
│ ├── VG_100K
│ └── VG_100K_2
├── share_textvqa # ShareGPT4V
│ └── images
├── web-celebrity # ShareGPT4V
│ └── images
├── web-landmark # ShareGPT4V
│ └── images
├── wikiart # ShareGPT4V
│ └── images
├── share_textvqa # ShareGPT4V
│ └── images
├── docvqa # MiniGemini
│ └── images
├── chartqa # MiniGemini
│ └── train
│ └── images
├── dvqa # MiniGemini
│ └── images
├── ai2d # MiniGemini
│ └── images
├── imgs # DocDownstream & DocReason
│ └── ChartQA
│ └── DUE_Benchmark
│ └── DeepForm
│ └── DocVQA
│ └── InfographicsVQA
│ └── KleisterCharity
│ └── TabFact
│ └── WikiTableQuestions
│ └── TextCaps
│ └── TextVQA
│ └── VisualMRC
├── geo3k # GLLaVA
| └── train
├── geoqa_plus # GLLaVA
├── images # MathVision
|
├── sharegpt4v_instruct_gpt4-vision_cap100k.json # ShareGPT4V-Caption
├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json # ShareGPT4V-Instruction
├── train.jsonl # DocDownstream
├── detailed_explanation.jsonl # DocReason
├── minigemini_instruction.json # MiniGemini-Instruction
├── gllava_align.parquet # GLLaVA-Align
├── gllava_qa.parquet # GLLaVA-QA
├── mathvision.parquet # MathVision
├── MathInstruct.json # MathInstruct
└── mathplus.parquet # MathPlus
These are the list of evaluation datasets. If you completely download them, the dataset should be placed in the folder by the following below directory layout.
- Q-Bench [link]
- SQA-IMG [link]
- AI2D [link]
- ChartQA [link]
- SEED [link]
- POPE [link]
- HallusionBench [link]
- MME [link]
- MathVista [link]
- MMB [link]
- MM-Vet [link]
- LLaVA-W [link]
- MMStar [link]
- MathVerse [link]
Evaluation Dataset Directory Layout
Evaluation_Dataset_Path
├── LLVisionQA-QBench # Q-Bench
├── ScienceQA # SQA-IMG
├── ai2d # AI2D
├── chartqa # ChartQA
├── SEED-Bench # SEED-IMG
├── POPE # POPE
├── HallusionBench # HallusionBench
├── MME_Benchmark_release_version # MME
├── MathVista # MathVista
├── MMBench # MMB
├── mm-vet # MM-Vet
├── llava-bench-in-the-wild # LLaVA Bench in the Wild
├── MMStar # MMStar
└── MathVerse # MathVerse