Official repository for the paper "MoVA: Adapting Mixture of Vision Experts to Multimodal Context".
[📖 Paper] [🤗 Huggingface Model]
- [2024.09.26] 🎉 MoVA is accepted to NeurIPS 2024 🎉
- [2024.06.28] 🔥 We release the codes and the MoVA-8B model.
- [2024.04.22] 🚀 We release our paper on arXiv.
To alleviate the bias of CLIP vision encoder, we first delve into the inherent behavior of different pre-trained vision encoders and then propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism.
MoVA consists of two stages: coarse-grained context-ware expert routing and fine-grained expert fusion with MoV-Adapter.
-
Coarse-grained context-ware expert routing: First, MoVA leverages the tool-use capabilities of LLM and aims to employ LLM to select vision experts with strong relevance to the user's image and instruction from the expert model pool. Thanks to the strong generalization ability of LLM, we also can perform model routing for vision experts in open scenarios.
-
Fine-grained expert fusion with MoV-Adapter: In the second stage, we turn to enhance the visual representation with a novel MoV-Adapter module in a fine-grained manner. More specifically, we leverage the cross-attention mechanism to extract the task-specific knowledge of representations from chosen experts. Meanwhile, the dynamic gating network in MoV-Adapter can allocate soft weights to the extracted knowledge of each expert according to the input image and instruction. Then the extracted knowledge can be effectively integrated into the foundational representation of the base vision encoder.
MoVA with Vicuna-7B, Llama3-8B and Hermes-Yi-34B can achieve significant performance gains over current state-of-the-art methods in a wide range of challenging benchmarks.
Name | LLM | #Tokens | MME | MMBench | MMBench-CN | QBench (dev) |
MathVista | MathVerse | POPE |
---|---|---|---|---|---|---|---|---|---|
MoVA-8B | Llama3-8B | 576 | 1595.8 / 347.5 | 75.3 | 67.7 | 70.8 | 37.7 | 21.4 | 89.3 |
Name | LLM | #Tokens | VQAv2 | GQA | SQA | TextVQA | ChartQA | DocVQA (val) |
DocVQA (test) |
AI2D |
---|---|---|---|---|---|---|---|---|---|---|
MoVA-8B | Llama3-8B | 576 | 83.5 | 65.2 | 74.7 | 77.1 | 70.5 | 83.8 | 83.4 | 77.0 |
Name | LLM | #Tokens | RefCOCO (val) |
RefCOCO (testA) |
RefCOCO (testB) |
RefCOCO+ (val) |
RefCOCO+ (testA) |
RefCOCO+ (testB) |
RefCOCO‑g (val) |
RefCOCO‑g (test) |
---|---|---|---|---|---|---|---|---|---|---|
MoVA-8B | Llama3-8B | 576 | 92.18 | 94.75 | 88.24 | 88.45 | 92.21 | 82.82 | 90.05 | 90.23 |
To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs.
We follow the evaluation settings of LLaVA. Please see Evaluation.md.
We would like to thank the following repos for their great work:
- The codebase of MoVA is built upon LLaVA.
- MoVA incorporates vision encoders from CLIP, DINOv2, Co-DETR, SAM, Pix2Struct, Deplot, Vary, and BiomedCLIP.
If you find MoVA useful for your research and applications, please kindly cite using this BibTeX:
@article{zong2024mova,
title={MoVA: Adapting Mixture of Vision Experts to Multimodal Context},
author={Zong, Zhuofan and Ma, Bingqi and Shen, Dazhong and Song, Guanglu and Shao, Hao and Jiang, Dongzhi and Li, Hongsheng and Liu, Yu},
journal={arXiv preprint arXiv:2404.13046},
year={2024}
}