๐ฅ Extended Multilingual Multimodal Medical Exam Dataset for Visual Question Answering in Healthcare
The Extended Multilingual Multimodal Medical Exam Dataset (Extended MMMED) is the new, larger release of MMMED for evaluating Vision-Language Models (VLMs) on medical multiple-choice question answering (MCQA) tasks.
Compared to the original benchmark, this extension substantially increases the number of questions and updates the benchmark with 28 tested VLMs (general-purpose, medical-specialized, and closed-source) across Spanish, English, and Italian.
The dataset includes challenging, real-world medical content from Mรฉdico Interno Residente (MIR - Spain) and Scuole di Specializzazione in Medicina (SSM - Italy) exam settings, with heterogeneous diagnostic images and clinically grounded questions.
You can access the dataset via Hugging Face. Follow these steps to download it:
from datasets import load_dataset
# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("praiselab-picuslab/MMMED", split="extended")
- Languages: ๐ช๐ธ Spanish, ๐ฌ๐ง English, ๐ฎ๐น Italian
- Scale: 955 questions per language (2,865 total samples)
- Medical Content: Questions based on Spanish residency exam material
- Image Types: Diagnostic medical images (e.g., CT scans, X-rays)
- Categories: 26 medical categories per language
- Multimodal: Each question comes with a medical image ๐ธ
- Benchmarking: 28 VLMs evaluated in multilingual settings
Here is the general workflow for building the MMMED dataset for Vision-Language Model (VLM) evaluation:
The Extended MMMED benchmark contains 955 questions for each language and is organized into 26 medical categories per language. The table below reports updated corpus statistics used in the new study.
| Statistic | ๐ช๐ธ Spanish | ๐ฌ๐ง English | ๐ฎ๐น Italian |
|---|---|---|---|
| # Questions | 955 | 955 | 955 |
| # Categories | 26 | 26 | 26 |
| Last Update | 2026 | 2026 | 2026 |
| Avg. Option Length | 4.20 | 3.87 | 4.03 |
| Max. Option Length | 73 | 76 | 74 |
| Total Question Tokens* | 42,262 | 41,327 | 39,716 |
| Avg. Question Length | 43.45 | 40.81 | 40.78 |
| Max. Question Length | 264 | 258 | 254 |
* Token counts are computed with the preprocessing pipeline used in this repository (SpaCy-based analysis notebooks).
Categorization of Image Types in the Extended MMMED Dataset. This figure presents the four main categories of images included in the dataset and their respective distributions.
Each multimodal multiple-choice question-answer (MMCQA) pair integrates three essential components with the following structure:
-
Category:
$C$ -
Question:
$Q$ -
Image URL:
$I$ -
Answer Options:
$O$ - Correct Answer: ๐ก
Hereโs an illustrative example of multimodal QA in three languages:
The following table reports architecture details for all tested models.
| Model | Type | Param (B) | Language Model | Vision Model |
|---|---|---|---|---|
| medvlm-r1 | Medical | 2 | Qwen2-2B | QwenViT |
| maira-2 | Medical | 7 | Vicuna-7B-v1.5 | RAD-DINO-MAIRA-2 |
| medgemma-4b-it | Medical | 4 | Gemma-3-4B | MedSigLIP-448 |
| llava-med-v1.5-7b | Medical | 7 | Mistral-7B | CLIP ViT-L/14 |
| chexagent-8b | Medical | 8 | Phi-2-2B | SigLIP-Large |
| medgemma-27b-it | Medical | 27 | Gemma-3-27B | MedSigLIP-448 |
| minicpm-v-2.6 | General | 2.6 | Qwen2-7B | SigLip-400M |
| paligemma-3b-mix-448 | General | 3 | Gemma-2B | SigLIP-So400m/14 |
| paligemma2-3b-mix-448 | General | 3 | Gemma-2-2B | SigLIP-So400m/14 |
| deepseek-vl2-tiny | General | 3 | DeepSeekMoE-3B | SigLIP-400M |
| qwen2.5-vl-3b | General | 3 | Qwen2.5-3B | QwenViT |
| phi-3.5-vision | General | 4 | Phi-3.5 | CLIP ViT-L/14 |
| gemma-3-4b-it | General | 4 | Gemma-3-4B | SigLIP |
| llava-v1.5-7b | General | 7 | Vicuna-7B-v1.5 | CLIP ViT-L/14 |
| deepseek-vl-7b | General | 7 | DeepSeek-LLM-7B | SigLIP + SAM |
| qwen2.5-vl-7b | General | 7 | Qwen2.5-7B | QwenViT |
| qwen2-vl-7b | General | 8 | Qwen2-7B | QwenViT |
| qwen3-vl-8b | General | 8 | Qwen3-8B | QwenViT |
| internvl2.5-8b | General | 8 | InternLM2.5-7B | InternViT-300M |
| paligemma2-10b-mix-448 | General | 10 | Gemma-2-9B | SigLIP-So400m/14 |
| pixtral-12b | General | 12 | Mistral-Nemo-12B | Pixtral ViT |
| gemma-3-27b-it | General | 27 | Gemma-3-27B | SigLIP |
| qwen3-vl-30b | General | 30 | Qwen3-30B | QwenViT |
| qwen2.5-vl-32b | General | 32 | Qwen2.5-32B | QwenViT |
| qwen2.5-vl-72b | General | 72 | Qwen2.5-72B | QwenViT |
| claude-4-sonnet | Closed | Unknown | Closed-Source | Closed-Source |
| gpt-5-mini | Closed | Unknown | Closed-Source | Closed-Source |
| gemini-2.5-flash | Closed | Unknown | Closed-Source | Closed-Source |
The following figure presents the overall multilingual performance trend.
For complete analysis outputs (tables and publication-quality figures), see:
Analysis/analysis_output/tables/accuracy_table.csvAnalysis/analysis_output/tables/summary_table.csvAnalysis/analysis_output/figures/
Please cite also the original work as follows:
@inproceedings{riccio2025multilingual,
title={A Multilingual Multimodal Medical Examination Dataset for Visual Question Answering in Healthcare},
author={Riccio, Giuseppe and Romano, Antonio and Barone, Mariano and Orlando, Gian Marco and Russo, Diego and Postiglione, Marco and La Gatta, Valerio and Moscato, Vincenzo},
booktitle={2025 IEEE 38th International Symposium on Computer-Based Medical Systems (CBMS)},
pages={435--440},
year={2025},
organization={IEEE Computer Society}
}Dataset Usage: The dataset is intended for academic and research purposes only. It is not recommended for clinical decision-making or commercial use.
๐จโ๐ป This project was developed by Mariano Barone, Francesco Di Serio, Giuseppe Riccio, Antonio Romano, Vincenzo Moscato, and Marco Postiglione University of Naples, Federico II
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.





