🏥 Extended Multilingual Multimodal Medical Exam Dataset for Visual Question Answering in Healthcare

The Extended Multilingual Multimodal Medical Exam Dataset (Extended MMMED) is the new, larger release of MMMED for evaluating Vision-Language Models (VLMs) on medical multiple-choice question answering (MCQA) tasks.

Compared to the original benchmark, this extension substantially increases the number of questions and updates the benchmark with 28 tested VLMs (general-purpose, medical-specialized, and closed-source) across Spanish, English, and Italian.

The dataset includes challenging, real-world medical content from Médico Interno Residente (MIR - Spain) and Scuole di Specializzazione in Medicina (SSM - Italy) exam settings, with heterogeneous diagnostic images and clinically grounded questions.

🔓 How to Access the Dataset

You can access the dataset via Hugging Face. Follow these steps to download it:

⚠️ Disclaimer: This dataset contains medical images that may be sensitive for some users. Viewer discretion is advised, especially if the content may evoke strong emotional reactions or be distressing.

Login using e.g. `huggingface-cli login` to access this dataset

from datasets import load_dataset

# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("praiselab-picuslab/MMMED", split="extended")

🌟 Key Features:

Languages: 🇪🇸 Spanish, 🇬🇧 English, 🇮🇹 Italian
Scale: 955 questions per language (2,865 total samples)
Medical Content: Questions based on Spanish residency exam material
Image Types: Diagnostic medical images (e.g., CT scans, X-rays)
Categories: 26 medical categories per language
Multimodal: Each question comes with a medical image 📸
Benchmarking: 28 VLMs evaluated in multilingual settings

🔄 Dataset Workflow

Here is the general workflow for building the MMMED dataset for Vision-Language Model (VLM) evaluation:

📊 Dataset Overview

The Extended MMMED benchmark contains 955 questions for each language and is organized into 26 medical categories per language. The table below reports updated corpus statistics used in the new study.

Statistic	🇪🇸 Spanish	🇬🇧 English	🇮🇹 Italian
# Questions	955	955	955
# Categories	26	26	26
Last Update	2026	2026	2026
Avg. Option Length	4.20	3.87	4.03
Max. Option Length	73	76	74
Total Question Tokens*	42,262	41,327	39,716
Avg. Question Length	43.45	40.81	40.78
Max. Question Length	264	258	254

* Token counts are computed with the preprocessing pipeline used in this repository (SpaCy-based analysis notebooks).

🖼️ Image Types

Categorization of Image Types in the Extended MMMED Dataset. This figure presents the four main categories of images included in the dataset and their respective distributions.

✨ Example MMCQA

Each multimodal multiple-choice question-answer (MMCQA) pair integrates three essential components with the following structure:

Category: $C$
Question: $Q$
Image URL: $I$
Answer Options: $O$
Correct Answer: 💡

Here’s an illustrative example of multimodal QA in three languages:

🔍 VLMs Evaluated in the Extended Benchmark (28 Models)

The following table reports architecture details for all tested models.

Model	Type	Param (B)	Language Model	Vision Model
medvlm-r1	Medical	2	Qwen2-2B	QwenViT
maira-2	Medical	7	Vicuna-7B-v1.5	RAD-DINO-MAIRA-2
medgemma-4b-it	Medical	4	Gemma-3-4B	MedSigLIP-448
llava-med-v1.5-7b	Medical	7	Mistral-7B	CLIP ViT-L/14
chexagent-8b	Medical	8	Phi-2-2B	SigLIP-Large
medgemma-27b-it	Medical	27	Gemma-3-27B	MedSigLIP-448
minicpm-v-2.6	General	2.6	Qwen2-7B	SigLip-400M
paligemma-3b-mix-448	General	3	Gemma-2B	SigLIP-So400m/14
paligemma2-3b-mix-448	General	3	Gemma-2-2B	SigLIP-So400m/14
deepseek-vl2-tiny	General	3	DeepSeekMoE-3B	SigLIP-400M
qwen2.5-vl-3b	General	3	Qwen2.5-3B	QwenViT
phi-3.5-vision	General	4	Phi-3.5	CLIP ViT-L/14
gemma-3-4b-it	General	4	Gemma-3-4B	SigLIP
llava-v1.5-7b	General	7	Vicuna-7B-v1.5	CLIP ViT-L/14
deepseek-vl-7b	General	7	DeepSeek-LLM-7B	SigLIP + SAM
qwen2.5-vl-7b	General	7	Qwen2.5-7B	QwenViT
qwen2-vl-7b	General	8	Qwen2-7B	QwenViT
qwen3-vl-8b	General	8	Qwen3-8B	QwenViT
internvl2.5-8b	General	8	InternLM2.5-7B	InternViT-300M
paligemma2-10b-mix-448	General	10	Gemma-2-9B	SigLIP-So400m/14
pixtral-12b	General	12	Mistral-Nemo-12B	Pixtral ViT
gemma-3-27b-it	General	27	Gemma-3-27B	SigLIP
qwen3-vl-30b	General	30	Qwen3-30B	QwenViT
qwen2.5-vl-32b	General	32	Qwen2.5-32B	QwenViT
qwen2.5-vl-72b	General	72	Qwen2.5-72B	QwenViT
claude-4-sonnet	Closed	Unknown	Closed-Source	Closed-Source
gpt-5-mini	Closed	Unknown	Closed-Source	Closed-Source
gemini-2.5-flash	Closed	Unknown	Closed-Source	Closed-Source

📈 VLM Performance on MMMED

The following figure presents the overall multilingual performance trend.

For complete analysis outputs (tables and publication-quality figures), see:

Analysis/analysis_output/tables/accuracy_table.csv
Analysis/analysis_output/tables/summary_table.csv
Analysis/analysis_output/figures/

🖋️ Original MMMED Citation

Please cite also the original work as follows:

@inproceedings{riccio2025multilingual,
  title={A Multilingual Multimodal Medical Examination Dataset for Visual Question Answering in Healthcare},
  author={Riccio, Giuseppe and Romano, Antonio and Barone, Mariano and Orlando, Gian Marco and Russo, Diego and Postiglione, Marco and La Gatta, Valerio and Moscato, Vincenzo},
  booktitle={2025 IEEE 38th International Symposium on Computer-Based Medical Systems (CBMS)},
  pages={435--440},
  year={2025},
  organization={IEEE Computer Society}
}

🌐 Notes

Dataset Usage: The dataset is intended for academic and research purposes only. It is not recommended for clinical decision-making or commercial use.

👨‍💻 This project was developed by Mariano Barone, Francesco Di Serio, Giuseppe Riccio, Antonio Romano, Vincenzo Moscato, and Marco Postiglione University of Naples, Federico II

📜 License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Analysis		Analysis
LIME Interpretability		LIME Interpretability
Raw CSV		Raw CSV
img		img
lime_images		lime_images
.gitignore		.gitignore
DatasetGeneration.ipynb		DatasetGeneration.ipynb
DatasetImageDownload.ipynb		DatasetImageDownload.ipynb
DatasetPreprocessing.ipynb		DatasetPreprocessing.ipynb
DatasetStatsView.ipynb		DatasetStatsView.ipynb
LICENSE		LICENSE
README.md		README.md
deepseek_multilingual_benchmark.py		deepseek_multilingual_benchmark.py
download_medical_models.py		download_medical_models.py
gemma_multilingual_benchmark.py		gemma_multilingual_benchmark.py
llava_multilingual_benchmark.py		llava_multilingual_benchmark.py
medical_models_multilingual_benchmark.py		medical_models_multilingual_benchmark.py
models_list.txt		models_list.txt
phi_multilingual_benchmark.py		phi_multilingual_benchmark.py
qwen_multilingual_benchmark.py		qwen_multilingual_benchmark.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏥 Extended Multilingual Multimodal Medical Exam Dataset for Visual Question Answering in Healthcare

🔓 How to Access the Dataset

Login using e.g. `huggingface-cli login` to access this dataset

🌟 Key Features:

🔄 Dataset Workflow

📊 Dataset Overview

🖼️ Image Types

✨ Example MMCQA

🔍 VLMs Evaluated in the Extended Benchmark (28 Models)

📈 VLM Performance on MMMED

🖋️ Original MMMED Citation

🌐 Notes

📜 License

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🏥 Extended Multilingual Multimodal Medical Exam Dataset for Visual Question Answering in Healthcare

🔓 How to Access the Dataset

Login using e.g. huggingface-cli login to access this dataset

🌟 Key Features:

🔄 Dataset Workflow

📊 Dataset Overview

🖼️ Image Types

✨ Example MMCQA

🔍 VLMs Evaluated in the Extended Benchmark (28 Models)

📈 VLM Performance on MMMED

🖋️ Original MMMED Citation

🌐 Notes

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages

Login using e.g. `huggingface-cli login` to access this dataset