A consistency-aware evaluation framework built on top of FunBench, extending it with multi-dimensional consistency analysis for Multimodal Large Language Models (MLLMs) in fundus image interpretation.
This framework extends FunBench with four consistency evaluation dimensions:
- L5 — Cross-task Reasoning Consistency: Detects logical contradictions between L3 lesion recognition and L4 disease diagnosis
- L6 — Description Dependency: Evaluates how much models rely on text descriptions (E-mode2 vs E-mode3)
- L7 — Option Order Robustness: Tests prediction stability when answer options are shuffled
- L8 — Hierarchical Consistency: Evaluates logical consistency across DR grading granularities
├── evaluate.py # Original FunBench evaluation entry
├── evaluate_all.py # Batch evaluation across all tasks
├── evaluate_cot.py # Chain-of-thought evaluation
├── predict.py # Prediction script for MLLMs
├── predict_cot.py # CoT prediction script
├── predict_hf_api.py # HuggingFace API predictor
├── predict_huatuo_local.py # Local HuatuoGPT predictor
├── predict_qilin_local_cot.py # Local Qilin CoT predictor
├── preprocess.py # Dataset preprocessing
├── preprocess_info.json # Preprocessing configuration
├── evaluation/ # Evaluation modules (L5–L8)
├── cot_evaluation/ # Chain-of-thought evaluation modules
├── FunBench/ # FunBench benchmark data (L1–L4)
└── requirements.txt
pip install -r requirements.txtFunBench uses 14 public fundus datasets. Download images from the links below and place them under datasets/:
- CFP:
IDRiD,DDR,JSIEC,RFMiD,OIA-ODIR,Retinal-Lesions - OCT:
OCTDL,NEH,OCTID,UCSD,RETOUCH - UWF:
TOP - Multimodal:
MMC-AMD,DeepDRiD
python preprocess.pypython predict.pypython evaluate.py
# or for all tasks:
python evaluate_all.pyThis work builds upon FunBench. We sincerely thank the original authors for their contribution:
@inproceedings{miccai25-funbench,
title = {FunBench: Benchmarking Fundus Reading Skills of MLLMs},
author = {Qijie Wei and Kaiheng Qian and Xirong Li},
booktitle = {MICCAI},
year = {2025}
}