CAFramework: Consistency-Aware Evaluation for Fundus MLLMs

A consistency-aware evaluation framework built on top of FunBench, extending it with multi-dimensional consistency analysis for Multimodal Large Language Models (MLLMs) in fundus image interpretation.

Overview

This framework extends FunBench with four consistency evaluation dimensions:

L5 — Cross-task Reasoning Consistency: Detects logical contradictions between L3 lesion recognition and L4 disease diagnosis
L6 — Description Dependency: Evaluates how much models rely on text descriptions (E-mode2 vs E-mode3)
L7 — Option Order Robustness: Tests prediction stability when answer options are shuffled
L8 — Hierarchical Consistency: Evaluates logical consistency across DR grading granularities

Repository Structure

├── evaluate.py              # Original FunBench evaluation entry
├── evaluate_all.py          # Batch evaluation across all tasks
├── evaluate_cot.py          # Chain-of-thought evaluation
├── predict.py               # Prediction script for MLLMs
├── predict_cot.py           # CoT prediction script
├── predict_hf_api.py        # HuggingFace API predictor
├── predict_huatuo_local.py  # Local HuatuoGPT predictor
├── predict_qilin_local_cot.py  # Local Qilin CoT predictor
├── preprocess.py            # Dataset preprocessing
├── preprocess_info.json     # Preprocessing configuration
├── evaluation/              # Evaluation modules (L5–L8)
├── cot_evaluation/          # Chain-of-thought evaluation modules
├── FunBench/                # FunBench benchmark data (L1–L4)
└── requirements.txt

Getting Started

1. Install dependencies

pip install -r requirements.txt

2. Download FunBench images

FunBench uses 14 public fundus datasets. Download images from the links below and place them under datasets/:

CFP: IDRiD, DDR, JSIEC, RFMiD, OIA-ODIR, Retinal-Lesions
OCT: OCTDL, NEH, OCTID, UCSD, RETOUCH
UWF: TOP
Multimodal: MMC-AMD, DeepDRiD

3. Preprocess images

python preprocess.py

4. Run prediction

python predict.py

5. Run evaluation

python evaluate.py
# or for all tasks:
python evaluate_all.py

Acknowledgements

This work builds upon FunBench. We sincerely thank the original authors for their contribution:

@inproceedings{miccai25-funbench,
  title  = {FunBench: Benchmarking Fundus Reading Skills of MLLMs},
  author = {Qijie Wei and Kaiheng Qian and Xirong Li},
  booktitle = {MICCAI},
  year   = {2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAFramework: Consistency-Aware Evaluation for Fundus MLLMs

Overview

Repository Structure

Getting Started

1. Install dependencies

2. Download FunBench images

3. Preprocess images

4. Run prediction

5. Run evaluation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
FunBench		FunBench
cot_evaluation		cot_evaluation
evaluation		evaluation
images		images
.gitignore		.gitignore
Figure1.png		Figure1.png
README.md		README.md
evaluate.py		evaluate.py
evaluate_all.py		evaluate_all.py
evaluate_cot.py		evaluate_cot.py
predict.py		predict.py
predict_cot.py		predict_cot.py
predict_hf_api.py		predict_hf_api.py
predict_huatuo_local.py		predict_huatuo_local.py
predict_qilin_local_cot.py		predict_qilin_local_cot.py
preprocess.py		preprocess.py
preprocess_info.json		preprocess_info.json
process_retouch.py		process_retouch.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CAFramework: Consistency-Aware Evaluation for Fundus MLLMs

Overview

Repository Structure

Getting Started

1. Install dependencies

2. Download FunBench images

3. Preprocess images

4. Run prediction

5. Run evaluation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages