Clinical decision-making often involves interpreting medical images (e.g., radiology) to make diagnoses. Retrieving relevant visual information from medical literature and hospital records can enhance diagnostic accuracy.
CLARE (Clincal LVLM-Aware Retrival) is a lightweight LVLM-aware multimodal retriever, such that the retriever learns to return images and texts that guide the LVLM toward correct predictions.
![]() |
![]() |
git clone <Add here>
cd CLARE
export PYTHONPATH=./
python3.9 -m venv clare_env
source clare_env/bin/activate
pip install -r requirements.txtFor running the qwen2_vl model, a different environment is needed:
git clone <Add here>
cd CLARE
export PYTHONPATH=./
python3.9 -m venv clare_qwen_env
source clare_qwen_env/bin/activate
pip install -r requirements_qwen.txtDownload the required datasets:
- ROCO: https://github.com/razorx89/roco-dataset
- PMC-OA: https://huggingface.co/datasets/axiong/pmc_oa
- MIMIC-CXR: https://physionet.org/content/mimic-cxr-jpg/ (permission required)
Organize the datasets into JSONL format:
python preporcess/mimic_cxr_processor.py --path_folder_location /path/to/mimic/dataset --path_save /path/to/save/outputpython preporcess/prepare_pmc_oa_for_embeddings.py --home_pmc_oa_project /path/to/project/PYCHARMPROJECTS/PMC_OA/ --home_pmc_oa_images /path/to/images/ --path_save /path/to/save/outputpython preporcess/organize_roco_for_embedding.py --home_pmc_oa_project /path/to/project/PYCHARMPROJECTS/PMC_OA/ --home_pmc_oa_images /path/to/images/ --path_save /path/to/save/outputMerge the three JSONL files into a final one:
python preporcess/merge_jsonl.py --mimic_file path/to/mimic-cxr.jsonl --roco_file path/to/roco.jsonl --pmc_file path/to/pmc-oa.jsonl --output_file merged_datasets.jsonlThe scripts vqarad_embeddings_pmc_encode_img.sh and vqarad_embeddings_pmc_encode_text.sh are used to encode the image and text embeddings. In each script, the arguments '--passages' and '--save_index_path' need to be updated. Then run the index creation:
source run_scripts/create_embeddings/vqarad_embeddings_pmc_encode_img.shsource run_scripts/create_embeddings/vqarad_embeddings_pmc_encode_text.shThe reader is trained on the LlamaFactory platform. We supply scripts for preparing the data and training in lamafactory_scripts/README.md
Training scripts for each benchmark are available in run_scripts/training_scripts/.
For breast, retina, and derma datasets:
- MedMNIST: https://medmnist.com/
- BREST: https://physionet.org/content/brazilian-ophthalmological/1.0.0/
- VinDR-PCXR: https://physionet.org/content/vindr-pcxr/1.0.0/
- VQA-RAD: https://huggingface.co/datasets/flaviagiammarino/vqa-rad
- SLAKE-English: https://huggingface.co/datasets/mdwiratathya/SLAKE-vqa-english
- PathVQA: https://huggingface.co/datasets/flaviagiammarino/path-vqa
After downloading the data, follow the instructions in data/README.md (Coming Soon).
Train the multimodal retriever in two stages:
- Text retriever head
- Image retriever head
For each script, update:
- Reader checkpoint path
- Index paths (text and image)
- Checkpoint directory
source run_scripts/training_retriever_text/<name_benchmark>/<chosen_script>source run_scripts/training_retriever_image/<name_benchmark>/<chosen_script>Use calculate_metrics.py to compute evaluation metrics from prediction files:
python calculate_metrics.py \
--prediction_file /path/to/predictions.jsonl \
--classes "0" "1" "2" "3" "4"--prediction_file: Path to JSONL file containing model predictions (required)--classes: List of class labels for classification tasks (required)

