We present Touchstone, a large-scale medical segmentation benchmark based on annotated 5,195 CT volumes from 76 hospitals for training, and 6,933 CT volumes from 8 additional hospitals for testing. We invite AI inventors to train their models on AbdomenAtlas, and we independently evaluate their algorithms. We have already collaborated with 14 influential research teams, and we remain accepting new submissions.
Note
Training set
- Touchstone 1.0: AbdomenAtlas1.0Mini (N=5,195)
- Touchstone 2.0: AbdomenAtlas1.1Mini (N=9,262)
Test set
- Proprietary JHH dataset (N=5,172)
- Public TotalSegmentator V2 dataset (N=1,228)
- Public DAP Atlas dataset (N=533)
Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?
Pedro R. A. S. Bassi1, Wenxuan Li1, Yucheng Tang2, Fabian Isensee3, ..., Alan Yuille1, Zongwei Zhou1
1Johns Hopkins University, 2NVIDIA, 3DKFZ
NeurIPS 2024
project | paper | code
Aorta - NexToU π
Gallbladder - STU-Net-B & MedNeXt π
KidneyL - Diff-UNet π
KidneyR - Diff-UNet π
Liver - MedNeXt π
Pancreas - MedNeXt π
Postcava - STU-Net-B & MedNeXt π
Spleen - nnU-Net ResEncL π
Stomach - STU-Net-B & MedNeXt & nnU-Net ResEncL π
*
Each cell in the significance heatmap above indicates a one-sided statistical test. Red indicates that the x-axis AI algorithm is significantly superior to the y-axis algorithm in terms of DSC, for one organ.
We provide DSC and NSD per CT scan for each checkpoint in test sets #2 and #3, and a code tutorial for easy:
- Per-organ performance analysis
- Performance comparison across demographic groups (age, sex, race, scanner, diagnosis, etc.)
- Pair-wise statistical tests and significance heatmaps
- Boxplots
You can easily modify our code to compare your custom model to our checkpoints, or to analyze segmentation performance in custom demographic groups (e.g., hispanic men aged 20-25).
Code tutorial
Per-sample results are in CSV files inside the folders totalsegmentator_results and dapatlas_results.
File structure
totalsegmentator_results
βββ Diff-UNet
β βββ dsc.csv
β βββ nsd.csv
βββ LHU-Net
β βββ dsc.csv
β βββ nsd.csv
βββ MedNeXt
β βββ dsc.csv
β βββ nsd.csv
βββ ...
dapatlas_results
βββ Diff-UNet
β βββ dsc.csv
β βββ nsd.csv
βββ LHU-Net
β βββ dsc.csv
β βββ nsd.csv
βββ MedNeXt
β βββ dsc.csv
β βββ nsd.csv
βββ ...
git clone https://github.com/MrGiovanni/Touchstone
cd Touchstone
conda env create -f environment.yml
source activate touchstone
python -m ipykernel install --user --name touchstone --display-name "touchstone"
cd notebooks
jupyter nbconvert --to notebook --execute --ExecutePreprocessor.kernel_name=touchstone TotalSegmentatorMetadata.ipynb
jupyter nbconvert --to notebook --execute --ExecutePreprocessor.kernel_name=touchstone DAPAtlasMetadata.ipynb
#results: plots are saved inside Touchstone/outputs/plotsTotalSegmentator/ and Touchstone/outputs/plotsDAPAtlas/
cd ../plot
python AggregatedBoxplot.py --stats
#results: Touchstone/outputs/summary_groups.pdf
#statistical significance maps (Appendix D.2.3):
python PlotAllSignificanceMaps.py
python PlotAllSignificanceMaps.py --organs second_half
python PlotAllSignificanceMaps.py --nsd
python PlotAllSignificanceMaps.py --organs second_half --nsd
#results: Touchstone/outputs/heatmaps
cd ../notebooks
jupyter nbconvert --to notebook --execute --ExecutePreprocessor.kernel_name=touchstone GroupAnalysis.ipynb
#results: Touchstone/outputs/box_plots
Define custom demographic groups (e.g., hispanic men aged 20-25) and compare AI performance on them
The csv results files in totalsegmentator_results/ and dapatlas_results/ contain per-sample dsc and nsd scores. Rich meatdata for each one of those samples (sex, age, scanner, diagnosis,...) are available in metaTotalSeg.csv and 'Clinical Metadata FDG PET_CT Lesions.csv', for TotalSegmentator and DAP Atlas, respectively. The code in TotalSegmentatorMetadata.ipynb and DAPAtlasMetadata.ipynb extracts this meatdata into simplfied group lists (e.g., a list of all samples representing male patients), and saves these lists in the folders plotsTotalSegmentator/ and plotsDAPAtlas/. You can modify the code to generate custom sample lists (e.g., all men aged 30-35). To compare a set of groups, the filenames of all lists in the set should begin with the same name. For example, comp1_list_a.pt, comp1_list_b.pt, comp1_list_C.pt can represent a set of 3 groups. Then, PlotGroup.py can draw boxplots and perform statistical tests comparing the AI algorithm's results (dsc and nsd) for the samples inside the different custom lists you created. In our example, you just just need to specify --group_name comp1 when running PlotGroup.py:
python utils/PlotGroup.py --ckpt_root totalsegmentator_results/ --group_root outputs/plotsTotalSegmentator/ --group_name comp1 --organ liver --stats
Please cite the following papers if you find our leaderboard or dataset helpful.
@article{li2024abdomenatlas,
title={AbdomenAtlas: A large-scale, detailed-annotated, \& multi-center dataset for efficient transfer learning and open algorithmic benchmarking},
author={Li, Wenxuan and Qu, Chongyu and Chen, Xiaoxi and Bassi, Pedro RAS and Shi, Yijia and Lai, Yuxiang and Yu, Qian and Xue, Huimin and Chen, Yixiong and Lin, Xiaorui and others},
journal={Medical Image Analysis},
pages={103285},
year={2024},
publisher={Elsevier}
}
@inproceedings{li2024well,
title={How Well Do Supervised Models Transfer to 3D Image Segmentation?},
author={Li, Wenxuan and Yuille, Alan and Zhou, Zongwei},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024}
}
@article{qu2023abdomenatlas,
title={Abdomenatlas-8k: Annotating 8,000 CT volumes for multi-organ segmentation in three weeks},
author={Qu, Chongyu and Zhang, Tiezheng and Qiao, Hualin and Tang, Yucheng and Yuille, Alan L and Zhou, Zongwei and others},
journal={Advances in Neural Information Processing Systems},
volume={36},
year={2023}
}
This work was supported by the Lustgarten Foundation for Pancreatic Cancer Research and the McGovern Foundation. Paper content is covered by patents pending.