Detecting Breast Carcinoma Metastasis on Whole-Slide Images by Partially Subsampled Multiple Instance Learning
This repository provides reproducible codes for the following working paper:
Yu, B., Li, X., Zhou, J., and Wang, H. Detecting Breast Carcinoma Metastasis on Whole-Slide Images by Partially Subsampled Multiple Instance Learning, Working Paper.
For an easy application of the proposed PSMIL method in this paper, you may also call the PSMIL package.
This repository contains three parts: the first covers simulation studies to validate the finite sample performance of the proposed estimators, the second focuses on robustness analysis, and the third presents real data analysis on the CAMELYON16 dataset.
The file tree of the repository is as follows:
├── LICENSE
├── README.md
├── RealDataAnalysis
├── Robustness
├── Simulation
├── datadict.txt
└── requirements.txt
datadict.txtcontains the data dictionary for the database.requirements.txtlists the specific Python package versions used in this repository.
The following files in the Simulation folder can be used to reproduce the simulation results for the proposed estimators presented in the paper.
├── Simulation
│ ├── para_est_KmeansInsInit.pkl
│ ├── simucode_Study1
│ ├── simucode_Study2
│ ├── simucode_Study3
│ ├── simucode_Study4
- The data file
para_est_KmeansInsInit.pklcontains the feature data for the real data based simulation, which is not included in this repository due to its size, but can be obtained at here, or by contacting me via email.
We use Study2 as an example to illustrate the file tree as follows:
├── simucode_Study2
│ ├── estimation.py
│ ├── plot_prop_Study2.R
│ └── simu.py
estimation.pyimplements the proposed estimating methods.plot_prop_Study2.Rcontains the R code for generating figures.simu.pysimulates data and estimates parameters.
The following code files in the Robustness folder can be used to reproduce the simulation results for robustness analysis presented in the paper. Settings are similar to those in the Simulation folder as before.
├── Robustness
│ ├── Study1_hetePi
│ ├── Study2_SpatialCorr_A
│ ├── Study3_SpatialCorr_X
│ ├── Study4_ConIndep
│ ├── para_est_KmeansInsInit.pkl
The following code files in the RealDataAnalysis folder can be used to reproduce real data analysis results presented in the paper. The file tree is given as follows:
├── RealDataAnalysis
│ ├── LICENSE
│ ├── estimation.py
│ ├── gentrain.py
│ ├── mul_gentest.py
│ ├── pred.py
│ ├── sin_gentest.py
│ ├── train_est.py
│ └── utils.py
The CAMELYON16 dataset can be downloaded at its official website. Please also note that the model file pytorch_model.bin is not involved in this repository due to its size. It can be obtained via email or downloaded at Hugging Face.
Below are instructions on how to execute the scripts:
We need to generate the features and labels for the training and testing datasets separately.
python gentrain.py
python mul_gentest.py
Next, we apply the proposed estimators to estimate the unknown parameters.
python train_est.py
Finally, the predictive results can be then obtained.
python pred.py
If you have any problem, please feel free to contact Baichen Yu and Prof. Xuetong Li.