A comprehensive computational framework for predicting cell type proportions in colorectal cancer H&E images using multiple foundation models and machine learning approaches.
This repository contains the complete analysis pipeline and scripts for the STPath-COAD project. The framework integrates multiple histopathology foundation models with XGBoost classifiers to predict spatial distributions of cell types in colorectal cancer tissues, validated against spatial transcriptomics data.
- Multi-modal Foundation Models: Integration of 5 state-of-the-art histopathology foundation models (Conch, UNI2-h, ProvGigaPath, Virchow, Virchow2) and 1 baseline model (ResNet50).
- Cell Type Deconvolution: Spatial transcriptomics-guided cell type prediction for 5 major cell populations
- TCGA Analysis: Comprehensive survival analysis and clinical correlation studies
- Cross-validation Framework: Leave-one-individual-out (LOIO) validation strategy
- Visualization Tools: Hexagonal heatmaps, UMAP embeddings, and soft segmentation
This framework evaluates six state-of-the-art foundation models for histopathology:
-
Conch: Lu, Ming Y., et al. "A visual-language foundation model for computational pathology." Nature medicine 30.3 (2024): 863-874.
-
UNI2h: Chen, Richard J., et al. "Towards a general-purpose foundation model for computational pathology." Nature medicine 30.3 (2024): 850-862.
-
ProvGigapath: Xu, Hanwen, et al. "A whole-slide foundation model for digital pathology from real-world data." Nature 630.8015 (2024): 181-188.
-
Virchow: Vorontsov, Eugene, et al. "A foundation model for clinical-grade computational pathology and rare cancers detection." Nature medicine 30.10 (2024): 2924-2935.
-
Virchow2: Zimmermann, Eric, et al. "Virchow2: Scaling self-supervised mixed magnification models in pathology." arXiv preprint arXiv:2408.00738 (2024)
-
ResNet50: He, Kaiming, et al. "Deep Residual Learning for Image Recognition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
The framework predicts proportions for 5 major cell types:
Colorectal Cancer (COAD):
- Cancer Cells - Malignant epithelial cells (colorectal carcinoma, adenoma, serrated-specific)
- Stromal Cells - Fibroblasts and Endothelial cells
- Normal Epithelial Cells - Non-malignant epithelial cells (goblet cells, absorptive colonocytes, enteroendocrine cells, tuft cells)
- T Cells - T lymphocytes (CD4+ and CD8+ T cells)
- pan-APC Cells - B cells and Myeloid cells combined (antigen-presenting cells)
Breast Cancer (BRCA):
- Tumor Cells - Malignant epithelial cells
- Stromal Cells - Cancer-associated fibroblasts and Perivascular-Like cells
- Normal Epithelial Cells - Non-malignant epithelial cells
- T Cells - T lymphocytes (CD4+ and CD8+ T cells)
- pan-APC Cells - B cells and Myeloid cells combined (antigen-presenting cells)
STPath_COAD/
├── Scripts_for_Analysis/ # Scripts for generating manuscript figures
│ ├── Figure2_scRNAseq_data_analysis.py
│ ├── Figure3_UMAP_Contri_RegressOut.py
│ ├── Figure4_Xgboost_comparison.py
│ ├── Figure5_Consistency.py
│ ├── Figure6_Soft_Segmentation.py
│ ├── Figure7_TCGA_COAD_Survival.py
│ ├── FigureS7_TCGA_COAD_Analysis.py
│ ├── FigureS10_expression_prediction.py
│ ├── FigureS11_Cell_Type_Distribution_Analysis.py
│ ├── FigureS13_Compare_BRCA_COAD_Feature_Importance.py
│ ├── Table1_TCGA.py
│ └── Sup_file_S3_Make_Important_Features.py
│
├── Workflow/ # Complete analysis workflow
│ ├── StepA_Data_Preparation/
│ │ ├── CARD_STData_Preparation_*.py
│ │ └── Create_patches_images_*.py
│ │
│ ├── StepB_Cell_Type_Deconvolution/
│ │ ├── CARD_Deconvolution_*.R
│ │ ├── CARD_Results_Validation_*.py
│ │ └── CARD_Results_Vis_Prepare_*.py
│ │
│ ├── StepC_Feature_Extraction_and_Train_Models/
│ │ ├── Precompute_Features_Using_Foundation_Models_COAD.py
│ │ ├── COAD_XGBoost_Prediction.py
│ │ ├── COAD_XGBoost_WithinSample.py
│ │ └── Compare_TIFF_JPG_Features.py
│ │
│ └── StepD_TCGA_Data_Preparation/
│ └── TCGA_COAD_DCM_to_TIFF.py
│
├── Figures/ # Manuscript figures and supplementary figures
├── config_template.yaml # Configuration template
├── README.md # This file
└── LICENSE # License information
- Prepare spatial transcriptomics data for CARD deconvolution
- Extract H&E image patches at matched spatial locations
- Supports Cody, FredHutch, and HEST-1K datasets
- Run CARD deconvolution using single-cell reference data
- Validate deconvolution results against known tissue regions
- Generate visualization of cell type spatial distributions
- Extract features using 6 foundation models
- Train XGBoost models with LOIO cross-validation
- Evaluate model performance and feature importance
- Generate calibrated predictions
- Convert TCGA whole slide images from DCM to TIFF format
- Process TCGA-COAD cohort for validation studies
- Calculate distance metrics between cell types
- Perform survival analysis
Python 3.8+
R 4.0+ (for CARD deconvolution)
Python packages:
- torch
- timm
- conch
- huggingface_hub
- xgboost
- numpy
- pandas
- matplotlib
- seaborn
- scanpy
- lifelines
- scipy
- scikit-learn
- umap-learn
- tqdm
- pillow
- opencv-python
R packages:
- CARD
- Seurat
- Recommended: GPU with 16GB+ VRAM (for foundation model feature extraction)
- Minimum: CPU with 32GB RAM
- Storage: 100GB+ for data, models, and intermediate results
- Spatial Transcriptomics Data: Spot-level gene expression and spatial coordinates
- Single-cell Reference: Annotated scRNA-seq data for CARD deconvolution
- H&E Images: Whole slide images or tissue regions
- TCGA Data (optional): For validation and survival analysis
# Create conda environment
conda create -n stpath python=3.9
conda activate stpath
# Install Python dependencies
pip install torch torchvision torchaudio
pip install timm transformers huggingface_hub
pip install xgboost scikit-learn
pip install numpy pandas matplotlib seaborn
pip install scanpy lifelines
pip install umap-learn tqdm pillow opencv-pythonCopy and edit the configuration template:
cp config_template.yaml config.yamlEdit paths and parameters according to your data location and computing resources.
- Systematic comparison of 6 foundation models (Figure 4)
- Feature importance analysis across models
- Model complementarity assessment
- Cell type distance calculations using weighted minimum distance
- Hard classification based on quantile thresholds
- Hexagonal heatmap visualization
- Cox proportional hazards models
- Kaplan-Meier survival curves
- Integration of cell type proportions, spatial metrics, and clinical variables
- Correlation between histopathology features and gene expression
- Cell type-specific marker gene analysis
- Partial correlation controlling for cell type composition
- Each individual (patient) is held out once as test set
- Models trained on all other individuals
- Ensures generalization across patients
- Prevents overfitting to individual-specific patterns
- Quantile-based calibration for improved prediction accuracy
- Cell type-specific outlier handling
- Preservation of rank ordering while adjusting scale
- Top 30% most important features selected per model
- Combined feature set from multiple foundation models
- Reduces dimensionality while maintaining predictive power
- CSV files with patch-level predictions
- JSON files with metadata and overall proportions
- Feature importance scores
- UMAP embeddings colored by various factors
- Hexagonal heatmaps for spatial distributions
- Violin plots for cell type proportions
- Scatter plots for model comparisons
- Kaplan-Meier survival curves
- Cox regression tables
- Model performance metrics (Spearman correlation, MAE, RMSE)
- Feature importance rankings
- Spearman Correlation: Primary metric for proportion prediction
- Mean Absolute Error (MAE): Absolute prediction error
- Root Mean Squared Error (RMSE): Squared error magnitude
- Feature Extraction: ~2-5 minutes per WSI per foundation model (GPU)
- Model Training: ~30-60 minutes per cell type (LOIO)
- Prediction: ~1-3 minutes per WSI
- TCGA Analysis: ~10-20 minutes per sample
- Feature Extraction: 16GB GPU VRAM recommended
- Model Training: 32GB RAM minimum
- Large WSI Processing: 64GB RAM recommended
If you use this code in your research, please cite:
[Citation information to be added upon publication]
- STPath-Software: Standalone prediction tool for clinical use (https://github.com/Sun-lab/STpath-software)
- BRCA Analysis: Breast cancer application scripts (https://github.com/Sun-lab/STpath-BRCA)
This project is licensed under the MIT License - see the LICENSE file for details.
For questions, issues, or collaborations, please open an issue on GitHub or contact the authors.
- Foundation models: Conch, UNI, ProvGigaPath, Virchow, Virchow2
- CARD deconvolution method
- TCGA consortium for data access
- All data contributors and collaborators
Last Updated: Feb 2026