Stack-first benchmarking for out-of-distribution cancer drug response prediction, with staged Evo 2 genotype augmentation
Team: Hanson Wen (Molecular Bio + CS), Nathan Gu (EE + CS), Foster Angus (Applied Math)
This project evaluates a simple question with strict benchmarking discipline:
Do Stack prompt-conditioned representations provide a real out-of-distribution (OOD) lift for perturbation + drug-response prediction, beyond strong baselines and non-ICL single-cell foundation models?
Recent evidence suggests many complex drug-response models underperform or match simpler methods when evaluation is leakage-safe and truly OOD. We are testing whether Stack changes that outcome.
# Clone repo
git clone https://github.com/Hilo-Hilo/Stack-Benchmarking.git
cd Stack-Benchmarking
# Set up Python environment (Python 3.9+)
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
# Install requirements
pip install -r requirements.txt
# Install Stack
git clone https://github.com/ArcInstitute/stack.git
cd stack
pip install -e .
cd ..See checkpoints/README.md for download instructions. You'll need:
bc_large.ckpt- Stack-Large model (~2.5 GB)basecount_1000per_15000max.pkl- Gene list (~900 KB)
Place your data in data/raw/:
- Single-cell expression:
*.h5ad(AnnData format, HGNC gene symbols) - Drug response:
*response*.csv
python -m src.main --config configs/default.yamlStack-Benchmarking/
├── checkpoints/ # Model weights (download from HuggingFace)
├── configs/ # Experiment configurations
├── data/
│ ├── raw/ # Original datasets
│ └── processed/ # Processed features/embeddings
├── src/
│ ├── data/ # Data loading & preprocessing
│ ├── models/ # Baseline predictors & Stack wrapper
│ └── eval/ # OOD evaluation & metrics
├── notebooks/ # Analysis notebooks
├── papers/ # Reference papers
└── PROPOSAL.md # Original project proposal
- Inputs: pre-treatment query cells + prompt cells (drug, dose, time, tissue)
- Output: predicted post-treatment expression + perturbation-conditioned embeddings
- Score predicted expression with pathway collections (GSVA/ssGSEA)
- Predict AUC/AAC using lightweight models (ElasticNet, XGBoost, MLP)
- Random + cold-drug + cold-cell-line splits
- Tissue/lineage-aware splits where possible
- Fixed-drug and fixed-cell aggregation reporting
- All preprocessing fit inside training folds only
- Raw or pseudo-bulk expression
- Standard drug descriptors (Morgan fingerprints)
- Marginal-effect baselines (mean-drug, mean-cell, mean(drug)+mean(cell))
- Python 3.9+
- PyTorch 2.0+
- Scanpy, anndata, scvi-tools
- Stack (
arc-stack)
See requirements.txt for full list.
Stack model weights: Arc Research Institute Non-Commercial License
For questions, open an issue or contact the team.