├── configs
├── data
│ ├── datasets
│ └── hierarchy
├── notebooks
├── scripts
└── src
-
configs/: Contains configuration files for different stages of the pipeline, such as training and prediction settings.
-
data/: Stores all data-related files, including raw datasets and hierarchy mappings.
- datasets/: Original datasets (lung, ovarian, pancreatic). Each dataset folder includes a README file with links to the sources where the data can be downloaded.
- hierarchy/: Contains cell hierarchy definitions and mapping files.
-
notebooks/: Jupyter notebooks for each step of the data processing and analysis pipeline.
-
scripts/: Shell scripts for environment setup.
-
src/: Source code for data modules, models, training, prediction, and utilities.
The notebooks/ directory is organized into sequential steps, each representing a stage in the data analysis pipeline:
notebooks
├── step_0_get_datasets
├── step_1_preprocess
├── step_2_embed
├── step_3_merge
├── step_4_create_splits
├── step_5_predict
└── step_6_evaluate
-
step_0_get_datasets/: Notebooks for downloading or converting the original datasets to
.h5adformat. -
step_1_preprocess/: Preprocessing steps to unify dataset formats and enforce consistent column naming.
- Input:
.h5adfiles from paths likedatasets/<dataset>/<file>.h5ad. - Output: Unified
.h5adfiles, typically saved indatasets/rawcounts_sparse_f32/<dataset>.h5ad.
- Input:
-
step_2_embed/: Embeds the data using various models (e.g., Nicheformer, scGPT). Includes notebooks for generating and visualizing embeddings.
- Input: Preprocessed
.h5adfiles from the previous step (e.g.,datasets/rawcounts_sparse_f32/). - Output: Embedding files saved as
datasets/embedded/<dataset>/X_<method>.npy.
- Input: Preprocessed
-
step_3_merge/: Merges datasets and prepares test sets, including intersection and alignment of data.
-
merge_intersect.ipynb
- Input: Lung, ovarian_discovery, and pancreatic datasets in
datasets/rawcounts_sparse_f32/. - Output: Merged dataset
datasets/common_genes/l-od-p.h5adwith a gene set intersection of the three datasets.
- Input: Lung, ovarian_discovery, and pancreatic datasets in
-
prepare_testsets.ipynb: Prepares test sets for scANVI.
- Input:
datasets/rawcounts_sparse_f32/ovarian_test*.h5adanddatasets/rawcounts_sparse_f32/ovarian_val*.h5ad. - Output: Test sets in
datasets/common_genes/, subsetted to the gene set froml-od-p.h5ad(missing genes are zero-filled).
- Input:
-
-
step_4_create_splits/: Creates train-test splits, generating
l-od-p_train.h5adandl-od-p_test.h5ad. Embeddings from lung, ovarian_discovery, and pancreatic datasets are also merged and split accordingly, producingdatasets/embedded/l-od-p_<split>/X_<method>.npyfiles. -
step_5_predict/: Runs scANVI on the processed and embedded data. Outputs predictions in
.csvformat. -
step_6_evaluate/: Evaluates model predictions, including metrics and visualizations.
Each stage is designed to run sequentially, but individual notebooks can be used independently for specific tasks or analyses, provided the expected input files are in place.
Training and prediction with the hierarchical or flat classifier can be run on datasets with embeddings using:
python src/main.py run_name=<name> stage=<stage> model.variant=<variant> data.dataset.embedding_path=<embedding_path>
where:
<name>is a descriptive name for the run<stage>is eithertrainorpredict<variant>is the model variant (e.g.,hierarchical,flat)<embedding_path>is the path to the embedding file (e.g.,datasets/embedded/l-od-p_train/X_nicheformer.npy)
For more configuration details, see configs/config.yaml or src/utils/config.py.
For the training stage, specify:
data.dataset.labels_path: Path to the labels file (.csvor.h5ad)- (Optional)
data.dataset.labels_column: Column name for labels in the labels file
This produces a run directory with a model checkpoint saved in the outputs/ folder.
For the prediction stage, specify:
prediction.checkpoint_path: Path to the trained model checkpoint.ckpt
This results in a run directory with predictions saved in the outputs/ folder.