Skip to content

BoevaLab/hierarchical-spatial-classifier

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hierarchical Classifier

Directory Structure

├── configs
├── data
│   ├── datasets
│   └── hierarchy
├── notebooks
├── scripts
└── src

Directory Explanations

  • configs/: Contains configuration files for different stages of the pipeline, such as training and prediction settings.

  • data/: Stores all data-related files, including raw datasets and hierarchy mappings.

    • datasets/: Original datasets (lung, ovarian, pancreatic). Each dataset folder includes a README file with links to the sources where the data can be downloaded.
    • hierarchy/: Contains cell hierarchy definitions and mapping files.
  • notebooks/: Jupyter notebooks for each step of the data processing and analysis pipeline.

  • scripts/: Shell scripts for environment setup.

  • src/: Source code for data modules, models, training, prediction, and utilities.

Notebooks: Pipeline Stages

The notebooks/ directory is organized into sequential steps, each representing a stage in the data analysis pipeline:

notebooks
├── step_0_get_datasets
├── step_1_preprocess
├── step_2_embed
├── step_3_merge
├── step_4_create_splits
├── step_5_predict
└── step_6_evaluate

Stage Descriptions

  • step_0_get_datasets/: Notebooks for downloading or converting the original datasets to .h5ad format.

  • step_1_preprocess/: Preprocessing steps to unify dataset formats and enforce consistent column naming.

    • Input: .h5ad files from paths like datasets/<dataset>/<file>.h5ad.
    • Output: Unified .h5ad files, typically saved in datasets/rawcounts_sparse_f32/<dataset>.h5ad.
  • step_2_embed/: Embeds the data using various models (e.g., Nicheformer, scGPT). Includes notebooks for generating and visualizing embeddings.

    • Input: Preprocessed .h5ad files from the previous step (e.g., datasets/rawcounts_sparse_f32/).
    • Output: Embedding files saved as datasets/embedded/<dataset>/X_<method>.npy.
  • step_3_merge/: Merges datasets and prepares test sets, including intersection and alignment of data.

    • merge_intersect.ipynb

      • Input: Lung, ovarian_discovery, and pancreatic datasets in datasets/rawcounts_sparse_f32/.
      • Output: Merged dataset datasets/common_genes/l-od-p.h5ad with a gene set intersection of the three datasets.
    • prepare_testsets.ipynb: Prepares test sets for scANVI.

      • Input: datasets/rawcounts_sparse_f32/ovarian_test*.h5ad and datasets/rawcounts_sparse_f32/ovarian_val*.h5ad.
      • Output: Test sets in datasets/common_genes/, subsetted to the gene set from l-od-p.h5ad (missing genes are zero-filled).
  • step_4_create_splits/: Creates train-test splits, generating l-od-p_train.h5ad and l-od-p_test.h5ad. Embeddings from lung, ovarian_discovery, and pancreatic datasets are also merged and split accordingly, producing datasets/embedded/l-od-p_<split>/X_<method>.npy files.

  • step_5_predict/: Runs scANVI on the processed and embedded data. Outputs predictions in .csv format.

  • step_6_evaluate/: Evaluates model predictions, including metrics and visualizations.

Each stage is designed to run sequentially, but individual notebooks can be used independently for specific tasks or analyses, provided the expected input files are in place.

Running the Hierarchical Classifier

Training and prediction with the hierarchical or flat classifier can be run on datasets with embeddings using:

python src/main.py run_name=<name> stage=<stage> model.variant=<variant> data.dataset.embedding_path=<embedding_path>

where:

  • <name> is a descriptive name for the run
  • <stage> is either train or predict
  • <variant> is the model variant (e.g., hierarchical, flat)
  • <embedding_path> is the path to the embedding file (e.g., datasets/embedded/l-od-p_train/X_nicheformer.npy)

For more configuration details, see configs/config.yaml or src/utils/config.py.

Training

For the training stage, specify:

  • data.dataset.labels_path: Path to the labels file (.csv or .h5ad)
  • (Optional) data.dataset.labels_column: Column name for labels in the labels file

This produces a run directory with a model checkpoint saved in the outputs/ folder.

Prediction

For the prediction stage, specify:

  • prediction.checkpoint_path: Path to the trained model checkpoint .ckpt

This results in a run directory with predictions saved in the outputs/ folder.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 95.0%
  • Python 4.4%
  • Shell 0.6%