Hierarchical Classifier

Directory Structure

├── configs
├── data
│   ├── datasets
│   └── hierarchy
├── notebooks
├── scripts
└── src

Directory Explanations

configs/: Contains configuration files for different stages of the pipeline, such as training and prediction settings.
data/: Stores all data-related files, including raw datasets and hierarchy mappings.
- datasets/: Original datasets (lung, ovarian, pancreatic). Each dataset folder includes a README file with links to the sources where the data can be downloaded.
- hierarchy/: Contains cell hierarchy definitions and mapping files.
notebooks/: Jupyter notebooks for each step of the data processing and analysis pipeline.
scripts/: Shell scripts for environment setup.
src/: Source code for data modules, models, training, prediction, and utilities.

Notebooks: Pipeline Stages

The notebooks/ directory is organized into sequential steps, each representing a stage in the data analysis pipeline:

notebooks
├── step_0_get_datasets
├── step_1_preprocess
├── step_2_embed
├── step_3_merge
├── step_4_create_splits
├── step_5_predict
└── step_6_evaluate

Stage Descriptions

step_0_get_datasets/: Notebooks for downloading or converting the original datasets to .h5ad format.
step_1_preprocess/: Preprocessing steps to unify dataset formats and enforce consistent column naming.
- Input: .h5ad files from paths like datasets/<dataset>/<file>.h5ad.
- Output: Unified .h5ad files, typically saved in datasets/rawcounts_sparse_f32/<dataset>.h5ad.
step_2_embed/: Embeds the data using various models (e.g., Nicheformer, scGPT). Includes notebooks for generating and visualizing embeddings.
- Input: Preprocessed .h5ad files from the previous step (e.g., datasets/rawcounts_sparse_f32/).
- Output: Embedding files saved as datasets/embedded/<dataset>/X_<method>.npy.
step_3_merge/: Merges datasets and prepares test sets, including intersection and alignment of data.
- merge_intersect.ipynb
  - Input: Lung, ovarian_discovery, and pancreatic datasets in datasets/rawcounts_sparse_f32/.
  - Output: Merged dataset datasets/common_genes/l-od-p.h5ad with a gene set intersection of the three datasets.
- prepare_testsets.ipynb: Prepares test sets for scANVI.
  - Input: datasets/rawcounts_sparse_f32/ovarian_test*.h5ad and datasets/rawcounts_sparse_f32/ovarian_val*.h5ad.
  - Output: Test sets in datasets/common_genes/, subsetted to the gene set from l-od-p.h5ad (missing genes are zero-filled).
step_4_create_splits/: Creates train-test splits, generating l-od-p_train.h5ad and l-od-p_test.h5ad. Embeddings from lung, ovarian_discovery, and pancreatic datasets are also merged and split accordingly, producing datasets/embedded/l-od-p_<split>/X_<method>.npy files.
step_5_predict/: Runs scANVI on the processed and embedded data. Outputs predictions in .csv format.
step_6_evaluate/: Evaluates model predictions, including metrics and visualizations.

Each stage is designed to run sequentially, but individual notebooks can be used independently for specific tasks or analyses, provided the expected input files are in place.

Running the Hierarchical Classifier

Training and prediction with the hierarchical or flat classifier can be run on datasets with embeddings using:

python src/main.py run_name=<name> stage=<stage> model.variant=<variant> data.dataset.embedding_path=<embedding_path>

where:

<name> is a descriptive name for the run
<stage> is either train or predict
<variant> is the model variant (e.g., hierarchical, flat)
<embedding_path> is the path to the embedding file (e.g., datasets/embedded/l-od-p_train/X_nicheformer.npy)

For more configuration details, see configs/config.yaml or src/utils/config.py.

Training

For the training stage, specify:

data.dataset.labels_path: Path to the labels file (.csv or .h5ad)
(Optional) data.dataset.labels_column: Column name for labels in the labels file

This produces a run directory with a model checkpoint saved in the outputs/ folder.

Prediction

For the prediction stage, specify:

prediction.checkpoint_path: Path to the trained model checkpoint .ckpt

This results in a run directory with predictions saved in the outputs/ folder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hierarchical Classifier

Directory Structure

Directory Explanations

Notebooks: Pipeline Stages

Stage Descriptions

Running the Hierarchical Classifier

Training

Prediction

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
data		data
notebooks		notebooks
scripts		scripts
src		src
README.md		README.md

BoevaLab/hierarchical-spatial-classifier

Folders and files

Latest commit

History

Repository files navigation

Hierarchical Classifier

Directory Structure

Directory Explanations

Notebooks: Pipeline Stages

Stage Descriptions

Running the Hierarchical Classifier

Training

Prediction

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages