This repository enables you to map your own single-cell or single-nucleus data to the Human Lung Cell Atlas.
The Human Lung Cell Atlas (HLCA) is an integration of 46 different datasets of the lung and nose. The HLCA core reference, containing a subset of those 46 datasets, was built by integrating data from 107 individuals and 14 datasets, and then re-annotating all cells in a consensus manner based on originally published labels and inputs from 6 experts.
Mapping your own lung single-cell or single-nucleus data to the HLCA will greatly speed up the analysis of your dataset. Upon mapping we can project consensus labels from the HLCA to your data and, using mapping uncertainty, highlight which cells seem different from healthy cells in the atlas. Depending on your experimental setup, these cells might represent novel cellular identities or cells particularly affected by disease.
In this repository we provide the code need to map your data to the HLCA. You can map your own single-cell or single-nucleus RNA-sequencing data to the HLCA reference. Mapping to the HLCA is done with scArches, a transfer-learning tool designed for mapping new data to large references. Note that you'll need access to GPU to perform the mapping. If you do not want to map the data on your own computing resources, you can also do automated mapping on FastGenomics.
There are a few things to prepare before starting the mapping. This information is also included in the analysis notebook you'll use.
- Install the right packages. The easiest way to do this is to install our prepared conda environment in ./envs/scarches_mapping_conda_env.yml. Just cd into the envs folder after cloning our repository, and run
conda env create -f scarches_mapping_conda_env.yml
. If you don't want to use conda, just make sure you have scArches v.0.3.5, scvi-tools 0.8.1 and torch>=1.3,<=1.8.0 installed. - You'll need access to a GPU to run this script efficiently, on CPU it will be very slow if it manages to run at all. We ran the script with the provided test data in less than an hour (~half of which is umap calculation), on a node with 4 cores, needing only 9GB of memory. With bigger datasets you'll need more.
- Ensure your data is in h5ad format, the standard format used in scanpy.
- Make sure your .X layer has raw counts (i.e. integers, so no normalization, no log-transformation)
- If your dataset contains multiple batches, specify these in the .obs layer under .obs['dataset']. Note that we do not recommend using subjects as batches. Rather, treat your dataset as one batch, unless there are subsets that differ in terms of experimental protocol (e.g. 5' versus 3' sequencing, or single-nucleus versus single-cell, or 10Xv1 versus v2 etc.). We also do not recommend treating disease status as a batch variable, as this might remove disease-specific effects from the embedding.
To do the mapping, clone this repository and simply follow the steps in the jupyter notebook "LCA_scArches_mapping_new_data_to_hlca", which you can find in the notebooks folder of this repository.
For more information on the metadata categories of the HLCA (e.g. cell type, subject age, sequencing platform), check out our metadata explanations in the docs folder.
HLCA: Sikkema et al., bioRxiv 2022, doi:10.1101/2022.03.10.483747
scArches: Lotfollahi et al., Nature Biotechnology 2021, doi:10.1038/s41587-021-01001-7
Submit an issue on this GitHub repository.