Skip to content

SELINA-team/SELINA.py

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SELINA: Single-cell Assignment using Multiple-Adversarial Domain Adaptation Network with Large-scale References

Introduction

SELINA is a deep learning-based framework for single-cell assignment with multiple references. The algorithm consists of three main steps: cell type balancing, pre-training and fine-tuning. The rare cell types in reference data are first oversampled using SMOTE(Synthetic Minority Oversampling Technique), and then the reference data is trained with a supervised deep learning framework using MADA(Multi-Adversarial Domain Adaptation). An autoencoder is subsequently used to fine-tune the parameters of the pre-trained model. Finally, the labels from reference data are transferred to the query data based on the fully-trained model. Along with the annotation algorithm, we also collected 136 datasets which were uniformly processed and curated to provide users with comprehensive pre-trained models. Currently there are two modes in SELINA: normal mode for normal datasets prediction and disease mode for disease dataset prediction. The architecture and cost function of the disease mode is slightly different to the normal mode, please refer to our paper to get more clearly explained.

Workflow

image

Installation

  • SELINA is available for macOS, Linux and Windows and has been tested on linux system with the following dependency packages:
    • python=3.9.7
    • datatable=0.11.1
    • pytorch=1.10.0
    • pandas=1.3.4
    • tqdm=4.62.3
    • h5py=3.4.0
    • pytables=3.6.1
    • scipy=1.6.3
    • r-deldir=1.0_2
    • r-seurat=4.0.3
    • r-hdf5r=1.3.4
    • r-data.table=1.14.0
    • r-optparse=1.7.1
    • imbalanced-learn=0.8.1
    • r-dbscan=1.1.10
    • r-devtools=2.4.3
    • bioconductor-deseq2=1.34.0

All the dependency packages will be installed simultaneously using the following commands except for the presto which can be found at presto. The R package devtools used in the installation of presto has been included in SELINA, so you do not need to install devtools again. The installation process takes only a few minutes.

conda create -n Selina
conda activate Selina
conda install -c conda-forge -c r -c bioconda -c pfren selina

Note that if you have gpu on your device and want to use it, you should additionally run the following command to install cudatoolkit and the paired version of pytorch and cudnn.

conda install pytorch cudatoolkit cudnn

Tutorial

Preprocess of query data

This step is to normalize, convert the genes to version hg38 and symbol names, perform dimension reduction and clustering for your data. SELINA supports 3 formats of input: plain,h5 and mtx. The gene by cell matrix is in plain format. We provide two example query files which are from normal tissue and abnormal tissue respectively, and they are in plain format under the disease and normal folder demos folder, the command is shown as below. Preprocessing of the example data can be finished within minutes. Note that this is just an example, we recommend that you use the original count data.

#plain
selina preprocess --format plain --matrix ./query/normal_data/query.txt --directory ./res/

Running the above command will generate four output files:

  • query_res.rds: a seurat object storing the normalized data, dimension reduction and clustering results

  • query_expr.txt: expression matrix of query data with the first column as genes

  • query_cluster.png: UMAP plot with cluster labels

  • query_cluster_DiffGenes.tsv: differentially expressed genes for each cluster

Pre-training of the reference data

In this step you can train a model using your own reference data with the following command. We provide example reference files in disease and normal conditions in the demos folder. The training process of the example takes only a few minutes.

selina train --path-in ./reference/normal_data --path-out ./res/

Two output files will be generated for the downstream prediction.

  • pre-trained_params.pt: a file containing all parameters of the trained model
  • pre-trained_meta.pkl: a file containing the cell types and genes of the reference data, if it is in disease condition, the cell source information will also be included.

Predict

Here you can use our pre-trained models (available on SELINA models) or the model generated by yourself to predict for your query data. These pre-trained models are divided into two categories, of which one is for normal dataset prediction, and another one is for disease datasets prediction. Currently the disease models only cover the non-small-cell lung carcinoma, type 2 diabetes and Alzheimer's disease, which were used to evaluate the performance of SELINA in our paper.

Since the expression profiles of disease data may deviate a lot from normal data, we removed the fine-tuning step when predicting for the disease data, this can be achieved by adding --disease to the command as the following command shows. Prediction of the example data can be finished within one minute.

selina predict --query-expr ./res/query_expr.txt --model ./res/pre-trained_params.pt --seurat ./res/query_res.rds --path-out ./res/predict/disease --disease

Running this command will output four files:

  • query_predictions.txt: predicted cell type for each cell in the query data

  • query_probability.txt: probability of cells predicted as each of the reference cell types

  • query_pred.png: UMAP plot with cell type labels

  • query_DiffGenes.tsv: differentially expressed genes for each cell type

if it is in disease mode, an extra file will be generated:

  • query_cellsources.txt: predicted cell source for each cell in the query data

Documentation

For further details of usage, please refer to SELINA documentation

Change Log

v0.1

  • Release SELINA with normal and disease modes.

Citation

Pengfei Ren, Xiaoying Shi, Taiwen Li, Chenfei Wang. SELINA: Single-cell Assignment using Multiple-Adversarial Domain Adaptation Network with Large-scale References

Contacts

pfren@tongji.edu.cn

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages