The power of single cell RNA sequencing (scRNA-seq) stems from its ability to uncover cell type-dependent phenotypes, which rests on the accuracy of cell type identification. However, resolving cell types within and, thus, comparison of scRNA-seq data across conditions is challenging due to technical factors such as sparsity, low number of cells and batch effect. To address these challenges we developed scID (Single Cell IDentification), which uses the framework of Fisher's Linear Discriminant Analysis to identify transcriptionally related cell types between scRNA-seq datasets. Detailed information on the method and performance evaluation is demostrated in the publication Boufea et al., iScience 2020. By increasing power to identify transcriptionally similar cell types across datasets, scID enhances investigator's ability to extract biological insights from scRNA-seq data.
scID classifies cells of a given target dataset based on their transcriptional similarity to given reference clusters in 4 steps. As a first step, scID extracts cluster-specific gene sets from the reference data and calculates weights (based on Fisher's Linear Discriminant Analysis) that represent their discriminative power to identify the cluster of interest. Next, scID scores all target cells based on the expression of the cluster-specific gene sets and, finally, identifies equivalent target cells by fitting a mixture of Gaussian distributions.
On May 17, 2019, we released the new version of scID (v2.0.0) that uses negative markers together with positive for identifying equivalent cells. We have seen that this improves classification in presense of very simlar cell types in the dataset.
scID can be installed using the devtools R package:
Given two single-cell RNA-seq gene expression datasets with one of them having known groups of cells (clusters), scID can be used to identify transcriptionally similar cells in the second dataset.
scID_output <- scID::scid_multiclass(target_gem, reference_gem, reference_clusters, ...)
target_gemAn nxm data frame of n genes (rows) in m cells (columns) of the dataset with unknown grouping, where each entry is library-depth or column normalized gene expression. Cell names are expected to be unique
reference_gemAn NxM data frame of N genes (rows) in M cells (columns) of the dataset with known grouping, where each entry is library-depth or column normalized gene expression
reference_clustersA list of cluster labels for the reference cells
scID_output is a list of four objects
scID_output$labelsA named list of cluster labels for the target cells
scID_output$markersA data frame of signature genes extracted from the reference clusters
scID_output$weightsA list of the estimated weights for all cluster-specific genes
scID_output$scoresA data frame of scores of target cells (columns) for each reference cluster-specific geneset (rows)
To report bugs or ask any questions please use the GitHub issues tracker.