ClusterMatch aligns single-cell RNA-seqencing data at the multi-scale cluster level via stable matching
Unsupervised clustering of single-cell RNA sequencing (scRNA-seq) data holds the promise of known and novel cell type characterization in various biological and clinical contexts. However, multiple sources of variability in the data poses challenges to deal with high-dimensional and high-noise characteristics, batch effects, and intrinsic multi-scale clustering resolutions. Here, we present ClusterMatch, a stable match optimization model to account the statistical uncertainty by aligning scRNA-seq data at the cluster level. In one hand, ClusterMatch leverages the mutual correspondence by canonical correlation analysis (CCA) and multi-scale Louvain clustering algorithms to identify cluster with optimized resolutions. In the other hand it utilizes stable matching framework to align scRNA-seq data in the latent space while maintaining interpretability with overlapped marker gene set. Through extensive experiments, we demonstrate the efficacy of ClusterMatch in data integration, cell type annotation, and cross-species/timepoint alignment scenarios. Our results show ClusterMatch's ability to utilize both global and local information of scRNA-seq data, sets the appropriate resolution of multi-scale clustering, and offers interpretability by utilizing marker genes.
library(devtools)
devtools::install_github("BateerCoder/ClusterMatch")
dendritic_batch1 <- read.csv(".../data/dendritic/batch1.csv", row.names = 1)
dendritic_batch2 <- read.csv(".../data/dendritic/batch2.csv", row.names = 1)
dendritic_celltype <- read.csv(".../data/dendritic/celltype.csv")
human_dLGN <- read.csv(".../data/dLGN/human_dLGN.csv", row.names = 1)
macaque_dLGN <- read.csv(".../data/dLGN/macaque_dLGN.csv", row.names = 1)
human_celltype <- read.csv(".../data/dLGN/human_celltype.csv")
rownames(human_celltype) <- human_celltype$cells
macaque_celltype <- read.csv(".../data/dLGN/macaque_celltype.csv")
rownames(macaque_celltype) <- macaque_celltype$cells
The format of the input file is as follows
- The row names: gene symbols.
- The column names: cell IDs.
- Other place: the expression values (counts or TPM) for a gene in a cell.
As an example, we will use two batches of human dendritic cell datasets in sections 2.2, 2.3, and 2.4. Batch 1 consists of 288 cells with three annotated cell types: CD141 DC, double-negative, and plasmacytoid DC. Batch 2 also contains 288 cells with three annotated cell types: CD1C DC, double-negative, and plasmacytoid DC.
We will also use human and macaque dorsal lateral geniculate nucleus (dLGN) datasets as an example in section 2.5. The human dLGN dataset consists of 952 cells, while the macaque dLGN dataset consists of 1,723 cells. The annotations have been refined through classification and manual review, resulting in four cell types: koniocellular (K), magnocellular and parvocellular projection neurons (MP), and two GABAergic cells (GABA1 and GABA2).
dendritic_res <- ClusterMatch_resolution(dendritic_batch1, dendritic_batch2, ref.norm = FALSE, que.norm = FALSE)
The optimal resolutions of batch1 and batch2
dendritic_res$D1_ref_res
[1] 1.1
dendritic_res$D2_que_res
[1] 1.4
dendritic_matching <- ClusterMatch_matching(dendritic_batch1, dendritic_batch2, ref.res = dendritic_res$D1_ref_res, que.res = dendritic_res$D2_que_res, ref.norm = FALSE, que.norm = FALSE, random_PCC = 1.3)
Matching matrix
dendritic_matching$Matching_matrix
D1_1 D1_0 D1_2
D2_1 0 2 0
D2_2 0 0 2
D2_0 0 0 0
dendritic_integration <- ClusterMatch_integration(dendritic_batch1, dendritic_batch2, ref.res = dendritic_res$D1_ref_res,
que.res = dendritic_res$D2_que_res, ref.norm = FALSE, que.norm = FALSE, random_PCC = 1.3, distance_diff = 3, distance_same = 1)
UMAP visualization colored by the annotated batches
umap_df <- ClusterMatch_UMAP(embedding = dendritic_integration$cell_embedding, celltype = dendritic_celltype)
batch_colour=c("#E64540","#3F81BB")
celltype_colour=c("#E64136","#5F78A3","#EDA6C3","#96C561")
library(ggplot2)
ggplot(umap_df,aes(X1,X2,color=batch)) +
scale_color_manual(values = batch_colour)+
geom_point() + theme_bw() +
theme(panel.grid=element_blank(),plot.title = element_text(hjust = 0.5),text = element_text(size = 20)) +
labs(x="UMAP_1",y="UMAP_2",
title = "ClusterMatch")
UMAP visualization colored by the annotated cell types
ggplot(umap_df,aes(X1,X2,color=label)) +
scale_color_manual(values = celltype_colour)+
geom_point() + theme_bw() +
theme(panel.grid=element_blank(),plot.title = element_text(hjust = 0.5),text = element_text(size = 20)) +
labs(x="UMAP_1",y="UMAP_2",
title = "ClusterMatch")
dLGN_annotation <- ClusterMatch_annotation(human_dLGN, macaque_dLGN, human_celltype, que.res=2, ref.norm = FALSE, que.norm = FALSE)
table(dLGN_annotation$D2_query$predicted_celltypes==macaque_celltype$celltype)
FALSE TRUE
31 1693