Proportion test statistics and visualization on single cell metadata. A simple package for single cell metadata exploration.
The package can be installed from R software using devtools:
library(devtools)
install_github("OscarGVelasco/lotOfCells")
A pre-print is available at [https://doi.org/10.1101/2024.05.23.595582], which includes details on the statistical tests available in the package and examples of real use on a public lung adenocarcinoma dataset:
Óscar González-Velasco; LotOfCells: data visualization and statistics of single cell metadata. bioRxiv 2024.05.23.595582; doi: https://doi.org/10.1101/2024.05.23.595582
- (May 2024) Version: 0.2.0
- Added an option to use customized colors in waffle, bar, and polar plots (see examples).
- (July 2024) Version: 0.3.0
Single cell sequencing unveils a treasure trove into the biological and molecular characteristics of samples. Here we introduce LotOfCells
: a simple R package designed to explore the intricate landscape of phenotype data within single-cell studies. An archetypal example of such analysis would be to test the differences in proportion between cell-types across conditions, but it can be used in many scenarios e.g: test if a specific cell-type or class proportion is dependent of sequencing date as a quality check.
LotOfCells
is compatible with Seurat
and SingleCellExperiment
objects. You can also directly provide a dataframe with the metadata.
All functions require the following inputs:
scObject
: An object of classSeurat
orSingleCellExperiment
, or a dataframe containing the metadata.main_variable
: Column name containing the main class variable (e.g., condition, treatment, tissue).subtype_variable
: Column name containing the subtype class variable (e.g., cell type, sequencing date, cluster).sample_id
(optional): Column name containing the sample/patient ID variable. If provided, tests will simulate proportion variability per sample, and plots will show each individual.
We will construct a simulated dataset of single-cell metadata consisting of six samples with two conditions (mutant and wild type), including four cell types (A, B, C, D) and different time points simulating treatment (time 0h, 2h, 4h):
# # Data simulation with 2 conditions, 3 time points, and 4 cell-types:
sample1 <- c(rep("CellTypeA",700),rep("CellTypeB",300),rep("CellTypeC",500),rep("CellTypeD",1000))
sample2 <- c(rep("CellTypeA",1700),rep("CellTypeB",350),rep("CellTypeC",550),rep("CellTypeD",800))
sample3 <- c(rep("CellTypeA",1200),rep("CellTypeB",200),rep("CellTypeC",420),rep("CellTypeD",800))
sample4 <- c(rep("CellTypeA",500),rep("CellTypeB",1000),rep("CellTypeC",10),rep("CellTypeD",1200))
sample5 <- c(rep("CellTypeA",550),rep("CellTypeB",990),rep("CellTypeC",10),rep("CellTypeD",1100))
sample6 <- c(rep("CellTypeA",1350),rep("CellTypeB",590),rep("CellTypeC",300),rep("CellTypeD",600))
sample <- c(rep("A",length(sample1)),rep("B",length(sample2)),rep("C",length(sample3)),rep("D",length(sample4)),rep("E",length(sample5)),rep("F",length(sample6)))
times <- c(rep("time 0h",length(sample1)),rep("time 0h",length(sample2)),rep("time 2h",length(sample3)),rep("time 2h",length(sample4)),rep("time 4h",length(sample5)),rep("time 4h",length(sample6)))
cell_type <- c(sample1, sample2, sample3, sample4, sample5, sample6)
meta.data <- data.frame(sample, cell_type, times)
meta.data$condition <- "wt"
meta.data[meta.data$sample %in% c("B","D","F"),]$condition <- "mut"
rownames(meta.data) <- as.character(1:nrow(meta.data))
head(meta.data)
First, let's visualize the data using LotOfCells
. In these functions, you can also specify:
subtype_only
: Visualize only a specific class fromsubtype_variable
. Useful if for example you only want to show the proportions of a specific cell type or subclass.
Barplots are displayed in such an order that the class with the largest average proportion is always at the bottom, making it easier to compare smaller groups at the top.
LotOfCells
can be used with any other combination of variables (different from cell type). Here (Figure D), by switching the subtype_variable
to the time-points we can investigate the contribution of time-points to the main condition, serving as a quality check to understand the weight of some covarites to the target condition (e.g. time points, sequencing dates or different tissues):
# # Test of barplot charts:
# All cells together for every group:
g.A <- bar_chart(meta.data, main_variable = "condition", subtype_variable = "cell_type")
# Barplot for each individual sample:
g.B <- bar_chart(meta.data, main_variable = "condition",subtype_variable = "cell_type",
sample_id = "sample")
# Display One-Class only:
g.C <- bar_chart(meta.data, main_variable = "condition",subtype_variable = "cell_type",
sample_id = "sample", subtype_only = "CellTypeD")
# Distribution of time-points by condition:
g.D <- bar_chart(meta.data, main_variable = "condition",subtype_variable = "times")
ggpubr::ggarrange(g.A, g.B, g.C, g.D, labels = c("A", "B", "C","D"),
ncol=2, nrow=2, common.legend = F)
If we want to visualize the contribution to the global proportion of a specific class (e.g.: how much each sample contribute to each cell type proportion), we can use the contribution = TRUE
flag when setting the sample_id
variable:
# Examine the contribution of each sample to the proportions of cell types per condition:
g.A <- bar_chart(meta.data, main_variable = "condition",subtype_variable = "cell_type",
sample_id = "sample", contribution = TRUE)
# Examine the contribution of each sample to the proportions of cell types per time point:
g.B <- bar_chart(meta.data, main_variable = "times",subtype_variable = "cell_type",
sample_id = "sample", contribution = TRUE)
ggpubr::ggarrange(g.A, g.B, labels = c("A", "B"),
ncol=2, nrow=1, common.legend = F)
To visualize small proportions using waffle plots might be more advisable:
# # Test of Waffles charts:
# All cells together for every group
g.A <- waffle_chart(meta.data, main_variable = "condition", subtype_variable = "cell_type")
# Waffle for each individual sample:
g.B <- waffle_chart(meta.data, main_variable = "condition", subtype_variable = "cell_type",
sample_id = "sample")
# One-Class only:
g.C <- waffle_chart(meta.data, main_variable = "condition",subtype_variable = "cell_type",
sample_id = "sample",subtype_only = "CellTypeD")
# One-Class only all samples together:
g.D <- waffle_chart(meta.data, main_variable = "condition", subtype_variable = "cell_type",
subtype_only = "CellTypeD")
ggpubr::ggarrange(g.B, g.A, g.C, g.D, labels = c("A", "B", "C", "D"),
ncol=2, nrow=2, widths = c(0.6,0.2), align = "hv")
To visualize numeric values (including gene/feature expression) over groups and sub-categories, we can make use of density ridge plots to observe its data distribution. We have to select the variable to use in numerical_variable
.
# # Test of density charts:
# simulate the number of RNA features detected in each cell:
meta.data <- rbind.data.frame(meta.data %>% dplyr::filter(condition == "mut") %>%
mutate(n_features_RNA = abs(rnorm(nrow(.), mean=c(3000, 3500), sd=800))),
meta.data %>% dplyr::filter(condition == "wt") %>%
mutate(n_features_RNA = abs(rnorm(nrow(.), mean=c(2500,1500), sd=400))))
# All cells together for every group
g.A <- density_chart(scObject = meta.data, main_variable = "condition", subtype_variable = "covariable",
numerical_variable = "n_features_RNA")
# Density chart for each individual sample:
g.B <- density_chart(scObject = meta.data, main_variable = "condition", subtype_variable = "covariable",
numerical_variable = "n_features_RNA", sample_id = "sample")
ggpubr::ggarrange(g.A, g.B, labels = c("A", "B"),
ncol=2, nrow=1, common.legend = F)
If scObject
is of class Seurat
or SingleCellExperiment
, we can specify a gene/feature name directly in subtype_variable
to display the counts per class and sub-level. Here we show an example using human pancreas single-cell data (Baron M et al. (2017) using the scRNAseq R package), the gene SST is a marker of pancreatic $\delta cells:
# # Test of density charts:
density_chart(scObject = sc_pancreas_human, main_variable = "donor", subtype_variable = "cell.type",
numerical_variable = "SST")
Colors can be easily changed for bar, waffle, or polar plots using the colors
option. To do this, simply provide a vector of colors to use. If the specified colors are insufficient, a palette will be created using colorRampPalette based on the colors specified. If no colors are specified, the default color palette of LotOfCells will be used.
# # Using different colors:
# We can use RColorBrewer to easily obtain a different palette of colors:
g.A <- bar_chart(meta.data, main_variable = "condition", subtype_variable = "cell_type",
sample_id = "sample", colors = RColorBrewer::brewer.pal(8, "Set3"))
g.B <- waffle_chart(meta.data, main_variable = "condition", subtype_variable = "cell_type",
sample_id = "sample", colors = RColorBrewer::brewer.pal(8, "Pastel2"))
ggpubr::ggarrange(g.A, g.B, labels = c("A", "B"), ncol=2, nrow=1, widths = c(0.4,0.6))
Lastly, this visualization might be interesting to compare raw number of cells per class (e.g.: if a sample has a much larger number of cells, which might affect downstream analysis).
# Test of circle polar plot:
polar_chart(meta.data, main_variable = "condition",subtype_variable = "cell_type", sample_id = "sample")
# Test of polar plot by cell-type:
polar_chart(meta.data, main_variable = "cell_type",subtype_variable = "sample")
For the proportion tests and the Montecarlo simulations that we are going to introduce, we can use parallelization using BiocParallel
:
# Set number of CPUs to use:
BiocParallel::register(BPPARAM = BiocParallel::MulticoreParam(workers = 6))
The main feature of LotOfCells is the testing of differences in proportions, combined with a simulation of a random distribution to evaluate how extreme the observations might be. To compare the proportions of a specific class between two conditions, define the two classes and their order from the main_variable
column. Optionally, the sample_id
column (or another sub-class level) can be set to account for per-sample heterogeneity in the computational simulation.
*NOTE: for significance testing we recommend an
A plot of the differences in proportions is generated, with pink shades representing the standard deviation of the Monte Carlo simulations, and a dataframe containing the statistical results is returned.
# # Test of 2 conditions using montecarlo and differences in percentage
labelOrder <- c("mut","wt")
results.2.conditions <- lotOfCells(scObject = meta.data,
main_variable = "condition",
subtype_variable = "cell_type",
sample_id = "sample",
permutations = 1000,
labelOrder = labelOrder,
parallel = TRUE)
print(results.2.conditions)
To determine if the majority of class proportions change significantly simultaneously between two conditions, we compute a divergence score based on the Kullback-Leibler (KL) divergence. This score helps to understand the global dysregulation in class proportions. The process involves simulating a random distribution to assess how extreme the observed symmetric score is.
Higher divergence scores suggest more significant simultaneous changes in class proportions between the two conditions.
# # Test of entropy for 2 conditions using montecarlo simulation
labelOrder <- c("mut","wt")
results.2.conditions.entropy <- entropyScore(scObject = meta.data,
main_variable = "condition",
subtype_variable = "cell_type",
permutations = 10000,
labelOrder = labelOrder,
parallel = TRUE)
The test of differences in proportions available in LotOfCells can be done with more than 2 conditions, by computing a Kendall rank correlation coefficient. To compare the proportions of a specific class between several conditions we will need to define the classes and the order in which they will be compared from the column specified in main_variable
, the correlation (positive and negative) will be tested following the order defined. Optionally, like introduced above, we can set the sample_id column (or other sub-class level) to include the per-sample heterogeneity in the computational simulation.
# # Test of correlation for SEVERAL conditions using Kendall rank correlation
labelOrder <- c("time 0h","time 2h","time 4h")
results.3.conditions <- lotOfCells(scObject = meta.data,
main_variable = "times",
subtype_variable = "cell_type",
sample_id = "sample",
permutations = 1000,
labelOrder = labelOrder,
parallel = TRUE)
head(results.3.conditions)
dynamics_chart(gammaResults = results.3.conditions)