# CITEseq data analysis

*Author: Lena Boehme, Taghon lab, 2023*

## Denoising and normalisation of ADT data with dsb

ADT data is often very noisy and antibody detection and specificity can be variable. We can use denoising to correct the captured expression, which substantially improves visualisation and interpretation.

For this we use the dsb package and follow the [suggested workflow](https://cran.rstudio.com/web/packages/dsb/vignettes/end_to_end_workflow.html#step1), which uses cell-free droplets to determine background levels of ambient antibody. Explanations about the approach can be found in the [vignette](https://cran.r-project.org/web/packages/dsb/vignettes/).

## Setup

In [None]:
options(repr.plot.width=12, repr.plot.height=6)

options(scipen=100) #avoid scientific notation of numbers

library(SeuratDisk)
library(Seurat)
library(matrixStats)
library(ggplot2)
library(ggrastr)
library(pheatmap)
library(reshape2)
library(dplyr)
library(tidyr)
library(viridis)
library(stringr)
library(RColorBrewer)
library(ggrepel)
library(Matrix)

library(dsb)

sessionInfo()

In [None]:
pal24 <- colorRampPalette(brewer.pal(12, "Paired"))(24)

In [None]:
datadir_raw <- ''

## Reading in data

To use dsb we need to analyse the unfiltered cellranger output, which still contains empty droplets. These will be used to estimate the background quantities of (unbound) antibody, which can then serve as a correction factor for unspecific staining of cells.

Which cells will be included will normally be determined on RNA-based QC measures. In this instance the pre-processing and QC is done for all single cell data sets together at the Sanger Institute. We can then use the retained cell barcodes for the CITE-seq data to select high-quality cells on which to carry out the denoising.

### Filtered data

Mapped with STARsolo (RNA only), pre-processed at the Sanger, QCed, doublets removed, preliminary annotation based on RNA.

In [None]:
#First need to convert from anndata to seurat format
#Convert("adata_full_rev_2_clean.h5ad", dest = "h5seurat", assay = 'RNA', overwrite = TRUE)

In [None]:
seurObj_filt <- LoadH5Seurat('./adata_full_rev_2_clean.h5seurat', misc = FALSE, meta.data = FALSE) #metadata and misc need to be excluded to prevent an error

In [None]:
seurObj_filt

We have the meta data in a separate csv and can add it back to the seurat Object.

In [None]:
meta <- read.csv("./adata_full_rev_2_clean.csv")

In [None]:
seurObj_filt$barcode <- rownames(seurObj_filt@meta.data)
seurObj_filt@meta.data <-  left_join(seurObj_filt@meta.data, meta, join_by('barcode'))
rownames(seurObj_filt@meta.data) <- seurObj_filt$barcode

In [None]:
# seurObj_filt@meta.data %>% head

In [None]:
SaveH5Seurat(seurObj_filt, 'HTSA_RNA_all.h5seurat', overwrite = TRUE)

In [None]:
#seurObj_filt <- LoadH5Seurat('HTSA_RNA_all.h5seurat')

This object represents the entire single cell data set. We only need cells from the CITEseq data.

In [None]:
table(seurObj_filt$study)

In [None]:
seurObj_filt_CITE <- subset(seurObj_filt, study == 'HTSA_Ghent') #represents all CITEseq data

In [None]:
seurObj_filt_CITE

In [None]:
head(seurObj_filt_CITE@meta.data)

In [None]:
SaveH5Seurat(seurObj_filt_CITE, 'HTSA_RNA_CITE.h5seurat', overwrite=TRUE)

In [None]:
#seurObj_filt_CITE <- LoadH5Seurat('HTSA_RNA_CITE.h5seurat')

### Unfiltered data

Mapped with cellranger v7.0.0 (ADT+RNA), unfiltered output including debris and droplets.

In [None]:
#fetch file names
h5_raw <- list.files(path=datadir_raw,
             pattern=".h5",  #return files matching this expression
             full.names=TRUE) #get full path instead of just filename
h5_raw

The Read10X_h5 command produces a list, that contains the RNA and ADT matrices as list element 1 and 2, respectively. We separate the two modalities and instead construct two independent lists, which contain the either RNA or ADT counts for each sample. In addition we extract the sample ID from the filename, so that we can later match the meta data. This is important because the list of files can be lexicographic i.e. 10 sorted before 2.

Note that the barcode prefix in the Sanger data is 'GEX', whereas in the unfiltered data it's 'TT-CITE-'. We need to rename and match those.

In [None]:
#read in files

counts_RNA <- list()
counts_ADT <- list()
samples <- c()

for(i in seq(1:length(h5_raw))){
    counts <- Read10X_h5(h5_raw[[i]]) #produces list of two matrices, 1st is RNA, 2nd is ADT
    counts_RNA[i] <- unlist(counts[1]) #add RNA counts as element in list
    counts_ADT[i] <- unlist(counts[2]) 
    name <- str_split(basename(h5_raw[i]),'_')[[1]][1] #extract sample name from filename
    samples[i] <- name #add sample name to sample vector
    names(samples)[i] <- sub('TT-CITE-', 'GEX', name) #change prefix
    names(counts_RNA)[i] <- names(samples)[i] #rename list elements to match sample names
    names(counts_ADT)[i] <- names(samples)[i]
}

In [None]:
samples

Before merging the cells from all samples to create a single object, we need to modify the barcodes to be able to distinguish their origin. By default they all end in '-1', but don't have a sample-specific identifier. We add a prefix corresponding to the sample and remove the '-1' suffix; that way they should match the sample barcodes in the Sanger-mapped data and allow us to extract corresponding cells from both versions of the data set. At the same time we save the sample origin for each cell into a list, which we can later use as meta data.

In [None]:
samples_list <- list()

for (i in seq(1:length(counts_RNA))){
    prefix <- paste0(names(samples)[i], '-')  #retrieve new prefix
    colnames(counts_RNA[[i]]) <- paste0(prefix, colnames(counts_RNA[[i]])) # add prefix to cell barcodes
    colnames(counts_RNA[[i]]) <- gsub('-1','', colnames(counts_RNA[[i]])) # remove suffix from cell barcodes
    colnames(counts_ADT[[i]]) <- paste0(prefix, colnames(counts_ADT[[i]]))
    colnames(counts_ADT[[i]]) <- gsub('-1','', colnames(counts_ADT[[i]]))
    samples_list[[i]] <- rep(names(samples)[i],length(colnames(counts_RNA[[i]])))
}

Note the lexicographic order: the second library is GEX10 (not GEX2)

In [None]:
colnames(counts_RNA[[2]]) %>% head()

Merge all samples per modality into one matrix instead of a list.

In [None]:
counts_RNA_merged <- Reduce(cbind, counts_RNA)
counts_ADT_merged <- Reduce(cbind, counts_ADT)

In [None]:
rm(counts_RNA, counts_ADT) #cleanup

Build seurat object from the RNA matrix.

In [None]:
seurObj_unfilt_CITE <- CreateSeuratObject(counts_RNA_merged)

Some of the antibodies have been extended with a '.1' suffix for unknown reasons, which we remove.

In [None]:
rownames(counts_ADT_merged)

In [None]:
ABs <- rownames(counts_ADT_merged) %>% gsub('\\.1', '', .)
ABs

In [None]:
rownames(counts_ADT_merged) <- ABs

Add ADT to seurat object as separate assay.

In [None]:
seurObj_unfilt_CITE[["ADT"]] <- CreateAssayObject(counts = counts_ADT_merged)

Add sample origin for each cell to the meta data.

In [None]:
seurObj_unfilt_CITE$sample <- unlist(samples_list)

In [None]:
seurObj_unfilt_CITE@meta.data %>% tail()

table(seurObj_unfilt_CITE$sample)

### Matching the data

All cells in the filtered (processed) data are also present in the unfiltered data set, which additionally contains several millions of low-quality cells and droplets.

In [None]:
table(colnames(seurObj_filt_CITE) %in% colnames(seurObj_unfilt_CITE))
table(colnames(seurObj_unfilt_CITE) %in% colnames(seurObj_filt_CITE))

We use the filtered object to annotate cells (vs. droplets) in the unfiltered data.

In [None]:
seurObj_unfilt_CITE$cell <- colnames(seurObj_unfilt_CITE) %in% colnames(seurObj_filt_CITE)

Sanity check: RNA reads should be much higher for cells compared to droplets.

In [None]:
options(repr.plot.width=15, repr.plot.height=4)

ggplot(seurObj_unfilt_CITE@meta.data, aes(x=sample, y=log10(nCount_RNA), fill=cell))+
geom_boxplot()+
theme_bw()

Sanity check: Odd samples should be CD3neg, even samples CD3pos.

In [None]:
options(repr.plot.width=12, repr.plot.height=6)
#cave: takes a while due to large data set
VlnPlot(subset(seurObj_unfilt_CITE, cell==TRUE), features = 'CD3', assay = 'ADT', group.by = 'sample', pt.size = 0, log = TRUE)

In [None]:
#save unfiltered object
SaveH5Seurat(seurObj_unfilt_CITE, "./HTSA_CITE_preDSB.h5seurat", overwrite=TRUE)

In [None]:
#seurObj_unfilt_CITE <- LoadH5Seurat("./HTSA_CITE_preDSB.h5seurat")

## Droplet QC

For cells basic QC has already been carried out; for droplets we need to select an appropriate subset. We first remove all droplets that only possess RNA or ADT reads. All cells are already pre-filtered to have both RNA and ADT information.

In [None]:
table(seurObj_unfilt_CITE$nCount_RNA >0, seurObj_unfilt_CITE$cell)
table(seurObj_unfilt_CITE$nCount_ADT >0, seurObj_unfilt_CITE$cell)

In [None]:
seurObj_unfilt_CITE2 <- subset(seurObj_unfilt_CITE, subset = nCount_RNA > 0 & nCount_ADT > 0 )

In [None]:
dim(seurObj_unfilt_CITE)
dim(seurObj_unfilt_CITE2)

We can also determine the percentage of mitochondrial reads, which indicates low-viability cells rather than empty droplets.

In [None]:
seurObj_unfilt_CITE2[["percent.mt"]] <- PercentageFeatureSet(seurObj_unfilt_CITE2, pattern = "^MT-")

Next, we need to set thresholds for the background library that will be used. For this purpose we can inspect the RNA/ADT counts of droplets and cells (plotted on log scale).

In [None]:
start.time <- Sys.time()

options(repr.plot.width=8, repr.plot.height=20)
#CAVE: takes very long due to the size of the data set
ggplot(seurObj_unfilt_CITE2@meta.data, aes(x=log10(nCount_ADT), y=log10(nCount_RNA), color=percent.mt))+
geom_point_rast()+ #use ggrastr function to reduce 
facet_grid(sample~cell)+
scale_color_viridis()+
theme_bw()

Sys.time() - start.time

We can set min/max thresholds (indicated by lines) to select droplets that will be used for the downstream analyses. By visualising cell/droplet density, we can determine where most droplets fall on the gene/count spectrum.

In [None]:
ADT_max <- 3.5
ADT_min <- 1.2
RNA_max <- 2.4

In [None]:
options(repr.plot.width=8, repr.plot.height=20)

ggplot(seurObj_unfilt_CITE2@meta.data, aes(x=log10(nCount_ADT), y=log10(nCount_RNA)))+
geom_hex(bins=100)+ #density representation
geom_hline(yintercept = RNA_max)+
geom_vline(xintercept = ADT_max)+
geom_vline(xintercept = ADT_min)+
facet_grid(sample~cell)+
scale_fill_viridis(limits=c(0,2000))+
theme_bw()

Based on the thresholds we can add an identifier in the meta data and create a reduced seurat object. We also remove 'droplets' with high mitochondrial reads, in case these are indeed partially lysed apoptotic cells, since these will not be a good representation of background antibody levels.

In [None]:
droplets <- subset(seurObj_unfilt_CITE2@meta.data,
                   log10(nCount_RNA) < RNA_max &
                   log10(nCount_ADT) < ADT_max &
                   log10(nCount_ADT) > ADT_min &
                   percent.mt < 5 &
                   cell==FALSE) %>% rownames()
seurObj_unfilt_CITE2$droplet <- rownames(seurObj_unfilt_CITE2@meta.data) %in% droplets

In [None]:
options(repr.plot.width=6, repr.plot.height=4)

ggplot(seurObj_unfilt_CITE2@meta.data, aes(x=log10(nCount_ADT), y=log10(nCount_RNA), colour=droplet))+
geom_point(alpha=0.5, pt.size=0.1)+
geom_hline(yintercept = RNA_max)+
geom_vline(xintercept = ADT_max)+
geom_vline(xintercept = ADT_min)+
facet_grid(~cell)+
theme_bw()

In [None]:
table(seurObj_unfilt_CITE2$droplet)

We can now remove everything that we didn't classify as cell or as droplet. This reduces the size of our data set substantially.

In [None]:
seurObj_unfilt_CITE3 <- subset(seurObj_unfilt_CITE2, subset = droplet==TRUE | cell==TRUE)

In [None]:
table(seurObj_unfilt_CITE3$cell)

In [None]:
#save QCed object
SaveH5Seurat(seurObj_unfilt_CITE3, "./HTSA_CITE_preDSB2.h5seurat", overwrite=TRUE)

In [None]:
#seurObj_unfilt_CITE3 <- LoadH5Seurat("./HTSA_CITE_preDSB2.h5seurat")

## Normalisation and denoising with dsb

In [None]:
isotype <- rownames(seurObj_unfilt_CITE3@assays$ADT@counts)[131:137]
isotype

In [None]:
ADT_stats <- data.frame(AB=rownames(seurObj_unfilt_CITE3@assays$ADT@counts),
                      max= rowMaxs(as.matrix(seurObj_unfilt_CITE3@assays$ADT@counts)),
                      min= rowMins(as.matrix(seurObj_unfilt_CITE3@assays$ADT@counts)),
                      mean= rowMeans(as.matrix(seurObj_unfilt_CITE3@assays$ADT@counts)),
                      median= rowMedians(as.matrix(seurObj_unfilt_CITE3@assays$ADT@counts)),
                      isotype=rownames(as.matrix(seurObj_unfilt_CITE3@assays$ADT@counts)) %in% isotype)

Isotype controls have low detection levels but so do many proteins.

In [None]:
ADT_stats[order(ADT_stats$max),][1:20,]

In [None]:
options(repr.plot.width=12, repr.plot.height=10)

ggplot(ADT_stats, aes(x=log10(mean), y=log10(max), color=isotype))+
geom_point()+
geom_text_repel(aes(label=AB))+
theme_bw()+
theme(legend.position = 'none')

Comparison of antibody staining levels indicates that most markers are detected at higher levels compared to the Isotype controls. Importantly, many markers will not be expressed in our data set or only in a small subset of cells, which will affect the mean expression.

In [None]:
options(repr.plot.width=12, repr.plot.height=10)

ggplot(ADT_stats, aes(x=log10(mean), y=(median), color=isotype))+
geom_point()+
geom_text_repel(aes(label=AB))+
theme_bw()+
theme(legend.position = 'none')

Dsb takes raw ADT count matrices for cells and droplets. In addition, the isotype controls need to be specified. Further parameters can be adjusted, e.g. pseudocount to be used, scale factor and thresholds for quantile clipping to remove outliers. These options were tested and not found to be suitable/required for our dataset. For information on default settings and parameter options refer to the [vignette](https://cran.r-project.org/web/packages/dsb/dsb.pdf).

In [None]:
start.time <- Sys.time()

norm <- DSBNormalizeProtein(cell_protein_matrix = subset(seurObj_unfilt_CITE3, subset=cell==TRUE)@assays$ADT@counts,
                            empty_drop_matrix = subset(seurObj_unfilt_CITE3, subset=droplet==TRUE)@assays$ADT@counts,
                            denoise.counts = TRUE,
                            use.isotype.control = TRUE,
                            isotype.control.name.vec = isotype,
                            return.stats=TRUE,
                            quantile.clipping = FALSE) #default if TRUE=0.001,0.9995
end.time <- Sys.time()
end.time - start.time

Dsb returns a normalised denoised matrix of protein expression for all cells (not droplets). In this matrix expression levels are corrected for background staining and cell-to-call variation is reduced. We can save this and the non-normalised ADT data to the filtered seurat Object.

After normalisation the matrix may contain some cells with very negative values i.e. very low expression. These normally represent outliers but can hinder visualisation due to automatic axis limits and scaling (see [package details](https://www.rdocumentation.org/packages/dsb/versions/0.2.0)). We therefore set a minimum to 0 by changing all values below this to 0.

In [None]:
norm2 <- apply(norm$dsb_normalized_matrix, 2, function(x){ifelse(test = x < 0, yes = 0, no = x)}) 

In [None]:
seurObj_filt_CITE[['ADT']] <- CreateAssayObject(counts=subset(seurObj_unfilt_CITE3, subset=cell==TRUE)@assays$ADT@counts)

In [None]:
seurObj_filt_CITE[['ADTdsb']] <- CreateAssayObject(data=Matrix(norm2, sparse = TRUE)) #matrix is currently dense and does not automatically converted by seurat

We also add denoised data without removing negatives, which is helpful for flowjo visualisation.

In [None]:
seurObj_filt_CITE[['ADTdsbneg']] <- CreateAssayObject(data=Matrix(norm$dsb_normalized_matrix, sparse = TRUE)) #matrix is currently dense and does not automatically converted by seurat

dsb provides some protein stats e.g. the amount of background detected, the mean levels before and after correction and the SD for all these measures.

In [None]:
stats.df <- cbind(AB=rownames(norm$protein_stats$`raw cell matrix stats`),
                  norm$protein_stats$`raw cell matrix stats`,
                  norm$protein_stats$`dsb normalized matrix stats`,
                  background_mean=norm$protein_stats$background_mean,
                  background_sd=norm$protein_stats$background_sd)

In [None]:
stats.df %>% head

In [None]:
options(repr.plot.width=12, repr.plot.height=10)

ggplot(stats.df, aes(x=cell_mean, y=dsb_mean, color=background_mean))+
geom_point()+
geom_text_repel(aes(label=AB))+
scale_color_viridis()+
#lims(x=c(0,15), y=c(0,15))+
theme_bw()

In [None]:
options(repr.plot.width=12, repr.plot.height=10)

ggplot(stats.df, aes(x=background_sd, y=dsb_sd, color=background_sd))+ #cell sd is the same before/after dsb
geom_point()+
geom_text_repel(aes(label=AB))+
scale_color_viridis()+
#lims(x=c(0,15), y=c(0,15))+
theme_bw()

per-cell stats on the isotype control staining are also reported, which we can save in the meta data for later inspection as high isotype values could indicate sticky cells.

In [None]:
seurObj_filt_CITE@meta.data <- cbind(seurObj_filt_CITE@meta.data, norm$technical_stats)

In [None]:
seurObj_filt_CITE@meta.data %>% head

In [None]:
stats2.df <- norm$technical_stats %>% data.frame %>% pivot_longer(cols = 1:7, names_to = 'Isotype', values_to = 'Value')

In [None]:
options(repr.plot.width=8, repr.plot.height=5)

ggplot(stats2.df, aes(x=Isotype, y=log(Value)))+
geom_boxplot()+theme_bw()

In [None]:
#save denoised object
SaveH5Seurat(seurObj_filt_CITE, "./HTSA_CITE_DSBdenoised.h5seurat", overwrite=TRUE)