# CITEseq data analysis

*Author: Lena Boehme, Taghon lab, 2023*

## Denoising and normalisation of ADT data

ADT data is often very noisy and antibody detection can be variable. Antibody properties and concentration determine background staining. For proper interpretation and analysis it is advisable to correct protein measurements.

For this we use the dsb package by [Mule et al.](https://pubmed.ncbi.nlm.nih.gov/35440536/) and follow the [suggested workflow](https://cran.rstudio.com/web/packages/dsb/vignettes/end_to_end_workflow.html#step1), which uses cell-free droplets to determine background levels of ambient antibody and carries out normalisation based on isotype controls. Explanations about the approach can be found [here](https://cran.r-project.org/web/packages/dsb/vignettes/understanding_dsb.html) and in the [general package documentation](https://www.rdocumentation.org/packages/dsb/versions/1.0.3).

### Setup

In [None]:
#directories
setwd("/home/lenab/Documents/scSeq_analyses/B_TotalThymus_CITEseq/2022_TotalThymus_CITEseq_HTA/objects")

datadir_raw <- '/home/lenab/Documents/scSeq_files/TotalThymus_CITEseq_HTA2/h5_raw'
datadir_filtered <- '/home/lenab/Documents/scSeq_files/TotalThymus_CITEseq_HTA2/h5'

In [None]:
#default plotting settings

options(repr.plot.width=12, repr.plot.height=6)

options(scipen=100) #avoid scientific notation of numbers

In [None]:
#load packages

library(SeuratDisk)
library(Seurat)
library(matrixStats)
library(ggplot2)
library(pheatmap)
library(reshape2)
library(dplyr)
library(tidyr)
library(viridis)
library(stringr)
library(dsb)
library(RColorBrewer)
library(ggrepel)
library(Matrix)

In [None]:
sessionInfo()

In [None]:
#make large palettes for plotting
pal24 <- colorRampPalette(brewer.pal(12, "Paired"))(24)

### Data import

To use dsb we need to analyse the unfiltered cellranger output, which still contains empty droplets. These will be used to estimate the background quantities of (unbound) antibody, which can then be serve as a correction factor for background staining of cells.

In [None]:
#Fetch all the file names for the unfiltered h5 files:

h5_raw <- list.files(path=datadir_raw,
             pattern=".h5",  #return files matching this expression
             full.names=TRUE) #get full path instead of just filename

In [None]:
h5_raw

The Read10X_h5 command produces a list, that contains the RNA and ADT matrices as list element 1 and 2, respectively. We separate the two modalities and instead construct two independent lists, which contain the either RNA or ADT counts for each sample. In addition we extract the sample ID from the filename, so that we can later match the meta data. This is important because the list of file can be lexicographic i.e. 10 sorted before 2.

In [None]:
counts_RNA <- list()
counts_ADT <- list()
samples <- c()

for(i in seq(1:length(h5_raw))){
    counts <- Read10X_h5(h5_raw[[i]])
    counts_RNA[i] <- unlist(counts[1])
    counts_ADT[i] <- unlist(counts[2])
    name <- str_split(basename(h5_raw[i]),'_')[[1]][1]
    samples[i] <- name
    names(samples)[i] <- sub('CITE', 'GEX', name)
    names(counts_RNA)[i] <- names(samples)[i]
    names(counts_ADT)[i] <- names(samples)[i]
}

In [None]:
samples

Note that there is no sample7, since this was lost to a wetting failure during the experiment.

Before merging the cells from all different samples, the barcodes need to be modified with a sample-specific identifier to be able to distinguish their origin. We add a prefix corresponding to the sample and remove the '-1' suffix to match the naming convention in the full scRNA-seq data seq. At the same time we save the sample origin for each cell into a list, which we can later use as meta data.

In [None]:
samples_list <- list()

for (i in seq(1:length(counts_RNA))){
    prefix <- paste0(names(samples)[i], '-')  #retrieve new prefix
    colnames(counts_RNA[[i]]) <- paste0(prefix, colnames(counts_RNA[[i]])) # add prefix to cell barcodes
    colnames(counts_RNA[[i]]) <- gsub('-1','', colnames(counts_RNA[[i]])) # remove suffix from cell barcodes
    colnames(counts_ADT[[i]]) <- paste0(prefix, colnames(counts_ADT[[i]]))
    colnames(counts_ADT[[i]]) <- gsub('-1','', colnames(counts_ADT[[i]]))
    samples_list[[i]] <- rep(names(samples)[i],length(colnames(counts_RNA[[i]])))
}

Note the lexicographic order: the second library is GEX10 (not GEX2)

In [None]:
colnames(counts_RNA[[2]]) %>% head()

Merge all samples per modality into one matrix instead of a list.

In [None]:
counts_RNA_merged <- Reduce(cbind, counts_RNA)

In [None]:
counts_ADT_merged <- Reduce(cbind, counts_ADT)

Build seurat object from the RNA matrix.

In [None]:
seurObj <- CreateSeuratObject(counts_RNA_merged)

In [None]:
rownames(counts_ADT_merged)

Some of the antibodies have been extended with a '.1' suffix for unknown reasons. We remove that and then add the ADT assay.

In [None]:
rownames(counts_ADT_merged) <- str_replace_all(rownames(counts_ADT_merged), '\\.1', '')

In [None]:
seurObj[["ADT"]] <- CreateAssayObject(counts = counts_ADT_merged)

Add sample origin for each cell to the meta data.

In [None]:
seurObj$sample <- unlist(samples_list)

In [None]:
seurObj@meta.data %>% tail()

This data set contains cells and droplets, for both of which RNA and ADT are measured. Normally, to distinguish cells from droplets we can used the filtered files from cell ranger and extract the barcodes. These will correspond to real cells (according to cellranger), whereas the rest is considered background.
In this instance, the filtered data has already been integrated with other RNA libraries and undergone QC. We therefore use the high-quality cells.

In [None]:
#read in QC-ed object
seurObj_clean <- LoadH5Seurat('./HTA2_v10_CITEonly.h5seurat')

Around 136k 'events' in the unfiltered data match the barcodes in the QC-ed data and can thus be labelled as cells.

In [None]:
table(colnames(seurObj) %in% colnames(seurObj_clean))

In [None]:
seurObj$cell <- colnames(seurObj) %in% colnames(seurObj_clean)

The 'cells' according to this approach will already be filtered for doublets and low-quality cells. These will instead now be classified as 'non-cells' along with the droplets. We therefore need to carry out QC on the droplets to ensure that these are not contaminated with cells.

### QC of droplets

For both droplets and cells we need to ensure certain quality standards before we can move on to background correction. First, we remove all droplets that have only captured RNA or ADT. For all cells this step has already happened during the cellranger filtering/QC.

In [None]:
seurObj2 <- subset(seurObj, subset = nCount_RNA > 0 & nCount_ADT > 0 )

We can also determine the percentage of mitochondrial reads, which serves as a measure of sub-par viability.

In [None]:
seurObj2[["percent.mt"]] <- PercentageFeatureSet(seurObj2, pattern = "^MT-")

We can set min/max thresholds to select droplets that will be used for the downstream analyses. By visualising cell/droplet density, we can determine where most droplets fall on the gene/count spectrum.

In [None]:
ADT_max <- 3.5
ADT_min <- 1.2
RNA_max <- 2.4

In [None]:
options(repr.plot.width=8, repr.plot.height=23)

ggplot(seurObj2@meta.data, aes(x=log10(nCount_ADT), y=log10(nCount_RNA)))+
geom_hex(bins=100)+
geom_hline(yintercept = RNA_max)+
geom_vline(xintercept = ADT_max)+
geom_vline(xintercept = ADT_min)+
facet_grid(sample~cell)+
scale_fill_viridis(limits=c(0,2000))+
theme_bw()

Based on the thresholds we can add an identifier in the meta data and create a reduced seurat object. We also remove 'droplets' with high mitochondrial reads, in case these are indeed partially lysed apoptotic cells, since these will not be a good represenation of background antibody levels.

In [None]:
droplets <- subset(seurObj2@meta.data,
                   log10(nCount_RNA) < RNA_max &
                   log10(nCount_ADT) < ADT_max &
                   log10(nCount_ADT) > ADT_min &
                   percent.mt < 5 &
                   cell==FALSE) %>% rownames()

In [None]:
seurObj2$droplet <- rownames(seurObj2@meta.data) %in% droplets

In [None]:
#CAVE: many data points, memory requirements may kill kernel
options(repr.plot.width=8, repr.plot.height=4)

ggplot(seurObj2@meta.data, aes(x=log10(nCount_ADT), y=log10(nCount_RNA), colour=droplet))+
geom_point()+
geom_hline(yintercept = RNA_max)+
geom_vline(xintercept = ADT_max)+
geom_vline(xintercept = ADT_min)+
facet_grid(~cell)+
theme_bw()

In [None]:
table(seurObj2$droplet)

We can now remove everything that we didn't classify as cell or as droplet. This reduces the size of our data set substantially.

In [None]:
seurObj3 <- subset(seurObj2, subset = droplet==TRUE | cell==TRUE)

In [None]:
options(repr.plot.width=8, repr.plot.height=4)

ggplot(seurObj3@meta.data, aes(x=log10(nCount_ADT), y=log10(nCount_RNA), colour=droplet))+
geom_point()+
geom_hline(yintercept = RNA_max)+
geom_vline(xintercept = ADT_max)+
geom_vline(xintercept = ADT_min)+
facet_grid(~cell)+
theme_bw()

Usually at this step QC on cells would be performed e.g. removal of cells with high/low counts etc. In our case this has already been done.

In [None]:
rm(seurObj, seurObj2, seurObj_clean) #remove old objects to free up space

## Normalisation with Dsb

In [None]:
isotype <- rownames(as.matrix(seurObj3@assays$ADT@counts))[131:137]
isotype

In [None]:
ADT_max <- data.frame(AB=rownames(seurObj3@assays$ADT@counts),
                      max= rowMaxs(as.matrix(seurObj3@assays$ADT@counts)),
                      min= rowMins(as.matrix(seurObj3@assays$ADT@counts)),
                      mean= rowMeans(as.matrix(seurObj3@assays$ADT@counts)),
                      isotype=rownames(as.matrix(seurObj3@assays$ADT@counts)) %in% isotype)

In [None]:
ADT_max[order(ADT_max$max),]

In [None]:
options(repr.plot.width=12, repr.plot.height=10)

ggplot(ADT_max, aes(x=log10(mean), y=log10(max), color=isotype))+
geom_point()+
geom_text_repel(aes(label=AB))+
theme_bw()+
theme(legend.position = 'none')

Comparison of antibody staining levels indicates that most markers are detected at higher levels compared to the Isotype controls. Note that CD14 is expressed at very low levels only, suggesting that it won't serve as a good marker even for cells in which it is expressed. Importantly, many markers will not be expressed in our data set or only in a small subset of cells, which will affect the mean expression.

Dsb takes raw ADT count matrices for cells and droplets. In addition, the isotype controls need to be specified. Further parameters can be adjusted, e.g. pseudocount to be used, scale factor and thresholds for quantile clipping to remove outliers. These options were tested and not found to be suitable/required for our dataset. For information on default settings and parameter options refer to the [vignette](https://cran.r-project.org/web/packages/dsb/dsb.pdf).

In [None]:
matrix_ADT_cells <- subset(seurObj3, subset=cell==TRUE)@assays$ADT@counts
matrix_ADT_backgr <- subset(seurObj3, subset=droplet==TRUE)@assays$ADT@counts

In [None]:
ptm <- proc.time() #measure elapsed time as reference

dsb_norm <- DSBNormalizeProtein(cell_protein_matrix = matrix_ADT_cells,
                            empty_drop_matrix = matrix_ADT_backgr,
                            denoise.counts = TRUE,
                            use.isotype.control = TRUE,
                            isotype.control.name.vec = isotype,
                           #define.pseudocount = TRUE,
                           #pseudocount.use = 1,
                           #scale.factor = 'mean_subtract',
                           #quantile.clipping = TRUE,
                           #quantile.clip = c(0.01, 0.99),
                            return.stats=TRUE)

proc.time() - ptm

Dsb returns a normalised denoised matrix of protein expression for all cells (not droplets). If return.stats was set to TRUE, technical and protein stats are also reported and saved in a list.

In [None]:
dsb_norm %>% str

In the expression matrix protein levels are corrected for background staining and cell-to-call variation is reduced. This can be saved in the seurat object. Importantly, raw and normalised data cannot be added in the same assay (overwrite each other!) so we create a new assay for the dsb-processed data.
For markers that are expressed on very few cells, the normalisation matrix may contain some cells with very negative values i.e. very low expression (see [package details](https://www.rdocumentation.org/packages/dsb/versions/0.2.0)). These normally represent outliers but can hinder visualisation due to automatic axis limits and scaling. We therefore set the minimum to 0 by changing all values below this to 0.

In [None]:
dsb_norm2 <- apply(dsb_norm$dsb_normalized_matrix, 2, function(x){ifelse(test = x < 0, yes = 0, no = x)}) 

In [None]:
seurObj_clean[['ADTdsb']] <- CreateAssayObject(data=Matrix(dsb_norm2, sparse = TRUE)) #matrix is currently dense and does not automatically converted by seurat

In [None]:
seurObj_clean[['ADT']] <- CreateAssayObject(counts=matrix_ADT_cells)

In [None]:
SaveH5Seurat(seurObj_clean, "./HTA2_v10_dsb_denoised.h5seurat", overwrite=TRUE)