# CITEseq data analysis

*Author: Lena Boehme, Taghon lab, 2023*

## Integrated RNA/protein analysis

We use weighted nearest neighbour analysis to generate a UMAP that takes into account both RNA and ADT.

### Setup

In [None]:
setwd("/home/lenab/Documents/scSeq_analyses/B_TotalThymus_CITEseq/2022_TotalThymus_CITEseq_HTA/objects")

In [None]:
#default plotting settings

options(repr.plot.width=12, repr.plot.height=6)

options(scipen=100) #avoid scientific notation of numbers

In [None]:
library(SeuratDisk)
library(Seurat)
library(matrixStats)
library(ggplot2)
library(pheatmap)
library(reshape2)
library(dplyr)
library(tidyr)
library(viridis)
library(RColorBrewer)
library(stringr)
library(batchelor)
library(BiocParallel)
library(BiocNeighbors)

In [None]:
sessionInfo()

In [None]:
pal12 <- colorRampPalette(brewer.pal(12, "Paired"))(12)
pal24 <- colorRampPalette(brewer.pal(12, "Paired"))(24)
pal36 <- colorRampPalette(brewer.pal(12, "Paired"))(36)

### Importing data

We have denoised protein data in a seurat object, which we can load in.

In [None]:
seurObj_dsb <- LoadH5Seurat("./HTA2_v10_dsb_denoised.h5seurat")

In [None]:
seurObj_dsb@assays

In the meantime scRNA-seq analyses have flagged around 3000 problematic cells in the data set. We therefore remove these from the CITE-seq data for consistency and quality ensurance.

In [None]:
#subset to only CITE-seq cells
seurObj_CITE <- subset(seurObj, subset = cite_w_protein == 1)

In [None]:
rm(seurObj)

In [None]:
table(colnames(seurObj_dsb) %in% colnames(seurObj_CITE))

Instead of subsetting the denoised object, we selectively transfer ADT and denoised ADT data for just the cells that are already in the new CITEseq-only object. This way the total scRNA-seq UMAP, annotations etc. are preserved in the object.

In [None]:
seurObj_CITE[['ADT']] <- CreateAssayObject(counts=seurObj_dsb@assays$ADT@counts[,colnames(seurObj_CITE)])
seurObj_CITE[['ADTdsb']] <- CreateAssayObject(data=seurObj_dsb@assays$ADTdsb@data[,colnames(seurObj_CITE)])

In [None]:
seurObj_CITE <- LoadH5Seurat('./HTA2_v16_CITEonly.h5seurat')

### Integrated analysis 

#### DimRed RNA

Data was previously integrated with scVI, so no PCA for the RNA is available. We first normalise and scale the data, then run a PCA.

In [None]:
seurObj_CITE <- seurObj_CITE  %>%
            NormalizeData(assay = 'RNA') %>%
            FindVariableFeatures(assay = 'RNA') %>%
            ScaleData(assay = 'RNA') %>%
            RunPCA(assay = 'RNA', npcs = 50, reduction.name = 'pca_rna_CITE')

In [None]:
ElbowPlot(seurObj_CITE, reduction = 'pca_rna_CITE', ndims = 50)+labs(title = 'Elbowplot for RNA PCA')

#determining PC cutoff: threshold corresponds to the point where the difference in the SD between two subsequent PCs in <0.1
var_pc <- seurObj_CITE@reductions$pca_rna_CITE@stdev/sum(seurObj_CITE@reductions$pca_rna_CITE@stdev)*100
diffvar_pc <- var_pc[1:length(var_pc)-1] - var_pc[2:length(var_pc)]
sort(which(diffvar_pc >0.1), decreasing=TRUE)[1]+1

The data set contains data from several donors and experimental batches, so batch correction is required for the RNA. We use [fastMNN](https://rdrr.io/github/LTLA/batchelor/man/reducedMNN.html) from the batchelor package for this purpose, which works on the previously generated PCA. We specify indicidual libraries as batches, but don't specify the merging order.

In [None]:
ptm <- proc.time()

MNN_rna_CITE <- reducedMNN(seurObj_CITE@reductions$pca_rna_CITE@cell.embeddings,
                 batch=seurObj_CITE$sample, #specify batches
                 #merge.order= unique(seurObj$batch), #batch order can be specified
                 BPPARAM=MulticoreParam(workers=12), #parallelisation
                 BNPARAM=HnswParam())

proc.time() - ptm

In [None]:
#save corrected PCA in DimRed slot
seurObj_CITE[["mnn_rna_CITE"]] <- CreateDimReducObject(embeddings=MNN_rna_CITE$corrected,
                                        assay="RNA",
                                        key="mnnrnacite_")

In [None]:
seurObj_CITE <- RunUMAP(seurObj_CITE, reduction = 'mnn_rna_CITE', assay='RNA', reduction.name = 'umap_rna_mnn_CITE', dims = 1:15)

In [None]:
options(repr.plot.width=7, repr.plot.height=6)

DimPlot(seurObj_CITE, reduction = 'umap_rna_mnn_CITE', group.by = 'sample', cols = pal12)+labs(title='batch-corrected RNA UMAP (by sample)')

DimPlot(seurObj_CITE, reduction = 'umap_rna_mnn_CITE', group.by = 'donor', cols = pal12)+labs(title='batch-corrected RNA UMAP (by donor)')

DimPlot(seurObj_CITE, reduction = 'umap_rna_mnn_CITE', group.by = 'donor', cols = pal12)+labs(title='batch-corrected RNA UMAP (by donor)')

Note that even and odd samples correspond to different cell subsets and are thus separate in the UMAP.

#### DimRed ADT

We also carry out scaling and PCA for the ADT data. Note that normalisation was already carried out with dsb and should not be performed again. We use all markers as HVGs (excluding isotype controls). Batch correction was tested and found to be not needed/suited for the ADT data - dsb should already have removed inter-cell/sample variance.

In [None]:
VariableFeatures(seurObj_CITE, assay = 'ADTdsb') <- rownames(seurObj_CITE@assays$ADTdsb@data)[c(1:130,138:150)]

In [None]:
seurObj_CITE <- seurObj_CITE  %>%
            ScaleData(assay = 'ADTdsb') %>%
            RunPCA(assay = 'ADTdsb', npcs = 50, reduction.name = 'pca_adt_CITE')

In [None]:
ElbowPlot(seurObj_CITE, reduction = 'pca_adt_CITE', ndims = 50)+labs(title = 'Elbowplot for ADT PCA')

var_pc <- seurObj_CITE@reductions$pca_adt_CITE@stdev/sum(seurObj_CITE@reductions$pca_adt_CITE@stdev)*100
diffvar_pc <- var_pc[1:length(var_pc)-1] - var_pc[2:length(var_pc)]
#determine last point where difference is >0.1
sort(which(diffvar_pc >0.1), decreasing=TRUE)[1]+1

#### DimRed WNN
We use the [Seurat approach](https://satijalab.org/seurat/reference/findmultimodalneighbors) to find neighbours accross the modalities and then generate a UMAP based on the nn graph.

In [None]:
seurObj_CITE <- FindMultiModalNeighbors(seurObj_CITE,
                                  reduction.list=list('mnn_rna_CITE', 'pca_adt_CITE'),
                                   dims.list=list(1:15,1:14)) #use PC cut-offs determine previously

In [None]:
seurObj_CITE <- RunUMAP(seurObj_CITE, nn.name = "weighted.nn", reduction.name = "umap_wnn",
                        reduction.key = "wnnUMAP_")

In [None]:
options(repr.plot.width=7, repr.plot.height=6)

DimPlot(seurObj_CITE, reduction = 'umap_wnn', group.by = 'sample', cols = pal12)+labs(title='WNN UMAP')

DimPlot(seurObj_CITE, reduction = 'umap_wnn', group.by = 'donor', cols = pal12)+labs(title='WNN UMAP')

In [None]:
options(repr.plot.width=12, repr.plot.height=6)

FeaturePlot(seurObj_CITE, reduction = 'umap_wnn', features = c('RNA.weight', 'ADTdsb.weight'), cols = viridis(100), order=T)

We additionally carry out a [supervised PCA (sPCA)](https://www.sciencedirect.com/science/article/pii/S0092867421005833?via%3Dihub) on the WNN graph. This yields an RNA-based PCA that incorporates the maximum variance described by the WNN graph and therefore allows weighted RNA and protein quantification.

In [None]:
seurObj_CITE <- RunSPCA(seurObj_CITE, assay='RNA', graph='wsnn')

In [None]:
SaveH5Seurat(seurObj_CITE, './HTA2_v16_raw.h5seurat', overwrite = TRUE)