# Part 3: Processing CD4 datasets

In this document, we will load the preprocessed CD4 datasets, perform quality control, integration and annotation.

In [None]:
source("diabetes_analysis_v07.R")

# CD4 Initial experiment

In [None]:
plan("multisession")

The init data are already preprocessed. Initial preprocessing was performed by Juraj Michalik. Briefly, files from CellRanger were loaded to Seurat, split according to hashtags, VDJ data from 10X immune profiling and Mixcr were added. The code for these analyses can be found at:  



To recapitulate the analysis, download the files and put it to the folder `data/initdata`.

In [None]:
paths  <- list.files("../data/initdata/", full.names = T)

In [None]:
paths

Let's load the datasets from initial experiment (exp 8, 10 and 11). Experiments 8 and 11 W1 only contained CD8 cells, so it is not loaded here.

In [None]:
seu_list  <- future_map(paths[c(2,4)], readRDS)

In [None]:
seu_list[[1]]$hashtags  %>% table

Experiment 10 contained both CD8 and CD4 cells, so we will filter out CD8 cells with hashtag #5.

In [None]:
exp10_cd4  <- subset(seu_list[[1]], hashtags != "H5")

In [None]:
seu_list[[1]] <- exp10_cd4

In [None]:
cd4_prelim  <- scCustomize::Merge_Seurat_List(seu_list)

Lets perform basic preprocessing of the merged initial datasets.

In [None]:
options(future.globals.maxSize = 10000 * 1024^2)
plan("sequential")

In [None]:
DefaultAssay(cd4_prelim)  <- "RNA"
cd4_prelim <- NormalizeData(cd4_prelim, verbose = FALSE)
cd4_prelim <- ScaleData(cd4_prelim, verbose = FALSE)
cd4_prelim <- FindVariableFeatures(cd4_prelim, nfeatures = 1000, verbose = FALSE)
cd4_prelim <- RunPCA(cd4_prelim, npcs = 12, verbose = FALSE)
cd4_prelim <- RunUMAP(cd4_prelim, reduction = "pca", dims = 1:12)

In [None]:
cd4_prelim <- FindNeighbors(cd4_prelim, dims = 1:12)
cd4_prelim <- FindClusters(cd4_prelim, resolution = 1)

In [None]:
DimPlot(cd4_prelim, label = T)

Visualization of some of the basic genes.

In [None]:
options(repr.plot.width = 16, repr.plot.height = 8)
FeaturePlot(cd4_prelim, features = c("MKI67", "LCK", "CD3G", "TYROBP", "CD14", "MKI67", "CD3D", "CD8A"), ncol = 4)

In [None]:
FeaturePlot(cd4_prelim, features = c("SELL", "CCR7", "IL7R", "ITGA4", "CCL5", "IFNG"), ncol = 4)

We can see that clusters 14 and 16 contain contaminating cells (no LCK, no CD3), so that we will remove it. 

In [None]:
cd4_prelim_filt  <- subset(cd4_prelim, seurat_clusters %in% c(0:13,15))
cd4_prelim_filt <- NormalizeData(cd4_prelim_filt, verbose = FALSE)
cd4_prelim_filt <- ScaleData(cd4_prelim_filt, verbose = FALSE)
cd4_prelim_filt <- FindVariableFeatures(cd4_prelim_filt, nfeatures = 1000, verbose = FALSE)
cd4_prelim_filt <- RunPCA(cd4_prelim_filt, npcs = 12, verbose = FALSE)
cd4_prelim_filt <- RunUMAP(cd4_prelim_filt, reduction = "pca", dims = 1:12)
cd4_prelim_filt <- FindNeighbors(cd4_prelim_filt, dims = 1:12)
cd4_prelim_filt <- FindClusters(cd4_prelim_filt, resolution = 0.5)

In [None]:
DimPlot(cd4_prelim_filt, label = T, label.size = 12)

For purpose of visualization for Figure S2, where we show naive as nonNaive populations, we will annotate the naive and nonNaive populaitons.

In [None]:
cd4_prelim_filt@meta.data  <- cd4_prelim_filt@meta.data  %>% 
mutate(naive_or_eff  = if_else(seurat_clusters %in% c(0,1,2,4,6,8),"Naive","NonNaive"))

In [None]:
options(repr.plot.width = 7, repr.plot.height = 5)
DimPlot(cd4_prelim_filt, label = F, label.size = 12, group.by = "naive_or_eff", cols = c("dodgerblue1","indianred2"), 
        raster = TRUE, raster.dpi = c(900,900), pt.size = 5) + ggtheme()
ggsave("../figures/preliminary/cd4_dimplot.svg", width = 16, height = 12, units = "cm", create.dir = TRUE)

In [None]:
cd4_prelim_filt$prelim  <- "Prelim"

We will create a bar chart showing the percentage of nanive vs. nonNaive cells.

In [None]:
cd4_prelim_filt@meta.data  %>% 
ggplot(aes(x = prelim, fill = naive_or_eff)) +
  geom_bar(position = "fill") + 
scale_fill_manual(values = c("dodgerblue1","indianred2")) + 
theme_classic()+
ggtheme() + ggtitle("CD4")
ggsave("../figures/preliminary/cd4_barplot.svg", width = 10, height = 12, units = "cm", create.dir = TRUE)

Now we will take advantage of the PTPRC isoform identification by Ideis software. We will visualize the PTPRC/RA and PTPRC/RO isoforms.

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)
FeaturePlot(cd4_prelim_filt, features = c("PTPRC-RA"), max.cutoff = 3, 
        raster = TRUE, raster.dpi = c(900,900), pt.size = 4) + ggtheme()
ggsave("../figures/prelim/cd4_ptprc_ra.svg", width = 13, height = 12, units = "cm", create.dir = TRUE)

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)
FeaturePlot(cd4_prelim_filt, features = c("PTPRC-RO"), max.cutoff = 2, 
        raster = TRUE, raster.dpi = c(900,900), pt.size = 4) + ggtheme()
ggsave("../figures/prelim/cd4_ptprc_ro.svg", width = 13, height = 12, units = "cm", create.dir = TRUE)

# CD4 Final experiment

We will now load the datasets from the final experiment (exp 16, 18, 19 and 20). CD4 and CD8 cells were processed in separate wells, so we will ensure that we will only load wells containing CD4 cells.

In [None]:
paths  <- list.files("../data/initdata/", full.names = T)

In [None]:
paths

In [None]:
seu_list  <- future_map(paths[c(5,6,10,12,14,16,18,20)], readRDS)

In [None]:
cd4_final  <- scCustomize::Merge_Seurat_List(seu_list)

We will process the dataset using the same pipeline as the Initial dataset.

In [None]:
options(future.globals.maxSize = 10000 * 1024^2)
plan("sequential")

In [None]:
DefaultAssay(cd4_final)  <- "RNA"
cd4_final <- NormalizeData(cd4_final, verbose = FALSE)
cd4_final <- ScaleData(cd4_final, verbose = FALSE)
cd4_final <- FindVariableFeatures(cd4_final, nfeatures = 1000, verbose = FALSE)
cd4_final <- RunPCA(cd4_final, npcs = 12, verbose = FALSE)
cd4_final <- RunUMAP(cd4_final, reduction = "pca", dims = 1:12)

In [None]:
cd4_final <- FindNeighbors(cd4_final, dims = 1:12)
cd4_final <- FindClusters(cd4_final, resolution = 1)

In [None]:
DimPlot(cd4_final, label = T)

Visualization of some of the basic genes.

In [None]:
options(repr.plot.width = 16, repr.plot.height = 8)
FeaturePlot(cd4_final, features = c("MKI67", "LCK", "CD3G", "TYROBP", "CD14", "MKI67", "CD3D", "CD8A"), ncol = 4)

In [None]:
FeaturePlot(cd4_final, features = c("SELL", "CCR7", "IL7R", "ITGA4", "CCL5", "IFNG"), ncol = 4)

We can see that clusters 12, 14, 15, 16 and 20 contain contaminating cells (no LCK, no CD3), so that we will remove it. 

In [None]:
cd4_final_filt  <- subset(cd4_final, seurat_clusters %in% c(0:11,13,17:19,21))
cd4_final_filt <- NormalizeData(cd4_final_filt, verbose = FALSE)
cd4_final_filt <- ScaleData(cd4_final_filt, verbose = FALSE)
cd4_final_filt <- FindVariableFeatures(cd4_final_filt, nfeatures = 1000, verbose = FALSE)
cd4_final_filt <- RunPCA(cd4_final_filt, npcs = 12, verbose = FALSE)
cd4_final_filt <- RunUMAP(cd4_final_filt, reduction = "pca", dims = 1:12)
cd4_final_filt <- FindNeighbors(cd4_final_filt, dims = 1:12)

In [None]:
cd4_final_filt <- FindClusters(cd4_final_filt, resolution = 0.8)

In [None]:
DimPlot(cd4_final_filt, label = T, label.size = 12)

In [None]:
FeaturePlot(cd4_final_filt, features = c("SELL", "CCR7", "IL7R", "ITGA4", "CCL5", "IFNG"), ncol = 4)

We will now again analyze the naive vs nonNaive composition and PTPRC isoforms for figure S2.

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)
FeaturePlot(cd4_final_filt, features = c("PTPRC-RA"), max.cutoff = 2, 
        raster = TRUE, raster.dpi = c(900,900), pt.size = 3) + ggtheme()
ggsave("../figures/final/cd4_ptprc_ra.svg", width = 13, height = 12, units = "cm", create.dir = TRUE)

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)
FeaturePlot(cd4_final_filt, features = c("PTPRC-RO"), max.cutoff = 5, 
        raster = TRUE, raster.dpi = c(900,900), pt.size = 3) + ggtheme()
ggsave("../figures/final/cd4_ptprc_ro.svg", width = 13, height = 12, units = "cm", create.dir = TRUE)

In [None]:
cd4_final_filt@meta.data  <- cd4_final_filt@meta.data  %>% 
mutate(naive_or_eff  = if_else(seurat_clusters %in% c(1,2,5,15),"Naive","NonNaive"))

In [None]:
options(repr.plot.width = 7, repr.plot.height = 5)
DimPlot(cd4_final_filt, label = F, label.size = 12, group.by = "naive_or_eff", cols = c("dodgerblue1","indianred2"), 
        raster = TRUE, raster.dpi = c(900,900), pt.size = 4) + ggtheme()
ggsave("../figures/final/cd4_dimplot.svg", width = 16, height = 12, units = "cm", create.dir = TRUE)

In [None]:
cd4_final_filt$final  <- "final"

In [None]:
cd4_final_filt@meta.data  %>% 
ggplot(aes(x = final, fill = naive_or_eff)) +
  geom_bar(position = "fill") + 
scale_fill_manual(values = c("dodgerblue1","indianred2")) + 
theme_classic()+
ggtheme() + ggtitle("CD4")
ggsave("../figures/final/cd4_barplot.svg", width = 10, height = 12, units = "cm", create.dir = TRUE)

# CD4 Initial and final experiments

We will now load all and merge all CD4 datasets. 

In [None]:
paths  <- list.files("../data/rawdata/", full.names = T)

In [None]:
paths

In [None]:
seu_list  <- future_map(paths[c(2,4,5,6,10,12,14,16,18,20)], readRDS)

In [None]:
seu_list[[1]]$hashtags  %>% table

In [None]:
exp10_cd4  <- subset(seu_list[[1]], hashtags != "H5")

In [None]:
seu_list[[1]] <- exp10_cd4

In [None]:
seu_list[[1]]$Condition  %>% table

In [None]:
cd4_full  <- scCustomize::Merge_Seurat_List(seu_list)

In [None]:
cd4_full

# Add metadata

In [None]:
md_dia  <- read_xlsx("../data/metadata_v07.xlsx")

In [None]:
md_dia  %>% colnames

In [None]:
cd4_full@meta.data  <- cd4_full@meta.data  %>% 
separate(Condition, into = c("Disease", "Time"), remove = F, sep = " ")  %>% 
mutate(Patient_Time = paste(Patient_ID, Time))

cd4_full$Time  <- if_else(is.na(cd4_full$Time), "T0", cd4_full$Time)

cd4_full$Sample_char  <- paste(cd4_full$Patient_ID, 
                                  cd4_full$Disease,
                                  cd4_full$Time,
                                  cd4_full$Age_group,
                                  cd4_full$Sex,
                                  cd4_full$Experiment_ID)

In [None]:
md_seurat  <- cd4_full@meta.data

In [None]:
colnames(md_seurat)

In [None]:
md_joined  <- left_join(md_seurat, md_dia)

In [None]:
cd4_full@meta.data  <- md_joined
rownames(cd4_full@meta.data)  <- colnames(cd4_full)

# Quality control and filtering

## Removing low quality samples

Patient 206 was removed because of low quality of data on sort, suggesting of low quality of the frozen sample.

In [None]:
cd4_full  <- subset(cd4_full, Patient_ID != "206")

Our dataset included three pre-diabetic cases which we did not include in the analysis. We will thus filter to keep only T1D patients and healthy donors. 

In [None]:
cd4_full$Disease  %>% table

In [None]:
cd4_full  <- subset(cd4_full, Disease %in% c("Dia", "Ctrl"))

In [None]:
cd4_full$Disease  %>% table

We sampled three of the healthy donors twice. As the data from the second sampling were more similar to that of T1D patients (i.e., they were sampled in the same environment, and we have the associated metadata for routine blood testing which were not available from the first sampling), we will remove the old samples from first sampling and from now on we will only work with the second timepoint samples. 

In [None]:
cd4_full$Condition  %>% table

In [None]:
cd4_full$is_old_control  <-  ifelse(cd4_full$Patient_Time %in% c("201 T0","202 T0","204 T0"), TRUE,FALSE)

In [None]:
cd4_full$is_old_control  %>% table

In [None]:
cd4_full  <- subset(cd4_full, is_old_control == FALSE)

In [None]:
cd4_full$Condition  %>% table

In [None]:
cd4_full@meta.data  <- cd4_full@meta.data  %>% 
mutate(Time = ifelse(Patient_Time %in% c("201 T1","202 T1","204 T1"), "T0", Time))  %>% 
mutate(Patient_Time = paste(Patient_ID, Time),
       Condition = paste(Disease, Time))


In [None]:
cd4_full$Condition  %>% table

Now let's proceed with analysis of the full CD8 dataset. 

In [None]:
options(future.globals.maxSize = 10000 * 1024^2)
plan("sequential")

In [None]:
DefaultAssay(cd4_full)  <- "RNA"
cd4_full <- NormalizeData(cd4_full, verbose = FALSE)
cd4_full <- ScaleData(cd4_full, verbose = FALSE)
cd4_full <- FindVariableFeatures(cd4_full, nfeatures = 1000, verbose = FALSE)
cd4_full <- RunPCA(cd4_full, npcs = 12, verbose = FALSE)
cd4_full <- RunUMAP(cd4_full, reduction = "pca", dims = 1:12)

In [None]:
cd4_full <- FindNeighbors(cd4_full, dims = 1:12)
cd4_full <- FindClusters(cd4_full, resolution = 1)

In [None]:
DimPlot(cd4_full, label = T)

We will check the canonical markers of T cells and other immune populations to see clusters of contaminating cell types.

In [None]:
options(repr.plot.width = 16, repr.plot.height = 8)
FeaturePlot(cd4_full, features = c("MKI67", "LCK", "CD3G", "TYROBP", "CD14", "MKI67", "CD3D", "CD8A"), ncol = 4)

## Automated annotation of cell types

We will perform automated analysis of cell types using the packages [SingleR](https://bioconductor.org/packages/release/bioc/vignettes/SingleR/inst/doc/SingleR.html) and [Azimuth](https://azimuth.hubmapconsortium.org/). We used two built-in reference datasets from the package celldex: Monaco Immune Dataset and Human Primary Cell Atlas Data, a custom reference of human T-cell types profiled by bulk RNA seq from the paper by Giles et al. [Immunity, 2022](https://www.sciencedirect.com/science/article/pii/S107476132200084X) and the three-level Azimuth annotations.

In [None]:
load("../data/ref_wherry_new.RData")

In [None]:
hpca.se  <- celldex::HumanPrimaryCellAtlasData()
mid.se <- celldex::MonacoImmuneData()

In [None]:
cd4_full  <- annotate_tcell_data(cd4_full)

In [None]:
# Only run if you want to save the file
# saveRDS(cd4_full, "../data/processed/L1/cd4_full.rds")

In [None]:
# Only run if starting from here
# cd4_full  <- readRDS("../data/processed/L1/cd4_full.rds")

In [None]:
options(repr.plot.width = 20, repr.plot.height = 7)

DimPlot(cd4_full, raster = F, group.by = "HPCA_single", label = F)
p1  <- DimPlot(cd4_full, raster = F, group.by = "HPCA_single", label = F)

In [None]:
DimPlot(cd4_full, raster = F, group.by = "Monaco_single", label = F)

In [None]:
DimPlot(cd4_full, label = T, raster = F, label.size = 12)

## Removing contaminating cells

In [None]:
# Only run if starting from here
# cd4_full  <- readRDS("../data/processed/L1/cd4_full.rds")

Based on the markers and annotations, we remove clusters 14, 15, 16, 20, 21 and 22. 

In [None]:
cd4_l1_full_filt  <- subset(cd4_full, seurat_clusters %in% c(0:13,17:19))

In [None]:
rm(cd4_full)

In [None]:
DimPlot(cd4_l1_full_filt, label = T, raster = F)

In [None]:
DefaultAssay(cd4_l1_full_filt)  <- "RNA"
cd4_l1_full_filt <- NormalizeData(cd4_l1_full_filt, verbose = FALSE)

cd4_l1_full_filt <- ScaleData(cd4_l1_full_filt, verbose = FALSE)
cd4_l1_full_filt <- FindVariableFeatures(cd4_l1_full_filt, nfeatures = 1000, verbose = FALSE)
cd4_l1_full_filt <- RunPCA(cd4_l1_full_filt, npcs = 12, verbose = FALSE)
cd4_l1_full_filt <- RunUMAP(cd4_l1_full_filt, reduction = "pca", dims = 1:12)

cd4_l1_full_filt <- FindNeighbors(cd4_l1_full_filt, dims = 1:12)
cd4_l1_full_filt <- FindClusters(cd4_l1_full_filt, resolution = 1)

In [None]:
options(repr.plot.width = 16, repr.plot.height = 8)

FeaturePlot(cd4_l1_full_filt, features = c("CD4", "FOXP3", "CD44", "CCL5", "TBX21", "IFNG", "PDCD1", "BCL6"), ncol = 4)

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)
DimPlot(cd4_l1_full_filt, label = T)

In [None]:
options(repr.plot.width = 16, repr.plot.height = 8)

FeaturePlot(cd4_l1_full_filt, features = c("SELL", "ZBTB16", "BHLHE40", "FOXP3", "IL2RA", "ZEB2", "ZEB1", "CSF2"), ncol = 4)

In [None]:
cd4_l1_full_filt@meta.data  <- cd4_l1_full_filt@meta.data  %>% separate(Condition, into = c("Disease", "Time"), remove = F, sep = " ")

## Removing dead cells

Mext we are going to check the quality of cells based on reac counts, feature counts and percentage of mitochondrial genes. 

In [None]:
DimPlot(cd4_l1_full_filt, label = T)

In [None]:
cutoff_nFeature_RNA <- 500
cutoff_percent_mt <- 10
cluster_exclude <- c(11)

In [None]:
p1 <- ggplot(data.frame(nCount_RNA = cd4_l1_full_filt$nCount_RNA,
                  nFeature_RNA = cd4_l1_full_filt$nFeature_RNA,
                  percent_mt = cd4_l1_full_filt$percent.mt,
                  seurat_clusters = cd4_l1_full_filt$seurat_clusters,
                  exclude = ifelse(cd4_l1_full_filt$seurat_clusters %in% cluster_exclude, TRUE, FALSE)), 
       aes(x = seurat_clusters, y = percent_mt)) +
  geom_violin(scale = "width", aes(fill = exclude)) + 
  geom_hline(yintercept = cutoff_percent_mt,
               geom = "line", 
               width = 0.5,
               colour = "red") + 
  ggtitle("Percent mt. cutoff") + 
  theme_classic() +
  scale_fill_manual(values = c("white","red")) +
  theme(panel.background = element_blank(), 
        axis.text.x = element_text(angle = 0, hjust = 1)) +
  annotate(geom = "rect", xmin = min(as.numeric(cd4_l1_full_filt$seurat_clusters))-1, 
           xmax = max(as.numeric(cd4_l1_full_filt$seurat_clusters))+1, 
           ymin=cutoff_percent_mt,ymax=1.1*(max(cd4_l1_full_filt$percent.mt)), fill = "red", alpha = 0.1)

p2 <- ggplot(data.frame(nCount_RNA = cd4_l1_full_filt$nCount_RNA,
                  nFeature_RNA = cd4_l1_full_filt$nFeature_RNA,
                  percent_mt = cd4_l1_full_filt$percent.mt,
                  seurat_clusters = cd4_l1_full_filt$seurat_clusters,
                        exclude = ifelse(cd4_l1_full_filt$seurat_clusters %in% cluster_exclude, TRUE, FALSE)), 
       aes(x = seurat_clusters, y = nFeature_RNA)) +
  geom_violin(scale = "width", aes(fill = exclude)) + 
  geom_hline(yintercept = cutoff_nFeature_RNA,
               geom = "line", 
               width = 0.5,
               colour = "red") + 
  ggtitle("nFeature RNA cutoff") + 
  theme_classic() +
  scale_fill_manual(values = c("white","red")) +
  theme(panel.background = element_blank(), 
        axis.text.x = element_text(angle = 0, hjust = 1)) +
  annotate(geom = "rect", xmin = min(as.numeric(cd4_l1_full_filt$seurat_clusters))-1, 
           xmax = max(as.numeric(cd4_l1_full_filt$seurat_clusters))+1, 
           ymin=0, ymax=cutoff_nFeature_RNA, fill = "red", alpha = 0.1)

In [None]:
options(repr.plot.width = 12, repr.plot.height = 5)
p1 + p2

We will be removing cluster 11 as it consists of low quality cells. In addition, we use the criteria of filtering based on percentage of mitochondrial genes and count of detected genes. 

## QC dead and contaminating removal for figure

In [None]:
cd4_full  <- readRDS("../data/processed/L1/cd4_full.rds")

In [None]:
DimPlot(cd4_full, label = T)

In [None]:

cutoff_nFeature_RNA <- 500
cutoff_percent_mt <- 10
cluster_exclude <- c(11,14:16,20:22)

In [None]:
options(repr.plot.width = 14, repr.plot.height = 5)

p1 <- ggplot(data.frame(nCount_RNA = cd4_full$nCount_RNA,
                  nFeature_RNA = cd4_full$nFeature_RNA,
                  percent_mt = cd4_full$percent.mt,
                  seurat_clusters = cd4_full$seurat_clusters,
                  exclude = ifelse(cd4_full$seurat_clusters %in% cluster_exclude, TRUE, FALSE)), 
       aes(x = seurat_clusters, y = percent_mt)) +
  geom_violin(scale = "width", aes(fill = exclude)) + 
  geom_hline(yintercept = cutoff_percent_mt,
               geom = "line", 
               width = 0.5,
               colour = "red") + 
  ggtitle("Percent mt. cutoff") + 
  theme_classic() +
  scale_fill_manual(values = c("white","red")) +
  theme(panel.background = element_blank(), 
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  annotate(geom = "rect", xmin = min(as.numeric(cd4_full$seurat_clusters))-1, 
           xmax = max(as.numeric(cd4_full$seurat_clusters))+1, 
           ymin=cutoff_percent_mt,ymax=1.1*(max(cd4_full$percent.mt)), fill = "red", alpha = 0.1) + ggtheme() + NoLegend()

p2 <- ggplot(data.frame(nCount_RNA = cd4_full$nCount_RNA,
                  nFeature_RNA = cd4_full$nFeature_RNA,
                  percent_mt = cd4_full$percent.mt,
                  seurat_clusters = cd4_full$seurat_clusters,
                        exclude = ifelse(cd4_full$seurat_clusters %in% cluster_exclude, TRUE, FALSE)), 
       aes(x = seurat_clusters, y = nFeature_RNA)) +
  geom_violin(scale = "width", aes(fill = exclude)) + 
  geom_hline(yintercept = cutoff_nFeature_RNA,
               geom = "line", 
               width = 0.5,
               colour = "red") + 
  ggtitle("nFeature RNA cutoff") + 
  theme_classic() +
  scale_fill_manual(values = c("white","red")) +
  theme(panel.background = element_blank(), 
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  annotate(geom = "rect", xmin = min(as.numeric(cd4_full$seurat_clusters))-1, 
           xmax = max(as.numeric(cd4_full$seurat_clusters))+1, 
           ymin=0, ymax=cutoff_nFeature_RNA, fill = "red", alpha = 0.1) + ggtheme()



p1 + p2

ggsave("../figures/QC/cd4_QC_plot1.png", width = 9, height = 4)
ggsave("../figures/QC/cd4_QC_plot1.svg", width = 9, height = 4)

In [None]:
options(repr.plot.width = 10, repr.plot.height = 6)

DotPlot(cd4_full, features = rev(c("CD3D","CD8A","CD8B","CD4","LCK","TRAC","CD14","MS4A1"))) + 
ggtheme() +
theme(panel.background = element_blank(), 
      axis.text.x = element_text(angle = 45, hjust = 1)) + coord_flip() +
      scale_size_continuous(range = c(0.2,3))
ggsave("../figures/QC/cd4_QC_plot2.png", width = 6.8, height = 3.7)
ggsave("../figures/QC/cd4_QC_plot2.svg", width = 6.8, height = 3.7)


In [None]:
cd4_full$remove  <- ifelse((cd4_full$seurat_clusters %in% cluster_exclude)  |
cd4_full$percent.mt > cutoff_percent_mt |
cd4_full$nFeature_RNA < cutoff_nFeature_RNA, "Remove", "Keep")

In [None]:
options(repr.plot.width = 5, repr.plot.height = 4)
DimPlot(cd4_full, raster = T, group.by = "remove", cols = c("grey88","red")) + ggtheme()
ggsave("../figures/QC/cd4_QC_plot3.png", width = 5, height = 4)
ggsave("../figures/QC/cd4_QC_plot3.svg", width = 5, height = 4)

In [None]:
DimPlot(cd4_full, raster = T, label = T, label.size = 7) + ggtheme()
ggsave("../figures/QC/cd4_QC_plot4.png", width = 5, height = 4)
ggsave("../figures/QC/cd4_QC_plot4.svg", width = 5, height = 4)

In [None]:
options(repr.plot.width = 60, repr.plot.height = 12)
VlnPlot(cd4_l1_full_filt, features = c( "nFeature_RNA"), 
        ncol = 4, group.by = "Sample_ID", raster = F, pt.size = 0) + NoLegend()

## Processing after QC

We now subset the object based on the previously defined criteria and run new normalization, scaling, variable feature selection, dimensional reduction and clustering. 

In [None]:
cd4_l1_full_filt  <- subset(cd4_l1_full_filt, 
                       ((seurat_clusters %in% cluster_exclude) == F) &
                      percent.mt < cutoff_percent_mt &
                      nFeature_RNA > cutoff_nFeature_RNA)

In [None]:
cd4_l1_full_filt <- NormalizeData(cd4_l1_full_filt, verbose = FALSE)

cd4_l1_full_filt <- ScaleData(cd4_l1_full_filt, verbose = FALSE)
cd4_l1_full_filt <- FindVariableFeatures(cd4_l1_full_filt, nfeatures = 1000, verbose = FALSE)
cd4_l1_full_filt <- RunPCA(cd4_l1_full_filt, npcs = 12, verbose = FALSE)
cd4_l1_full_filt <- RunUMAP(cd4_l1_full_filt, reduction = "pca", dims = 1:12)

cd4_l1_full_filt <- FindNeighbors(cd4_l1_full_filt, dims = 1:12)
cd4_l1_full_filt <- FindClusters(cd4_l1_full_filt, resolution = 1)

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)
DimPlot(cd4_l1_full_filt, label = T)

We can see that the mitochondrial genes as well as counts and features are now well balanced with the exception of proliferating cells, which has more detected genes and more reads, as expected. 

In [None]:
options(repr.plot.width = 20)
VlnPlot(cd4_l1_full_filt, features = c("percent.mt", "percent.rp", "nCount_RNA", "nFeature_RNA"), ncol = 4)

In [None]:
options(repr.plot.width = 16, repr.plot.height = 8)

FeaturePlot(cd4_l1_full_filt, features = c("CD4", "CD8A", "SELL", "CD3D", "CD19", "MS4A1", "TRGV2", "TRDC"), ncol = 4)

In [None]:
options(repr.plot.width = 16, repr.plot.height = 8)

FeaturePlot(cd4_l1_full_filt, features = c("FOXP3", "IL2RA", "PDCD1", "LAG3", "BCL6", "TBX21", "IFNG", "CCL5"), ncol = 4)

In [None]:
options(repr.plot.width = 16, repr.plot.height = 8)

FeaturePlot(cd4_l1_full_filt, features = c("MKI67", "PTPRC-RA", "PTPRC-RO", "KLF2", "CXCR6", "ZBTB16", "CSF2", "IL23R"), ncol = 4)

In [None]:
cd4_l1_full_filt@meta.data  <- cd4_l1_full_filt@meta.data  %>% 
separate(Condition, into = c("Disease", "Time"), remove = F, sep = " ")

In [None]:
cd4_l1_full_filt$Time  <- if_else(is.na(cd4_l1_full_filt$Time), "T0", cd4_l1_full_filt$Time)

In [None]:
cd4_l1_full_filt$Sample_char  <- paste(cd4_l1_full_filt$Patient_ID, 
                                  cd4_l1_full_filt$Disease,
                                  cd4_l1_full_filt$Time,
                                  cd4_l1_full_filt$Age_group,
                                  cd4_l1_full_filt$Sex,
                                  cd4_l1_full_filt$Experiment_ID)

## PCA on samples before integration

To check if there is some apparent batch effect of the experiment, we will check the PCA on the sample level. 

In [None]:
cd4_samples  <- AverageExpression(cd4_l1_full_filt, group.by = "Sample_char", return.seurat = T)

In [None]:
cd4_samples  <- FindVariableFeatures(cd4_samples)

In [None]:
cd4_samples  <- RunPCA(cd4_samples)

In [None]:
DimPlot(cd4_samples)

In [None]:
cd4_samples$Sample_char  <- colnames(cd4_samples)

In [None]:
cd4_samples$Sample_char  %>% table

In [None]:
cd4_samples@meta.data  <- cd4_samples@meta.data  %>% separate(Sample_char, 
                                                              into = c("Patient_ID",
                                                                      "Disease",
                                                                      "Time",
                                                                      "Age_group",
                                                                      "Sex",
                                                                      "Exp"), 
                                                             sep = " ",
                                                             remove = F)

In [None]:
options(repr.plot.width = 12, repr.plot.height = 7)
(DimPlot(cd4_samples, group.by = "Exp") + DimPlot(cd4_samples, group.by = "Disease") + DimPlot(cd4_samples, group.by = "Time")) / (DimPlot(cd4_samples, group.by = "Sex") + DimPlot(cd4_samples, group.by = "Age_group") + (DimPlot(cd4_samples, group.by = "Patient_ID") + NoLegend()))

In [None]:
p1  <- (DimPlot(cd4_samples, group.by = "Exp") + DimPlot(cd4_samples, group.by = "Disease") + DimPlot(cd4_samples, group.by = "Time")) / (DimPlot(cd4_samples, group.by = "Sex") + DimPlot(cd4_samples, group.by = "Age_group") + (DimPlot(cd4_samples, group.by = "Patient_ID") + NoLegend()))



We can see a huge batch effect, so integration is needed. 

## STACAS Integration over Experiment

In [None]:
merged.list  <- SplitObject(cd4_l1_full_filt, split.by = "Experiment_ID")

In [None]:
plan("sequential")

In [None]:
# normalize and identify variable features for each dataset independently
merged.list <- lapply(X = merged.list, FUN = function(x) {
    DefaultAssay(x)  <- "RNA"
    x$barcode  <- colnames(x)
    x <- NormalizeData(x)
    x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})

library(STACAS)

cd4_l1_full_filt <- Run.STACAS(merged.list, dims = 1:12)
cd4_l1_full_filt <- RunUMAP(cd4_l1_full_filt, dims = 1:12) 

In [None]:
# Visualize

DimPlot(cd4_l1_full_filt, group.by = c("Experiment_ID"))

In [None]:
saveRDS(cd4_l1_full_filt, "../data/processed/L1/cd4_l1_full_filt.rds")

In [None]:
cd4_l1_full_filt  <- readRDS("../data/processed/L1/cd4_l1_full_filt.rds")

## PCA on samples after integration

Now let's re-run the sample-level PCA to see if the batch effect was efficiently corrected. 

In [None]:
cd4_samples  <- AverageExpression(cd4_l1_full_filt, group.by = "Sample_char", return.seurat = T)

In [None]:
cd4_samples  <- FindVariableFeatures(cd4_samples)

In [None]:
cd4_samples  <- RunPCA(cd4_samples)

In [None]:
DimPlot(cd4_samples)

In [None]:
cd4_samples$Sample_char  <- colnames(cd4_samples)

In [None]:
cd4_samples@meta.data  <- cd4_samples@meta.data  %>% separate(Sample_char, 
                                                              into = c("Patient_ID",
                                                                      "Disease",
                                                                      "Time",
                                                                      "Age_group",
                                                                      "Sex",
                                                                      "Exp"), 
                                                             sep = " ",
                                                             remove = F)

In [None]:
options(repr.plot.width = 12, repr.plot.height = 7)
(DimPlot(cd4_samples, group.by = "Exp") + DimPlot(cd4_samples, group.by = "Disease") + DimPlot(cd4_samples, group.by = "Time")) / (DimPlot(cd4_samples, group.by = "Sex") + DimPlot(cd4_samples, group.by = "Age_group") + (DimPlot(cd4_samples, group.by = "Patient_ID") + NoLegend()))

We can see that now the only apparent batch effect is between experiments 10 and 11 vs. 16, 18, 19 and 20. This corresponds to the initial and final experiments and makes sense, as the initial experiment samples contained all cells without enrichment, while the final experiment samples contained cells enriched for non-naive populations. 

## Evaluation of integration

To evaluate the effectivity of integration, we used kBET ([see Github](https://github.com/theislab/kBET)).

In [None]:
.libPaths("~/R/x86_64-pc-linux-gnu-library/4.4/")
library(kBET)

In [None]:
cd4_l1_full_filt  <- readRDS("../data/processed/L1/cd4_l1_full_filt.rds")

### Before integration

In [None]:
cd4_samples2  <- AverageExpression(cd4_l1_full_filt, group.by = "Sample_char", return.seurat = T, 
                                   assay = "RNA", slot = "data")
cd4_samples2  <- FindVariableFeatures(cd4_samples2)
cd4_samples2  <- ScaleData(cd4_samples2)
cd4_samples2  <- RunPCA(cd4_samples2)
cd4_samples2$Sample_char  <- colnames(cd4_samples2)

cd4_samples2@meta.data  <- cd4_samples2@meta.data  %>% separate(Sample_char, 
                                                              into = c("Patient_ID",
                                                                      "Disease",
                                                                      "Time",
                                                                      "Age_group",
                                                                      "Sex",
                                                                      "Exp"), 
                                                             sep = " ",
                                                             remove = F)

In [None]:
cd4_samples2@meta.data  <- cd4_samples2@meta.data  %>% 
mutate(Enrichment = if_else(Exp %in% c("Exp08","Exp10","Exp11"), "Initial", "Final")) 
                                                         

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)
kbet_result_before <- kBET(
    cd4_samples2@reductions$pca@cell.embeddings,
    batch = cd4_samples2$Exp,
    plot = T,
    k0 = 50, # Neighborhood size - can adjust
    n_repeat = 100, # Number of iterations
    do.pca = FALSE # We already provide PCA
)

### After integration

In [None]:
cd4_samples  <- AverageExpression(cd4_l1_full_filt, group.by = "Sample_char", return.seurat = T, 
                                   assay = "integrated", slot = "data")
cd4_samples  <- FindVariableFeatures(cd4_samples)
cd4_samples  <- ScaleData(cd4_samples)
cd4_samples  <- RunPCA(cd4_samples)
cd4_samples$Sample_char  <- colnames(cd4_samples)

cd4_samples@meta.data  <- cd4_samples@meta.data  %>% separate(Sample_char, 
                                                              into = c("Patient_ID",
                                                                      "Disease",
                                                                      "Time",
                                                                      "Age_group",
                                                                      "Sex",
                                                                      "Exp"), 
                                                             sep = " ",
                                                             remove = F)

In [None]:
cd4_samples@meta.data  <- cd4_samples@meta.data  %>% 
mutate(Enrichment = if_else(Exp %in% c("Exp08","Exp11","Exp10"), "Initial", "Final")) 
                                                         

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)
kbet_result_after <- kBET(
    cd4_samples@reductions$pca@cell.embeddings,
    batch = cd4_samples$Exp,
    plot = T,
    k0 = 50, # Neighborhood size - can adjust
    n_repeat = 100, # Number of iterations
    do.pca = FALSE # We already provide PCA
)

In [None]:
kbet_result  <- data.frame(test = paste("test",1:100), 
                           #expected = kbet_result_after$stats$kBET.expected,
                           observed_before = kbet_result_before$stats$kBET.observed,
                           observed_after = kbet_result_after$stats$kBET.observed)  %>% 
pivot_longer(!test, names_to = "tested",values_to = "value")


In [None]:
kbet_result  %>% 
ggplot(aes(x = tested, y = value)) +
geom_boxplot(outlier.shape = NA, width = 0.6) +
geom_jitter(width = 0.2, height = 0.05, alpha = 0.4) +
theme_classic() +
ggtheme()

ggsave("../figures/QC/integration_kBET_cd4.png", width = 3, height = 3)
ggsave("../figures/QC/integration_kBET_cd4.svg", width = 3, height = 3)

## PCA distances

### Before integration

In [None]:
(DimPlot(cd4_samples2, group.by = "Exp") + DimPlot(cd4_samples2, group.by = "Enrichment") + DimPlot(cd4_samples2, group.by = "Time")) / (DimPlot(cd4_samples2, group.by = "Sex") + DimPlot(cd4_samples2, group.by = "Age_group") + (DimPlot(cd4_samples2, group.by = "Patient_ID") + NoLegend()))

In [None]:
(DimPlot(cd4_samples2, group.by = "Exp", pt.size = 2) + ggtheme() + 
(DimPlot(cd4_samples2, group.by = "Enrichment", cols = c("#3d79f3ff","#e6352fff"), pt.size = 2) + ggtheme()))

ggsave("../figures/QC/PCA_dimplot_cd4_before.svg", width = 22, height = 9, units = "cm")

In [None]:
pca_coords <- Embeddings(cd4_samples2, "pca")

All PCs

In [None]:
distance_matrix <- dist(pca_coords, method = "euclidean")

In [None]:
distance_matrix  %>% as.matrix()  %>% as.data.frame()

In [None]:
distance_matrix  <- distance_matrix  %>%  as.matrix()  %>% as.data.frame()  %>% 
rownames_to_column("sample1")  %>%  
pivot_longer(!sample1, names_to = "sample2", values_to = "distance")

In [None]:
distance_matrix

Recode sample numbers to integers to avoid counting pairwise distance twice

In [None]:
# Create a named vector for mapping levels to integers
level_map <- setNames(seq_along(colnames(dm)), colnames(dm))

# Recode col1 and col2 to numeric using the same mapping
distance_matrix$sn1 <- level_map[as.character(distance_matrix$sample1)]
distance_matrix$sn2 <- level_map[as.character(distance_matrix$sample2)]

Separate sample names to variables, create comparison categories

In [None]:
distance_matrix2  <- distance_matrix  %>% 
dplyr::filter(sn1 < sn2)  %>% 
separate(sample1, into = c("Patient_ID_1", NA,NA,NA,NA,"Exp_1"), sep = " ", remove = F)  %>% 
separate(sample2, into = c("Patient_ID_2", NA,NA,NA,NA,"Exp_2"), sep = " ", remove = F)  %>% 
mutate(Enrichment_1 = if_else(Exp_1 %in% c("Exp08","Exp10","Exp11"), "Initial", "Final"))  %>% 
mutate(Enrichment_2 = if_else(Exp_2 %in% c("Exp08","Exp10","Exp11"), "Initial", "Final"))  %>% 
mutate(Comp1 = paste(Exp_1, Exp_2))  %>% 
mutate(Comp2 = paste(Enrichment_1, Enrichment_2))   %>% 
mutate(Comp3 = if_else(Exp_1 == Exp_2, "Within_exp", 
                       if_else(Enrichment_1 == Enrichment_2 & Enrichment_1 == "Final", "Cross_exp", 
                       if_else(Enrichment_1 == Enrichment_2 & Enrichment_1 == "Initial", "Cross_exp", 
                              "Cross_enrichment" ))))  



Recode experiment numbers to integers to ensure that comparison of Exp10 - Exp11 is the same as Exp11 - Exp10

In [None]:
# Create a named vector for mapping levels to integers
level_map <- setNames(seq_along(levels(factor(distance_matrix2$Exp_1))), levels(factor(distance_matrix2$Exp_1)))

# Recode col1 and col2 to numeric using the same mapping
distance_matrix2$en1 <- level_map[as.character(distance_matrix2$Exp_1)]
distance_matrix2$en2 <- level_map[as.character(distance_matrix2$Exp_2)]

In [None]:
distance_matrix2

In [None]:
distance_matrix2  <- distance_matrix2  %>% 
mutate(Comp1 = if_else(en1<en2, paste(Exp_1, Exp_2), paste(Exp_2,Exp_1))) 


Recode experiment numbers to those used in the manuscript.

In [None]:
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp10", replacement = "Batch1")
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp11", replacement = "Batch2")
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp16", replacement = "Batch3")
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp18", replacement = "Batch4")
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp19", replacement = "Batch5")
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp20", replacement = "Batch6")
distance_matrix2$Comp2  <- gsub(distance_matrix2$Comp2, pattern = "Final Initial", replacement = "Initial Final")

In [None]:
distance_matrix2  %>% 
ggplot(aes(x = Comp1, y = distance)) +
geom_boxplot(outlier.shape = NA, aes(color = Comp2)) +
#geom_violin(scale = "width") +
geom_jitter(alpha = .1, width = 0.2, aes(color = Comp2)) +
facet_grid(cols = vars(Comp3), scales = "free_x", space = "free") +
theme_classic() +
theme(axis.text.x = element_text(angle = 90)) +
ggtheme() +
scale_fill_manual(values = c("#3d79f3ff","#b01ab8ff", "#e6352fff")) +
scale_color_manual(values = c("#3d79f3ff","#b01ab8ff", "#e6352fff")) +
scale_y_continuous(limits = c(0,115), expand = c(0,0))

In [None]:
ggsave("../figures/QC/PCA_dist_cd4_before_50pcs.svg", width = 22, height = 12, units = "cm")

Just two PCs

In [None]:
pca_coords <- Embeddings(cd4_samples2, "pca")

In [None]:
pca_coords <- pca_coords[, 1:2]

In [None]:
distance_matrix <- dist(pca_coords, method = "euclidean")

In [None]:
distance_matrix  %>% as.matrix()  %>% as.data.frame()

In [None]:
distance_matrix  <- distance_matrix  %>%  as.matrix()  %>% as.data.frame()  %>% 
rownames_to_column("sample1")  %>%  
pivot_longer(!sample1, names_to = "sample2", values_to = "distance")

In [None]:
distance_matrix

Recode sample numbers to integers to avoid counting pairwise distance twice

In [None]:
# Create a named vector for mapping levels to integers
level_map <- setNames(seq_along(colnames(dm)), colnames(dm))

# Recode col1 and col2 to numeric using the same mapping
distance_matrix$sn1 <- level_map[as.character(distance_matrix$sample1)]
distance_matrix$sn2 <- level_map[as.character(distance_matrix$sample2)]

Separate sample names to variables, create comparison categories

In [None]:
distance_matrix2  <- distance_matrix  %>% 
dplyr::filter(sn1 < sn2)  %>% 
separate(sample1, into = c("Patient_ID_1", NA,NA,NA,NA,"Exp_1"), sep = " ", remove = F)  %>% 
separate(sample2, into = c("Patient_ID_2", NA,NA,NA,NA,"Exp_2"), sep = " ", remove = F)  %>% 
mutate(Enrichment_1 = if_else(Exp_1 %in% c("Exp08","Exp10","Exp11"), "Initial", "Final"))  %>% 
mutate(Enrichment_2 = if_else(Exp_2 %in% c("Exp08","Exp10","Exp11"), "Initial", "Final"))  %>% 
mutate(Comp1 = paste(Exp_1, Exp_2))  %>% 
mutate(Comp2 = paste(Enrichment_1, Enrichment_2))   %>% 
mutate(Comp3 = if_else(Exp_1 == Exp_2, "Within_exp", 
                       if_else(Enrichment_1 == Enrichment_2 & Enrichment_1 == "Final", "Cross_exp", 
                       if_else(Enrichment_1 == Enrichment_2 & Enrichment_1 == "Initial", "Cross_exp", 
                              "Cross_enrichment" ))))  



Recode experiment numbers to integers to ensure that comparison of Exp10 - Exp11 is the same as Exp11 - Exp10

In [None]:
# Create a named vector for mapping levels to integers
level_map <- setNames(seq_along(levels(factor(distance_matrix2$Exp_1))), levels(factor(distance_matrix2$Exp_1)))

# Recode col1 and col2 to numeric using the same mapping
distance_matrix2$en1 <- level_map[as.character(distance_matrix2$Exp_1)]
distance_matrix2$en2 <- level_map[as.character(distance_matrix2$Exp_2)]

In [None]:
distance_matrix2

In [None]:
distance_matrix2  <- distance_matrix2  %>% 
mutate(Comp1 = if_else(en1<en2, paste(Exp_1, Exp_2), paste(Exp_2,Exp_1))) 


Recode experiment numbers to those used in the manuscript.

In [None]:
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp10", replacement = "Batch1")
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp11", replacement = "Batch2")
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp16", replacement = "Batch3")
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp18", replacement = "Batch4")
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp19", replacement = "Batch5")
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp20", replacement = "Batch6")
distance_matrix2$Comp2  <- gsub(distance_matrix2$Comp2, pattern = "Final Initial", replacement = "Initial Final")

In [None]:
distance_matrix2  %>% 
ggplot(aes(x = Comp1, y = distance)) +
geom_boxplot(outlier.shape = NA, aes(color = Comp2)) +
#geom_violin(scale = "width") +
geom_jitter(alpha = .1, width = 0.2, aes(color = Comp2)) +
facet_grid(cols = vars(Comp3), scales = "free_x", space = "free") +
theme_classic() +
theme(axis.text.x = element_text(angle = 90)) +
ggtheme() +
scale_fill_manual(values = c("#3d79f3ff","#b01ab8ff", "#e6352fff")) +
scale_color_manual(values = c("#3d79f3ff","#b01ab8ff", "#e6352fff")) +
scale_y_continuous(limits = c(0,115), expand = c(0,0))

In [None]:
ggsave("../figures/QC/PCA_dist_cd4_before_2pcs.svg", width = 22, height = 12, units = "cm")

### After integration

In [None]:
(DimPlot(cd4_samples, group.by = "Exp") + DimPlot(cd4_samples, group.by = "Enrichment") + DimPlot(cd4_samples, group.by = "Time")) /
(DimPlot(cd4_samples, group.by = "Sex") + DimPlot(cd4_samples, group.by = "Age_group") + (DimPlot(cd4_samples, group.by = "Patient_ID") + NoLegend()))

In [None]:
(DimPlot(cd4_samples, group.by = "Exp", pt.size = 2) + ggtheme() + 
(DimPlot(cd4_samples, group.by = "Enrichment", cols = c("#3d79f3ff","#e6352fff"), pt.size = 2) + ggtheme()))

ggsave("../figures/QC/PCA_dimplot_cd4_after.svg", width = 22, height = 9, units = "cm")

In [None]:
pca_coords <- Embeddings(cd4_samples, "pca")

All PCs

In [None]:
distance_matrix <- dist(pca_coords, method = "euclidean")

In [None]:
distance_matrix

In [None]:
distance_matrix  %>% as.matrix()  %>% as.data.frame()

In [None]:
dm = distance_matrix  %>% as.matrix()  %>% as.data.frame()

table(colnames(dm) == rownames(dm))

In [None]:
distance_matrix  <- distance_matrix  %>%  as.matrix()  %>% as.data.frame()  %>% 
rownames_to_column("sample1")  %>%  
pivot_longer(!sample1, names_to = "sample2", values_to = "distance")

In [None]:
distance_matrix

Recode sample numbers to integers to avoid counting pairwise distance twice

In [None]:
# Create a named vector for mapping levels to integers
level_map <- setNames(seq_along(colnames(dm)), colnames(dm))

# Recode col1 and col2 to numeric using the same mapping
distance_matrix$sn1 <- level_map[as.character(distance_matrix$sample1)]
distance_matrix$sn2 <- level_map[as.character(distance_matrix$sample2)]

Separate sample names to variables, create comparison categories

In [None]:
distance_matrix2  <- distance_matrix  %>% 
dplyr::filter(sn1 < sn2)  %>% 
separate(sample1, into = c("Patient_ID_1", NA,NA,NA,NA,"Exp_1"), sep = " ", remove = F)  %>% 
separate(sample2, into = c("Patient_ID_2", NA,NA,NA,NA,"Exp_2"), sep = " ", remove = F)  %>% 
mutate(Enrichment_1 = if_else(Exp_1 %in% c("Exp08","Exp10","Exp11"), "Initial", "Final"))  %>% 
mutate(Enrichment_2 = if_else(Exp_2 %in% c("Exp08","Exp10","Exp11"), "Initial", "Final"))  %>% 
mutate(Comp1 = paste(Exp_1, Exp_2))  %>% 
mutate(Comp2 = paste(Enrichment_1, Enrichment_2))   %>% 
mutate(Comp3 = if_else(Exp_1 == Exp_2, "Within_exp", 
                       if_else(Enrichment_1 == Enrichment_2 & Enrichment_1 == "Final", "Cross_exp", 
                       if_else(Enrichment_1 == Enrichment_2 & Enrichment_1 == "Initial", "Cross_exp", 
                              "Cross_enrichment" ))))  



Recode experiment numbers to integers to ensure that comparison of Exp10 - Exp11 is the same as Exp11 - Exp10

In [None]:
# Create a named vector for mapping levels to integers
level_map <- setNames(seq_along(levels(factor(distance_matrix2$Exp_1))), levels(factor(distance_matrix2$Exp_1)))

# Recode col1 and col2 to numeric using the same mapping
distance_matrix2$en1 <- level_map[as.character(distance_matrix2$Exp_1)]
distance_matrix2$en2 <- level_map[as.character(distance_matrix2$Exp_2)]

In [None]:
distance_matrix2

In [None]:
distance_matrix2  <- distance_matrix2  %>% 
mutate(Comp1 = if_else(en1<en2, paste(Exp_1, Exp_2), paste(Exp_2,Exp_1))) 


In [None]:
distance_matrix2

Recode experiment numbers to those used in the manuscript.

In [None]:
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp10", replacement = "Batch1")
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp11", replacement = "Batch2")
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp16", replacement = "Batch3")
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp18", replacement = "Batch4")
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp19", replacement = "Batch5")
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp20", replacement = "Batch6")
distance_matrix2$Comp2  <- gsub(distance_matrix2$Comp2, pattern = "Final Initial", replacement = "Initial Final")

In [None]:
distance_matrix2  %>% 
ggplot(aes(x = Comp1, y = distance)) +
geom_boxplot(outlier.shape = NA, aes(color = Comp2)) +
#geom_violin(scale = "width") +
geom_jitter(alpha = .1, width = 0.2, aes(color = Comp2)) +
facet_grid(cols = vars(Comp3), scales = "free_x", space = "free") +
theme_classic() +
theme(axis.text.x = element_text(angle = 90)) +
ggtheme() +
scale_fill_manual(values = c("#3d79f3ff","#b01ab8ff", "#e6352fff")) +
scale_color_manual(values = c("#3d79f3ff","#b01ab8ff", "#e6352fff")) +
scale_y_continuous(limits = c(0,115), expand = c(0,0))

In [None]:
ggsave("../figures/QC/PCA_dist_cd4_after_50pcs.svg", width = 22, height = 12, units = "cm")

Just two PCs

In [None]:
pca_coords <- Embeddings(cd4_samples, "pca")

In [None]:
pca_coords <- pca_coords[, 1:2]

In [None]:
distance_matrix <- dist(pca_coords, method = "euclidean")

In [None]:
distance_matrix

In [None]:
distance_matrix  %>% as.matrix()  %>% as.data.frame()

In [None]:
dm = distance_matrix  %>% as.matrix()  %>% as.data.frame()

table(colnames(dm) == rownames(dm))

In [None]:
distance_matrix  <- distance_matrix  %>%  as.matrix()  %>% as.data.frame()  %>% 
rownames_to_column("sample1")  %>%  
pivot_longer(!sample1, names_to = "sample2", values_to = "distance")

In [None]:
distance_matrix

Recode sample numbers to integers to avoid counting pairwise distance twice

In [None]:
# Create a named vector for mapping levels to integers
level_map <- setNames(seq_along(colnames(dm)), colnames(dm))

# Recode col1 and col2 to numeric using the same mapping
distance_matrix$sn1 <- level_map[as.character(distance_matrix$sample1)]
distance_matrix$sn2 <- level_map[as.character(distance_matrix$sample2)]

Separate sample names to variables, create comparison categories

In [None]:
distance_matrix2  <- distance_matrix  %>% 
dplyr::filter(sn1 < sn2)  %>% 
separate(sample1, into = c("Patient_ID_1", NA,NA,NA,NA,"Exp_1"), sep = " ", remove = F)  %>% 
separate(sample2, into = c("Patient_ID_2", NA,NA,NA,NA,"Exp_2"), sep = " ", remove = F)  %>% 
mutate(Enrichment_1 = if_else(Exp_1 %in% c("Exp08","Exp10","Exp11"), "Initial", "Final"))  %>% 
mutate(Enrichment_2 = if_else(Exp_2 %in% c("Exp08","Exp10","Exp11"), "Initial", "Final"))  %>% 
mutate(Comp1 = paste(Exp_1, Exp_2))  %>% 
mutate(Comp2 = paste(Enrichment_1, Enrichment_2))   %>% 
mutate(Comp3 = if_else(Exp_1 == Exp_2, "Within_exp", 
                       if_else(Enrichment_1 == Enrichment_2 & Enrichment_1 == "Final", "Cross_exp", 
                       if_else(Enrichment_1 == Enrichment_2 & Enrichment_1 == "Initial", "Cross_exp", 
                              "Cross_enrichment" ))))  



Recode experiment numbers to integers to ensure that comparison of Exp10 - Exp11 is the same as Exp11 - Exp10

In [None]:
# Create a named vector for mapping levels to integers
level_map <- setNames(seq_along(levels(factor(distance_matrix2$Exp_1))), levels(factor(distance_matrix2$Exp_1)))

# Recode col1 and col2 to numeric using the same mapping
distance_matrix2$en1 <- level_map[as.character(distance_matrix2$Exp_1)]
distance_matrix2$en2 <- level_map[as.character(distance_matrix2$Exp_2)]

In [None]:
distance_matrix2

In [None]:
distance_matrix2  <- distance_matrix2  %>% 
mutate(Comp1 = if_else(en1<en2, paste(Exp_1, Exp_2), paste(Exp_2,Exp_1))) 


In [None]:
distance_matrix2

Recode experiment numbers to those used in the manuscript.

In [None]:
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp10", replacement = "Batch1")
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp11", replacement = "Batch2")
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp16", replacement = "Batch3")
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp18", replacement = "Batch4")
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp19", replacement = "Batch5")
distance_matrix2$Comp1  <- gsub(distance_matrix2$Comp1, pattern = "Exp20", replacement = "Batch6")
distance_matrix2$Comp2  <- gsub(distance_matrix2$Comp2, pattern = "Final Initial", replacement = "Initial Final")

In [None]:
distance_matrix2  %>% 
ggplot(aes(x = Comp1, y = distance)) +
geom_boxplot(outlier.shape = NA, aes(color = Comp2)) +
#geom_violin(scale = "width") +
geom_jitter(alpha = .1, width = 0.2, aes(color = Comp2)) +
facet_grid(cols = vars(Comp3), scales = "free_x", space = "free") +
theme_classic() +
theme(axis.text.x = element_text(angle = 90)) +
ggtheme() +
scale_fill_manual(values = c("#3d79f3ff","#b01ab8ff", "#e6352fff")) +
scale_color_manual(values = c("#3d79f3ff","#b01ab8ff", "#e6352fff")) +
scale_y_continuous(limits = c(0,115), expand = c(0,0))

In [None]:
ggsave("../figures/QC/PCA_dist_cd4_after_2pcs.svg", width = 22, height = 12, units = "cm")

# Analysis CD4 Level 1

Now we will perform a detailed analysis of the dataset, its main clusters as well as subclusters. The following hierarchy of analysis is used:
* Level 1 - distinguish conventional and unconventional CD4 populations
* Level 2 - main clusters of CD4 T cells 
* Level 3 - further subclustering of main clusters revealing cell states

For each of the levels of classification, we wll perform subsetting, new normalization, new scaling, dimensional reduction, integration and annotation. From the obtained clusters, we will create dimplots and will calculate frequencies and markers. Clustering is performed on the filtered dataset and on the UMAP dimensional reduction. 

In [None]:
# Only run if you want to start form here
# cd4_l1_full_filt  <- readRDS("../data/processed/L1/cd4_l1_full_filt.rds")

In [None]:
DefaultAssay(cd4_l1_full_filt)  <- "integrated"

In [None]:
cd4_l1_full_filt <- FindNeighbors(cd4_l1_full_filt, dims = 1:12)

In [None]:
cd4_l1_full_filt <- FindClusters(cd4_l1_full_filt, resolution = 0.6)

In [None]:
DimPlot(cd4_l1_full_filt, label = T)

In [None]:
cd4_l1_full_filt@meta.data  <- cd4_l1_full_filt@meta.data  %>% 
mutate(annotations_manual = recode_factor(seurat_clusters, 
                                      "0" = "CD4 T cells",
                                     "1" = "CD4 T cells",
                                     "2" = "CD4 T cells",
                                     "3" = "CD4 T cells",
                                     "4" = "CD4 T cells",
                                     "5" = "CD4 T cells",
                                     "6" = "CD4 T cells",
                                     "7" = "CD4 T cells",
                                     "8" = "CD4 T cells",
                                     "9" = "CD4 T cells",
                                     "10" = "Unconventional T cells",
                                     "11" = "CD4 T cells",
                                     "12" = "CD4 T cells"))

In [None]:
cd4_l1_full_filt@misc$cols_annotations  <- c("#ffd42aff", "#bf14c6ff")

In [None]:
cd4_l1_full_filt@misc$dataset_name  <- "cd4_l1_full_filt"

In [None]:
cd4_l1_full_filt@misc$all_md  <- cd4_l1_full_filt@meta.data  %>% 
                            dplyr::select(Sample_ID, Condition, Condition2, 
                                          Disease, 
                                          Sex, Age, Age_group, Patient_ID, 
                                          Time, Experiment_ID)   %>% unique

In [None]:
options(repr.plot.width = 11, repr.plot.height = 8)
DimPlot(cd4_l1_full_filt, cols = cd4_l1_full_filt@misc$cols_annotations, group.by = "annotations_manual")

In [None]:
options(repr.plot.width = 8, repr.plot.height = 6.5)
save_dimplot_plot(seurat_dataset = cd4_l1_full_filt)

In [None]:
saveRDS(cd4_l1_full_filt, "../data/processed/L1/cd4_l1_full_filt.rds")

### Cluster composition

In [None]:
process_plots_from_dataset(seurat_dataset = cd4_l1_full_filt)

### Save frequencies

In [None]:
df4  <- create_df4(cd4_l1_full_filt)

In [None]:
df4

In [None]:
dir_create("../tables/cd4/")

In [None]:
dir.create("../tables/cd4/markers_annotations/")
dir.create("../tables/cd4/frequencies/")

In [None]:
freq  <- df4  %>% dplyr::select(1:3)
write.csv(freq, "../tables/cd4/frequencies/freq_cd4_l1_full_filt.csv", row.names = FALSE)

### Save markers

In [None]:
Idents(cd4_l1_full_filt)  <- cd4_l1_full_filt$annotations_manual

In [None]:
mrk  <- FindAllMarkers(cd4_l1_full_filt)

In [None]:
mrk  <- rank_score_func(mrk)

In [None]:
dir.create("../tables/cd4/markers_annotations/")

In [None]:
write.csv(mrk, "../tables/cd4/markers_annotations/mrk_cd4_l1_full_filt.csv", row.names = FALSE)

## CD4 L1 Dorothea

We will use the DecoupleR package for pathway activity and trascription factor activity estimation. Please see the GitHub page and the publication for more info.

* [GitHub](https://github.com/saezlab/CollecTRI)
* [Paper](https://academic.oup.com/nar/article/51/20/10934/7318114?login=false)    

In [None]:
net <- get_progeny(organism = 'human', top = 200)
net2 <- decoupleR::get_collectri(organism='human', split_complexes=FALSE)

In [None]:
  data <- cd4_l1_full_filt
# Extract the normalized log-transformed counts
mat <- as.matrix(data@assays$RNA@data)

######## Pathways Progeny #########   
    
# Run wmean
acts <- run_wmean(mat=mat, net=net, .source='source', .target='target',
                  .mor='weight', times = 100, minsize = 5)
  
# Add data to Seurat object

  data[['pathwayswmean']] <- acts %>%
  filter(statistic == 'norm_wmean') %>%
  pivot_wider(id_cols = 'source', names_from = 'condition',
              values_from = 'score') %>%
  column_to_rownames('source') %>%
  Seurat::CreateAssayObject(.)

  # Scale the data
DefaultAssay(object = data) <- "pathwayswmean"

data <- ScaleData(data)
data@assays$pathwayswmean@data <- data@assays$pathwayswmean@scale.data
rownames(data@assays$pathwayswmean@data)

######## CollecTRI ######### 

# Run ULM
acts <- run_ulm(mat=mat, net=net2, .source='source', .target='target',
                .mor='mor', minsize = 5)
  
# Add data to Seurat object
  data[['CollecTRI']] <- acts %>%
  pivot_wider(id_cols = 'source', names_from = 'condition',
              values_from = 'score') %>%
  column_to_rownames('source') %>%
  Seurat::CreateAssayObject(.)

  # Scale the data
DefaultAssay(object = data) <- "CollecTRI"

data <- ScaleData(data)
data@assays$CollecTRI@data <- data@assays$CollecTRI@scale.data
rownames(data@assays$CollecTRI@data)

DefaultAssay(object = data) <- "integrated"
saveRDS(data, paste0("../data/processed/L1/cd4_l1_full_filt.rds"))


## Correlation CD45RA/RO with age

In [None]:
cd4_l1_full_filt

In [None]:
cd4_samples3  <-  AverageExpression(cd4_l1_full_filt, group.by = c("Sample_char", "Age"), return.seurat = T, 
                                   assay = "PTPRC")

cd4_samples3@meta.data  <- cd4_samples3@meta.data  %>% separate(Sample_char, 
                                                              into = c("Patient_ID",
                                                                      "Disease",
                                                                      "Time",
                                                                      "Age_group",
                                                                      "Sex",
                                                                      "Exp"), 
                                                             sep = " ",
                                                             remove = F)

cd4_samples3@meta.data  <- cd4_samples3@meta.data  %>% 
mutate(Enrichment = if_else(Exp %in% c("Exp08","Exp10","Exp11"), "Initial", "Final")) 
                                                         

In [None]:
rownames(cd4_samples3)

In [None]:
df_ra_ro = as.data.frame(t(cd4_samples3@assays$PTPRC$data))

In [None]:
df_ra_ro  <- cbind(df_ra_ro, cd4_samples3@meta.data  %>% dplyr::select(Age_group, Age, Enrichment, Sample_char, Disease))

In [None]:
df_ra_ro

### PTPRC-RA

In [None]:
options(repr.plot.width = 8, repr.plot.height = 5)
df_ra_ro  %>% ggplot(aes(x = as.numeric(Age), y = `PTPRC-RA`, color = Enrichment))  +
geom_point() +
ggpubr::stat_cor() +
ggtitle("PTPRC-RA") +
theme_classic() +
ggtheme() 

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)
df_ra_ro  %>% 
dplyr::filter(Enrichment == "Final")  %>% 
ggplot(aes(x = as.numeric(Age), y = `PTPRC-RA`))  +
geom_point() +
ggpubr::stat_cor() +
 geom_smooth(method=lm) +
ggtitle("PTPRC-RA") +
theme_classic() +
ggtheme() 

In [None]:
df_ra_ro  %>% 
dplyr::filter(Enrichment == "Final")  %>% 
group_by(Age_group)  %>% 
ggplot(aes(x = as.numeric(Age_group), y = `PTPRC-RA`, fill = Age_group))  +
ggbeeswarm::geom_beeswarm() +
geom_violin( alpha = .2, scale = "width") +
theme_classic() +
ggtheme() + 
  stat_summary(fun = "mean",
               geom = "crossbar", 
               width = 0.5,
               colour = "black") +
ggtitle("PTPRC-RA")

### PTPRC-RO

In [None]:
options(repr.plot.width = 8, repr.plot.height = 5)
df_ra_ro  %>% ggplot(aes(x = as.numeric(Age), y = `PTPRC-RO`, color = Enrichment))  +
geom_point() +
ggpubr::stat_cor() + 
ggtitle("PTPRC-RO") +
theme_classic() +
ggtheme() 

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)
df_ra_ro  %>% 
dplyr::filter(Enrichment == "Final")  %>% 
ggplot(aes(x = as.numeric(Age), y = `PTPRC-RO`))  +
geom_point() +
ggpubr::stat_cor() +
 geom_smooth(method=lm) +
ggtitle("PTPRC-RO")  +
theme_classic() +
ggtheme() 

In [None]:
df_ra_ro  %>% 
dplyr::filter(Enrichment == "Final")  %>% 
group_by(Age_group)  %>% 
ggplot(aes(x = as.numeric(Age_group), y = `PTPRC-RO`, fill = Age_group))  +
ggbeeswarm::geom_beeswarm() +
geom_violin( alpha = .2, scale = "width") +
theme_classic() +
ggtheme() + 
  stat_summary(fun = "mean",
               geom = "crossbar", 
               width = 0.5,
               colour = "black") +
ggtitle("PTPRC-RO") +
theme_classic() +
ggtheme() 

# Analysis CD4 Level 2: Conventional CD4

In [None]:
cd4_l1_full_filt  <- readRDS("../data/processed/L1/cd4_l1_full_filt.rds")

In [None]:
cd4_l1_full_filt$Patient_Time  %>% table

In [None]:
plan("sequential")

In [None]:
cd4_l1_full_filt

In [None]:
cd4_subcluster  <- subset(cd4_l1_full_filt, annotations_manual == "CD4 T cells")

In [None]:
merged.list  <- SplitObject(cd4_subcluster, split.by = "Experiment_ID")

In [None]:
# normalize and identify variable features for each dataset independently
merged.list <- lapply(X = merged.list, FUN = function(x) {
    DefaultAssay(x)  <- "RNA"
    x$barcode  <- colnames(x)
    x <- NormalizeData(x)
    x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})

In [None]:
new_dia_experiment2 <- Run.STACAS(merged.list, dims = 1:10)
new_dia_experiment2 <- RunUMAP(new_dia_experiment2, dims = 1:10) 

In [None]:
cd4_subcluster <- FindNeighbors(new_dia_experiment2, reduction = "pca", dims = 1:10)

In [None]:
DefaultAssay(cd4_subcluster)  <- "integrated"

In [None]:
cd4_subcluster <- FindClusters(cd4_subcluster, resolution = 0.6)

In [None]:
options(repr.plot.width = 8, repr.plot.height = 7)

DimPlot(cd4_subcluster, label = T, raster = T)

### Cluster markers

In [None]:
### Naive

DefaultAssay(cd4_subcluster)  <- "integrated"

mrk  <- FindAllMarkers(cd4_subcluster, only.pos = TRUE)

mrk  <- rank_score_func(mrk)

markers  <- mrk  %>% arrange(desc(score))  %>% group_by(cluster)  %>% slice_head(n = 4)  %>% pull(gene)

In [None]:
mrk

In [None]:
markers2  <- rev(c("SELL", "CCR7", "LEF1", "TCF7", 
              "CXCR5", "IL21", "BCL6",
              
              "ISG15", "IFIT1", "OAS3",
              "FOXP3", "IKZF2","IL2RA",  "TNFRSF18",
              "GATA3", "IL4", "IL13",
             "IL17A", "IL23R", "RORC" , "IL7R",
              "CCL5", 
              "TBX21", "IFNG","ZBTB16",
              "CXCR6", "CXCR3",
              "KLRG1","CX3CR1","CD27","B3GAT1","CD28","FAS",
             "MKI67", "PCNA", "MCM6" ))

avgexp = AverageExpression(cd4_subcluster, features = c(markers, markers2),
                           return.seurat = F, group.by = "seurat_clusters", 
                          assay = "RNA")

avgexp$RNA

In [None]:
options(repr.plot.width = 7, repr.plot.height = 15)
pheatmap(avgexp$RNA, main = "", 
         scale = "row", cluster_cols = T, cluster_rows = T,
        color=colorRampPalette(c("dodgerblue", "grey95", "indianred2"))(50), 
         border_color = "white",
                  fontsize = 9)

### Cluster tree

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)
DimPlot(cd4_subcluster, label = T)

In [None]:
cd4_subcluster <- BuildClusterTree(
  cd4_subcluster,
  dims = 1:10,
  reorder = FALSE,
  reorder.numeric = FALSE)

In [None]:
tree <- cd4_subcluster@tools$BuildClusterTree
tree$tip.label <- paste0("Cluster ", tree$tip.label)

In [None]:
#cols  <- c("#f1cc7dff","#e1b82bff","#e1b82bff","#eb7a8bff","#f1a46cff","#e1b82bff",
#  "#fcc9deff", "indianred3", "#fcc9deff", "#e9afafff", "#dc6b2fff", "#7d252aff")

In [None]:
p <- ggtree::ggtree(tree, aes(x, y)) +
  scale_y_reverse() +
  ggtree::geom_tree() +
  ggtree::theme_tree() +
  ggtree::geom_tiplab(offset = 1) +
  ggtree::geom_tippoint(shape = 16, size = 5) +
  coord_cartesian(clip = 'off') +
  theme(plot.margin = unit(c(0,2.5,0,0), 'cm'))

#ggsave('plots/cluster_tree.png', p, height = 4, width = 6)

In [None]:
options(repr.plot.width=3.5, repr.plot.height=3)
p

### Annotations L2

In [None]:
DimPlot(cd4_subcluster, label = T)

In [None]:
cd4_subcluster@meta.data  <- cd4_subcluster@meta.data  %>% 
mutate(annotations_manual = recode_factor(seurat_clusters, "0" = "Tfh",
                                     "1" = "Naive",
                                     "2" = "Th2",
                                     "3" = "Naive",
                                     "4" = "Treg",
                                     "5" = "Naive",
                                     "6" = "Th1Th17",
                                     "7" = "ISAGhi",
                                     "8" = "Treg",
                                     "9" = "Nfkb",
                                     "10" = "Proliferating",
                                     "11" = "Temra"))

In [None]:
cd4_subcluster@misc$cols_annotations  <- c(
       "#f1cc7dff", # Tfh
       "#e1b82bff",  # Naive
       "#eb7a8bff", # Th2
       "#fcc9deff", # Treg
       "indianred3",  # Th1
       "#e9afafff",  # ISAGhi
       "#f1a46cff",  # Nfkb
       "orchid3", # Proliferating
       "#7d252aff") # Temra 


In [None]:
options(repr.plot.width = 11, repr.plot.height = 9)

DimPlot(cd4_subcluster, label = F, group.by = "annotations_manual", 
        cols = cd4_subcluster@misc$cols_annotations, pt.size = 0.05
     )

### Cluster composition

In [None]:
cd4_subcluster@misc$dataset_name  <- "cd4_subcluster"

In [None]:
cd4_subcluster@misc$all_md  <- cd4_subcluster@meta.data  %>% 
                            dplyr::select(Sample_ID, Condition, Condition2, 
                                          Disease, 
                                          Sex, Age, Age_group, Patient_ID, 
                                          Time, Experiment_ID)   %>% unique

In [None]:
options(repr.plot.width = 11, repr.plot.height = 8)
DimPlot(cd4_subcluster, cols = cd4_subcluster@misc$cols_annotations, group.by = "annotations_manual")

In [None]:
options(repr.plot.width = 8, repr.plot.height = 6.5)
save_dimplot_plot(seurat_dataset = cd4_subcluster)

In [None]:
saveRDS(cd4_subcluster, "../data/processed/L2/cd4_subcluster.rds")

In [None]:
process_plots_from_dataset(seurat_dataset = cd4_subcluster)

### Save frequencies

In [None]:
df4  <- create_df4(cd4_subcluster)

In [None]:
df4

In [None]:
dir_create("../tables/cd4/")

In [None]:
dir.create("../tables/cd4/markers_annotations/")
dir.create("../tables/cd4/frequencies/")

In [None]:
freq  <- df4  %>% dplyr::select(1:3)
write.csv(freq, "../tables/cd4/frequencies/freq_cd4_subcluster.csv", row.names = FALSE)

### Save markers

In [None]:
Idents(cd4_subcluster)  <- cd4_subcluster$annotations_manual

In [None]:
mrk  <- FindAllMarkers(cd4_subcluster)

In [None]:
mrk  <- rank_score_func(mrk)

In [None]:
write.csv(mrk, "../tables/cd4/markers_annotations/mrk_cd4_subcluster.csv", row.names = FALSE)

## Correlation CD45RA/RO with age

In [None]:
# cd4_subcluster  <- readRDS("../../240617_VN_Diabetes_V06/data/processed/L2/cd4_subcluster.rds")

In [None]:
options(repr.plot.width = 8, repr.plot.height = 6.5)

DimPlot(cd4_subcluster, raster = T, group.by = "Experiment_ID", 
       cols = c("salmon","red3", "dodgerblue1","dodgerblue2","dodgerblue3","dodgerblue4"))

In [None]:
cd4_samples3  <-  AverageExpression(cd4_subcluster, group.by = c("Sample_char", "Age"), return.seurat = T, 
                                   assay = "PTPRC")

cd4_samples3@meta.data  <- cd4_samples3@meta.data  %>% separate(Sample_char, 
                                                              into = c("Patient_ID",
                                                                      "Disease",
                                                                      "Time",
                                                                      "Age_group",
                                                                      "Sex",
                                                                      "Exp"), 
                                                             sep = " ",
                                                             remove = F)

cd4_samples3@meta.data  <- cd4_samples3@meta.data  %>% 
mutate(Enrichment = if_else(Exp %in% c("Exp08","Exp10","Exp11"), "Initial", "Final")) 
                                                         

In [None]:
rownames(cd4_samples3)

In [None]:
df_ra_ro = as.data.frame(t(cd4_samples3@assays$PTPRC$data))

In [None]:
df_ra_ro  <- cbind(df_ra_ro, cd4_samples3@meta.data  %>% dplyr::select(Age_group, Age, Enrichment, Sample_char, Disease))

In [None]:
df_ra_ro

### PTPRC-RA

In [None]:
options(repr.plot.width = 8, repr.plot.height = 5)
df_ra_ro  %>% ggplot(aes(x = as.numeric(Age), y = `PTPRC-RA`, color = Enrichment))  +
geom_point() +
ggpubr::stat_cor() +
ggtitle("PTPRC-RA") +
theme_classic() +
ggtheme() 

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)
df_ra_ro  %>% 
dplyr::filter(Enrichment == "Final")  %>% 
ggplot(aes(x = as.numeric(Age), y = `PTPRC-RA`))  +
geom_point() +
ggpubr::stat_cor() +
 geom_smooth(method=lm) +
ggtitle("PTPRC-RA") +
theme_classic() +
ggtheme() 

In [None]:
df_ra_ro  %>% 
dplyr::filter(Enrichment == "Final")  %>% 
group_by(Age_group)  %>% 
ggplot(aes(x = as.numeric(Age_group), y = `PTPRC-RA`, fill = Age_group))  +
ggbeeswarm::geom_beeswarm() +
geom_violin( alpha = .2, scale = "width") +
theme_classic() +
ggtheme() + 
  stat_summary(fun = "mean",
               geom = "crossbar", 
               width = 0.5,
               colour = "black") +
ggtitle("PTPRC-RA")

### PTPRC-RO

In [None]:
options(repr.plot.width = 8, repr.plot.height = 5)
df_ra_ro  %>% ggplot(aes(x = as.numeric(Age), y = `PTPRC-RO`, color = Enrichment))  +
geom_point() +
ggpubr::stat_cor() + 
ggtitle("PTPRC-RO") +
theme_classic() +
ggtheme() 

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)
df_ra_ro  %>% 
dplyr::filter(Enrichment == "Final")  %>% 
ggplot(aes(x = as.numeric(Age), y = `PTPRC-RO`))  +
geom_point() +
ggpubr::stat_cor() +
 geom_smooth(method=lm) +
ggtitle("PTPRC-RO")  +
theme_classic() +
ggtheme() 

In [None]:
df_ra_ro  %>% 
dplyr::filter(Enrichment == "Final")  %>% 
group_by(Age_group)  %>% 
ggplot(aes(x = as.numeric(Age_group), y = `PTPRC-RO`, fill = Age_group))  +
ggbeeswarm::geom_beeswarm() +
geom_violin( alpha = .2, scale = "width") +
theme_classic() +
ggtheme() + 
  stat_summary(fun = "mean",
               geom = "crossbar", 
               width = 0.5,
               colour = "black") +
ggtitle("PTPRC-RO") +
theme_classic() +
ggtheme() 

# Analysis CD4 Level 2: Unconventional CD4

In [None]:
# cd4_l1_full_filt  <- readRDS("../data/processed/L1/cd4_l1_full_filt.rds")

In [None]:
cd4_l1_full_filt$annotations_manual  %>% table

In [None]:
cd4_l2_unc  <- subset(cd4_l1_full_filt, annotations_manual == "Unconventional T cells")

In [None]:
cd4_l2_unc@meta.data  <- cd4_l2_unc@meta.data  %>% 
    mutate(Experiment_ID_2 = 
           ifelse(Experiment_ID %in% c("Exp10", "Exp11"), "Exp10_11",Experiment_ID ))

In [None]:
merged.list  <- SplitObject(cd4_l2_unc, split.by = "Experiment_ID_2")

In [None]:
merged.list

In [None]:
# normalize and identify variable features for each dataset independently
merged.list <- lapply(X = merged.list, FUN = function(x) {
    DefaultAssay(x)  <- "RNA"
    x$barcode  <- colnames(x)
    x <- NormalizeData(x)
    x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})

In [None]:

stacas_anchors <- FindAnchors.STACAS(merged.list, 
                                     dims = 1:12, 
                                     min.sample.size = 65)
st1 <- SampleTree.STACAS(
  anchorset = stacas_anchors,
  obj.names = names(merged.list)
  )    

new_dia_experiment2 <- IntegrateData.STACAS(stacas_anchors,
                                          sample.tree = st1,
                                          dims=1:12) %>% ScaleData() %>%
  RunPCA(npcs=12) %>% RunUMAP(dims=1:12)

new_dia_experiment2 <- FindNeighbors(new_dia_experiment2, reduction = "pca", dims = 1:12)
new_dia_experiment2 <- FindClusters(new_dia_experiment2, resolution = 0.3)

DimPlot(new_dia_experiment2, label = T)

mrk  <- FindAllMarkers(new_dia_experiment2, logfc.threshold = log(1.5))

write.csv(mrk, paste0("../tables/cd4/markers_cd4_l2_unc.csv"))

saveRDS(new_dia_experiment2, paste0("../data/processed/L2/cd4_l2_unc.rds"))

In [None]:
cd4_l2_unc <- FindNeighbors(new_dia_experiment2, reduction = "pca", dims = 1:12)

In [None]:
DefaultAssay(cd4_l2_unc)  <- "integrated"

In [None]:
cd4_l2_unc <- FindClusters(cd4_l2_unc, resolution = 0.3)

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)

DimPlot(cd4_l2_unc, label = T)

In [None]:
mrk  <- FindAllMarkers(cd4_l2_unc, only.pos = TRUE)

mrk  <- rank_score_func(mrk)

markers  <- mrk  %>% arrange(desc(score))  %>% group_by(cluster)  %>% slice_head(n = 4)  %>% pull(gene)

options(repr.plot.width = 16, repr.plot.height = 16)
FeaturePlot(cd4_l2_unc, features = markers,
           min.cutoff = 0, ncol = 4)

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)

DimPlot(cd4_l2_unc, label = T)

In [None]:
mrk  %>% filter(cluster == 2)

In [None]:
cd4_l2_unc@meta.data  <- cd4_l2_unc@meta.data  %>% 
mutate(annotations_manual = recode_factor(seurat_clusters, 
                                         "0" = "Unc1: LGALS1 CRIP2 S100A10",
                                         "1" = "Unc2: CCR7 HCST FYB1",
                                         "2" = "Unc3: IFI44L ISG15 XAF1",
                                         "3" = "Unc4: CD2 TMEM117 SNRNP27"))

In [None]:
cd4_l2_unc@misc$cols_annotations  <- c(
             "#cd634aff", "#deaa87ff", "#a0892cff", "red"
)


In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)

DimPlot(cd4_l2_unc, label = T, group.by = "annotations_manual", 
        cols = cd4_l2_unc@misc$cols_annotations_l2)

In [None]:
plan("multisession")

### Cluster composition

In [None]:
cd4_l2_unc@misc$dataset_name  <- "cd4_l2_unc"

In [None]:
cd4_l2_unc@misc$all_md  <- cd4_l2_unc@meta.data  %>% 
                            dplyr::select(Sample_ID, Condition, Condition2, 
                                          Disease, 
                                          Sex, Age, Age_group, Patient_ID, 
                                          Time, Experiment_ID)   %>% unique

In [None]:
options(repr.plot.width = 11, repr.plot.height = 8)
DimPlot(cd4_l2_unc, cols = cd4_l2_unc@misc$cols_annotations, group.by = "annotations_manual")

In [None]:
options(repr.plot.width = 8, repr.plot.height = 6.5)
save_dimplot_plot(seurat_dataset = cd4_l2_unc)

In [None]:
saveRDS(cd4_l2_unc, "../data/processed/L1/cd4_l2_unc.rds")

In [None]:
process_plots_from_dataset(seurat_dataset = cd4_l2_unc)

### Save frequencies

In [None]:
df4  <- create_df4(cd4_l2_unc)

In [None]:
df4

In [None]:
dir_create("../tables/cd4/")

In [None]:
dir.create("../tables/cd4/markers_annotations/")
dir.create("../tables/cd4/frequencies/")

In [None]:
freq  <- df4  %>% dplyr::select(1:3)
write.csv(freq, "../tables/cd4/frequencies/freq_cd4_l2_unc.csv", row.names = FALSE)

### Save markers

In [None]:
Idents(cd4_l2_unc)  <- cd4_l2_unc$annotations_manual

In [None]:
mrk  <- FindAllMarkers(cd4_l2_unc)

In [None]:
mrk  <- rank_score_func(mrk)

In [None]:
mrk

In [None]:
write.csv(mrk, "../tables/cd4/markers_annotations/mrk_cd4_l2_unc.csv", row.names = FALSE)

# Analysis Level 3

In [None]:
#cd4_l1_full_filt  <- readRDS("../data/processed/L1/cd4_l1_full_filt.rds")

In [None]:
#cd4_subcluster  <- readRDS("../data/processed/L2/cd4_subcluster.rds")

In [None]:
DimPlot(cd4_subcluster, group.by = "annotations_manual")

In [None]:
cd4_l3_naive  <- subset(cd4_subcluster, annotations_manual == "Naive")
cd4_l3_tfh  <- subset(cd4_subcluster, annotations_manual == "Tfh")
cd4_l3_th1_17  <- subset(cd4_subcluster, annotations_manual == "Th1Th17")
cd4_l3_th2  <- subset(cd4_subcluster, annotations_manual == "Th2")
cd4_l3_treg  <- subset(cd4_subcluster, annotations_manual == "Treg")
cd4_l3_isaghi  <- subset(cd4_subcluster, annotations_manual == "ISAGhi")
cd4_l3_nfkb  <- subset(cd4_subcluster, annotations_manual == "Nfkb")
cd4_l3_proliferating  <- subset(cd4_subcluster, annotations_manual == "Proliferating")
cd4_l3_temra  <- subset(cd4_subcluster, annotations_manual == "Temra")

In [None]:
cd4_l3_list  <- list(cd4_l3_naive, cd4_l3_tfh, cd4_l3_th1_17, cd4_l3_nfkb, cd4_l3_th2,
                     cd4_l3_treg, cd4_l3_isaghi, cd4_l3_proliferating, cd4_l3_temra)

In [None]:
names_list  <- c("cd4_l3_naive", "cd4_l3_tfh", "cd4_l3_th1_17", "cd4_l3_nfkb", "cd4_l3_th2",
                     "cd4_l3_treg", "cd4_l3_isaghi", "cd4_l3_proliferating", "cd4_l3_temra")

In [None]:
plan("sequential")

Ensure that we will split into datasets of size at least 100 cells

In [None]:
for(i in 1:9){
    seurat  <- cd4_l3_list[[i]]
    seurat@meta.data  <- seurat@meta.data  %>% 
    mutate(Experiment_ID_2 = 
           ifelse(Experiment_ID %in% c("Exp10", "Exp11"), "Exp10_11",Experiment_ID ))
    cd4_l3_list[[i]]  <- seurat
}

In [None]:
for(i in 1:9){
    seurat  <- cd4_l3_list[[i]]
    print(seurat$Experiment_ID_2  %>% table)
}

In [None]:
for(i in 1:9){
    seurat  <- cd4_l3_list[[i]]
    
    merged.list  <- SplitObject(seurat, split.by = "Experiment_ID_2")

merged.list <- lapply(X = merged.list, FUN = function(x) {
    DefaultAssay(x)  <- "RNA"
    x$barcode  <- colnames(x)
    x <- NormalizeData(x)
    x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})


stacas_anchors <- FindAnchors.STACAS(merged.list, 
                                     dims = 1:12, 
                                     min.sample.size = 50)
st1 <- SampleTree.STACAS(
  anchorset = stacas_anchors,
  obj.names = names(merged.list)
  )    

new_dia_experiment2 <- IntegrateData.STACAS(stacas_anchors,
                                          sample.tree = st1,
                                          dims=1:12) %>% ScaleData() %>%
  RunPCA(npcs=12) %>% RunUMAP(dims=1:12)

new_dia_experiment2 <- FindNeighbors(new_dia_experiment2, reduction = "pca", dims = 1:12)
new_dia_experiment2 <- FindClusters(new_dia_experiment2, resolution = 0.3)

DimPlot(new_dia_experiment2, label = T)

mrk  <- FindAllMarkers(new_dia_experiment2, logfc.threshold = log(1.5))

write.csv(mrk, paste0("../tables/cd4/markers_",names_list[i],".csv"))

saveRDS(new_dia_experiment2, paste0("../data/processed/L3/",names_list[i],".rds"))

}

In [None]:
cd4_l3_naive  <- readRDS("../data/processed/L3/cd4_l3_naive.rds")
cd4_l3_tfh  <- readRDS("../data/processed/L3/cd4_l3_tfh.rds")
cd4_l3_th1th17  <- readRDS("../data/processed/L3/cd4_l3_th1_17.rds")
cd4_l3_nfkb  <- readRDS("../data/processed/L3/cd4_l3_nfkb.rds")
cd4_l3_th2  <- readRDS("../data/processed/L3/cd4_l3_th2.rds")
cd4_l3_treg  <- readRDS("../data/processed/L3/cd4_l3_treg.rds")
cd4_l3_isaghi  <- readRDS("../data/processed/L3/cd4_l3_isaghi.rds")
cd4_l3_proliferating  <- readRDS("../data/processed/L3/cd4_l3_proliferating.rds")
cd4_l3_temra  <- readRDS("../data/processed/L3/cd4_l3_temra.rds")

In [None]:
cd4_l3_list  <- list(cd4_l3_naive, cd4_l3_tfh, cd4_l3_th1_17, cd4_l3_nfkb, cd4_l3_th2,
                     cd4_l3_treg, cd4_l3_isaghi, cd4_l3_proliferating, cd4_l3_temra)

names_list  <- c("cd4_l3_naive", "cd4_l3_tfh", "cd4_l3_th1_17", "cd4_l3_nfkb", "cd4_l3_th2",
                     "cd4_l3_treg", "cd4_l3_isaghi", "cd4_l3_proliferating", "cd4_l3_temra")

In [None]:
for(i in 1:9){
seurat_dataset  <- cd4_l3_list[[i]]
seurat_dataset@misc$cols_annotations  <- scales::hue_pal(h.start = 20) (length(levels(factor(seurat_dataset$seurat_clusters))))
seurat_dataset$annotations_manual  <- paste("Cluster", seurat_dataset$seurat_clusters)
seurat_dataset@misc$dataset_name  <- names(cd4_l3_list)[i]
seurat_dataset@misc$all_md  <- cd4_l1_full_filt@meta.data  %>% 
                            dplyr::select(Sample_ID, Condition, Condition2, 
                                          Disease, 
                                          Sex, Age, Age_group, Patient_ID, 
                                          Time, Experiment_ID)   %>% unique
}

## Annotations level 3

### Naive

In [None]:
DefaultAssay(cd4_l3_naive)  <- "integrated"

In [None]:
options(repr.plot.width = 7, repr.plot.height = 6)
DimPlot(cd4_l3_naive)

In [None]:
cd4_l3_naive <- FindNeighbors(cd4_l3_naive, reduction = "pca", dims = 1:12)
cd4_l3_naive <- FindClusters(cd4_l3_naive, resolution = 0.25)

In [None]:
options(repr.plot.width = 7, repr.plot.height = 6)
DimPlot(cd4_l3_naive, label = T)

In [None]:
mrk  <- FindAllMarkers(cd4_l3_naive, only.pos = TRUE)

In [None]:
mrk  <- rank_score_func(mrk)

In [None]:
mrk  %>% filter(cluster == 1)  %>% mutate(diff.pct = pct.1-pct.2)  %>% arrange(desc(diff.pct))

In [None]:
markers  <- mrk  %>% arrange(desc(score))  %>% group_by(cluster)  %>% slice_head(n = 4)  %>% pull(gene)

In [None]:
options(repr.plot.width = 16, repr.plot.height = 16)
FeaturePlot(cd4_l3_naive, features = markers,
           min.cutoff = 0, ncol = 4)

In [None]:
cd4_l3_naive@meta.data  <- cd4_l3_naive@meta.data  %>% 
mutate(annotations_manual = recode_factor(seurat_clusters, "0" = "Naive1: AIF1 STMN1 EPHB6",
                                     "1" = "Naive2: ITGA4 PIM1 PCED1B",
                                     "2" = "Naive3: DUSP1 JUN FOS",
                                     "3" = "Naive4: TRDC TRDV1 SOX4"))

In [None]:
options(repr.plot.width = 7, repr.plot.height = 4)
DimPlot(cd4_l3_naive, group.by = "annotations_manual")

In [None]:
saveRDS(cd4_l3_naive, "../data/processed/L3/cd4_l3_naive.rds")

### Tfh

In [None]:
options(repr.plot.width = 7, repr.plot.height = 6)
DimPlot(cd4_l3_tfh)

In [None]:
cd4_l3_tfh <- FindNeighbors(cd4_l3_tfh, reduction = "pca", dims = 1:12)
cd4_l3_tfh <- FindClusters(cd4_l3_tfh, resolution = 0.3)

DimPlot(cd4_l3_tfh, label = T)

In [None]:
mrk  <- FindAllMarkers(cd4_l3_tfh, only.pos = TRUE)

In [None]:
mrk  <- rank_score_func(mrk)

In [None]:
markers  <- mrk  %>% arrange(desc(score))  %>% group_by(cluster)  %>% slice_head(n = 4)  %>% pull(gene)

In [None]:
options(repr.plot.width = 16, repr.plot.height = 16)
FeaturePlot(cd4_l3_tfh, features = markers,
           min.cutoff = 0, ncol = 4)

In [None]:
cd4_l3_tfh@meta.data  <- cd4_l3_tfh@meta.data  %>% 
mutate(annotations_manual = recode_factor(seurat_clusters, "0" = "Tfh1: IL7R CXCR4 PTGER2",
                                     "1" = "Tfh2: KLRB1 TIGIT PPP2R5C",
                                         "2" = "Tfh3: CCR7 ID2 RGS10"))

In [None]:
options(repr.plot.width = 7, repr.plot.height = 4)
DimPlot(cd4_l3_tfh, group.by = "annotations_manual")

In [None]:
saveRDS(cd4_l3_tfh, "../data/processed/L3/cd4_l3_tfh.rds")

### Treg

In [None]:
options(repr.plot.width = 7, repr.plot.height = 6)
DimPlot(cd4_l3_treg)

In [None]:
cd4_l3_treg <- FindNeighbors(cd4_l3_treg, reduction = "pca", dims = 1:12)
cd4_l3_treg <- FindClusters(cd4_l3_treg, resolution = 0.3)

In [None]:
options(repr.plot.width = 7, repr.plot.height = 6)
DimPlot(cd4_l3_treg, label = T)

In [None]:
mrk  <- FindAllMarkers(cd4_l3_treg, only.pos = TRUE)

In [None]:
mrk  <- rank_score_func(mrk)

In [None]:
markers  <- mrk  %>% arrange(desc(score))  %>% group_by(cluster)  %>% slice_head(n = 4)  %>% pull(gene)

In [None]:
options(repr.plot.width = 16, repr.plot.height = 16)
FeaturePlot(cd4_l3_treg, features = markers,
           min.cutoff = 0, ncol = 4)

In [None]:
DefaultAssay(cd4_l3_treg)  <- "RNA"
options(repr.plot.width = 16, repr.plot.height = 10)
FeaturePlot(cd4_l3_treg, features = c("CCR10","TNFRSF9","CCR7","GZMA","GZMB","GZMH","GZMM",
                                    "STAT3","SOCS3","IL2RA", "GZMK", "CD226"),
           min.cutoff = 0, ncol = 4)
DefaultAssay(cd4_l3_treg)  <- "integrated"

In [None]:
mrk  %>% arrange(desc(score))  %>% group_by(cluster)  %>% slice_head(n = 10)

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)
DimPlot(cd4_l3_treg, label = T)

In [None]:
cd4_l3_treg@meta.data  <- cd4_l3_treg@meta.data  %>% 
mutate(annotations_manual = recode_factor(seurat_clusters, "0" = "Treg2: FCER1G NOG CXCR4",
                                     "1" = "Treg3: TIGIT NCR3 FCRL3",
                                     "2" = "Treg1: TCF7 IL7R NOSIP",
                                     "3" = "Treg4: HLA-DR CCR10 PI16"))

In [None]:
DimPlot(cd4_l3_treg, group.by = "annotations_manual")

In [None]:
saveRDS(cd4_l3_treg, "../data/processed/L3/cd4_l3_treg.rds")

### Th1_Th17

In [None]:
cd4_l3_th1th17

In [None]:
options(repr.plot.width = 7, repr.plot.height = 6)
DimPlot(cd4_l3_th1th17)

In [None]:
cd4_l3_th1th17 <- FindNeighbors(cd4_l3_th1th17, reduction = "pca", dims = 1:12)
cd4_l3_th1th17 <- FindClusters(cd4_l3_th1th17, resolution = 0.3)

DimPlot(cd4_l3_th1th17, label = T)

In [None]:
mrk  <- FindAllMarkers(cd4_l3_th1th17, only.pos = TRUE)

In [None]:
mrk  <- rank_score_func(mrk)

In [None]:
markers  <- mrk  %>% arrange(desc(score))  %>% group_by(cluster)  %>% slice_head(n = 4)  %>% pull(gene)

In [None]:
options(repr.plot.width = 16, repr.plot.height = 16)
FeaturePlot(cd4_l3_th1th17, features = markers,
           min.cutoff = 0, ncol = 4)

In [None]:
mrk  %>% arrange(desc(score))  %>% group_by(cluster)  %>% slice_head(n = 10)

In [None]:
options(repr.plot.width = 16, repr.plot.height = 12)
DefaultAssay(cd4_l3_th1th17)  <- "RNA"
FeaturePlot(cd4_l3_th1th17, features = c("CXCR3","CD27","TCF7","CXCR6",
                                         "TBX21","IFNG","IL17A",
                                         "IL23R","EOMES","RORC","TNF","PRF1"),
           min.cutoff = 0, ncol = 4)

In [None]:
markers2  <- mrk  %>% arrange(desc(score))  %>% group_by(cluster)  %>% slice_head(n = 4)  %>% pull(gene)

options(repr.plot.width = 16, repr.plot.height = 16)
FeaturePlot(cd4_l3_th1th17, features = markers2,
           min.cutoff = 0, ncol = 4)

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)
DimPlot(cd4_l3_th1th17, label = T)

In [None]:
cd4_l3_th1th17@meta.data  <- cd4_l3_th1th17@meta.data  %>% 
mutate(annotations_manual = recode_factor(seurat_clusters, "0" = "Th1Th17_1: CXCR3 CCL5 SELL",
                                     "1" = "Th1Th17_2: TNFRSF4 CTSH CMTM6",
                                     "2" = "Th1Th17_3: NKG7 PRF1 GZMA",
                                     "3" = "Th1Th17_4: GNLY LGALS1 HOPX"))

In [None]:
options(repr.plot.width = 10, repr.plot.height = 5)
DimPlot(cd4_l3_th1th17, group.by = "annotations_manual")

In [None]:
saveRDS(cd4_l3_th1_17, "../data/processed/L3/cd4_l3_th1_17.rds")

### Th2

In [None]:
options(repr.plot.width = 7, repr.plot.height = 6)
DimPlot(cd4_l3_th2)

In [None]:
cd4_l3_th2 <- FindNeighbors(cd4_l3_th2, reduction = "pca", dims = 1:12)
cd4_l3_th2 <- FindClusters(cd4_l3_th2, resolution = 0.3)

DimPlot(cd4_l3_th2, label = T)

In [None]:
mrk  <- FindAllMarkers(cd4_l3_th2, only.pos = TRUE)

In [None]:
mrk  <- rank_score_func(mrk)

In [None]:
markers  <- mrk  %>% arrange(desc(score))  %>% group_by(cluster)  %>% slice_head(n = 4)  %>% pull(gene)

In [None]:
options(repr.plot.width = 16, repr.plot.height = 16)
FeaturePlot(cd4_l3_th2, features = markers,
           min.cutoff = 0, ncol = 4)

In [None]:
options(repr.plot.width = 16, repr.plot.height = 4)
DefaultAssay(cd4_l3_th2)  <- "RNA"
FeaturePlot(cd4_l3_th2, features = c("GATA3","IL5","IL13","IL4"),
           min.cutoff = 0, ncol = 4)

In [None]:
mrk  %>% arrange(desc(score))  %>% group_by(cluster)  %>% slice_head(n = 10)

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)
DimPlot(cd4_l3_th2, label = T)

In [None]:
cd4_l3_th2@meta.data  <- cd4_l3_th2@meta.data  %>% 
mutate(annotations_manual = recode_factor(seurat_clusters, "0" = "Th2_1: CD27 CCR7 TCF7",
                                     "1" = "Th2_2: GATA3 PTGDR2 NEFL",
                                     "2" = "Th2_3: LPAR6 PI16 ALOX5AP",
                                     "3" = "Th2_4: HLA-DR CCR10 CD74"))

In [None]:
saveRDS(cd4_l3_th2, "../data/processed/L3/cd4_l3_th2.rds")

### NfKb

In [None]:
options(repr.plot.width = 5, repr.plot.height = 4)
DimPlot(cd4_l3_nfkb)

In [None]:
DefaultAssay(cd4_l3_nfkb)  <- "integrated"

In [None]:
cd4_l3_nfkb <- FindNeighbors(cd4_l3_nfkb, reduction = "pca", dims = 1:12)
cd4_l3_nfkb <- FindClusters(cd4_l3_nfkb, resolution = 0.4)

DimPlot(cd4_l3_nfkb, label = T)

In [None]:
mrk  <- FindAllMarkers(cd4_l3_nfkb, only.pos = TRUE)

In [None]:
mrk  <- rank_score_func(mrk)

In [None]:
markers  <- mrk  %>% arrange(desc(score))  %>% group_by(cluster)  %>% slice_head(n = 4)  %>% pull(gene)

In [None]:
options(repr.plot.width = 16, repr.plot.height = 8)
FeaturePlot(cd4_l3_nfkb, features = markers,
           min.cutoff = 0, ncol = 4)

In [None]:
mrk  %>% arrange(desc(score))  %>% group_by(cluster)  %>% slice_head(n = 10)

In [None]:
cd4_l3_nfkb@meta.data  <- cd4_l3_nfkb@meta.data  %>% 
mutate(annotations_manual = recode_factor(seurat_clusters, "0" = "Nfkb_1: PTGER2 LPAR6 TNFSF8",
                                     "1" = "Nfkb_2: BCL2L11 TP53INP1 TNFAIP3"))

In [None]:
options(repr.plot.width = 10, repr.plot.height = 5)
DimPlot(cd4_l3_nfkb, label = T, group.by = "annotations_manual")

In [None]:
saveRDS(cd4_l3_nfkb, "../data/processed/L3/cd4_l3_nfkb.rds")

### ISAGhi

In [None]:
options(repr.plot.width = 7, repr.plot.height = 6)
DimPlot(cd4_l3_isaghi)

In [None]:
options(repr.plot.width = 7, repr.plot.height = 6)
DimPlot(cd4_l3_isaghi)

In [None]:
DefaultAssay(cd4_l3_isaghi)  <- "integrated"
cd4_l3_isaghi <- FindNeighbors(cd4_l3_isaghi, reduction = "pca", dims = 1:12)

In [None]:
cd4_l3_isaghi <- FindClusters(cd4_l3_isaghi, resolution = 0.3)

DimPlot(cd4_l3_isaghi, label = T)

In [None]:
mrk  <- FindAllMarkers(cd4_l3_isaghi, only.pos = TRUE)

In [None]:
mrk  <- rank_score_func(mrk)

In [None]:
mrk  

In [None]:
markers  <- mrk  %>% arrange(desc(score))  %>% group_by(cluster)  %>% slice_head(n = 6)  %>% pull(gene)

In [None]:
options(repr.plot.width = 16, repr.plot.height = 16)
FeaturePlot(cd4_l3_isaghi, features = markers,
           min.cutoff = 0, ncol = 4)

In [None]:
options(repr.plot.width = 16, repr.plot.height = 16)
FeaturePlot(cd4_l3_isaghi, features = c("CCR10","TNFRSF9","CCR7","GZMA","GZMB","GZMH","GZMM",
                                    "STAT3","SOCS3","IL2RA"),
           min.cutoff = 0, ncol = 4)

In [None]:
mrk  %>% arrange(desc(score))  %>% group_by(cluster)  %>% slice_head(n = 10)

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)
DimPlot(cd4_l3_isaghi, label = T)

In [None]:
cd4_l3_isaghi@meta.data  <- cd4_l3_isaghi@meta.data  %>% 
mutate(annotations_manual = recode_factor(seurat_clusters, "0" = "ISAGhi1: LIMS1 PASK ITGA4",
                                     "1" = "ISAGhi2: CCR7 LEF1 SOX4",
                                     "2" = "ISAGhi3: S100A4 LGALS1 ANXA2",
                                     "3" = "ISAGhi4: ISG15 MT2A ISG20"
                                         ))

In [None]:
DimPlot(cd4_l3_isaghi, group.by = "annotations_manual")

In [None]:
saveRDS(cd4_l3_isaghi, "../data/processed/L3/cd4_l3_isaghi.rds")

### Prolif

In [None]:
options(repr.plot.width = 7, repr.plot.height = 6)
DimPlot(cd4_l3_proliferating)

In [None]:
DefaultAssay(cd4_l3_proliferating)  <- "integrated"
cd4_l3_proliferating <- FindNeighbors(cd4_l3_proliferating, reduction = "pca", dims = 1:12)
cd4_l3_proliferating <- FindClusters(cd4_l3_proliferating, resolution = 0.2)

DimPlot(cd4_l3_proliferating, label = T)

In [None]:
mrk  <- FindAllMarkers(cd4_l3_proliferating, only.pos = TRUE)

In [None]:
mrk  <- rank_score_func(mrk)

In [None]:
markers  <- mrk  %>% arrange(desc(score))  %>% group_by(cluster)  %>% slice_head(n = 4)  %>% pull(gene)

In [None]:
options(repr.plot.width = 16, repr.plot.height = 20)
FeaturePlot(cd4_l3_proliferating, features = markers,
           min.cutoff = 0, ncol = 4)

In [None]:
options(repr.plot.width = 16, repr.plot.height = 10)
FeaturePlot(cd4_l3_proliferating, features = c("CCR10","TNFRSF9","CCR7","GZMA","GZMB","GZMH","GZMM","FOXP3",
                                    "STAT3","SOCS3","IL2RA","IL10"),
           min.cutoff = 0, ncol = 4)

In [None]:
mrk  %>% arrange(desc(score))  %>% group_by(cluster)  %>% slice_head(n = 10)

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)
DimPlot(cd4_l3_proliferating, label = T)

In [None]:
DefaultAssay(cd4_l3_proliferating)  <- "RNA"
FeaturePlot(cd4_l3_proliferating, features = c("MKI67"),  min.cutoff = 0)

In [None]:
cd4_l3_proliferating@meta.data  <- cd4_l3_proliferating@meta.data  %>% 
mutate(annotations_manual = recode_factor(seurat_clusters, "0" = "Prolif1: CCL5 GZMA CST7",
                                     "1" = "Prolif2: STMN1 TUBA1B MKI67",
                                     "2" = "Prolif3: KLRB1 CCR7 IL6ST",
                                     "3" = "Prolif4: GZMK NUCB2 IL10"))

In [None]:
DimPlot(cd4_l3_proliferating, group.by = "annotations_manual")

In [None]:
saveRDS(cd4_l3_proliferating, "../data/processed/L3/cd4_l3_proliferating.rds")

### Temra

In [None]:
DefaultAssay(cd4_l3_temra)  <- "integrated"
cd4_l3_temra <- FindNeighbors(cd4_l3_temra, reduction = "pca", dims = 1:12)
cd4_l3_temra <- FindClusters(cd4_l3_temra, resolution = 0.4)
options(repr.plot.width = 6, repr.plot.height = 5)

DimPlot(cd4_l3_temra, label = T)

In [None]:
DefaultAssay(cd4_l3_temra)  <- "RNA"

In [None]:
mrk  <- FindAllMarkers(cd4_l3_temra, only.pos = TRUE)

In [None]:
mrk  <- rank_score_func(mrk)

In [None]:
mrk  %>% dplyr::filter(cluster == 0)

In [None]:
markers  <- mrk  %>% arrange(desc(score))  %>% group_by(cluster)  %>% slice_head(n = 4)  %>% pull(gene)

In [None]:
options(repr.plot.width = 16, repr.plot.height = 20)
FeaturePlot(cd4_l3_temra, features = markers,
           min.cutoff = 0, ncol = 4)

In [None]:
options(repr.plot.width = 16, repr.plot.height = 16)
FeaturePlot(cd4_l3_temra, features = c("CCR10","TNFRSF9","CCR7","GZMA","GZMB","GZMH","GZMM",
                                    "STAT3","SOCS3","IL2RA", "LTB", "CCR7", "NOSIP", "KLRB1"),
           min.cutoff = 0, ncol = 4)

In [None]:
mrk  %>% arrange(desc(score))  %>% group_by(cluster)  %>% slice_head(n = 10)

In [None]:
cd4_l3_temra@meta.data  <- cd4_l3_temra@meta.data  %>% 
mutate(annotations_manual = recode_factor(seurat_clusters, "0" = "Temra1: IL7R LTB CD40LG",
                                     "1" = "Temra2: GZMB GZMH LGALS1",
                                     "2" = "Temra3: CHI3L2 TIGIT CRTAM", 
                                         "3" = "Temra4: IGFBP7 FES LGALS9B",
                                         "4" = "Temra5: TRDV2 TRGV9 TRDC"))

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)

DimPlot(cd4_l3_temra, group.by = "annotations_manual")

In [None]:
saveRDS(cd4_l3_temra, "../data/processed/L3/cd4_l3_temra.rds")

## Plotting all L3 datasets

In [None]:
cd4_l3_list  <- list(cd4_l3_naive, cd4_l3_tfh, cd4_l3_th1th17, cd4_l3_nfkb, cd4_l3_th2,
                     cd4_l3_treg, cd4_l3_isaghi, cd4_l3_proliferating, cd4_l3_temra)

names(cd4_l3_list)  <- c("cd4_l3_naive", "cd4_l3_tfh", "cd4_l3_th1th17", "cd4_l3_nfkb", "cd4_l3_th2",
                     "cd4_l3_treg", "cd4_l3_isaghi", "cd4_l3_proliferating", "cd4_l3_temra")

In [None]:
for(i in 1:9){
seurat_dataset  <- cd4_l3_list[[i]]
seurat_dataset@misc$cols_annotations  <- scales::hue_pal(h.start = 20) (length(levels(factor(seurat_dataset$annotations_manual))))
seurat_dataset@misc$dataset_name  <- names(cd4_l3_list)[i]
seurat_dataset@misc$all_md  <- cd4_l1_full_filt@meta.data  %>% 
                            dplyr::select(Sample_ID, Condition, Condition2, 
                                          Disease, 
                                          Sex, Age, Age_group, Patient_ID, 
                                          Time, Experiment_ID)   %>% unique



options(repr.plot.width = 8, repr.plot.height = 6.5)
save_dimplot_plot(seurat_dataset = seurat_dataset)
    
options(repr.plot.width=16, repr.plot.height=5)
process_plots_from_dataset(seurat_dataset = seurat_dataset)
df4  <- create_df4(seurat_dataset)
freq  <- df4  %>% dplyr::select(1:3)
write.csv(freq, paste0("../tables/cd4/freq_", names(cd4_l3_list)[i],".csv"), row.names = FALSE)

saveRDS(seurat_dataset, paste0("../data/processed/L3/",names(cd4_l3_list)[i],".rds"))
}

## Population tree

In [None]:
cd4_l3_naive  <- readRDS("../data/processed/L3/cd4_l3_naive.rds")
cd4_l3_tfh  <- readRDS("../data/processed/L3/cd4_l3_tfh.rds")
cd4_l3_nfkb  <- readRDS("../data/processed/L3/cd4_l3_nfkb.rds")
cd4_l3_th2  <- readRDS("../data/processed/L3/cd4_l3_th2.rds")
cd4_l3_treg  <- readRDS("../data/processed/L3/cd4_l3_treg.rds")
cd4_l3_isaghi  <- readRDS("../data/processed/L3/cd4_l3_isaghi.rds")
cd4_l3_proliferating  <- readRDS("../data/processed/L3/cd4_l3_proliferating.rds")
cd4_l3_temra  <- readRDS("../data/processed/L3/cd4_l3_temra.rds")

In [None]:
cd4_l3_th1th17  <- readRDS("../data/processed/L3/cd4_l3_th1th17.rds")

In [None]:
cd4_md  <- cd4_l1_full_filt@meta.data  %>% dplyr::select(barcode, annotations_l1 = annotations_manual)

In [None]:
md_l2  <- rbind(cd4_l2_subcluster@meta.data %>% dplyr::select(barcode, annotations_manual),
               data.frame(barcode = cd4_l2_unc@meta.data$barcode, annotations_manual = "Unconventional"))

In [None]:
md_l2

In [None]:
cd4_md  <- left_join(cd4_md, (md_l2  %>% dplyr::select(barcode, annotations_l2 = annotations_manual)))

In [None]:
cd4_md  %>% group_by(annotations_l1, annotations_l2)  %>% tally

In [None]:
md_l3  <- rbind(cd4_l3_naive@meta.data %>% dplyr::select(barcode, annotations_manual), 
                cd4_l3_proliferating@meta.data %>% dplyr::select(barcode, annotations_manual), 
                cd4_l3_isaghi@meta.data %>% dplyr::select(barcode, annotations_manual),
                cd4_l3_temra@meta.data %>% dplyr::select(barcode, annotations_manual),
                cd4_l3_tfh@meta.data %>% dplyr::select(barcode, annotations_manual),
                cd4_l3_th1th17@meta.data %>% dplyr::select(barcode, annotations_manual),
                cd4_l3_nfkb@meta.data %>% dplyr::select(barcode, annotations_manual),
                cd4_l3_th2@meta.data %>% dplyr::select(barcode, annotations_manual),
                cd4_l3_treg@meta.data %>% dplyr::select(barcode, annotations_manual),
                cd4_l2_unc@meta.data %>% dplyr::select(barcode, annotations_manual)
                )

In [None]:
cd4_md  <- left_join(cd4_md, (md_l3  %>% dplyr::select(barcode, annotations_l3 = annotations_manual)))

In [None]:
cd4_md$annotations  <- "CD4"

In [None]:
data  <- cd4_md  %>% dplyr::select(-barcode)  %>% 
    group_by(annotations, annotations_l1, annotations_l2, annotations_l3)  %>% 
tally()

In [None]:
data

In [None]:
nrow(data)

In [None]:
md_new  <- cd4_md  %>% 
mutate(barcode = barcode, 
          annotations_l1 = ifelse(grepl(annotations_l1, pattern = "CD4"), "CD4 T cells",
          paste("CD4", annotations_l1)))  %>% 
       mutate(annotations_l2 = paste(annotations_l1, annotations_l2, sep = "---"),
              annotations_l3 = paste(annotations_l2, annotations_l3, sep = "---"))  %>% 
mutate(annotations_l3 = sub(annotations_l3, pattern = "_NA", replacement = ""))

In [None]:
md_new

In [None]:
cd4_l1_full_filt$annotations_l3  <- NULL

In [None]:
cd4_l1_full_filt@meta.data  <- left_join(cd4_l1_full_filt@meta.data, md_new)

In [None]:
rownames(cd4_l1_full_filt@meta.data)  <- colnames(cd4_l1_full_filt)

In [None]:
options(repr.plot.width = 25, repr.plot.height = 15)

DimPlot(cd4_l1_full_filt, group.by = "annotations_l3", label = T, raster = T)

In [None]:
DimPlot(cd4_l1_full_filt, group.by = "annotations_l2", label = T, raster = T)

In [None]:
options(repr.plot.width = 12, repr.plot.height = 6)
DimPlot(cd4_l2_subcluster, group.by = "annotations_manual", label = T, raster = T, cols = cd4_l2_subcluster@misc$cols)

In [None]:
options(repr.plot.width = 12, repr.plot.height = 7)
DimPlot(cd4_l1_full_filt, group.by = "annotations_l2", label = F, 
        raster = T, cols = c(cd4_l2_subcluster@misc$cols[c(6,2,7,8,9,1,5,3,4)],
                              #"#be87e7ff",
                             "dodgerblue1",
                             "#fe60cbff"))

In [None]:
DimPlot(cd4_l1_full_filt, group.by = "annotations_l1", label = T, raster = T)

In [None]:
saveRDS(cd4_l1_full_filt, "../data/processed/L1/cd4_l1_full_filt.rds")

In [None]:
cd4_l1_full_filt  <- readRDS("../data/processed/L1/cd4_l1_full_filt.rds")

In [None]:
cd4_l2_subcluster  <- readRDS("../data/processed/L2/cd4_subcluster.rds")
cd4_l2_unc  <- readRDS("../data/processed/L2/cd4_l2_unc.rds")

## Sankey plot

In [None]:
data

In [None]:
dir.create("../tables/sankey/")

In [None]:
write.csv(data, "../tables/sankey/cd4_sankey.csv")

## Table for quantification and Bayes

In [None]:
cd4_l1_full_filt  <- readRDS("../data/processed/L1/cd4_l1_full_filt.rds")

In [None]:
cd4_patient_meta  <- cd4_l1_full_filt@meta.data  %>% 
                            dplyr::select(Sample_ID, Condition, Condition2, 
                                          Disease, 
                                          Sex, Age, Age_group, Patient_ID, 
                                          Time, Experiment_ID)   %>% unique

In [None]:
colnames(cd4_l1_full_filt@meta.data )

In [None]:
df3  <- cd4_l1_full_filt@meta.data %>% 
  group_by(Sample_ID, annotations_l3) %>% 
  summarise(n = n()) %>% 
  unique() %>% 
ungroup   %>% 
pivot_wider(names_from = "annotations_l3", values_from = "n", values_fill = 0) 
df4  <- left_join((cd4_l1_full_filt@misc$all_md %>% dplyr::select(Sample_ID) %>% unique), df3)
df4[is.na(df4)] <- 0
df4  <- df4  %>% pivot_longer(!Sample_ID, values_to = "n", names_to = "annotations")

# As we've lost non-grouping variables, let's join them back
md_to_join <- cd4_l1_full_filt@misc$all_md %>% 
  unique()

df4  <- left_join(df4, md_to_join)

In [None]:
df4$Level  <- "L3"

In [None]:
df_l3  <- df4

In [None]:
df_l3

In [None]:
df3  <- cd4_l1_full_filt@meta.data %>% 
  group_by(Sample_ID, annotations_l2) %>% 
  summarise(n = n()) %>% 
  unique() %>% 
ungroup   %>% 
pivot_wider(names_from = "annotations_l2", values_from = "n", values_fill = 0) 
df4  <- left_join((cd4_l1_full_filt@misc$all_md %>% dplyr::select(Sample_ID) %>% unique), df3)
df4[is.na(df4)] <- 0
df4  <- df4  %>% pivot_longer(!Sample_ID, values_to = "n", names_to = "annotations")

# As we've lost non-grouping variables, let's join them back
md_to_join <- cd4_l1_full_filt@misc$all_md %>% 
  unique()

df4  <- left_join(df4, md_to_join)
df4$Level  <- "L2"

In [None]:
df_l2  <- df4

In [None]:
df4

In [None]:
df3  <- cd4_l1_full_filt@meta.data %>% 
  group_by(Sample_ID, annotations_l1) %>% 
  summarise(n = n()) %>% 
  unique() %>% 
ungroup   %>% 
pivot_wider(names_from = "annotations_l1", values_from = "n", values_fill = 0) 
df4  <- left_join((cd4_l1_full_filt@misc$all_md %>% dplyr::select(Sample_ID) %>% unique), df3)
df4[is.na(df4)] <- 0
df4  <- df4  %>% pivot_longer(!Sample_ID, values_to = "n", names_to = "annotations")

# As we've lost non-grouping variables, let's join them back
md_to_join <- cd4_l1_full_filt@misc$all_md %>% 
  unique()

df4  <- left_join(df4, md_to_join)
df4$Level  <- "L1"

df_l1  <- df4

In [None]:
df_l3

In [None]:
df_all_levels  <- rbind(df_l1, df_l2, df_l3)

In [None]:
write.csv(df_all_levels, "../tables/populations_freq/all_levels_counts_with_preliminary_cd4.csv")

In [None]:
df_all_levels

In [None]:
df_all_levels_without_preliminary  <- df_all_levels  %>% dplyr::filter(Experiment_ID %in% c("Exp16", "Exp18", "Exp19", "Exp20"))

In [None]:
write.csv(df_all_levels_without_preliminary, "../tables/populations_freq/all_levels_counts_cd4.csv")

## Adding L3 annotations to full object

In [None]:
md_new  <- cd4_md  %>% transmute(barcode = barcode, 
                                 annotations_l3 = paste(annotations_l1, annotations_l2, annotations_l3, sep = "---"))  %>% 
mutate(annotations_l3 = sub(annotations_l3, pattern = "_NA", replacement = ""))

In [None]:
md_new

In [None]:
cd4_l1_full_filt@meta.data  <- left_join(cd4_l1_full_filt@meta.data, md_new)

In [None]:
rownames(cd4_l1_full_filt@meta.data)  <- colnames(cd4_l1_full_filt)

In [None]:
options(repr.plot.width = 25, repr.plot.height = 15)

DimPlot(cd4_l1_full_filt, group.by = "annotations_l3", label = T, raster = T)

In [None]:
saveRDS(cd4_l1_full_filt, "../data/processed/L1/cd4_l1_full_filt.rds")

In [None]:
cd4_l1_full_filt  <- readRDS("../data/processed/L1/cd4_l1_full_filt.rds")

## Population phylogenetic tree

In [None]:
Idents(cd4_l1_full_filt)  <- cd4_l1_full_filt$annotations_l3

In [None]:
cd4_l1_full_filt <- BuildClusterTree(
  cd4_l1_full_filt,
  dims = 1:12,
  reorder = FALSE,
  reorder.numeric = FALSE
)

In [None]:
tree <- cd4_l1_full_filt@tools$BuildClusterTree
tree$tip.label <- tree$tip.label

In [None]:
tree$tip.label

In [None]:
as.character(tree$tip.label)

In [None]:
tree

In [None]:
p <- ggtree::ggtree(tree, aes(x, y)) +
  scale_y_reverse() +
  ggtree::geom_tree() +
  ggtree::theme_tree() +
  ggtree::geom_tiplab(offset = 1) +
  ggtree::geom_tippoint(shape = 16, size = 5) +
  coord_cartesian(clip = 'off') +
  theme(plot.margin = unit(c(0,18,0,0), 'cm'))

#ggsave('plots/cluster_tree.png', p, height = 4, width = 6)

In [None]:
options(repr.plot.width=10, repr.plot.height=12)
p

# Revisions - CD79A expression

In [None]:
cd4_full  <- readRDS("../..//240218_VN_Diabetes_V05/data/processed/L1/cd4_full.rds")

In [None]:
DimPlot(cd4_full, label = T)

In [None]:
options(repr.plot.width=16, repr.plot.height=12)
VlnPlot(cd4_full, features = c("CD79A","MS4A1"))

In [None]:
pct_expressing_boxplot  <- function(seurat_object, gene, group.by = "annotations_l2", sample.col = "sample"){
   rn = which(rownames(seurat_object@assays$RNA)==gene)
ggtheme = function() {
  theme(
    axis.text = element_text(size = 20),
    axis.title = element_text(size = 20),
    text = element_text(size = 20, colour = "black"),
    legend.text = element_text(size = 20),
    legend.key.size =  unit(10, units = "points")
    
  )
}

df = data.frame(grouping_var = seurat_object@meta.data[[group.by]],
               value = seurat_object@assays$RNA@counts[rn,], 
               sample = seurat_object@meta.data[[sample.col]])  %>% 
mutate(expressing = if_else(value>0,1,0))  %>% 
dplyr::select(-value)  %>% 
group_by(sample, grouping_var)  %>% 
summarise(mean_expression = mean(expressing))  %>% 
pivot_wider(names_from = sample, values_from = mean_expression, values_fill = 0)  %>% 
pivot_longer(!grouping_var, names_to = "sample", values_to = "expressing")

plt = ggplot(data = df, aes(x = grouping_var, y = expressing)) +
geom_boxplot(outlier.shape = NA, aes(fill = grouping_var), alpha = 0.3) + 
  geom_dotplot(binaxis='y', stackdir='center', dotsize=0) + 
  geom_jitter(width = 0.1, height = 0.0, size = 2, aes(color = grouping_var)) +
theme_classic() +
    theme(plot.title = element_text(hjust = 0.5)) +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
ggtheme() +
    ggtitle(gene) +
    ylab("Pct expressing cells") +
xlab("") + NoLegend()
    return(plt)
    }


In [None]:
options(repr.plot.width=7, repr.plot.height=6)
pct_expressing_boxplot(seurat_object = cd4_full, group.by = "seurat_clusters", gene = "CD79A", sample.col = "Sample_ID")

In [None]:
pct_expressing_boxplot(seurat_object = cd4_full, group.by = "seurat_clusters", gene = "CD19", sample.col = "Sample_ID")

In [None]:
pct_expressing_boxplot(seurat_object = cd4_full, group.by = "seurat_clusters", gene = "KLRB1", sample.col = "Sample_ID")

In [None]:
pct_expressing_boxplot(seurat_object = cd4_full, group.by = "seurat_clusters", gene = "ZBTB16", sample.col = "Sample_ID")

In [None]:
pct_expressing_boxplot(seurat_object = cd4_full, group.by = "seurat_clusters", gene = "MS4A1", sample.col = "Sample_ID")

In [None]:
ls()