-
Notifications
You must be signed in to change notification settings - Fork 0
/
3.1_deconvolutionAnalysis.Rmd
176 lines (163 loc) · 6.46 KB
/
3.1_deconvolutionAnalysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
title: "RNA-seq Mouse Placenta"
author: "Ha T. H. Vu"
output: html_document
---
```{r setup, include=FALSE}
options(max.print = "75")
knitr::opts_chunk$set(
echo = TRUE,
collapse = TRUE,
comment = "#>",
fig.path = "Files/",
fig.width = 8,
prompt = FALSE,
tidy = FALSE,
message = FALSE,
warning = TRUE
)
knitr::opts_knit$set(width = 75)
```
Documentation for the analysis of mouse placental RNA-seq data at e7.5, e8.5 and e9.5.
## 1. Deconvolution analysis
To determine if timepoint-specific gene groups could capture different cell populations of the placenta, we carried out deconvolution analysis with `LinSeed` and inferred the cell type profiles. <br>
To run `LinSeed`, we need gene expression levels in TPM (obtained with `tximport`). Moreover, because of the permutation test for significance of genes, the running time can be long - this should be run on HPC. I have presaved an object to reproduce results here.<br>
Here, we load the necessary files, including the expression tables.
```{r, eval = F}
library("linseed", suppressMessages())
t2g <- read.table("Files/t2g.txt", header = T, sep = "\t")
load("Files/e7.5geneLevelTPM.rda")
load("Files/e8.5geneLevelTPM.rda")
load("Files/e9.5geneLevelTPM.rda")
abundance <- dplyr::inner_join(e7.5abundance, e8.5abundance, by = c("ens_gene" = "ens_gene"))
abundance <- dplyr::inner_join(abundance, e9.5abundance, by = c("ens_gene" = "ens_gene"))
rownames(abundance) <- abundance$ens_gene
keep <- colnames(abundance)[2:17]
abundance <- abundance[,keep]
t2g2 <- dplyr::distinct(t2g[,2:3])
t2g2 <- subset(t2g2, t2g2$ens_gene %in% rownames(abundance))
rownames(t2g2) <- t2g2$ens_gene
```
Starting with top 5000 most expressed genes, we identify the most significant ones to carry out deconvolution analysis.
```{r, eval = F}
lo <- LinseedObject$new(abundance, topGenes=5000, annotation = t2g2, geneSymbol = "ens_gene") #use top 5000 most expressed genes
lo$calculatePairwiseLinearity()
lo$calculateSpearmanCorrelation()
lo$calculateSignificanceLevel(10000) #use 100000 times to calculate significance. To reduce running time, reduce this number, but the results will be different
temp <- lo$spearman
names(lo$genes$pvals) <- rownames(temp) #ensure the gene names between objects matched
```
First, we look at the significance plot showing significant genes at p-value level of 0.05.
```{r, eval = F}
lo$significancePlot(0.05)
lo$filterDatasetByPval(0.05)
```
Next, plot the SVD plot to determine the number of cell groups. We can also extract the variance SVD to calculate the exact changes as we increase the dimensions (cell groups).
```{r, eval=F}
lo$svdPlot()
#calculate variance SVD for full data
dataFull <- as.data.frame(lo$exp$full)
dataFull <- dataFull[, grep("norm", colnames(dataFull))]
vars <- svd(dataFull)$d^2
vars <- cumsum(vars / sum(vars))
dfFull <- data.frame(dimension=1:length(vars), variance=vars, filtered=FALSE)
dfFull
#calculate variance SVD for filtered data
dataFiltered <- as.data.frame(lo$exp$filtered)
dataFiltered <- dataFiltered[, grep("norm", colnames(dataFiltered))]
vars <- svd(dataFiltered)$d^2
vars <- cumsum(vars / sum(vars))
dfFiltered <- data.frame(dimension=1:length(vars), variance=vars, filtered=TRUE)
dfFiltered
```
From this plot and the variance, we determine that we have 5 groups.
```{r, eval=F}
set.seed(123)
lo$setCellTypeNumber(5)
lo$project("full") # projecting full dataset
lo$project("filtered")
lo$smartSearchCorners(dataset="filtered", error="norm")
lo$deconvolveByEndpoints()
plotProportions(lo$proportions)
```
The `signature` object will give us the expression levels of significant genes after deconvolution.
```{r, eval=F}
cells <- lo$signatures
```
Next, we select 100 closest genes to the simplex corners to be candidate markers. Since we have 5 corners corresponding to 5 cell groups, each cell group will have 100 candidate markers.
```{r, eval=F}
lo$selectGenes(100)
```
For each cell group, we determine the candidate markers that is closer to the group's corner than any other corner.
```{r, eval=F}
distances <- lo$distances
colnames(distances) <- c(paste0("Group", 1:5))
for (i in 1:5) {
sub <- distances[lo$markers[[i]],]
keep <- c()
for (j in 1:nrow(sub)) {
if (sub[j, i] == min(sub[j,])) {
keep <- c(keep, rownames(sub)[j])
}
}
markers <- unique(t2g[t2g$ens_gene %in% keep, "ext_gene"])
print(markers)
#write.table(markers, paste0("/work/LAS/geetu-lab/hhvu/project1_2/rna-seq/7_deconvolution/cell", i, "markers-minDistance-inTop100.txt"), quote = F, row.names = F)
}
```
We can check if markers are in time-point specific groups.
```{r, eval=F}
e7.5specific <- read.table("/work/LAS/geetu-lab/hhvu/project1_2/rna-seq/4_specificGenes/e7.5specific_ensGenes.txt")
e8.5specific <- read.table("/work/LAS/geetu-lab/hhvu/project1_2/rna-seq/4_specificGenes/e8.5specific_ensGenes.txt")
e9.5specific <- read.table("/work/LAS/geetu-lab/hhvu/project1_2/rna-seq/4_specificGenes/e9.5specific_ensGenes.txt")
for (i in 1:5) {
markers <- read.table(paste0("/work/LAS/geetu-lab/hhvu/project1_2/rna-seq/7_deconvolution/cell", i, "markers-minDistance-inTop100.txt"), header = T, sep = "\t")
print(paste0("Cell group ", i))
print("Number of markers specific to e7.5:")
print(length(unique(t2g[t2g$ens_gene %in% intersect(unique(t2g[t2g$ext_gene %in% markers$x, "ens_gene"]), e7.5specific$V1), "ext_gene"])))
print("Number of markers specific to e8.5:")
print(length(unique(t2g[t2g$ens_gene %in% intersect(unique(t2g[t2g$ext_gene %in% markers$x, "ens_gene"]), e8.5specific$V1), "ext_gene"])))
print("Number of markers specific to e9.5:")
print(length(unique(t2g[t2g$ens_gene %in% intersect(unique(t2g[t2g$ext_gene %in% markers$x, "ens_gene"]), e9.5specific$V1), "ext_gene"])))
}
```
```
[1] "Cell group 1"
[1] "Number of markers specific to e7.5:"
[1] 0
[1] "Number of markers specific to e8.5:"
[1] 1
[1] "Number of markers specific to e9.5:"
[1] 1
[1] "Cell group 2"
[1] "Number of markers specific to e7.5:"
[1] 0
[1] "Number of markers specific to e8.5:"
[1] 45
[1] "Number of markers specific to e9.5:"
[1] 8
[1] "Cell group 3"
[1] "Number of markers specific to e7.5:"
[1] 95
[1] "Number of markers specific to e8.5:"
[1] 4
[1] "Number of markers specific to e9.5:"
[1] 0
[1] "Cell group 4"
[1] "Number of markers specific to e7.5:"
[1] 0
[1] "Number of markers specific to e8.5:"
[1] 10
[1] "Number of markers specific to e9.5:"
[1] 66
[1] "Cell group 5"
[1] "Number of markers specific to e7.5:"
[1] 0
[1] "Number of markers specific to e8.5:"
[1] 1
[1] "Number of markers specific to e9.5:"
[1] 40
```
```{r}
sessionInfo()
```