/
99_Methods.Rmd
227 lines (191 loc) · 9.67 KB
/
99_Methods.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
---
title: "Methods"
output:
workflowr::wflow_html:
code_folding: hide
bibliography:
- "`r here::here('data/packages.bib')`"
- "`r here::here('data/references.bib')`"
---
```{r knitr, include = FALSE}
DOCNAME = "99_Methods"
knitr::opts_chunk$set(autodep = TRUE,
cache = FALSE,
cache.path = paste0("cache/", DOCNAME, "/"),
cache.comments = FALSE,
echo = TRUE,
error = FALSE,
fig.align = "center",
fig.width = 10,
fig.height = 8,
message = FALSE,
warning = FALSE)
```
```{r libaries, cache = FALSE}
# Presentation
library("glue")
library("knitr")
# Paths
library("here")
# Output
library("jsonlite")
# Tidyverse
library("tidyverse")
```
```{r pkg-bib}
write_bib(c("base", "SingleCellExperiment", "cowplot", "workflowr"),
file = here("data/packages.bib"))
```
```{r load}
docs <- list.dirs(here("output"), full.names = FALSE)[-1]
params <- map(docs, function(doc) {
out <- tryCatch(
{
table <- suppressWarnings(
read_json(here(glue("output/{doc}/parameters.json")),
simplifyVector = TRUE)
)
p.list <- table$Value %>% setNames(table$Parameter)
},
error = function(e) {
message(glue("{doc} parameters file not found"))
return(list())
}
)
return(out)
})
names(params) <- str_remove(docs, "[0-9A-Z]+_")
```
Pre-processing
==============
The Cell Ranger pipeline (v1.3.1) was used to perform sample demultiplexing,
barcode processing and single-cell gene counting. Briefly, samples were
demultiplexed to produce a pair of FASTQ files for each sample. Reads containing
sequence information were aligned to the GRCh38 reference genome provided with
Cell Ranger (v1.2.0). Cell barcodes were filtered to remove empty droplets and
PCR duplicates were removed by selecting unique combinations of cell barcodes,
unique molecular identifiers and gene ids with the final results being a gene
expression matrix that was used for further analysis. The three samples in the
first batch of organoids were aggregated using Cell Ranger with no normalisation
and treated as a single dataset.
Quality control
===============
The R statistical programming language
(v`r str_extract(R.version.string, "[0-9\\.]+")`) [@R-base] was used for further
analysis. Count data for each experiment was read into R and used to construct a
SingleCellExperiment object (v`r packageVersion("SingleCellExperiment")`)
[@R-SingleCellExperiment].Gene annotation information was added from BioMart
[@Smedley2015-an] using the biomaRt package (v`r packageVersion("biomaRt")`)
[@Durinck2005-zb] and cells were assigned cell cycle scores using the cyclone
[@Scialdone2015-rp] function in the scran package (v`r packageVersion("scran")`)
[@Lun2016-lp].
The scater package (v`r packageVersion("scater")`) [@McCarthy2017-ql] was used
to produce a series of diagnostic quality control plots. Cells with a high
number of expressed genes (indicating potential doublets) were removed, as were
cells with a high percentage of counts assigned to mitochondrial or ribosomal
genes, or with low expression of the housekeeping genes GAPDH and ACTB.
Genes that had less than two total counts across a dataset, or were expressed in
less than two individual cells, were removed. We also removed genes without an
annotated HGNC symbol.
Following quality control the first organoid dataset consisted of
`r params$Organoid123_QC$n_cells` cells and `r params$Organoid123_QC$n_genes`
genes with a median of `r params$Organoid123_QC$median_genes` genes expressed
per a cell, the fourth organoid had `r params$Organoid4_QC$n_cells` cells and
`r params$Organoid4_QC$n_genes` genes with a median of
`r params$Organoid4_QC$median_genes` genes expressed per cell and the Lindstrom
dataset had `r params$Lindstrom_QC$n_cells` cells and
`r params$Lindstrom_QC$n_genes` genes with a median of
`r params$Lindstrom_QC$median_genes` genes expressed per cell.
Clustering analysis
===================
Organoids
---------
The two organoid datasets were integrated using the alignment method in the
Seurat package (v`r packageVersion("Seurat")`) [@Satija2015-or; @Butler2018-js].
Briefly, highly variable genes were identified in each dataset and those that
were present in both datasets (`r params$Organoids_Integration$var_genes` genes)
were selected. Canonical correlation analysis
[@Hotelling1936-ni; @Hardoon2004-ev] was then performed using the
selected genes and `r params$Organoids_Integration$n_CCs` dimensions that
represent the majority of variation were selected. The final step used dynamic
time warping [@Berndt1994-lh] to align the datasets in the selected subspace.
To perform clustering Seurat constructs a shared nearest neighbour graph of
cells in the aligned subspace and uses Louvain modularity optimisation
[@Blondel2008-ym] to assign cells to clusters. The number of clusters produced
using this method is controlled by a resolution parameter with higher values
giving more clusters. We performed clustering over a range of resolutions from
`r min(params$Organoids_Clustering$resolutions)` to
`r max(params$Organoids_Clustering$resolutions)` in steps of 0.1 and used the
clustree package (v`r packageVersion("clustree")`) to produce clustering trees
[@Zappia2018-lz] showing the expression of known marker genes to select the
appropriate resolution to use. We chose to use a resolution of
`r params$Organoids_Clustering$res` which produced
`r params$Organoids_Clustering$n.clusts` clusters.
Marker genes for each cluster were detected by testing for differential
expression between cells in one cluster and all other cells using a Wilcoxon
rank sum test [@Bauer1972-yf]. To reduce processing time only genes that were
expressed in at least 10 percent of cells in one of these groups were tested. To
identify conserved marker genes a similar process was performed on each dataset
separately and the results combined using the maximum p-value method. We also
tested for within cluster differential expression to identify differences
between cells of the same type in different datasets.
Based on identified marker genes we determined clusters
`r glue_collapse(params$Organoids_Nephron$clusters, last = " and ")` represented
the nephron lineage. The `r params$Organoids_Nephron$n_cells` cells in these
clusters were re-clustered at a resolution of
`r params$Organoids_Nephron$res` resulting in
`r params$Organoids_Nephron$n.clusts` clusters.
We also performed pseudotime trajectory analysis on the nephron cells using
Monocle (v`r packageVersion("monocle")`) [@Trapnell2014-he; @Qiu2017-mq]. The
intersection of the top 100 genes with the greatest absolute foldchange for each
nephron cluster were selected for this analysis, giving a set of
`r params$Organoids_Trajectory$n.genes` genes used to order the cells.
Combined
--------
The combined organoid and human fetal kidney analysis used the procedure
described for the organoid only analysis, but with slightly different parameters.
We identified `r params$Combined_Integration$var_genes` variable genes present
in all three datasets and selected the first
`r params$Combined_Integration$n_CCs` canonical correlation dimensions. For
clustering we chose a resolution of `r params$Combined_Clustering$res` which
produced `r params$Combined_Clustering$n.clusts` clusters. Clusters
`r glue_collapse(params$Combined_Nephron$clusters, sep = ", ", last = " and ")`
were determined to be the nephron lineage and these
`r params$Combined_Nephron$n_cells` cells were re-clustered at a resolution of
`r params$Combined_Nephron$res` producing `r params$Combined_Nephron$n.clusts`
clusters.
We also performed differential expression testing between the two datasets as a
whole which was used to identify a signature of
`r params$Combined_Clustering$n.sig` genes that represent the main differences
between them. To identify cell type specific differences between organoid and
human fetal kidney we performed differential expression testing between cells
within a cluster and removed genes from the overall differential expression
signature. Cluster `r params$Combined_Nephron$hFK_pod_clust` in the combined
nephron analysis was identified as a human fetal kidney specific podocyte
cluster. To investigate the differences we compared expressions in this cluster
to the general podocyte cluster (CN`r params$Combined_Nephron$pod_clust`).
Visualisation and presentation
==============================
Figures shown here were produced using functions in the Seurat, Monocle and
clustree packages. Additional plots and customisations were created using the
ggplot2 (v`r packageVersion("ggplot2")`) [@Wickham2010-zq] and cowplot
(v`r packageVersion("cowplot")`) [@R-cowplot] packages. The analysis project
was managed using the workflowr (v`r packageVersion("workflowr")`)
[@R-workflowr] package which was also used to produce the publicly available
website displaying the analysis code, results and output.
Availability
============
Both organoid datasets are available from GEO accession number GSE114802 and the
Lindstrom fetal kidney dataset is available from GEO accession GSE102596. A
website showing reports produced during analysis including the exact software
versions and parameters used can be accessed at
http://oshlacklab.com/combes-organoid-paper and the analysis code is
available at https://github.com/Oshlack/combes-organoid-paper.
References
==========
<div id="refs"></div>
Session information
-------------------
```{r session-info, cache = FALSE}
devtools::session_info()
```