-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
314 lines (224 loc) · 17.2 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
---
title: "README for patchSeqQC package"
author: "Shreejoy Tripathy"
date: "23/04/2018"
output:
html_document:
keep_md: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(knitr)
library(ggplot2)
library(readr)
library(stringr)
library(pheatmap)
library(RColorBrewer)
library(cowplot)
library(magrittr)
devtools::load_all()
```
patchSeqQC
==========
This package includes functions for assessing the quality of single-cell transcriptomic data using cell type-specific marker gene expression, based on our paper, [Data-driven approaches for improving the interpretability of patch-seq data](https://www.biorxiv.org/content/early/2018/04/09/298133). All of the code and data for reproducing the figures from the manuscipt is available at the [patchSeqQCManuscript repository](https://github.com/PavlidisLab/patchSeqQCManuscript).
These functions were initially tailored for data collected using the patch-seq methodology, but are applicable to other single-cell RNAseq datasets. This package is meant as a tutorial to illustrate the approach outlined in our paper, but please contact us with feature requests and bug reports.
- [patchSeqQC](#patchseqqc)
- [Installation](#installation)
- [Usage](#usage)
* [Example patch-seq data](#example-patch-seq-data)
* [Visualizing cell type-specific marker expression](#visualizing-cell-type-specific-marker-expression)
+ [Cell type specific marker genes](#cell-type-specific-marker-genes)
+ [Associating single-cells to standardized cell type names](#associating-single-cells-to-standardized-cell-type-names)
+ [Visualizing expression of multiple cell type markers](#visualizing-expression-of-multiple-cell-type-markers)
* [Comparing marker expression in patch-seq data to dissociated cell reference data](#comparing-marker-expression-in-patch-seq-data-to-dissociated-cell-reference-data)
+ [Loading cell dissociation-based single cell reference dataset](#loading-cell-dissociation-based-single-cell-reference-dataset)
+ [Comparing expression of individual marker genes in patch-seq samples to dissociated cells](#comparing-expression-of-individual-marker-genes-in-patch-seq-samples-to-dissociated-cells)
+ [Summarizing cell type specific marker expression in patch-seq data](#summarizing-cell-type-specific-marker-expression-in-patch-seq-data)
+ [Summarizing cell type specific marker expression in reference data](#summarizing-cell-type-specific-marker-expression-in-reference-data)
* [Computing single-cell quality metrics based on comparison to reference data](#computing-single-cell-quality-metrics-based-on-comparison-to-reference-data)
* [Caveats in approach](#caveats-in-approach)
Installation
===========
Use devtools to install. Additional github packages needs to be installed for it work.
```
devtools::install_github('PavlidisLab/patchSeqQC')
```
This package requires the following packages:
```
install.packages('ggplot2')
install.packages('pheatmap')
install.packages('dplyr')
install.packages('knitr')
```
Usage
=====
Example patch-seq data
----------------------
This package includes sample data from [Cadwell et al. 2016](https://www.nature.com/articles/nbt.3445), which contains patch-seq data from layer 1 inhibitory interneurons and a small number of pyramidal cells.
```cadwell_meta``` includes basic sample metadata about each single-cell sample
```{r show_metadata_df}
data(cadwell_meta)
knitr::kable(head(cadwell_meta, 3))
```
**sample_id:** name of the samples. This needs to correspond to column names in the expression file.
**major_type:** Primary cell type names provided by Cadwell2016
**colors:** Unique color for cell type
**read_count:** Count of number of reads
**ercc_count:** Count of number of reads mapping to ERCC spike-ins
**ercc_pct:** Percent of reads mapping to ERCC spike-ins (relative to all sequenced reads)
```cadwell_expr``` includes gene expression data for these samples (stored as TPM + 1, transcripts per million)
```{r show_expr_mat}
data(cadwell_expr)
knitr::kable(head(cadwell_expr[, 1:7], 3))
```
Visualizing cell type-specific marker expression
-------------------------------
The central idea behind our single-cell quality metrics is that a high-quality single cell should strongly express markers of its own cell type and not express cell-type specific markers of other cell types. For example, we would expect that an cortical inhibitory neuron should strongly express GABAergic markers, like Gad2, Gad1, and not express markers of other cell types, such as markers of microglia like Tyrobp or C1qc.
### Cell type specific marker genes
To obtain cell type specific markers, we make use of mouse cortex-based single cell reference datasets which are comparable to the cell types profiled by in the Cadwell patch-seq dataset. Here, we provide pre-computed gene markers, ```markers```, based on the [Tasic2016](https://www.nature.com/articles/nn.4216) and [Zeisel2015](http://science.sciencemag.org/content/347/6226/1138.full) reference datasets.
When defining the markers, we have made the distinction between “on” and “off” markers. "On" markers, are genes that are highly expressed in the cell type of interest with enriched expression relative to other cell types. The second class, "Off" markers, are expected to be expressed at low levels in a given patch-seq cell type. These are genes that are specifically expressed in a single cell type (e.g., microglia) and, if expressed, are an indicator of possible cellular contamination.
```{r show_markers}
data(markers)
lapply(markers,head, 5)
```
### Associating single-cells to standardized cell type names
To make use of these marker genes, we need to associate each single cell from the patch-seq dataset to one of the "On" marker types listed above. Here, we define a column `contam_type` in the `cadwell_meta` data frame and assign all of the interneurons from the Cadwell dataset to the cell type "Ndnf", and the excitatory cells to the "Pyramidal" type.
```{r}
cadwell_meta$contam_type = 'Ndnf'
cadwell_meta[str_detect(cadwell_meta$major_type, 'Exc'), 'contam_type'] = 'Pyramidal'
knitr::kable(head(cadwell_meta %>% dplyr::select(sample_id, major_type, contam_type)))
```
### Visualizing expression of multiple cell type markers
Below, we use the function ```plotMarkerHeatmap``` to plot the expression of cell type specific markers of the elongated neurogliaform cells (eNGC) and pyramidal cells from the cadwell dataset. We also define the gene expression data frame, ```cadwell_engc_df```, which stores both metadata and gene expression values for the eNGC cells.
```{r plot_marker_heatmaps}
# plot marker expression of cadwell eNGCs
cadwell_engc_samples = cadwell_meta %>%
dplyr::filter(major_type == 'eNGC') %>%
dplyr::pull(sample_id) # get sample ids for eNGC cells
cadwell_engc_df = cadwell_expr[, cadwell_engc_samples]
cadwell_engc_df = cbind(cadwell_meta[cadwell_meta$sample_id %in% cadwell_engc_samples, ],
cadwell_engc_df %>% t()) # combines metadata and expression matrices and transposes it
rownames(cadwell_engc_df) = cadwell_engc_samples
plot_cell_types = c('Ndnf_on', 'Pyramidal', 'Microglia') # define which cell types to plot in heatmap
plot_marker_list = c(markers[plot_cell_types])
cadwell_engc_heatmap = plotMarkerHeatmap(markerlist = plot_marker_list, # named list of lists defining which markers to show
expr_matrix = cadwell_engc_df, # data frame that combines gene expr with metadata, each row is a single-cell
show_legend = T, # show color bar ?
show_cell_labels = T # show sample names in heatmap (defined by rownames of expr_matrix)
)
# plot expression of cadwell pyramidal cells
cadwell_pyr_samples = cadwell_meta %>% dplyr::filter(major_type == 'Exc') %>% dplyr::pull(sample_id)
cadwell_pyr_df = cadwell_expr[, cadwell_pyr_samples]
cadwell_pyr_df = cbind(cadwell_meta[cadwell_meta$sample_id %in% cadwell_pyr_samples, ], cadwell_pyr_df %>% t())
rownames(cadwell_pyr_df) = cadwell_pyr_samples
plot_cell_types = c('Pyramidal_on', 'Inhibitory', 'Microglia')
plot_marker_list = c(markers[plot_cell_types])
cadwell_pyramidal_heatmap = plotMarkerHeatmap(plot_marker_list, cadwell_pyr_df, show_legend = T, show_cell_labels = T )
```
Comparing marker expression in patch-seq data to dissociated cell reference data
--------------------------------------
### Loading cell dissociation-based single cell reference dataset
We've provided a small version of the Tasic2016 single cell cortical reference dataset (with expression of only marker genes), stored in the variable ```aibsExprData``` and quantified as TPM + 1. Metadata for these samples is available in ```aibsExprData```.
```{r aibs_show}
data(aibsExprData) # need to show that we have a field major types and contam_Type in here as well
knitr::kable(head(aibsExprData[1:5, 1:5]))
data(aibsExprMeta)
knitr::kable(head(aibsExprMeta, 4))
aibsExprDataDf = cbind(aibsExprMeta, aibsExprData %>% t()) # combines metadata and expression matrices
rownames(aibsExprDataDf) = aibsExprMeta$sample_id
```
**sample_id:** name of the samples. This needs to correspond to column names in the expression file.
**aibs_type:** Primary cell type names provided in Tasic2016
**norm_sub_types: ** Cell type subtype names
**norm_broad_types: ** Cell type broad names
### Comparing expression of individual marker genes in patch-seq samples to dissociated cells
After loading in the data from the Tasic2016 data, we can compare at the expression of cell type specific markers (for various cell types) to the same markers in patch-seq samples, Below, we use the function `plotPatchSeqVsDissocMarkerExpr` for this purpose, which on the y-axis shows the average expression of markers across `Ndnf` cells in the Tasic 2016 data and the x-axis shows expression of markers in a single patch-seq sample (one marker gene = one dot).
```{r}
cell_types_to_plot = c('Ndnf_on', 'Astrocyte', 'Endothelial', 'Microglia', 'Oligodendrocyte', 'OPC', 'Pyramidal')
plotPatchSeqVsDissocMarkerExpr(patch_seq_df = cadwell_engc_df, # patch-seq data frame that combines gene expr with metadata, each row is a single-cell
dissoc_cell_df = aibsExprDataDf, # dissociated cell data frame that combines gene expr with metadata, each row is a single-cell
contam_type = 'Ndnf', # cell type to plot
plot_cell_sample_ids = c('ERR1156978', 'ERR1156984'), # sample_ids to show
plot_marker_types = cell_types_to_plot, # cell types to show markers for
markers = markers # named list of lists for marker genes
)
```
### Summarizing cell type specific marker expression in patch-seq data
For each cell in the patch-seq dataset, we'll use the cell type specific marker genes to calculate summaries of each cell type's marker expression using the function ```calcContamAllTypes```. Each column is named a major cell type (i.e., a major cell type with defined markers in `markers`) and stores the sum of each cell's expression of each cell type marker (in units log2 (TPM + 1)).
```{r cadwell_contam}
cadwell_contam = calcContamAllTypes(cadwell_engc_df, markers)
cadwell_contam$colors = "indianred1"
cadwell_contam$major_type = cadwell_engc_df$major_type
cadwell_contam$contam_type = "Ndnf_on"
knitr::kable(head(cadwell_contam[1:8], 3))
```
### Summarizing cell type specific marker expression in reference data
We'll use the same function ```calcContamAllTypes``` to calculate summarized marker expression for each cell in the dissociated cell based Tasic2016 reference data, stored in the variable ```aibsExprDataDf```. Please note that it's essential that both the patch-seq data and the reference data are normalized using the same metric, TPM + 1 (transcripts per million) here.
```{r aibs_contam}
aibs_ndnf_contam = calcContamAllTypes(aibsExprDataDf[str_detect(aibsExprDataDf$aibs_type, 'Ndnf'), ], markers)
aibs_ndnf_contam$colors = 'indianred1'
aibs_ndnf_contam$contam_type = 'Ndnf_on'
aibs_pyr_contam = calcContamAllTypes(aibsExprDataDf[str_detect(aibsExprDataDf$aibs_type, 'L'), ], markers)
aibs_pyr_contam$colors = "turquoise"
aibs_pyr_contam$contam_type = 'Pyramidal'
aibs_contam = rbind(aibs_ndnf_contam %>% dplyr::select(Ndnf_on, Pyramidal, colors, contam_type),
aibs_pyr_contam %>% dplyr::select(Ndnf_on, Pyramidal, colors, contam_type)
)
```
We can use the function ```plotPatchSeqDissocCellMarkerExpr``` to plot the summarized marker expression for both the patch-seq and the reference datasets (one dot = one cell).
```{r}
plotPatchSeqDissocCellMarkerExpr(patch_seq_contam_df = cadwell_contam,
dissoc_cell_contam_df = aibs_contam,
on_type = 'Ndnf_on',
off_type = 'Pyramidal')
```
Move this somewhere else
```{r}
aibsExprDataDf$contam_type = aibsExprDataDf$norm_broad_type
aibs_contam_all_broad = calcContamAllTypes(aibsExprDataDf, markers)
aibs_contam_all_broad$contam_type = factor(aibsExprDataDf$contam_type)
aibs_med_exprs_broad = aibs_contam_all_broad %>% dplyr::group_by(contam_type) %>% dplyr::summarize_all(median) %>% as.data.frame()
rownames(aibs_med_exprs_broad) = aibs_med_exprs_broad$contam_type
aibsExprDataDf$contam_type = paste0(aibsExprDataDf$norm_sub_type, '_on')
aibs_contam_all_sub = calcContamAllTypes(aibsExprDataDf, markers)
aibs_contam_all_sub$contam_type = factor(aibsExprDataDf$contam_type)
aibs_med_exprs_sub = aibs_contam_all_sub %>% dplyr::group_by(contam_type) %>% dplyr::summarize_all(median) %>% as.data.frame()
rownames(aibs_med_exprs_sub) = aibs_med_exprs_sub$contam_type
aibs_med_exprs = rbind(aibs_med_exprs_broad, aibs_med_exprs_sub[c('Ndnf_on', 'Sncg_on', 'Pvalb_on', 'Pyramidal_on'), ])
compare_cell_types_inh = setdiff(names(markers), c('Ndnf_on', 'Pvalb_on', 'Sncg_on', 'Pyramidal_on', 'Inhibitory')) %>% make.names()
compare_cell_types_exc = setdiff(names(markers), c('Pyramidal_on', 'Pvalb_on', 'Sncg_on', 'Ndnf_on', 'Pyramidal')) %>% make.names()
exc_cell_types = c('CA1-PYR', 'Exc', 'RS-PYR', 'BS-PYR')
```
Computing single-cell quality metrics based on comparison to reference data
---------------------------------------------------------------------------
We have developed a number of metrics to evaluate the quality of single cell samples by comparison to dissociated cell reference data that can be calculated using the function ```calculatePatchSeqQCMetrics```.
```{r}
cadwell_engc_qc_df = calculatePatchSeqQCMetrics(cadwell_engc_df, comparison_dataset = 'aibs', markers)
knitr::kable(head(cadwell_engc_qc_df %>% dplyr::select(sample_id:quality_score)))
```
**sample_id:** name of the samples.
**major_type:** cell type identities (provided by Cadwell2016)
**contam_type:** cell type identities (normalized to cell type names in `markers`)
**marker_sum:** Summed expression of "On" cell type marker genes (with cell type defined by `contam_type`)
**marker_sum_norm:** Normalized summed expression of "on"-type marker genes, normalized to median expression of same cell type in dissociated-cell reference data
**contam_sum:** Contamination score, defined as the sum of normalized expression across all "off" cell types defined in `compare_cell_types_inh`
**quality_score:** Quality score, defined as the Spearman correlation of marker expression between markers expressed in single cell with mean expression of markers in dissociated cell reference data
This function also outputs normalized expression of each "off"-cell type (defined in `compare_cell_types_inh`) and we can use the function `plotContamHeatmap` to show these (each column is one single cell)
```{r}
knitr::kable(head(cadwell_engc_qc_df %>% dplyr::select(sample_id, Astrocyte:Pyramidal)))
contam_plot_types = factor(c('Astrocyte', 'Endothelial', 'Microglia', 'Oligodendrocyte', 'OPC', 'Pyramidal'), levels = c('Astrocyte', 'Endothelial', 'Microglia', 'Oligodendrocyte', 'OPC', 'Pyramidal'))
contam_mat = getContamMatrixFromPatchSeq(cadwell_engc_qc_df, 'eNGC', contam_show_types = contam_plot_types)
summarized_contam_heatmap = plotContamHeatmap(contam_mat, show_cell_labels = T)
```
We plot the `marker_sum`, `contam_sum` and `quality_score` variables below
```{r}
p1 = ggplot2::ggplot(data = cadwell_engc_qc_df, aes(x = marker_sum, y = contam_sum)) +
geom_point() + geom_smooth(method = "lm", se = F)
p2 = ggplot2::ggplot(data = cadwell_engc_qc_df, aes(x = marker_sum, y = quality_score)) +
geom_point() + geom_smooth(method = "lm", se = F)
plot_grid(p1, p2, nrow = 1)
```
Caveats in approach
-------------------
Our single-cell quality metrics are dependent on the following factors: 1) being able to confidently assign a single-cell's major cell type, ideally without using its transcriptomic data. 2) the availability of high-quality single-cell reference data to use for comparing the expression of observed marker genes to expected marker genes.
We note that there are likely situations where either or both of these do not hold, for example, if the cells profiled have unknown types or if reference data is not available (for example, patch-seq data collected from a brain region without adequate cell-dissociation based single-cell atlases).