-
Notifications
You must be signed in to change notification settings - Fork 2
/
example_adat_workflow.Rmd
350 lines (262 loc) · 11.7 KB
/
example_adat_workflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
---
title: "Example ADAT Workflow Using SomaScan.db"
package: "`r BiocStyle::pkg_ver('SomaScan.db')`"
output:
BiocStyle::html_document:
toc_float: true
vignette: >
%\VignetteIndexEntry{Example ADAT Workflow Using SomaScan.db}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, crop = NULL)
library(BiocStyle)
library(dplyr)
library(GO.db)
library(SomaDataIO)
library(SomaScan.db)
```
# Introduction
This vignette will illustrate how the `SomaScan.db` package can be used to
extend the results of basic bioinformatic analyses performed on SomaScan data.
In this vignette, we will use tools from `r CRANpkg("SomaDataIO")`, in
conjunction with `SomaScan.db`, to identify and annotate proteins
(and ultimately, genes and their biological pathways) that are associated
with a clinical variable of interest; e.g. `Age`.
This workflow assumes the user has a basic knowledge of the R programming
language. In addition, it is assumed that the user is familiar with the ADAT
format, as well as the documentation in `r CRANpkg("SomaDataIO")` and the
`r Githubpkg("SomaLogic/SomaLogic-Data")` Github repository. For more
information about the ADAT format, as well as a detailed description of the
fields found in an ADAT, please reference the
`r Githubpkg("SomaLogic/SomaLogic-Data")` repository
[README](https://github.com/SomaLogic/SomaLogic-Data).
```{r load-pkgs}
library(dplyr)
library(GO.db)
library(SomaDataIO)
library(SomaScan.db)
```
# Reading an ADAT
The ADAT file format is a SomaLogic-specific, tab-delimited text file designed
to store SomaScan study data. The example ADAT file used for this walkthrough,
`example_data10.adat`, contains a SomaScan 5k study from a set of 10 human
samples. The measurements and identifiers have been altered to protect
personally identifiable information (PII), while retaining the underlying
biological signal as much as possible.
To begin, we will utilize `SomaDataIO::read_adat()` to load the example
ADAT file:
```{r read-adat}
# Retrieve the example ADAT file from the SomaDataIO package
file <- system.file("extdata", "example_data10.adat",
package = "SomaDataIO", mustWork = TRUE
)
# Read in the file
adat <- SomaDataIO::read_adat(file)
adat
```
---------------------
## Data Prep and Exploration
Next, we will perform some preliminary data exploration of our variable
of interest (`Age`):
```{r age-summary}
summary(adat$Age)
```
```{r age-hist, fig.height=3.5, fig.width=5}
hist(adat$Age,
xlab = "Age (years)",
main = "Age Distribution"
)
```
The subjects in this example data set are all adults, with age ranging from
`r min(adat$Age, na.rm = TRUE)` to `r max(adat$Age, na.rm = TRUE)`
years.
However, as suggested by the summary table above, there is 1 patient that is
missing age information:
```{r rm-na}
dplyr::filter(adat, is.na(Sex)) %>%
dplyr::select(SlideId, Age)
```
**Note**: You may notice that the `select` function above has the namespace
(`r CRANpkg("dplyr")`) specified; this is to distinguish between
`dplyr::select()` and `SomaScan.db::select()`, both of which will be used in
this vignette. A more thorough explanation of why this is important can be
found in the "`select` namespace collision" section.
Samples corresponding to this patient will need to be removed before
proceeding:
```{r}
adat <- dplyr::filter(adat, !is.na(Age))
```
As a quick sanity check: this ADAT was generated using the 5k SomaScan assay
menu, which indicates that there should be RFU values for approximately
5,000 analytes.
```{r n-analytes}
analytes <- SomaDataIO::getAnalytes(adat)
head(analytes)
```
```{r}
length(analytes)
```
Indeed, there are `r length(analytes)` analytes in this example ADAT file,
consistent with the assay version.
---------------------
## Note on `SeqId` format
The values obtained in `analytes` above contains the same information as a
SomaLogic `SeqId`, just converted to a more R-friendly format (i.e. the
`seq`-prefix and `"."` delimiters). This format of identifier is
used throughout `SomaDataIO`, and is also compatible with `SomaScan.db`.
Values obtained from `SomaDataIO` (with the `seq`-prefix) are
converted to `SeqIds` automatically when used by methods in `SomaScan.db`.
# Identifying Associations
We can now examine the data to see if any analytes are correlated with the
variable `Age`. First, however, the data be be pre-processed. SomaLogic
generally recommends to log-transform the RFU values in an ADAT prior to
analysis.
```{r log-trans}
log10_adat <- log10(adat)
```
After log-transformation, we can examine the correlations and identify
analytes that are positively correlated with `Age`. Here, correlations will be
calculated using `stats::cor.test()`:
```{r cor-test}
# Calculate correlations for each analyte
cors_df <- lapply(analytes, function(x) {
results <- stats::cor.test(log10_adat$Age, log10_adat[[x]],
method = "spearman", exact = FALSE)
results <- results[c("estimate", "p.value")]
unlist(results)
}) %>% setNames(analytes)
# Bind results together into a single dataframe
cors_df <- dplyr::bind_rows(cors_df, .id = "analytes")
# Isolate top positive correlations
top_posCor <- cors_df %>%
dplyr::filter(p.value < 0.05) %>% # Retain significant cors only
dplyr::filter(estimate.rho >= 0.75) %>% # Retain strong correlations
arrange(desc(estimate.rho))
nrow(top_posCor)
head(top_posCor, 20L)
```
The table above contains analytes that have a strong positive
correlation with `Age`, but this information alone may not enough to derive
meaningful biological insights. Which proteins and genes do these identifiers
correspond to? Are the most correlated proteins functionally related in some
way, perhaps as part of the same biological pathway? These are the types of
questions that `SomaScan.db` is designed to address.
In the next section, we will annotate these data by querying the above
`SeqIds` in `SomaScan.db` to retrieve additional information about their
corresponding genes and gene types.
# Annotating Results
To obtain a list of available annotations, use the `columns` method:
```{r columns}
columns(SomaScan.db)
```
Note that the "PROBEID" column corresponds to the SomaScan `SeqId`, the
central probe for the platform. This identifier ties the other available
annotations (listed in the `columns` output above) to the data found in an
ADAT file. For more information on the ADAT file format and an explanation of
its contents, please reference the
[SomaLogic-Data GitHub repository](https://github.com/SomaLogic/SomaLogic-Data).
For this example, we will retrieve Gene Ontology (GO) identifiers associated
with the `SeqIds` that are positively correlated with `Age`. This will be the
first step in identifying biological processes associated with the protein
targets of these `SeqIds`.
```{r GO-columns}
grep("GO|ONTOLOGY", columns(SomaScan.db), value = TRUE)
```
The `GO` and `GOALL` columns both contain GO pathway identifiers, while
`ONTOLOGY` and `ONTOLOGYALL` contain additional metadata about the
identifiers. For additional information about the values returned by `columns`,
run `help("{COLUMN NAME}")` to get a more detailed description of the listed
options.
---------------------
## Retrieving Annotations
The `select` method can be used to query the annotations to retrieve
information corresponding to SomaScan analytes of interest. But, before we
proceed, a note about the `select` method.
---------------------
### `select` Namespace Collisions
The `select` method is unique to Bioconductor's annotation packages (of which
`SomaScan.db` is one), and should not be confused with the similarly-named
`dplyr::select()`. These two functions perform very different actions on
different classes of objects. If you have the `r CRANpkg("dplyr")` package
([dplyr](https://dplyr.tidyverse.org)) loaded at the same time as
`SomaScan.db` (as we do in this vignette), you may encounter an error like the
one below when using `select`:
```r
Error in UseMethod("select") :
no applicable method for 'select' applied to an object of class "c('SomaDb',
'ChipDb', 'AnnotationDb', 'envRefClass', '.environment', 'refClass',
'environment', 'refObject', 'AssayData')"
```
This error indicates that it is unclear which `select` you are attempting to
use (the `dplyr` version or the `SomaScan.db` version). To remedy this, it
can be helpful to use the `::` operator to specify the package namespace
when calling `select`, as seen in the code chunk below.
---------------------
### Gene Name Annotation
Continuing on with the example, we will retrieve gene names (as symbols) that
correspond to the top 10 analytes associated with `Age`.
```{r top-10}
top10_analytes <- head(top_posCor$analytes, 10L)
anno <- SomaScan.db::select(SomaScan.db,
keys = top10_analytes,
columns = c("SYMBOL", "GENENAME", "GENETYPE"))
anno
```
Remember that `select` retains the order of the original query, so these genes
are still listed in order of most highly associated with `Age`.
---------------------
### GO Annotation
Which pathways or biological processes are these genes associated with? Could
that information help explain why they positively correlated with age?
`SomaScan.db` enables mapping between `SeqIds` and GO identifiers, so let's
take the top 3 genes and add GO annotations to the data frame. But why only 3
genes? We are likely to receive a lot of additional
information if we query the database for GO identifiers, so it can be easier
and cleaner to begin with a small example data set when retrieving additional
annotations.
```{r}
go_anno <- SomaScan.db::select(SomaScan.db,
keys = anno$PROBEID[1:3L],
columns = c("SYMBOL", "GENENAME", "GENETYPE",
"GO", "ONTOLOGY")) %>%
dplyr::filter(ONTOLOGY == "BP")
go_anno
```
As expected when working with GO terms, there were quite a few rows retrieved
by that `select` query (`r nrow(go_anno)` rows of information for only 3
genes). Filtering the results to the biological process ("BP") ontology only
can help reduce the number of GO identifiers that are returned in the query.
This leaves us with a list of GO identifiers, but those identifiers do not
adequately explain much about the term itself. How can we get more
"human-readable" information out of GO?
Luckily, this is easy to do with other annotation tools from Bioconductor. The
`r Biocpkg("GO.db")` annotation package contains annotations describing the
entire Gene Ontology knowledgebase, assembled using data directly from the
[GO website](http://geneontology.org/). We can use the GO IDs identified by
`SomaScan.db` to connect with `r Biocpkg("GO.db")` and retrieve more
information about each GO ID.
```{r}
go_terms <- AnnotationDbi::select(GO.db,
keys = go_anno$GO,
columns = c("GOID", "TERM", "DEFINITION"))
go_terms
```
Now we have _much_ more information about each GO ID! Using these IDs, we can
merge this information back into our `select` results from `SomaScan.db`:
```{r join-go}
final_df <- left_join(go_anno, go_terms, by = c("GO" = "GOID"))
final_df
```
The `SomaScan.db` package can be used to link to numerous other
annotation resources. For a more detailed description of how this can be done
with examples of what resources are available, see the Advanced Usage Examples
vignette.
There is still room for further exploration and extension of these results;
this workflow is meant to be an introduction to how `SomaScan.db` can be
used to build upon and interpret information obtained from a SomaLogic ADAT.
# Session Info
```{r session-info}
sessionInfo()
```