-
Notifications
You must be signed in to change notification settings - Fork 12
/
customizing-dccvalidator.Rmd
334 lines (289 loc) · 15.1 KB
/
customizing-dccvalidator.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
---
title: "Customizing dccvalidator"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{customizing-dccvalidator}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
dccvalidator is intended customizable for different settings, however the
built-in Shiny application is designed to ensure that data conforms to a
specific project organization structure in which data is organized into studies
that are described with a combination of metadata files. These metadata files
document the individuals, specimens, and assay(s) that were performed in the
study. Documenting the data in this way allows researchers to describe the study
at different levels while minimizing repetition. These metadata files can later
be joined on the `individualID` and `specimenID` columns to create a table of
all of the metadata for the study.
For projects like AMP-AD, PsychENCODE, and others that follow the same
structure, see the next section ("Customizing the Shiny application") for how
to configure an instance of the application for the project.
For projects that do not follow the same structure but wish to use some elements
of dccvalidator's data validation capabilities, see "Extending dccvalidator"
below.
## Customizing the Shiny application
The Shiny app uses a [configuration file](https://github.com/Sage-Bionetworks/dccvalidator/blob/master/config.yml)
to set details such as where to store uploaded files, which metadata templates
to validate against, whom to contact with questions, etc.
To create a custom version of the app, you'll need to follow these steps:
1. Create a Synapse project or folder with the appropriate permissions to store
uploaded files. In AMP-AD, we created a folder to which consortium members
have permissions to read and write, but *not* download. Only the curation
team has the ability to download files (in order to assist with debugging).
See this
[example of how to create a project with appropriate permissions](https://github.com/Sage-Bionetworks/dccvalidator/blob/master/inst/app/create_project.R).
1. Fork the [dccvalidator](https://github.com/Sage-Bionetworks/dccvalidator/blob/master/config.yml)
GitHub repository.
1. (For a production version): register a Synapse OAuth client (see [Synapse Docs](https://help.synapse.org/docs/Using-Synapse-as-an-OAuth-Server.2048327904.html#UsingSynapseasanOAuthServer-RegisteringandLinkinganOAuth2.0Client)) and store the `client_name`, `client_id` and `client_secret` in a .Renviron file within the repository. Ensure you are the only person who has access to this file (e.g. `chmod 600 .Renviron` or `chmod 400 .Renviron`).
1. (For running locally only): create a .synapseConfig file in your home directory with your Synapse authentication credentials (see the [synapser docs for an example](https://r-docs.synapse.org/articles/manageSynapseCredentials.html#use-synapseconfig)).
1. Create a new configuration in the `config.yml` file. Note that any values you
do not customize will be inherited from the default configuration. The configuration file must have a `default` configuration.
1. (Optional): create a pull request with your configuration back to the
upstream dccvalidator repository.
1. Within the file [`app.R`](https://github.com/Sage-Bionetworks/dccvalidator/blob/master/app.R),
replace the `"default"` configuration with the name of your new
configuration.
1. Deploy the application as described in the
[Deploying dccvalidator](https://sage-bionetworks.github.io/dccvalidator/articles/deploying-dccvalidator.html)
vignette.
To install the dccvalidator instead of forking the repository:
1. Create an `app.R` file containing the following:
```
library("dccvalidator")
run_app()
```
1. Create a `config.yml` file using the configuration options specified below and name the parameters "default".
```
default:
parent: "syn20400157"
```
1. Create a Synapse project or folder with the appropriate permissions to store
uploaded files. In AMP-AD, we created a folder to which consortium members
have permissions to read and write, but *not* download. Only the curation
team has the ability to download files (in order to assist with debugging).
See this
[example of how to create a project with appropriate permissions](https://github.com/Sage-Bionetworks/dccvalidator/blob/master/inst/app/create_project.R).
1. Create a .synapseConfig file in your home directory with your Synapse
authentication credentials (see the
[synapser docs for an example](https://r-docs.synapse.org/articles/manageSynapseCredentials.html#use-synapseconfig)).
### Configuration options
* `production`: `TRUE` to use production Synapse endpoints; `FALSE` to use
staging endpoints. Used for testing new versions of the client before they are
officially released. For information on endpoints, see [Synapse client documentation](https://python-docs.synapse.org/build/html/index.html?highlight=endpoints#synapseclient.Synapse.setEndpoints).
* `app_url`: The URL for the app when in production. This is required for using
the Synapse OAuth 2.0 client for logging into the application. See
[Deploying dccvalidator](https://sage-bionetworks.github.io/dccvalidator/articles/deploying-dccvalidator.html)
for more information.
* `parent`: The Synapse project or folder where files will be stored
* `path_to_markdown`: Location of an R Markdown document with app instructions.
If you wish to omit instructions, insert `!expr NA`.
* `study_table`: Synapse ID of a table that lists all of the existing studies in
the consortium. It should have a column called `StudyName`
* `samples_table`: Synapse ID of a table that lists the key-value pairs of `study`, `individualID`,
`specimenID`, `assay`, `species` to validate that additions to existing studies retain all IDs
that were previously shared.
* `annotations_table`: Synapse ID of a table that lists allowable annotation
keys and values for the consortium. This should follow the same basic format
as our Synapse Annotations table, e.g. there must be the following columns:
`key`, `value`, and `columnType`. `columnType` options are `STRING`,
`BOOLEAN`, `INTEGER`, `DOUBLE`.
* `annotations_link`: URL to a list or description of allowable annotation
values (so users can read more)
* `templates_link`: URL to the location of metadata templates
* `docs_tab`: These settings configure the documentation tab, including the
option to not have a tab at all.
* `include_tab`: `TRUE` to include the documentation tab in the app, else `FALSE`.
* `include_upload_widget`: `TRUE` to include the widget that allows upload of
documentation, else `FALSE`. `include_tab` must be `TRUE` to have the widget.
* `tab_name`: Desired name of the documentation tab.
* `path_to_docs_markdown`: Location of an R Markdown document with
documentation instructions.
* `teams`: The team(s) a user must be a member of in order to use the app. If
the user is not in any of the teams, they will see a message telling them they
must be added to one of the teams.
* `include_biospecimen_type`: `TRUE` to include specific biospecimen types
within each species, else `FALSE`. This will dictate whether the Specimen Type
panel appears in the validator tab.
* `templates` (including `manifest_template`, `individual_templates`,
`biospecimen_templates`, and `assay_templates`): Location of templates to
use for validation. These may be a Synapse synID or the id of a
Synapse-registered JSON schema. See the
[config file here](https://github.com/Sage-Bionetworks/dccvalidator/blob/master/inst/config.yml)
for an example and see below for more detailed information on `templates`.
* Template reference types:
* synID: Templates accessed via synID should be either .xlsx or .csv
files, where the column names reflect the needed columns in the
template. If the template is an excel file with multiple sheets, the
first sheet will be used.
* JSON schema id: Templates accessed via JSON schema id should be
registered in Synapse. The schema should have a `properties` object
listing all column names (a.k.a. keys) that should be present in the
metadata. The `manifest_template` should have only one template.
* Template types:
* `individual_templates`: Individual templates should be based on species.
Note that the species must also be included in `species_list`.
* `biospecimen_templates`: Biospecimen templates have two options. The first
option is for the biospecimen templates to be based on species only and
`include_biospecimen_type` is set to `FALSE`. Note that the species must
also be included in `species_list`. The second option is for the
biospecimen templates to be be based on both a species and a biospecimen
type (e.g. in vitro, in vivo) and `include_biospecimen_type` is set to
`TRUE`. If the biospecimen templates have specific biospecimen types,
these will be reflected in the validation section for the users to select
the correct type.
* `assay_templates`: Assay templates should be based on assay type. The list
assays will be reflected in the validation section for users to select
the correct type.
* `species_list`: List of possible species in the consortium. These are shown as
options in the validation UI and control which individual template and
biospecimen template the app validates against.
* `complete_columns`: For each metadata file and the manifest, a list of the
columns that must be complete (i.e. not contain any missing values or empty
strings). For metadata files, this should typically include `"individualID"`
and/or `"specimenID"`.
* `contact_email`: Email address linked in footer for users to contact if they
have questions
## Extending dccvalidator
Users who wish to use dccvalidator for projects that do not follow the same
structure as AMP-AD and PsychENCODE can do so by reusing the existing validation
functions in their own R code, implementing new checks as needed, and reusing
Shiny modules from dccvalidator.
### Using validation functions in scripts
Functions to check data for common quality issues are at the core of
dccvalidator. These functions in dccvalidator are all named with the pattern
`check_*()`: `check_annotation_keys()`, `check_annotation_values()`,
`check_cols_empty()`, etc.
See the [function reference](https://sage-bionetworks.github.io/dccvalidator/reference/index.html#section-data-validation-functions) for a complete list.
These functions are used in the dccvalidator Shiny app, but they can also be
used in scripts or reports. Each function takes data as input and returns
results in the form of custom condition objects. The condition objects inherit
from R's `"message"`, `"warning"`, and `"error"` classes. However, rather than
raising a message/warning/error, the check functions in dccvalidator return the
condition objects themselves.
```{r demo-check-function, message = FALSE}
library("dccvalidator")
library("readr")
## Load a sample manifest
manifest <- read_tsv(
system.file("extdata", "test_manifest.txt", package = "dccvalidator")
)
manifest
## Check that required columns are complete in the manifest
result <- check_cols_complete(
manifest,
required_cols = c("path", "parent", "grant")
)
result
```
The condition objects contain several useful pieces of information. There is a
message describing the result of the check, as well as a message describing the
expected outcome. For warnings and errors, the data that caused the warning or
error is included.
```{r see-result-details}
result$message
result$behavior
result$data
```
### Creating new validation functions
While we have functions for many data validation tasks, users may wish to create
their own custom checking functions. The functions `check_pass()`,
`check_warn()`, and `check_fail()` will create condition objects from the
provided arguments. Here is an example of how one could write a custom function
that checks if pH values are within an appropriate range.
```{r custom-function}
dat <- data.frame(pH = c(2, 6, 3, -2, 0, 15))
check_values_ph <- function(data, ph_col = "pH",
success_msg = "All pH values are valid",
fail_msg = "Some pH values are outside the range 0-14",
behavior_msg = "pH values should be between 0-14") {
values <- data[, ph_col, drop = TRUE]
if (all(values >= 0 & values <= 14)) {
check_pass(
msg = success_msg,
behavior = behavior_msg
)
} else {
check_fail(
msg = fail_msg,
behavior = behavior_msg,
data = values[values > 14 | values < 0]
)
}
}
check_values_ph(dat)
```
The `check_condition()` function can make the above a little more concise:
```{r check-condition-example}
check_values_ph <- function(data, ph_col = "pH",
success_msg = "All pH values are valid",
fail_msg = "Some pH values are outside the range 0-14",
behavior_msg = "pH values should be between 0-14") {
values <- na.omit(data[, ph_col, drop = TRUE])
all_valid <- all(values >= 0 & values <= 14)
check_condition(
msg = ifelse(all_valid, success_msg, fail_msg),
behavior = behavior_msg,
data = if (!all_valid) values[values > 14 | values < 0],
type = ifelse(all_valid, "check_pass", "check_fail")
)
}
```
### Reusing app modules
The Shiny app that dccvalidator provides allows users to see the results of data
validation. This module (`results_boxes_server()`/`results_boxes_ui()`) is
exported from dccvalidator and can be reused in other Shiny applications like so:
```{r show-module-demo, eval = FALSE}
library("shiny")
library("shinydashboard")
server <- function(input, output) {
# Load sample data
manifest <- read_tsv(
system.file("extdata", "test_manifest.txt", package = "dccvalidator")
)
biosp <- read_csv(
system.file("extdata", "test_biospecimen.csv", package = "dccvalidator")
)
# Add logic to run the checks you are interested in and store the results in a
# list. Here is an example:
res <- list(
check_cols_complete(
manifest,
c("path", "parent", "grant"),
success_msg = "All required columns present are complete in the manifest",
fail_msg = "Some required columns are incomplete in the manifest"
),
check_cols_complete(
biosp,
"specimenID",
success_msg = "All required columns present are complete in the biospecimen metadata",
fail_msg = "Some required columns are incomplete in the biospecimen metadata"
),
check_specimen_ids_dup(
biosp,
success_msg = "Specimen IDs in the biospecimen metadata file are unique",
fail_msg = "Duplicate specimen IDs found in the biospecimen metadata file"
)
)
# Show results in boxes
callModule(results_boxes_server, "validation_results", res)
}
ui <- function(request) {
dashboardPage(
header = dashboardHeader(),
sidebar = dashboardSidebar(),
body = dashboardBody(
includeCSS(
system.file("app/www/custom.css", package = "dccvalidator")
),
results_boxes_ui("validation_results")
)
)
}
shinyApp(ui, server)
```