/
cli-merge-tool.Rmd
264 lines (202 loc) · 6.98 KB
/
cli-merge-tool.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
---
title: "Command Line Merge Tool"
author: "Stu Field, SomaLogic Operating Co., Inc."
description: >
A convenient CLI merge tool to add new clinical data
to 'SomaScan' data.
output:
rmarkdown::html_vignette:
fig_caption: yes
vignette: >
%\VignetteIndexEntry{Command Line Merge Tool}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
library(SomaDataIO)
library(withr)
Sys.setlocale("LC_COLLATE", "en_US.UTF-8")
knitr::opts_chunk$set(
echo = TRUE,
collapse = TRUE,
comment = "#>"
)
```
# Overview
Occasionally, additional clinical data is obtained _after_ samples
have been submitted to SomaLogic, Inc. or even after 'SomaScan'
results have been delivered.
This requires the new clinical, i.e. non-proteomic, data to be merged
with the 'SomaScan' data into a "new" ADAT prior to analysis.
For this purpose, a command-line-interface ("CLI") tool has been included
with [SomaDataIO](https://CRAN.R-project.org/package=SomaDataIO)
in the `cli/merge/` directory, which allows one to
generate an updated `*.adat` file via the command-line without
having to launch an integrated development environment ("IDE"), e.g. `RStudio`.
To use `SomaDataIO`s exported functionality from _within_ an R session,
please see `merge_clin()`.
----------------
## Setup
The clinical merge tool is an `R script` that comes with an installation
of [SomaDataIO](https://CRAN.R-project.org/package=SomaDataIO):
```{r merge-script}
dir(system.file("cli", "merge", package = "SomaDataIO", mustWork = TRUE))
merge_script <- system.file("cli/merge", "merge_clin.R", package = "SomaDataIO")
merge_script
```
First create a temporary "analysis" directory:
```{r create-dir}
analysis_dir <- tempfile(pattern = "somascan-")
# create directory
dir.create(analysis_dir)
# sanity check
dir.exists(analysis_dir)
# copy merge tool into analysis directory
file.copy(merge_script, to = analysis_dir)
```
## Create Example Data
Let's create some dummy 'SomaScan' data derived from the `example_data`
object from [SomaDataIO](https://CRAN.R-project.org/package=SomaDataIO).
First we reduce its size to 9 samples and 5 proteomic features, and
then write to text file in our new analysis directory with `write_adat()`.
This will be the "new" starting point for the clinical
data merge and represents where customers would typically begin an analysis.
```{r save-data}
feats <- withr::with_seed(3, sample(getAnalytes(example_data), 5L))
sub_adat <- dplyr::select(example_data, PlateId, SlideId, Subarray,
SampleId, Age, all_of(feats)) |> head(9L)
withr::with_dir(analysis_dir,
write_adat(sub_adat, file = "ex-data-9.adat")
)
```
Next we create random clinical data with a common key (this is typically
the `SampleId` identifier but it could be any common key).
```{r create-clin-1}
df <- data.frame(SampleId = as.character(seq(1, 9, by = 2)), # common key
group = c("a", "b", "a", "b", "a"),
newvar = withr::with_seed(1, rnorm(5)))
df
# write clinical data to file
withr::with_dir(analysis_dir,
write.csv(df, file = "clin-data.csv", row.names = FALSE)
)
```
At this point there are now 3 files in our analysis directory:
```{r ls1}
dir(analysis_dir)
```
1. `merge_clin.R` the merge script engine itself
1. `clin-data.csv`:
+ new data containing 3 columns:
+ a common key: `SampleId`
+ a new variable with grouping information: `group`
+ a new variable with a continuous variable: `newvar`
1. `ex-data-9.adat`:
+ ADAT with 9 samples containing 5 'SomaScan' proteomic
features and 5 pre-existing variables we would like to merge into
+ `PlateId`, `SlideId`, `Subarray`, `SampleId`, and `Age`
+ __note:__ `PlateId`, `SlideId`, and `Subarray` are key fields common
to _almost all_ ADATs; removing them could yield unintended results
+ the common key `SampleId` is required
## Merging Clinical Data
The clinical data merge tool is simple to use at most common command line
terminals (`BASH`, `ZSH`, etc.). You must have `R` installed
(and available) with [SomaDataIO](https://CRAN.R-project.org/package=SomaDataIO)
and its dependencies installed.
### Arguments
The merge script takes 4 (four), _ordered_ arguments:
1. path to the original ADAT (`*.adat`) file
1. path to clinical data (`*.csv`) file
1. common key variable name (e.g. `SampleId`)
1. output file name (`*.adat`) for new ADAT
---------------
### Standard Syntax
The primary syntax is for when the common key in __both__ files,
(ADAT and CSV), has the _same_ variable name:
```bash
# change directory to the analysis path
cd `r analysis_dir`
# run the Rscript:
# - we recommend using the --vanilla flag
Rscript --vanilla merge_clin.R ex-data-9.adat clin-data.csv SampleId ex-data-9-merged.adat
```
```{r sys-call1, include = FALSE}
withr::with_dir(analysis_dir,
base::system2(
"Rscript",
c("--vanilla",
"merge_clin.R",
"ex-data-9.adat",
"clin-data.csv",
"SampleId",
"ex-data-9-merged.adat")
)
)
```
```{r ls2}
dir(analysis_dir)
```
### Alternative Syntax
In certain instances you may have the common key under
a _different_ variable name in their respective files.
This is handled by a modification to argument 3,
which now takes the form `key1=key2` where `key1`
contains the common keys in the `*.adat` file,
and `key2` contains keys for the `*.csv` file.
To highlight this syntax, first let's create a new clinical
data file with a _different_ variable name, `ClinID`:
```{r create-clin-2}
# rename original `df`
names(df) <- c("ClinID", "letter", "size")
df
# write clinical data to file
withr::with_dir(analysis_dir,
write.csv(df, file = "clin-data2.csv", row.names = FALSE)
)
```
We can now execute the _same_ merge script at the command line
with a slightly modified syntax:
```bash
Rscript --vanilla merge_clin.R ex-data-9.adat clin-data2.csv SampleId=ClinID ex-data-9-merged2.adat
```
```{r sys-call2, include = FALSE}
withr::with_dir(analysis_dir,
base::system2(
"Rscript",
c("--vanilla",
"merge_clin.R",
"ex-data-9.adat",
"clin-data2.csv",
"SampleId=ClinID",
"ex-data-9-merged2.adat")
)
)
```
```{r ls3}
dir(analysis_dir)
```
## Check Results
Now let's check that the clinical data was merged successfully and
yields the expected `*.adat`:
```{r new-adat}
new <- withr::with_dir(analysis_dir,
read_adat("ex-data-9-merged2.adat")
)
new
getMeta(new)
getAnalytes(new)
```
## Summary
- Merging newly obtained clinical variables into existing 'SomaScan' ADATs
is easy via the `merge_clin.R` script provided with
[SomaDataIO](https://CRAN.R-project.org/package=SomaDataIO).
- Alternatively, one could use the exported function `merge_clin()`.
- If you run into any trouble please do not hesitate to reach out
to <techsupport@somalogic.com> or
[file an issue](https://github.com/SomaLogic/SomaDataIO/issues/new) on
our [GitHub](https://github.com/SomaLogic/SomaDataIO) repository.
```{r teardown, include = FALSE}
if ( dir.exists(analysis_dir) ) {
unlink(analysis_dir, force = TRUE)
}
```