/
ibd2015_nkcell.Rmd
328 lines (262 loc) · 11.9 KB
/
ibd2015_nkcell.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
---
title: "RSS-NET analysis of IBD GWAS summary statistics and NK cell regulatory network"
author: "Xiang Zhu"
date: 2019-08-19
output: workflowr::wflow_html
editor_options:
chunk_output_type: console
---
[Zhu et al (2019)]: xxx
[`script_dir`]: xxx
[simulation example]: xxx
[`ibd2015_nkcell.sbatch`]: xxx
## Overview
Here we describe an end-to-end RSS-NET analysis of
inflammatory bowel disease (IBD) GWAS summary statistics
[(Liu et al, 2015)](https://www.ncbi.nlm.nih.gov/pubmed/26192919)
and a gene regulatory network inferred for natural killer (NK) cells.
This example illustrates the actual data analyses performed in [Zhu et al (2019)][].
To reproduce results of this example,
please use scripts in the directory [`script_dir`][],
and follow the step-by-step guide below.
Before running any script in [`script_dir`][],
please [install](setup.html) RSS-NET.
Since a real genome-wide analysis is conducted here,
this example is more complicated than the previous [simulation example][].
It is advisable to go through the previous [simulation example][]
before diving into this real data example.
Note that the working directory here is assumed to be `wdtba`.
Please modify scripts accordingly if a different directory is used.
## Step-by-step illustration
### Download input data files
#### 1. `ibd2015_sumstat.mat`: processed GWAS summary statistics and LD matrix estimates
This file is large (43G) because
it has a LD matrix of 1.1 million common SNPs.
Please contact me (`xiangzhu[at]stanford.edu`)
if you have trouble accessing this file.
```{r, eval=FALSE, engine='zsh'}
$ md5sum ibd2015_sumstat.mat
ad1763079ee7e46b21722f74e037a230 ibd2015_sumstat.mat
$ du -sh ibd2015_sumstat.mat
43G ibd2015_sumstat.mat
```
Let's look at the contents of `ibd2015_sumstat.mat`.
```matlab
>> sumstat = matfile('ibd2015_sumstat.mat');
>> sumstat
matlab.io.MatFile
Properties:
Properties.Source: 'ibd2015_sumstat.mat'
Properties.Writable: false
BR: [22x1 cell]
R: [22x1 cell]
SiRiS: [22x1 cell]
betahat: [22x1 cell]
chr: [22x1 cell]
pos: [22x1 cell]
se: [22x1 cell]
```
GWAS summary statistics and LD estimates are stored as
[cell arrays](https://www.mathworks.com/help/matlab/cell-arrays.html).
RSS-NET only uses the following variables:
- `betahat{j,1}`, single-SNP effect size estimates of all SNPs on chromosome `j`;
- `se{j,1}`, standard errors of `betahat{j, 1}`;
- `chr{j,1}` and `pos{j, 1}`, physical positions of these SNPs (GRCh37 build);
- `SiRiS{j,1}`, a [sparse matrix](https://www.mathworks.com/help/matlab/sparse-matrices.html),
defined as `repmat((1./se),1,p) .* R .* repmat((1./se)',p,1)`,
where `R` is the estimated LD matrix of these `p` SNPs.
#### 2. `ibd2015_snp2gene.mat`: physical distance between SNPs and genes
This file contains the physical distance between
each GWAS SNP and each protein-coding gene, within 1 Mb.
This file corresponds to ${\bf G}_j$ in the RSS-NET model.
```{r, eval=FALSE, engine='zsh'}
$ md5sum ibd2015_snp2gene.mat
7832838e2675e4cf3b85f471fed95554 ibd2015_snp2gene.mat
$ du -sh ibd2015_snp2gene.mat
224M ibd2015_snp2gene.mat
```
In this example, there are 18334 SNPs and 1081481 genes.
```matlab
>> snp2gene = matfile('ibd2015_snp2gene.mat');
>> snp2gene
matlab.io.MatFile
Properties:
Properties.Source: 'ibd2015_snp2gene.mat'
Properties.Writable: false
chr: [1081481x1 int32]
colid: [14126805x1 int32]
numgene: [1x1 int32]
numsnp: [1x1 int32]
pos: [1081481x1 int32]
rowid: [14126805x1 int32]
val: [14126805x1 double]
>> [snp2gene.numgene snp2gene.numsnp]
18334 1081481
```
The SNP-to-gene distance information is captured
by a three-column matrix `[colid rowid val]`.
For example, the distance between gene `1` and SNP `6` is `978947` bps.
```matlab
>> colid=snp2gene.colid; rowid=snp2gene.rowid; val=snp2gene.val;
>> [colid(6) rowid(6) val(6)]
1 6 978947
```
#### 3. `Primary_Natural_Killer_cells_from_peripheral_blood_gene2gene.mat`: gene regulatory network
This file contains information of gene-to-gene
connections in a given regulatory network.
```{r, eval=FALSE, engine='zsh'}
$ md5sum Primary_Natural_Killer_cells_from_peripheral_blood_gene2gene.mat
35ac724b86f7777d87116cc48166caa2 Primary_Natural_Killer_cells_from_peripheral_blood_gene2gene.mat
$ du -sh Primary_Natural_Killer_cells_from_peripheral_blood_gene2gene.mat
1.7M Primary_Natural_Killer_cells_from_peripheral_blood_gene2gene.mat
```
```matlab
>> gene2gene = matfile('Primary_Natural_Killer_cells_from_peripheral_blood_gene2gene.mat');
>> gene2gene
matlab.io.MatFile
Properties:
Properties.Source: 'Primary_Natural_Killer_cells_from_peripheral_blood_gene2gene.mat'
Properties.Writable: false
colid: [110733x1 int32]
numgene: [1x1 int32]
rowid: [110733x1 int32]
val: [110733x1 double]
```
For implementation convenience, this file contains the trivial case
where each gene is mapped to itself with `val=1`.
```matlab
>> colid=gene2gene.colid; rowid=gene2gene.rowid; val=gene2gene.val;
>> [gene2gene.numgene sum(colid==rowid) unique(val(colid==rowid))]
18334 18334 1
```
For a given network, transcription factors (TFs) are stored in `rowid`
and target genes (TGs) are stored in `colid`.
In this example there are 3105 TGs and 376 TFs.
Among these TFs and TGs, there are 92399 edges.
The edge weights range from 0.61 to 1.
These TF-to-TG connections and edge weights correspond to
$\{{\bf T}_g,v_{gt}\}$ in the RSS-NET model.
```matlab
>> [length(unique(colid(colid ~= rowid))) length(unique(rowid(colid ~= rowid)))]
3105 376
>> [length(colid(colid ~= rowid)) length(rowid(colid ~= rowid))]
92399 92399
>> val_tftg = val(colid ~= rowid);
>> [min(val_tftg) quantile(val_tftg, 0.25) median(val_tftg) quantile(val_tftg, 0.75) max(val_tftg)]
0.6138 0.6324 0.6568 0.6949 1.0000
```
#### 4. `ibd2015_Primary_Natural_Killer_cells_from_peripheral_blood_snp2net.mat`: SNP-to-network proximity annotation
This file contains binary annotations
whether a SNP is "near" the given network,
that is, within 100 kb of any network element
(TF, TG or associated regulatory elements).
```{r, eval=FALSE, engine='zsh'}
$ md5sum ibd2015_Primary_Natural_Killer_cells_from_peripheral_blood_snp2net.mat
d96cd9b32759f954cc37680dc6aeafd8 ibd2015_Primary_Natural_Killer_cells_from_peripheral_blood_snp2net.mat
$ du -sh ibd2015_Primary_Natural_Killer_cells_from_peripheral_blood_snp2net.mat
21M ibd2015_Primary_Natural_Killer_cells_from_peripheral_blood_snp2net.mat
```
In this example, there are 1081481 GWAS SNPs and
382443 of them are near the NK cell network (i.e. `val=1`).
```matlab
>> snp2net = matfile('ibd2015_Primary_Natural_Killer_cells_from_peripheral_blood_snp2net.mat');
>> snp2net
matlab.io.MatFile
Properties:
Properties.Source: 'ibd2015_Primary_Natural_Killer_cells_from_peripheral_blood_snp2net.mat'
Properties.Writable: false
chr: [1081481x1 int32]
pos: [1081481x1 int32]
snpid: [1081481x1 int32]
val: [1081481x1 double]
window: [1x1 double]
>> [length(snp2net.val) sum(snp2net.val) snp2net.window]
1081481 382443 100000
>> unique(snp2net.val)'
0 1
```
#### 5. `ibd2015_NK_snp2gene_cis.mat`: SNP-to-gene cis regulation
This file contains the SNP-to-gene cis regulation scores
derived from context-matching cis eQTL studies.
This file corresponds to $(c_{jg}-1)$ in the RSS-NET model.
```{r, eval=FALSE, engine='zsh'}
$ md5sum ibd2015_NK_snp2gene_cis.mat
dedad8e25773fad69576dbce0f7d9f93 ibd2015_NK_snp2gene_cis.mat
$ du -sh ibd2015_NK_snp2gene_cis.mat
165M ibd2015_NK_snp2gene_cis.mat
```
```matlab
>> snp2gene_cis = matfile('ibd2015_NK_snp2gene_cis.mat');
>> snp2gene_cis
matlab.io.MatFile
Properties:
Properties.Source: 'ibd2015_NK_snp2gene_cis.mat'
Properties.Writable: false
colid: [10790012x1 int32]
numgene: [1x1 int32]
numsnp: [1x1 int32]
rowid: [10790012x1 int32]
val: [10790012x1 double]
```
In this example, there are 10790012 SNP-gene pairs
with cis regulation scores available,
consiting of 829280 SNPs and 18230 genes.
The cis regulation scores (`val`) range from 0 to 0.76.
```matlab
>> [snp2gene_cis.numsnp length(unique(snp2gene_cis.rowid)) snp2gene_cis.numgene length(unique(snp2gene_cis.colid))]
1081481 829280 18334 18230
>> val=snp2gene_cis.val;
>> [min(val) quantile(val, 0.25) median(val) quantile(val, 0.75) max(val)]
0 0.0226 0.0636 0.1172 0.7622
```
### Run RSS-NET analysis
```{r, eval=FALSE, engine='zsh'}
$ pwd
/Users/xiangzhu/GitHub/rss-net/examples/ibd2015_nkcell
$ tree -f
.
├── ./analysis_template.m
├── ./ibd2015_nkcell.m
└── ./ibd2015_nkcell.sbatch
0 directories, 3 files
```
#### 3. Submit job arrays
For a given GWAS and a given regulatory network,
all RSS-NET analysis tasks are almost identical
and they only differs in hyper-parameter values.
To exploit this, we run one RSS-NET analysis as
a job array with multiple tasks that run in parallel.
To this end, we write a simple sbatch script
[`ibd2015_nkcell.sbatch`][],
and submit it to a cluster with [`Slurm`](https://slurm.schedmd.com/) available.
```{r, eval=FALSE, engine='zsh'}
$ sbatch ibd2015_nkcell.sbatch
```
After the submission, multiple jobs should run in different nodes simultaneously.
```{r, eval=FALSE, engine='zsh'}
$ squeue -u xiangzhu
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
62554249_107 owners ibd2015_ xiangzhu R 0:32 1 sh02-17n12
62554249_108 owners ibd2015_ xiangzhu R 0:32 1 sh02-17n12
62554249_109 owners ibd2015_ xiangzhu R 0:32 1 sh01-28n08
62554249_110 owners ibd2015_ xiangzhu R 0:32 1 sh01-17n18
62554249_111 owners ibd2015_ xiangzhu R 0:32 1 sh01-26n33
62554249_112 owners ibd2015_ xiangzhu R 0:32 1 sh01-27n30
```
Start at Mar 2, 2020, 3:04 PM.
End at Mar 2, 2020, 11:37 PM.
```
Job ID: 62554249
Array Job ID: 62554249_125
Cluster: sherlock
User/Group: xiangzhu/whwong
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 8
CPU Utilized: 1-03:08:31
CPU Efficiency: 74.54% of 1-12:24:40 core-walltime
Job Wall-clock time: 04:33:05
Memory Utilized: 26.58 GB
Memory Efficiency: 85.06% of 31.25 GB
```
## More examples