/
paper.Rmd
714 lines (641 loc) · 39.1 KB
/
paper.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
---
title: 'graphsim: An R package for simulating gene expression data from graph structures of biological pathways'
output:
rmarkdown::pdf_document:
fig_crop: no
#keep_md: TRUE
keep_tex: TRUE
fig_caption: yes
tags:
- R
- gene-expression
- simulation
- genomics
- pathway
- network
authors:
- name: S. Thomas Kelly
email: "tom.kelly@postgrad.otago.ac.nz, tom.kelly@riken.jp"
orcid: 0000-0003-3904-6690
affiliation: "1, 2" # (Multiple affiliations must be quoted)
- name: Michael A. Black
email: mik.black@otago.ac.nz
orcid: 0000-0003-1174-6054
affiliation: "1"
affiliations:
- name: "Department of Biochemistry, University of Otago, PO Box 56, Dunedin 9054, New Zealand"
index: 1
- name: "RIKEN Center for Integrative Medical Sciences, Suehiro-cho-1-7-22, Tsurumi Ward, Yokohama, Kanagawa 230-0045, Japan"
index: 2
date: "`r format(Sys.time(), '%d %B %Y')`"
bibliography: paper.bib
header-includes:
- \usepackage{caption}
---
```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", width = 68)
knitr::opts_chunk$set(fig.cap = "", fig.path = "Plot")
knitr::opts_chunk$set(fig.pos = "!htbp")
options(width = 68, cli.unicode = FALSE, cli.width = 68)
#par(mai=c(2.82, 2.82, 0.82, 0.82)-0.82)
par(mar=c(7, 10, 4, 2) + 0.1)
par(bty="o")
#captioner::captioner(prefix = "Fig.")
```
### Summary
Transcriptomic analysis is used to capture the molecular state of a cell
or sample in many biological and medical applications. In addition to
identifying alterations in activity at the level of individual genes,
understanding changes in the gene networks that regulate fundamental
biological mechanisms is also an important objective of molecular
analysis. As a result, databases that describe biological pathways
are increasingly used to assist with the interpretation of results
from large-scale genomics studies. Incorporating information from
biological pathways and gene regulatory networks into a genomic data
analysis is a popular strategy, and there are many methods that provide
this functionality for gene expression data. When developing or comparing
such methods, it is important to gain an accurate assessment of their
performance. Simulation-based validation studies are frequently used
for this.
This necessitates the use of simulated data that correctly accounts for
pathway relationships and correlations. Here we present a versatile
statistical framework to simulate correlated gene expression data from
biological pathways, by sampling from a multivariate normal distribution
derived from a graph structure. This procedure has been released as the
\texttt{graphsim} R package on CRAN and GitHub (\url{https://github.com/TomKellyGenetics/graphsim})
and is compatible with any graph structure that can be described using
the `igraph` package. This package allows the simulation of biological
pathways from a graph structure based on a statistical model of gene expression.
Introduction: inference and modelling of biological networks {#sec:intro}
===============================================================================
Network analysis of molecular biological pathways has the potential to
lead to new insights into biology and medical genetics
[@Barabasi2004; @Hu2016]. Since gene expression profiles capture a
consistent signature of the regulatory state of a cell
[@Perou2000; @Ozsolak2011; @Svensson2018], they can be used to analyse
complex molecular states with genome-scale data. However, biological
pathways are often analysed in a reductionist paradigm as amorphous sets
of genes involved in particular functions, despite the fact that the
relationships defined by pathway structure could further inform gene
expression analyses. In many cases, the pathway relationships are
well-defined, experimentally-validated, and are available in public
databases [@Reactome]. As a result, network analysis techniques can
play an important role in furthering our understanding of biological
pathways and aiding in the interpretation of genomics studies.
Gene networks provide insights into how cells are regulated, by mapping
regulatory interactions between target genes and transcription factors,
enhancers, and sites of epigenetic marks or chromatin structures
[@Barabasi2004; @Yamaguchi2007]. Inference using these regulatory
interactions genomic analysis has the potential to radically
expand the range of candidate biological pathways to be further
explored, or to improve the accuracy of bioinformatics and functional
genomic analysis. A number of methods have been developed to
utilise timecourse gene expression data [@Yamaguchi2007; @Arner2015]
using gene regulatory modules in state-space models and recursive vector
autoregressive models [@Hirose2008; @Shimamura2009]. Various approaches
to gene regulation and networks at the genome-wide scale have led to
novel biological insights [@Arner2015; @Komatsu2013], however, inference
of regulatory networks has thus far primarily relied on experimental
validation or resampling-based approaches to estimate the likelihood
of specific network modules being predicted [@Markowetz2007; @Hawe2019].
Simulating datasets that account for pathway structure are of particular
interest for benchmarking regulatory network inference techniques
and methods being developed for genomics data containing complex biological
interactions [@Schaffter2011; @Saelens2019]. Dynamical models using
differential equations have been employed, such as by GeneNetWeaver
[@Schaffter2011], to generate simulated datasets
specifically for benchmarking gene regulatory network inference techniques.
There is also renewed interest in modelling biological pathways and simulating
data for benchmarking due to the emergence of single-cell genomics
technologies and the growing number of bioinformatics techniques developed
to use this data [@Zappia2017; @Saelens2019]. Packages such as 'splatter'
[@Zappia2017], which uses the gamma-poisson distribution,
have been developed to model single-cell data.
SERGIO [@Dibaeinia2019] and dyngen [@Cannoodt2020] build on
this by adding gene regulatory networks and multimodality
respectively. These methods have been designed based on known
deterministic relationships or synthetic reaction states,
to which stochasticity is then added.
However, it is computationally-intensive to model
these reactions at scale or run many iterations for benchmarking.
In some cases, it is only necessary to model the statistical
variability and "noise" of RNA-Seq data in order to evaluate
methods in the presence of multivariate correlation structures.
There is a need, therefore, for a systematic framework for statistical
modelling and simulation of gene expression data derived from
hypothetical, inferred or known gene networks. Here we present a
package to achieve this, where samples from a multivariate normal
distribution are used to generate normally-distributed log-expression
data, with correlations between genes derived from the structure of the
underlying pathway or gene regulatory network. This methodology enables
simulation of expression profiles that approximate the log-transformed
and normalised data from microarray studies, as well as bulk or single-cell RNA-Seq
experiments. This procedure has been released as the \texttt{graphsim}
package to enable the generation of simulated gene expression datasets
containing pathway relationships from a known underlying network.
These simulated datasets can be used to evaluate various bioinformatics
methodologies, including
statistical and network inference procedures.
Methodology and software {#sec:methods}
===============================================================================
Here we present a procedure to simulate gene expression data with
correlation structure derived from a known graph structure. This
procedure assumes that transcriptomic data have been generated and
follow a log-normal distribution (i.e.,
$log(X_{ij}) \sim MVN({\bf\mu}, \Sigma)$, where ${\bf\mu}$ and $\Sigma$
are the mean vector and variance-covariance matrix respectively, for
gene expression data derived from a biological pathway) after
appropriate normalisation [@Law2014; @Li2015]. Log-normality of gene
expression matches the assumptions of the popular \texttt{limma} package [@limma], which is
often used for the analysis of intensity-based data from gene expression
microarray studies and count-based data from RNA-Seq experiments. This
approach has also been applied for modelling UMI-based count data from
single-cell RNA-Seq experiments in the \texttt{DESCEND} R package [@Wang2018].
In order to simulate transcriptomic data, a pathway is first constructed
as a graph structure, using the \texttt{igraph} R package [@igraph], with the status of
the edge relationships defined (i.e, whether they activate or inhibit
downstream pathway members). [This procedure uses]{style="color: black"}
a graph structure such as that presented in
Figure [1a](#fig:simple_graph:first){reference-type="ref"
reference="fig:simple_graph:first"}. The graph can be defined by an
adjacency matrix, **$A$** (with elements
$A_{ij}$), where $$A_{ij} =
\begin{cases}
1 & \mbox{if genes } i \mbox{ and } j \mbox{ are adjacent} \\
0 & \mbox{otherwise}
\end{cases}$$
A matrix, **$R$**, with elements
[$R_{ij}$]{style="color: black"}, is calculated based on distance (i.e.,
number of edges contained in the shortest path) between nodes, such that
closer nodes are given more weight than more distant nodes, to define
inter-node relationships. A geometrically-decreasing (relative) distance
weighting is used to achieve this:
[ $$R_{ij} =
\begin{cases}
1 & \mbox{if genes } i \mbox{ and } j \mbox{ are adjacent} \\
(\frac{1}{2})^{d_{ij}} & \mbox{if a path can be found between genes } i \mbox{ and } j \\
0 & \mbox{if no path exists between genes } i \mbox{ and } j
\end{cases}$$]{style="color: black"} where $d_{ij}$ is the length of
the shortest path (i.e., minimum number of edges traversed) between
genes (nodes) $i$ and $j$ in graph $G$. Each more distant node is thus
related by $\frac{1}{2}$ compared to the next nearest, as shown in
Figure [2b](#fig:simulation_activating:second){reference-type="ref"
reference="fig:simulation_activating:second"}. An
arithmetically-decreasing (absolute) distance weighting is also
supported in the \texttt{graphsim} R package which implements this procedure: [ $$R_{ij} =
\begin{cases}
1 & \mbox{if genes } i \mbox{ and } j \mbox{ are adjacent} \\
1-\frac{d_{ij}}{diam(G)} & \mbox{if a path can be found between genes } i \mbox{ and } j \\
0 & \mbox{if no path exists between genes } i \mbox{ and } j
\end{cases}$$ ]{style="color: black"}
Assuming a unit variance for each gene, these values can be used to
derive a $\Sigma$ matrix: $$\Sigma_{ij} =
\begin{cases}
1 & \mbox{if } i=j \\
\rho R_{ij} & \mbox{otherwise}
\end{cases}$$ where $\rho$ is the correlation between adjacent nodes.
Thus covariances between adjacent nodes are assigned by a correlation
parameter ($\rho$) and the remaining off-diagonal values in the matrix
are based on scaling these correlations by the geometrically weighted
relationship matrix (or the nearest positive definite matrix for
$\Sigma$ with negative correlations).\
Computing the nearest positive definite matrix is necessary to ensure
that the variance-covariance matrix can be inverted when used as a
parameter in multivariate normal simulations, particularly when negative
correlations are included for inhibitions (as shown below). Matrices
that cannot be inverted occur rarely with biologically plausible
graph structures but this approach allows for the computation of a
plausible correlation matrix when the given graph structure is
incomplete or contains loops. When required, the nearest positive
definite matrix is computed using the `nearPD` function of
the \texttt{Matrix} R package [@Matrix] to perform
Higham's algorithm [@Higham2002] on variance-covariance matrices.
The \texttt{graphsim} package gives a warning when this occurs.
Illustrations {#sec:illustrations}
===============================================================================
Generating a Graph Structure {#sec:plot_graph}
-------------------------------------------------------------------------------
The graph structure in
Figure [1a](#fig:simple_graph:first){reference-type="ref"
reference="fig:simple_graph:first"} was used to simulate correlated gene
expression data by sampling from a multivariate normal distribution
using the \texttt{mvtnorm} R package [@Genz2009; @mvtnorm]. The graph structure
visualisation in
Figure [1](#fig:simple_graph){reference-type="ref"
reference="fig:simple_graph"} was specifically developed for (directed)
\texttt{igraph} objects in and is available in the \texttt{graphsim} package. The
\texttt{plot\_directed} function enables customisation of plot parameters for
each node or edge, and mixed (directed) edge types for indicating
activation or inhibition. These inhibition links (which occur frequently
in biological pathways) are demonstrated in
Figure [1b](#fig:simple_graph:second){reference-type="ref"
reference="fig:simple_graph:second"}.
A graph structure can be generated and plotted using the following
commands in R:
```{r, eval = FALSE}
#install packages required (once per computer)
install.packages("graphsim")
```
```{r, warning=FALSE, results='hide', message=FALSE}
#load required packages (once per R instance)
library("graphsim")
#load packages for examples
library("igraph"); library("gplots"); library("scales")
```
```{r simple_graph_hide, fig.align='center', fig.show='hold', fig.width='1.0\\linewidth', fig.height='1.0\\linewidth', out.height='.375\\linewidth', out.width='.375\\linewidth', fig.retina=10, fig.margin = FALSE, fig.ncol = 2, eval=FALSE}
#generate graph structure
graph_edges <- rbind(c("A", "C"), c("B", "C"), c("C", "D"), c("D", "E"),
c("D", "F"), c("F", "G"), c("F", "I"), c("H", "I"))
graph <- graph.edgelist(graph_edges, directed = TRUE)
#plot graph structure (Figure 1a)
plot_directed(graph, state ="activating", layout = layout.kamada.kawai,
cex.node = 2, cex.arrow = 4, arrow_clip = 0.2)
#generate parameters for inhibitions for each edge in E(graph)
state <- c(1, 1, -1, 1, 1, 1, 1, -1)
#plot graph structure with inhibitions (Figure 1b)
plot_directed(graph, state=state, layout = layout.kamada.kawai,
cex.node = 2, cex.arrow = 4, arrow_clip = 0.2)
```
```{r simple_graph, fig.align='center', fig.show='hold', fig.width='1.0\\linewidth', fig.height='1.0\\linewidth', out.height='.375\\linewidth', out.width='.375\\linewidth', fig.retina=10, fig.margin = FALSE, fig.ncol = 2, fig.cap = '\\textbf{Simulated graph structures}. A constructed graph structure used as an example to demonstrate the simulation procedure in Figures 2 and 3. Activating links are denoted by black arrows and inhibiting links by red edges.', echo=FALSE}
#generate graph structure
graph_edges <- rbind(c("A", "C"), c("B", "C"), c("C", "D"), c("D", "E"),
c("D", "F"), c("F", "G"), c("F", "I"), c("H", "I"))
graph <- graph.edgelist(graph_edges, directed = TRUE)
#plot graph structure (Figure 1a)
plot_directed(graph, state ="activating", layout = layout.kamada.kawai,
cex.node = 2, cex.arrow = 4, arrow_clip = 0.2)
mtext(text = "(a) Activating pathway structure", side=1, line=3.5, at=0.05, adj=0.5, cex=1.75)
box()
#generate parameters for inhibitions for each edge in E(graph)
state <- c(1, 1, -1, 1, 1, 1, 1, -1)
#plot graph structure with inhibitions (Figure 1b)
plot_directed(graph, state=state, layout = layout.kamada.kawai,
cex.node = 2, cex.arrow = 4, arrow_clip = 0.2)
mtext(text = "(b) Inhibiting pathway structure", side=1, line=3.5, at=0.075, adj=0.5, cex=1.75)
box()
```
Generating a Simulated Expression Dataset {#sec:graphsim_demo}
-----------------------------------------
The correlation parameter of $\rho = 0.8$ is used to demonstrate the
inter-correlated datasets using a geometrically-generated relationship
matrix (as used for the example in
Figure [2c](#fig:simulation_activating:third){reference-type="ref"
reference="fig:simulation_activating:third"}). This $\Sigma$ matrix was
then used to sample from a multivariate normal distribution such that
each gene had a mean of $0$, standard deviation $1$, and covariance
within the range $[0,1]$ so that the off-diagonal elements of $\Sigma$
represent correlations. This procedure generated a simulated (continuous
normally-distributed) log-expression profile for each node
(Figure [2e](#fig:simulation_activating:fourth){reference-type="ref"
reference="fig:simulation_activating:fourth"}) with a corresponding
correlation structure (Figure [2d](#fig:simulation_activating:fifth){reference-type="ref"
reference="fig:simulation_activating:fifth"}). The simulated correlation
structure closely resembled the expected correlation structure ($\Sigma$
in Figure [2c](#fig:simulation_activating:third){reference-type="ref"
reference="fig:simulation_activating:third"}) even for the relatively
modest sample size ($N=100$) illustrated in
Figure [2](#fig:simulation_activating){reference-type="ref"
reference="fig:simulation_activating"}. Once a gene expression dataset
comprising multiple pathways has been generated (as in
Figure [2e](#fig:simulation_activating:fourth){reference-type="ref"
reference="fig:simulation_activating:fourth"}), it can then be used to
test procedures designed for analysis of empirical gene expression data
(such as those generated by microarrays or RNA-Seq) that have been
normalised on a log-scale.
The simulated dataset can be generated using the following code:
```{r, include=FALSE, eval=FALSE}
graphics::layout(matrix(c(1:5, 5), nrow = 3, byrow = TRUE))
```
```{r, include = FALSE}
set.seed(9000)
```
```{r simulation_activating_hide, fig.align='center', fig.show='hold', fig.width='1.0\\linewidth', fig.height='1.0\\linewidth', out.height='.375\\linewidth', out.width='.375\\linewidth', fig.retina=10, fig.margin = FALSE, fig.ncol = 2, eval=FALSE, warning=FALSE}
#plot relationship matrix
heatmap.2(make_distance_graph(graph, absolute = FALSE),
scale = "none", trace = "none", col = colorpanel(50, "white", "red"),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)))
#plot sigma matrix
heatmap.2(make_sigma_mat_dist_graph(graph, cor = 0.8, absolute = FALSE),
scale = "none", trace = "none", col = colorpanel(50, "white", "red"),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)))
#simulate data
expr <- generate_expression(100, graph, cor = 0.8, mean = 0,
comm = FALSE, dist =TRUE, absolute = FALSE, state = state)
#plot simulated correlations
heatmap.2(cor(t(expr)), scale = "none", trace = "none",
col = colorpanel(50, "white", "red"),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)))
#plot simulated expression data
heatmap.2(expr, scale = "none", trace = "none", col = bluered(50),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "")
```
```{r simulation_activating, fig.align='center', fig.show='hold', fig.width='1.0\\linewidth', fig.height='1.0\\linewidth', out.height='.375\\linewidth', out.width='.375\\linewidth', fig.retina=10, fig.margin = FALSE, fig.ncol = 2, fig.cap = '\\textbf{Simulating expression from a graph structure}. An example of a graph structure (a) that has been used to derive a relationship matrix (b), $\\Sigma$ matrix (c) and correlation structure (d) from the relative distances between the nodes. Non-negative values are coloured white to red from $0$ to $1$ (e). The $\\Sigma$ matrix has been used to generate a simulated expression dataset of 100 samples (coloured blue to red from low to high) via sampling from the multivariate normal distribution. Here genes with closer relationships in the pathway structure show a higher correlation between simulated values.', warning=FALSE, echo=FALSE}
# activating graph
state <- rep(1, length(E(graph)))
plot_directed(graph, state=state, layout = layout.kamada.kawai,
cex.node=2, cex.arrow=4, arrow_clip = 0.2)
mtext(text = "(a) Activating pathway structure", side=1, line=3.5, at=0.075, adj=0.5, cex=1.75)
box()
#plot relationship matrix
heatmap.2(make_distance_graph(graph, absolute = FALSE),
scale = "none", trace = "none", key = FALSE,
col = colorpanel(50, "white", "red"),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)))
mtext(text = "(b) Relationship matrix", side=1, line=3.5, at=0, adj=0.5, cex=1.75)
#plot sigma matrix
heatmap.2(make_sigma_mat_dist_graph(graph, cor = 0.8, absolute = FALSE),
scale = "none", trace = "none", key = FALSE,
col = colorpanel(50, "white", "red"),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)))
mtext(text = expression(paste("(c) ", Sigma, " matrix")), side=1, line=3.5, at=0, adj=0.5, cex=1.75)
#simulate data
expr <- generate_expression(100, graph, cor = 0.8, mean = 0,
comm = FALSE, dist =TRUE, absolute = FALSE, state = state)
#plot simulated correlations
heatmap.2(cor(t(expr)), scale = "none", trace = "none", key = FALSE,
col = colorpanel(50, "white", "red"),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)))
mtext(text = "(d) Simulated correlation", side=1, line=3.5, at=0, adj=0.5, cex=1.75)
#plot simulated expression data
heatmap.2(expr, scale = "none", trace = "none", key = FALSE, col = bluered(50),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "")
mtext(text = "samples", side=1, line=1.5, at=0.2, adj=0.5, cex=1.5)
mtext(text = "genes", side=4, line=1, at=-0.4, adj=0.5, cex=1.5)
mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0, adj=0.5, cex=1.75)
```
```{r, include=FALSE, warning=FALSE}
pdf("Plotsimulation_activating-5.pdf", width = 12.14, height = 6.072)
#plot simulated expression data
heatmap.2(expr, scale = "none", trace = "none", key = FALSE, col = bluered(50),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "", mar = c(6, 6), keysize = 1)
mtext(text = "samples", side=1, line=1.5, at=0.55, adj=0.5, cex=1.5)
mtext(text = "genes", side=4, line=1, at=0.45, adj=0.5, cex=1.5)
mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0.5, adj=0.5, cex=1.75)
dev.off()
#system("sed -i.bak '/Plotsimulation_activating-5/s/\\(.*\\)width=[.]415/\1width=.830/g' paper.md && rm paper.md.bak")
#system("sed -i.bak '/Plotsimulation_activating-5/s/\\(.*\\)width=[.]415/\1width=.830/g' paper.tex && rm paper.tex.bak")
#sed -i '/Plot.*-5/s/\(.*\)width=[.]415/\1width=.830/g' paper.md paper.tex
```
The simulation procedure
(Figure [2](#fig:simulation_activating){reference-type="ref"
reference="fig:simulation_activating"}) can similarly be used for
pathways containing inhibitory links
(Figure [3](#fig:simulation_inhibiting){reference-type="ref"
reference="fig:simulation_inhibiting"}) with several refinements. With
the inhibitory links
(Figure [3a](#fig:simulation_inhibiting:first){reference-type="ref"
reference="fig:simulation_inhibiting:first"}), distances are calculated
in the same manner as before
(Figure [3b](#fig:simulation_inhibiting:second){reference-type="ref"
reference="fig:simulation_inhibiting:second"}) with inhibitions
accounted for by iteratively multiplying downstream nodes by $-1$ to
form modules with negative correlations between them
(Figures [3c
](#fig:simulation_inhibiting:third){reference-type="ref"
reference="fig:simulation_inhibiting:third"}
and [3d](#fig:simulation_inhibiting:fifth){reference-type="ref"
reference="fig:simulation_inhibiting:fifth"}). A multivariate normal
distribution with these negative correlations can be sampled to generate
simulated data
(Figure [3e](#fig:simulation_inhibiting:fourth){reference-type="ref"
reference="fig:simulation_inhibiting:fourth"}).
The following changes are needed to handle inhibitions:
```{r, include = FALSE}
set.seed(9000)
```
```{r simulation_inhibiting_hide, fig.align='center', fig.show='hold', fig.width='1.0\\linewidth', fig.height='1.0\\linewidth', out.height='.375\\linewidth', out.width='.375\\linewidth', fig.retina=10, fig.margin = FALSE, fig.ncol = 2, eval=FALSE, warning=FALSE}
#generate parameters for inhibitions
state <- c(1, 1, -1, 1, 1, 1, 1, -1)
plot_directed(graph, state=state, layout = layout.kamada.kawai,
cex.node=2, cex.arrow=4, arrow_clip = 0.2)
#plot relationship matrix
heatmap.2(make_distance_graph(graph, absolute = FALSE),
scale = "none", trace = "none", col = colorpanel(50, "white", "red"),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)))
#plot sigma matrix
heatmap.2(make_sigma_mat_dist_graph(graph, state, cor = 0.8, absolute = FALSE),
scale = "none", trace = "none", col = colorpanel(50, "blue", "white", "red"),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)))
#simulated data
expr <- generate_expression(100, graph, state, cor = 0.8, mean = 0,
comm = FALSE, dist =TRUE, absolute = FALSE)
#plot simulated correlations
heatmap.2(cor(t(expr)), scale = "none", trace = "none",
col = colorpanel(50, "blue", "white", "red"),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)))
#plot simulated expression data
heatmap.2(expr, scale = "none", trace = "none", col = bluered(50),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "")
```
```{r simulation_inhibiting, fig.align='center', fig.show='hold', fig.width='1.0\\linewidth', fig.height='1.0\\linewidth', out.height='.375\\linewidth', out.width='.375\\linewidth', fig.retina=10, fig.margin = FALSE, fig.ncol = 2, fig.cap = '\\textbf{Simulating expression from graph structure with inhibitions}. An example of a graph structure (a), that has been used to derive a relationship matrix (b), $\\Sigma$ matrix (c), and correlation structure (d), from the relative distances between the nodes (e). These values are coloured blue to red from $-1$ to $1$. This has been used to generate a simulated expression dataset of 100 samples (coloured blue to red from low to high) via sampling from the multivariate normal distribution. Here the inhibitory relationships between genes are reflected in negatively correlated simulated values.', warning=FALSE, echo=FALSE}
#generate parameters for inhibitions
state <- c(1, 1, -1, 1, 1, 1, 1, -1)
plot_directed(graph, state=state, layout = layout.kamada.kawai,
cex.node=2, cex.arrow=4, arrow_clip = 0.2)
mtext(text = "(a) Inhibiting pathway structure", side=1, line=3.5, at=0.075, adj=0.5, cex=1.75)
box()
#plot relationship matrix
heatmap.2(make_distance_graph(graph, absolute = FALSE),
scale = "none", trace = "none", key = FALSE, col = colorpanel(50, "white", "red"),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)))
mtext(text = "(b) Relationship matrix", side=1, line=3.5, at=0, adj=0.5, cex=1.75)
#plot sigma matrix
heatmap.2(make_sigma_mat_dist_graph(graph, state, cor = 0.8, absolute = FALSE),
scale = "none", trace = "none", key = FALSE, col = colorpanel(50, "blue", "white", "red"),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)))
mtext(text = expression(paste("(c) ", Sigma, " matrix")), side=1, line=3.5, at=0, adj=0.5, cex=1.75)
#simulated data
expr <- generate_expression(100, graph, state, cor = 0.8, mean = 0,
comm = FALSE, dist =TRUE, absolute = FALSE)
#plot simulated correlations
heatmap.2(cor(t(expr)), scale = "none", trace = "none", key = FALSE, col = colorpanel(50, "blue", "white", "red"),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)))
mtext(text = "(d) Simulated correlation", side=1, line=3.5, at=0, adj=0.5, cex=1.75)
#plot simulated expression data
heatmap.2(expr, scale = "none", trace = "none", key = FALSE, col = bluered(50),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "")
mtext(text = "samples", side=1, line=1.5, at=0.2, adj=0.5, cex=1.5)
mtext(text = "genes", side=4, line=1, at=-0.4, adj=0.5, cex=1.5)
mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0, adj=0.5, cex=1.75)
```
```{r, include=FALSE, warning=FALSE}
pdf("Plotsimulation_inhibiting-5.pdf", width = 12.14, height = 6.072)
#plot simulated expression data
heatmap.2(expr, scale = "none", trace = "none", key = FALSE, col = bluered(50),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "", mar = c(6, 6), keysize = 1)
mtext(text = "samples", side=1, line=1.5, at=0.55, adj=0.5, cex=1.5)
mtext(text = "genes", side=4, line=1, at=0.45, adj=0.5, cex=1.5)
mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0.5, adj=0.5, cex=1.75)
dev.off()
#("sed -i.bak '/Plotsimulation_inhibiting-5/s/\\(.*\\)width=[.]415/\1width=.830/g' paper.md && rm paper.md.bak")
#system("sed -i.bak '/Plotsimulation_inhibiting-5/s/\\(.*\\)width=[.]415/\1width=.830/g' paper.tex && rm paper.tex.bak")
#sed -i '/Plot.*-5/s/\(.*\)width=[.]415/\1width=.830/g' paper.md paper.tex
```
The simulation procedure is also demonstrated here
(Figure [4](#fig:simulation_smad){reference-type="ref"
reference="fig:simulation_smad"}) on a pathway structure for a known
biological pathway (Reactome pathway R-HSA-2173789): "TGF-$\beta$ receptor
signaling activates SMADs"
(Figure [4a](#fig:simulation_smad:first){reference-type="ref"
reference="fig:simulation_smad:first"}) derived from the Reactome
database version 52 [@Reactome]. Distances are calculated in the same
manner as before
(Figure [4b](#fig:simulation_smad:second){reference-type="ref"
reference="fig:simulation_smad:second"}) producing blocks of correlated
genes
(Figures [4c](#fig:simulation_smad:third){reference-type="ref"
reference="fig:simulation_smad:third"}
and [4d](#fig:simulation_smad:fifth){reference-type="ref"
reference="fig:simulation_smad:fifth"}). This shows that the
multivariate normal distribution can be sampled to generate simulated
data to represent expression with the complexity of a biological pathway
(Figure [4e](#fig:simulation_smad:fourth){reference-type="ref"
reference="fig:simulation_smad:fourth"}). Here *SMAD7* exhibits
negative correlations with the other SMADs consistent with its
functions as an "inhibitor SMAD" which competitively inhibits *SMAD4*.
We can import the graph structure into R as follows and run simulations as above:
```{r, include = FALSE}
set.seed(9000)
```
```{r simulation_smad_hide, fig.align='center', fig.show='hold', fig.width='1.0\\linewidth', fig.height='1.0\\linewidth', out.height='.375\\linewidth', out.width='.375\\linewidth', fig.retina=10, fig.margin = FALSE, fig.ncol = 2, eval=FALSE, warning=FALSE}
#import graph from data
graph <- identity(TGFBeta_Smad_graph)
#generate parameters for inhibitions
state <- E(graph)$state
plot_directed(graph, state = state, layout = layout.kamada.kawai,
border.node=alpha("black", 0.75), fill.node="lightblue",
col.arrow = c(alpha("navyblue", 0.25), alpha("red", 0.25))[state],
cex.node = 1.5, cex.label = 0.8, cex.arrow = 2)
#plot relationship matrix
heatmap.2(make_distance_graph(graph, absolute = FALSE),
scale = "none", trace = "none", col = colorpanel(50, "white", "red"),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "")
#plot sigma matrix
heatmap.2(make_sigma_mat_dist_graph(graph, state, cor = 0.8, absolute = FALSE),
scale = "none", trace = "none", col = colorpanel(50, "blue", "white", "red"),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "")
#simulated data
expr <- generate_expression(100, graph, state, cor = 0.8,
mean = 0,comm = FALSE, dist =TRUE, absolute = FALSE)
#plot simulated correlations
heatmap.2(cor(t(expr)), scale = "none", trace = "none",
col = colorpanel(50, "blue", "white", "red"),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "")
#plot simulated expression data
heatmap.2(expr, scale = "none", trace = "none", col = bluered(50),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "")
```
```{r simulation_smad, fig.align='center', fig.show='hold', fig.width='1.0\\linewidth', fig.height='1.0\\linewidth', out.height='.375\\linewidth', out.width='.375\\linewidth', fig.retina=10, fig.margin = FALSE, fig.ncol = 2, fig.cap = '\\textbf{Simulating expression from a biological pathway graph structure}. The graph structure (a) of a known biological pathway, "TGF-$\\beta$ receptor signaling activates SMADs" (R-HSA-2173789), was used to derive a relationship matrix (b), $\\Sigma$ matrix (c) and correlation structure (d) from the relative distances between the nodes. These values are coloured blue to red from $-1$ to $1$ (e). This has been used to generate a simulated expression dataset of 100 samples (coloured blue to red from low to high) via sampling from the multivariate normal distribution. Here modules of genes with correlated expression can be clearly discerned.', warning=FALSE, echo=FALSE}
#import graph from data
graph <- identity(TGFBeta_Smad_graph)
#generate parameters for inhibitions
state <- rep(1, length(E(graph))); pathway <- get.edgelist(graph)
state[pathway[,1] %in% c("SMAD6", "SMAD7", "BAMBI", "SMURF1", "SMURF2", "UCHL5",
"USP15", "UBB", "UBC", "PMEPA1", "PPP1CA", "PPP1CB", "PPP1CC", "PPP1R15A")] <- 2
state[is.na(state)] <- 1
plot_directed(graph, state = state, layout = layout.kamada.kawai,
border.node=alpha("black", 0.75), fill.node="lightblue",
col.arrow = c(alpha("navyblue", 0.25), alpha("red", 0.25))[state],
cex.node = 1.5, cex.label = 0.8, cex.arrow = 2,
sub = expression(paste("(a) TFG-", beta, " activates SMADs")), cex.sub = 1.75)
box()
#plot relationship matrix
heatmap.2(make_distance_graph(graph, absolute = FALSE),
scale = "none", trace = "none", key = FALSE, col = colorpanel(50, "white", "red"),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "")
mtext(text = "(b) Relationship matrix", side=1, line=3.5, at=0, adj=0.5, cex=1.75)
#plot sigma matrix
heatmap.2(make_sigma_mat_dist_graph(graph, state, cor = 0.8, absolute = FALSE),
scale = "none", trace = "none", key = FALSE, col = colorpanel(50, "blue", "white", "red"),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "")
mtext(text = expression(paste("(c) ", Sigma, " matrix")), side=1, line=3.5, at=0, adj=0.5, cex=1.75)
#simulated data
expr <- generate_expression(100, graph, state, cor = 0.8,
mean = 0,comm = FALSE, dist =TRUE, absolute = FALSE)
#plot simulated correlations
heatmap.2(cor(t(expr)), scale = "none", trace = "none", key = FALSE, col = colorpanel(50, "blue", "white", "red"),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "")
mtext(text = "(d) Simulated correlation", side=1, line=3.5, at=0, adj=0.5, cex=1.75)
#plot simulated expression data
heatmap.2(expr, scale = "none", trace = "none", key = FALSE, col = bluered(50),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "")
mtext(text = "samples", side=1, line=1.5, at=0.2, adj=0.5, cex=1.5)
mtext(text = "genes", side=4, line=1, at=-0.4, adj=0.5, cex=1.5)
mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0, adj=0.5, cex=1.75)
```
```{r, include=FALSE, warning=FALSE}
pdf("Plotsimulation_smad-5.pdf", width = 12.14, height = 6.072)
#plot simulated expression data
heatmap.2(expr, scale = "none", trace = "none", col = bluered(50),
colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "", mar = c(6, 6), keysize = 1)
mtext(text = "samples", side=1, line=1.5, at=0.55, adj=0.5, cex=1.5)
mtext(text = "genes", side=4, line=1, at=0.45, adj=0.5, cex=1.5)
mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0.5, adj=0.5, cex=1.75)
dev.off()
#system("sed -i.bak '/Plotsimulation_smad-5/s/\\(.*\\)width=[.]415/\1width=.830/g' paper.md && rm paper.md.bak")
#system("sed -i.bak '/Plotsimulation_smad-5/s/\\(.*\\)width=[.]415/\1width=.830/g' paper.tex && rm paper.tex.bak")
#sed -i '/Plot.*-5/s/\(.*\)width=[.]415/\1width=.830/g' paper.md paper.tex
```
These simulated datasets can also be used for simulating gene
expression data within a graph network to test genomic analysis techniques.
Correlation structure can be included in datasets generated
for testing whether true positive genes or samples can be detected
in a sample with the background of complex pathway structure.
Summary and discussion {#sec:summary}
===============================================================================
Biological pathways are of fundamental importance to understanding
molecular biology. In order to translate findings from genomics studies
into real-world applications such as improved healthcare, the roles of
genes must be studied in the context of molecular pathways. Here we
present a statistical framework to simulate gene expression from
biological pathways, and provide the \texttt{graphsim} package in R
to generate these simulated datasets. This approach is versatile and
can be fine-tuned for modelling existing biological pathways or for
testing whether constructed pathways can be detected by other means.
In particular, methods to infer biological pathways and gene regulatory
networks from gene expression data can be tested on simulated datasets
using this framework. The package also enables simulation of complex gene
expression datasets to test how these pathways impact on statistical
analysis of gene expression data using existing methods or novel
statistical methods being developed for gene expression data analysis.
This approach is intended to be applied to bulk gene expression data
but could in principle be adapted to modelling single-cell or
different modalities such as genome-wide epigenetic data.
Computational details {#computational-details .unnumbered .unnumbered}
===============================================================================
Complete examples of code needed to produce the figures in this paper are
available in the Rmarkdown version in the package GitHub repository
(\url{https://github.com/TomKellyGenetics/graphsim}).
Further details are available in the vignettes as well.
The results in this paper were obtained using R 4.0.2 with the \texttt{igraph} 1.2.5
\texttt{Matrix} 1.2-17,
\texttt{matrixcalc} 1.0-3, and \texttt{mvtnorm} 1.1-1 packages.
R itself and all dependent packages
used are available from the Comprehensive Archive Network (CRAN) at
\url{https://CRAN.R-project.org}. The \texttt{graphsim} 1.0.0 package
can be installed from CRAN and the issues can be reported to
the development version on GitHub.
This package is included in the \texttt{igraph.extensions} library on GitHub (\url{https://github.com/TomKellyGenetics/igraph.extensions})
which installs various
tools for \texttt{igraph} analysis. This software is cross-platform and
compatible with installations on Windows, Mac, and Linux operating
systems.
Updates to the package (\texttt{graphsim} 1.0.0) will be released on CRAN.
Acknowledgements {#acknowledgements .unnumbered .unnumbered}
===============================================================================
This package was developed as part of a PhD research project funded by
the Postgraduate Tassell Scholarship in Cancer Research Scholarship
awarded to STK. We thank members of the Laboratory of Professor Satoru
Miyano at the University of Tokyo, Institute for Medical Science,
Professor Seiya Imoto, Associate Professor Rui Yamaguchi, and Dr Paul
Sheridan (Assistant Professor at Hirosaki University,CSO at Tupac Bio)
for helpful discussions in this field. We also thank Professor Parry
Guilford at the University of Otago, Professor Cristin Print at the
University of Auckland, and Dr Erik Arner at the RIKEN Center for
Integrative Medical Sciences for their excellent advice during this
project.
Author Contributions {#contributions .unnumbered .unnumbered}
===============================================================================
S.T.K. and M.A.B. conceived of the presented methodology. S.T.K. developed the theory and performed the computations.
M.A.B. provided guidance throughout the project and gave feedback on the package. All authors discussed the package and contributed to the final manuscript.
# References