Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign up| --- | |
| title: 'graphsim: An R package for simulating gene expression data from graph structures of biological pathways' | |
| output: | |
| rmarkdown::pdf_document: | |
| fig_crop: no | |
| #keep_md: TRUE | |
| keep_tex: TRUE | |
| fig_caption: yes | |
| tags: | |
| - R | |
| - gene-expression | |
| - simulation | |
| - genomics | |
| - pathway | |
| - network | |
| authors: | |
| - name: S. Thomas Kelly | |
| email: "tom.kelly@postgrad.otago.ac.nz, tom.kelly@riken.jp" | |
| orcid: 0000-0003-3904-6690 | |
| affiliation: "1, 2" # (Multiple affiliations must be quoted) | |
| - name: Michael A. Black | |
| email: mik.black@otago.ac.nz | |
| orcid: 0000-0003-1174-6054 | |
| affiliation: "1" | |
| affiliations: | |
| - name: "Department of Biochemistry, University of Otago, PO Box 56, Dunedin 9054, New Zealand" | |
| index: 1 | |
| - name: "RIKEN Center for Integrative Medical Sciences, Suehiro-cho-1-7-22, Tsurumi Ward, Yokohama, Kanagawa 230-0045, Japan" | |
| index: 2 | |
| date: "`r format(Sys.time(), '%d %B %Y')`" | |
| bibliography: paper.bib | |
| header-includes: | |
| - \usepackage{caption} | |
| --- | |
| ```{r, include = FALSE} | |
| knitr::opts_chunk$set(collapse = TRUE, comment = "#>", width = 68) | |
| knitr::opts_chunk$set(fig.cap = "", fig.path = "Plot") | |
| knitr::opts_chunk$set(fig.pos = "!htbp") | |
| options(width = 68, cli.unicode = FALSE, cli.width = 68) | |
| #par(mai=c(2.82, 2.82, 0.82, 0.82)-0.82) | |
| par(mar=c(7, 10, 4, 2) + 0.1) | |
| par(bty="o") | |
| #captioner::captioner(prefix = "Fig.") | |
| ``` | |
| ### Summary | |
| Transcriptomic analysis is used to capture the molecular state of a cell | |
| or sample in many biological and medical applications. In addition to | |
| identifying alterations in activity at the level of individual genes, | |
| understanding changes in the gene networks that regulate fundamental | |
| biological mechanisms is also an important objective of molecular | |
| analysis. As a result, databases that describe biological pathways | |
| are increasingly used to assist with the interpretation of results | |
| from large-scale genomics studies. Incorporating information from | |
| biological pathways and gene regulatory networks into a genomic data | |
| analysis is a popular strategy, and there are many methods that provide | |
| this functionality for gene expression data. When developing or comparing | |
| such methods, it is important to gain an accurate assessment of their | |
| performance. Simulation-based validation studies are frequently used | |
| for this. | |
| This necessitates the use of simulated data that correctly accounts for | |
| pathway relationships and correlations. Here we present a versatile | |
| statistical framework to simulate correlated gene expression data from | |
| biological pathways, by sampling from a multivariate normal distribution | |
| derived from a graph structure. This procedure has been released as the | |
| \texttt{graphsim} R package on CRAN and GitHub (\url{https://github.com/TomKellyGenetics/graphsim}) | |
| and is compatible with any graph structure that can be described using | |
| the `igraph` package. This package allows the simulation of biological | |
| pathways from a graph structure based on a statistical model of gene expression. | |
| Introduction: inference and modelling of biological networks {#sec:intro} | |
| =============================================================================== | |
| Network analysis of molecular biological pathways has the potential to | |
| lead to new insights into biology and medical genetics | |
| [@Barabasi2004; @Hu2016]. Since gene expression profiles capture a | |
| consistent signature of the regulatory state of a cell | |
| [@Perou2000; @Ozsolak2011; @Svensson2018], they can be used to analyse | |
| complex molecular states with genome-scale data. However, biological | |
| pathways are often analysed in a reductionist paradigm as amorphous sets | |
| of genes involved in particular functions, despite the fact that the | |
| relationships defined by pathway structure could further inform gene | |
| expression analyses. In many cases, the pathway relationships are | |
| well-defined, experimentally-validated, and are available in public | |
| databases [@Reactome]. As a result, network analysis techniques can | |
| play an important role in furthering our understanding of biological | |
| pathways and aiding in the interpretation of genomics studies. | |
| Gene networks provide insights into how cells are regulated, by mapping | |
| regulatory interactions between target genes and transcription factors, | |
| enhancers, and sites of epigenetic marks or chromatin structures | |
| [@Barabasi2004; @Yamaguchi2007]. Inference using these regulatory | |
| interactions genomic analysis has the potential to radically | |
| expand the range of candidate biological pathways to be further | |
| explored, or to improve the accuracy of bioinformatics and functional | |
| genomic analysis. A number of methods have been developed to | |
| utilise timecourse gene expression data [@Yamaguchi2007; @Arner2015] | |
| using gene regulatory modules in state-space models and recursive vector | |
| autoregressive models [@Hirose2008; @Shimamura2009]. Various approaches | |
| to gene regulation and networks at the genome-wide scale have led to | |
| novel biological insights [@Arner2015; @Komatsu2013], however, inference | |
| of regulatory networks has thus far primarily relied on experimental | |
| validation or resampling-based approaches to estimate the likelihood | |
| of specific network modules being predicted [@Markowetz2007; @Hawe2019]. | |
| Simulating datasets that account for pathway structure are of particular | |
| interest for benchmarking regulatory network inference techniques | |
| and methods being developed for genomics data containing complex biological | |
| interactions [@Schaffter2011; @Saelens2019]. Dynamical models using | |
| differential equations have been employed, such as by GeneNetWeaver | |
| [@Schaffter2011], to generate simulated datasets | |
| specifically for benchmarking gene regulatory network inference techniques. | |
| There is also renewed interest in modelling biological pathways and simulating | |
| data for benchmarking due to the emergence of single-cell genomics | |
| technologies and the growing number of bioinformatics techniques developed | |
| to use this data [@Zappia2017; @Saelens2019]. Packages such as 'splatter' | |
| [@Zappia2017], which uses the gamma-poisson distribution, | |
| have been developed to model single-cell data. | |
| SERGIO [@Dibaeinia2019] and dyngen [@Cannoodt2020] build on | |
| this by adding gene regulatory networks and multimodality | |
| respectively. These methods have been designed based on known | |
| deterministic relationships or synthetic reaction states, | |
| to which stochasticity is then added. | |
| However, it is computationally-intensive to model | |
| these reactions at scale or run many iterations for benchmarking. | |
| In some cases, it is only necessary to model the statistical | |
| variability and "noise" of RNA-Seq data in order to evaluate | |
| methods in the presence of multivariate correlation structures. | |
| There is a need, therefore, for a systematic framework for statistical | |
| modelling and simulation of gene expression data derived from | |
| hypothetical, inferred or known gene networks. Here we present a | |
| package to achieve this, where samples from a multivariate normal | |
| distribution are used to generate normally-distributed log-expression | |
| data, with correlations between genes derived from the structure of the | |
| underlying pathway or gene regulatory network. This methodology enables | |
| simulation of expression profiles that approximate the log-transformed | |
| and normalised data from microarray studies, as well as bulk or single-cell RNA-Seq | |
| experiments. This procedure has been released as the \texttt{graphsim} | |
| package to enable the generation of simulated gene expression datasets | |
| containing pathway relationships from a known underlying network. | |
| These simulated datasets can be used to evaluate various bioinformatics | |
| methodologies, including | |
| statistical and network inference procedures. | |
| Methodology and software {#sec:methods} | |
| =============================================================================== | |
| Here we present a procedure to simulate gene expression data with | |
| correlation structure derived from a known graph structure. This | |
| procedure assumes that transcriptomic data have been generated and | |
| follow a log-normal distribution (i.e., | |
| $log(X_{ij}) \sim MVN({\bf\mu}, \Sigma)$, where ${\bf\mu}$ and $\Sigma$ | |
| are the mean vector and variance-covariance matrix respectively, for | |
| gene expression data derived from a biological pathway) after | |
| appropriate normalisation [@Law2014; @Li2015]. Log-normality of gene | |
| expression matches the assumptions of the popular \texttt{limma} package [@limma], which is | |
| often used for the analysis of intensity-based data from gene expression | |
| microarray studies and count-based data from RNA-Seq experiments. This | |
| approach has also been applied for modelling UMI-based count data from | |
| single-cell RNA-Seq experiments in the \texttt{DESCEND} R package [@Wang2018]. | |
| In order to simulate transcriptomic data, a pathway is first constructed | |
| as a graph structure, using the \texttt{igraph} R package [@igraph], with the status of | |
| the edge relationships defined (i.e, whether they activate or inhibit | |
| downstream pathway members). [This procedure uses]{style="color: black"} | |
| a graph structure such as that presented in | |
| Figure [1a](#fig:simple_graph:first){reference-type="ref" | |
| reference="fig:simple_graph:first"}. The graph can be defined by an | |
| adjacency matrix, **$A$** (with elements | |
| $A_{ij}$), where $$A_{ij} = | |
| \begin{cases} | |
| 1 & \mbox{if genes } i \mbox{ and } j \mbox{ are adjacent} \\ | |
| 0 & \mbox{otherwise} | |
| \end{cases}$$ | |
| A matrix, **$R$**, with elements | |
| [$R_{ij}$]{style="color: black"}, is calculated based on distance (i.e., | |
| number of edges contained in the shortest path) between nodes, such that | |
| closer nodes are given more weight than more distant nodes, to define | |
| inter-node relationships. A geometrically-decreasing (relative) distance | |
| weighting is used to achieve this: | |
| [ $$R_{ij} = | |
| \begin{cases} | |
| 1 & \mbox{if genes } i \mbox{ and } j \mbox{ are adjacent} \\ | |
| (\frac{1}{2})^{d_{ij}} & \mbox{if a path can be found between genes } i \mbox{ and } j \\ | |
| 0 & \mbox{if no path exists between genes } i \mbox{ and } j | |
| \end{cases}$$]{style="color: black"} where $d_{ij}$ is the length of | |
| the shortest path (i.e., minimum number of edges traversed) between | |
| genes (nodes) $i$ and $j$ in graph $G$. Each more distant node is thus | |
| related by $\frac{1}{2}$ compared to the next nearest, as shown in | |
| Figure [2b](#fig:simulation_activating:second){reference-type="ref" | |
| reference="fig:simulation_activating:second"}. An | |
| arithmetically-decreasing (absolute) distance weighting is also | |
| supported in the \texttt{graphsim} R package which implements this procedure: [ $$R_{ij} = | |
| \begin{cases} | |
| 1 & \mbox{if genes } i \mbox{ and } j \mbox{ are adjacent} \\ | |
| 1-\frac{d_{ij}}{diam(G)} & \mbox{if a path can be found between genes } i \mbox{ and } j \\ | |
| 0 & \mbox{if no path exists between genes } i \mbox{ and } j | |
| \end{cases}$$ ]{style="color: black"} | |
| Assuming a unit variance for each gene, these values can be used to | |
| derive a $\Sigma$ matrix: $$\Sigma_{ij} = | |
| \begin{cases} | |
| 1 & \mbox{if } i=j \\ | |
| \rho R_{ij} & \mbox{otherwise} | |
| \end{cases}$$ where $\rho$ is the correlation between adjacent nodes. | |
| Thus covariances between adjacent nodes are assigned by a correlation | |
| parameter ($\rho$) and the remaining off-diagonal values in the matrix | |
| are based on scaling these correlations by the geometrically weighted | |
| relationship matrix (or the nearest positive definite matrix for | |
| $\Sigma$ with negative correlations).\ | |
| Computing the nearest positive definite matrix is necessary to ensure | |
| that the variance-covariance matrix can be inverted when used as a | |
| parameter in multivariate normal simulations, particularly when negative | |
| correlations are included for inhibitions (as shown below). Matrices | |
| that cannot be inverted occur rarely with biologically plausible | |
| graph structures but this approach allows for the computation of a | |
| plausible correlation matrix when the given graph structure is | |
| incomplete or contains loops. When required, the nearest positive | |
| definite matrix is computed using the `nearPD` function of | |
| the \texttt{Matrix} R package [@Matrix] to perform | |
| Higham's algorithm [@Higham2002] on variance-covariance matrices. | |
| The \texttt{graphsim} package gives a warning when this occurs. | |
| Illustrations {#sec:illustrations} | |
| =============================================================================== | |
| Generating a Graph Structure {#sec:plot_graph} | |
| ------------------------------------------------------------------------------- | |
| The graph structure in | |
| Figure [1a](#fig:simple_graph:first){reference-type="ref" | |
| reference="fig:simple_graph:first"} was used to simulate correlated gene | |
| expression data by sampling from a multivariate normal distribution | |
| using the \texttt{mvtnorm} R package [@Genz2009; @mvtnorm]. The graph structure | |
| visualisation in | |
| Figure [1](#fig:simple_graph){reference-type="ref" | |
| reference="fig:simple_graph"} was specifically developed for (directed) | |
| \texttt{igraph} objects in and is available in the \texttt{graphsim} package. The | |
| \texttt{plot\_directed} function enables customisation of plot parameters for | |
| each node or edge, and mixed (directed) edge types for indicating | |
| activation or inhibition. These inhibition links (which occur frequently | |
| in biological pathways) are demonstrated in | |
| Figure [1b](#fig:simple_graph:second){reference-type="ref" | |
| reference="fig:simple_graph:second"}. | |
| A graph structure can be generated and plotted using the following | |
| commands in R: | |
| ```{r, eval = FALSE} | |
| #install packages required (once per computer) | |
| install.packages("graphsim") | |
| ``` | |
| ```{r, warning=FALSE, results='hide', message=FALSE} | |
| #load required packages (once per R instance) | |
| library("graphsim") | |
| #load packages for examples | |
| library("igraph"); library("gplots"); library("scales") | |
| ``` | |
| ```{r simple_graph_hide, fig.align='center', fig.show='hold', fig.width='1.0\\linewidth', fig.height='1.0\\linewidth', out.height='.375\\linewidth', out.width='.375\\linewidth', fig.retina=10, fig.margin = FALSE, fig.ncol = 2, eval=FALSE} | |
| #generate graph structure | |
| graph_edges <- rbind(c("A", "C"), c("B", "C"), c("C", "D"), c("D", "E"), | |
| c("D", "F"), c("F", "G"), c("F", "I"), c("H", "I")) | |
| graph <- graph.edgelist(graph_edges, directed = TRUE) | |
| #plot graph structure (Figure 1a) | |
| plot_directed(graph, state ="activating", layout = layout.kamada.kawai, | |
| cex.node = 2, cex.arrow = 4, arrow_clip = 0.2) | |
| #generate parameters for inhibitions for each edge in E(graph) | |
| state <- c(1, 1, -1, 1, 1, 1, 1, -1) | |
| #plot graph structure with inhibitions (Figure 1b) | |
| plot_directed(graph, state=state, layout = layout.kamada.kawai, | |
| cex.node = 2, cex.arrow = 4, arrow_clip = 0.2) | |
| ``` | |
| ```{r simple_graph, fig.align='center', fig.show='hold', fig.width='1.0\\linewidth', fig.height='1.0\\linewidth', out.height='.375\\linewidth', out.width='.375\\linewidth', fig.retina=10, fig.margin = FALSE, fig.ncol = 2, fig.cap = '\\textbf{Simulated graph structures}. A constructed graph structure used as an example to demonstrate the simulation procedure in Figures 2 and 3. Activating links are denoted by black arrows and inhibiting links by red edges.', echo=FALSE} | |
| #generate graph structure | |
| graph_edges <- rbind(c("A", "C"), c("B", "C"), c("C", "D"), c("D", "E"), | |
| c("D", "F"), c("F", "G"), c("F", "I"), c("H", "I")) | |
| graph <- graph.edgelist(graph_edges, directed = TRUE) | |
| #plot graph structure (Figure 1a) | |
| plot_directed(graph, state ="activating", layout = layout.kamada.kawai, | |
| cex.node = 2, cex.arrow = 4, arrow_clip = 0.2) | |
| mtext(text = "(a) Activating pathway structure", side=1, line=3.5, at=0.05, adj=0.5, cex=1.75) | |
| box() | |
| #generate parameters for inhibitions for each edge in E(graph) | |
| state <- c(1, 1, -1, 1, 1, 1, 1, -1) | |
| #plot graph structure with inhibitions (Figure 1b) | |
| plot_directed(graph, state=state, layout = layout.kamada.kawai, | |
| cex.node = 2, cex.arrow = 4, arrow_clip = 0.2) | |
| mtext(text = "(b) Inhibiting pathway structure", side=1, line=3.5, at=0.075, adj=0.5, cex=1.75) | |
| box() | |
| ``` | |
| Generating a Simulated Expression Dataset {#sec:graphsim_demo} | |
| ----------------------------------------- | |
| The correlation parameter of $\rho = 0.8$ is used to demonstrate the | |
| inter-correlated datasets using a geometrically-generated relationship | |
| matrix (as used for the example in | |
| Figure [2c](#fig:simulation_activating:third){reference-type="ref" | |
| reference="fig:simulation_activating:third"}). This $\Sigma$ matrix was | |
| then used to sample from a multivariate normal distribution such that | |
| each gene had a mean of $0$, standard deviation $1$, and covariance | |
| within the range $[0,1]$ so that the off-diagonal elements of $\Sigma$ | |
| represent correlations. This procedure generated a simulated (continuous | |
| normally-distributed) log-expression profile for each node | |
| (Figure [2e](#fig:simulation_activating:fourth){reference-type="ref" | |
| reference="fig:simulation_activating:fourth"}) with a corresponding | |
| correlation structure (Figure [2d](#fig:simulation_activating:fifth){reference-type="ref" | |
| reference="fig:simulation_activating:fifth"}). The simulated correlation | |
| structure closely resembled the expected correlation structure ($\Sigma$ | |
| in Figure [2c](#fig:simulation_activating:third){reference-type="ref" | |
| reference="fig:simulation_activating:third"}) even for the relatively | |
| modest sample size ($N=100$) illustrated in | |
| Figure [2](#fig:simulation_activating){reference-type="ref" | |
| reference="fig:simulation_activating"}. Once a gene expression dataset | |
| comprising multiple pathways has been generated (as in | |
| Figure [2e](#fig:simulation_activating:fourth){reference-type="ref" | |
| reference="fig:simulation_activating:fourth"}), it can then be used to | |
| test procedures designed for analysis of empirical gene expression data | |
| (such as those generated by microarrays or RNA-Seq) that have been | |
| normalised on a log-scale. | |
| The simulated dataset can be generated using the following code: | |
| ```{r, include=FALSE, eval=FALSE} | |
| graphics::layout(matrix(c(1:5, 5), nrow = 3, byrow = TRUE)) | |
| ``` | |
| ```{r, include = FALSE} | |
| set.seed(9000) | |
| ``` | |
| ```{r simulation_activating_hide, fig.align='center', fig.show='hold', fig.width='1.0\\linewidth', fig.height='1.0\\linewidth', out.height='.375\\linewidth', out.width='.375\\linewidth', fig.retina=10, fig.margin = FALSE, fig.ncol = 2, eval=FALSE, warning=FALSE} | |
| #plot relationship matrix | |
| heatmap.2(make_distance_graph(graph, absolute = FALSE), | |
| scale = "none", trace = "none", col = colorpanel(50, "white", "red"), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph))) | |
| #plot sigma matrix | |
| heatmap.2(make_sigma_mat_dist_graph(graph, cor = 0.8, absolute = FALSE), | |
| scale = "none", trace = "none", col = colorpanel(50, "white", "red"), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph))) | |
| #simulate data | |
| expr <- generate_expression(100, graph, cor = 0.8, mean = 0, | |
| comm = FALSE, dist =TRUE, absolute = FALSE, state = state) | |
| #plot simulated correlations | |
| heatmap.2(cor(t(expr)), scale = "none", trace = "none", | |
| col = colorpanel(50, "white", "red"), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph))) | |
| #plot simulated expression data | |
| heatmap.2(expr, scale = "none", trace = "none", col = bluered(50), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "") | |
| ``` | |
| ```{r simulation_activating, fig.align='center', fig.show='hold', fig.width='1.0\\linewidth', fig.height='1.0\\linewidth', out.height='.375\\linewidth', out.width='.375\\linewidth', fig.retina=10, fig.margin = FALSE, fig.ncol = 2, fig.cap = '\\textbf{Simulating expression from a graph structure}. An example of a graph structure (a) that has been used to derive a relationship matrix (b), $\\Sigma$ matrix (c) and correlation structure (d) from the relative distances between the nodes. Non-negative values are coloured white to red from $0$ to $1$ (e). The $\\Sigma$ matrix has been used to generate a simulated expression dataset of 100 samples (coloured blue to red from low to high) via sampling from the multivariate normal distribution. Here genes with closer relationships in the pathway structure show a higher correlation between simulated values.', warning=FALSE, echo=FALSE} | |
| # activating graph | |
| state <- rep(1, length(E(graph))) | |
| plot_directed(graph, state=state, layout = layout.kamada.kawai, | |
| cex.node=2, cex.arrow=4, arrow_clip = 0.2) | |
| mtext(text = "(a) Activating pathway structure", side=1, line=3.5, at=0.075, adj=0.5, cex=1.75) | |
| box() | |
| #plot relationship matrix | |
| heatmap.2(make_distance_graph(graph, absolute = FALSE), | |
| scale = "none", trace = "none", key = FALSE, | |
| col = colorpanel(50, "white", "red"), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph))) | |
| mtext(text = "(b) Relationship matrix", side=1, line=3.5, at=0, adj=0.5, cex=1.75) | |
| #plot sigma matrix | |
| heatmap.2(make_sigma_mat_dist_graph(graph, cor = 0.8, absolute = FALSE), | |
| scale = "none", trace = "none", key = FALSE, | |
| col = colorpanel(50, "white", "red"), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph))) | |
| mtext(text = expression(paste("(c) ", Sigma, " matrix")), side=1, line=3.5, at=0, adj=0.5, cex=1.75) | |
| #simulate data | |
| expr <- generate_expression(100, graph, cor = 0.8, mean = 0, | |
| comm = FALSE, dist =TRUE, absolute = FALSE, state = state) | |
| #plot simulated correlations | |
| heatmap.2(cor(t(expr)), scale = "none", trace = "none", key = FALSE, | |
| col = colorpanel(50, "white", "red"), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph))) | |
| mtext(text = "(d) Simulated correlation", side=1, line=3.5, at=0, adj=0.5, cex=1.75) | |
| #plot simulated expression data | |
| heatmap.2(expr, scale = "none", trace = "none", key = FALSE, col = bluered(50), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "") | |
| mtext(text = "samples", side=1, line=1.5, at=0.2, adj=0.5, cex=1.5) | |
| mtext(text = "genes", side=4, line=1, at=-0.4, adj=0.5, cex=1.5) | |
| mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0, adj=0.5, cex=1.75) | |
| ``` | |
| ```{r, include=FALSE, warning=FALSE} | |
| pdf("Plotsimulation_activating-5.pdf", width = 12.14, height = 6.072) | |
| #plot simulated expression data | |
| heatmap.2(expr, scale = "none", trace = "none", key = FALSE, col = bluered(50), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "", mar = c(6, 6), keysize = 1) | |
| mtext(text = "samples", side=1, line=1.5, at=0.55, adj=0.5, cex=1.5) | |
| mtext(text = "genes", side=4, line=1, at=0.45, adj=0.5, cex=1.5) | |
| mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0.5, adj=0.5, cex=1.75) | |
| dev.off() | |
| #system("sed -i.bak '/Plotsimulation_activating-5/s/\\(.*\\)width=[.]415/\1width=.830/g' paper.md && rm paper.md.bak") | |
| #system("sed -i.bak '/Plotsimulation_activating-5/s/\\(.*\\)width=[.]415/\1width=.830/g' paper.tex && rm paper.tex.bak") | |
| #sed -i '/Plot.*-5/s/\(.*\)width=[.]415/\1width=.830/g' paper.md paper.tex | |
| ``` | |
| The simulation procedure | |
| (Figure [2](#fig:simulation_activating){reference-type="ref" | |
| reference="fig:simulation_activating"}) can similarly be used for | |
| pathways containing inhibitory links | |
| (Figure [3](#fig:simulation_inhibiting){reference-type="ref" | |
| reference="fig:simulation_inhibiting"}) with several refinements. With | |
| the inhibitory links | |
| (Figure [3a](#fig:simulation_inhibiting:first){reference-type="ref" | |
| reference="fig:simulation_inhibiting:first"}), distances are calculated | |
| in the same manner as before | |
| (Figure [3b](#fig:simulation_inhibiting:second){reference-type="ref" | |
| reference="fig:simulation_inhibiting:second"}) with inhibitions | |
| accounted for by iteratively multiplying downstream nodes by $-1$ to | |
| form modules with negative correlations between them | |
| (Figures [3c | |
| ](#fig:simulation_inhibiting:third){reference-type="ref" | |
| reference="fig:simulation_inhibiting:third"} | |
| and [3d](#fig:simulation_inhibiting:fifth){reference-type="ref" | |
| reference="fig:simulation_inhibiting:fifth"}). A multivariate normal | |
| distribution with these negative correlations can be sampled to generate | |
| simulated data | |
| (Figure [3e](#fig:simulation_inhibiting:fourth){reference-type="ref" | |
| reference="fig:simulation_inhibiting:fourth"}). | |
| The following changes are needed to handle inhibitions: | |
| ```{r, include = FALSE} | |
| set.seed(9000) | |
| ``` | |
| ```{r simulation_inhibiting_hide, fig.align='center', fig.show='hold', fig.width='1.0\\linewidth', fig.height='1.0\\linewidth', out.height='.375\\linewidth', out.width='.375\\linewidth', fig.retina=10, fig.margin = FALSE, fig.ncol = 2, eval=FALSE, warning=FALSE} | |
| #generate parameters for inhibitions | |
| state <- c(1, 1, -1, 1, 1, 1, 1, -1) | |
| plot_directed(graph, state=state, layout = layout.kamada.kawai, | |
| cex.node=2, cex.arrow=4, arrow_clip = 0.2) | |
| #plot relationship matrix | |
| heatmap.2(make_distance_graph(graph, absolute = FALSE), | |
| scale = "none", trace = "none", col = colorpanel(50, "white", "red"), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph))) | |
| #plot sigma matrix | |
| heatmap.2(make_sigma_mat_dist_graph(graph, state, cor = 0.8, absolute = FALSE), | |
| scale = "none", trace = "none", col = colorpanel(50, "blue", "white", "red"), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph))) | |
| #simulated data | |
| expr <- generate_expression(100, graph, state, cor = 0.8, mean = 0, | |
| comm = FALSE, dist =TRUE, absolute = FALSE) | |
| #plot simulated correlations | |
| heatmap.2(cor(t(expr)), scale = "none", trace = "none", | |
| col = colorpanel(50, "blue", "white", "red"), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph))) | |
| #plot simulated expression data | |
| heatmap.2(expr, scale = "none", trace = "none", col = bluered(50), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "") | |
| ``` | |
| ```{r simulation_inhibiting, fig.align='center', fig.show='hold', fig.width='1.0\\linewidth', fig.height='1.0\\linewidth', out.height='.375\\linewidth', out.width='.375\\linewidth', fig.retina=10, fig.margin = FALSE, fig.ncol = 2, fig.cap = '\\textbf{Simulating expression from graph structure with inhibitions}. An example of a graph structure (a), that has been used to derive a relationship matrix (b), $\\Sigma$ matrix (c), and correlation structure (d), from the relative distances between the nodes (e). These values are coloured blue to red from $-1$ to $1$. This has been used to generate a simulated expression dataset of 100 samples (coloured blue to red from low to high) via sampling from the multivariate normal distribution. Here the inhibitory relationships between genes are reflected in negatively correlated simulated values.', warning=FALSE, echo=FALSE} | |
| #generate parameters for inhibitions | |
| state <- c(1, 1, -1, 1, 1, 1, 1, -1) | |
| plot_directed(graph, state=state, layout = layout.kamada.kawai, | |
| cex.node=2, cex.arrow=4, arrow_clip = 0.2) | |
| mtext(text = "(a) Inhibiting pathway structure", side=1, line=3.5, at=0.075, adj=0.5, cex=1.75) | |
| box() | |
| #plot relationship matrix | |
| heatmap.2(make_distance_graph(graph, absolute = FALSE), | |
| scale = "none", trace = "none", key = FALSE, col = colorpanel(50, "white", "red"), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph))) | |
| mtext(text = "(b) Relationship matrix", side=1, line=3.5, at=0, adj=0.5, cex=1.75) | |
| #plot sigma matrix | |
| heatmap.2(make_sigma_mat_dist_graph(graph, state, cor = 0.8, absolute = FALSE), | |
| scale = "none", trace = "none", key = FALSE, col = colorpanel(50, "blue", "white", "red"), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph))) | |
| mtext(text = expression(paste("(c) ", Sigma, " matrix")), side=1, line=3.5, at=0, adj=0.5, cex=1.75) | |
| #simulated data | |
| expr <- generate_expression(100, graph, state, cor = 0.8, mean = 0, | |
| comm = FALSE, dist =TRUE, absolute = FALSE) | |
| #plot simulated correlations | |
| heatmap.2(cor(t(expr)), scale = "none", trace = "none", key = FALSE, col = colorpanel(50, "blue", "white", "red"), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph))) | |
| mtext(text = "(d) Simulated correlation", side=1, line=3.5, at=0, adj=0.5, cex=1.75) | |
| #plot simulated expression data | |
| heatmap.2(expr, scale = "none", trace = "none", key = FALSE, col = bluered(50), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "") | |
| mtext(text = "samples", side=1, line=1.5, at=0.2, adj=0.5, cex=1.5) | |
| mtext(text = "genes", side=4, line=1, at=-0.4, adj=0.5, cex=1.5) | |
| mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0, adj=0.5, cex=1.75) | |
| ``` | |
| ```{r, include=FALSE, warning=FALSE} | |
| pdf("Plotsimulation_inhibiting-5.pdf", width = 12.14, height = 6.072) | |
| #plot simulated expression data | |
| heatmap.2(expr, scale = "none", trace = "none", key = FALSE, col = bluered(50), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "", mar = c(6, 6), keysize = 1) | |
| mtext(text = "samples", side=1, line=1.5, at=0.55, adj=0.5, cex=1.5) | |
| mtext(text = "genes", side=4, line=1, at=0.45, adj=0.5, cex=1.5) | |
| mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0.5, adj=0.5, cex=1.75) | |
| dev.off() | |
| #("sed -i.bak '/Plotsimulation_inhibiting-5/s/\\(.*\\)width=[.]415/\1width=.830/g' paper.md && rm paper.md.bak") | |
| #system("sed -i.bak '/Plotsimulation_inhibiting-5/s/\\(.*\\)width=[.]415/\1width=.830/g' paper.tex && rm paper.tex.bak") | |
| #sed -i '/Plot.*-5/s/\(.*\)width=[.]415/\1width=.830/g' paper.md paper.tex | |
| ``` | |
| The simulation procedure is also demonstrated here | |
| (Figure [4](#fig:simulation_smad){reference-type="ref" | |
| reference="fig:simulation_smad"}) on a pathway structure for a known | |
| biological pathway (Reactome pathway R-HSA-2173789): "TGF-$\beta$ receptor | |
| signaling activates SMADs" | |
| (Figure [4a](#fig:simulation_smad:first){reference-type="ref" | |
| reference="fig:simulation_smad:first"}) derived from the Reactome | |
| database version 52 [@Reactome]. Distances are calculated in the same | |
| manner as before | |
| (Figure [4b](#fig:simulation_smad:second){reference-type="ref" | |
| reference="fig:simulation_smad:second"}) producing blocks of correlated | |
| genes | |
| (Figures [4c](#fig:simulation_smad:third){reference-type="ref" | |
| reference="fig:simulation_smad:third"} | |
| and [4d](#fig:simulation_smad:fifth){reference-type="ref" | |
| reference="fig:simulation_smad:fifth"}). This shows that the | |
| multivariate normal distribution can be sampled to generate simulated | |
| data to represent expression with the complexity of a biological pathway | |
| (Figure [4e](#fig:simulation_smad:fourth){reference-type="ref" | |
| reference="fig:simulation_smad:fourth"}). Here *SMAD7* exhibits | |
| negative correlations with the other SMADs consistent with its | |
| functions as an "inhibitor SMAD" which competitively inhibits *SMAD4*. | |
| We can import the graph structure into R as follows and run simulations as above: | |
| ```{r, include = FALSE} | |
| set.seed(9000) | |
| ``` | |
| ```{r simulation_smad_hide, fig.align='center', fig.show='hold', fig.width='1.0\\linewidth', fig.height='1.0\\linewidth', out.height='.375\\linewidth', out.width='.375\\linewidth', fig.retina=10, fig.margin = FALSE, fig.ncol = 2, eval=FALSE, warning=FALSE} | |
| #import graph from data | |
| graph <- identity(TGFBeta_Smad_graph) | |
| #generate parameters for inhibitions | |
| state <- E(graph)$state | |
| plot_directed(graph, state = state, layout = layout.kamada.kawai, | |
| border.node=alpha("black", 0.75), fill.node="lightblue", | |
| col.arrow = c(alpha("navyblue", 0.25), alpha("red", 0.25))[state], | |
| cex.node = 1.5, cex.label = 0.8, cex.arrow = 2) | |
| #plot relationship matrix | |
| heatmap.2(make_distance_graph(graph, absolute = FALSE), | |
| scale = "none", trace = "none", col = colorpanel(50, "white", "red"), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "") | |
| #plot sigma matrix | |
| heatmap.2(make_sigma_mat_dist_graph(graph, state, cor = 0.8, absolute = FALSE), | |
| scale = "none", trace = "none", col = colorpanel(50, "blue", "white", "red"), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "") | |
| #simulated data | |
| expr <- generate_expression(100, graph, state, cor = 0.8, | |
| mean = 0,comm = FALSE, dist =TRUE, absolute = FALSE) | |
| #plot simulated correlations | |
| heatmap.2(cor(t(expr)), scale = "none", trace = "none", | |
| col = colorpanel(50, "blue", "white", "red"), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "") | |
| #plot simulated expression data | |
| heatmap.2(expr, scale = "none", trace = "none", col = bluered(50), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "") | |
| ``` | |
| ```{r simulation_smad, fig.align='center', fig.show='hold', fig.width='1.0\\linewidth', fig.height='1.0\\linewidth', out.height='.375\\linewidth', out.width='.375\\linewidth', fig.retina=10, fig.margin = FALSE, fig.ncol = 2, fig.cap = '\\textbf{Simulating expression from a biological pathway graph structure}. The graph structure (a) of a known biological pathway, "TGF-$\\beta$ receptor signaling activates SMADs" (R-HSA-2173789), was used to derive a relationship matrix (b), $\\Sigma$ matrix (c) and correlation structure (d) from the relative distances between the nodes. These values are coloured blue to red from $-1$ to $1$ (e). This has been used to generate a simulated expression dataset of 100 samples (coloured blue to red from low to high) via sampling from the multivariate normal distribution. Here modules of genes with correlated expression can be clearly discerned.', warning=FALSE, echo=FALSE} | |
| #import graph from data | |
| graph <- identity(TGFBeta_Smad_graph) | |
| #generate parameters for inhibitions | |
| state <- rep(1, length(E(graph))); pathway <- get.edgelist(graph) | |
| state[pathway[,1] %in% c("SMAD6", "SMAD7", "BAMBI", "SMURF1", "SMURF2", "UCHL5", | |
| "USP15", "UBB", "UBC", "PMEPA1", "PPP1CA", "PPP1CB", "PPP1CC", "PPP1R15A")] <- 2 | |
| state[is.na(state)] <- 1 | |
| plot_directed(graph, state = state, layout = layout.kamada.kawai, | |
| border.node=alpha("black", 0.75), fill.node="lightblue", | |
| col.arrow = c(alpha("navyblue", 0.25), alpha("red", 0.25))[state], | |
| cex.node = 1.5, cex.label = 0.8, cex.arrow = 2, | |
| sub = expression(paste("(a) TFG-", beta, " activates SMADs")), cex.sub = 1.75) | |
| box() | |
| #plot relationship matrix | |
| heatmap.2(make_distance_graph(graph, absolute = FALSE), | |
| scale = "none", trace = "none", key = FALSE, col = colorpanel(50, "white", "red"), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "") | |
| mtext(text = "(b) Relationship matrix", side=1, line=3.5, at=0, adj=0.5, cex=1.75) | |
| #plot sigma matrix | |
| heatmap.2(make_sigma_mat_dist_graph(graph, state, cor = 0.8, absolute = FALSE), | |
| scale = "none", trace = "none", key = FALSE, col = colorpanel(50, "blue", "white", "red"), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "") | |
| mtext(text = expression(paste("(c) ", Sigma, " matrix")), side=1, line=3.5, at=0, adj=0.5, cex=1.75) | |
| #simulated data | |
| expr <- generate_expression(100, graph, state, cor = 0.8, | |
| mean = 0,comm = FALSE, dist =TRUE, absolute = FALSE) | |
| #plot simulated correlations | |
| heatmap.2(cor(t(expr)), scale = "none", trace = "none", key = FALSE, col = colorpanel(50, "blue", "white", "red"), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "") | |
| mtext(text = "(d) Simulated correlation", side=1, line=3.5, at=0, adj=0.5, cex=1.75) | |
| #plot simulated expression data | |
| heatmap.2(expr, scale = "none", trace = "none", key = FALSE, col = bluered(50), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "") | |
| mtext(text = "samples", side=1, line=1.5, at=0.2, adj=0.5, cex=1.5) | |
| mtext(text = "genes", side=4, line=1, at=-0.4, adj=0.5, cex=1.5) | |
| mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0, adj=0.5, cex=1.75) | |
| ``` | |
| ```{r, include=FALSE, warning=FALSE} | |
| pdf("Plotsimulation_smad-5.pdf", width = 12.14, height = 6.072) | |
| #plot simulated expression data | |
| heatmap.2(expr, scale = "none", trace = "none", col = bluered(50), | |
| colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)), labCol = "", mar = c(6, 6), keysize = 1) | |
| mtext(text = "samples", side=1, line=1.5, at=0.55, adj=0.5, cex=1.5) | |
| mtext(text = "genes", side=4, line=1, at=0.45, adj=0.5, cex=1.5) | |
| mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0.5, adj=0.5, cex=1.75) | |
| dev.off() | |
| #system("sed -i.bak '/Plotsimulation_smad-5/s/\\(.*\\)width=[.]415/\1width=.830/g' paper.md && rm paper.md.bak") | |
| #system("sed -i.bak '/Plotsimulation_smad-5/s/\\(.*\\)width=[.]415/\1width=.830/g' paper.tex && rm paper.tex.bak") | |
| #sed -i '/Plot.*-5/s/\(.*\)width=[.]415/\1width=.830/g' paper.md paper.tex | |
| ``` | |
| These simulated datasets can also be used for simulating gene | |
| expression data within a graph network to test genomic analysis techniques. | |
| Correlation structure can be included in datasets generated | |
| for testing whether true positive genes or samples can be detected | |
| in a sample with the background of complex pathway structure. | |
| Summary and discussion {#sec:summary} | |
| =============================================================================== | |
| Biological pathways are of fundamental importance to understanding | |
| molecular biology. In order to translate findings from genomics studies | |
| into real-world applications such as improved healthcare, the roles of | |
| genes must be studied in the context of molecular pathways. Here we | |
| present a statistical framework to simulate gene expression from | |
| biological pathways, and provide the \texttt{graphsim} package in R | |
| to generate these simulated datasets. This approach is versatile and | |
| can be fine-tuned for modelling existing biological pathways or for | |
| testing whether constructed pathways can be detected by other means. | |
| In particular, methods to infer biological pathways and gene regulatory | |
| networks from gene expression data can be tested on simulated datasets | |
| using this framework. The package also enables simulation of complex gene | |
| expression datasets to test how these pathways impact on statistical | |
| analysis of gene expression data using existing methods or novel | |
| statistical methods being developed for gene expression data analysis. | |
| This approach is intended to be applied to bulk gene expression data | |
| but could in principle be adapted to modelling single-cell or | |
| different modalities such as genome-wide epigenetic data. | |
| Computational details {#computational-details .unnumbered .unnumbered} | |
| =============================================================================== | |
| Complete examples of code needed to produce the figures in this paper are | |
| available in the Rmarkdown version in the package GitHub repository | |
| (\url{https://github.com/TomKellyGenetics/graphsim}). | |
| Further details are available in the vignettes as well. | |
| The results in this paper were obtained using R 4.0.2 with the \texttt{igraph} 1.2.5 | |
| \texttt{Matrix} 1.2-17, | |
| \texttt{matrixcalc} 1.0-3, and \texttt{mvtnorm} 1.1-1 packages. | |
| R itself and all dependent packages | |
| used are available from the Comprehensive Archive Network (CRAN) at | |
| \url{https://CRAN.R-project.org}. The \texttt{graphsim} 1.0.0 package | |
| can be installed from CRAN and the issues can be reported to | |
| the development version on GitHub. | |
| This package is included in the \texttt{igraph.extensions} library on GitHub (\url{https://github.com/TomKellyGenetics/igraph.extensions}) | |
| which installs various | |
| tools for \texttt{igraph} analysis. This software is cross-platform and | |
| compatible with installations on Windows, Mac, and Linux operating | |
| systems. | |
| Updates to the package (\texttt{graphsim} 1.0.0) will be released on CRAN. | |
| Acknowledgements {#acknowledgements .unnumbered .unnumbered} | |
| =============================================================================== | |
| This package was developed as part of a PhD research project funded by | |
| the Postgraduate Tassell Scholarship in Cancer Research Scholarship | |
| awarded to STK. We thank members of the Laboratory of Professor Satoru | |
| Miyano at the University of Tokyo, Institute for Medical Science, | |
| Professor Seiya Imoto, Associate Professor Rui Yamaguchi, and Dr Paul | |
| Sheridan (Assistant Professor at Hirosaki University,CSO at Tupac Bio) | |
| for helpful discussions in this field. We also thank Professor Parry | |
| Guilford at the University of Otago, Professor Cristin Print at the | |
| University of Auckland, and Dr Erik Arner at the RIKEN Center for | |
| Integrative Medical Sciences for their excellent advice during this | |
| project. | |
| Author Contributions {#contributions .unnumbered .unnumbered} | |
| =============================================================================== | |
| S.T.K. and M.A.B. conceived of the presented methodology. S.T.K. developed the theory and performed the computations. | |
| M.A.B. provided guidance throughout the project and gave feedback on the package. All authors discussed the package and contributed to the final manuscript. | |
| # References |