-
Notifications
You must be signed in to change notification settings - Fork 1
/
GEM_GSA.Rmd
794 lines (612 loc) · 37.2 KB
/
GEM_GSA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
---
title: "GEM-based gene set analysis"
output:
html_document:
toc: true
toc_float:
collapsed: false
fig_width: 10
editor_options:
chunk_output_type: console
---
<style type="text/css">
blockquote {
background: #ECF8FF;
border-left: 10px solid #3989CB;
margin: 1.5em 10px;
padding: 0.5em 10px;
font-size: 14px;
}
h1 {
font-size: 25px;
margin-top: 1.5cm;
margin-bottom: 0.5cm;
}
h2 {
font-size: 18px;
margin-top: 1cm;
margin-bottom: 0.5cm;
}
h3 {
font-size: 14px;
margin-top: 1cm;
margin-bottom: 0.5cm;
}
table.answer, td.answer {
border: 0px;
background: #BCE0C0;
padding: 10px;
width: 100%;
}
div.answer { display: true;}
</style>
```{r setup, include=F}
knitr::opts_chunk$set(echo=T, tidy=F, eval=T, cache=F)
qn <- sn <- 0
```
Introduction
============
Gene set analysis (GSA), or alternatively gene set enrichment analysis (GSEA), is a commonly used approach for interpreting omics data, whereby genes are aggregated into groups (gene sets) on the basis of their shared biological or functional properties. The aim of GSA is to interpret biological phenomena (the "how"), and from that, generate a hypothesis for the underlying mechanism (the "why"). These hypotheses can then be validated in replication studies and other independent laboratory experiments.
Genome-scale metabolic models (GEMs) are a computational representation of the chemical reactions that comprise the metabolic network of an organism or biological system. A valuable feature of GEMs is their connectivity information, such as gene-reaction, reaction-metabolite, and reaction-pathway associations. In this exercise, we will demonstrate how this connectivity information can be used to interpret transcriptomic data in the context of metabolism. Specifically, gene-metabolite and gene-pathway connections derived from [Human-GEM](https://github.com/SysBioChalmers/Human-GEM) will be used to identify regions of the human metabolic network that are significantly associated with trancriptional changes in tumors.
For this exercise we will use publically available RNA-Seq data from [The Cancer Genome Atlas](https://portal.gdc.cancer.gov/) (TCGA) database, specifically from the [Breast Invasive Carcinoma](https://portal.gdc.cancer.gov/projects/TCGA-BRCA) (BRCA) project.
</br>
```{r, echo=FALSE, fig.cap='*Image from US National Cancer Institute GDC Portal homepage (portal.gdc.cancer.gov)*', out.width = '85%', fig.align='center'}
knitr::include_graphics('img/tcga_gdc_banner.png')
```
Setup option
=====
**In r/rmarkdown run the following code**
`r sn<-sn+1; paste0(sn,"\\. ")` If your docker installation fails, navigate to the directory where "renv.lock" file is located and run following code chunk.
```{r, eval=FALSE}
install.packages("renv")
renv::restore("renv.lock")
```
</br>
**In R Studio**
`r sn<-sn+1; paste0(sn,"\\. ")` Load the R packages that will be used in this exercise
```{r, message=FALSE, warning=FALSE, results='hide'}
# downloading and processing TCGA data
library(TCGAbiolinks)
library(SummarizedExperiment)
# differential expression anaylsis
library(edgeR)
# gene set (enrichment) analysis
library(fgsea)
library(piano)
# plotting
library(viridisLite)
library(ggplot2)
library(gridExtra)
library(pheatmap)
library(RColorBrewer)
# misc
library(tibble)
library(org.Hs.eg.db)
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Specify the location of the main tutorial directory
```{r}
main_dir = './session_gsa'
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Create a `TCGA_data/` subdirectory within `data/` (if it does not yet exist)
```{r}
tcga_dir <- file.path(main_dir, 'data', 'TCGA_data')
if (!dir.exists(tcga_dir)) { dir.create(tcga_dir) }
```
Retrieve RNA-Seq data
=====================
> **Note:** This and the **Differential expression analysis** section below are **optional**! If the GDC or Ensembl database is slow/unavailable, or you are comfortable with differential expression analysis, you can skip ahead to the **Gene set analysis** section below. The processed count matrix and differential expression analysis result files are already available in the `data/TCGA_data/` directory.
```{r, echo=F}
RunAll <- F # in case we don't want to re-run the TCGA data retrieval and DE analysis (quite slow)
```
`r sn<-sn+1; paste0(sn,"\\. ")` Query the TCGA database (hosted by GDC). Specifically, we are querying for RNA-Seq gene counts associated with the breast cancer project, and filtering to include only sample types `Primary Tumor` and `Solid Tissue Normal`.
```{r, message=F, warning=F, eval=RunAll}
query <- GDCquery(project='TCGA-BRCA', # breast cancer
data.category='Transcriptome Profiling',
data.type='Gene Expression Quantification',
experimental.strategy='RNA-Seq',
workflow.type='HTSeq - Counts',
sample.type=c('Primary Tumor', 'Solid Tissue Normal'))
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` For simplicity, identify subject IDs that are associated with exactly 1 normal and 1 tumor sample
```{r, eval=RunAll}
# Identify subjects with paired data
q <- getResults(query)
paired_subjects <- intersect(q$cases.submitter_id[q$sample_type == 'Primary Tumor'],
q$cases.submitter_id[q$sample_type == 'Solid Tissue Normal'])
# extract barcodes associated with those subjects
keep <- unlist(lapply(paired_subjects, function(x) sum(q$cases.submitter_id %in% x) == 2))
barcodes <- q$cases[q$cases.submitter_id %in% paired_subjects[keep]]
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Update our query, now filtering samples to include only those in our list of barcodes
```{r, message=F, warning=F, eval=RunAll}
query <- GDCquery(project='TCGA-BRCA',
data.category='Transcriptome Profiling',
data.type='Gene Expression Quantification',
experimental.strategy='RNA-Seq',
workflow.type='HTSeq - Counts',
sample.type=c('Primary Tumor', 'Solid Tissue Normal'),
barcode=barcodes) # specify sample barcodes
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Download count data. The data will be stored in separate folders for every sample, with very non-human-friendly names (take a look into the directory to see for yourself...).
```{r, message=F, warning=F, results='hide', eval=RunAll}
GDCdownload(query, directory=tcga_dir)
```
</br>
> **Note**: Even though we specify the directory `tcga_dir` where we want the files to be downloaded, it will still place a `MANIFEST.txt` file in your current working directory. You can just delete that file.
`r sn<-sn+1; paste0(sn,"\\. ")` Load the RNA-Seq gene counts from the downloaded files.
```{r, message=F, warning=F, results='hide', eval=RunAll}
# loaded as a "SummarizedExperiment" object
TCGAdata <- GDCprepare(query, directory=tcga_dir)
# If that fails (i.e., Ensembl unavaiable)
#TCGAdata <- readRDS(file.path(tcga_dir, 'TCGAdata.rds'))
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Further filter the data to only include subjects with the Luminal B subtype (`LumB`) of breast cancer
```{r, message=F, warning=F, results='hide', eval=RunAll}
meta <- colData(TCGAdata)
keep_subjects <- meta$patient[meta$paper_BRCA_Subtype_PAM50 %in% 'LumB']
TCGAdata <- TCGAdata[, meta$patient %in% keep_subjects]
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Extract the relevant fields from the `TCGAdata` SummarizedExperiment object
```{r, message=F, warning=F, results='hide', eval=RunAll}
counts <- assay(TCGAdata)
gene_symbols <- rowData(TCGAdata)$external_gene_name
subject_id <- as.factor(TCGAdata$patient)
# We explicitly state the ordering of the levels for sample_type as healthy -> diseased so that
# it uses this order for the differential expression analysis below.
sample_type <- factor(TCGAdata$sample_type, levels=c('Solid Tissue Normal', 'Primary Tumor'))
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Clear some intermediate variables to free some memory
```{r, eval=RunAll}
rm(TCGAdata, query, q)
invisible(gc())
```
Differential expression analysis
================================
> **Note**: The differential expression (DE) analysis here will be performed using the **edgeR** package. Since the focus of this workshop is not on DE analysis, we don't provide much detail surrounding the commands below. However, you can find a lot of examples and detail in the very helpful edgeR [User's Guide](https://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf). Another excellent tool for DE analysis is **DESeq2**, which also has a fantastic [User Guide](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html).
`r sn<-sn+1; paste0(sn,"\\. ")` Create DGEList object from the gene counts matrix
```{r, eval=RunAll}
y <- DGEList(counts=counts)
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Generate a design matrix such that samples are paired by subject ID
```{r, eval=RunAll}
design <- model.matrix(~subject_id + sample_type)
rownames(design) <- colnames(y)
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Filter low-count genes
```{r, eval=RunAll}
keep <- filterByExpr(y, design=design, min.count=10)
counts <- counts[keep, ]
y <- y[keep, , keep.lib.sizes=F]
gene_symbols <- gene_symbols[keep]
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Calculate normalization factors, and estimate dispersion
```{r, eval=RunAll}
y <- calcNormFactors(y)
y <- estimateDisp(y, design) # calc time is ~1 minute
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Fit a generalized linear model (GLM) to the data and test for DE genes
```{r, eval=RunAll}
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit)
topTags(qlf) # view top DE genes
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Extract results table and add FDR-adjusted p-values
```{r, eval=RunAll}
res.table <- qlf$table
res.table$FDR <- p.adjust(res.table$PValue, 'BH') # BH = Benjamini-Hochberg method
res.table <- cbind(gene_symbols, res.table) %>% rownames_to_column(var='gene')
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` View the top DE genes again, this time with the gene symbols added to the table
```{r, eval=RunAll}
res.table[order(res.table$FDR)[1:10], c('gene_symbols', 'logFC', 'logCPM', 'PValue', 'FDR')]
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Export filtered count matrix and DE results to a plain text file
```{r, eval=RunAll}
colnames(counts) <- paste(c('normal', 'tumor')[sample_type], as.numeric(subject_id), sep='.')
counts <- rownames_to_column(as.data.frame(counts), var='gene')
write.table(counts, file=file.path(tcga_dir, 'count_matrix.txt'), quote=F, sep='\t', row.names=F)
write.table(res.table, file=file.path(tcga_dir, 'DE_results.txt'), quote=F, sep='\t', row.names=F)
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Also calculate median counts per million (CPM) for each sample type, and export to a plain text file
```{r, eval=RunAll}
median_cpm <- data.frame(Normal=apply(cpm(y)[, sample_type == 'Solid Tissue Normal'], 1, median),
Tumor=apply(cpm(y)[, sample_type == 'Primary Tumor'], 1, median))
median_cpm <- median_cpm %>% rownames_to_column(var='gene')
write.table(median_cpm, file=file.path(tcga_dir, 'median_cpm.txt'), quote=F, sep='\t', row.names=F)
```
Gene set analysis
=================
There are many different methods for performing gene set analysis (GSA), but Piano is convenient in that it implements many of these variants in a single package (see more info [here](https://varemo.github.io/piano/)). In this example, we will use one of the gene set collections (GSCs) we extracted from Human-GEM to explore which areas of metabolism are enriched in significant expression changes between normal and tumor samples.
`r sn<-sn+1; paste0(sn,"\\. ")` Load DE results (if not already loaded)
```{r}
de.res <- read.delim("./data/TCGA_data/DE_results.txt", row.names=1)
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Load one of the GEM-derived gene set collections
```{r}
#gsc <- loadGSC(file.path(main_dir, 'data', 'gene_set_collections', 'HumanGEM_subsystem_GSC.gmt'))
gsc <- piano::loadGSC("./data/gene_set_collections/HumanGEM_subsystem_GSC.gmt")
# gsc <- loadGSC(file.path(main_dir, 'data', 'gene_set_collections', 'HumanGEM_metabolite_GSC.gmt'))
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` View a summary of the GSC to make sure that it was properly loaded, and to view some basic stats about the collection
```{r}
gsc
```
</br>
Overrepresentation analysis
---------------------------
We will first use Piano to perform an overrepresentation analysis - i.e., given a list of "interesting" genes, which gene sets (if any) contain more of these genes than one would expect due to random chance? This overrepresentation or enrichment effect can be quantified using the hypergeometric test.
`r sn<-sn+1; paste0(sn,"\\. ")` Generate a list of "interesting" genes. In this case, we (arbitrarily) select genes that are differentially expressed with an FDR < 0.01 and an absolute log2(fold-change) > 2.
```{r}
selected_genes = rownames(de.res)[de.res$FDR < 0.01 & abs(de.res$logFC) > 2]
length(selected_genes) # see how many genes are in our list
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Define the background or "universe" of genes from which we selected our list of genes. Here, we define the universe as all genes in the DE results and the GSC.
```{r}
all_genes <- union(rownames(de.res), unlist(gsc$gsc))
length(all_genes)
```
</br>
> **Note:** It may be better if the "universe" includes *all* genes that were detected in the RNA-Seq dataset. Since we filtered out some of the low-count genes before performing our DE analysis, our universe may be somewhat incomplete. It is good practice to run additional analyses with variations in parameters such as the "gene universe" to ensure that the results are robust to such changes.
`r sn<-sn+1; paste0(sn,"\\. ")` Run a hypergeometric test with our selected list of genes using the Piano `runGSAhyper` function
```{r}
res <- runGSAhyper(genes=selected_genes, universe=all_genes, gsc=gsc, gsSizeLim=c(10, 500))
```
</br>
> **Question `r qn<-qn+1;qn`:** The `gsSizeLim` parameter sets the limits on the number of genes allowed in a gene set. Why would we want to filter out gene sets that contain very few or very many genes?
<div class="answer"><table class="answer"><tr><td class="answer">
**Answer:**
Very small gene sets (e.g., those with fewer than 5 or 10 genes) are generally uninformative and often lack the statistical power to effectively quantify the significance of their enrichment. Very large gene sets (containing several hundreds or thousands of genes) are usually so broad that it is difficult to interpret their results, and some algorithms are slowed when having to work with large gene sets.
</td></tr></table><br></div>
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` View the top 10 most enriched gene sets and their stats
```{r}
res_table <- res$resTab[order(res$pvalues), ] # sort by p-value
knitr::kable(res_table[1:10, ])
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Visualize the results as an interactive network using the `networkPlot2` function. Gene sets are represented as nodes that are sized based on the number of genes in the set, and colored according to the enrichment p-value (log-transformed). Nodes are connected if the gene sets share a specified fraction of genes (`overlap`), with thicker edges indicating more overlap.
```{r, warning=F, fig.width=7}
networkPlot2(res,
class="non",
adjusted=F, # too few points when using FDR-adjusted p-values
significance=0.05,
overlap=0.4,
lay='2',
labelSize=16,
ncharLabel=50,
scoreColors=rev(magma(50))) # darker = more significant
```
</br>
Reporter metabolite/pathway analysis
------------------------------------
For the overrepresentation analysis, we evaluated the enrichment of a list of genes among different gene sets. However, if we want to use the information from *all* of the genes in our DE results, we can perform a GSA using one of the methods available in Piano.
`r sn<-sn+1; paste0(sn,"\\. ")` Extract p-values and log2fold-changes from DE results dataframe
```{r}
pval <- de.res$PValue %>% setNames(rownames(de.res))
logfc <- de.res$logFC %>% setNames(rownames(de.res))
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Run a gene set analysis (GSA) using the "Reporter" statistic. To understand the meaning of each parameter, view the help information for the `runGSA` function.
```{r, message=F, results='hide'}
gsaRes <- runGSA(geneLevelStats=pval,
directions=logfc,
geneSetStat='reporter',
signifMethod='geneSampling',
gsc=gsc,
gsSizeLim=c(10, 500),
nPerm=1000) # 10,000+ is better, but slower
```
```{r, echo=F}
gsaRes # print the calculation details summary
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` View the results in a tabular format
```{r}
gsaRes_table <- GSAsummaryTable(gsaRes)
```
```{r, eval=F}
View(gsaRes_table)
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Visualize the top most significant gene sets using a heatmap.
> **Note:** The `GSAheatmap` function often has a `figure margins too large` error when using the default `colorkey=TRUE`. If you get this error, **reset the plot window using the `dev.off()` command**. You can then try to resize your plot window to a larger size before re-running the `GSAheatmap` function, or turn the key off (`colorkey=FALSE`), or skip this command and continue with the custom heatmap generation in the next step.
```{r, fig.width=4, fig.height=6.5}
# include top 3 gene sets of each directional category, and print the
# gene set rank (lower = more significant) in the heatmap cells
GSAheatmap(gsaRes, cutoff=3, adjusted=T, cellnote='rank', colorkey=F,
cex=1, ncharLabel=50)
```
</br></br>
> **Note:** There are many repeated numbers in the heatmap (e.g., rank 1 is repeated very often). This is because these gene sets have identical p-values, and therefore have equal ranks. We could re-run the GSA with more permutations (`nPerm`) to obtain a higher resolution on the p-values and potentially begin to separate some of these gene sets, but it will come at the cost of increased computation time.
Another option is to generate our own custom heatmap. Although the setup is a bit more involved, it allows for more flexibility and customization.
`r sn<-sn+1; paste0(sn,"\\. ")` Extract the adjusted p-value columns from the GSA results structure, and filter out rows that do not have at least one FDR-adjusted p-value < 0.05
```{r}
table_sel <- gsaRes_table[, c('p adj (dist.dir.dn)', 'p adj (mix.dir.dn)', 'p adj (non-dir.)',
'p adj (mix.dir.up)', 'p adj (dist.dir.up)')]
table_sel[is.na(table_sel)] <- 1 # set NA p-values equal to 1
rownames(table_sel) <- gsaRes_table$Name
table_sel <- table_sel[apply(table_sel, 1, min) < 0.05, ]
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Generate the heatmap using the `pheatmap` function
```{r, fig.width=6, fig.height=6.5}
pheatmap(-log10(table_sel), # log-transform p-values
scale='none',
color=rev(magma(100)),
cluster_cols=F,
clustering_distance_rows='euclidean',
clustering_method='ward.D2',
breaks=seq(0, 2, len=100),
border_color='black',
angle_col=90)
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` We can also visualize the results as an interactive network
```{r, fig.width=7}
# define custom palette where blue = expression decrease and red = expression increase
pal <- colorRampPalette(brewer.pal(11, 'RdBu'))
networkPlot2(gsaRes,
class='distinct',
direction='both',
significance=0.05,
overlap=0.1,
physics=F, # physics=T can be slow/problematic for large networks
lay='2',
edgeWidth=c(5,20),
ncharLabel=25,
scoreColors=pal(20))
```
</br>
Yet _another_ way to view the results of a GSA (and the gene set collection itself) is through the Piano `exploreGSAresults` function, which uses rShiny to enable interactive exploration of your GSA results. This function will open up an interactive session in your internet browser. Simply close the tab/window when you're finished to continue working in your R session.
`r sn<-sn+1; paste0(sn,"\\. ")` Explore the results interactively using a local rShiny app
```{r, eval=F}
exploreGSAres(gsaRes)
```
</br>
Gene set enrichment analysis
----------------------------
One of the gene set analysis algorithms included in the Piano package is named (perhaps confusingly) gene set enrichment analysis (GSEA), and is described in [Subramanian et al. *PNAS* 2005](https://www.pnas.org/content/102/43/15545.abstract).
GSEA is one of the more commonly used gene set analysis algorithms. Although the method can be called from Piano's `runGSA` function (`geneSetStat='gsea'` or `geneSetStat='fgsea'`), we will use functions from the `fgsea` package directly because they offer some additional features.
Similar to the reporter analysis above, GSEA uses information from *all* genes in the dataset, not just a list of interesting or significant genes.
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` We will use the same DE results from text file loaded previously.
```{r}
#de.res <- read.delim(file.path(tcga_dir, 'DE_results.txt'), row.names=1)
```
</br>
Unlike the get set analysis above using `runGSA`, we cannot input the differential expression fold-changes and p-values separately. We instead need to create a combined gene score that accounts for both. In this example, we will generate a signed Z-score such that significantly decreased genes have a very negative score, significantly increased genes have a very positive score, and genes exhibiting non-significant expression changes have a score near zero.
> **Note:** This is just one approach for generating gene scores - there is no single "correct" approach, as it will depend on the question being asked and the type/behavior of data being used. Often it is good to try a few different variations to determine what impact the selected scoring method has on your results.
`r sn<-sn+1; paste0(sn,"\\. ")` Generate Z-scores from the p-values
```{r}
gene_scores <- qnorm(0.5*de.res$PValue, lower.tail=F)
gene_scores[de.res$logFC < 0] <- -gene_scores[de.res$logFC < 0]
names(gene_scores) <- rownames(de.res)
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Load gene set collection file.
> **Note:** The `fgsea` package uses a different format than Piano for the GSC object, so we need to re-load the file using the package's `gmtPathways` function.
```{r}
# Option 1: GSC based on GEM subsystems
gsc_fgsea <- fgsea::gmtPathways("./data/gene_set_collections/HumanGEM_subsystem_GSC.gmt")
# Option 2: GSC based on GEM metabolites
# gsc_fgsea <- gmtPathways("./data/gene_set_collections/HumanGEM_metabolite_GSC.gmt"))
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Run gene set enrichment analysis (GSEA)
```{r}
# fgseaRes <- fgsea(gsc_fgsea, gene_scores, nperm=10000, minSize=5, maxSize=500) # significance estimated by gene shuffling
fgseaRes <- fgseaMultilevel(gsc_fgsea, gene_scores, minSize=5, maxSize=500, eps=0) # significance estimated by MC approach
fgseaRes <- fgseaRes[order(pval), ] # sort by gene set significance
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Create an enrichment plot for the top 3 "positive" and top 3 "negative" gene sets. Gene sets with a *positive* **Enrichment Score** (ES) represent those that are enriched with genes that are *increased* in abundance in tumor samples relative to the healthy controls, whereas those with a *negative* ES are enriched with genes that have *decreased* in abundance.
```{r, fig.height=6}
# get top 3 positive and negative gene sets
gs_select <- c(head(fgseaRes$pathway[fgseaRes$NES > 0], 3),
head(fgseaRes$pathway[fgseaRes$NES < 0], 3))
# iterate through gene sets and create an enrichment plot for each
p <- list()
for (gs in gs_select) {
p[[gs]] <- plotEnrichment(gsc_fgsea[[gs]], gene_scores) + ggtitle(gs)
}
do.call(grid.arrange, c(p, list(ncol=3)))
```
</br></br>
`r sn<-sn+1; paste0(sn,"\\. ")` For contrast, create an enrichment plot for one of the non-significant gene sets
```{r, fig.width=3, fig.height=3}
ns_pathway <- tail(fgseaRes$pathway[fgseaRes$size > 50], 1) # select one with >50 genes
plotEnrichment(gsc_fgsea[[ns_pathway]], gene_scores) + ggtitle(ns_pathway)
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` View the genes comprising the **Leading Edge** of the top few gene sets. The **Leading Edge** is the subset of genes in the gene set that are primarily responsible for it being identified as "significant".
```{r, message=F}
# select the leading edges for the same (top 3 positive and negative) gene sets
leadingEdge <- fgseaRes$leadingEdge %>% setNames(fgseaRes$pathway)
edge_select <- leadingEdge[gs_select]
# convert the Ensembl IDs to gene symbols
eid2sym <- AnnotationDbi::select(org.Hs.eg.db, unique(unlist(edge_select)), 'SYMBOL', 'ENSEMBL')
edge_select <- lapply(edge_select, function(x) sort(unique(eid2sym$SYMBOL[eid2sym$ENSEMBL %in% x])))
# print leading edge genes for top 3 positive gene sets
edge_select[1:3]
# print leading edge genes for top 3 negative gene sets
edge_select[4:6]
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Create a table plot of the top 10 positive and negative gene sets
```{r, fig.height=6.5}
topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=10), pathway]
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=10), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
plotGseaTable(gsc_fgsea[topPathways], gene_scores, fgseaRes, gseaParam=0.5)
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` For GSCs with gene sets that are subsets of each other or are largely overlapping (very common with gene-metabolite associations), create a table using only independent gene sets
```{r}
# determine which pathways are largely a subset of another pathway
collapsedPathways <- collapsePathways(fgseaRes[order(pval)][padj < 0.05], gsc_fgsea, gene_scores)
# see if any pathways are significantly overlapping
head(collapsedPathways$parentPathways[!is.na(collapsedPathways$parentPathways)])
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` View a table plot of significant gene sets, excluding the larger "parent pathways"
```{r, fig.height=5}
mainPathways <- fgseaRes[pathway %in% collapsedPathways$mainPathways][order(-NES), pathway]
plotGseaTable(gsc_fgsea[mainPathways], gene_scores, fgseaRes, gseaParam=0.5)
```
</br>
Revisiting gene-level data
==========================
It is always good to go back to the original gene-level data when interpreting GSA results. Here we will use a heatmap to visualize gene-level data for selected gene-sets.
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Load the RNA-Seq gene count matrix
```{r}
counts <- read.table("./data/TCGA_data/count_matrix.txt", row.names=1, header=T)
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Subset the count matrix to include only genes in a specific gene set. We will choose an example gene set that appeared as significant in many of the results above: **Serotonin and melatonin biosynthesis**.
```{r}
cs <- counts[row.names(counts) %in% gsc$gsc$`Serotonin and melatonin biosynthesis`, ]
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Convert row names from Ensembl IDs to gene symbols for easier interpretation
```{r, message=F}
eid2sym <- AnnotationDbi::select(org.Hs.eg.db, row.names(cs), 'SYMBOL', 'ENSEMBL')
# if the symbol is missing (NA), just keep the Ensembl ID
eid2sym$SYMBOL[is.na(eid2sym$SYMBOL)] <- eid2sym$ENSEMBL[is.na(eid2sym$SYMBOL)]
row.names(cs) <- make.unique(eid2sym$SYMBOL[match(row.names(cs), eid2sym$ENSEMBL)])
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Prepare gene (row) and sample (column) annotation information that will be included with the heatmap
```{r}
# row annotation: highlight genes in the "leading edge" from the GSEA analysis above
LE_genes <- eid2sym$SYMBOL[match(leadingEdge$`Serotonin and melatonin biosynthesis`, eid2sym$ENSEMBL)]
rowAnn <- data.frame(LeadingEdge=ifelse(row.names(cs) %in% LE_genes, 'True', 'False'),
row.names=row.names(cs))
# also indicate which genes were significantly differentially expressed
DE_down_genes <- de.res$gene_symbols[de.res$FDR < 0.01 & de.res$logFC < 0]
DE_up_genes <- de.res$gene_symbols[de.res$FDR < 0.01 & de.res$logFC > 0]
rowAnn$Direction <- ifelse(row.names(cs) %in% DE_down_genes, 'Decrease', 'ns')
rowAnn$Direction[row.names(cs) %in% DE_up_genes] <- 'Increase'
# column annotation: color by sample type (Normal or Tumor)
colAnn <- data.frame(Sample=ifelse(grepl('normal', colnames(cs)), 'Normal', 'Tumor'))
row.names(colAnn) <- colnames(cs)
# specify annotation colors
ann_colors <- list(LeadingEdge = c(True='black', False='gray80'),
Direction = c(Decrease='#2166AC', ns='white', Increase='#B2182B'),
Sample = c(Normal='#defcc2', Tumor='#634901'))
```
</br>
Generate the heatmap, adding the annotation information as colorbars beside each axis
```{r, fig.height=9}
# generate heatmap
pheatmap(log2(cs+1),
color=rev(magma(100)),
breaks=seq(-2, 2, len=100),
border_color='black',
scale='row',
annotation_row=rowAnn,
annotation_col=colAnn,
annotation_colors=ann_colors,
labels_col=rep('',ncol(cs))) # omit column labels
```
</br>
> **Note:** We scaled the rows (genes) so the colors will reflect the difference for each gene individually, which means that the values are **not** comparable across genes.
Do the gene-level data shown in the heatmap seem reasonable given that this gene set was identified as significantly enriched in genes that decreased in expression? And regarding the subset of genes annotated as part of the "Leading Edge" identified by GSEA - does it make sense that these genes are more "responsible" for the gene set being identified as significant than some of the other genes?
Exploration using Metabolic Atlas
=================================
When analyzing data in the context of metabolism, it can be helpful to visualize the genes and reaction pathways and how they are connected/distributed. **[Metabolic Atlas](https://www.metabolicatlas.org/)** is an online resource that faciliates the exploration of GEMs by providing metabolic network contents in tabular, map, or network format.
To demonstrate some of its functions, we will use Metabolic Atlas to explore some of the results that we found interesting/significant in the above analyses.
Data overlay on 2D maps
-----------------------
Metabolic Atlas contains manually drawn 2D maps of metabolic compartments and subsystems (pathways) corresponding to those in the Human-GEM model. The maps allow users to overlay pre-loaded mRNA expression data from the [Human Protein Atlas](https://www.proteinatlas.org/), as well as user-provided data.
To demonstrate the data overlay function, we will use the median gene CPM values generated during the initial retrieval of the TCGA data (`median_cpm.txt`). Note that more detailed information on using the data overlay function can be found in the Metabolic Atlas [Documentation](https://www.metabolicatlas.org/documentation#Data-overlay).
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Navigate to the [Map Viewer for Human-GEM](https://www.metabolicatlas.org/explore/map-viewer/human1) on Metabolic Atlas.
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` From the **Subsystems** menu on the left panel, choose one of subsystems that we found as significant in our analysis above - `Serotonin and melatonin biosynthesis`.
</br>
```{r, echo=FALSE, fig.align='center', out.width='50%'}
knitr::include_graphics('img/MA_subystem_menu.png')
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Click the **DATA OVERLAY** bar on the right side of the window to expand the overlay menu.
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Near the top of the overlay menu, select **Choose a file**, and use the file browser to navigate to the `session_gems/data/TCGA_data/` subdirectory and select the `median_cpm.txt` file that should be there.
</br>
You should now see the file name `median_cpm.txt` in green below the button, indicating that the file was successfully loaded and in the correct format.
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` In the **Data 1** and **Data 2 (for comparison)** boxes, select `Normal` and `Tumor`, respectively, from the **uploaded data** drop-down menus (the second menu in each box). The sidebar should now look as follows:
</br>
```{r, echo=FALSE, fig.align='center', out.width='30%'}
knitr::include_graphics('img/MA_overlay_sidebar.png')
```
</br>
The log fold-change in median CPM should now be overlaid onto the gene labels of the map as a color, ranging from blue (decrease) to red (increase). It is usually easier to view the colors if the subsystem highlighting is removed, by toggling the **Show/Hide subsystem** button:
```{r, echo=FALSE, fig.align='center'}
knitr::include_graphics('img/MA_button_hide_subsys.png')
```
</br>
To make the map view larger, you can collapse the **DATA OVERLAY** side panel by clicking its green bar, or use the **full screen** button:
```{r, echo=FALSE, fig.align='center'}
knitr::include_graphics('img/MA_button_fullscreen.png')
```
</br>
The map should look something like this:
</br>
```{r, echo=FALSE, fig.align='center'}
knitr::include_graphics('img/serotonin_pathway.svg')
```
</br>
Overall, the pathway shows many blue gene labels, representing an expression decrease in tumor relative to normal samples, which is consistent with the GSA results. There are also a few genes with increased expression, which was also visible in the gene-level heatmap. The advantage of the pathway representation is that we can now easily see which parts of the pathway appear to be coordinated expression changes, and which are more "mixed" (increases and decreases).
Furthermore, selecting a gene, metabolite, or reaction on the map will highlight all appearances of that component on the map, as well as provide a brief summary on the left side bar. Following the **GEM Browser** link below the short summary will take the user to the more detailed page for the selected component.
Interaction partners
--------------------
As discussed above, a key benefit of a GEM is the interaction information that it contains. The metabolite-gene interactions were, for example, used to extract a metabolite-based gene set collection from the Human-GEM. This same type of information can be viewed on Metabolic Atlas using the **Interaction Partners** tool.
To demonstrate, we will view for example one of the metabolites that is identified as significantly associated with gene expression increases when using the metabolite-gene GSC (`HumanGEM_metabolite_GSC.gmt`): 4-hydroxyproline (named `trans-4-hydroxy-L-proline` in the model/GSC)
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Navigate to the [**Interaction Partners**](https://www.metabolicatlas.org/explore/interaction/human1) tool on Metabolic Atlas, and enter `trans-4-hydroxy-L-proline` in the search bar. Select the top entry in the drop-down menu that appears below the search bar (the `Cytosol` form of the metabolite)
> **Note:** As in Human-GEM, Metabolic Atlas contains metabolites that are repeated in different compartments; e.g., H2O[c] and H2O[m] represents water in the cytoplasm and in the mitochondria, respectively. Although identical in composition, these metabolites are assigned different IDs in the model and different pages/entries in Metabolic Atlas to maintain the spatial organization of metabolism among different organelles.
The interaction partners graph for 4-hydroxyproline will open, showing the genes and other metabolites with which the metabolite is associated based on shared reactions. From here, one can right-click on a gene or other metabolite to expand or re-center on its interaction partners.
</br>
```{r, echo=FALSE, fig.align='center', out.width='85%'}
knitr::include_graphics('img/proline_interaction_partners.png')
```
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` To view 4-hydroxyproline in a pathway context, navigate to the **GEM Browser** for the metabolite by selecting the `trans-4-hydroxy-L-proline` node at the center of the graph, and clicking the **GEM Browser** link to the right of the graph window.
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` Once in the `trans-4-hydroxy-L-proline` GEM Browser, select one of the 2D pathway map links shown on the **Map Viewer** right-side panel: `Cytosol (part5)` or `Arginine and proline metabolism`.
</br>
`r sn<-sn+1; paste0(sn,"\\. ")` If our uploaded data (from `medium_cpm.txt`) is no longer loaded, simply re-load the file using the **DATA OVERLAY** side panel and again select `Normal` and `Tumor` for **Data 1** and **Data 2**.
</br>
We can now view the trans-4-hydroxy-L-proline metabolite within its reaction pathway, where it is clear that many of the genes encoding the surrounding reactions exhibit increased expression in tumor relative to normal tissue.
</br>
```{r, echo=FALSE, fig.align='center', out.width='60%'}
knitr::include_graphics('img/arginine_and_proline_metabolism.svg')
```
</br>
If viewing the subsystem map for `Arginine and Proline Metabolism`, you may notice that many of the other genes involved in this subsystem exhibit a decreased expression in tumor samples. This highlights the difference in information that can be extracted from data when using metabolite-gene compared to subsystem-gene associations. Specifically, that although the Arginine and Proline Metabolism subsystem was not identified as significantly enriched in expression changes when using the subsystem GSC, applying the metabolite GSC revealed a significantly upregulated subnetwork of this pathway surrounding the 4-hydroxyproline metabolite.
Session info
============
This page was generated using the following R session:
```{r, echo=F}
sessionInfo()
```