# `pathview` Tutorial

(work in progress)

### Intro

* This is the `R` version of the `pathview` webapp. 
* `pathview` colours nodes on KEGG network diagrams, based on input 

### Review
* **Pros**: `pathviewR` grants access to sanitized `KEGG` pathways. That's a *very* big pro.
* **Cons**: More than the pros, unfortunately:
    * I'm *still* not sure whether the input is logFC or abundance values. Some playing around with very simple 2-class examples has revealed that this doesn't make a difference. 
    * `GAGE` (automatic pathway selection functionality, which is presumably a pathway enrichment analysis method of some kind) has questionable efficacy. It appears to need many, many features (a test dataset with ~70 features yielded output pathways, but no output `.tsv` table with the associated q-values and statistics values). 
* Suggested usage: use the **joint pathway (enrichment) analysis** module on `MetaboAnalyst` to retrieve perturbed pathways, then visualize these with `pathview`. 

In [2]:
# Load library and example datasets
library("pacman")

pacman::p_load("pathview", "gage", "tidyverse")
data(gse16873.d)
# Load human pathways data
data(paths.hsa)
# load demo pathway-related data, including 3 pathway ids and related plotting params
# this is in dictionary format
data(demo.paths)

## Start

* Visualize input data onto a selected pathway, in this case `hsa04110` ("*Cell Cycle*").
* Data are a matrix of shape()
* It's not clear exactly what those input values are, but should be normalized abundance values in this case. 
* Writes out a punch of `.png` and `.xml` to directory.  

In [8]:
head(gse16873.d)

Unnamed: 0,DCIS_1,DCIS_2,DCIS_3,DCIS_4,DCIS_5,DCIS_6
10000,-0.3076448,-0.14722769,-0.023784808,-0.07056193,-0.001323087,-0.15026813
10001,0.41586805,-0.33477259,-0.513136907,-0.16653712,0.111122223,0.13400734
10002,0.19854925,0.03789588,0.341865341,-0.0852742,0.767559264,0.15828609
10003,-0.23155297,-0.09659311,-0.104727283,-0.04801404,-0.208056443,0.03344448
100048912,-0.04490724,-0.05203146,0.036390376,0.04807823,0.027205816,0.05444739
10004,-0.08756237,-0.05027725,0.001821133,0.03023835,0.008034394,-0.06860749


In [4]:
# Generate viz for only 1 column, gse16873.d[, 1]
# Generate a single image file
pv.out <- pathview(gene.data = gse16873.d[, 1], 
                   pathway.id = "04110",
                   species = "hsa", 
                   out.suffix = "gse16873")

Info: Downloading xml files for hsa04110, 1/1 pathways..
Info: Downloading png files for hsa04110, 1/1 pathways..
'select()' returned 1:1 mapping between keys and columns
Info: Working in directory /Users/don/Documents/my_vignettes
Info: Writing image file hsa04110.gse16873.png


In [9]:
i <- 1
pv.out <- pathview(gene.data = gse16873.d[, 1], 
                   pathway.id = demo.paths$sel.paths[i],
                   species = "hsa", 
                   out.suffix = "gse16873",
                   kegg.native = T)
list.files(pattern="hsa04110", full.names=T)

'select()' returned 1:1 mapping between keys and columns
Info: Working in directory /Users/don/Documents/my_vignettes
Info: Writing image file hsa04110.gse16873.png


In [10]:
str(pv.out)

List of 2
 $ plot.data.gene:'data.frame':	92 obs. of  10 variables:
  ..$ kegg.names: chr [1:92] "1029" "51343" "4171" "4998" ...
  ..$ labels    : chr [1:92] "CDKN2A" "FZR1" "MCM2" "ORC1" ...
  ..$ all.mapped: chr [1:92] "1029" "51343" "4171,4172,4173,4174,4175,4176" "4998,4999,5000,5001,23594,23595" ...
  ..$ type      : chr [1:92] "gene" "gene" "gene" "gene" ...
  ..$ x         : num [1:92] 532 919 553 494 919 919 188 432 123 77 ...
  ..$ y         : num [1:92] 124 536 556 556 297 519 519 191 704 687 ...
  ..$ width     : num [1:92] 46 46 46 46 46 46 46 46 46 46 ...
  ..$ height    : num [1:92] 17 17 17 17 17 17 17 17 17 17 ...
  ..$ mol.data  : num [1:92] 0.129 -0.404 -0.42 0.986 1.181 ...
  ..$ mol.col   : Factor w/ 10 levels "#00FF00","#30EF30",..: 5 3 3 9 9 9 9 9 5 6 ...
 $ plot.data.cpd : NULL


In [11]:
head(pv.out$plot.data.gene)

Unnamed: 0_level_0,kegg.names,labels,all.mapped,type,x,y,width,height,mol.data,mol.col
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
4,1029,CDKN2A,1029,gene,532,124,46,17,0.1291987,#BEBEBE
5,51343,FZR1,51343,gene,919,536,46,17,-0.4043256,#5FDF5F
6,4171,MCM2,417141724173417441754176,gene,553,556,46,17,-0.4202181,#5FDF5F
7,4998,ORC1,49984999500050012359423595,gene,494,556,46,17,0.9864873,#FF0000
8,996,CDC27,9968697888110393258472988251433,gene,919,297,46,17,1.1811525,#FF0000
9,996,CDC27,9968697888110393258472988251433,gene,919,519,46,17,1.1811525,#FF0000


In [None]:
pv.out <- pathview(gene.data = gse16873.d[, 1], 
                   pathway.id = demo.paths$sel.paths[i],
                   species = "hsa", 
                   out.suffix = "gse16873.2layer", 
                   kegg.native = T,
                   same.layer = F)

## Integrating Cpd and Gene Data

### Compound and gene data

* Visualize gene and compound data jointly onto the output plots.
* Input data are, again, normalized abundance values (gene and compound). 

In [14]:
# simulate cpd data
sim.cpd.data = sim.mol.data(mol.type="cpd", nmol=3000)
data(cpd.simtypes)

In [13]:
# specify which pathway to retrieve
i <- 3
print(demo.paths$sel.paths[i])

[1] "00640"


In [None]:
pv.out <- suppressWarnings(pathview(gene.data = gse16873.d[, 1], 
                   cpd.data = sim.cpd.data,
                   pathway.id = demo.paths$sel.paths[i], 
                   species = "hsa", 
                   out.suffix = "gse16873.cpd",
                   keys.align = "y", 
                   kegg.native = T, 
                   key.pos = demo.paths$kpos1[i]))

In [None]:
head(pv.out$plot.data.cpd)

In [None]:
pv.out <- suppressWarnings(pathview(gene.data = gse16873.d[, 1], 
                   cpd.data = sim.cpd.data, 
                   pathway.id = demo.paths$sel.paths[i], 
                   species = "hsa", 
                   out.suffix = "gse16873.cpd",
                   keys.align = "y", 
                   kegg.native = F, 
                   key.pos = demo.paths$kpos2[i],
                   sign.pos = demo.paths$spos[i], 
                   cpd.lab.offset = demo.paths$offs[i]))

### Multiple states or samples

In [None]:
# simulate compound data with multiple replicate samples
set.seed(10)
sim.cpd.data2 = matrix(sample(sim.cpd.data, 18000,
                              replace = T), ncol = 6)
rownames(sim.cpd.data2) = names(sim.cpd.data)
colnames(sim.cpd.data2) = paste("exp", 1:6, sep = "")
head(sim.cpd.data2, 3)

In [None]:
# KEGG view
pv.out <- suppressWarnings(pathview(gene.data = gse16873.d[, 1:3],
                                    cpd.data = sim.cpd.data2[, 1:2], 
                                    pathway.id = demo.paths$sel.paths[i],
                                    species = "hsa", 
                                    out.suffix = "gse16873.cpd.3-2s", 
                                    keys.align = "y",
                                    kegg.native = T, 
                                    match.data = F, 
                                    multi.state = T, 
                                    same.layer = T))

In [None]:
# KEGG view with data match
pv.out <- suppressWarnings(pathview(gene.data = gse16873.d[, 1:3],
                                    cpd.data = sim.cpd.data2[, 1:2], 
                                    pathway.id = demo.paths$sel.paths[i],
                                    species = "hsa", 
                                    out.suffix = "gse16873.cpd.3-2s.match",
                                    keys.align = "y", 
                                    kegg.native = T, 
                                    match.data = T, 
                                    multi.state = T,
                                    same.layer = T))

In [None]:
# graphviz view
pv.out <- suppressWarnings(pathview(gene.data = gse16873.d[, 1:3],
                                    cpd.data = sim.cpd.data2[, 1:2], 
                                    pathway.id = demo.paths$sel.paths[i],
                                    species = "hsa", 
                                    out.suffix = "gse16873.cpd.3-2s", 
                                    keys.align = "y",
                                    kegg.native = F, 
                                    match.data = F, 
                                    multi.state = T, 
                                    same.layer = T,
                                    key.pos = demo.paths$kpos2[i], 
                                    sign.pos = demo.paths$spos[i]))

In [None]:
# plot samples/states separately
# Doesn't seem to print out images well
pv.out <- suppressWarnings(pathview(gene.data = gse16873.d[, 1:3],
                                    cpd.data = sim.cpd.data2[, 1:2], 
                                    pathway.id = demo.paths$sel.paths[i],
                                    species = "hsa", 
                                    out.suffix = "gse16873.cpd.3-2s", 
                                    keys.align = "y",
                                    kegg.native = T, 
                                    match.data = F, 
                                    multi.state = F, 
                                    same.layer = T))

In [None]:
# KEGG layer with 2 views. Loses the original KEGG gene labels (or EC numbers)
pv.out <- suppressWarnings(pathview(gene.data = gse16873.d[, 1:3],
                                    cpd.data = sim.cpd.data2[, 1:2], 
                                    pathway.id = demo.paths$sel.paths[i],
                                    species = "hsa", 
                                    out.suffix = "gse16873.cpd.3-2s.2layer",
                                    keys.align = "y", 
                                    kegg.native = T, 
                                    match.data = F, 
                                    multi.state = T,
                                    same.layer = F))

## Feat. `GAGE`

In [None]:
# Load some datasets
data(gse16873)
hn <- grep('HN', colnames(gse16873), ignore.case =TRUEs) # indices of HN samples in colnames
dcis <- grep('DCIS', colnames(gse16873), ignore.case =TRUE) # indice of DCIS samples in colnames
data(kegg.gs)

In [None]:
# pw analysis with gage, gene data only
gse16873.kegg.p <- gage(gse16873, 
                        gsets = kegg.gs, 
                        ref = hn, 
                        samp = dcis)

In [None]:
#prepare the differential expression data
gse16873.d <- gagePrep(gse16873, ref = hn, samp = dcis)

#equivalently, you can do simple subtraction for paired samples
gse16873.d <- gse16873[,dcis]-gse16873[,hn]

#select significant pathways and extract their IDs
sel <- gse16873.kegg.p$greater[, "q.val"] < 0.1 & !is.na(gse16873.kegg.p$greater[,"q.val"])

path.ids <- rownames(gse16873.kegg.p$greater)[sel]
path.ids2 <- substr(path.ids[c(1, 2, 7)], 1, 8) # Grab paths with indices 1, 2 and 7

In [None]:
#pathview visualization
pv.out.list <- sapply(path.ids2, function(pid) pathview(gene.data = gse16873.d[,1:2], 
                                                        pathway.id = pid, 
                                                        species = "hsa"))

In [None]:
x <- as_tibble(gse16873.kegg.p$greater, rownames = "pw_name") %>% drop_na() %>% filter(q.val<0.1)
