<div style="max-width:1200px"><img src="assets/mgnify_banner.png" width="100%"></div>

<img src="assets/mgnify_logo.png" width="200px">

# Pathways Visualization

In this notebook we aim to demonstrate how the MGnifyR tool can be used to fetch functional annotation results generated through the MGnify metagenomic analyisis. Then we show how to generate the pathways visualization using [Pathview](https://bioconductor.org/packages/release/bioc/html/pathview.html) in R.

[MGnifyR](http://github.com/beadyallen/mgnifyr) is a library that provides a set of tools for easily accessing and processing MGnify data in R, making queries to MGnify databases through the [MGnify API](https://www.ebi.ac.uk/metagenomics/api/v1/). 
The benefit of MGnifyR is that data can either be fetched in tsv format or be directly combined in a phyloseq object to run an analysis in a custom workflow.

This is an interactive code notebook (a Jupyter Notebook). To run this code, click into each cell and press the ▶ button in the top toolbar, or press shift+enter


## Contents

- [Part 1. Drawing from one metagenomic analysis](#part1)
    - [1.1. Fetching data from MGnify](#part1_1)
    - [1.2. Selecting the most complete pathways](#part1_2)
    - [1.3. Ready to draw!](#part1_3)
- [Part 2. Comparing groups of samples](#part2)
    - [2.1. Fetching data from MGnify](#part2_1)
    - [2.2. Generating differentially abundance count tables](#part2_2)
    - [2.3. Plotting pathways](#part2_3)
- [Part 3. Plotting KOs and metabolites for one sample](#part3)
    - [3.1. Extracting KOs from input tables](#part3_1)
    - [3.2. Loading and formatting metabolites data](#part3_2)
    - [3.3. Drawing!](#part3_3)
- [References](#refs)

In [None]:
# Loading libraries:
suppressMessages({
    library(pathview)
    library(MGnifyR)
    library(IRdisplay)
    library(data.table)
    library(dplyr)
    library(tidyjson)
    library(KEGGREST)
    library(ALDEx2)
})
    
#display_markdown(file = 'assets/mgnifyr_help.md')

In [None]:
# Setting tables and figures size to display (these will be reset later):
options(repr.matrix.max.cols=150, repr.matrix.max.rows=500)
options(repr.plot.width=4, repr.plot.height=4)

In [None]:
# Setting up functions
collect_pathways <- function(ids_list) {
    pathways = list()
    for (id in ids_list) { 
        current_pathway = as.list(keggLink("pathway", id))
        for (index in grep("map", current_pathway)) {        
            clean_id = gsub("*path:", "", current_pathway[index])
            # Discarding chemical structure (map010XX), global (map011XX), and overview (map012XX) maps
            prefix = substring(clean_id, 1, 6)
            if(is.na(match("map010", prefix)) & is.na(match("map011", prefix)) & is.na(match("map012", prefix)) ){
                pathways = append(pathways, clean_id)
            }
        }
    }
    return(pathways)
}

## Part 1. Drawing from one metagenomic analysis <a id='part1'/>

For Parts 1 and 2 of this notebook, we will use MGnify results generated for two studies: 
1. Metagenomes of bacteria colonizing the gut of Apis mellifera and Apis cerana from Japan ([MGYS00006180](https://www.ebi.ac.uk/metagenomics/studies/MGYS00006180#overview)).
2. Gut microbiota of Switzerland honeybees ([MGYS00006178](https://www.ebi.ac.uk/metagenomics/studies/MGYS00006178#overview)). 

The original analysis based on viral communities can be found in the [publication](https://www.pnas.org/doi/full/10.1073/pnas.2000228117).

### 1.1. Fetching data from MGnify <a id='part1_1'/>

1. Setting up the client object and retrieving the analysis accession list.

In [None]:
# Create your session mgnify_client object
mg = mgnify_client(usecache = T, cache_dir = 'assets/.mgnify_cache')

In [None]:
all_accessions = mgnify_analyses_from_studies(mg, c('MGYS00006180','MGYS00006178'))
metadata = mgnify_get_analyses_metadata(mg, all_accessions)

2. Use the first accession to fetch the kegg orthologs count table for one of the analyses from the MGnify API and transform from JSON to matrix.

In [None]:
accession = head(all_accessions, 1)
ko_loc = paste0('analyses/',accession,'/kegg-orthologs')

In [None]:
ko_json = mgnify_retrieve_json(mg, path = ko_loc)
ko_data = as.data.frame(ko_json %>% spread_all)[ , c("attributes.accession", "attributes.count")]
ko_data = data.frame(ko_data, row.names=1)
colnames(ko_data)[1] = 'counts'
ko_matrix = data.matrix(ko_data)

3. Fetch the modules completeness table and filter out completeness < 100%.

In [None]:
comp_loc = paste0('analyses/',accession,'/kegg-modules')
ko_comp_json = mgnify_retrieve_json(mg, path = comp_loc)
ko_comp = as.data.frame(ko_comp_json %>% spread_all)
modules = ko_comp[ko_comp$attributes.completeness == 100,][, c("attributes.accession")]
#modules

4. Collecting the pathways for each module using [KEGGREST](https://www.bioconductor.org/packages/devel/bioc/vignettes/KEGGREST/inst/doc/KEGGREST-vignette.html). This step takes 2 minutes to run.

In [None]:
pathways = collect_pathways(modules)

5. Counting the number of modules we have per pathway.

In [None]:
our_pathways_counts = list()
for (path_element in pathways) {
    if (path_element %in% names(our_pathways_counts)) {
        new_value = our_pathways_counts[[path_element]] + 1
        our_pathways_counts[path_element] = new_value       
    } else {
        our_pathways_counts[path_element] = 1 
    }
}

6. Counting the number of modules expected in each pathway.

In [None]:
u_pathways = unique(pathways)
exp_pathways_counts = list()
for (path in u_pathways) {
    mod_count = length(as.list(keggLink("module", path)))
    exp_pathways_counts[path] = mod_count 
}
#length(our_pathways_counts)
#length(exp_pathways_counts)

### 1.2. Selecting the most complete pathways <a id='part1_2'/>

In [None]:
to_draw = list()
for (pathway in names(our_pathways_counts)) {
    our_value = our_pathways_counts[[pathway]]
    exp_value = exp_pathways_counts[[pathway]]
    ratio =  our_value / exp_value
    if (ratio == 1) {
        nude_id =  gsub("map", "", pathway)
        to_draw = append(to_draw, nude_id)   
    }
}
#to_draw

### 1.3. Ready to draw! <a id='part1_3'/>

In [None]:
for (p in to_draw) {
    pathview(gene.data = ko_matrix, 
             species = "ko", 
             pathway.id = p, 
             bins=c(2, 2), 
             both.dirs = FALSE, 
             limit = c(1,1), 
             mid = c("#ffffff" , "#ffffff"), 
             high = c("#7206b5" , "#7206b5")
    )
}

1. Cleaning the working directory.

In [None]:
if(!dir.exists("output_plots")){
    dir.create("output_plots")
    dir.create("output_plots/single_sample")
}

file.copy(from=list.files(pattern="./*pathview.png"), to="./output_plots/single_sample/", overwrite = TRUE)

In [None]:
png_files = list.files(path = ".", pattern = "*.png")
xml_files = list.files(path = ".", pattern = "*.xml")
files = c(png_files, xml_files)
unlink(files)

## Part 2. Comparing groups of samples <a id='part2'/>

### 2.1. Fetching data from MGnify <a id='part2_1'/>

1. Generating condition labels.

In [None]:
accession_alias = subset(metadata, select = c('analysis_accession', 'study_attributes.accession'))

In [None]:
cond_list = list()
for (study_id in accession_alias$'study_attributes.accession') {
    if (study_id == 'MGYS00006180') {
        cond_list = append(cond_list , 'Japan')
    } else {
        cond_list = append(cond_list , 'Switzerland')
    }
}
accession_alias$condition = cond_list

In [None]:
table(unlist(accession_alias$condition))

2. Download and integrate KO counts tables. This step takes 7 minutes to complete.

In [None]:
samples_list = accession_alias$'analysis_accession'
list_of_dfs = list()
for (accession in samples_list) {
    ko_loc = paste0('analyses/',accession,'/kegg-orthologs')
    ko_json = mgnify_retrieve_json(mg, path = ko_loc)
    ko_data = as.data.frame(ko_json %>% spread_all)[ , c("attributes.accession", "attributes.count")]
    colnames(ko_data) = c('ko_id', accession)
    list_of_dfs = append(list_of_dfs, list(ko_data)) 
}

In [None]:
integrated_df = data.frame()

for (df in list_of_dfs){
    integrated_df = merge(integrated_df,df, all = T)
}

row.names(integrated_df) = integrated_df$ko_id
integrated_df$ko_id = NULL
integrated_df[is.na(integrated_df)] = 0
integrated_df = integrated_df %>% select_if(is.numeric) %>% select_if(~ sum(. != 0) > 0)

3. Reformating condition labels according with the KOs dataframe.

In [None]:
sorted_conds = list()
for (sample in colnames(integrated_df)) {
    match = accession_alias[accession_alias$analysis_accession %in% sample,]$condition
    cond = paste(match, collapse = "")
    sorted_conds = append(sorted_conds, cond)    
}
vector_conds = unlist(sorted_conds)

In [None]:
table(vector_conds)

### 2.1. Generating differentially abundance count tables<a id='part2_2'/>

1. We are using [aldex2](https://www.bioconductor.org/packages/devel/bioc/vignettes/ALDEx2/inst/doc/ALDEx2_vignette.html). This step takes 2 minutes to run.

In [None]:
x.all = aldex(integrated_df, 
              vector_conds, 
              mc.samples=128, 
              test="t", 
              effect=TRUE, 
              include.sample.summary=FALSE, 
              denom="all", 
              verbose=FALSE
        )

1. Saving in a matrix the list of KOs and the effect value to be used as differential abundance in the plot.

In [None]:
ko_matrix = data.matrix(subset(x.all, select = c('effect')))

2. Plotting effect and difference versus P-value.

In [None]:
options(repr.plot.width=10, repr.plot.height=8)

par(mfrow=c(1,2))
plot(x.all$effect, x.all$we.ep, log="y", cex=0.7, col=rgb(0,0,1,0.2),
  pch=19, xlab="Effect size", ylab="P value", main="Effect size plot")
points(x.all$effect, x.all$we.eBH, cex=0.7, col=rgb(1,0,0,0.2),
  pch=19)
abline(h=0.05, lty=2, col="grey")
legend(15,1, legend=c("P value", "BH-adjusted"), pch=19, col=c("blue", "red"))

plot(x.all$diff.btw, x.all$we.ep, log="y", cex=0.7, col=rgb(0,0,1,0.2),
  pch=19, xlab="Difference", ylab="P value", main="Volcano plot")
points(x.all$diff.btw, x.all$we.eBH, cex=0.7, col=rgb(1,0,0,0.2),
  pch=19)
abline(h=0.05, lty=2, col="grey")

3. Reporting features detected by the Welchs’ or Wilcoxon test individually (blue) or by both (red).

In [None]:
options(repr.plot.width=8, repr.plot.height=8)

found.by.all <- which(x.all$we.eBH < 0.05 & x.all$wi.eBH < 0.05)
found.by.one <- which(x.all$we.eBH < 0.05 | x.all$wi.eBH < 0.05)

plot(x.all$diff.win, x.all$diff.btw, pch=19, cex=1, col=rgb(0,0,0,0.3),
 xlab="Dispersion", ylab="Difference")
points(x.all$diff.win[found.by.one], x.all$diff.btw[found.by.one], pch=19,
 cex=1, col=rgb(0,0,1,0.5))
points(x.all$diff.win[found.by.all], x.all$diff.btw[found.by.all], pch=19,
 cex=1, col=rgb(1,0,0,1))
abline(0,1,lty=2)
abline(0,-1,lty=2)


4. Find the pathways with differentially abundant KOs. This step takes about 5 minutes to run.

In [None]:
kos_list = list()
for (index in found.by.one){
    current_ko = rownames(x.all)[index]
    kos_list = append(kos_list, current_ko)
}
pathways = collect_pathways(kos_list)

5. Select the top 3 pathways with the highest number of significant KOs.

In [None]:
pathways_counts = list()
for (path_element in pathways) {
    if (path_element %in% names(pathways_counts)) {
        new_value = pathways_counts[[path_element]] + 1
        pathways_counts[path_element] = new_value       
    } else {
        pathways_counts[path_element] = 1 
    }
}

In [None]:
top_to_plot = names(tail(pathways_counts[order(unlist(pathways_counts))], 3))
top_to_plot

### 2.3. Plotting pathways<a id='part2_2'/>

In [None]:
for (p in top_to_plot) {
    nude_id =  gsub("map", "", p)
    pathview(gene.data = ko_matrix, 
             species = "ko", 
             pathway.id = nude_id, 
             both.dirs = TRUE, 
             low = c("#e69e03", "#e69e03"),  
             mid = c("#c9c9c9" , "#c9c9c9"), 
             high = c("#02b3ad" , "#02b3ad")
    )
}

1. Cleaning the working directory.

In [None]:
if(!dir.exists("output_plots/comparative")){
    dir.create("output_plots/comparative")
}

file.copy(from=list.files(pattern="./*pathview.png"), to="./output_plots/comparative/", overwrite = TRUE)

In [None]:
png_files = list.files(path = ".", pattern = "*.png")
xml_files = list.files(path = ".", pattern = "*.xml")
files = c(png_files, xml_files)
unlink(files)

## Part 3. Plotting KOs and metabolites for one sample <a id='part3'/>

For this exercise, we are using supplementary tables available on the original publication of [Franzosa et al.](https://doi.org/10.1038/s41564-018-0306-4), and processed tables for this publication available at [The Curated Gut Microbiome Metabolome Data Resource](https://github.com/borenstein-lab/microbiome-metabolome-curated-data). We will use presence/absence data for only one sample to illustrate the generation of metabolic pathways maps with compounds, as normalizing metabolites data is not straightforward.

### 3.1. Extracting KOs from input tables <a id='part3_1'/>

1. Loading abundance data of enzymes annotation (EC numbers). 

In [None]:
enzymes_data = read.table("supp_tables/enzymes.tsv", header = TRUE, sep = "\t") 

2. Cleaning the IDs and keeping data for one sample (PRISM.7122).

In [None]:
EC_ids = list()
comp_ids = list()
for ( enzyme_id in enzymes_data$Enzyme ){
    new_id = gsub(": .*", "", enzyme_id)
    EC_ids = append(EC_ids, new_id)
}
enzymes_data$ec_id = EC_ids

In [None]:
enzymes_df = subset(enzymes_data, select = c('ec_id', 'PRISM.7122'))
row.names(enzymes_df) = enzymes_df$ec_id
enzymes_df$ec_id = NULL
enzymes_clean = subset(enzymes_df, enzymes_df[,1] > 0)

3. Finding the corresponding KOs for each EC number. This step takes 8 minutes to run.

In [None]:
kos_presence = list()
for( ec_number in rownames(enzymes_clean) ){ 
    current_kos = as.list(names(as.list(keggFind("ko", ec_number))))
    kos_presence = append(kos_presence, current_kos) 
}

In [None]:
kos_vector = unlist(gsub('ko:', '', kos_presence))

### 3.2. Loading and formatting metabolites data <a id='part3_2'/>

In [None]:
compound_data = read.table("supp_tables/mtb.tsv", header = TRUE, row.names = 1, sep = "\t") 

In [None]:
compound_data = data.frame(t(compound_data))

1. Subsetting table for one sample (PRISM.7122).

In [None]:
compound_df = subset(compound_data, select = c('PRISM.7122'))
compound_clean = subset(compound_df, compound_df[,1] > 0)
row.names(compound_clean) = gsub("\\.\\..*", "", row.names(compound_clean))
row.names(compound_clean) = gsub("\\.", "-", row.names(compound_clean))

2. Loading mapping data to transform cluster ID into KEGG compound ID.

In [None]:
mapping_data = read.table("supp_tables/mtb.map.tsv", header = TRUE, row.names = 1, sep = "\t") 

In [None]:
compound_map = subset(mapping_data, select = c('KEGG'))
compound_map <- na.omit(compound_map)
row.names(compound_map) = gsub(": .*", "", row.names(compound_map))

In [None]:
cpd_names = list()
for (name in rownames(compound_clean)) {
    if (name %in% rownames(compound_map)) {
        compound = compound_map[row.names(compound_map) == name, "KEGG"]
        cpd_names = append(cpd_names, compound)   
    }
}

In [None]:
cpd_vector = unlist(cpd_names)

3. Finding pathways with annotated compounds. It takes 4 minutes to run.

In [None]:
pathways = collect_pathways(cpd_names)

4. Selecting the top 3 pathways with more compounds to draw.

In [None]:
pathways_counts = list()
for (path_element in pathways) {
    if (path_element %in% names(pathways_counts)) {
        new_value = pathways_counts[[path_element]] + 1
        pathways_counts[path_element] = new_value       
    } else {
        pathways_counts[path_element] = 1 
    }
}

In [None]:
top_to_plot = names(tail(pathways_counts[order(unlist(pathways_counts))], 3))

In [None]:
top_to_plot

### 3.3. Drawing! <a id='part3_3'/>

In [None]:
for (p in top_to_plot) {
    nude_id =  gsub("map", "", p)
    pathview(gene.data = kos_vector, 
             cpd.data = cpd_vector, 
             species = "ko", 
             pathway.id = nude_id, 
             bins=c(2, 2), 
             both.dirs = FALSE, 
             limit = c(1,1), 
             mid = c("#c9c9c9" , "#c9c9c9"), 
             high = c("#02b3ad" , "#d67e03")
    )
}

1. Cleaning working directory.

In [None]:
if(!dir.exists("output_plots/metabolites")){
    dir.create("output_plots/metabolites")
}

In [None]:
file.copy(from=list.files(pattern="./*pathview.png"), to="./output_plots/metabolites/", overwrite = TRUE)

png_files = list.files(path = ".", pattern = "*.png")
xml_files = list.files(path = ".", pattern = "*.xml")
files = c(png_files, xml_files)
unlink(files)


In [None]:
#?pathview

 ### References: <a id='refs'/></a>

