<div style="max-width:1200px"><img src="../_resources/mgnify_banner.png" width="100%"></div>

<img src="../_resources/mgnify_logo.png" width="200px">

# Pathways Visualization

In this notebook we aim to demonstrate how the MGnifyR tool can be used to fetch functional annotation results generated through the MGnify metagenomic analyisis pipelines. Then we show how to generate the pathways visualization using [Pathview](https://bioconductor.org/packages/release/bioc/html/pathview.html) in R.

[MGnifyR](http://github.com/beadyallen/mgnifyr) is a library that provides a set of tools for easily accessing and processing MGnify data in R, making queries to MGnify databases through the [MGnify API](https://www.ebi.ac.uk/metagenomics/api/v1/). 
The benefit of MGnifyR is that data can either be fetched in tsv format or be directly combined in a phyloseq object to run an analysis in a custom workflow.

This is an interactive code notebook (a Jupyter Notebook). To run this code, click into each cell and press the ▶ button in the top toolbar, or press shift+enter


## Contents

- [Introduction](#intro)
    - [Minimal example 1. Fetching kegg orthologs and modules tables from MGnify](#mgnifyr)
    - [Minimal example 2. Get pathways relating info from KEGGREST api](#keggrest)
- [Part 1. Drawing presence/absence KOs for one metagenomic sample](#part1)
    - [1.1. Fetching data from MGnify](#part1_1)
    - [1.2. Selecting the most complete pathways](#part1_2)
    - [1.3. Ready to draw!](#part1_3)
- [Part 2. Comparing groups of samples, drawing KOs abundance](#part2)
    - [2.1. Fetching KO tables from MGnify](#part2_1)
    - [2.2. Generating differentially abundance count tables](#part2_2)
    - [2.3. Selecting the pathways templates to draw](#part2_3)
    - [2.4. Plotting pathways](#part2_4)
- [Part 3. Plotting presence/absence of KOs and metabolites for one metagenomic sample](#part3)
    - [3.1. Extracting KOs from input tables](#part3_1)
    - [3.2. Loading and formatting metabolites data](#part3_2)
    - [3.3. Drawing!](#part3_3)
- [References](#refs)

In [None]:
# Loading libraries:
suppressMessages({
    library(ALDEx2)
    library(data.table)
    library(dplyr)
    library(IRdisplay)
    library(KEGGREST)
    library(MGnifyR)   
    library(pathview)
    library(tidyjson)
})
    
#display_markdown(file = 'assets/mgnifyr_help.md')

In [None]:
# Setting tables and figures size to display (these will be reset later):
options(repr.matrix.max.cols=150, repr.matrix.max.rows=500)
options(repr.plot.width=4, repr.plot.height=4)

In [None]:
# Setting up functions
collect_pathways <- function(ids_list) {
    pathways = list()
    for (id in ids_list) { 
        current_pathway = as.list(keggLink("pathway", id))
        for (index in grep("map", current_pathway)) {        
            clean_id = gsub("*path:", "", current_pathway[index])
            # Discarding chemical structure (map010XX), global (map011XX), and overview (map012XX) maps
            prefix = substring(clean_id, 1, 6)
            if(is.na(match("map010", prefix)) & is.na(match("map011", prefix)) & is.na(match("map012", prefix)) ){
                pathways = append(pathways, clean_id)
            }
        }
    }
    return(pathways)
}

In [None]:
# Create your session mgnify_client object
mg = mgnify_client(usecache = T, cache_dir = '/home/jovyan/.mgnify_cache')

## Introduction <a id='intro'/>

The goal of this notebook is to demonstrate how to create KEGG pathway maps to visualize metabolic potential and metabolite production in metagenomic samples. We will use metabolic pathways annotated at the gene level by assigning a KEGG Orthology (KO) to putative protein sequences. These results are generated through the [MGnify v5.0 pipeline](https://doi.org/10.1093/nar/gkac1080) for metagenomic assemblies, as shown in the workflow schema below. We will also use the completeness estimation of KEGG modules available in the [MGnify web portal](https://www.ebi.ac.uk/metagenomics). Consider that modules completeness is determined by the number of essential steps present; therefore, even if 100% completeness is achieved, there can still be gaps.

The [KEGG (Kyoto Encyclopedia of Genes and Genomes) database](https://www.genome.jp/kegg/) is a collection of biologically-oriented data, including genetic and metabolic pathways, diseases and drugs, protein-protein interactions, and gene expression. We use the KEGG database to make connections between genes and biological information and provides pathways as a resource for systems biology.

KEGG Modules are clusters of related genes that are involved in a specific biological process or pathway. KEGG modules IDs starts with 'M' followed by 5 numbers.
Pathways are sets of interconnected biochemical reactions that form a chain leading from an initial reactant to a final product. The IDs of manually drawn reference pathway starts with the word 'map' followed by 5 numbers. KEGG pathways provide a high-level overview of the major metabolic pathways in an organism, while KEGG modules provide a more detailed view of the genes and reactions involved in a specific pathway. For a better display of results, in this notebook we are not using as templates chemical structure (map010XX), global (map011XX), and overview (map012XX) maps.

<img src="https://www.ebi.ac.uk/metagenomics/static/5e55649e459d5f26ee6c.png" width="800px">


In the following sections of this introduction you will find a couple of simplest minimal examples on the main functions we will use to fetch and format the input tables for KEGG pathways visualization.

### Minimal example 1. Fetching kegg orthologs and modules tables from MGnify <a id='mgnifyr'/>


`MgnifyR` has pre-built functions to retrieve data from MGnify databases. The `mgnify_retrieve_json` function can be used to access results that are not available in tabular format. In this notebook we will fetch the pathways annotation and pathways completeness tables generated by the latest version of the MGnify analysis pipeline in json format and reformat into dataframes to easily manipulate the data. 

1. Example on how to fetch the KOs counts table for one sample having the analysis accession `MGYA00636312`:

In [None]:
ko_json = mgnify_retrieve_json(mg, path = 'analyses/MGYA00636312/kegg-orthologs')
ko_data = as.data.frame(ko_json %>% spread_all)[ , c("attributes.accession", "attributes.count")]

In [None]:
head(ko_data, 3)

2. And we can also fetch the modules completeness table for the same sample using the following code:

In [None]:
ko_comp_json = mgnify_retrieve_json(mg, path = 'analyses/MGYA00636312/kegg-modules')
ko_comp = as.data.frame(ko_comp_json %>% spread_all)[ , c("id", "attributes.completeness")]

In [None]:
head(ko_comp, 3)

### Minimal example 2. Get pathways relating info from KEGGREST API <a id='keggrest'/>

[KEGGREST](https://www.bioconductor.org/packages/devel/bioc/vignettes/KEGGREST/inst/doc/KEGGREST-vignette.html) is a powerful tool with multiple functions to access and utilize data from KEGG databases. In this notebook we use `KEGGREST` to map between IDs in different databases and to determine the pathways associated with a module, orthologue, or compound. To display the list of KEGG databases we can query, you can run `listDatabases()` command.

1. Using `keggLink` function to find the ID of the pathways we can use as template to draw the module of methane production from CO2 (M00567)

In [None]:
as.list(keggLink("pathway", 'M00567'))

2. Using `keggFind` to find all the KOs associated to the Methyl-coenzyme M reductase (EC 2.8.4.1)

In [None]:
as.list(keggFind("ko", '2.8.4.1'))

## Part 1. Drawing presence/absence KOs for one metagenomic sample <a id='part1'/>

For Parts 1 and 2 of this notebook, we will use MGnify results generated for two studies: 
1. Metagenomes of bacteria colonizing the gut of Apis mellifera and Apis cerana from Japan ([MGYS00006180](https://www.ebi.ac.uk/metagenomics/studies/MGYS00006180#overview)).
2. Gut microbiota of Switzerland honeybees ([MGYS00006178](https://www.ebi.ac.uk/metagenomics/studies/MGYS00006178#overview)). 

The original analysis based on viral communities can be found in the [Bonilla-Rosso et al.,](https://doi.org/10.1073/pnas.2000228117) publication.


### 1.1. Fetching data from MGnify <a id='part1_1'/>

1. Fetching the analysis accession list using the study accessions. 

In [None]:
all_accessions = mgnify_analyses_from_studies(mg, c('MGYS00006180','MGYS00006178'))
all_metadata = mgnify_get_analyses_metadata(mg, all_accessions)

2. Keeping just the first analysis accession to fetch the kegg orthologs count table from the MGnify API and transform from JSON to matrix.

In [None]:
accession = head(all_accessions, 1)
ko_loc = paste0('analyses/',accession,'/kegg-orthologs')

In [None]:
ko_json = mgnify_retrieve_json(mg, path = ko_loc)
ko_data = as.data.frame(ko_json %>% spread_all)[ , c("attributes.accession", "attributes.count")]
ko_data = data.frame(ko_data, row.names=1)
colnames(ko_data)[1] = 'counts'
ko_matrix = data.matrix(ko_data)

In [None]:
head(ko_matrix, 3)

3. Fetch the modules completeness table and filter out completeness < 100%.

In [None]:
comp_loc = paste0('analyses/',accession,'/kegg-modules')
ko_comp_json = mgnify_retrieve_json(mg, path = comp_loc)
ko_comp = as.data.frame(ko_comp_json %>% spread_all)
modules = ko_comp[ko_comp$attributes.completeness == 100,][, c("attributes.accession")]

In [None]:
head(modules)

### 1.2. Selecting the most complete pathways <a id='part1_2'/>

1. Now we need to collect the list of template pathways where these complete modules can be draw. This step takes less than 1 minute to run.

In [None]:
md_pathways = collect_pathways(modules)

In [None]:
head(md_pathways)

2. In order to draw the most complete pathways maps, we will use the list of templates we obtained in the previous step and select only pathways having all their constituent modules.

In [None]:
# Counting the number of modules we have in each pathway
our_pathways_counts = list()
for (path_element in md_pathways) {
    if (path_element %in% names(our_pathways_counts)) {
        new_value = our_pathways_counts[[path_element]] + 1
        our_pathways_counts[path_element] = new_value       
    } else {
        our_pathways_counts[path_element] = 1 
    }
}

# Counting the number of modules expected in each pathway
u_pathways = unique(md_pathways)
exp_pathways_counts = list()
for (path in u_pathways) {
    mod_count = length(as.list(keggLink("module", path)))
    exp_pathways_counts[path] = mod_count 
}

# Selecting the pathways having all their constituent modules. We remove the 'map' prefix as pathview doesn't like it
to_draw = list()
for (pathway in names(our_pathways_counts)) {
    our_value = our_pathways_counts[[pathway]]
    exp_value = exp_pathways_counts[[pathway]]
    ratio =  our_value / exp_value
    if (ratio == 1) {
        nude_id =  gsub("map", "", pathway)
        to_draw = append(to_draw, nude_id)   
    }
}

In [None]:
to_draw

### 1.3. Ready to draw! <a id='part1_3'/>

1. As we are plotting absence/presence, we set the number of bins = 2, the scale in one direction, and use 1 as limit.

In [None]:
for (p in to_draw) {
    pathview(gene.data = ko_matrix, 
             species = "ko", 
             pathway.id = p, 
             bins=c(2, 2), 
             both.dirs = FALSE, 
             limit = c(1,1), 
             mid = c("#ffffff" , "#ffffff"), 
             high = c("#02b3ad" , "#02b3ad")
    )
}

2. Cleaning the working directory.

In [None]:
if(!dir.exists("output_plots")){
    dir.create("output_plots")
    dir.create("output_plots/single_sample")
}

file.copy(from=list.files(pattern="./*pathview.png"), to="./output_plots/single_sample/", overwrite = TRUE)

png_files = list.files(path = ".", pattern = "*.png")
xml_files = list.files(path = ".", pattern = "*.xml")
files = c(png_files, xml_files)
unlink(files)

3. This is one example of Pathview outputs. The rest of the generated figures are stored at `output_plots/single_sample/` directory. You can explore the files by clicking on them at the left-side panel.

<div style="max-width:800px"><img src="output_plots/single_sample/ko00010.pathview.png" width="100%"></div>

## Part 2. Comparing groups of samples, drawing KOs abundance <a id='part2'/>

In this part of the notebook we use the whole list of accessions for both studies (`MGYS00006180` and `MGYS00006178`). After integrate the KO's tables, we run [Aldex2](https://www.bioconductor.org/packages/devel/bioc/vignettes/ALDEx2/inst/doc/ALDEx2_vignette.html) to find differentially abudant KOs between honeybees from Japan and Switzerland, and we use the `effect` as scale to plot the pathways with the highest number of differentially abundant KOs. Consider that steps involving fetching KO tables and `KEGGREST` queries can take several minutes to run.

### 2.1. Fetching KO tables from MGnify <a id='part2_1'/>

1. Generating condition labels.

In [None]:
accession_alias = subset(all_metadata, select = c('analysis_accession', 'study_attributes.accession'))

In [None]:
cond_list = list()
for (study_id in accession_alias$'study_attributes.accession') {
    if (study_id == 'MGYS00006180') {
        cond_list = append(cond_list , 'Japan')
    } else {
        cond_list = append(cond_list , 'Switzerland')
    }
}
accession_alias$condition = cond_list

In [None]:
table(unlist(accession_alias$condition))

2. Download and integrate KO counts tables. This step takes 8 minutes to complete.

In [None]:
samples_list = accession_alias$'analysis_accession'
list_of_dfs = list()
for (accession in samples_list) {
    ko_loc = paste0('analyses/',accession,'/kegg-orthologs')
    ko_json = mgnify_retrieve_json(mg, path = ko_loc)
    ko_data = as.data.frame(ko_json %>% spread_all)[ , c("attributes.accession", "attributes.count")]
    colnames(ko_data) = c('ko_id', accession)
    list_of_dfs = append(list_of_dfs, list(ko_data)) 
}

In [None]:
integrated_df = data.frame()
for (df in list_of_dfs){
    integrated_df = merge(integrated_df,df, all = T)
}

# Using the KO id column as row names
row.names(integrated_df) = integrated_df$ko_id
integrated_df$ko_id = NULL

# Converting NA to zero 
integrated_df[is.na(integrated_df)] = 0

# Discarding samples that KOs abundance sum = 0
integrated_df = integrated_df %>% select_if(is.numeric) %>% select_if(~ sum(. != 0) > 0)

In [None]:
head(integrated_df, c(3, 2))

3. Reformating the condition label according with the KOs dataframe.

In [None]:
sorted_conds = list()
for (sample in colnames(integrated_df)) {
    match = accession_alias[accession_alias$analysis_accession %in% sample,]$condition
    cond = paste(match, collapse = "")
    sorted_conds = append(sorted_conds, cond)    
}
vector_conds = unlist(sorted_conds)

In [None]:
table(vector_conds)

### 2.1. Generating differentially abundance count tables<a id='part2_2'/>

1. This step takes 2 minutes to run.

In [None]:
x.all = aldex(integrated_df, 
              vector_conds, 
              mc.samples=128, 
              test="t", 
              effect=TRUE, 
              include.sample.summary=FALSE, 
              denom="all", 
              verbose=FALSE
        )

In [None]:
head(x.all, 3)

1. The column `effect` in the above output (`x.all` table) contains the log ratio of the sample mean to the reference mean. A positive effect indicates that the sample mean is greater than the reference mean, while a negative effect indicates that the sample mean is lower than the reference mean. In this example, Japan group is used as reference. We are keeping in a separate matrix the list of KOs and the effect value to be used to plot.

In [None]:
ko_matrix = data.matrix(subset(x.all, select = c('effect')))

In [None]:
head(ko_matrix, 3)

2. Plotting effect and difference (`diff.btw`) versus P-value. The threshold line indicates P-value = 0.05.

In [None]:
options(repr.plot.width=10, repr.plot.height=8)

par(mfrow=c(1,2))
plot(x.all$effect, x.all$we.ep, log="y", cex=0.7, col=rgb(0,0,1,0.2),
  pch=19, xlab="Effect size", ylab="P value", main="Effect size plot")
points(x.all$effect, x.all$we.eBH, cex=0.7, col=rgb(1,0,0,0.2),
  pch=19)
abline(h=0.05, lty=2, col="grey")
legend(15,1, legend=c("P value", "BH-adjusted"), pch=19, col=c("blue", "red"))

plot(x.all$diff.btw, x.all$we.ep, log="y", cex=0.7, col=rgb(0,0,1,0.2),
  pch=19, xlab="Difference", ylab="P value", main="Volcano plot")
points(x.all$diff.btw, x.all$we.eBH, cex=0.7, col=rgb(1,0,0,0.2),
  pch=19)
abline(h=0.05, lty=2, col="grey")

3. Reporting features detected by the Welchs’ or Wilcoxon test individually (blue) or by both (red).

In [None]:
options(repr.plot.width=8, repr.plot.height=8)

found.by.all <- which(x.all$we.eBH < 0.05 & x.all$wi.eBH < 0.05)
found.by.one <- which(x.all$we.eBH < 0.05 | x.all$wi.eBH < 0.05)

plot(x.all$diff.win, x.all$diff.btw, pch=19, cex=1, col=rgb(0,0,0,0.3),
 xlab="Dispersion", ylab="Difference")
points(x.all$diff.win[found.by.one], x.all$diff.btw[found.by.one], pch=19,
 cex=1, col=rgb(0,0,1,0.5))
points(x.all$diff.win[found.by.all], x.all$diff.btw[found.by.all], pch=19,
 cex=1, col=rgb(1,0,0,1))
abline(0,1,lty=2)
abline(0,-1,lty=2)


### 2.3. Selecting the pathways templates to draw<a id='part2_3'/>

1. We will use the union of both methods to find the pathways with differentially abundant KOs. This step takes 3 minutes to run.

In [None]:
kos_list = list()
for (index in found.by.one){
    current_ko = rownames(x.all)[index]
    kos_list = append(kos_list, current_ko)
}

In [None]:
ko_pathways = collect_pathways(kos_list)

In [None]:
head(ko_pathways)

2. Select the top 3 pathways with the highest number of significant KOs.

In [None]:
pathways_counts = list()
for (path_element in ko_pathways) {
    if (path_element %in% names(pathways_counts)) {
        new_value = pathways_counts[[path_element]] + 1
        pathways_counts[path_element] = new_value       
    } else {
        pathways_counts[path_element] = 1 
    }
}

In [None]:
top_to_plot = names(tail(pathways_counts[order(unlist(pathways_counts))], 3))
top_to_plot

### 2.4. Plotting pathways<a id='part2_4'/>

1. For this type of data, values range from -1 to 1, with both negative and positive fractions. We use both directions when plotting.

In [None]:
for (p in top_to_plot) {
    nude_id =  gsub("map", "", p)
    pathview(gene.data = ko_matrix, 
             species = "ko", 
             pathway.id = nude_id, 
             both.dirs = TRUE, 
             low = c("#bd066b", "#bd066b"),  
             mid = c("#c9c9c9" , "#c9c9c9"), 
             high = c("#02b3ad" , "#02b3ad")
    )
}

2. Cleaning the working directory.

In [None]:
if(!dir.exists("output_plots/comparative")){
    dir.create("output_plots/comparative")
}

file.copy(from=list.files(pattern="./*pathview.png"), to="./output_plots/comparative/", overwrite = TRUE)

png_files = list.files(path = ".", pattern = "*.png")
xml_files = list.files(path = ".", pattern = "*.xml")
files = c(png_files, xml_files)
unlink(files)

3. This is one example of Pathview outputs. The rest of the generated figures are stored at `output_plots/comparative/` directory. You can explore the files by clicking on them at the left-side panel.

<div style="max-width:1200px"><img src="output_plots/comparative/ko00270.pathview.png" width="100%"></div>

## Part 3. Plotting presence/absence of KOs and metabolites for one metagenomic sample <a id='part3'/>

We are using supplementary and processed tables from [Franzosa et al](https://doi.org/10.1038/s41564-018-0306-4).'s publication, available at [The Curated Gut Microbiome Metabolome Data Resource](https://doi.org/10.1038/s41522-022-00345-5). As normalizing metabolites data is not straightforward, this section will only use presence/absence data from one sample to demonstrate the creation of metabolic pathways maps using metabolites and KOs tables as input.

### 3.1. Extracting KOs from input tables <a id='part3_1'/>

1. We need a KOs list as input to draw pathways, so we are generating a KOs table from the enzyme (EC number) identifiers in the `Supplementary Dataset 6: Per-subject microbial enzyme relative abundance profiles`, as there are no KOs count tables available for this data.

In [None]:
enzymes_data = read.table("/home/jovyan/supp_tables/enzymes.tsv", header = TRUE, sep = "\t") 

In [None]:
head(enzymes_data, c(3,3))

2. Cleaning the enzyme IDs to get the EC number only and keeping data for only one sample (PRISM.7122)

In [None]:
EC_ids = list()
comp_ids = list()
for ( enzyme_id in enzymes_data$Enzyme ){
    new_id = gsub(": .*", "", enzyme_id)
    EC_ids = append(EC_ids, new_id)
}

# Keeping data for PRISM.7122 sample
enzymes_df = subset(enzymes_data, select = c('PRISM.7122'))

# Using the clean EC number column  as row names
row.names(enzymes_df) = EC_ids

# Discarding enzymes with abundance = 0
enzymes_clean = subset(enzymes_df, enzymes_df[,1] > 0)

head(enzymes_clean, 3)

3. Finding the corresponding KOs for each EC number. As we want to plot presence/absence we can input Pathview with a list (formatted as vector) of KO ids instead of an abundance matrix. This step takes 4 minutes to complete.


In [None]:
kos_presence = list()
for( ec_number in rownames(enzymes_clean) ){ 
    current_kos = as.list(names(as.list(keggFind("ko", ec_number))))
    kos_presence = append(kos_presence, current_kos) 
}

kos_vector = unlist(gsub('ko:', '', kos_presence))

In [None]:
head(kos_vector)

### 3.2. Loading and formatting metabolites data <a id='part3_2'/>

1. We are using the metabolites table available at the [The Curated Gut Microbiome Metabolome Data Resource github repo](https://github.com/borenstein-lab/microbiome-metabolome-curated-data) generated from the `Supplementary Dataset 2: Per-subject metabolite relative abundance profiles` of the [original publication](https://doi.org/10.1038/s41564-018-0306-4) subsetted for sample PRISM.7122.

In [None]:
compound_data = read.table("/home/jovyan/supp_tables/mtb.tsv", header = TRUE, row.names = 1, sep = "\t") 

# Transposing the table
compound_data = data.frame(t(compound_data))

# Keeping data for sample PRISM.7122 only
compound_df = subset(compound_data, select = c('PRISM.7122'))

# Discarding rows with abundance = 0
compound_clean = subset(compound_df, compound_df[,1] > 0)

# Formatting row names to be mapped to compound names
row.names(compound_clean) = gsub("\\.\\..*", "", row.names(compound_clean))
row.names(compound_clean) = gsub("\\.", "-", row.names(compound_clean))

In [None]:
head(compound_clean, c(3,3))

2. Now we have to transform the cluster IDs into KEGG compound IDs. This info is not available in the [original publication](https://doi.org/10.1038/s41564-018-0306-4). Fortunatelly, a mapping table exists available in [The Curated Gut Microbiome Metabolome Data Resource github repo](https://github.com/borenstein-lab/microbiome-metabolome-curated-data). Loading and formatting mapping data to transform cluster ID into KEGG compound ID.

In [None]:
mapping_data = read.table("/home/jovyan/supp_tables/mtb.map.tsv", header = TRUE, row.names = 1, sep = "\t", quote = '#') 

In [None]:
head(mapping_data, 3)

In [None]:
# Reducing the table to keep only raw names and KEGG compund ID
compound_map = subset(mapping_data, select = c('KEGG'))

# Removing rows with NA in the compound ID column
compound_map = na.omit(compound_map)

# Formatting the row names to be correctly mapped on the abundance table 'compound_clean'
row.names(compound_map) = gsub(": .*", "", row.names(compound_map))

head(compound_map, 3)

3. Generating the compounds list (as vector) to plot as presence/absence.

In [None]:
cpd_names = list()
for (name in rownames(compound_clean)) {
    if (name %in% rownames(compound_map)) {
        compound = compound_map[row.names(compound_map) == name, "KEGG"]
        cpd_names = append(cpd_names, compound)   
    }
}
cpd_vector = unlist(cpd_names)

In [None]:
head(cpd_vector)

4. Finding pathways with annotated compounds. It takes 2 minutes to run.

In [None]:
cpd_pathways = collect_pathways(cpd_names)

In [None]:
head(cpd_pathways)

5. Selecting the top 3 pathways with the largest number of compounds to draw.

In [None]:
pathways_counts = list()
for (path_element in cpd_pathways) {
    if (path_element %in% names(pathways_counts)) {
        new_value = pathways_counts[[path_element]] + 1
        pathways_counts[path_element] = new_value       
    } else {
        pathways_counts[path_element] = 1 
    }
}

top_to_plot = names(tail(pathways_counts[order(unlist(pathways_counts))], 3))

In [None]:
top_to_plot

### 3.3. Drawing! <a id='part3_3'/>

1. As we are plotting absence/presence, we set the number of bins = 2, the scale in one direction, and use 1 as limit.

In [None]:
for (p in top_to_plot) {
    nude_id =  gsub("map", "", p)
    pathview(gene.data = kos_vector, 
             cpd.data = cpd_vector, 
             species = "ko", 
             pathway.id = nude_id, 
             bins=c(2, 2), 
             both.dirs = FALSE, 
             limit = c(1,1), 
             mid = c("#c9c9c9" , "#c9c9c9"), 
             high = c("#02b3ad" , "#d67e03")
    )
}

2. Cleaning working directory.

In [None]:
if(!dir.exists("output_plots/metabolites")){
    dir.create("output_plots/metabolites")
}

file.copy(from=list.files(pattern="./*pathview.png"), to="./output_plots/metabolites/", overwrite = TRUE)

png_files = list.files(path = ".", pattern = "*.png")
xml_files = list.files(path = ".", pattern = "*.xml")
files = c(png_files, xml_files)
unlink(files)


3. This is one example of Pathview outputs. The rest of the generated figures are stored at `output_plots/metabolites/` directory. You can explore the files by clicking on them at the left-side panel.

<div style="max-width:1400px"><img src="output_plots/metabolites/ko05230.pathview.png" width="100%"></div>

In [None]:
?pathview

 ### References: <a id='refs'/></a>

#### Datasets and databases papers:
Honeybee datasets (used in parts 1 and 2 of this notebook):
https://doi.org/10.1073/pnas.2000228117

Metabolites dataset (used in part 3):
https://doi.org/10.1038/s41564-018-0306-4

The Curated Gut Microbiome Metabolome Data Resource (used in part 3):
https://doi.org/10.1038/s41522-022-00345-5

MGnify pipeline:
https://doi.org/10.1093/nar/gkac1080

KEGG database:
https://doi.org/10.1093/nar/gkw1092


#### R libraries:
 - `library(ALDEx2)` Gloor GB, Macklaim JM, Fernandes AD (2016). Displaying Variation in Large Datasets: a Visual Summary of Effect Sizes. Journal of Computational and Graphical Statistics, 2016 http://doi.org/10.1080/10618600.2015.1131161. R package version 1.30.0.

 - `library(data.table)` Matt Dowle and Arun Srinivasan (2023). data.table: Extension of `data.frame`. R package version 1.14.8. https://CRAN.R-project.org/package=data.table    

 - `library(dplyr)` Hadley Wickham, Romain François, Lionel Henry, Kirill Müller and Davis Vaughan (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.2. https://CRAN.R-project.org/package=dplyr

 - `library(IRdisplay)` Thomas Kluyver, Philipp Angerer and Jan Schulz (NA). IRdisplay: 'Jupyter' Display Machinery. R package version 1.1. https://github.com/IRkernel/IRdisplay

 - `library(KEGGREST)` Dan Tenenbaum and Bioconductor Package Maintainer (2021). KEGGREST: Client-side REST access to the Kyoto Encyclopedia of Genes and Genomes (KEGG). R package version 1.38.0.

 - `library(MGnifyR)` Ben Allen (2022). MGnifyR: R interface to EBI MGnify metagenomics resource. R package version 0.1.0.

 - `library(pathview)` Luo, W. and Brouwer C., Pathview: an R/Bioconductor package for pathway-based data integration and visualization. Bioinformatics, 2013, 29(14): 1830-1831, doi: 10.1093/bioinformatics/btt285. R package version 1.38.0.

 - `library(tidyjson)` Jeremy Stanley and Cole Arendt (2023). tidyjson: Tidy Complex 'JSON'. R package version 0.3.2. https://CRAN.R-project.org/package=tidyjson
  

#### Going deeper:

If you want to learn more about using MGnifyR, you can follow the online tutorial available here:
https://www.ebi.ac.uk/training/online/courses/metagenomics-bioinformatics/mgnifyr/

KEGGREST documentation with multiple examples:
https://www.bioconductor.org/packages/devel/bioc/vignettes/KEGGREST/inst/doc/KEGGREST-vignette.html

Pathview user manual:
https://www.bioconductor.org/packages/release/bioc/vignettes/pathview/inst/doc/pathview.pdf

