<div style="max-width:1200px"><img src="../_resources/mgnify_banner.png" width="100%"></div>

<img src="../_resources/mgnify_logo.png" width="200px">

[MGnifyR](http://github.com/beadyallen/mgnifyr) is a library that provides a set of tools for easily accessing and processing MGnify data in R, making queries to MGnify databases through the [MGnify API](https://www.ebi.ac.uk/metagenomics/api/v1/). 
The benefit of MGnifyR is that data can either be fetched in tsv format or be directly combined in a phyloseq object to run an analysis in a custom workflow.

This is an interactive code notebook (a Jupyter Notebook). To run this code, click into each cell and press the ▶ button in the top toolbar, or press shift+enter.


# Pathways Visualisation

This is a notebook generated for the 2nd Biohackathon of ELIXIR Food and Nutrition project. We use the DIME study results ([MGYS00006076](https://www.ebi.ac.uk/metagenomics/studies/MGYS00006076#overview)) on MGnify as example on how to fetch and process metagenomic data.  

We aim to demonstrate how the MGnifyR tool can be used to fetch functional annotation results generated through the MGnify metagenomic analyisis pipelines. Then we will use the metadata file from the [Phenotype database](https://dashin.eu/interventionstudies/) (not-publicly available) to group the samples and to generate a comparative metagenomics analysis to find diffentially abundant genes using [Aldex2](https://www.bioconductor.org/packages/devel/bioc/vignettes/ALDEx2/inst/doc/ALDEx2_vignette.html). Finally, we will generate the pathways visualization using [Pathview](https://bioconductor.org/packages/release/bioc/html/pathview.html) in R.


## Contents

- [Introduction](#intro)
    - [Minimal example 1. Fetching KEGG orthologs and modules tables from MGnify](#min_mgnifyr)
    - [Minimal example 2. Accessing to KEGG pathways info using the KEGGREST API](#min_keggrest)
- [Part 1. Comparative analysis of microbiota response to low and high bioactive diets](#part1)
    - [1.1. Fetching metadata tables of DIME project from MGnify](#part1_1)
    - [1.2. Reading and formating the DIME Phenotype table](#part1_2)
    - [1.3. Fetching KO tables from MGnify](#part1_3)
    - [1.4. Add the Start group label to the KO integrated table](#part1_4)
    - [1.5. Generating differentially abundance count tables](#part1_5)
    - [1.6. Selecting pathways to draw](#part1_6)
    - [1.7. Drawing pathways!](#part1_7)
- [References](#refs)

In [None]:
# Loading libraries:
suppressMessages({
    library(ALDEx2)
    library(data.table)
    library(dplyr)
    library(IRdisplay)
    library(KEGGREST)
    library(MGnifyR)   
    library(pathview)
    library(tidyjson)
})
    
#display_markdown(file = 'assets/mgnifyr_help.md')

In [None]:
# Setting tables and figures size to display (these will be reset later):
options(repr.matrix.max.cols=150, repr.matrix.max.rows=500)
options(repr.plot.width=4, repr.plot.height=4)

In [None]:
# Setting up functions
collect_pathways <- function(ids_list) {
    pathways = list()
    for (id in ids_list) { 
        current_pathway = as.list(keggLink("pathway", id))
        for (index in grep("map", current_pathway)) {        
            clean_id = gsub("*path:", "", current_pathway[index])
            # Discarding chemical structure (map010XX), global (map011XX), and overview (map012XX) maps
            prefix = substring(clean_id, 1, 6)
            if(is.na(match("map010", prefix)) & is.na(match("map011", prefix)) & is.na(match("map012", prefix)) ){
                pathways = append(pathways, clean_id)
            }
        }
    }
    return(pathways)
}

In [None]:
# Create your session mgnify_client object
mg = mgnify_client(usecache = T, cache_dir = '/home/jovyan/.mgnify_cache')

## Introduction <a id='intro'/>

The goal of this notebook is to demonstrate how to create KEGG pathway maps to visualise metabolic potential and metabolite production in metagenomic samples. We will use metabolic pathways annotated at the gene level by assigning a KEGG Orthology (KO) to putative protein sequences. These results are generated through the [MGnify v5.0 pipeline](https://doi.org/10.1093/nar/gkac1080) for metagenomic assemblies, as shown in the workflow schema below. We will also use the completeness estimation of KEGG modules available in the [MGnify web portal](https://www.ebi.ac.uk/metagenomics). Consider that modules completeness is determined by the shortest possible path which contains the essential steps for the pathway; therefore, even if 100% completeness is achieved, there can still be gaps.

The [KEGG (Kyoto Encyclopedia of Genes and Genomes) database](https://www.genome.jp/kegg/) is a collection of biological databases, including genetic and metabolic pathways, diseases and drugs, protein-protein interactions, and gene expression. We use the KEGG database to make connections between genes and their biological functions, and define pathways containing the functions to be used as a resource for systems biology.  

Some key concepts we will use in this notebook are:  

1. KEGG Orthologs (KO): Orthologs are genes in different species that evolved from a common ancestral gene and typically retain similar functions. KEGG Orthologs are a set of manually curated genes and proteins grouped into categories based on their functional similarities, particularly their roles in biological pathways and molecular functions. KEGG Orthologues IDs are unique and always starts with the letter 'K' (uppercase) followed by 5 numbers.  

2. KEGG Modules: Modules are clusters of related genes that are involved in a specific biological process or pathway. KEGG modules IDs starts with 'M' followed by 5 numbers.  

3. KEGG Pathways: Pathways are sets of interconnected biochemical reactions that form a chain leading from an initial reactant to a final product. KEGG pathways provide a high-level overview of the major metabolic pathways in an organism, while KEGG modules provide a more detailed view of the genes and reactions involved in a specific pathway. The IDs of manually drawn reference pathway starts with the word 'map' followed by 5 numbers. Another pathways prefix you will se in the notebook is 'ko' for reference pathway highlighting KO.  

For a better display of results, in this notebook we are not using as templates chemical structure (map010XX), global (map011XX), and overview (map012XX) maps.

<figure>
  <img src="https://www.ebi.ac.uk/metagenomics/static/5e55649e459d5f26ee6c.png" alt="Alt Text" width="800px">
  <figcaption>Mgnify assembly analysis pipeline v5.0</figcaption>
</figure>

In the following sections of this introduction you will find a couple of simplest minimal examples on the main functions we will use to fetch and format the input tables for KEGG pathways visualization.

### Minimal example 1. Fetching KEGG orthologs and modules tables from MGnify <a id='min_mgnifyr'/>


`MgnifyR` has pre-built functions to retrieve data from MGnify databases. The `mgnify_retrieve_json` function can be used to access results that are not available in tabular format. In this notebook we will fetch the pathways annotation and pathways completeness tables generated by the latest version of the MGnify analysis pipeline in json format and reformat into dataframes to easily manipulate the data. 

1. Example on how to fetch the KOs counts table for one sample having the analysis accession `MGYA00636312`:

In [None]:
ko_json = mgnify_retrieve_json(mg, path = 'analyses/MGYA00636312/kegg-orthologs')
ko_data = as.data.frame(ko_json %>% spread_all)[ , c("attributes.accession", "attributes.count")]

In [None]:
head(ko_data, 3)

2. And we can also fetch the modules completeness table for the same sample using the following code:

In [None]:
ko_comp_json = mgnify_retrieve_json(mg, path = 'analyses/MGYA00636312/kegg-modules')
ko_comp = as.data.frame(ko_comp_json %>% spread_all)[ , c("id", "attributes.completeness")]

In [None]:
head(ko_comp, 3)

### Minimal example 2. Accessing to KEGG pathways info using the KEGGREST API <a id='min_keggrest'/>

[KEGGREST](https://www.bioconductor.org/packages/devel/bioc/vignettes/KEGGREST/inst/doc/KEGGREST-vignette.html) is a powerful tool with multiple functions to access and utilize data from KEGG databases. We can use `KEGGREST` to map between IDs in different databases and to find pathways associated with a module, orthologue, or compound. To display the list of KEGG databases we can query, you can run `listDatabases()` command.

1. With `keggLink` function we can find the ID of the pathways to use as template to draw the module of methane production from CO2 (M00567). We find that the module M00567 is present in 4 possible pathways.

In [None]:
as.list(keggLink("pathway", 'M00567'))

2. Using `keggFind` to find all the KOs associated to the Methyl-coenzyme M reductase (EC 2.8.4.1)

In [None]:
as.list(keggFind("ko", '2.8.4.1'))

## Part 1. Comparative analysis of microbiota response to low and high bioactive diets <a id='part1'/>

In this part of the notebook we will use all the samples for the DIME study `MGYS00006076`. The workflow includes the following steps:  

1. Fetch data from MGnify.  
2. Generate a simple phenotype table with the SubjectID and the Start group.
3. Download ko tables from MGnify and integrate the resuls into a single table.
4. Add the phenotype group label to the integrated ko table. We assume that the numerical section of SubjectID in the phenotype database corresponds to the prefix of the sample-name in MGnify.
5. Run [Aldex2](https://www.bioconductor.org/packages/devel/bioc/vignettes/ALDEx2/inst/doc/ALDEx2_vignette.html) to find differentially abudant genes with KO annotation between low and high bioactives diet
6. Use the `effect` as scale to plot the pathways with the highest number of differentially abundant KOs. 
7. Draw pathways!

Consider that steps involving fetching KO tables and `KEGGREST` queries can take few minutes to run.

Phenotype database id:  
Metadata for 20 individuals named under SubjectID as `DIME_[001-027]`  

MGnify results table id:  
MGnify accessions for 100 samples named under sample_sample-name as `[01-27]-[MP|V|v][1-4]]`  


### 1.1. Fetching metadata tables of DIME project from MGnify <a id='part1_1'/>

1. Fetching the analysis accession list using the study accessions. This step takes 6 minutes to run, as we are fetching the MGnify metadata of 100 samples.

In [None]:
all_accessions = mgnify_analyses_from_studies(mg, 'MGYS00006076')
all_metadata = mgnify_get_analyses_metadata(mg, all_accessions)

In [None]:
write.table(all_metadata, file = 'dime_mgnify_metadata.txt', sep = "\t", quote = FALSE, row.names = FALSE)

### 1.2. Reading and formating the DIME Phenotype table <a id='part1_2'/>

1. The file `Dietary Bioactives and Microbiome Diversity (DIME)_SimpleTox.xls` was downloaded from the Phenotype database and exported as csv to `dime_phenotype_metadata.txt`.

In [None]:
phenotype_data = read.table("dime_phenotype_metadata.txt", header = TRUE, sep = ",")

In [None]:
start_group = unique(subset(phenotype_data, select = c('SubjectID', 'Start.group')))

In [None]:
head(start_group)

In [None]:
table(start_group$Start.group)

### 1.3. Fetching KO tables from MGnify <a id='part1_3'/>

1. Download and integrate KO counts tables. We will discard any sample having zero genes with KO annotation. This step takes several minutes to complete.

In [None]:
# Saving KO tables as a list of dataframes
list_of_dfs = list()
for (accession in all_accessions) {
    ko_loc = paste0('analyses/',accession,'/kegg-orthologs')
    ko_json = mgnify_retrieve_json(mg, path = ko_loc)
    ko_data = as.data.frame(ko_json %>% spread_all)[ , c("attributes.accession", "attributes.count")]
    colnames(ko_data) = c('ko_id', accession)
    list_of_dfs = append(list_of_dfs, list(ko_data)) 
}

In [None]:
# Integrating all dataframes into a single dataframe
integrated_df = data.frame()
for (df in list_of_dfs){
    integrated_df = merge(integrated_df,df, all = T)
}

## Cleaning the integrated table
# Using the KO id column as row names
row.names(integrated_df) = integrated_df$ko_id
integrated_df$ko_id = NULL

# Converting NA to zero 
integrated_df[is.na(integrated_df)] = 0

# Discarding samples that KOs abundance sum = 0
integrated_df = integrated_df %>% select_if(is.numeric) %>% select_if(~ sum(. != 0) > 0)

### 1.4. Add the Start group label to the KO integrated table <a id='part1_4'/>

1. Reformating the MGnify sample id to mathc with the phenotype subject ID

In [None]:
magnify_subject = all_metadata %>% select('sample_sample-name')
head(magnify_subject)

In [None]:
magnify_subject <- magnify_subject %>%
  mutate(`sample_sample-name` = paste("DIME_0", sub("-.*", "", `sample_sample-name`), sep = ""))

In [None]:
magnify_subject$mgnify_accession <- rownames(magnify_subject)
rownames(magnify_subject) <- NULL

In [None]:
head(magnify_subject)

2. Generating the condition table based on the sample order in the KOs integrated table

In [None]:
ids_table = merge(start_group, magnify_subject, by.x = "SubjectID", by.y = "sample_sample-name", all.x = TRUE)

In [None]:
head(ids_table)

In [None]:
sorted_conds = list()
for (sample in colnames(integrated_df)) {
    match = ids_table[ids_table$mgnify_accession %in% sample,]$Start.group
    cond = paste(match, collapse = "")
    sorted_conds = append(sorted_conds, cond)    
}
vector_conds = unlist(sorted_conds)

In [None]:
table(vector_conds)

### 1.5. Generating differentially abundance count tables<a id='part1_5'/>

1. Running Aldex step. It takes 5 minutes to complete.

In [None]:
x.all = aldex(integrated_df, 
              vector_conds, 
              mc.samples=128, 
              test="t", 
              effect=TRUE, 
              include.sample.summary=FALSE, 
              denom="all", 
              verbose=FALSE
        )

In [None]:
head(x.all, 3)

2. The column `effect` in the above output (`x.all` table) contains the log ratio of the sample mean to the reference mean. A positive effect indicates that the sample mean is greater than the reference mean, while a negative effect indicates that the sample mean is lower than the reference mean. We will generate now the matrix of KO-effect for plotting.

In [None]:
ko_matrix = data.matrix(subset(x.all, select = c('effect')))

In [None]:
head(ko_matrix, 3)

3. Plotting the effect size (`effect`) and difference (`diff.btw`) to show an overview of differentially abundant functions. Two type of P-value results from the Welch’s t test are shown: In blue marks the expected P-value; in red marks the expected Benjamini-Hochberg corrected P-value. The threshold line indicates P-value = 0.05.

In [None]:
options(repr.plot.width=10, repr.plot.height=8)

# Effect size plot
par(mfrow=c(1,2))
plot(x.all$effect, 
    x.all$we.ep, 
    log="y", 
    cex=0.7, 
    col=rgb(0,0,1,0.2),  # Blue marks for expected P value of Welch’s t test
    pch=19, 
    xlab="Effect size", 
    ylab="P value", 
    main="Effect size plot")
points(x.all$effect, 
    x.all$we.eBH, 
    cex=0.7, 
    col=rgb(1,0,0,0.2), # Red marks for expected Benjamini-Hochberg corrected P value of Welch’s t test
    pch=19)
abline(h=0.05, lty=2, col="grey")
legend(-0.5,0.0005, legend=c("P value", "BH-adjusted"), pch=19, col=c("blue", "red"))

# Volcano plot
plot(x.all$diff.btw, 
    x.all$we.ep, 
    log="y", 
    cex=0.7, 
    col=rgb(0,0,1,0.2), # Blue marks for expected P value of Welch’s t test
    pch=19, 
    xlab="Difference", 
    ylab="P value", 
    main="Volcano plot")
points(x.all$diff.btw, 
    x.all$we.eBH, 
    cex=0.7, 
    col=rgb(1,0,0,0.2), # Red marks for expected Benjamini-Hochberg corrected P value of Welch’s t test
    pch=19)
abline(h=0.05, lty=2, col="grey")
legend(-2,0.0005, legend=c("P value", "BH-adjusted"), pch=19, col=c("blue", "red"))

4. Reporting features detected by the Welchs’ or Wilcoxon test individually (blue) or by both (red).

In [None]:
options(repr.plot.width=8, repr.plot.height=8)

found.by.all <- which(x.all$we.eBH < 0.05 & x.all$wi.eBH < 0.05)
found.by.one <- which(x.all$we.eBH < 0.05 | x.all$wi.eBH < 0.05)

plot(x.all$diff.win, x.all$diff.btw, pch=19, cex=1, col=rgb(0,0,0,0.3),
 xlab="Dispersion", ylab="Difference")
points(x.all$diff.win[found.by.one], x.all$diff.btw[found.by.one], pch=19,
 cex=1, col=rgb(0,0,1,0.5))
points(x.all$diff.win[found.by.all], x.all$diff.btw[found.by.all], pch=19,
 cex=1, col=rgb(1,0,0,1))
abline(0,1,lty=2)
abline(0,-1,lty=2)


### 1.6. Selecting pathways to draw<a id='part1_6'/>

1. We will use the union of both testing methods (Welchs’ or Wilcoxon) to find the pathways with differentially abundant KOs.

In [None]:
kos_list = list()
for (index in found.by.one){
    current_ko = rownames(x.all)[index]
    kos_list = append(kos_list, current_ko)
}

In [None]:
ko_pathways = collect_pathways(kos_list)

In [None]:
head(ko_pathways)

2. Finding the pathways with the highest number of significant KOs.

In [None]:
pathways_counts = list()
for (path_element in ko_pathways) {
    if (path_element %in% names(pathways_counts)) {
        new_value = pathways_counts[[path_element]] + 1
        pathways_counts[path_element] = new_value       
    } else {
        pathways_counts[path_element] = 1 
    }
}

In [None]:
top_to_plot = names(tail(pathways_counts[order(unlist(pathways_counts))], 10))
top_to_plot

### 1.7. Drawing pathways!<a id='part1_7'/>

1. For this type of data, values range from -1 to 1, with both negative and positive fractions. We use both directions when plotting.

In [None]:
for (p in top_to_plot) {
    nude_id =  gsub("map", "", p)
    pathview(gene.data = ko_matrix, 
        species = "ko", 
        pathway.id = nude_id, 
        both.dirs = TRUE, 
        low = c("#bd066b", "#bd066b"),  
        mid = c("#c9c9c9" , "#c9c9c9"), 
        high = c("#02b3ad" , "#02b3ad")
    )
}

2. Cleaning the working directory.

In [None]:
if(!dir.exists("output_plots")){
    dir.create("output_plots")
    dir.create("output_plots/dime_comp")
}

file.copy(from=list.files(pattern="./*pathview.png"), to="./output_plots/dime_comp/", overwrite = TRUE)

png_files = list.files(path = ".", pattern = "*.png")
xml_files = list.files(path = ".", pattern = "*.xml")
files = c(png_files, xml_files)
unlink(files)

3. This is one example of the plots generated by the Pathview. The rest of the outputs are stored at the `output_plots/dime_comp/` directory.

In [None]:
display_png(file='./output_plots/dime_comp/ko00680.pathview.png')

In [None]:
# Get pathview help
?pathview

 ### References: <a id='refs'/></a>

#### Datasets and databases papers:
MGnify pipeline:
https://doi.org/10.1093/nar/gkac1080

KEGG database:
https://doi.org/10.1093/nar/gkw1092

#### R libraries:
 - `library(ALDEx2)` Gloor GB, Macklaim JM, Fernandes AD (2016). Displaying Variation in Large Datasets: a Visual Summary of Effect Sizes. Journal of Computational and Graphical Statistics, 2016 http://doi.org/10.1080/10618600.2015.1131161. R package version 1.30.0.

 - `library(data.table)` Matt Dowle and Arun Srinivasan (2023). data.table: Extension of `data.frame`. R package version 1.14.8. https://CRAN.R-project.org/package=data.table    

 - `library(dplyr)` Hadley Wickham, Romain François, Lionel Henry, Kirill Müller and Davis Vaughan (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.2. https://CRAN.R-project.org/package=dplyr

 - `library(IRdisplay)` Thomas Kluyver, Philipp Angerer and Jan Schulz (NA). IRdisplay: 'Jupyter' Display Machinery. R package version 1.1. https://github.com/IRkernel/IRdisplay

 - `library(KEGGREST)` Dan Tenenbaum and Bioconductor Package Maintainer (2021). KEGGREST: Client-side REST access to the Kyoto Encyclopedia of Genes and Genomes (KEGG). R package version 1.38.0.

 - `library(MGnifyR)` Ben Allen (2022). MGnifyR: R interface to EBI MGnify metagenomics resource. R package version 0.1.0.

 - `library(pathview)` Luo, W. and Brouwer C., Pathview: an R/Bioconductor package for pathway-based data integration and visualization. Bioinformatics, 2013, 29(14): 1830-1831, doi: 10.1093/bioinformatics/btt285. R package version 1.38.0.

 - `library(tidyjson)` Jeremy Stanley and Cole Arendt (2023). tidyjson: Tidy Complex 'JSON'. R package version 0.3.2. https://CRAN.R-project.org/package=tidyjson
  

#### Going deeper:

If you want to learn more about using MGnifyR, you can follow the online tutorial available here:
https://www.ebi.ac.uk/training/online/courses/metagenomics-bioinformatics/mgnifyr/

KEGGREST documentation with multiple examples:
https://www.bioconductor.org/packages/devel/bioc/vignettes/KEGGREST/inst/doc/KEGGREST-vignette.html

Pathview user manual:
https://www.bioconductor.org/packages/release/bioc/vignettes/pathview/inst/doc/pathview.pdf

