<img src="../_resources/mgnify_logo.png" width="200px">

# Cross-Study analysis, using MGnifyR

The [MGnify API](https://www.ebi.ac.uk/metagenomics/api/v1) returns data and relationships as JSON. 
[MGnifyR](https://github.com/beadyallen/MGnifyR) is a package to help you read MGnify data into your R analyses.

**This example shows you how to perform an analysis across multiple Studies. It is an example of comparing taxonomic diversity from two places in different Studies.**

You can also discover more about the "API" using the [Browsable API interface in your web browser](https://www.ebi.ac.uk/metagenomics/api/v1).

This is an interactive code notebook (a Jupyter Notebook).
To run this code, click into each cell and press the ▶ button in the top toolbar, or press `shift+enter`.

---

In [43]:
library(IRdisplay)
display_markdown(file = '../_resources/mgnifyr_help.md')

# Help with MGnifyR

MGnifyR is an R package that provides a convenient way for R users to access data from [the MGnify API](https://www.ebi.ac.uk/metagenomics/api/).

Detailed help for each function is available in R using the standard `?function_name` command (i.e. typing `?mgnify_query` will bring up built-in help for the mgnify_query command). 

A vignette is available containing a reasonably verbose overview of the main functionality. 
This can be read either within R with the `vignette("MGnifyR")` command, or [in the development repository](https://htmlpreview.github.io/?https://github.com/beadyallen/MGnifyR/blob/master/doc/MGnifyR.html)

## MGnifyR Command cheat sheet

The following list of key functions should give a starting point for finding relevent documentation.

- `mgnify_client()` : Create the client object required for all other functions.
- `mgnify_query()` : Search the whole MGnify database.
- `mgnify_analyses_from_xxx()` : Convert xxx accessions to analyses accessions. xxx is either samples or studies.
- `mgnify_get_analyses_metadata()` : Retrieve all study, sample and analysis metadata for given analyses.
- `mgnify_get_analyses_phyloseq()` : Convert abundance, taxonomic, and sample metadata into a single phyloseq object.
- `mgnify_get_analyses_results()` : Get functional annotation results for a set of analyses.
- `mgnify_download()` : Download raw results files from MGnify.
- `mgnify_retrieve_json()` : Low level API access helper function.


Load libraries:

In [None]:
library(vegan)
library(ggplot2)
library(phyloseq)
library(MGnifyR)

mg <- mgnify_client(usecache = T, cache_dir = '/tmp/mgnify_cache')

# Example: compare taxonomic abundances of two soil studies
*This example is based on a [MGnify workshop exercise](https://beadyallen.github.io/MGnifyR/Exercises.html) created by [Ben Allen](https://github.com/beadyallen) (the author of MGnifyR).*

## Fetch the MGnify Analyses accessions for each of two Studies
(one with samples from Malaysia, one with samples from Panama)

In [None]:
panama <- mgnify_analyses_from_studies(mg, 'MGYS00003920')
malaysia <- mgnify_analyses_from_studies(mg, 'MGYS00003918')

Join the Analyses accession lists

In [None]:
accessions <- c(panama, malaysia)
sprintf('There are %d accessions between the studies', length(accessions))

Fetch metadata for all of the Analyses from the MGnify API

In [None]:
metadata <- mgnify_get_analyses_metadata(mg, accessions)
head(metadata)

## Taxonomic analysis
First, build a [phyloseq](https://joey711.github.io/phyloseq/) object of the Analyses metadata

In [None]:
ps <- mgnify_get_analyses_phyloseq(mg, metadata$analysis_accession)

#### Filter out low-abundance samples
Filter out samples with low abundances, leaving just those in the "normal" distribution. This is particularly important given the normalisation approach we're about to use ([`phyloseq`'s slightly controversial `rarefy_even_depth`](https://www.rdocumentation.org/packages/phyloseq/versions/1.16.2/topics/rarefy_even_depth).)

We make a histogram and note that samples with abundances $\lt 10^3$ i.e. $\lt 1000$ seem to be outliers.

In [None]:
hist(log10(sample_sums(ps)), breaks = 50)

In [None]:
ps_good <- subset_samples(ps, sample_sums(ps) > 1000)
hist(log10(sample_sums(ps_good)), breaks = 50)

#### Estimate richness

Use `estimate_richness` to calculate various diversity measures for each analysis.

To read the documentation on this method, enter `?estimate_richness` in a Code Cell and run it.

In [None]:
#  ?estimate_richness

In [None]:
diversity = estimate_richness(ps_good)
head(diversity)

diversity[diversity$Observed == max(diversity$Observed),]

#### Normalise the data by rarefication

In [None]:
ps_rare <- rarefy_even_depth(ps_good)
div_rare <- estimate_richness(ps_rare)

In [None]:
merged_df <- merge(div_rare, metadata, by = 0, all.y = F)
head(merged_df)

In [None]:
options(repr.plot.width=12, repr.plot.height=8)
ggplot(merged_df, aes(x=`sample_geo-loc-name`, y=Observed)) + geom_boxplot() + theme(text = element_text(size = 20))