# MGnify tutorial
This tutorial is a self-guided journey through some of the feature of the [MGnify](https://www.ebi.ac.uk/metagenomics) platform.
We will look at how to search for, browse, and download different kinds of datasets derived from metagenomic samples.

This tutorial is a JupyterLite notebook, which means there is a Python interpreter running in your web browser. This requires an up to date browser. The Python that runs here is a bit different to "normal" Python, so keep that in mind if you're trying to write your own code here.

Most of the notebook is instructions and quiz questions, but there is also a bit of Python code if you're interested in how to access the data programatically. Knowing and using Python is not necessary to follow most of this tutorial.

<div style="background: #d0debb; padding: 16px; border-radius: 4px; margin: 8px">
To use this notebook, click the <b>Run</b> menu at the top, and then <b>Run all cells</b>.
</div>

You can also run individual cells using the Play icon at the top, or type shift-enter in any cell to run it.

For help, contact `sandyr@ebi.ac.uk`

---

# Part 1: Exploring MGnify Studies
## Task 1.1: Search for a MGnify study

Go to the [MGnify website](https://www.ebi.ac.uk/metagenomics), and **use the "Text Search" feature to search for studies of marine sediment collected using an "ROV" (that's a _remote operated vehicle_) from the Environmental > Aquatic > Marine > Sediment biome.**

_Hint: the use of an ROV is only mentioned in the study description, there is no specific metadata field for this._

In [2]:
%pip install -q ipywidgets

In [3]:
from utils.quiz import check_answer, _encode_answer, question

question(
    "What is the MGnify Study accession (MGYS) of the first study that matches those parameters?",
    "MGYSxxx",
    [b'bWd5czAwMDA1ODUx']
)

Tab(children=(VBox(children=(HTML(value='What is the MGnify Study accession (MGYS) of the first study that mat…

## Task 1.2: Find metadata for a study

Follow the link on the study result above, to reach the detail page for that study on MGnify.



In [4]:
from utils.quiz import check_answer, _encode_answer, question

question(
    "Off the coast of which continent were most the study's samples collected?",
    "",
    [b'YWZyaWNh'],
    ["Africa", "Antarctica", "Asia", "Europe", "North America", "Oceania", "South America"]
)

Tab(children=(VBox(children=(Label(value="Off the coast of which continent were most the study's samples colle…

Not all metadata are neatly organised in structured fields!

Very often, the information you're interested in is only mentioned e.g. in the "Methods" section of a publication.

**Use the "metadata from Europe PMC Annotations" feature to discover published mentions of metadata about this study.**

In [5]:
from utils.quiz import check_answer, _encode_answer, question

question(
    "What is the deepest sampling *site* mentioned in the publication annotations?",
    "",
    [b'OTMwMCAtIDEwMDEwIG0='],
    ["1000 m", "2560 m", "7720 - 8085 m", "9300 - 10010 m"]
)

Tab(children=(VBox(children=(Label(value='What is the deepest sampling *site* mentioned in the publication ann…

## Task 1.3: Finding common taxa in a study

MGnify produces "Analysis summaries" for each Study. These summarise the taxonomic annotations over all of the sequencing runs (or assemblies) in the study. They are useful for getting a quick overview of the count of taxa present in each analysed sample.

For a study of amplicon sequences, sometimes there will be multiple analysis summaries originating from different databases (e.g. SSU, LSU, ITSone).

**For the study [MGYS00005851](https://www.ebi.ac.uk/metagenomics/studies/MGYS00005851#analysis), find the most common Bacterial phyla using the SSU taxonomic summary.**

You could do this by downloading a TSV file and analysing it using something like Excel or R.

Or, you can use the following Python code for a programmatic approach.

In [13]:
import json
from js import fetch
import pandas as pd
from io import StringIO

In [15]:
# Download the TSV file programmatically:
res = await fetch('https://www.ebi.ac.uk/metagenomics/api/v1/studies/MGYS00005851/pipelines/5.0/file/ERP116704_phylum_taxonomy_abundances_SSU_v5.0.tsv')
text = await res.text()
# in a "normal" Python environment, this could be e.g. res = requests.get("https://..."); text = res.text()
# here we use js.fetch since we're running Python in the browser.

# Read the TSV into a pandas dataframe
itsone_taxonomies = pd.read_csv(StringIO(text), sep='\t')
itsone_taxonomies

Unnamed: 0,superkingdom,kingdom,phylum,ERR5008478_MERGED_FASTQ,ERR3684824_MERGED_FASTQ,ERR3684808_MERGED_FASTQ,ERR3630748_MERGED_FASTQ,ERR3630676_MERGED_FASTQ,ERR3630672_MERGED_FASTQ,ERR3630669_MERGED_FASTQ,...,ERR5007880_MERGED_FASTQ,ERR3684836_MERGED_FASTQ,ERR3684832_MERGED_FASTQ,ERR3684828_MERGED_FASTQ,ERR3684820_MERGED_FASTQ,ERR3684817_MERGED_FASTQ,ERR3684816_MERGED_FASTQ,ERR3684814_MERGED_FASTQ,ERR3684812_MERGED_FASTQ,ERR3684811_MERGED_FASTQ
0,Archaea,Unassigned,Candidatus_Aenigmarchaeota,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Archaea,Unassigned,Candidatus_Altiarchaeota,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Archaea,Unassigned,Candidatus_Bathyarchaeota,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Archaea,Unassigned,Candidatus_Diapherotrites,1940,1,0,0,0,364,1413,...,1025,0,0,0,0,0,0,0,0,0
4,Archaea,Unassigned,Candidatus_Heimdallarchaeota,0,0,0,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
197,Eukaryota,Unassigned,Unassigned,0,0,22,0,69163,8,40,...,19,0,0,15182,0,16,94043,0,0,0
198,Eukaryota,Viridiplantae,Chlorophyta,0,0,0,0,57,0,0,...,0,0,0,0,0,195,0,0,0,0
199,Eukaryota,Viridiplantae,Streptophyta,0,0,0,0,711,0,0,...,0,0,0,52,0,0,8995,1,120,1
200,Eukaryota,Viridiplantae,Unassigned,0,0,1,0,1287,0,1,...,0,0,0,189,0,0,32958,0,0,0


In [16]:
# Sum the taxonomy counts across all runs in the study
itsone_taxonomies['study_total'] = itsone_taxonomies.filter(like="ERR").sum(axis=1)

# Limit to phyla in the bacteria superkingdom
itsone_bacteria = itsone_taxonomies[itsone_taxonomies.superkingdom == 'Bacteria']
itsone_bacteria.set_index('phylum', inplace=True)

# Show just the most prevalent phyla
itsone_bacteria.sort_values(by="study_total", ascending=False).head()

Unnamed: 0_level_0,superkingdom,kingdom,ERR5008478_MERGED_FASTQ,ERR3684824_MERGED_FASTQ,ERR3684808_MERGED_FASTQ,ERR3630748_MERGED_FASTQ,ERR3630676_MERGED_FASTQ,ERR3630672_MERGED_FASTQ,ERR3630669_MERGED_FASTQ,ERR3630659_MERGED_FASTQ,...,ERR3684836_MERGED_FASTQ,ERR3684832_MERGED_FASTQ,ERR3684828_MERGED_FASTQ,ERR3684820_MERGED_FASTQ,ERR3684817_MERGED_FASTQ,ERR3684816_MERGED_FASTQ,ERR3684814_MERGED_FASTQ,ERR3684812_MERGED_FASTQ,ERR3684811_MERGED_FASTQ,study_total
phylum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Proteobacteria,Bacteria,Unassigned,37,33776,16122,0,0,213496,174258,0,...,0,0,0,0,0,422,139642,882782,0,114774122
Planctomycetes,Bacteria,Unassigned,20,3,303,0,0,55005,53880,0,...,0,0,0,0,0,0,1,6,0,42091107
Chloroflexi,Bacteria,Unassigned,0,1,10,0,0,16231,10801,1,...,0,0,0,0,0,0,3,7,0,20230213
Bacteroidetes,Bacteria,Unassigned,11,17,1332,0,0,17983,30774,0,...,0,0,0,0,0,0,4,19,0,19931577
Acidobacteria,Bacteria,Unassigned,8,4,9,2,1,49711,37510,0,...,0,0,0,0,0,0,4,17,0,17238080


## Task 1.4: Think about the caveats...

**Was what we just did (summing bacterial tax counts across all samples of the study) a good idea?**

Things to consider: (Hint: Check the study page's publication annotations from Schauberger _et al._)
1. Were all of the sequencing runs targetting the 16s SSU region? 
2. Were the primers used likely to sequence the same taxa across all samples?

<details>
<summary>Tell me</summary>
The publication mentions that all samples were prepared with primers targetting 16s rRNA gene, so the SSU taxonomic analysis summary is fair to use across the study. However, it also mentions that some samples were prepared with primers specific to archaea, so there is likely a bias to different taxa in different samples. This means our basic summation across all samples is likely not an unbiased measurement.
</details>

---

# Part 2: Exploring MGnify Analyses

## Task 2.1: Search for a MGnify analysis

Let's say we're particularly interested in Chloroflexi phylum (also known as Chloroflexota) – the third most prevalent Bacterial phylum in the study we were looking at in Part 1. 

Go to the [MGnify website](https://www.ebi.ac.uk/metagenomics/search/analyses), and **use the "Text Search" feature to search for _Sample analyses_, that contain Organisms in the Bacteria > Chloroflexi lineage, also collected from the Environmental > Aquatic > Marine > Sediment biome. Limit the search to samples analysed with MGnify's pipeline version 5.0 (the latest).**

In [6]:
from utils.quiz import check_answer, _encode_answer, question

question(
    "Look at the 'Experiment type' filter on the left of the search results: there are some <b>assembly</b> analyses.<br/>Why might we want to look into these next?",
    "",
    [b'dG8gZGlzY292ZXIgZnVuY3Rpb25hbCBhbm5vdGF0aW9ucw=='],
    ["To browse complete genomes", "To get more taxonomic diversity", "To discover functional annotations"]
)

Tab(children=(VBox(children=(HTML(value="Look at the 'Experiment type' filter on the left of the search result…

## Task 2.2: Find annotations within an analysis
**Limit your search to just those Assembly analyses.**

There are more filters we could use here, like location, or filtering to find analysis where a specific GO or InterPro functional annotation has been found on the assembled contigs.

For now, **scroll through the list to find MGYA00594115, and click it to open the analysis page.**

In [13]:
from utils.quiz import check_answer, _encode_answer, question

question(
    "According to the charts in the <b>Taxonomic analysis</b> section's SSU results, what fraction of the phylum composition is assigned to Chloroflexi?",
    "x.yy %",
    [b'NC40OSAl', b'NC40OSU=', b'NC40OQ=='],
)

Tab(children=(VBox(children=(HTML(value="According to the charts in the <b>Taxonomic analysis</b> section's SS…

Now turn to the functional analysis information, in particular **the Pathways/Systems section**, which shows functional annotations of the metagenomic assembly, based on comparison of proteins coding sequences found in the assembly's contigs to databases of protein function.

In [26]:
from utils.quiz import check_answer, _encode_answer, question

question(
    "What is the first 'complete' KEGG Module for this assembly analysis?<br/>(Complete means all of the Kegg Orthologs in that module are present in the assembly.)",
    "Mxxxxx%",
    [b'bTAwMDAx', b'MDAwMDE='],
)

Tab(children=(VBox(children=(HTML(value="What is the first 'complete' KEGG Module for this assembly analysis?<…

## Task 2.3: Think about what this means...

**Does this mean that the Kegg Module you just found is necesarily a function performed by organisms within the taxa we originally searched for (Chloroflexi)?**

_Consider that the functional annotations are on contigs from the entire assembly..._

---

# Part 3: Metagenomic Assembled Genomes

To find functional annotations for a specific species, we can use metagenome assembly genomes (MAGs) – draft genomes made by taxonomically binning assembled contigs.
[MGnify Genomes](https://www.ebi.ac.uk/metagenomics/browse/genomes) is MGnify's resource for MAGs, where MAGs derived from various biomes are organised into species-level clusters and annotated with a suite of functional annotation tools.

## Task 3.1: Find Chloroflexi MAGs

**Go to the [MGnify Genomes website](https://www.ebi.ac.uk/metagenomics/browse/genomes)**.

Given the biome we started this journey from, **which of the MGnify Genomes catalogues would you guess we're mostly like to find MAGs from the Chloroflexi phylum in?**

Let's check. **Go to the "All genomes" list, and use the Filter to search for `Chloroflex`** (since this will catch lineages using either the `Chloroflexi` or `Chloroflexota` naming scheme).

In [19]:
from utils.quiz import check_answer, _encode_answer, question

question(
    "Which catalogue are most of the MAGs found in?",
    "",
    [b'bWFyaW5l'],
    ["Marine", "Human Oral", "Zebrafish fecal"]
)

Tab(children=(VBox(children=(HTML(value='Which catalogue are most of the MAGs found in?'), Dropdown(descriptio…

## Task 3.2: Check functional annotations of a MAG

From the list, **open the MAG for the species `Casp-Chloro-G4`**.

In [30]:
from utils.quiz import check_answer, _encode_answer, question

question(
    "What is the most prevalent KEGG Module for this genome?",
    "Mxxxxx%",
    [b'bTAwMTc4', b'MDAxNzg='],
)

Tab(children=(VBox(children=(HTML(value='What is the most prevalent KEGG Module for this genome?'), Text(value…

_Note that there are "Download" files available for both the assembly analysis we looked at before, and for this genome analysis, that include full listings of the Kegg annotations amongst others. You could download these files similarly to the study summary TSV file we analysed in Part 1, to compare these programmatically._

## Task 3.3: Cross-catalogue searching
Looking at all the data you have available for this MAG, and exploring the other features available in the MGnify Genomes resource, can you work out how you might search for genomes similar to this one in other catalogues?

<details>
<summary>Tell me</summary>
The <a href="https://www.ebi.ac.uk/metagenomics/browse/genomes?browse-by=mag-search">"MAG search" feature</a> uses <a href="https://sourmash.readthedocs.io/en/latest/">Sourmash</a> to search genomes against genomes, based on sequence similarity. You could download the FASTA file of the MAG MGYG000297014, from its Downloads tab, and then upload this as the query genome to the MAG Search, selecting all catalogues as the target. At the time of writing, this search will only match the genome we downloaded from the marine catalogue.
</details>

---

# Further reading:

### Latest MGnify publication

> Richardson L, Allen B, Baldi G, et al. [MGnify: the microbiome sequence data analysis resource in 2023.](https://europepmc.org/article/PMC/PMC9825492) Nucleic Acids Research. 2023 Jan;51(D1):D753-D759. DOI: 10.1093/nar/gkac1080. PMID: 36477304; PMCID: PMC9825492.

### MGnify Genomes publication
> Gurbich TA, Almeida A, Beracochea M, et al. [MGnify Genomes: A Resource for Biome-specific Microbial Genome Catalogues.](https://europepmc.org/article/MED/36806692) Journal of Molecular Biology. 2023 Jul;435(14):168016. DOI: 10.1016/j.jmb.2023.168016. PMID: 36806692; PMCID: PMC10318097.


### Documentation
> [docs.mgnify.org](https://docs.mgnify.org)

### MGnify training courses
> [MGnify on EBI Training](https://www.ebi.ac.uk/training/search-results?query=mgnify)

### Updates
> [@MGnifyDB](https://twitter.com/mgnifyDB)