# Metagenomics Bioinformatics Course - EBI MGnify 2021
## MGnify Genomes resource - Metagenomic Assembled Genomes Catalogues - Practical exercise

### Aims
In this exercise, we will learn how to use the [Genomes resource within MGnify](https://www.ebi.ac.uk/metagenomics/browse#genomes).

- Discover the available data on the MGnify website
- Use two search mechanisms (search by _gene_ and search by _genome_)
- Learn how to use the MGnify API to fetch data using scripts or analysis notebooks
- Use the _genome_ search mechanism via the API, to compare your own MAGs against a MGnify catalogue and see whether they are novel

### How this works
This file is a [Jupyter Notebook](https://jupyter.org). 
It has instructions, and also code cells. The code cells are connected to Python, and you can run all of the code in a cell by pressing Play (▶) icon in the top bar, or pressing `shift + return`.
The code libraries you should need are already installed.

# Import packages

[pandas](https://pandas.pydata.org/docs/reference/index.html#api) is a data analysis library with a huge list of features. It is very good at holding and manipulating table data.

In [3]:
import pandas as pd

[jsonapi-client](https://pypi.org/project/jsonapi-client/) is a library to get formatted data from web services into python code

In [4]:
from jsonapi_client import Session as APISession

# The MGnify API
## Core concepts
An [API](https://en.wikipedia.org/wiki/API "Application programming interface") is how your scripts (e.g. Python or R) can talk to the MGnify database.

The MGnify API uses [JSON](https://en.wikipedia.org/wiki/JSON "Javascript Object Notation") to transfer data in a systematic way. This is human-readable and computer-readable.

The particular format we use is a standard called [JSON:API](https://jsonapi.org). 
There is a Python package ([`jsonapi_client`](https://pypi.org/project/jsonapi-client/)) to make consuming this data easy. We're using it here.

The MGnify API has a "browsable interface", which is a human-friendly way of exploring the API. The URLs for the browsable API are exactly the same as you'd use in a script or code; but when you open those URLs in a browser you see a nice interface. Find it here: [https://www.ebi.ac.uk/metagenomics/api/v1/](https://www.ebi.ac.uk/metagenomics/api/v1/).

The MGnify API is "paginated", i.e. when you list some data you are given it in multiple pages. This is because there can sometimes by thousands of results. Thankfully `jsonapi_client` handles this for us.

## Example
The MGnify website has a list of ["Super Studies"](https://www.ebi.ac.uk/metagenomics/browse) (collections of studies that together represent major research efforts or collaborations).

What the website is actually showing, is the data from an API endpoint (i.e. specific resource within the API) that lists those. It's here: [api/v1/super-studies](https://www.ebi.ac.uk/metagenomics/api/v1/super-studies). Have a look.

Here is an example of some Python code, using two popular packages that let us write a short tidy piece of code:

**Click into the next cell, and press `shift + return` (or click the ▶ icon on the menubar at the top) to run it.**

In [9]:
endpoint = 'super-studies'

with APISession("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    resources = map(lambda r: r.json, mgnify.iterate(endpoint))
    resources = pd.json_normalize(resources)
    resources.to_csv(f"{endpoint}.csv")
resources

Unnamed: 0,type,id,attributes.super-study-id,attributes.title,attributes.url-slug,attributes.description,attributes.image-url,attributes.biomes-count
0,super-studies,1,1,Tara Oceans,tara-oceans,The Tara Oceans expedition (Karsenti et al. 20...,"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...",0
1,super-studies,2,2,Earth Microbiome Project,earth-microbiome-project,The Earth Microbiome Project is now available ...,"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...",0
2,super-studies,3,3,NASA GeneLab Microbiome (MANGO),nasa-genelab-microbiome-mango,Project MANGO provides access to the microbiom...,"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...",0
3,super-studies,4,4,HoloFood,holofood,Holistic approach to improve the efficiency of...,"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...",0
4,super-studies,5,5,Malaspina,malaspina,The Malaspina circumnavigation expedition was ...,"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...",0
5,super-studies,6,6,AtlantECO,atlanteco,The EU-funded AtlantECO project aims to develo...,"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...",0


## Line by line explanation

```python
### The packages were already imported, but if you wanted to use this snippet on it's own as a script you would import them like this:
from jsonapi_client import Session as APISession
import pandas as pd
###


endpoint = 'super-studies'
# An "endpoint" is the specific resource within the API which we want to get data from. 
# It is the a URL relative to the "server base URL" of the API, which for MGnify is https://www.ebi.ac.uk/metagenomics/api/v1.
# You can find the endpoints in the API Docs https://www.ebi.ac.uk/metagenomics/api/docs/ 

with APISession("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    # Calling "APISession" is enabling a connection to the MGnify API, that can be used multiple times. 
    # The `with...as mgnify` syntax is a Python "context". 
    # Everything inside the `with...` block (i.e. indented below it) can use the `APISession` which we've called `mgnify` here. 
    # When the `with` block closes (the indentation stops), the connection to the API is nicely cleaned up for us.
    
    resources = map(lambda r: r.json, mgnify.iterate(endpoint))
    # `map` applies a function to every element of an iterable - so do something to each thing in a list.
    # Remember we said the API is paginated? 
    # `mgnify.iterate(endpoint)` is a very helpful function that loops over as many pages of results as there are.
    # `lambda r: r.json` is grabbing the JSON attribute from each Super Study returned from the API.
    # All together, this makes `resources` be a bunch of JSON representations we could loop through, each containing the data of a Super Study.
    
    resources = pd.json_normalize(resources)
    # `pd` is the de-facto shorthand for the `pandas` package - you'll see it anywhere people are using pandas.
    # The `json_normalize` function takes "nested" data and does its best to turn it into a table.
    # You can throw quite strange-looking data at it and it usually does something sensible.
    
    resources.to_csv(f"{endpoint}.csv")
    # Pandas has a built-in way of writing CSV (or TSV, etc) files, which is helpful for getting data into other tools.
    # This writes the table-ified Super Study list to a file called `super-studies.csv`.
    
resources
# In a Jupyter notebook, you can just write a variable name in a cell (or the last line of a long cell), and it will print it.
# Jupyter knows how to display Pandas tables (actually called "DataFrames", because they are More Than Just Tables ™) in a pretty way.
```


# Task 1 - list Genome Catalogues
**In the cell below, complete the Python code to fetch the list of [Genome Catalogues from the MGnify API](https://www.ebi.ac.uk/metagenomics/api/v1/genome-catalogues), and show them in a table.**

(Note that there may only be one catalogue in the list right now, that is correct)

In [10]:
# Complete this code

endpoint = 

with APISession("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    catalogues = 
    
    
    
    
catalogues

## Solution
Unhide these cells to see a solution

In [17]:
endpoint = 'genome-catalogues'

with APISession("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    catalogues = map(lambda r: r.json, mgnify.iterate(endpoint))
    catalogues = pd.json_normalize(catalogues)
catalogues

Unnamed: 0,type,id,attributes.name,attributes.description,attributes.protein-catalogue-name,attributes.protein-catalogue-description,attributes.genome-count,attributes.version,attributes.last-update,relationships.biome.data.id,relationships.biome.data.type
0,genome-catalogues,human-gut-v1-0,Unified Human Gastrointestinal Genome (UHGG) 1.0,"4,644 species-level prokaryotic genomes corres...",The Unified Human Gastrointestinal Protein cat...,The Unified Human Gastrointestinal Protein (UH...,4644,1.0,2021-09-30T11:27:14.126429,root:Host-associated:Human:Digestive system:La...,biomes


# Task 2 - list Genomes
Each catalogue contains a much larger list of Genomes.
**In the cell below, complete the Python code to fetch the list of [Genomes from the MGnify API](https://www.ebi.ac.uk/metagenomics/api/v1/genome-catalogues), and show them in a table.**

(Note that there are quite a lot of pages of data, so this will take a minute to run)

In [None]:
catalogue_id = catalogues.id[0]
endpoint = f'genome-catalogues/{catalogue_id}/genomes'  # a Python f-string inserts the value of a variable into the string where that variable name appears inside {..}

with           as mgnify:
    genomes = 

    
    
    
genomes

## Solution
Unhide these cells to see a solution

In [27]:
catalogue_id = catalogues.id[0]
endpoint = f'genome-catalogues/{catalogue_id}/genomes'  # a Python f-string inserts the value of a variable into the string where that variable name appears inside {..}

with APISession("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    genomes = map(lambda r: r.json, mgnify.iterate(endpoint))
    genomes = pd.json_normalize(genomes)
genomes

Unnamed: 0,type,id,attributes.genome-id,attributes.geographic-range,attributes.geographic-origin,attributes.accession,attributes.ena-sample-accession,attributes.ena-study-accession,attributes.length,attributes.num-contigs,...,relationships.biome.data.type,relationships.catalogue.data.id,relationships.catalogue.data.type,attributes.ncbi-genome-accession,attributes.ncbi-sample-accession,attributes.ncbi-study-accession,attributes.cmseq,attributes.ena-genome-accession,attributes.img-genome-accession,attributes.patric-genome-accession
0,genomes,MGYG-HGUT-00001,648,"[North America, Europe]",Europe,MGYG-HGUT-00001,ERS370061,ERP105624,3219614,137,...,biomes,human-gut-v1-0,genome-catalogues,,,,,,,
1,genomes,MGYG-HGUT-00002,383,"[North America, Europe, Asia, Oceania, South A...",Europe,MGYG-HGUT-00002,ERS370064,ERP105624,4433090,100,...,biomes,human-gut-v1-0,genome-catalogues,,,,,,,
2,genomes,MGYG-HGUT-00003,2319,"[North America, Europe, Asia, Oceania, South A...",Europe,MGYG-HGUT-00003,ERS370070,ERP105624,3229507,35,...,biomes,human-gut-v1-0,genome-catalogues,,,,,,,
3,genomes,MGYG-HGUT-00004,4150,"[North America, Europe, Asia]",Europe,MGYG-HGUT-00004,ERS370072,ERP105624,3698872,105,...,biomes,human-gut-v1-0,genome-catalogues,,,,,,,
4,genomes,MGYG-HGUT-00005,2855,"[North America, Europe]",Europe,MGYG-HGUT-00005,ERS417217,ERP105624,3930422,32,...,biomes,human-gut-v1-0,genome-catalogues,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4639,genomes,MGYG-HGUT-04640,498,"[Europe, Asia]",Europe,MGYG-HGUT-04640,ERS436824,ERP005534,2596429,276,...,biomes,human-gut-v1-0,genome-catalogues,,,,1.09,,,
4640,genomes,MGYG-HGUT-04641,995,[],Europe,MGYG-HGUT-04641,ERS436802,ERP005534,1772088,174,...,biomes,human-gut-v1-0,genome-catalogues,,,,0.20,,,
4641,genomes,MGYG-HGUT-04642,464,"[North America, Europe, Asia]",Europe,MGYG-HGUT-04642,ERS436800,ERP005534,2001820,23,...,biomes,human-gut-v1-0,genome-catalogues,,,,0.05,,,
4642,genomes,MGYG-HGUT-04643,1136,"[Africa, Asia, Europe, North America, Oceania,...",Europe,MGYG-HGUT-04643,ERS436826,ERP005534,1975847,11,...,biomes,human-gut-v1-0,genome-catalogues,,,,0.05,,,
