# Metagenomics Bioinformatics Course - EBI MGnify 2021
## MGnify Services and API - Practical exercise

### Aims
In this exercise, we will learn how to use the [Mgnify API](https://www.ebi.ac.uk/metagenomics/api/v1).

- Discover the available data on the MGnify website
- Learn how to use the MGnify API to fetch data using scripts or analysis notebooks

### How this works
This file is a [Jupyter Notebook](https://jupyter.org). 
It has instructions, and also code cells. The code cells are connected to Python, and you can run all of the code in a cell by pressing Play (▶) icon in the top bar, or pressing `shift + return`.
The code libraries you should need are already installed.

# Import packages

[pandas](https://pandas.pydata.org/docs/reference/index.html#api) is a data analysis library with a huge list of features. It is very good at holding and manipulating table data.

In [None]:
import pandas as pd

[jsonapi-client](https://pypi.org/project/jsonapi-client/) is a library to get formatted data from web services into python code

In [None]:
from jsonapi_client import Session as APISession
from jsonapi_client import Modifier

# The MGnify API
## Core concepts
An [API](https://en.wikipedia.org/wiki/API "Application programming interface") is how your scripts (e.g. Python or R) can talk to the MGnify database.

The MGnify API uses [JSON](https://en.wikipedia.org/wiki/JSON "Javascript Object Notation") to transfer data in a systematic way. This is human-readable and computer-readable.

The particular format we use is a standard called [JSON:API](https://jsonapi.org). 
There is a Python package ([`jsonapi_client`](https://pypi.org/project/jsonapi-client/)) to make consuming this data easy. We're using it here.

The MGnify API has a "browsable interface", which is a human-friendly way of exploring the API. The URLs for the browsable API are exactly the same as you'd use in a script or code; but when you open those URLs in a browser you see a nice interface. Find it here: [https://www.ebi.ac.uk/metagenomics/api/v1/](https://www.ebi.ac.uk/metagenomics/api/v1/).

The MGnify API is "paginated", i.e. when you list some data you are given it in multiple pages. This is because there can sometimes by thousands of results. Thankfully `jsonapi_client` handles this for us.

## Example
The MGnify website has a list of ["Studies"](https://www.ebi.ac.uk/metagenomics/browse).

What the website is actually showing, is the data from an API endpoint (i.e. specific resource within the API) that lists those. It's here: [api/v1/studies](https://www.ebi.ac.uk/metagenomics/api/v1/studies). Have a look.

Here is an example of some Python code, using two popular packages that let us write a short tidy piece of code:

**Click into the next cell, and press `shift + return` (or click the ▶ icon on the menubar at the top) to run it.**

In [None]:
endpoint = "studies"

with APISession("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    biome_filter = Modifier("lineage=root:Host-associated:Algae")
    resources = map(lambda r: r.json, mgnify.iterate(endpoint, filter=biome_filter))
    resources = pd.json_normalize(resources)
    resources.to_csv(f"{endpoint}.csv")
resources

## Line by line explanation

```python
### The packages were already imported, but if you wanted to use this snippet on it's own as a script you would import them like this:
from jsonapi_client import Session as APISession
import pandas as pd
###


endpoint = 'studies'
# An "endpoint" is the specific resource within the API which we want to get data from. 
# It is the a URL relative to the "server base URL" of the API, which for MGnify is https://www.ebi.ac.uk/metagenomics/api/v1.
# You can find the endpoints in the API Docs https://www.ebi.ac.uk/metagenomics/api/docs/ 

with APISession("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    # Calling "APISession" is enabling a connection to the MGnify API, that can be used multiple times. 
    # The `with...as mgnify` syntax is a Python "context". 
    # Everything inside the `with...` block (i.e. indented below it) can use the `APISession` which we've called `mgnify` here. 
    # When the `with` block closes (the indentation stops), the connection to the API is nicely cleaned up for us.
    
    # Using a Modifier, we can filter the results from the API. 
    # The biome_filter will add the "lineage=XXX" to the query sent to the API
    # This will be used by the API to filter the studies in the response by the biome specified in "lineage"
    biome_filter = Modifier("lineage=root:Host-associated:Algae")

    resources = map(lambda r: r.json, mgnify.iterate(endpoint))
    # `map` applies a function to every element of an iterable - so do something to each thing in a list.
    # Remember we said the API is paginated? 
    # `mgnify.iterate(endpoint)` is a very helpful function that loops over as many pages of results as there are.
    # `lambda r: r.json` is grabbing the JSON attribute from each Super Study returned from the API.
    # All together, this makes `resources` be a bunch of JSON representations we could loop through, each containing the data of a Super Study.
    
    resources = pd.json_normalize(resources)
    # `pd` is the de-facto shorthand for the `pandas` package - you'll see it anywhere people are using pandas.
    # The `json_normalize` function takes "nested" data and does its best to turn it into a table.
    # You can throw quite strange-looking data at it and it usually does something sensible.
    
    resources.to_csv(f"{endpoint}.csv")
    # Pandas has a built-in way of writing CSV (or TSV, etc) files, which is helpful for getting data into other tools.
    # This writes the table-ified Super Study list to a file called `super-studies.csv`.
    
resources
# In a Jupyter notebook, you can just write a variable name in a cell (or the last line of a long cell), and it will print it.
# Jupyter knows how to display Pandas tables (actually called "DataFrames", because they are More Than Just Tables ™) in a pretty way.
```


# Task - Get a study from the API
**In the cell below, complete the Python code to fetch the study _MGYS00002045_ [Study MGYS00002045 MGnify API endpoint](https://www.ebi.ac.uk/metagenomics/api/v1/studies/MGYS00002045), and show the study data in a table.**


In [None]:
# Complete this code
resource = 
accession = 

with APISession("https://www.ebi.ac.uk/metagenomics/api/v1") as    :
    study = mgnify.get(resource, accession).resource
    study = 
study

## Solution
Unhide these cells to see a solution

In [None]:
resource = "studies"
accession = "MGYS00002045"

with APISession("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    study = mgnify.get(resource, accession).resource
    study = pd.json_normalize(study.json)
study