<div style="max-width:1200px"><img src="../_resources/mgnify_banner.png" width="100%"></div>

<img src="../_resources/mgnify_logo.png" width="200px">

**This notebook is part of the practical session for the Metagenomics Bioinformatics at MGnify course (2023)**

This is an interactive code notebook (a Jupyter Notebook).
To run this code, click into each cell and press the ▶ button in the top toolbar, or press `shift+enter`.

# Comparing a set of MAGs (Metagenome Assembled Genomes) against MGnify's [MAG Catalogues](https://www.ebi.ac.uk/browse/genomes)

In this practical, we will use the [sourmash library](https://sourmash.readthedocs.io/en/latest/) to compare a "novel" catalogue of MAGs against the existing catalogues of MAGs on MGnify.

At small scale, this can be done [on the MGnify website](https://www.ebi.ac.uk/metagenomics/browse/genomes?browse-by=mag-search).

But what about at large scale, with hundreds of newly assembled MAGs? Or in a pipeline?

For that, we can use sourmash locally to produce "sketches" of our query MAGs, and the MGnify API to compare those against the catalogues.

In [None]:
import sourmash  # the library for computing sketches of dna sequences
from Bio import SeqIO  # the library for dealing with FASTA files

# standard libraries for finding and manipulating files and directories:
import glob
from os import mkdir
import shutil
from pathlib import PurePath

import time

# libraries for dealing with APIs and web requests:
from jsonapi_client import Session as APISession
from jsonapi_client import Modifier as APIModifier
import requests

# the quintessential python data tables manipulation and plotting libraries:
import pandas as pd
import matplotlib.pyplot as plt

### Get the paths to our query MAGs

If you completed the **MAG Generation** practical session of this course, you will have some of your own MAGs to use.

To use them:

In [None]:
try:
    mkdir('query-mags')
except FileExistsError:
    pass

**Now open the `query-mags` directory in the left pane of Jupyter Lab, and drag-and-drop your MAGs into it.**

If you didn't complete the MAG generation practical, you can use some example data.

Uncomment the code in the following cell to use the example data instead.

In [None]:
# shutil.copytree('../example-data/genomes', './query-mags', ignore=shutil.ignore_patterns('*.parquet'), dirs_exist_ok=True)

Now – however we got them – let's list the MAGs we will query with:

In [None]:
query_mags = glob.glob('query-mags/*.fa')

query_mags

### Compute a sourmash sketch for each MAG

Create "sketches" for each MAG using [Sourmash](https://sourmash.readthedocs.io/en/latest/index.html#sourmash-in-brief)

A sketch goes into a signature, that we will use for searching. The signature is a sort of collection of hashes that are well suited for calculating the containment of your MAGs within the catalogue's MAGs.

<span style="background-color: #d0debb; color: #000; padding: 32px; font-weight: 800">Complete the following piece of code</span>

In [None]:
for mag in ____________:

    # The sourmash parameters are chosen to match those used within MGnify
    sketch = sourmash.MinHash(n=0, ksize=31, scaled=1000)
    
    # A fasta file may have multiple records in it. Add them all to the sourmash signature.
    for index, record in enumerate(SeqIO.parse(mag, 'fasta')):
        sketch.add_sequence(str(______.seq))
        
    # Save the sourmash sketch as a "signature" file
    signature = sourmash.SourmashSignature(sketch, name=record.name)
    with open(f'query-mags/{PurePath(PurePath(mag).name).stem}.sig', 'wt') as fp:
        sourmash.save_signatures([signature], fp)

You can unhide the following cell (click the •••) for a solution

In [None]:
for mag in query_mags:

    # The sourmash parameters are chosen to match those used within MGnify
    sketch = sourmash.MinHash(n=0, ksize=31, scaled=1000)
    
    # A fasta file may have multiple records in it. Add them all to the sourmash signature.
    for index, record in enumerate(SeqIO.parse(mag, 'fasta')):
        sketch.add_sequence(str(record.seq))
        
    # Save the sourmash sketch as a "signature" file
    signature = sourmash.SourmashSignature(sketch, name=record.name)
    with open(f'query-mags/{PurePath(PurePath(mag).name).stem}.sig', 'wt') as fp:
        sourmash.save_signatures([signature], fp)

All being well, we now have some `*.sig` signature files in the `query-mags` directory:

In [None]:
glob.glob('query-mags/*.sig')

### Fetch all of the catalogue IDs currently available on MGnify

Next, we need to know which catalogues we can search against.

We could figure this out by hand on the MGnify website, but again the API makes it programtically accessible.

To fetch the `catalogue IDs` to the MGnify API, use the following endpoint: `https://www.ebi.ac.uk/metagenomics/api/v1/genome-catalogues`.  

In [None]:
catalogues_endpoint = "genome-catalogues"

In [None]:
with APISession("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    catalogues = map(lambda r: r.json, mgnify.iterate(catalogues_endpoint))
    catalogues = pd.json_normalize(catalogues)
catalogues

<span style="background-color: #d0debb; color: #000; padding: 32px; font-weight: 800">
    Complete the following piece of code. We want to create a `list` of the catalogue `id`s.
</span>

In [None]:
catalogue_ids = ____________________

catalogue_ids

You can unhide the following cell (click the •••) for a solution

In [None]:
catalogue_ids = list(catalogues['id'])
catalogue_ids

### Submit a search job to the MGnify API

To`submit a job` to the MGnify API, use the following endpoint: `https://www.ebi.ac.uk/metagenomics/api/v1/genomes-search/gather`.  
Data will be send to the API, which is called "POST"ing data in the API world.  
This part of the API is quite specialized and so is not a formal JSON:API, the `requests` Python package ìs therefore used to communicate with it.

In [None]:
endpoint = 'https://www.ebi.ac.uk/metagenomics/api/v1/genomes-search/gather'

<span style="background-color: #d0debb; color: #000; padding: 32px; font-weight: 800">Complete the following piece of code</span>

In [None]:
# Create a list of file uploads, and attach them to the API request

## First, read the contents of the .sig files into a list:
sketch_file_pointers = [open(sig_file, 'rb') for sig_file in ____.____('__________/_.___')]

## Next, make a list of tuples of the signature contents to send to the API:
sketch_uploads = [('file_uploaded', sketch_file_pointer) for ___________________ in ____________________]

# Send the API request - it specifies which catalogue IDs to search against and attaches all of the sketch files.
submitted_job_response = requests.post(endpoint, data={'mag_catalogues': _____________}, files=______________)

assert submitted_job_response.status_code == 200

submitted_job = submitted_job_response.json()

submitted_job

You can unhide the following cell (click the •••) for a solution

In [None]:
# Create a list of file uploads, and attach them to the API request

## First, read the contents of the .sig files into a list:
sketch_file_pointers = [open(sig_file, 'rb') for sig_file in glob.glob('query-mags/*.sig')]

## Next, make a list of tuples of the signature contents to send to the API:
sketch_uploads = [('file_uploaded', sketch_file_pointer) for sketch_file_pointer in sketch_file_pointers]

# Send the API request - it specifies which catalogue IDs to search against and attaches all of the sketch files.
submitted_job_response = requests.post(endpoint, data={'mag_catalogues': catalogue_ids}, files=sketch_uploads)

assert submitted_job_response.status_code == 200

submitted_job = submitted_job_response.json()

submitted_job

### Wait for the results to be ready

As you can see in the `printed submitted_job` above, a status_URL was returned in the response from submitting the job via the API. Since the job is in a queue, this status_URL must be polled to wait for our job to be completed.  
Below is an example to check every 2 seconds until ALL of the jobs are finished. The time can be easily change (to 10s in the example below) by setting a different sleeping value:
```python
time.sleep(10)
```

In [None]:
job_done = False
while not job_done:
    print('Checking status...')
    # The status_URL is another API endpoint that's unique for the submitted search job
    query_result = None
    
    while not query_result:
        query_result = requests.get(submitted_job['data']['status_URL'])
        print('Still waiting for jobs to complete. Current status of jobs')
        print('Will check again in 2 seconds')
        time.sleep(2) 
        
    queries_status = {sig['job_id']: sig['status'] for sig in query_result.json()['data']['signatures']}
    job_done = all(map(lambda q: q == 'SUCCESS', queries_status.values()))
    
print('Job done!')

The `query_result` contains the results of the query.   
The results can be visualised as json (try entering `query_results.json()`), or as a Pandas dataframe:

<span style="background-color: #d0debb; color: #000; padding: 32px; font-weight: 800">Complete the following piece of code. We want to create a Pandas dataframe from the JSON data at the JSON path `data.signatures` in the response.</span>

In [None]:
query_result_df = pd.json__________(query_result.json()['____']['__________'])

query_result_df

You can unhide the following cell (click the •••) for a solution

In [None]:
query_result_df = pd.json_normalize(query_result.json()['data']['signatures'])
query_result_df

<span style="background-color: #d0debb; color: #000; padding: 32px; font-weight: 800">Why are there the number of results that there are?</span>

### Are any of our MAGs found in biomes other than the human gut?

<span style="background-color: #d0debb; color: #000; padding: 32px; font-weight: 800">Complete the following piece of code</span>

Use the pandas `dropna` method on the `query_result_df` dataframe to drop rows which have a `NaN` in the `result.match` column only.

In [None]:
matches = _______________.______(subset=['______._____'])

matches

You can unhide the following cell (click the •••) for a solution

In [None]:
matches = query_result_df.dropna(subset=['result.match'])

matches

<span style="background-color: #d0debb; color: #000; padding: 32px; font-weight: 800">Complete the following piece of code</span>

Use the pandas `hist` method on the `matches` dataframe's `catalogue` column to show a histogram of the number of matches per catalogue:

In [None]:
matches._________.____()
plt.xlabel('catalogue')
plt.ylabel('number of matches')

You can unhide the following cell (click the •••) for a solution

In [None]:
matches.catalogue.hist()
plt.xlabel('catalogue')
plt.ylabel('number of matches')

### Bonus round

Use the [MGnify API](https://www.ebi.ac.uk/metagenomics/api/v1) to fetch the detail of each matching genome (MGYG). Print a list of which continents (the `geographic_origin` attribute) each match is found in for each of the query MAGs.

In [None]:
matching_continents = {}

with APISession("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    for _, genome in matches.iterrows():
        genome_detail = mgnify.get('genomes', genome['result.match'])
        found_in = genome_detail.resource.geographic_origin
        matching_continents.setdefault(genome.filename, []).append(found_in)
        # print(matching_continents)
for query_mag, continents in matching_continents.items():
    print(f'Query MAG {query_mag} matched genomes found in {", ".join(continents)}')