<div style="max-width:1200px"><img src="../_resources/mgnify_banner.png" width="100%"></div>

<img src="../_resources/mgnify_logo.png" width="200px">

# Mapping samples from the [AtlantECO Super Study](https://www.ebi.ac.uk/metagenomics/super-studies/atlanteco)
### ... using the MGnify API and an interactive map widget

The [MGnify API](https://www.ebi.ac.uk/metagenomics/api/v1) returns JSON data. The `jsonapi_client` package can help you load this data into Python, e.g. into a Pandas dataframe.

**This example shows you how to load a MGnify AtlantECO Super Study's data from the MGnify API and display it on an interactive world map**

You can find all of the other "API endpoints" using the [Browsable API interface in your web browser](https://www.ebi.ac.uk/metagenomics/api/v1). The URL you see in the browsable API is exactly the same as the one you can use in this code.

This is an interactive code notebook (a Jupyter Notebook).
To run this code, click into each cell and press the ▶ button in the top toolbar, or press `shift+enter`.

**Content:**
- Fetch AtlantECO studies data
- Show study' samples on the interactive world map
- Check functional annotation terms presence/absense
    - GO-term
    - InterPro 
    - Biosynthetic Gene Clusters (BGC)
    

---

## Fetch all [AtlantECO](https://www.ebi.ac.uk/metagenomics/super-studies/atlanteco) studies
A Super Study is a collection of MGnify Studies originating from a major project. AtlantECO is one such project, aiming to develop and apply a novel, unifying framework that provides knowledge-based resources for a better understanding and management of the Atlantic Ocean and its ecosystem services.

Fetch the Super Study's Studies from the MGnify API, into a [Pandas dataframe](https://pandas.pydata.org/docs/user_guide/index.html):

In [75]:
import pandas as pd
from jsonapi_client import Session, Modifier

atlanteco_endpoint = 'super-studies/atlanteco/flagship-studies'
with Session("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    studies = map(lambda r: r.json, mgnify.iterate(atlanteco_endpoint))
    studies = pd.json_normalize(studies)
studies[:5]

Unnamed: 0,type,id,attributes.bioproject,attributes.accession,attributes.samples-count,attributes.is-private,attributes.last-update,attributes.secondary-accession,attributes.centre-name,attributes.study-abstract,attributes.study-name,attributes.data-origination,relationships.biomes.data
0,studies,MGYS00006613,PRJEB40759,MGYS00006613,58,False,2024-03-01T18:29:37,ERP124426,Ocean Sampling Day Consortium,Ocean Sampling Day was initiated by the EU-fun...,18S rRNA amplicon sequencing from the Ocean Sa...,SUBMITTED,"[{'id': 'root:Environmental:Aquatic:Marine', '..."
1,studies,MGYS00006612,PRJEB40763,MGYS00006612,48,False,2024-03-01T18:14:18,ERP124432,Ocean Sampling Day Consortium,Ocean Sampling Day was initiated by the EU-fun...,18S rRNA amplicon sequencing from the Ocean Sa...,SUBMITTED,"[{'id': 'root:Environmental:Aquatic:Marine', '..."
2,studies,MGYS00006611,PRJEB55999,MGYS00006611,63,False,2024-03-01T18:01:09,ERP140920,Ocean Sampling Day Consortium,Ocean Sampling Day was initiated by the EU-fun...,18S rRNA amplicon sequencing from the Ocean Sa...,SUBMITTED,"[{'id': 'root:Environmental:Aquatic:Marine', '..."
3,studies,MGYS00006610,PRJEB56005,MGYS00006610,50,False,2024-03-01T17:44:36,ERP140926,Ocean Sampling Day Consortium,Ocean Sampling Day was initiated by the EU-fun...,18S rRNA amplicon sequencing from the Ocean Sa...,SUBMITTED,"[{'id': 'root:Environmental:Aquatic:Marine', '..."
4,studies,MGYS00006609,PRJEB9737,MGYS00006609,193,False,2024-03-01T17:30:57,ERP010877,GSC,Analysis of 18S DNA in Tara Oceans Polar Circl...,Amplicon sequencing of Tara Oceans Polar Circl...,SUBMITTED,"[{'id': 'root:Environmental:Aquatic:Marine', '..."


## Show the studies' samples on a map

We can fetch the Samples for each Study, and concatenate them all into one Dataframe.
Each sample has geolocation data in its `attributes` - this is what we need to build a map.

It takes time to fetch data for all samples, so **let's show samples from chosen PRJEB46727 study only.** This study contain assembly data https://www.ebi.ac.uk/metagenomics/studies/MGYS00005810#overview.

In [76]:
substudy = studies[studies['attributes.bioproject'] == 'PRJEB46727']
studies_samples = []

with Session("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    for idx, study in substudy.iterrows():
        print(f"fetching {study.id} samples")
        samples = map(lambda r: r.json, mgnify.iterate(f'studies/{study.id}/samples?page_size=1000'))
        samples = pd.json_normalize(samples)
        samples = pd.DataFrame(data={
            'accession': samples['id'],
            'sample_id': samples['id'],
            'study': study.id, 
            'lon': samples['attributes.longitude'],
            'lat': samples['attributes.latitude'],
            'color': "#FF0000",
        })
        samples.set_index('accession', inplace=True)
        studies_samples.append(samples)
studies_samples = pd.concat(studies_samples)

fetching MGYS00005810 samples


In [77]:
print(f"fetched {len(studies_samples)} samples")

studies_samples.head()

fetched 83 samples


Unnamed: 0_level_0,sample_id,study,lon,lat,color
accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
SRS836773,SRS836773,MGYS00005810,-50.6225,-0.1569,#FF0000
SRS567311,SRS567311,MGYS00005810,-53.0,7.29,#FF0000
SRS567886,SRS567886,MGYS00005810,-54.42,10.68,#FF0000
SRS580500,SRS580500,MGYS00005810,-52.22,12.41,#FF0000
SRS580495,SRS580495,MGYS00005810,-54.51,10.29,#FF0000


In [78]:
import leafmap
m = leafmap.Map(center=(0, 0), zoom=2)
m.add_points_from_xy(
    studies_samples,
    x='lon', 
    y='lat', 
    popup=["study", "sample_id"], 
    color_column='color',
    add_legend=False
)
m

Map(center=[0, 0], controls=(ZoomControl(options=['position', 'zoom_in_text', 'zoom_in_title', 'zoom_out_text'…

## Check functional annotation terms presence
Let's check whether a specific identifier is present in each sample. 

We will work with MGnify analyses (`MGYA`s) corresponding to chosen samples. We filter analyses by 
- pipeline version: 5.0
- experiment type: assembly

This example shows how to process **just the first 10 samples** (again, because the full dataset takes a while to fetch).
Firstly, get analyses for each sample.

In [79]:
analyses = []
with Session("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    for idx, sample in studies_samples[:10].iterrows():
        print(f"processing {sample.sample_id}")
        filtering = Modifier(f"pipeline_version=5.0&sample_accession={sample.sample_id}&experiment_type=assembly")
        analysis = map(lambda r: r.json, mgnify.iterate('analyses', filter=filtering))
        analysis = pd.json_normalize(analysis)
        analyses.append(analysis)
analyses = pd.concat(analyses)
analyses[:5]

processing SRS836773
processing SRS567311
processing SRS567886
processing SRS580500
processing SRS580495
processing SRS565748
processing SRS1776191
processing SRS581965
processing SRS837379
processing SRS1776201


Unnamed: 0,type,id,attributes.analysis-status,attributes.accession,attributes.experiment-type,attributes.analysis-summary,attributes.pipeline-version,attributes.is-private,attributes.last-update,attributes.complete-time,attributes.instrument-platform,attributes.instrument-model,relationships.assembly.data.id,relationships.assembly.data.type,relationships.study.data.id,relationships.study.data.type,relationships.sample.data.id,relationships.sample.data.type
0,analysis-jobs,MGYA00593142,completed,MGYA00593142,assembly,"[{'key': 'Submitted nucleotide sequences', 'va...",5.0,False,2024-01-29T15:29:19.757516,2021-12-06T19:31:17,ILLUMINA,Illumina HiSeq 2500,ERZ2945023,assemblies,MGYS00005810,studies,SRS836773,samples
0,analysis-jobs,MGYA00589840,completed,MGYA00589840,assembly,"[{'key': 'Submitted nucleotide sequences', 'va...",5.0,False,2024-01-29T15:29:19.757516,2021-10-22T17:25:45,ILLUMINA,Illumina Genome Analyzer IIx,ERZ2945090,assemblies,MGYS00005810,studies,SRS567311,samples
0,analysis-jobs,MGYA00589562,completed,MGYA00589562,assembly,"[{'key': 'Submitted nucleotide sequences', 'va...",5.0,False,2024-01-29T15:29:19.757516,2021-10-21T08:23:57,ILLUMINA,Illumina Genome Analyzer IIx,ERZ2945101,assemblies,MGYS00005810,studies,SRS567886,samples
0,analysis-jobs,MGYA00589561,completed,MGYA00589561,assembly,"[{'key': 'Submitted nucleotide sequences', 'va...",5.0,False,2024-01-29T15:29:19.757516,2021-10-21T08:16:10,ILLUMINA,Illumina Genome Analyzer IIx,ERZ2944669,assemblies,MGYS00005810,studies,SRS580500,samples
0,analysis-jobs,MGYA00589560,completed,MGYA00589560,assembly,"[{'key': 'Submitted nucleotide sequences', 'va...",5.0,False,2024-01-29T15:29:19.757516,2021-10-21T08:13:12,ILLUMINA,Illumina Genome Analyzer IIx,ERZ2944798,assemblies,MGYS00005810,studies,SRS580495,samples


Define functions:
- `identify_existance` to check each analysis for identifier presence/absence. We add a column to the dataframe with a colour: blue if identifier was found and red if not.
- `show_on_map` to plot results on the world map. Join the analyses and sample tables to have geolocation data and identifier presence data together (we'll create a new sub-DataFrame with a subset of the fields and add them to the map).

In [80]:
def identify_existance(input_analyses, identifier, term):
    data = []
    with Session("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
        for idx, mgya in input_analyses.iterrows():
            print(f"processing {mgya.id}")
            analysis_identifier = map(lambda r: r.json, mgnify.iterate(f'analyses/{mgya.id}/{identifier}'))
            analysis_identifier = pd.json_normalize(analysis_identifier)
            data.append("#0000FF" if term in list(analysis_identifier.id) else "#FF0000")
        presented = sum([1 for i in data if i == "#0000FF"])
        print(f"Presented {presented} of {identifier} {term}")
        input_analyses.insert(2, identifier, data, True)
    return input_analyses

def show_on_map(input_analyses, studies_samples, identifier):
    df = input_analyses.join(studies_samples.set_index('sample_id'), on='relationships.sample.data.id')
    df2 = df[[identifier, 'lon', 'lat', 'study', 'attributes.accession', 'relationships.study.data.id', 'relationships.sample.data.id', 'relationships.assembly.data.id']].copy()
    df2 = df2.set_index("study")
    df2 = df2.rename(columns={"attributes.accession": "analysis_ID", 
                              'relationships.study.data.id': "study_ID",
                              'relationships.sample.data.id': "sample_ID", 
                              'relationships.assembly.data.id': "assembly_ID"
                             })
    m = leafmap.Map(center=(0, 0), zoom=2)
    m.add_points_from_xy(df2, 
                         x='lon', 
                         y='lat', 
                         popup=["study_ID", "sample_ID", "assembly_ID", "analysis_ID"],
                        color_column=identifier, add_legend=False)
    return m

## GO term
This example is written for GO-term for biotin transport [GO:0015878](http://www.candidagenome.org/cgi-bin/GO/go.pl?goid=15878)

Other GO identifiers are available on the MGnify API.

In [81]:
identifier = "go-terms"
go_term = 'GO:0015878'
go_analyses = analyses
go_data = identify_existance(go_analyses, identifier, go_term)
map_vis = show_on_map(go_data, studies_samples, identifier)
map_vis

processing MGYA00593142
processing MGYA00589840
processing MGYA00589562
processing MGYA00589561
processing MGYA00589560
processing MGYA00589559
processing MGYA00589558
processing MGYA00589557
processing MGYA00589556
processing MGYA00589555
Presented 9 of go-terms GO:0015878


Map(center=[0, 0], controls=(ZoomControl(options=['position', 'zoom_in_text', 'zoom_in_title', 'zoom_out_text'…

## InterPro entry
This example is written for InterPro entry [IPR001650](https://www.ebi.ac.uk/interpro/entry/InterPro/IPR001650): Helicase, C-terminal domain-like 

Other IPS identifiers are available on the MGnify API.

In [86]:
identifier = "interpro-identifiers"
ips_term = 'IPR001650'
ips_analyses = analyses
ips_data = identify_existance(ips_analyses, identifier, ips_term)
map_vis = show_on_map(ips_data, studies_samples, identifier)
map_vis

processing MGYA00593142
processing MGYA00589840
processing MGYA00589562
processing MGYA00589561
processing MGYA00589560
processing MGYA00589559
processing MGYA00589558
processing MGYA00589557
processing MGYA00589556
processing MGYA00589555
Presented 0 of interpro-identifiers GO:0015878


Map(center=[0, 0], controls=(ZoomControl(options=['position', 'zoom_in_text', 'zoom_in_title', 'zoom_out_text'…