<div style="max-width:1200px"><img src="../_resources/mgnify_banner.png" width="100%"></div>

<img src="../_resources/mgnify_logo.png" width="200px">

# Mapping samples from the [AtlantECO Super Study](https://www.ebi.ac.uk/metagenomics/super-studies/atlanteco)
### ... using the MGnify API and an interactive map widget

The [MGnify API](https://www.ebi.ac.uk/metagenomics/api/v1) returns JSON data. The `jsonapi_client` package can help you load this data into Python, e.g. into a Pandas dataframe.

**This example shows you how to load a MGnify Super Study's data from the MGnify API and display it on an interactive world map**

You can find all of the other "API endpoints" using the [Browsable API interface in your web browser](https://www.ebi.ac.uk/metagenomics/api/v1). The URL you see in the browsable API is exactly the same as the one you can use in this code.

This is an interactive code notebook (a Jupyter Notebook).
To run this code, click into each cell and press the ▶ button in the top toolbar, or press `shift+enter`.

---

## Fetch all [AtlantECO](https://www.ebi.ac.uk/metagenomics/super-studies/atlanteco) studies
A Super Study is a collection of MGnify Studies originating from a major project. AtlantECO is one such project, aiming to develop and apply a novel, unifying framework that provides knowledge-based resources for a better understanding and management of the Atlantic Ocean and its ecosystem services.

Fetch the Super Study's Studies from the MGnify API, into a [Pandas dataframe](https://pandas.pydata.org/docs/user_guide/index.html):

In [1]:
import pandas as pd
from jsonapi_client import Session, Modifier

atlanteco_endpoint = 'super-studies/atlanteco/flagship-studies'
with Session("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    studies = map(lambda r: r.json, mgnify.iterate(atlanteco_endpoint))
    studies = pd.json_normalize(studies)
studies[:5]

Unnamed: 0,type,id,attributes.samples-count,attributes.accession,attributes.bioproject,attributes.is-private,attributes.secondary-accession,attributes.centre-name,attributes.study-abstract,attributes.study-name,attributes.data-origination,attributes.last-update,relationships.biomes.data
0,studies,MGYS00006074,75,MGYS00006074,PRJEB51168,False,ERP135767,EMG,The Third Party Annotation (TPA) assembly was ...,EMG produced TPA metagenomics assembly of PRJE...,SUBMITTED,2022-10-21T19:56:09,"[{'id': 'root:Environmental:Aquatic:Marine', '..."
1,studies,MGYS00006072,130,MGYS00006072,PRJEB54918,False,ERP139784,EMG,The Third Party Annotation (TPA) assembly was ...,EMG produced TPA metagenomics assembly of PRJN...,SUBMITTED,2022-10-13T14:03:52,[{'id': 'root:Environmental:Aquatic:Marine:Oce...
2,studies,MGYS00006034,971,MGYS00006034,PRJEB50181,False,ERP134737,EMG,The Third Party Annotation (TPA) assembly was ...,EMG produced TPA metagenomics assembly of PRJN...,SUBMITTED,2022-10-10T12:22:12,"[{'id': 'root:Environmental:Aquatic:Marine', '..."
3,studies,MGYS00006061,3,MGYS00006061,PRJEB52804,False,ERP137544,EMG,The Third Party Annotation (TPA) assembly was ...,EMG produced TPA metagenomics assembly of PRJE...,SUBMITTED,2022-09-05T14:38:08,"[{'id': 'root:Environmental:Aquatic:Marine', '..."
4,studies,MGYS00006058,41,MGYS00006058,PRJEB51989,False,ERP136659,EMG,The Third Party Annotation (TPA) assembly was ...,EMG produced TPA metagenomics assembly of PRJE...,SUBMITTED,2022-08-27T19:19:53,"[{'id': 'root:Environmental:Aquatic:Marine', '..."


## Show the studies' samples on a map

We can fetch the Samples for each Study, and concatenate them all into one Dataframe.
Each sample has geolocation data in its `attributes` - this is what we need to build a map.

It takes time to fetch data for all samples, so **let's show samples from the first 6 studies only.** 

In [10]:
studies_samples = []

with Session("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    for idx, study in studies[:6].iterrows():
        print(f"fetching {study.id} samples")
        samples = map(lambda r: r.json, mgnify.iterate(f'studies/{study.id}/samples'))
        samples = pd.json_normalize(samples)
        samples = pd.DataFrame(data={
            'accession': samples['id'],
            'sample_id': samples['id'],
            'study': study.id, 
            'lon': samples['attributes.longitude'],
            'lat': samples['attributes.latitude'],
            'color': "#FF0000",
        })
        samples.set_index('accession', inplace=True)
        studies_samples.append(samples)
studies_samples = pd.concat(studies_samples)

fetching MGYS00006074 samples
fetching MGYS00006072 samples
fetching MGYS00006034 samples
fetching MGYS00006061 samples
fetching MGYS00006058 samples
fetching MGYS00006057 samples


In [4]:
studies_samples[:5]

Unnamed: 0_level_0,sample_id,study,lon,lat,color
accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ERS492820,ERS492820,MGYS00006074,-140.5216,-9.1504,#FF0000
ERS477953,ERS477953,MGYS00006074,1.9478,37.0541,#FF0000
ERS488658,ERS488658,MGYS00006074,39.875,18.3967,#FF0000
ERS490548,ERS490548,MGYS00006074,-35.1803,-20.9354,#FF0000
ERS488147,ERS488147,MGYS00006074,5.9422,39.0609,#FF0000


In [5]:
import leafmap
m = leafmap.Map(center=(0, 0), zoom=2)
m.add_points_from_xy(studies_samples, 
                     x='lon', 
                     y='lat', 
                     popup=["study", "sample_id"], 
                     color_column='color',
                     add_legend=False)
m

Map(center=[0, 0], controls=(ZoomControl(options=['position', 'zoom_in_text', 'zoom_in_title', 'zoom_out_text'…

## Check GO term presence
Let's check whether a specific identifier is present in each sample. This example is written for GO-term 'GO:0015878', but other identifier types are available on the MGnify API.

We will work with MGnify analyses (`MGYA`s) corresponding to chosen samples. We filter analyses by 
- pipeline version: 5.0
- experiment type: assembly

This example shows how to process **just the first 10 samples** (again, because the full dataset takes a while to fetch).
Firstly, get analyses for each sample.

In [11]:
analyses = []
with Session("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    for idx, sample in studies_samples[:10].iterrows():
        print(f"processing {sample.sample_id}")
        filtering = Modifier(f"pipeline_version=5.0&sample_accession={sample.sample_id}&experiment_type=assembly")
        analysis = map(lambda r: r.json, mgnify.iterate('analyses', filter=filtering))
        analysis = pd.json_normalize(analysis)
        analyses.append(analysis)
analyses = pd.concat(analyses)
analyses[:5]

processing ERS492820
processing ERS477953
processing ERS488658
processing ERS490548
processing ERS488147
processing ERS490665
processing ERS490382
processing ERS492932
processing ERS477931
processing ERS488723


Unnamed: 0,type,id,attributes.pipeline-version,attributes.experiment-type,attributes.analysis-summary,attributes.analysis-status,attributes.accession,attributes.is-private,attributes.complete-time,attributes.instrument-platform,attributes.instrument-model,relationships.study.data.id,relationships.study.data.type,relationships.sample.data.id,relationships.sample.data.type,relationships.assembly.data.id,relationships.assembly.data.type
0,analysis-jobs,MGYA00615698,5.0,assembly,"[{'key': 'Submitted nucleotide sequences', 'va...",completed,MGYA00615698,False,2022-10-21T19:56:12,ILLUMINA,Illumina HiSeq 2000,MGYS00006074,studies,ERS492820,samples,ERZ8733628,assemblies
0,analysis-jobs,MGYA00615631,5.0,assembly,"[{'key': 'Submitted nucleotide sequences', 'va...",completed,MGYA00615631,False,2022-10-16T16:11:02,ILLUMINA,Illumina HiSeq 2000,MGYS00006074,studies,ERS477953,samples,ERZ6778699,assemblies
1,analysis-jobs,MGYA00615679,5.0,assembly,"[{'key': 'Submitted nucleotide sequences', 'va...",completed,MGYA00615679,False,2022-10-17T13:44:04,LS454,454 GS FLX Titanium,MGYS00006074,studies,ERS477953,samples,ERZ5402186,assemblies
2,analysis-jobs,MGYA00615697,5.0,assembly,"[{'key': 'Submitted nucleotide sequences', 'va...",completed,MGYA00615697,False,2022-10-21T19:54:50,LS454,454 GS FLX Titanium,MGYS00006074,studies,ERS477953,samples,ERZ5402156,assemblies
0,analysis-jobs,MGYA00615696,5.0,assembly,"[{'key': 'Submitted nucleotide sequences', 'va...",completed,MGYA00615696,False,2022-10-21T19:37:54,ILLUMINA,Illumina HiSeq 2000,MGYS00006074,studies,ERS488658,samples,ERZ5402542,assemblies


Next, check each analysis for GO term presence/absence. We add a column to the dataframe with a colour: blue if GO term was found and red if not.

In [12]:
identifier = "go-terms"
go_term = 'GO:0015878'
go_data = []
with Session("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    for idx, mgya in analyses.iterrows():
        print(f"processing {mgya.id}")
        analysis_identifier = map(lambda r: r.json, mgnify.iterate(f'analyses/{mgya.id}/{identifier}'))
        analysis_identifier = pd.json_normalize(analysis_identifier)
        go_data.append("#0000FF" if go_term in list(analysis_identifier.id) else "#FF0000")
analyses.insert(2, identifier, go_data, True)

processing MGYA00615698
processing MGYA00615631
processing MGYA00615679
processing MGYA00615697
processing MGYA00615696
processing MGYA00615695
processing MGYA00615673
processing MGYA00615694
processing MGYA00615693
processing MGYA00615692
processing MGYA00615674
processing MGYA00615658
processing MGYA00615684
processing MGYA00615691
processing MGYA00615689
processing MGYA00615690


Join the analyses and sample tables to have geolocation data and identifier presence data together.

We'll create a new sub-DataFrame with a subset of the fields and add them to the map.

In [13]:
df = analyses.join(studies_samples.set_index('sample_id'), on='relationships.sample.data.id')
df2 = df[[identifier, 'lon', 'lat', 'study', 'attributes.accession', 'relationships.study.data.id', 'relationships.sample.data.id', 'relationships.assembly.data.id']].copy()
df2 = df2.set_index("study")
df2 = df2.rename(columns={"attributes.accession": "analysis_ID", 
                          'relationships.study.data.id': "study_ID",
                          'relationships.sample.data.id': "sample_ID", 
                          'relationships.assembly.data.id': "assembly_ID"
                         })
m = leafmap.Map(center=(0, 0), zoom=2)
m.add_points_from_xy(df2, 
                     x='lon', 
                     y='lat', 
                     popup=["study_ID", "sample_ID", "assembly_ID", "analysis_ID"],
                    color_column=identifier, add_legend=False)
m

Map(center=[0, 0], controls=(ZoomControl(options=['position', 'zoom_in_text', 'zoom_in_title', 'zoom_out_text'…