# How to use EBI Metagenomics API

The EMG REST API https://www.ebi.ac.uk/metagenomics/api/latest/ provides an easy-to-use set of top level resources, such as studies, samples, runs, experiment-types, biomes and annotations, that let user access metagenomics data in simple JSON format (JSON object formatted data structure is a syntax for storing and exchanging data). Retrieving the data is as simple as sending a HTTP request. Response return JSON object formatted data structure that contains the resource type, associated object identifier (id) with attributes. Where appropriate, relationships and links are provided to other resources.

We have utilised an interactive documentation framework (Swagger UI) to visualise and simplify interaction with the API’s resources via an HTML interface. Detailed explanations of the purpose of all resources, along with many examples, are provided to guide end-users. Documentation on how to use the endpoints is available at https://www.ebi.ac.uk/metagenomics/api/docs/.

# Browse API

### Task 1

Find marine studies

Answer:
1. https://www.ebi.ac.uk/metagenomics/api/latest/studies?lineage=root%3AEnvironmental%3AAquatic%3AMarine
2. https://www.ebi.ac.uk/metagenomics/api/latest/biomes/root:Environmental:Aquatic:Marine/studies

and samples:
1. https://www.ebi.ac.uk/metagenomics/api/latest/samples?lineage=root%3AEnvironmental%3AAquatic%3AMarine
2. https://www.ebi.ac.uk/metagenomics/api/latest/biomes/root:Environmental:Aquatic:Marine/samples


### Task 2

Find oceanic metagenomic samples taken from latitude >= 70° (N)

Answer:
1. https://www.ebi.ac.uk/metagenomics/api/latest/samples?experiment_type=metagenomic&lineage=root%3AEnvironmental%3AAquatic%3AMarine%3AOceanic&latitude_gte=70

2. https://www.ebi.ac.uk/metagenomics/api/latest/experiment-types/metagenomic/samples?experiment_type=&biome_name=&lineage=root%3AEnvironmental%3AAquatic%3AMarine%3AOceanic&geo_loc_name=&latitude_gte=70

# Write scripts

### Import Python modules

In [1]:
from pandas import DataFrame

try:
    from urllib import urlencode
except ImportError:
    from urllib.parse import urlencode

In [2]:
from jsonapi_client import Session, Filter

API_BASE = 'https://www.ebi.ac.uk/metagenomics/api/latest/'

### Get study

Get study: https://www.ebi.ac.uk/metagenomics/api/latest/studies/ERP009004

In [3]:
with Session(API_BASE) as s:
    study = s.get('studies', 'ERP009004').resource
    print('Study name:', study.study_name)
    print('Study abstract:', study.study_abstract)
    for biome in study.biomes:
        print('Biome:', biome.biome_name, biome.lineage)

Study name: Hydrocarbon Metagenomics Project
Study abstract: Metagenomics for Greener Production and Extraction of Hydrocarbon Energy:
Creating Opportunities for Enhanced Recovery with Reduced Environmental Impact
Biome: Freshwater root:Environmental:Aquatic:Freshwater
Biome: Marine root:Environmental:Aquatic:Marine
Biome: Soil root:Environmental:Terrestrial:Soil


### List samples with biomes for the given study

Get study: https://www.ebi.ac.uk/metagenomics/api/latest/studies/ERP001736

List samples: https://www.ebi.ac.uk/metagenomics/api/latest/studies/ERP001736/samples


Fetch samples for the given study accession: https://www.ebi.ac.uk/metagenomics/api/latest/samples?study_accession=ERP001736


In [4]:
df = DataFrame(columns=('sample name', 'lineage', 'biome', 'feature', 'material'))
df.index.name = 'accession'

with Session(API_BASE) as s:
    params = {
        'study_accession': 'ERP001736',
        'page_size': 100,
    }
    f = Filter(urlencode(params))
    for sample in s.iterate('samples', f):
        df.loc[sample.accession] = [
            sample.sample_name,
            sample.biome.id,
            sample.environment_biome,
            sample.environment_feature,
            sample.environment_material
        ]
df

Unnamed: 0_level_0,sample name,lineage,biome,feature,material
accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ERS488919,TARA_20100318T1133Z_039_EVENT_PUMP_P_D_(25 m)_...,root:Environmental:Aquatic:Marine:Oceanic,marine biome (ENVO:00000447),deep chlorophyll maximum layer (ENVO:xxxxxxxx),"saline water (ENVO:00002010), including plankt..."
ERS478017,TARA_20091215T1041Z_030_EVENT_PUMP_P_S_(5-7m)_...,root:Environmental:Aquatic:Marine:Oceanic,marine biome (ENVO:00000447),surface water (ENVO:00002042) layer,"saline water (ENVO:00002010), including plankt..."
ERS491463,TARA_20110312T1937Z_093_Combined-EVENTS_CAST_M...,root:Environmental:Aquatic:Marine:Oceanic,marine biome (ENVO:00000447),deep chlorophyll maximum layer (ENVO:xxxxxxxx),"""saline water (ENVO:00002010), including plank..."
ERS490542,TARA_20101016T0955Z_076_EVENT_PUMP_P_S_(5 m)_B...,root:Environmental:Aquatic:Marine:Oceanic,marine biome (ENVO:00000447),surface water layer (ENVO:00002042),"""saline water (ENVO:00002010), including plank..."
ERS490691,TARA_20101104T1816Z_078_Combined-EVENTS_CAST_M...,root:Environmental:Aquatic:Marine:Oceanic,marine biome (ENVO:00000447),deep chlorophyll maximum layer (ENVO:xxxxxxxx),"""saline water (ENVO:00002010), including plank..."
ERS490597,TARA_20101016T1700Z_076_Combined-EVENTS_CAST_M...,root:Environmental:Aquatic:Marine:Oceanic,marine biome (ENVO:00000447),deep chlorophyll maximum layer (ENVO:xxxxxxxx),"""saline water (ENVO:00002010), including plank..."
ERS489315,TARA_20100419T0756Z_048_EVENT_PUMP_P_S_(5 m)_B...,root:Environmental:Aquatic:Marine:Oceanic,marine biome (ENVO:00000447),surface water layer (ENVO:00002042),"""saline water (ENVO:00002010), including plank..."
ERS491095,TARA_20110106T1936Z_085_Combined-EVENTS_CAST_M...,root:Environmental:Aquatic:Marine:Oceanic,marine biome (ENVO:00000447),deep chlorophyll maximum layer (ENVO:xxxxxxxx),"""saline water (ENVO:00002010), including plank..."
ERS492778,TARA_20110801T1755Z_123_Combined-EVENTS_CAST_M...,root:Environmental:Aquatic:Marine:Oceanic,marine biome (ENVO:00000447),marine epipelagic mixed layer (ENVO:xxxxxxxxx),"""saline water (ENVO:00002010), including plank..."
ERS493460,TARA_20111019T1928Z_133_Combined-EVENTS_CAST_M...,root:Environmental:Aquatic:Marine:Oceanic,marine biome (ENVO:00000447),mesopelagic zone (ENVO:00000213),"""saline water (ENVO:00002010), including plank..."


### List samples with biomes and metadata for the given study

Samples for the given study accession: https://www.ebi.ac.uk/metagenomics/api/latest/samples?study_accession=ERP001736


In [5]:
def get_metadata(metadata, key):
    import html
    for m in metadata:
        if m['key'].lower() == key.lower():
            value = m['value']
            unit = html.unescape(m['unit']) if m['unit'] else ""
            return "{value} {unit}".format(value=value, unit=unit)
    return None

depth_label = 'geographic location (depth)'
temp_label = 'temperature'
df = DataFrame(columns=('sample name', 'biome', 'temperature', 'depth', 'longitude', 'latitude'))
df.index.name = 'accession'

with Session(API_BASE) as s:
    params = {
        'study_accession': 'ERP001736',
        'include': 'biome',
        'page_size': 100,
    }
    f = Filter(urlencode(params))
    for sample in s.iterate('samples', f):
        df.loc[sample.accession] = [
            sample.sample_name, sample.biome.id,
            get_metadata(sample.sample_metadata, temp_label),
            get_metadata(sample.sample_metadata, depth_label),
            sample.longitude, sample.latitude
        ]
df

Unnamed: 0_level_0,sample name,biome,temperature,depth,longitude,latitude
accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ERS488919,TARA_20100318T1133Z_039_EVENT_PUMP_P_D_(25 m)_...,root:Environmental:Aquatic:Marine:Oceanic,26.812225 °C,25 m,66.4727,18.5839
ERS478017,TARA_20091215T1041Z_030_EVENT_PUMP_P_S_(5-7m)_...,root:Environmental:Aquatic:Marine:Oceanic,20.460612 °C,5 m,32.8980,33.9179
ERS491463,TARA_20110312T1937Z_093_Combined-EVENTS_CAST_M...,root:Environmental:Aquatic:Marine:Oceanic,16.40115 °C,35 m,-73.0537,-33.9116
ERS490542,TARA_20101016T0955Z_076_EVENT_PUMP_P_S_(5 m)_B...,root:Environmental:Aquatic:Marine:Oceanic,23.349542 °C,5 m,-35.1803,-20.9354
ERS490691,TARA_20101104T1816Z_078_Combined-EVENTS_CAST_M...,root:Environmental:Aquatic:Marine:Oceanic,19.30925 °C,120 m,-43.2705,-30.1484
ERS490597,TARA_20101016T1700Z_076_Combined-EVENTS_CAST_M...,root:Environmental:Aquatic:Marine:Oceanic,21.643283 °C,150 m,-35.3498,-21.0292
ERS489315,TARA_20100419T0756Z_048_EVENT_PUMP_P_S_(5 m)_B...,root:Environmental:Aquatic:Marine:Oceanic,29.818233 °C,5 m,66.4228,-9.3921
ERS491095,TARA_20110106T1936Z_085_Combined-EVENTS_CAST_M...,root:Environmental:Aquatic:Marine:Oceanic,-0.78154 °C,90 m,-49.2139,-62.2231
ERS492778,TARA_20110801T1755Z_123_Combined-EVENTS_CAST_M...,root:Environmental:Aquatic:Marine:Oceanic,22.1151 °C,150 m,-140.2845,-8.9109
ERS493460,TARA_20111019T1928Z_133_Combined-EVENTS_CAST_M...,root:Environmental:Aquatic:Marine:Oceanic,4.936634 °C,650 m,-127.7268,35.2698


### List runs

Get sample: https://www.ebi.ac.uk/metagenomics/api/latest/samples/ERS1871412

List runs: https://www.ebi.ac.uk/metagenomics/api/latest/samples/ERS1871412/runs

In [6]:
df = DataFrame(columns=('instrument platform', 'instrument model', 'analysis pipeline'))
df.index.name = 'accession'

with Session(API_BASE) as s:
    sample = s.get('samples', 'ERS1871412').resource
    for run in sample.runs:
        df.loc[run.accession] = [
            run.instrument_platform, run.instrument_model,
            ", ".join([p.release_version for p in run.pipelines])
        ]

df

Unnamed: 0_level_0,instrument platform,instrument model,analysis pipeline
accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ERR2098598,ILLUMINA,Illumina HiSeq 2500,"4.0, 4.1"
ERR2098545,ILLUMINA,Illumina HiSeq 2500,"4.0, 4.1"
ERR2098490,ILLUMINA,Illumina MiSeq,"4.0, 4.1"
ERR2098489,ILLUMINA,Illumina MiSeq,"4.0, 4.1"
ERR2098440,ILLUMINA,Illumina HiSeq 2500,"4.0, 4.1"
ERR2098396,ILLUMINA,Illumina HiSeq 4000,"4.0, 4.1"


### List sample metadata

Get sample with metadata: https://www.ebi.ac.uk/metagenomics/api/latest/samples/ERS488919

In [7]:
def format_unit(unit):
    import html
    return html.unescape(unit) if unit else ""

df = DataFrame(columns=('metadata key', 'value', 'unit'))

with Session(API_BASE) as s:
    sample = s.get('samples', 'ERS488919').resource
    print(sample.sample_name, sample.accession)
            
    for i, m in enumerate(sample.sample_metadata):
        df.loc[i] = [
            m['key'], m['value'],
            format_unit(m['unit'] or None)
        ]

df

TARA_20100318T1133Z_039_EVENT_PUMP_P_D_(25 m)_BACT_NUC-DNA(100L)_W1.6-20_TARA_B100000105 ERS488919


Unnamed: 0,metadata key,value,unit
0,temperature,26.812225,°C
1,project name,Tara Oceans expedition (2009-2013),
2,geographic location (depth),25,m
3,environmental package,water,
4,instrument model,Illumina HiSeq 2000,
5,ENA checklist,ENA TARA (ERC000030),
6,latitude end,18.5679,DD
7,longitude end,66.4581,DD
8,marine region,,
9,protocol label,BACT_NUC-DNA(100L)_W1.6-20,


### List organisms

Organisms: https://www.ebi.ac.uk/metagenomics/api/latest/runs/ERR598955/pipelines/2.0/taxonomy

In [8]:
df = DataFrame(columns=('parent','domain', 'rank', 'reads'))
df.index.name = 'Organism'

with Session(API_BASE) as s:
    run = s.get('runs', 'ERR598955').resource
    for a in run.analyses:
        for ann in a.taxonomy:
            df.loc[ann.name] = [
                ann.parent, ann.domain, ann.rank, ann.count
            ]
df.sort_values('reads', ascending=False)

Unnamed: 0_level_0,parent,domain,rank,reads
Organism,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Pelagibacteraceae,Rickettsiales,Bacteria,family,44464
CandidatusPortiera,Halomonadaceae,Bacteria,genus,16514
Unusigned,,,,12088
Alphaproteobacteria,Proteobacteria,Bacteria,class,11596
Prochlorococcus,Synechococcaceae,Bacteria,genus,10445
Rhodobacteraceae,Rhodobacterales,Bacteria,family,5267
Rickettsiales,Alphaproteobacteria,Bacteria,order,5222
Synechococcus,Synechococcaceae,Bacteria,genus,5143
Flavobacteriaceae,Flavobacteriales,Bacteria,family,4375
OCS155,Acidimicrobiales,Bacteria,family,2382


### List functional annotations

Gene Ontology (GO) terms derived from InterPro matches: https://www.ebi.ac.uk/metagenomics/api/latest/runs/ERR598955/pipelines/2.0/go-slim

In [9]:
df = DataFrame(columns=('category', 'description', 'annotation counts'))
df.index.name = 'GO term'

with Session(API_BASE) as s:
    run = s.get('runs', 'ERR598955').resource
    for a in run.analyses:
        for ann in a.go_slim:
            df.loc[ann.accession] = [
                ann.lineage, ann.description, ann.count
            ]
df

Unnamed: 0_level_0,category,description,annotation counts
GO term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GO:0030031,biological_process,cell projection assembly,103
GO:0071554,biological_process,cell wall organization or biogenesis,2341
GO:0016043,biological_process,cellular component organization,411559
GO:0051301,biological_process,cell division,165234
GO:0016049,biological_process,cell growth,7
GO:0048870,biological_process,cell motility,63785
GO:0032196,biological_process,transposition,6526
GO:0006811,biological_process,ion transport,935393
GO:0009306,biological_process,protein secretion,87068
GO:0006810,biological_process,transport,3174897


### List marine metagenomic samples collected in a temperature between 1°C and 5°C

List samples: https://www.ebi.ac.uk/metagenomics/api/latest/biomes/root:Environmental:Aquatic:Marine/samples?experiment_type=metagenomic&metadata_key=temperature&metadata_value_gte=1&metadata_value_lte=5

In [10]:
def get_metadata(metadata, key='temperature'):
    import html
    for m in metadata:
        if m['key'].lower() == key.lower():
            value = m['value']
            unit = html.unescape(m['unit']) if m['unit'] else ""
            return "{value} {unit}".format(value=value, unit=unit)
    return None

depth_label = 'geographic location (depth)'
temp_label = 'temperature'
df = DataFrame(columns=('sample name', 'biome', 'temperature', 'depth', 'location', 'latitude'))
df.index.name = 'accession'

with Session(API_BASE) as s:
    params = {
        'experiment_type': 'metagenomic',
        'metadata_key': 'temperature',
        'metadata_value_gte': 1,
        'metadata_value_lte': 5,
        'latitude_gte': 0,
        'include': 'biome',
    }
    f = Filter(urlencode(params))
    for sample in s.iterate('biomes/root:Environmental:Aquatic:Marine/samples', f):
        df.loc[sample.accession] = [
            sample.sample_name, sample.biome.id,
            get_metadata(sample.sample_metadata, temp_label),
            get_metadata(sample.sample_metadata, depth_label),
            sample.geo_loc_name, sample.latitude
        ]
df

Unnamed: 0_level_0,sample name,biome,temperature,depth,location,latitude
accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
SRS954949,LMO_120314,root:Environmental:Aquatic:Marine,3.2,,Baltic Sea,56.9309
SRS954959,LMO_120328,root:Environmental:Aquatic:Marine,3.7,,Baltic Sea,56.9309
SRS954978,LMO_120403,root:Environmental:Aquatic:Marine,3.1,,Baltic Sea,56.9309
SRS954975,LMO_120416,root:Environmental:Aquatic:Marine,4.0,,Baltic Sea,56.9309
SRS954974,LMO_120419,root:Environmental:Aquatic:Marine,4.4,,Baltic Sea,56.9309
SRS954976,LMO_121220,root:Environmental:Aquatic:Marine,4.7,,Baltic Sea,56.9309
SRS940674,LMO_120322,root:Environmental:Aquatic:Marine,3.0,,Baltic Sea,56.9309
SRS954972,LMO_120423,root:Environmental:Aquatic:Marine,4.1,,Baltic Sea,56.9309
SRS954970,LMO_120507,root:Environmental:Aquatic:Marine,5.6,,Baltic Sea,56.9309
SRS981327,BP_381106-1,root:Environmental:Aquatic:Marine,4.4889,,USA: Gulf of Mexico,28.7051


### Export to CSV

Get study: https://www.ebi.ac.uk/metagenomics/api/latest/studies/ERP005831

In [11]:
import csv

with open("output.csv", "w") as csvfile:
    with Session(API_BASE) as s:
        fieldnames = ['study', 'sample', 'biome', 'lineage', 'longitude', 'latitude']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        study = s.get('studies', 'ERP005831').resource
        for sample in study.samples:
            biome = sample.biome
            row = {
                'study': study.accession,
                'sample': sample.accession,
                'biome': biome.biome_name,
                'lineage': biome.lineage,
                'longitude': sample.longitude,
                'latitude': sample.latitude
            }
            writer.writerow(row)

df = DataFrame().from_csv('output.csv')
df

Unnamed: 0_level_0,sample,biome,lineage,longitude,latitude
study,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
MGYS00000389,ERS456668,Sediment,root:Environmental:Aquatic:Freshwater:Lentic:S...,-1.56,52.38
MGYS00000389,ERS456669,Agricultural,root:Environmental:Terrestrial:Soil:Loam:Agric...,-1.61,52.19
