# Python code for interacting with the EGA public metadata API

This notebook contains code for interacting with the EGA (European Genome-phenome Archive) public metadata API. Use the code as you like and feel free to adapt it to your own needs!

If you want to learn more about the public metadata API, you can find [extensive documentation](https://ega-archive.org/discovery/metadata/public-metadata-api/) at the EGA website.


## 1. Import required packages

Let's start by importing the third-party packages [pandas](https://pandas.pydata.org) and [requests](https://requests.readthedocs.io/en/latest/):


In [None]:
import pandas as pd
import requests

## 2. Create a client

Next, we will create a client that allows us make requests to the API endpoints, as well as handling the responses. The client is contained within a single class called `EGAClient`. This class has two methods for fetching data: `get_entity()` and  `get_related_entities()`. These methods are generic in the sense that they can be used to fetch different types of metadata objects from EGA (for example datasets, studies, runs and policies).

In [None]:
class EGAClient:
    def __init__(self, base_url='https://metadata.ega-archive.org'):
        self.base_url = base_url

    def _get(self, endpoint, params=None):
        url = f'{self.base_url}/{endpoint}'
        response = requests.get(url, params=params)
        response.raise_for_status()
        return response.json()

    def get_entity(self, entity_type, accession_id=None, limit=None, 
                   offset=None):
        params = {}
        endpoint = entity_type
        if accession_id:
            endpoint += f'/{accession_id}'
        if limit is not None:
            params['limit'] = limit
        if offset is not None:
            params['offset'] = offset
        return self._get(endpoint, params=params)

    def get_related_entities(self, entity_type, related_entity_type, 
                             accession_id, limit=None, offset=None):
        params = {}
        endpoint = entity_type
        if accession_id:
            endpoint += f'/{accession_id}/{related_entity_type}'
        if limit is not None:
            params['limit'] = limit
        if offset is not None:
            params['offset'] = offset
        return self._get(endpoint, params=params)


The two methods `get_entity()` and `get_related_entities()` return data in the JSON format, which can be difficult for a human to read. Let's therefore define a small function that transforms JSON data to a pandas dataframe.

In [None]:
def as_dataframe(data_json):
    """Turn JSON into a pandas dataframe"""
    return pd.json_normalize(data_json)

The function above contains only a single line of code and is therefore not really needed. However, if we want to customize the transformation of the JSON data further, we can use this function as a starting point. Also, using the name "as_dataframe" instead of "json_normalize" could arguably make the code more readable in our context.

Let's now try to use the client!

## 3. Use the client

To use the client, we first need to create an instance of the `EGAClient` class.

In [None]:
client = EGAClient()

Once we have the client instance, we can go on and fetch metadata from the API. Below are some examples that you may want to adapt to your own needs.

### Information about a single study

In [None]:
study_id = 'EGAS50000000209'
study = client.get_entity('studies', accession_id=study_id)
as_dataframe(study)

If the Study ID is left out from the API call, all studies in EGA will be returned. Try to not to request more information than you need, since the computational burden on the server may be high. A good practice is to use the `limit` keyword argument to specify the maximum number of records to be returned (more on this later). Also try to avoid making more calls to the API than necessary.

### Information about datasets in a study

In [None]:
study_id = 'EGAS50000000209'
datasets = client.get_related_entities(entity_type='studies', related_entity_type='datasets',
                                       accession_id=study_id)
as_dataframe(datasets)

### Information about samples associated with a dataset

In [None]:
dataset_id = 'EGAD50000000298'
samples = client.get_related_entities(entity_type='datasets', related_entity_type='samples',
                                      accession_id=dataset_id)
as_dataframe(samples)

We can control the number of records that is returned by using the keyword agument `limit`:

In [None]:
dataset_id = 'EGAD50000000298'
samples = client.get_related_entities(entity_type='datasets', related_entity_type='samples', 
                                      accession_id=dataset_id, limit=10)
as_dataframe(samples)

We can also use the `offset` keyword argument to specify the number of records we want to skip before selecting records.

In [None]:
dataset_id = 'EGAD50000000298'
samples = client.get_related_entities(entity_type='datasets', related_entity_type='samples', 
                                      accession_id=dataset_id, limit=10, offset=2)
as_dataframe(samples)