# Last step of the "HoloFood Data Portal Tutorial"
Previous steps do not require any coding, and can be followed [in the tutorial](https://ebi-metagenomics.github.io/holofood-database/tutorial.html)

## Objective 7: Use Python to analyse data from the API

- Use the [API](http://holofooddataportaldev-env.eba-jwzhg3z2.eu-west-1.elasticbeanstalk.com/api) to fetch a list of HoloFood samples as a Pandas dataframe (the [API docs will be very helpful](https://ebi-metagenomics.github.io/holofood-database/api.html#from-python))
    - Only fetch samples from project `PRJEB41657`
    - Only fetch samples from Trial A Tank 1 (which happen to have sample titles starting `SA01`)
    - Only fetch samples which have a non-empty value for metadata variable `host length` (that's the length of the fish the sample came from)
- Make two histograms, showing the distribution of `host length` metadata values at timepoints `trail timepoint = 0 days` and `60 days`

Here's a startpoint for the libraries and base API endpoint you need:

In [None]:
samples_endpoint_base = 'http://holofooddataportaldev-env.eba-jwzhg3z2.eu-west-1.elasticbeanstalk.com/api/samples'

import requests
import pandas as pd
import matplotlib.pyplot as plt

Here is some broken code for you to complete (but there is a full solution below, too):

In [None]:
page = 1
while page:
    samples_page = requests.get(         FILL IN THE API ENDPOINT AND QUERY PARAMETERS            ).json()
    samples_page_df = pd.json_normalize(            FILL IN SOME CODE TO GET ITEMS FROM THE QUERY RESPONSE                )
    
    if page == 1:
        samples_df = samples_page_df
    else:
        samples_df = pd.concat(
            [
                samples_df,
                samples_page_df
            ]
        )
    
    page += 1
    if len(samples_df) >= samples_page['count']:
        page = False

        
def get_host_length_and_timepoint(sample):
    sample_detail = requests.get(             FILL IN THE SAMPLE DETAIL API ENDPOINT              ).json()
    metadata = sample_detail['structured_metadata']
    host_length =          WRITE SOME CODE TO GET THE host length METADATA
    timepoint =      WRITE SOME CODE TO GET THE trial timepoint METADATA MARKER
    return host_length['measurement'], timepoint['measurement']

metadata = samples_df.apply(
    get_host_length_and_timepoint, 
    axis='columns', 
    result_type='expand'
).rename(
    columns={
        0: 'host_length_cm', 
        1: 'trial_timepoint_days'
    }
)
trial_samples = pd.concat(
    [
        samples_df,
        metadata
    ]
)
trial_samples.groupby('trial_timepoint_days').host_length_cm.hist(legend=True, bins=5, alpha=0.5)
plt.xlabel('Host length / cm');

Here is the complete code in case you get stuck.
Click the ••• to unhide the cell.

In [None]:
page = 1

while page:
    samples_page = requests.get(
        f'{samples_endpoint_base}?{page=}&project=PRJEB41657&title=SA01&require_metadata_value=host length'
    ).json()
    samples_page_df = pd.json_normalize(
        samples_page['items']
    )
    
    if page == 1:
        samples_df = samples_page_df
    else:
        samples_df = pd.concat(
            [
                samples_df,
                samples_page_df
            ]
        )
    
    page += 1
    if len(samples_df) >= samples_page['count']:
        page = False

def get_host_length_and_timepoint(sample):
    sample_detail = requests.get(
        f'{samples_endpoint_base}/{sample.accession}'
    ).json()
    metadata = sample_detail['structured_metadata']
    host_length = next(
        metadatum 
        for metadatum in metadata 
        if metadatum['marker']['name'] == 'host length'
    )
    timepoint = next(
        metadatum 
        for metadatum in metadata 
        if metadatum['marker']['name'] == 'trial timepoint'
    )
    return host_length['measurement'], timepoint['measurement']

metadata = samples_df.apply(
    get_host_length_and_timepoint, 
    axis='columns', 
    result_type='expand'
).rename(
    columns={
        0: 'host_length_cm', 
        1: 'trial_timepoint_days'
    }
)
trial_samples = pd.concat(
    [
        samples_df,
        metadata
    ]
)
trial_samples.groupby('trial_timepoint_days').host_length_cm.hist(legend=True, bins=5, alpha=0.5)
plt.xlabel('Host length / cm');