# Searching & Downloading SenNet Datasets with Python: Using `requests` with the Parameterized Search API and `pandas`

This notebook contains a demonstration on how to search the SenNet data repository (https://data.sennetconsortium.org) for datasets matching particular search criteria. **In this example, GeoMx (NGS) data from mouse tissue is filtered on age criteria.**

Searches utilize the **SenNet Parameterized Search API** (https://docs.sennetconsortium.org/param-search/).

Searched datasets can be downloaded utilizing the **SenNet Command Line Transfer** tool (https://docs.sennetconsortium.org/libraries/clt/).

## I. Entity Provenance:

Before searching datasets, understand the structure of SenNet entities and how they are related to each other.

![Dataset Provenance in SenNet uploads](https://docs.sennetconsortium.org/imgs/provenance-ui.png)

(From right to left) Each **dataset** (green) is derived from a **sample** (pink), which may take the form of a *block*, a *section*, or a *suspension*. Some datasets may be derived directly from other primary datasets. Likewise, samples may also be derived from other samples. For example, a tissue section or cell suspension is generated from a tissue block, which comes from a particular *organ*. All are associated with a **source** (yellow), which is the donor of a sample used to generate a dataset or datasets. Sources for SenNet data may be *Human* or *Mouse*. Datasets, samples, and sources are all **entities**. Entities may have *ancestors* to their left on the provenance graph, and *descendants* to their right.

## II. Entity Metadata:

Different types of entities contain different types of metadata. Queryable fields for each entity can be found in the documentation below:

- Source Metadata Fields: https://docs.sennetconsortium.org/param-search/schema-source.html
- Sample Metadata Fields: https://docs.sennetconsortium.org/param-search/schema-sample.html
- Dataset Metadata Fields: https://docs.sennetconsortium.org/param-search/schema-dataset.html

## III. Entity Indentification:

There are three primary ways to identify an entity (source, sample, or dataset) within SenNet Data:

- **SenNet ID** (e.g. SNT123.ABCD.456): a unique human-readable identifier. All entities have a SenNet ID.
- **UUID** (e.g. 0123456789abcdef0123456789abcde): a unique computer-readable string identifier. All SenNet IDs have a corresponding UUID, but not all UUIDs necessarily have a SenNet ID.
- **DOI** (e.g. doi:10.1234/SNT.123.ABCD.456): a permanent Digital Object Identifier. DOIs pertain to datasets and collections, but typically are not associated with other entity types (source, samples).

## IV. Using the API:

There are two pieces to finding datasets programmatically through the SenNet Search API: the **request** and the **response**.

The request is a RESTful URL starting with 'https://search.api.sennetconsortium.org/param-search/' followed by the entity type being search (*sources*, *samples*, or *datasets*), a '?', and any query parameters (derived from entity metadata and annotations) "ANDed" together.

For example: if looking for SenNet datasets:

    https://search.api.sennetconsortium.org/param-search/datasets

Generated by **NanoString's Digital Spatial Profiler** platform:

    https://search.api.sennetconsortium.org/param-search/datasets?dataset_type=GeoMx (NGS)
    
The URL above is used as the request to the API endpoint.

The response is what is sent back. In most cases, a successful response will be a JSON array containing all entities matching the query. In Python, you can use the `requests` library to send a request and parse the response: 

In [1]:
import requests

Using the `requests.get()` method and specifying the URL, retrieve the datasets that match the query. Clean up the input by separating the API URL from the parameters used for the search:

In [2]:
url = 'https://search.api.sennetconsortium.org/param-search/datasets'

params = {
    'dataset_type': 'GeoMx (NGS)'
}

datasets = requests.get(url, params=params)

The **Response Code** associated with the query indicates the type of response received:

- Response code `200`: At least 1 match was found using the request criteria. If the request contains `produce-clt-manifest=true`, a list of matching SenNet IDs is returned. If `produce-clt-manifest` is omitted from the query or set to `false`, a JSON array is returned.
- Response code `303`: The response exceeds 10 MB and returns an S3 bucket rather than a JSON array. Use the response as a new URL (e.g. if `response = requests.get(<url>)` yields a `response.status_code` of `303`, try `new_response = requests.get(response.text)`.
- Response code `404`: There are no entities matching the submited request.
- Response code `504`: The query timed out (maximum of 30 seconds). Request may need to be refined to include fewer results.
Checking the response code of the request, matches were indeed found and returned.

In [3]:
print(f"Status code returned: {datasets.status_code}")
print(f"Number of datasets returned: {len(datasets.json())}")

Status code returned: 200
Number of datasets returned: 259


The `.json()` method of the returned response is actually a list of JSONs (Python dictionaries), one for each dataset. Use list subsetting to parse the responses and inspect them. Use zero-based indexing. Then access the item's top-level metadata fields with the `.keys()` method:

In [4]:
datasets.json()[0].keys()

dict_keys(['assigned_to_group_name', 'cedar_mapped_metadata', 'contacts', 'contains_human_genetic_sequences', 'contributors', 'created_by_user_displayname', 'created_by_user_email', 'created_by_user_sub', 'created_timestamp', 'creation_action', 'data_access_level', 'dataset_category', 'dataset_type', 'dataset_type_hierarchy', 'description', 'display_subtype', 'doi_url', 'entity_type', 'files', 'group_name', 'group_uuid', 'has_metadata', 'has_pipeline_message', 'has_qa_published_derived_dataset', 'has_rui_information', 'has_visualization', 'ingest_id', 'ingest_metadata', 'ingest_task', 'last_modified_timestamp', 'last_modified_user_displayname', 'last_modified_user_email', 'last_modified_user_sub', 'last_touch', 'metadata', 'origin_samples', 'provider_info', 'published_timestamp', 'published_user_displayname', 'published_user_email', 'published_user_sub', 'registered_doi', 'run_id', 'sennet_id', 'source_samples', 'sources', 'status', 'status_history', 'title', 'upload', 'uuid'])

Notice that, although this is a **dataset** entity, you can access the attributes of the corresponding **samples** and **sources** used to generate the dataset in this way.

## V. Producing a Manifest File for Downloading Datasets

If `produce-clt-manifest=true` is included in the search parameters when issuing a requests, the JSON-formatted response will not be returned. Instead, a list of SenNet IDs matching the search criteria will be received. This list can be written to a `.txt` file for the SenNet Command Line Transfer tool to download associated datasets (more at the end of this notebook).

In [5]:
datasets_manifest = requests.get(f"{url}?dataset_type=GeoMx (NGS)&produce-clt-manifest=true")

datasets_manifest.text

'SNT223.KLZQ.387 /\nSNT376.ZWQZ.935 /\nSNT892.BZKZ.473 /\nSNT485.PWFH.836 /\nSNT759.PPXR.947 /\nSNT554.GSZZ.774 /\nSNT695.CQGZ.986 /\nSNT378.FVTK.835 /\nSNT338.JCHN.546 /\nSNT975.GQNN.445 /\nSNT774.KWRJ.434 /\nSNT929.VBDM.424 /\nSNT462.JWSK.475 /\nSNT876.CQNR.795 /\nSNT989.QQVK.635 /\nSNT296.QXKF.366 /\nSNT659.VRRQ.598 /\nSNT495.MXRW.957 /\nSNT838.GQDW.299 /\nSNT697.PKMQ.454 /\nSNT576.VHDP.798 /\nSNT374.NGRH.375 /\nSNT535.NKGT.399 /\nSNT373.WJRD.775 /\nSNT637.RXZX.234 /\nSNT358.WRNR.834 /\nSNT323.GBLJ.398 /\nSNT233.TCJM.975 /\nSNT324.NZHW.456 /\nSNT928.CHQG.865 /\nSNT865.TCGF.336 /\nSNT775.ZKTC.533 /\nSNT729.SPKF.435 /\nSNT465.MPLS.333 /\nSNT568.XXMX.375 /\nSNT749.ZBSQ.385 /\nSNT879.CJKF.763 /\nSNT343.GPNQ.576 /\nSNT847.RZSJ.262 /\nSNT466.GZHC.332 /\nSNT285.JRPD.867 /\nSNT596.DPXS.725 /\nSNT355.SMZD.769 /\nSNT373.WSKC.988 /\nSNT938.NXSB.957 /\nSNT626.TTHB.542 /\nSNT444.VVNM.333 /\nSNT453.VNVH.222 /\nSNT239.LJWX.534 /\nSNT322.JWDS.547 /\nSNT769.QGMQ.447 /\nSNT362.GWPH.597 /\nSNT652.CWBR

## Subsetting GeoMx (NGS) Datasets by Source Species

To search for datasets based on data and source attributes, `produce-clt-manifest` should be excluded.

Some of the dataset attributes are associated with upload metrics and other information that may not be of interest in from a biological perspective. Using values in the `sources['source_type']` key, filters datasets by those of mouse origin:

In [6]:
hs_datasets = [] # Initialize an empty list for human datasets

for i in range(0, len(datasets.json())): # For every dataset in all GeoMx (NGS) datsets
    if datasets.json()[i]['sources'][0]['source_type'] == 'Human': # If the source type is human
        hs_datasets.append(datasets.json()[i]) # Add it to the human dataset list
    else: # If it isn't human
        continue # Skip it and move to the next
    
mm_datasets = [] # Initilize an empty list for mouse datasets

for i in range(0, len(datasets.json())): # For every dataset in all GeoMx (NGS) datasets
    if datasets.json()[i]['sources'][0]['source_type'] == 'Mouse': # If the source type is mouse
        mm_datasets.append(datasets.json()[i]) # Add it to the mouse dataset list
    else: # If it isn't mouse
        continue # Skip it and move to the next

Filtering shows there are 212 mouse GeoMx (NGS) datasets and 47 human GeoMx (NGS) datasets.

In [7]:
print(f"Number of mouse datasets: {len(mm_datasets)}")
print(f"Number of human datasets: {len(hs_datasets)}")

Number of mouse datasets: 212
Number of human datasets: 47


To further query the datasets filtered, use `pandas` to combine the **source** metadata of returned entities with the **dataset** metadata.

In [8]:
import pandas as pd

## Filtering Mouse GeoMx (NGS) Datasets by Age using `pandas`

Start with mouse data, which have a straighforward source metadata schema. First to extract the SenNet IDs and UUIDs of all the datasets and put them into a new dictionary, then put other metadata into respective dictionaries. Finally, merge the dictionary objects and use them to create a `pandas` dataframe. 

In [9]:
mm_datasets[0].keys()

dict_keys(['assigned_to_group_name', 'cedar_mapped_metadata', 'contacts', 'contains_human_genetic_sequences', 'contributors', 'created_by_user_displayname', 'created_by_user_email', 'created_by_user_sub', 'created_timestamp', 'creation_action', 'data_access_level', 'dataset_category', 'dataset_type', 'dataset_type_hierarchy', 'description', 'display_subtype', 'doi_url', 'entity_type', 'files', 'group_name', 'group_uuid', 'has_metadata', 'has_pipeline_message', 'has_qa_published_derived_dataset', 'has_rui_information', 'has_visualization', 'ingest_id', 'ingest_metadata', 'ingest_task', 'last_modified_timestamp', 'last_modified_user_displayname', 'last_modified_user_email', 'last_modified_user_sub', 'last_touch', 'metadata', 'origin_samples', 'provider_info', 'published_timestamp', 'published_user_displayname', 'published_user_email', 'published_user_sub', 'registered_doi', 'run_id', 'sennet_id', 'source_samples', 'sources', 'status', 'status_history', 'title', 'upload', 'uuid'])

In [10]:
mm_ids = {'sennet_id': [], 'uuid': []} # Initialize a dictionary containing different ID types as the keys and empty lists as values
mm_source_meta = {} # Initialize an empty dictionary for source metadata
mm_assay_meta = {} # Initialize an empty dictionary for assay metadata

for i in range(0, len(mm_datasets)): # For every dataset in the Mouse-filtered datasets
    
    mm_ids['sennet_id'].append(mm_datasets[i]['sennet_id']) # Add the dataset's SenNet ID to the list of SenNet IDs
    
    mm_ids['uuid'].append(mm_datasets[i]['uuid']) # Add the dataset's UUID to the list of UUIDs
    
    for source_field in mm_datasets[i]['sources'][0]['metadata']: # For each source metadata field
        if source_field in mm_source_meta.keys(): # If the field has already been added to keys
            mm_source_meta[source_field].append(mm_datasets[i]['sources'][0]['metadata'][source_field]) # Add the value to the list
        elif source_field not in mm_source_meta.keys(): # If the field hasn't already been added
            mm_source_meta[source_field] = [] # Add the field
            mm_source_meta[source_field].append(mm_datasets[i]['sources'][0]['metadata'][source_field]) # Then also values to the list

    for assay_field in mm_datasets[i]['metadata']: # For each assay metadata field
        if assay_field in mm_assay_meta.keys(): # If the field has already been added to keys
            mm_assay_meta[assay_field].append(mm_datasets[i]['metadata'][assay_field]) # Add the value to the list
        elif assay_field not in mm_source_meta.keys(): # If the field hasn't already been added
            mm_assay_meta[assay_field] = [] # Add the field
            mm_assay_meta[assay_field].append(mm_datasets[i]['metadata'][assay_field]) # Then also add values to the list
    
mm_dict = mm_ids | mm_source_meta | mm_assay_meta # Merge all the dictionary objects

mm_dataframe = pd.DataFrame(mm_dict) # Create a pandas dataframe from the merged object

The resulting dataframe contains all 212 GeoMx (NGS) datasets generated using mouse source samples

In [11]:
mm_dataframe

Unnamed: 0,sennet_id,uuid,bedding,cage_enhancements,date_of_birth_or_fertilization,date_of_death,diet,euthanization_method,is_deceased,is_embryo,...,segment_id,source_storage_duration_unit,source_storage_duration_value,target_retrieval_incubation_temperature,target_retrieval_incubation_time_unit,target_retrieval_incubation_time_value,targeted_entity_id,targeted_entity_label,time_since_acquisition_instrument_calibration_unit,time_since_acquisition_instrument_calibration_value
0,SNT223.KLZQ.387,f8d8f85dd4f38dc90dcb1176db43a3d5,Corn cob bedding,Other,2021-05-25 0:00:00,2023-09-06 9:00:00,Teklad 2918,Carbon dioxide inhalant euthanasia,Yes,No,...,d9fb767e-bb64-4153-8b7b-221029cc32b1,day,2,100,minute,20,,Zone 2,,
1,SNT376.ZWQZ.935,40e88991b0f179fe0c75fd165bc14636,Corn cob bedding,Other,6/12/2021 0:00,12/28/2022 9:00,Teklad 2918,Carbon dioxide inhalant euthanasia,Yes,No,...,af498d42-de94-4699-9a69-acc65871f605,day,2,100,minute,20,,Zone 2,,
2,SNT892.BZKZ.473,abdb55c41e00736f156dec155fec66a6,Corn cob bedding,Other,8/10/2022 0:00,11/29/2022 9:00,Teklad 2918,Carbon dioxide inhalant euthanasia,Yes,No,...,1130e56a-1575-4c70-9e42-4896edb63576,day,2,100,minute,20,,Zone 3,,
3,SNT485.PWFH.836,f583346d521788e894c0c883018b0fa0,Corn cob bedding,Other,4/29/2021 0:00,11/29/2023 14:30,Teklad 2918,Carbon dioxide inhalant euthanasia,Yes,No,...,5b2bc97d-cc69-470e-a051-a817efac9ce9,day,2,100,minute,20,,Zone 1 - CD45-rich,,
4,SNT695.CQGZ.986,e2210007b5d01362c60d5fbf801f06a8,Corn cob bedding,Other,5/6/2021 0:00,12/1/2023 9:00,Teklad 2918,Carbon dioxide inhalant euthanasia,Yes,No,...,1212ea1f-8677-4964-b6ab-b282f0bbf3c8,day,2,100,minute,20,,Zone 3,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
207,SNT249.ZJXV.744,cacc45025cff0c1531ff8cb1fcc3b57a,Corn cob bedding,Other,6/12/2021 0:00,12/28/2022 9:00,Teklad 2918,Carbon dioxide inhalant euthanasia,Yes,No,...,bacac692-3b70-45a1-ae87-d3d2e4ef397e,day,2,100,minute,20,,Zone 1,,
208,SNT933.JMNF.579,045a3a624f8aefd5ebe04a784fc92b26,Corn cob bedding,Other,5/6/2021 0:00,9/8/2023 9:00,Teklad 2918,Carbon dioxide inhalant euthanasia,Yes,No,...,6b4b63fb-d481-42e7-99d8-d60706b6e229,day,2,100,minute,20,,Zone 3,,
209,SNT855.BVBV.374,202daa234ef9344bea9a46c9a6b65134,Corn cob bedding,Other,6/12/2021 0:00,12/28/2022 9:00,Teklad 2918,Carbon dioxide inhalant euthanasia,Yes,No,...,9db8271c-51ab-482a-bfd3-027a9d217f80,day,2,100,minute,20,,Zone 3,,
210,SNT523.PPZL.282,f27bd6456f85860658625abd21cebae6,Corn cob bedding,Other,10/22/2020 0:00,11/29/2022 9:00,Teklad 2918,Carbon dioxide inhalant euthanasia,Yes,No,...,0455857b-2fd8-4ca2-8ac3-f79adb28f84e,day,2,100,minute,20,,Zone 3,,


Metadata is tracked for date of birth and date of death for mouse datasets. However, to filter datasets based on age (e.g. months), age needs to be calculated based on existing information. For this, the `dateutils.parser` module can be used. 

In [12]:
from dateutil.parser import parse

Parsing the date-time information with this package helps ensure consistent, regular formatting.

In [13]:
dob_list = list(map(lambda x: parse(x), mm_dataframe.date_of_birth_or_fertilization)) # Parse every item in the list and put in a new list
dod_list = list(map(lambda x: parse(x), mm_dataframe.date_of_death)) # Parse every item in the list and put in a new list

With formats made consistent, DOB and DOD information can be used to calculate age:

In [14]:
ages_months = [] # Initialize an empty list of ages

for i in range(0, len(mm_dataframe)): # For every mouse dataset
    age_months = (dod_list[i] - dob_list[i]).days / (365 / 12) # Calculate the age in days and convert to months
    ages_months.append(age_months) # Add the calculated age to the list

mm_dataframe['age_months'] = ages_months # Then turn the list into a pandas series in a column of the dataframe

The new datset can be queried by age cutoffs:

In [15]:
mm_12mo = mm_dataframe[mm_dataframe.age_months >= 12] # Subset the dataframe on the calculated column.
print(f"{len(mm_12mo)} GeoMx (NGS) datasets are from mice 12 months old or more.")

133 GeoMx (NGS) datasets are from mice 12 months old or more.


Using this age-filtered data, generate a new list of SenNet IDs in a similar fashion to producing the manifest information using `produce-clt-manifest=true`. The function below (`make_manifest_text`) does just that.

In [16]:
def make_manifest_text(dataframe):

    try:
        'sennet_id' in dataframe.columns # Make sure the input has a .columns attribute. Otherwise: error.
    except AttributeError as e:
        print(f"{e}: input must be a pandas dataframe with named colmuns.")
        raise
    except Exception as e:
        print("Unknown error. Is the input a pandas dataframe?")

    if 'sennet_id' not in dataframe.columns: # Make sure the input has a column called 'sennet_id'. Otherwise: error.
        print(f"Error: 'sennet_id' not found in dataframe.columns.")
        raise

    elif 'sennet_id' in dataframe.columns: # If the input is a pandas dataframe with a column named 'sennet_ids':
        manifest = '' # Initialize a blank string

        for i, sennet_id in enumerate(dataframe.sennet_id): # For the index and SenNet ID value of every SenNet ID
            if i + 1 < len(dataframe): # If it isn't the last ID in the list
                manifest += (sennet_id + ' /\n') # Add the SenNet ID to the string, along with a space (" "), forward slash ("/") and new line ("\n").
            elif i + 1 == len(dataframe): # If it is the last ID in the list
                manifest += (sennet_id + ' /') # Don't add a new line

    return manifest

Using the function on the filtered dataframe containing GeoMx (NGS) datasets from mice 12 months or older pulls all SenNet IDs from these datasets:

In [17]:
manifest_12mo = make_manifest_text(mm_12mo)

manifest_12mo

'SNT223.KLZQ.387 /\nSNT376.ZWQZ.935 /\nSNT485.PWFH.836 /\nSNT695.CQGZ.986 /\nSNT338.JCHN.546 /\nSNT774.KWRJ.434 /\nSNT929.VBDM.424 /\nSNT462.JWSK.475 /\nSNT659.VRRQ.598 /\nSNT495.MXRW.957 /\nSNT838.GQDW.299 /\nSNT697.PKMQ.454 /\nSNT535.NKGT.399 /\nSNT373.WJRD.775 /\nSNT637.RXZX.234 /\nSNT323.GBLJ.398 /\nSNT865.TCGF.336 /\nSNT775.ZKTC.533 /\nSNT465.MPLS.333 /\nSNT879.CJKF.763 /\nSNT343.GPNQ.576 /\nSNT847.RZSJ.262 /\nSNT466.GZHC.332 /\nSNT285.JRPD.867 /\nSNT596.DPXS.725 /\nSNT373.WSKC.988 /\nSNT938.NXSB.957 /\nSNT626.TTHB.542 /\nSNT444.VVNM.333 /\nSNT239.LJWX.534 /\nSNT322.JWDS.547 /\nSNT769.QGMQ.447 /\nSNT362.GWPH.597 /\nSNT997.FSSQ.597 /\nSNT959.ZDVC.364 /\nSNT564.CWMM.869 /\nSNT227.RTCS.994 /\nSNT478.JHGM.667 /\nSNT675.TJRJ.326 /\nSNT697.SFNV.549 /\nSNT963.BGJX.956 /\nSNT844.HQXP.342 /\nSNT529.FMJT.478 /\nSNT394.WDZV.227 /\nSNT667.SFSP.634 /\nSNT969.LCGK.896 /\nSNT334.SNRH.457 /\nSNT263.CNKN.464 /\nSNT325.WRCG.238 /\nSNT246.KSXM.486 /\nSNT932.WNDG.774 /\nSNT375.JVJM.768 /\nSNT434.GXFW

These SenNet IDs can be written to a text file using the code below with comment hashes removed:

In [18]:
# with open("manifest.txt", "w") as file:
#         file.write(manifest_12mo)

## Downloading datasets using the manifest text file and `sennet-clt`

The SenNet command line tool (https://docs.sennetconsortium.org/libraries/clt/) can be installed from PyPI:
    
    pip install atlas-clt

This tool first requires users to be authenticated in Globus. Credentials may be institutional, Google, ORCID, GitHub, or another form of Globus ID. Prompt the login screen:

    sennet-clt login

Then use the browser window to authenticate an account. Check the current login status:

    sennet-clt whoami

And logout when finished:

    sennet-clt logout

The commang line transfer tool allows datasets to be downloaded in bulk using a manifest text file:

    sennet-clt transfer </path/to/manifest.txt> --destination </path/to/downloads/folder>

Manifest files in which each entry is followed by a lone '/' will download all available files for each given SenNet ID or entity ID.

### ***Use with caution: downloading all files for hundreds of datasets may require <u>several terabytes of open drive space</u>.***