# IEDB Query API (IQ-API) - Use Case 1G
**Goal**: Search for information related to a branch of the NCBI taxonomy, using Dengue virus and all Flaviviruses as examples.

The goal of this use case is to query for epitopes arising from a single branch of the NCBI taxonomy.  For example, extracting all viral epitopes or extracting all epitopes related to Dengue virus.  The approach outlined here can be applied to all tables where the *source_organism_iri_search* field exists.

For more information on the expressive syntax of PostgresT, refer to [this document](https://postgrest.org/en/stable/api.html#).  For more details on the tables that are part of the API, refer to [the swagger documetation](http://query-api.iedb.org/docs/swagger/).

---

First, let's import required modules, set some globals, and define a function to print the corresponding CURL command for each request.  I've tried to include that CURL command for each example so that you can copy/paste it into your terminal.  You may want to pipe the output to a tool like 'jq' to have it render neatly.

In [64]:
import requests
import json
import time
import pandas as pd
from io import StringIO

base_uri='https://query-api.iedb.org'

# funciton to print the CURL command given a request
def print_curl_cmd(req):
    url = req.url
    print("curl -X 'GET' '" + url + "'")

This may or may not have resulted in a warning about lzma compression.  That can be safely ignored...

## The IRI search fields

Before we get started, we need to understand the fields that have 'iri_search' in their names.  For example, *source_organism_iri_search* or *host_organism_iri_search*.  All organism-related fields in the IEDB make use of the [NCBI Taxonomomy](https://www.ncbi.nlm.nih.gov/taxonomy) and reference the organism in question with its NCBI Taxonomy ID prefaced by the string 'NCBITaxon'.  For example, Dengue virus has the [NCBI Taxonomy ID 12637](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=12637) and would be referenced as 'NCBITaxon:12637'.  In the context of the NCBI Taxonomy, it's position looks like:

```
Organism (NCBITaxon:1)
  Viruses (NCBITaxon:10239)
    Riboviria (NCBITaxon:2559587)
      Orthornavirae (NCBITaxon:2732396)
        Kitrinoviricota (NCBITaxon:2732406)
          Flasuviricetes (NCBITaxon:2732462)
            Amarillovirales (NCBITaxon:2732545)
              Flaviviridae (NCBITaxon:11050)
                Flavivirus (NCBITaxon:11051)
                  Dengue virus (NCBITaxon:12637)
```

In order to be able to make use of this hierarchy, the organsim taxonomy ID as well **many of its ancestor taxonomy IDs** are encoded into these *organism_iri_search* fields.  The *source_organism_iri_search* field of an epitope record from Dengue virus looks like:

```json
"source_organism_iri_search": [
    "NCBITaxon:1",
    "NCBITaxon:10239",
    "NCBITaxon:11050",
    "NCBITaxon:11051",
    "NCBITaxon:12637",
    "OBI:0100026"
]
```

Note that there are several intermediate taxonomy IDs that are missing, skipping the taxonomy IDs in between Viruses (superkingdom) and Flaviviridae (family) levels.  You will also note an additional entry for [OBI:0100026](http://purl.obolibrary.org/obo/OBI_0100026), which corresponds to the 'organism' term in the [Ontology of Biomedical Investigations (OBI)](http://obi-ontology.org/).

Other *iri_search* fields exist with values from taxonomies other than the NCBI.  For example:

* *mhc_allele_iri_search* encodes values from the [MHC Restriction Ontology](http://www.obofoundry.org/ontology/mro.html)
* *disease_iri_search* uses the [Human Disease Ontology](http://www.obofoundry.org/ontology/doid.html)

The above list is not exhaustive.  For a detailed list of ontologies employed an the CURIEs used to refer to them, please consule the [curie_map](https://query-api.iedb.org/curie_map) table.


## Query for Dengue epitopes

So, in order to compose a query for all records below the 'Dengue virus' node in the taxonomy, all that is needed is to query on the *source_organism_iri_search* field for the presence of 'NCBITaxon:12637'.  Since this field is an array, we will need to use the [cs operator](https://postgrest.org/en/stable/api.html#operators) for 'contains'.  We also restrict the output to the small subset of fields that are relevant to our search with the [select operator](https://postgrest.org/en/stable/api.html#vertical-filtering-columns):


In [65]:
search_params={ 'source_organism_iri_search': 'cs.{"NCBITaxon:12637"}',
                'select': 'structure_iri,structure_description,curated_source_antigens,iedb_assay_ids,qualitative_measures,source_organism_names,source_organism_names,source_organism_iri_search',
              }
table_name='epitope_search'
full_url=base_uri + '/' + table_name
result = requests.get(full_url, params=search_params)
print_curl_cmd(result)

curl -X 'GET' 'https://query-api.iedb.org/epitope_search?source_organism_iri_search=cs.%7B%22NCBITaxon%3A12637%22%7D&select=structure_iri%2Cstructure_description%2Ccurated_source_antigens%2Ciedb_assay_ids%2Cqualitative_measures%2Csource_organism_names%2Csource_organism_names%2Csource_organism_iri_search'


OK we have the result...now let's have a look.  **Note**: We are only printing the first record here to get a sense of what is returned.

In [66]:
print(json.dumps(result.json()[:1], indent=4))

[
    {
        "structure_iri": "IEDB_EPITOPE:184885",
        "structure_description": "MPLVMAWRT",
        "curated_source_antigens": [
            {
                "accession": "AFP27208.1",
                "name": "polyprotein",
                "iri": "GENPEPT:AFP27208.1",
                "starting_position": 1284,
                "ending_position": 1292,
                "source_organism_name": "Dengue virus 4 (dengue type 4 virus DEN4)",
                "source_organism_iri": "NCBITaxon:11070"
            },
            {
                "accession": "Q58HT7.1",
                "name": "Genome polyprotein",
                "iri": "GENPEPT:Q58HT7.1",
                "starting_position": 1284,
                "ending_position": 1292,
                "source_organism_name": "Dengue virus 4 Philippines/H241/1956",
                "source_organism_iri": "NCBITaxon:408686"
            }
        ],
        "iedb_assay_ids": [
            1892066,
            1892367,
            209483

We can also load the output into a data frame.

In [67]:
df = pd.json_normalize(result.json())
df

Unnamed: 0,structure_iri,structure_description,curated_source_antigens,iedb_assay_ids,qualitative_measures,source_organism_names,source_organism_iri_search
0,IEDB_EPITOPE:184885,MPLVMAWRT,"[{'accession': 'AFP27208.1', 'name': 'polyprot...","[1892066, 1892367, 2094831]","[Negative, Positive-Intermediate, Positive-Low]","[Dengue virus 4 (dengue type 4 virus DEN4), De...","[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
1,IEDB_EPITOPE:184886,MPPLRFLGE,"[{'accession': 'Q58HT7.1', 'name': 'Genome pol...",[1890378],[Positive-Low],[Dengue virus 4 Philippines/H241/1956 (Dengue ...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
2,IEDB_EPITOPE:184887,MPPLRFLGED,"[{'accession': 'Q58HT7.1', 'name': 'Genome pol...",[1890926],[Negative],[Dengue virus 4 Philippines/H241/1956 (Dengue ...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
3,IEDB_EPITOPE:184888,MPSMKRFRKE,"[{'accession': 'Q6YMS3.1', 'name': 'Genome pol...","[1890909, 1892044, 1892345]",[Negative],[Dengue virus 3 Martinique/1243/1999 (Dengue v...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
4,IEDB_EPITOPE:184889,MPSMKRFRR,"[{'accession': 'P07564.2', 'name': 'Genome pol...",[1889155],[Positive-High],[Dengue virus 2 Jamaica/1409/1983 (Dengue viru...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
...,...,...,...,...,...,...,...
9995,IEDB_EPITOPE:184875,MMILPAALA,"[{'accession': 'AGO63991.1', 'name': 'polyprot...","[1886177, 2094705]","[Negative, Positive-High]","[Dengue virus, Dengue virus 3 Sri Lanka/1266/2...","[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
9996,IEDB_EPITOPE:184876,MMILPAALAF,"[{'accession': 'Q6YMS4.1', 'name': 'Genome pol...","[1887303, 1887582, 1890697, 1890989, 1892424]","[Negative, Positive-High, Positive-Intermediat...",[Dengue virus 3 Sri Lanka/1266/2000 (Dengue vi...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
9997,IEDB_EPITOPE:184877,MMLKLLTEF,"[{'accession': 'P27909.2', 'name': 'Genome pol...","[1887112, 1888769, 1890488]",[Positive-High],[Dengue virus 1 Brazil/97-11/1997 (Dengue viru...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
9998,IEDB_EPITOPE:184878,MMLVAPSYGM,"[{'accession': 'Q58HT7.1', 'name': 'Genome pol...","[1886397, 1893426, 1893427]","[Negative, Positive-High, Positive-Intermediate]",[Dengue virus 4 Philippines/H241/1956 (Dengue ...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."


So it looks as though exactly 10,000 records were returned.  This should raise suspicion as 10,000 is the default limit for a 'page' of results.  If you receive 10,000 results it indicates that you are only getting the first page back and you will need to pull the rest of the results in subsequent calls.  This is described further in the [IQ-API help material](https://help.iedb.org/hc/en-us/articles/4402872882189).

### Retrieving more than 10,000 records

We will need to retrieve the rest of the records by increasing the 'offset' parameter until we have pulled the complete dataset.  We add a pause of 2 seconds between calls in order to be a good citizen:

In [68]:
search_params['offset'] = 0
while result.json() != []:
    time.sleep(2)
    search_params['offset'] += 10000
    print('offset: ' + str(search_params['offset']))
    full_url=base_uri + '/' + table_name
    result = requests.get(full_url, params=search_params)
    df = df.append(pd.json_normalize(result.json()))
print('Done!')

offset: 10000
offset: 20000
Done!


Let's take another look at our data frame, now that it's complete.

In [69]:
df

Unnamed: 0,structure_iri,structure_description,curated_source_antigens,iedb_assay_ids,qualitative_measures,source_organism_names,source_organism_iri_search
0,IEDB_EPITOPE:184885,MPLVMAWRT,"[{'accession': 'AFP27208.1', 'name': 'polyprot...","[1892066, 1892367, 2094831]","[Negative, Positive-Intermediate, Positive-Low]","[Dengue virus 4 (dengue type 4 virus DEN4), De...","[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
1,IEDB_EPITOPE:184886,MPPLRFLGE,"[{'accession': 'Q58HT7.1', 'name': 'Genome pol...",[1890378],[Positive-Low],[Dengue virus 4 Philippines/H241/1956 (Dengue ...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
2,IEDB_EPITOPE:184887,MPPLRFLGED,"[{'accession': 'Q58HT7.1', 'name': 'Genome pol...",[1890926],[Negative],[Dengue virus 4 Philippines/H241/1956 (Dengue ...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
3,IEDB_EPITOPE:184888,MPSMKRFRKE,"[{'accession': 'Q6YMS3.1', 'name': 'Genome pol...","[1890909, 1892044, 1892345]",[Negative],[Dengue virus 3 Martinique/1243/1999 (Dengue v...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
4,IEDB_EPITOPE:184889,MPSMKRFRR,"[{'accession': 'P07564.2', 'name': 'Genome pol...",[1889155],[Positive-High],[Dengue virus 2 Jamaica/1409/1983 (Dengue viru...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
...,...,...,...,...,...,...,...
9999,IEDB_EPITOPE:184880,MNRRKRSVT,"[{'accession': 'AGT63075.1', 'name': 'polyprot...","[1890182, 2096814]","[Positive, Positive-High]",[Dengue virus 1 Brazil/97-11/1997 (Dengue viru...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
0,IEDB_EPITOPE:184817,MGYWIESQK,"[{'accession': 'Q6YMS3.1', 'name': 'Genome pol...","[1886658, 1889441]","[Positive-High, Positive-Intermediate]",[Dengue virus 3 Martinique/1243/1999 (Dengue v...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
1,IEDB_EPITOPE:184818,MGYWIESSK,"[{'accession': 'Q58HT7.1', 'name': 'Genome pol...",[1886710],[Positive-Intermediate],[Dengue virus 4 Philippines/H241/1956 (Dengue ...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
2,IEDB_EPITOPE:184819,MIAGVFFTF,"[{'accession': 'ACO06174.1', 'name': 'polyprot...","[1886361, 1887209, 1887477, 1888890, 1890590, ...","[Positive, Positive-High, Positive-Intermediate]","[Dengue virus 3 (Dengue virus serotype 3), Den...","[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."


Looks like there are 10,004 Dengue epitope records, which matches what we obtain through the web interface as of August 27, 2021.

## Query for Flavivirus epitopes

What if we want to now search one level higher in the taxonomy, for epitopes from all of Flavivirus?  Simply update our query to use the NCBI Taxonomy ID of Flavivirus in the *source_organism_iri_search* field and repeat:

In [70]:
search_params={ 'source_organism_iri_search': 'cs.{"NCBITaxon:11051"}',
                'select': 'structure_iri,structure_description,curated_source_antigens,iedb_assay_ids,qualitative_measures,source_organism_names,source_organism_names,source_organism_iri_search',
                'offset': 0,
              }
table_name='epitope_search'
full_url=base_uri + '/' + table_name

# get the first 10K results and load them into a data frame
result = requests.get(full_url, params=search_params)
df_flavi = pd.json_normalize(result.json())

while result.json() != []:
    time.sleep(2)
    search_params['offset'] += 10000
    print('offset: ' + str(search_params['offset']))
    full_url=base_uri + '/' + table_name
    result = requests.get(full_url, params=search_params)
    df_flavi = df_flavi.append(pd.json_normalize(result.json()))
print('Done!')

df_flavi

offset: 10000
offset: 20000
Done!


Unnamed: 0,structure_iri,structure_description,curated_source_antigens,iedb_assay_ids,qualitative_measures,source_organism_names,source_organism_iri_search
0,IEDB_EPITOPE:184821,MIAGVLFTF,"[{'accession': 'Q5UB51.1', 'name': 'Genome pol...","[1886479, 1887306, 1887588, 1887774, 1889007, ...","[Positive-High, Positive-Intermediate, Positiv...",[Dengue virus 3 Singapore/8120/1995 (Dengue vi...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
1,IEDB_EPITOPE:184822,MIAGVLFTFV,"[{'accession': 'Q5UB51.1', 'name': 'Genome pol...","[1886180, 1886480, 1889840, 1893580, 1893581]","[Positive-High, Positive-Intermediate]",[Dengue virus 3 China/80-2/1980 (Dengue virus ...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
2,IEDB_EPITOPE:184823,MIEEVMRSR,"[{'accession': 'P27909.2', 'name': 'Genome pol...",[1889062],[Positive-Intermediate],[Dengue virus 1 Brazil/97-11/1997 (Dengue viru...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
3,IEDB_EPITOPE:184824,MIEEVMRSRW,"[{'accession': 'P27909.2', 'name': 'Genome pol...",[1892174],[Positive-Intermediate],[Dengue virus 1 Brazil/97-11/1997 (Dengue viru...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
4,IEDB_EPITOPE:184825,MIIMDEAHF,"[{'accession': 'P27909.2', 'name': 'Genome pol...",[1890498],[Positive-High],[Dengue virus 1 Brazil/97-11/1997 (Dengue viru...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
...,...,...,...,...,...,...,...
6804,IEDB_EPITOPE:952376,SHSTRKLQTRSQTWLESR,"[{'accession': 'AHZ13508.1', 'name': 'polyprot...",[6278839],[Negative],[Zika virus ZIKV/H. sapiens/FrenchPolynesia/10...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
6805,IEDB_EPITOPE:952377,SIGGVFTSVGKLVHQIFG,"[{'accession': 'BAJ72472.1', 'name': 'polyprot...",[6279046],[Negative],[Dengue virus 1 (dengue type 1 D1 virus)],"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
6806,IEDB_EPITOPE:952379,SIQPENLEYRIMLSVHGS,"[{'accession': 'AHZ13508.1', 'name': 'polyprot...",[6278866],[Negative],[Zika virus ZIKV/H. sapiens/FrenchPolynesia/10...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."
6807,IEDB_EPITOPE:952380,SKKETRCGTGVFVYNDVE,"[{'accession': 'AHZ13508.1', 'name': 'polyprot...",[6278916],[Negative],[Zika virus ZIKV/H. sapiens/FrenchPolynesia/10...,"[NCBITaxon:1, NCBITaxon:10239, NCBITaxon:11050..."


OK now we have 16,809 epitopes, which should include all of the Dengue epitopes.