# IEDB Query API (IQ-API) - Use Case 1G
**Goal**: Search for information related to a branch of the NCBI taxonomy, using Dengue virus and all Flaviviruses as examples.

The goal of this use case is to query for epitopes arising from a single branch of the NCBI taxonomy.  For example, extracting all viral epitopes or extracting all epitopes related to Dengue virus.  The approach outlined here can be applied to all tables where the *source_organism_iri_search* field exists.

For more information on the expressive syntax of PostgresT, refer to [this document](https://postgrest.org/en/stable/api.html#).  For more details on the tables that are part of the API, refer to [the swagger documetation](http://query-api.iedb.org/docs/swagger/).

---

First, let's import required modules, set some globals, and define a function to print the corresponding CURL command for each request.  I've tried to include that CURL command for each example so that you can copy/paste it into your terminal.  You may want to pipe the output to a tool like 'jq' to have it render neatly.

In [None]:
import requests
import json
import time
import pandas as pd
from io import StringIO

base_uri='https://query-api.iedb.org'

# funciton to print the CURL command given a request
def print_curl_cmd(req):
    url = req.url
    print("curl -X 'GET' '" + url + "'")

This may or may not have resulted in a warning about lzma compression.  That can be safely ignored...

## The IRI search fields

Before we get started, we need to understand the fields that have 'iri_search' in their names.  For example, *source_organism_iri_search* or *host_organism_iri_search*.  All organism-related fields in the IEDB make use of the [NCBI Taxonomomy](https://www.ncbi.nlm.nih.gov/taxonomy) and reference the organism in question with its NCBI Taxonomy ID prefaced by the string 'NCBITaxon'.  For example, Dengue virus has the [NCBI Taxonomy ID 12637](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=12637) and would be referenced as 'NCBITaxon:12637'.  In the context of the NCBI Taxonomy, it's position looks like:

```
Organism (NCBITaxon:1)
  Viruses (NCBITaxon:10239)
    Riboviria (NCBITaxon:2559587)
      Orthornavirae (NCBITaxon:2732396)
        Kitrinoviricota (NCBITaxon:2732406)
          Flasuviricetes (NCBITaxon:2732462)
            Amarillovirales (NCBITaxon:2732545)
              Flaviviridae (NCBITaxon:11050)
                Flavivirus (NCBITaxon:11051)
                  Dengue virus (NCBITaxon:12637)
```

In order to be able to make use of this hierarchy, the organsim taxonomy ID as well **many of its ancestor taxonomy IDs** are encoded into these *organism_iri_search* fields.  The *source_organism_iri_search* field of an epitope record from Dengue virus looks like:

```json
"source_organism_iri_search": [
    "NCBITaxon:1",
    "NCBITaxon:10239",
    "NCBITaxon:11050",
    "NCBITaxon:11051",
    "NCBITaxon:12637",
    "OBI:0100026"
]
```

Note that there are several intermediate taxonomy IDs that are missing, skipping the taxonomy IDs in between Viruses (superkingdom) and Flaviviridae (family) levels.  You will also note an additional entry for [OBI:0100026](http://purl.obolibrary.org/obo/OBI_0100026), which corresponds to the 'organism' term in the [Ontology of Biomedical Investigations (OBI)](http://obi-ontology.org/).

Other *iri_search* fields exist with values from taxonomies other than the NCBI.  For example:

* *mhc_allele_iri_search* encodes values from the [MHC Restriction Ontology](http://www.obofoundry.org/ontology/mro.html)
* *disease_iri_search* uses the [Human Disease Ontology](http://www.obofoundry.org/ontology/doid.html)

The above list is not exhaustive.  For a detailed list of ontologies employed an the CURIEs used to refer to them, please consule the [curie_map](https://query-api.iedb.org/curie_map) table.


## Query for Dengue epitopes

So, in order to compose a query for all records below the 'Dengue virus' node in the taxonomy, all that is needed is to query on the *source_organism_iri_search* field for the presence of 'NCBITaxon:12637'.  Since this field is an array, we will need to use the [cs operator](https://postgrest.org/en/stable/api.html#operators) for 'contains'.  We also restrict the output to the small subset of fields that are relevant to our search with the [select operator](https://postgrest.org/en/stable/api.html#vertical-filtering-columns):


In [None]:
search_params={ 'source_organism_iri_search': 'cs.{"NCBITaxon:12637"}',
                'select': 'structure_iri,structure_description,curated_source_antigens,iedb_assay_ids,qualitative_measures,source_organism_names,source_organism_names,source_organism_iri_search',
                'order': 'structure_iri',
              }
table_name='epitope_search'
full_url=base_uri + '/' + table_name
result = requests.get(full_url, params=search_params)
print_curl_cmd(result)

OK we have the result...now let's have a look.  **Note**: We are only printing the first record here to get a sense of what is returned.

In [None]:
print(json.dumps(result.json()[:1], indent=4))

We can also load the output into a data frame.

In [None]:
df = pd.json_normalize(result.json())
df

So it looks as though exactly 10,000 records were returned.  This should raise suspicion as 10,000 is the default limit for a 'page' of results.  If you receive 10,000 results it indicates that you are only getting the first page back and you will need to pull the rest of the results in subsequent calls.  This is described further in the [IQ-API help material](https://help.iedb.org/hc/en-us/articles/4402872882189).

### Retrieving more than 10,000 records

We will need to retrieve the rest of the records by increasing the 'offset' parameter until we have pulled the complete dataset.  **WARNING:** Whenever requesting a result with multiple pages, **it is *essential* to add an 'order' keyword**.  Without it, the pages may be inconsistent between calls.

We also add a pause of 2 seconds between calls in order to be a good citizen:

In [None]:
search_params['offset'] = 0
while result.json() != []:
    time.sleep(2)
    search_params['offset'] += 10000
    print('offset: ' + str(search_params['offset']))
    result = requests.get(full_url, params=search_params)
    df = df.append(pd.json_normalize(result.json()))
print('Done!')

Let's take another look at our data frame, now that it's complete.

In [None]:
df

Looks like there are 10,004 Dengue epitope records, which matches what we obtain through the web interface as of August 27, 2021.

## Query for Flavivirus epitopes

What if we want to now search one level higher in the taxonomy, for epitopes from all of Flavivirus?  Simply update our query to use the NCBI Taxonomy ID of Flavivirus in the *source_organism_iri_search* field and repeat:

In [None]:
search_params={ 'source_organism_iri_search': 'cs.{"NCBITaxon:11051"}',
                'select': 'structure_iri,structure_description,curated_source_antigens,iedb_assay_ids,qualitative_measures,source_organism_names,source_organism_names,source_organism_iri_search',
                'offset': 0,
                'order': 'structure_iri'
              }
table_name='epitope_search'
full_url=base_uri + '/' + table_name

# get the first 10K results and load them into a data frame
result = requests.get(full_url, params=search_params)
df_flavi = pd.json_normalize(result.json())

while result.json() != []:
    time.sleep(2)
    search_params['offset'] += 10000
    print('offset: ' + str(search_params['offset']))
    result = requests.get(full_url, params=search_params)
    df_flavi = df_flavi.append(pd.json_normalize(result.json()))
print('Done!')

df_flavi

OK now we have 16,809 epitopes, which should include all of the Dengue epitopes.

## Limiting the results to T cell epitopes

What if we want to limit the output to peptides that were positive in at least one T cell assay (i.e., T cell epitopes).  We can perform essentiallly the same search against the **tcell_search** table, restricting it only to assays that do not have 'Negative' as their qualitative value.

In [None]:
search_params={ 'source_organism_iri_search': 'cs.{"NCBITaxon:11051"}',
                'select': 'tcell_iri,structure_iri,linear_sequence,curated_source_antigen,qualitative_measure,assay_names,source_organism_name,source_organism_name',
                'qualitative_measure': 'not.eq.Negative',
                'offset': 0,
                'order': 'tcell_iri'
              }
table_name='tcell_search'
full_url=base_uri + '/' + table_name

# get the first 10K results and load them into a data frame
result = requests.get(full_url, params=search_params)
df_tcell = pd.json_normalize(result.json())

while result.json() != []:
    time.sleep(2)
    search_params['offset'] += 10000
    print('offset: ' + str(search_params['offset']))
    result = requests.get(full_url, params=search_params)
    df_tcell = df_tcell.append(pd.json_normalize(result.json()))
print('Done!')

df_tcell



10,217 positive T cell assays for Flavivirus.  That matches the web results for the IEDB as of August 27, 2021.  Note, however, that since we queried against the T cell table, the rows are unique by assay rather than epitope. So we will need to extract the unique epitopes.

In [None]:
df_tcell.drop_duplicates(subset = ['linear_sequence'])

4,927 sequences from Flavivirus that were positive in at least 1 T cell assay, which matches the current data in the IEDB.

### An alternative approach using resource embeddings

**NOTE:**  This example has not been fully vetted and takes a *very* long time to run, likely due to inefficient joins.  It will be further worked out in the future.

We can also make use of [resource embeddings](https://postgrest.org/en/stable/api.html#resource-embedding) to get at this same information.  Here, we start with the epitope_search table, limit it to those with tcell_ids, then join it to the tcell_search table where the qualitative value is not 'Negative'.  Since results are returned for ALL epitope records with T cell assays, they need to be further processed to remove the epitopes without associated 'positive' T cell records.

In [None]:
search_params={ 'source_organism_iri_search': 'cs.{"NCBITaxon:11051"}',
                'tcell_ids': 'not.is.null',
                'tcell_search.qualitative_measure': 'not.eq.Negative',
                'select': 'structure_id,tcell_ids,tcell_search(tcell_iri,qualitative_measure)',
                'offset': 0,
                'order': 'structure_id'
              }
table_name='epitope_search'
full_url=base_uri + '/' + table_name

# get the first 10K results and load them into a data frame
result = requests.get(full_url, params=search_params)
df_tcell_from_epitope = pd.json_normalize(result.json())

while result.json() != []:
    time.sleep(2)
    search_params['offset'] += 10000
    print('offset: ' + str(search_params['offset']))
    result = requests.get(full_url, params=search_params)
    df_tcell_from_epitope = df_tcell_from_epitope.append(pd.json_normalize(result.json()))
print('Done!')

df_tcell_from_epitope


Now we filter out all of the rows where no corresponding, positive T cell assays were found.

In [None]:
df_tcell_from_epitope[df_tcell_from_epitope['tcell_search'].apply(lambda x : len(x) > 0)]