# Using ROBOKOP's expand service

The most basic functionality in answering questions is to start with an entity and find other connected entities.
In this context, an entity is defined by a curie-formatted identifier.

ROBOKOP's expand service performs this function.  The user provides an identifier and its biolink-model type, and the type of entities that it wants returned.  ROBOKOP will call out to any sources that it is aware of that can answer the particular question.  If multiple services can provide the information, ROBOKOP will call all of them.  It will then rank the results based on literature co-occurence data.

In [1]:
import requests
import json
import pandas as pd

The following python function shows how to call the ROBOKOP expand service.  For the moment, let's focus only on the arguments `type1`, `identifier`, and `type2`.    The expand service is called when a user has an `identifier` of `type1`, and wants to know what entities of `type2` are connected to it.

In [2]:
def expand(type1,identifier,type2,rebuild=None,csv=None,predicate=None):
    url=f'http://robokop.renci.org:80/api/simple/expand/{type1}/{identifier}/{type2}'
    params = {'rebuild': rebuild, 
              'csv'    : csv,
              'predicate': predicate} 
    params = { k:v for k,v in params.items() if v is not None }
    response = requests.get(url,params=params)
    print( f'Return Status: {response.status_code}' )
    if response.status_code == 200:
        return response.json()
    return []

## Basic Usage

In this example, we have the `disease` Fanconi Anemia defined by the curie identifier `MONDO:0019391`.  We want to know the `phenotypic_feature`s that are associated with it.  We can call the function above like this:

In [3]:
fanconi_phenotypes = expand('disease', 'MONDO:0019391', 'phenotypic_feature')

Return Status: 200


## Service Output

The result that comes back is json in the KG-standard.

Importantly, results are ranked using ROBOKOP's standard ranking algorithm, which is looking at literature co-occurance based on the `omnicorp` repository.

In [4]:
print(json.dumps(fanconi_phenotypes,indent=4))

{
    "answers": [
        {
            "id": null,
            "answerset": null,
            "natural_answer": null,
            "nodes": [
                {
                    "id": "MONDO:0019391",
                    "name": "Fanconi anemia",
                    "equivalent_identifiers": [
                        "MONDO:0019391",
                        "MEDDRA:10055206",
                        "ORPHANET:84",
                        "MEDDRA:10016218",
                        "DOID:13636",
                        "UMLS:C0015625",
                        "MESH:D005199"
                    ],
                    "type": "disease",
                    "omnicorp_article_count": 4009
                },
                {
                    "id": "HP:0005528",
                    "equivalent_identifiers": [
                        "HP:0005528",
                        "UMLS:C1855710",
                        "MEDDRA:10065553"
                    ],
                    "name": "Bone ma

This output has plenty of information, but for display purposes, it's sometimes easier to tabularize with the following function:

In [5]:
def parse_answer(returnanswer):
    nodes = [answer['nodes'][1] for answer in returnanswer['answers']]
    edges = [answer['edges'][0] for answer in returnanswer['answers']]
    answers = [ {"result_id": node["id"], 
                 "result_name": node["name"], 
                 "relation": edge["relation_label"],
                 "source": edge['edge_source']}
              for node,edge in zip(nodes,edges)]
    return pd.DataFrame(answers)

In [6]:
fanconi_pheno_frame = parse_answer(fanconi_phenotypes)
fanconi_pheno_frame

Unnamed: 0,relation,result_id,result_name,source
0,has phenotype,HP:0005528,Bone marrow hypocellularity,biolink.disease_get_phenotype
1,has phenotype,HP:0004810,Congenital hypoplastic anemia,biolink.disease_get_phenotype
2,has phenotype,HP:0001908,Hypoplastic anemia,biolink.disease_get_phenotype
3,has phenotype,HP:0010972,Anemia of inadequate production,biolink.disease_get_phenotype
4,has phenotype,HP:0003974,Absent radius,biolink.disease_get_phenotype
5,has phenotype,HP:0004820,Acute myelomonocytic leukemia,biolink.disease_get_phenotype
6,has phenotype,HP:0000953,Hyperpigmentation of the skin,biolink.disease_get_phenotype
7,has phenotype,HP:0001972,Macrocytic anemia,biolink.disease_get_phenotype
8,has phenotype,HP:0001876,Pancytopenia,biolink.disease_get_phenotype
9,has phenotype,HP:0001000,Abnormality of skin pigmentation,biolink.disease_get_phenotype


In this case, all of the results are coming from biolink's disease to phenotype function.  As mentioned above, results here are ranked by their literature co-occurence with the query term.

If the caller doesn't want to dig around in a json return, they can also ask for a csv-style list:

In [7]:
fanconi_phenotypes_csv = expand('disease', 'MONDO:0019391', 'phenotypic_feature',csv=True)

Return Status: 200


In [8]:
fanconi_phenotypes_csv

['Bone marrow hypocellularity(HP:0005528)',
 'Congenital hypoplastic anemia(HP:0004810)',
 'Hypoplastic anemia(HP:0001908)',
 'Anemia of inadequate production(HP:0010972)',
 'Absent radius(HP:0003974)',
 'Acute myelomonocytic leukemia(HP:0004820)',
 'Hyperpigmentation of the skin(HP:0000953)',
 'Macrocytic anemia(HP:0001972)',
 'Pancytopenia(HP:0001876)',
 'Abnormality of skin pigmentation(HP:0001000)',
 'Congenital thrombocytopenia(HP:0001905)',
 'Anal atresia(HP:0002023)',
 'Abnormality of DNA repair(HP:0003254)',
 'Cafe-au-lait spot(HP:0000957)',
 'Amegakaryocytic thrombocytopenia(HP:0004859)',
 'Abnormal vertebral morphology(HP:0003468)',
 'Absent thumb(HP:0009777)',
 'Hypoplasia of the radius(HP:0002984)',
 'Severe combined immunodeficiency(HP:0004430)',
 'Abnormality of the kidney(HP:0000077)',
 'Transient erythroblastopenia(HP:0005510)',
 'Microphthalmia(HP:0000568)',
 'Abnormality of the thumb(HP:0001172)',
 'Growth hormone deficiency(HP:0000824)',
 'Refractory anemia(HP:000550

## Curie inputs and synonymization

ROBOKOP will perform identifier translations when it can.  This means that for most common input types, there are a range of curie prefixes that will work without the user doing any work.  

For example, Fanconi Anemia is identified as `MONDO:0019339`, which is ROBOKOP's preferred identifier, but that is equivalent to `DOID:13636`, `Orphanet:84`, `NCIT:C62505`, `UMLS:C0015625`, `MeSH:D005199`, and `MedDRA:10055206`.  We can see that calling expand with any of these inputs will produce the same results:

In [9]:
equivalents=['MedDRA:10055206','DOID:13636','UMLS:C0015625','Orphanet:84','NCIT:C62505','MeSH:D005199']
for equivalent_id in equivalents:
    e_result = expand('disease', equivalent_id, 'phenotypic_feature',csv=True)
    print(equivalent_id, len(e_result), e_result[0])

Return Status: 200
MedDRA:10055206 186 Bone marrow hypocellularity(HP:0005528)
Return Status: 200
DOID:13636 186 Bone marrow hypocellularity(HP:0005528)
Return Status: 200
UMLS:C0015625 186 Bone marrow hypocellularity(HP:0005528)
Return Status: 200
Orphanet:84 186 Bone marrow hypocellularity(HP:0005528)
Return Status: 200
NCIT:C62505 186 Bone marrow hypocellularity(HP:0005528)
Return Status: 200
MeSH:D005199 186 Bone marrow hypocellularity(HP:0005528)


## Query Types

The `type1` and `type2` arguments are chosen from the [biolink-model](https://biolink.github.io/biolink-model/).  While any type in the model is potentially acceptable, only some types are exposed via ROBOKOP.  The current list of acceptable types is:

* disease_or_phenotypic_feature
    * **phenotypic_feature**
    * **disease**
       * genetic_condition

* **gene**

* **anatomical_entity**
    * **cell**
    * gross_anatomical_structure
    * cellular_component

* **biological_process_or_activity**
    * biological_process
        * pathway
    * molecular_activity

* **chemical_substance**
    * metabolite
    * drug
    
ROBOKOP understands the hierarchical nature of these relationships and can figure out services to call at a different level of the heirarchy.  For instance, suppose an adverse events service returns a mix of diseases and phenotypic features, but the caller only wants diseases.  The user can then ask for diseases, and the service will be called an automatically filtered.  On the other hand, if a user is willing to accept, say, either disease or phenotypic features for a query, then any function that returns both, or either type will automatically get called.

Because of ROBOKOP's caching (see below), some types will return more quickly than others.  These types are in **bold** above.

Note that `genetic_condition` above is not part of the biolink model, but is an additional type that descends from disease in ROBOKOP.

As an example, we'll call expand twice with the same gene, once asking for a disease, and once asking for a genetic condition. There are no translator services that return only genetic condition, but the service knows how to recognize genetic conditions and returns only those diseases that are genetic conditions:

In [10]:
NPC1_diseases = set(expand('gene', 'HGNC:7897', 'disease',csv=True,rebuild=True)) #HGNC:7897 = "NPC1"
NPC1_genetic_conditions = set(expand('gene', 'HGNC:7897', 'genetic_condition',csv=True,rebuild=True))

Return Status: 200
Return Status: 200


In [11]:
print('There are',len(NPC1_genetic_conditions),'Genetic Conditions associated with NPC1')
print('There are',len(NPC1_diseases),'Diseases associated with NPC1')
print(len(NPC1_diseases.intersection(NPC1_genetic_conditions)),'of these are in common. In other words, everything in the genetic condition list is also in the disease list')

There are 8 Genetic Conditions associated with NPC1
There are 24 Diseases associated with NPC1
8 of these are in common. In other words, everything in the genetic condition list is also in the disease list


## Caching and Rebuilding

ROBOKOP maintains caches results.  The cache is built both opportunistically (including the results of all previous queries) and proactively (pre-loading data that expected to be heavily used).  By default, expand only looks in its cache.  If a result has not been previously cached, then this call will not return anything (and may return a status code of 500).

If a user wants to force the service to look beyond its local cache, it sends a parameter `rebuild=True`, as seen in the NPC1 examples above.

If a user wants to be sure to retreive all relevant data, they should use `rebuild=True`, but this will be at the expense of performance.  In order to increase performance without sacrificing reliability, certain type pairs are preloaded into the cache.  In this case, there will be no difference in results between calling `rebuild=True` and `rebuild=False`, but calling with `rebuild=True` will be noticeably slower.

Certain pairs of types are preloaded into ROBOKOP's cache, so there is no point in using rebuild for them. The following list will be updated as the preloaded list is modified.  Note that with the data loaded, it doesn't matter which type is the query and which is the resut.  That is, if a row in this table specifies `disease` and `phenotypic_feature`, then there is no reason to use rebuild for `type1='disease' type2='phenotypic_feature'` or `type1='phenotypic_feature' type2='disease'`.

| type | type |
|------|------|
| disease | phenotypic_feature |

## Specifying Predicates

The responses can be filtered by a predicate, which should also come from the biolink model.  As we saw above, the disease/phenotype call returned only the `has_phenotype` predicate, so using that predicate should return all of the results, and using anything else should return no edges:

(TODO: Include a more interesting example here...)

In [12]:
for predicate in ['has_phenotype','something_else']:
    fanconi_phenotypes = expand('disease', 'MONDO:0019391', 'phenotypic_feature', predicate=predicate, csv=True)
    print(f'predicate "{predicate}" returned {len(fanconi_phenotypes)} results')

Return Status: 200
predicate "has_phenotype" returned 186 results
Return Status: 500
predicate "something_else" returned 0 results
