In [2]:
import requests
import json

This notebook provides a deeper dive into the TRAPI options for querying the ROBOKOP KG. It assumes you are familiar with the concepts covered in `HelloRobokop.ipynb`.

TRAPI Documentation: https://github.com/NCATSTranslator/ReasonerAPI

Most TRAPI documents contain a `message` key.  Within that `message` are a `query_graph` denoting the user query,
a `knowledge_graph` consisting of the union of all nodes and edges that match the `query_graph` pattern, and a list of `results` that bind `query_graph` elements to `knowledge_graph` elements.

The following message contains only a `query_graph`.  This query graph consists of 3 nodes connected together in a line.   Two of the nodes (`n00` and `n02`) have specified identifiers, while the middle node of the line does not.  Rather the middle node has a list of `categories` that are acceptable.

For a researcher who is starting from a `name` who wants to use TRAPI, they can use the Name Resolver tool to get the identifers for the nodes.  This is illustrated in the "HelloRobokop_TRAPI_multiple_IDs" notebook. 

This query asks "Find me a Biological Process or Activity, or a Gene, or a Pathway that is related to both `PUBCHEM.COMPOUND:644073` (Buprenorphine) and `HP:0001337` (Tremor).

In [3]:
query={
    "message": {
      "query_graph": {
        "edges": {
          "e00": {
            "subject": "n00",
              "object": "n01",
          "predicates":["biolink:related_to"]
          },
          "e01": {
            "subject": "n01",
              "object": "n02",
          "predicates":["biolink:related_to"]
          }
        },
        "nodes": {
          "n00": {
            "ids": ["PUBCHEM.COMPOUND:644073"],
            "categories": ["biolink:ChemicalEntity"]
          },
          "n01": {
              "categories": ["biolink:BiologicalProcessOrActivity","biolink:Gene","biolink:Pathway"]
          },
          "n02": {
            "ids": ["HP:0001337"],
            "categories": ["biolink:DiseaseOrPhenotypicFeature"]
          }
        }
      }
    }
  }


This query can be sent to various components of Translator as needed.  It can be sent directly to the ROBOKOP KG hosted in the Automat system like this:

In [4]:
robokop_submit_url = "http://automat-u24.apps.renci.org/robokopkg/1.3/query"
response = requests.post(robokop_submit_url,json=query)
print(response.status_code)
number_pathway_results = len(response.json()['message']['results'])
print(len(response.json()['message']['results']))

200
7


In [5]:
import pprint
pp = pprint.PrettyPrinter(indent=5)

The response in JSON form is a python dictionary with three main keys, the `message`, `log_level`, and `workflow`.  The `message` property contains the `query_graph`, `knowledge_graph`, and `results` from the query.

In [6]:
print(response.json().keys())
print(response.json()['message'].keys())

dict_keys(['message', 'log_level', 'workflow'])
dict_keys(['query_graph', 'knowledge_graph', 'results'])


The `results` property contains pathways resulting from the query message. Each pathway is organized into edge_bindings and node bindings and contains results for the edges and nodes specified in the query message.  Results are defined using the node and edge identifiers. The attributes for those nodes and edges (including the names) are available via the `knowledge_graph` component of the `message` section of the response.

In [7]:
pp.pprint(response.json()['message']['results'])

[    {    'edge_bindings': {    'e00': [    {    'attributes': None,
                                                 'id': '88245379'},
                                            {    'attributes': None,
                                                 'id': '79325668'},
                                            {    'attributes': None,
                                                 'id': '8608859'},
                                            {    'attributes': None,
                                                 'id': '113499113'}],
                                'e01': [    {    'attributes': None,
                                                 'id': '76822934'}]},
          'node_bindings': {    'n00': [    {    'attributes': None,
                                                 'id': 'PUBCHEM.COMPOUND:644073',
                                                 'qnode_id': 'PUBCHEM.COMPOUND:644073',
                                                 'query_id': None}],
    

The `knowledge_graph` contains information about each of the Nodes and Edges found in `results`.  An example of a Node and an Edge are shown below.  

In [8]:
pp.pprint(response.json()['message']['knowledge_graph'].keys())

dict_keys(['nodes', 'edges'])


Information returned for the each Node includes the concept ID (key), biolink categories, the name/label, attributes, the value type, and others.  Note that each entry under the `nodes` level is itemized in dictionary format with the key corresponding to the identifier used to define the `Result`.  The content for one Node is shown below.

In [9]:
next(iter( response.json()['message']['knowledge_graph']['nodes'].items() ))

('HP:0012164',
 {'categories': ['biolink:BiologicalEntity',
   'biolink:NamedThing',
   'biolink:DiseaseOrPhenotypicFeature',
   'biolink:Entity',
   'biolink:PhenotypicFeature',
   'biolink:ThingWithTaxon'],
  'name': 'Asterixis',
  'attributes': [{'attribute_type_id': 'biolink:same_as',
    'value': ['MEDDRA:10057580',
     'UMLS:C0232766',
     'NCIT:C86048',
     'SNOMEDCT:32838008',
     'HP:0012164',
     'MEDDRA:10003547'],
    'value_type_id': 'metatype:uriorcurie',
    'original_attribute_name': 'equivalent_identifiers',
    'value_url': None,
    'attribute_source': None,
    'description': None,
    'attributes': None},
   {'attribute_type_id': 'dct:description',
    'value': 'A clinical sign indicating a lapse of posture and is usually manifest by a bilateral flapping tremor at the wrist, metacarpophalangeal, and hip joints.',
    'value_type_id': 'EDAM:data_0006',
    'original_attribute_name': 'description',
    'value_url': None,
    'attribute_source': None,
    'descri

Information returned for the each Edge includes the edge ID (key), the subject's concept ID, the object's concept ID, the edge's predicate, any qualifiers, and attributes.  Note that each entry under the `edges` level is itemized in dictionary format  with the key corresponding to the identifier used to define the `Result`.  The content for one Edge is shown below.

In [10]:
next(iter( response.json()['message']['knowledge_graph']['edges'].items() ))

('79325668',
 {'subject': 'PUBCHEM.COMPOUND:644073',
  'object': 'NCBIGene:1565',
  'predicate': 'biolink:regulates',
  'qualifiers': [{'qualifier_type_id': 'biolink:object_direction_qualifier',
    'qualifier_value': 'downregulated'}],
  'attributes': [{'attribute_type_id': 'biolink:Attribute',
    'value': '0.99969625',
    'value_type_id': 'EDAM:data_0006',
    'original_attribute_name': 'biolink:tmkp_confidence_score',
    'value_url': None,
    'attribute_source': None,
    'description': None,
    'attributes': None},
   {'attribute_type_id': 'biolink:publications',
    'value': ['PMID:12756210'],
    'value_type_id': 'EDAM:data_0006',
    'original_attribute_name': 'publications',
    'value_url': None,
    'attribute_source': None,
    'description': None,
    'attributes': None},
   {'attribute_type_id': 'biolink:Attribute',
    'value': ['tmkp:12699a64cae70b20411935b5e5028d220ccbe4f7a0ad13a81978026cb43111bf'],
    'value_type_id': 'EDAM:data_0006',
    'original_attribute_nam

The following sections show how the `knowledge_graph` can be used to annotate the `results`, which serve to bind the components of the `knowedge_graph` to the original `query_graph` used to find the `knowledge_graph`. A single result is shown below to highlight the features we'll use for the subsequent summaries. The `node_bindings` link the components of the `query_graph` to the identifiers of the nodes within the `knowledge_graph`. The keys for these bindings correspond to the identifiers used to define the `query_graph` above. For this example nodes are defined with an "n" followed by a number corresponding to the order of the node in the `query_graph`. The `edge_bindings` have identifiers defined with "e" plus a number specifying the order. In both cases, the `id` field corresponds to the corresponding identifier in the `knowledge_graph`. The `result` includes a score field, which is not used when directly querying ROBOKOP using TRAPI. We'll show the score field below when we cover the Aragorn API for querying ROBOKOP. 

In [11]:
# Illustrating the structure of each pathway result from the message property
pp.pprint(response.json()['message']['results'][0])

{    'edge_bindings': {    'e00': [    {'attributes': None, 'id': '88245379'},
                                       {'attributes': None, 'id': '79325668'},
                                       {'attributes': None, 'id': '8608859'},
                                       {'attributes': None, 'id': '113499113'}],
                           'e01': [{'attributes': None, 'id': '76822934'}]},
     'node_bindings': {    'n00': [    {    'attributes': None,
                                            'id': 'PUBCHEM.COMPOUND:644073',
                                            'qnode_id': 'PUBCHEM.COMPOUND:644073',
                                            'query_id': None}],
                           'n01': [    {    'attributes': None,
                                            'id': 'NCBIGene:1565',
                                            'query_id': None}],
                           'n02': [    {    'attributes': None,
                                            'id': 'HP:02000

To export the results from our query, we create a directory with today's date to hold the output files and then convert our output dictionaries into a set of tables, which can be easily exported.

In [12]:
from datetime import datetime
from pathlib import Path

now = datetime.now()
dt_string = now.strftime("%Y-%m-%d_%H%M%S")
write_dir = Path("output/TRAPI",str(dt_string))
write_dir.mkdir(parents=True, exist_ok=True)

The code below writes out nodes from our `results`, but NOT the edges connecting the nodes.

In [13]:
import pandas as pd
import os

kg = response.json()['message']['knowledge_graph']
cols = []
for node in sorted(response.json()['message']['results'][0]['node_bindings'].keys()):
    cols.append(node)
    cols.append(node + '_name')
results_df = pd.DataFrame(columns = cols)

results_list = []
for result in response.json()['message']['results']:
    result_dict = {}
    for node in sorted(result['node_bindings'].keys()):
        node_id = result['node_bindings'][node][0]['id']
        result_dict[node] = node_id
        result_dict[node + '_name'] = kg['nodes'][node_id]['name']

    results_list.append(pd.DataFrame([result_dict]))
results_df = pd.concat(results_list)
display(results_df)
results_df.to_csv(os.path.join(write_dir,'results_TRAPI.csv'), index=False)

combined_node_list = ["_".join([row[1].replace(" ", "_"), row[3].replace(" ", "_"), row[5].replace(" ", "_")]) for row in results_df[cols].to_numpy()]
pp.pprint(combined_node_list)

Unnamed: 0,n00,n00_name,n01,n01_name,n02,n02_name
0,PUBCHEM.COMPOUND:644073,Buprenorphine,NCBIGene:1565,CYP2D6,HP:0200085,Limb tremor
0,PUBCHEM.COMPOUND:644073,Buprenorphine,NCBIGene:1565,CYP2D6,HP:0002345,Action tremor
0,PUBCHEM.COMPOUND:644073,Buprenorphine,NCBIGene:1565,CYP2D6,HP:0025387,Pill-rolling tremor
0,PUBCHEM.COMPOUND:644073,Buprenorphine,NCBIGene:1565,CYP2D6,HP:0001337,Tremor
0,PUBCHEM.COMPOUND:644073,Buprenorphine,NCBIGene:1565,CYP2D6,HP:0002322,Resting tremor
0,PUBCHEM.COMPOUND:644073,Buprenorphine,NCBIGene:1565,CYP2D6,HP:0002174,Postural tremor
0,PUBCHEM.COMPOUND:644073,Buprenorphine,NCBIGene:4988,OPRM1,HP:0012164,Asterixis


[    'Buprenorphine_CYP2D6_Limb_tremor',
     'Buprenorphine_CYP2D6_Action_tremor',
     'Buprenorphine_CYP2D6_Pill-rolling_tremor',
     'Buprenorphine_CYP2D6_Tremor',
     'Buprenorphine_CYP2D6_Resting_tremor',
     'Buprenorphine_CYP2D6_Postural_tremor',
     'Buprenorphine_OPRM1_Asterixis']


The edge IDs are then retrieved from the `results` and used to find the corresponding `predicate` label in the `knowledge_graph`. Two nodes can have more than one edge connecting them because each edge represents a distinct type of association derived from a single data source. Because of this, the number of rows in this report may differ from the set of results above. For an answer with three nodes such as this one, there will be at least one edge between nodes 1 and 2 as well as nodes 2 and 3. The direction of the edge connecting two nodes can differ as well, such as for "Buprenorphine" and "CYP2D6" below. Sometimes a single association type will be derived from multiple data sources. In the code below, we print a single edge for each association type and include the count of the number of data sources where that association was found. The following writes out all unique edges for each of the results in the format of `subject` -> `predicate` -> `object` and exports the information into a single file per `result`.

In [14]:
from collections import Counter
import json
import pprint
pp = pprint.PrettyPrinter(indent=5)

for i in range(number_pathway_results):
    print(f"Pathway result: {combined_node_list[i]}")
    edge_bindings = response.json()['message']['results'][i]['edge_bindings']

    edge_ids = []
    for edge_name, edge_list in edge_bindings.items():
        edge_ids.append({edge_name: [x['id'] for x in edge_list]})

    string_out_list = []
    for edge_dict in edge_ids:
        for edge_name, edge_list in edge_dict.items():
            for edge_id in edge_list:
                subject_id = response.json()['message']['knowledge_graph']['edges'][edge_id]['subject']
                subject = response.json()['message']['knowledge_graph']['nodes'][subject_id]['name']
                predicate = response.json()['message']['knowledge_graph']['edges'][edge_id]['predicate']
                object_id = response.json()['message']['knowledge_graph']['edges'][edge_id]['object']
                object = response.json()['message']['knowledge_graph']['nodes'][object_id]['name']
                string_out = f"{subject} -> {predicate} -> {object}"
                string_out_list.append(string_out)
    string_out_dict = dict(Counter(string_out_list).items())
    pp.pprint(string_out_dict)
    print("")
    
    with open(os.path.join(write_dir,combined_node_list[i]+".txt"), 'w') as convert_file:
        convert_file.write(json.dumps(string_out_dict))
        

Pathway result: Buprenorphine_CYP2D6_Limb_tremor
{    'Buprenorphine -> biolink:affects -> CYP2D6': 1,
     'Buprenorphine -> biolink:directly_physically_interacts_with -> CYP2D6': 1,
     'Buprenorphine -> biolink:regulates -> CYP2D6': 1,
     'CYP2D6 -> biolink:affects -> Buprenorphine': 1,
     'CYP2D6 -> biolink:genetic_association -> Limb tremor': 1}

Pathway result: Buprenorphine_CYP2D6_Action_tremor
{    'Buprenorphine -> biolink:affects -> CYP2D6': 1,
     'Buprenorphine -> biolink:directly_physically_interacts_with -> CYP2D6': 1,
     'Buprenorphine -> biolink:regulates -> CYP2D6': 1,
     'CYP2D6 -> biolink:affects -> Buprenorphine': 1,
     'CYP2D6 -> biolink:genetic_association -> Action tremor': 1}

Pathway result: Buprenorphine_CYP2D6_Pill-rolling_tremor
{    'Buprenorphine -> biolink:affects -> CYP2D6': 1,
     'Buprenorphine -> biolink:directly_physically_interacts_with -> CYP2D6': 1,
     'Buprenorphine -> biolink:regulates -> CYP2D6': 1,
     'CYP2D6 -> biolink:affect

We see that the connection between Buprenorphine and CYP2D6 consists of two edge types `biolink:affects` and `biolink:directly_physically_interacts_with`. The `biolink:affects` relationship is bidirectional with one edge starting with Buprenorphine (`Buprenorphine -> biolink:affects -> CYP2D6`) and one starting with CYP2D6 (`CYP2D6 -> biolink:affects -> Buprenorphine`). These three edges are repeated for all results that include Buprenorphine and CYP2D6 as the first two nodes. Those results differ in the type of tremor found for the third node. There are three edges connecting Buprenorphine and OPRM1 (`biolink:affects`, `biolink:directly_physically_interacts_with`, `biolink:related_to`). Of note, the `biolink:affects` was identified in 4 different knowledge sources as indicated by the count reported after the colon.

## Aragorn.

The results above are just database matches, there are no scores or other additions.  You can instead send the TRAPI to the robokop application using Aragorn (rather than just to the graph)

In [15]:
ara_robokop_submit_url = "https://aragorn-u24.apps.renci.org/robokop/query"
response_ara = requests.post(ara_robokop_submit_url,json=query)
print(response_ara.status_code)
print(len(response_ara.json()['message']['results']))

200
7


In [16]:
pp.pprint(response_ara.json()['message']['results'][0].keys())
pp.pprint(response_ara.json()['message']['results'][0]['node_bindings'].keys())
pp.pprint(response_ara.json()['message']['results'][0]['edge_bindings'].keys())
pp.pprint(response_ara.json()['message']['results'][0]['score'])

dict_keys(['node_bindings', 'edge_bindings', 'score'])
dict_keys(['n02', 'n01', 'n00'])
dict_keys(['e00', 'e01', 's2', 's14', 's15'])
0.24054207686854312


The following assumes that the node names will sort in the correct order, which is the case with the default naming conventions. This exports the results showing the nodes and the score assigned to each result.

In [17]:
kg = response_ara.json()['message']['knowledge_graph']
cols = []
for node in sorted(response_ara.json()['message']['results'][0]['node_bindings'].keys()):
    cols.append(node)
    cols.append(node + '_name')
results_df = pd.DataFrame(columns = cols)

for result in response_ara.json()['message']['results']:
    result_dict = {}
    for node in result['node_bindings'].keys():
        node_id = result['node_bindings'][node][0]['id']
        result_dict[node] = node_id
        result_dict[node + '_name'] = kg['nodes'][node_id]['name']
    result_dict['score'] = result['score']
    #print(result_dict)

    results_df = results_df.append(result_dict, ignore_index=True)
print(results_df)
results_df.to_csv(os.path.join(write_dir,'results_TRAPI_aragorn.csv'))


                       n00       n00_name            n01 n01_name         n02  \
0  PUBCHEM.COMPOUND:644073  Buprenorphine  NCBIGene:1565   CYP2D6  HP:0001337   
1  PUBCHEM.COMPOUND:644073  Buprenorphine  NCBIGene:4988    OPRM1  HP:0012164   
2  PUBCHEM.COMPOUND:644073  Buprenorphine  NCBIGene:1565   CYP2D6  HP:0002322   
3  PUBCHEM.COMPOUND:644073  Buprenorphine  NCBIGene:1565   CYP2D6  HP:0002345   
4  PUBCHEM.COMPOUND:644073  Buprenorphine  NCBIGene:1565   CYP2D6  HP:0200085   
5  PUBCHEM.COMPOUND:644073  Buprenorphine  NCBIGene:1565   CYP2D6  HP:0002174   
6  PUBCHEM.COMPOUND:644073  Buprenorphine  NCBIGene:1565   CYP2D6  HP:0025387   

              n02_name     score  
0               Tremor  0.240542  
1            Asterixis  0.131692  
2       Resting tremor  0.130222  
3        Action tremor  0.129512  
4          Limb tremor  0.129109  
5      Postural tremor  0.129109  
6  Pill-rolling tremor  0.129109  


  results_df = results_df.append(result_dict, ignore_index=True)
  results_df = results_df.append(result_dict, ignore_index=True)
  results_df = results_df.append(result_dict, ignore_index=True)
  results_df = results_df.append(result_dict, ignore_index=True)
  results_df = results_df.append(result_dict, ignore_index=True)
  results_df = results_df.append(result_dict, ignore_index=True)
  results_df = results_df.append(result_dict, ignore_index=True)


In [18]:
pp.pprint(kg['edges'])

{    '112610254': {    'attributes': [    {    'attribute_source': 'infores:automat-robokop',
                                               'attribute_type_id': 'biolink:aggregator_knowledge_source',
                                               'original_attribute_name': 'biolink:aggregator_knowledge_source',
                                               'value': [    'infores:automat-robokop'],
                                               'value_type_id': 'biolink:InformationResource'},
                                          {    'attribute_type_id': 'dct:description',
                                               'original_attribute_name': 'description',
                                               'value': 'decreases molecular '
                                                        'interaction with',
                                               'value_type_id': 'EDAM:data_0006'},
                                          {    'attribute_type_id': 'biolink:has_numeri