In [1]:
# Parameter inputs
trapi_submit_url = "http://robokop-automat.apps.renci.org/robokopkg/1.4/query"
CURIE_buprenorphine_PubChem = "PUBCHEM.COMPOUND:644073"
CURIE_tremor_HP = "HP:0001337"

# Initializing directory to write
from datetime import datetime
from pathlib import Path

now = datetime.now()
dt_string = now.strftime("%Y-%m-%d_%H%M%S")
write_dir = Path("output/TRAPI",str(dt_string))
write_dir.mkdir(parents=True, exist_ok=True)

In [2]:
import requests
import json

This notebook provides a deeper dive into the TRAPI options for querying the ROBOKOP KG. It assumes you are familiar with the concepts covered in `HelloRobokop.ipynb`.

TRAPI Documentation: https://github.com/NCATSTranslator/ReasonerAPI

Most TRAPI documents contain a `message` key.  Within that `message` are a `query_graph` denoting the user query,
a `knowledge_graph` consisting of the union of all nodes and edges that match the `query_graph` pattern, and a list of `results` that bind `query_graph` elements to `knowledge_graph` elements.

The following message contains only a `query_graph`.  This query graph consists of 3 nodes connected together in a line.   Two of the nodes (`n00` and `n02`) have specified identifiers, while the middle node of the line does not.  Rather the middle node has a list of `categories` that are acceptable.

For a researcher who is starting from a `name` who wants to use TRAPI, they can use the Name Resolver tool to get the identifers for the nodes.  This is illustrated in the `HelloRobokop_TRAPI_multiple_IDs` notebook.

For a research who wants to make use of ARA functionality for features such as scoring, information can be found in the `HelloRobokop_ARA` notebook.

This query asks "Find me a Biological Process or Activity, or a Gene, or a Pathway that is related to both `PUBCHEM.COMPOUND:644073` (Buprenorphine) and `HP:0001337` (Tremor).

In [3]:
query={
    "message": {
      "query_graph": {
        "edges": {
          "e00": {
            "subject": "n00",
              "object": "n01",
          "predicates":["biolink:related_to"]
          },
          "e01": {
            "subject": "n01",
              "object": "n02",
          "predicates":["biolink:related_to"]
          }
        },
        "nodes": {
          "n00": {
            "ids": [CURIE_buprenorphine_PubChem],
            "categories": ["biolink:ChemicalEntity"]
          },
          "n01": {
              "categories": ["biolink:BiologicalProcessOrActivity","biolink:Gene","biolink:Pathway"]
          },
          "n02": {
            "ids": [CURIE_tremor_HP],
            "categories": ["biolink:DiseaseOrPhenotypicFeature"]
          }
        }
      }
    }
  }


This query can be sent directly to the ROBOKOP KG hosted in the Automat system like this:

In [4]:
response = requests.post(trapi_submit_url,json=query)
print(response.status_code)
number_pathway_results = len(response.json()['message']['results'])
print(len(response.json()['message']['results']))

200
7


In [5]:
import pprint
pp = pprint.PrettyPrinter(indent=5)

The response in JSON form is a python dictionary with three main keys, the `message`, `log_level`, and `workflow`.  The `message` property contains the `query_graph`, `knowledge_graph`, and `results` from the query.

In [6]:
print(response.json().keys())
print(response.json()['message'].keys())

query_out = response.json()['message']['query_graph']
kg = response.json()['message']['knowledge_graph']
results = response.json()['message']['results']

dict_keys(['message', 'log_level', 'workflow'])
dict_keys(['query_graph', 'knowledge_graph', 'results', 'auxiliary_graphs'])


The original, submitted query graph with additional fields is returned with the `message` property in the `query_graph` property.

The code below shows the changes between the submitted query graph and the results query graph. The results includes additional attributes such as `knowledge_type`, `attribute_constraints`, and `qualifier_constraints`.  These attributes can be explicitly defined when submitting to automat such that further filtering can be done.  Because nothing was specified for these attributes in the submitted query graph, their values in the returned query graph are blank.

In [7]:
edges = ["e00", "e01"]
nodes = ["n00", "n01", "n02"]

print(nodes)
print(edges)
pp.pprint(query)
pp.pprint(query_out)

['n00', 'n01', 'n02']
['e00', 'e01']
{    'message': {    'query_graph': {    'edges': {    'e00': {    'object': 'n01',
                                                                   'predicates': [    'biolink:related_to'],
                                                                   'subject': 'n00'},
                                                       'e01': {    'object': 'n02',
                                                                   'predicates': [    'biolink:related_to'],
                                                                   'subject': 'n01'}},
                                         'nodes': {    'n00': {    'categories': [    'biolink:ChemicalEntity'],
                                                                   'ids': [    'PUBCHEM.COMPOUND:644073']},
                                                       'n01': {    'categories': [    'biolink:BiologicalProcessOrActivity',
                                                          

The `results` property contains pathways resulting from the query message. Each pathway is organized into edge_bindings and node bindings and contains results for the edges and nodes specified in the query message.  Results are defined using the node and edge identifiers. The attributes for those nodes and edges (including the names) are available via the `knowledge_graph` component of the `message` section of the response.

In [8]:
pp.pprint(results)

[    {    'analyses': [    {    'attributes': None,
                                'edge_bindings': {    'e00': [    {    'attributes': None,
                                                                       'id': '92137489'},
                                                                  {    'attributes': None,
                                                                       'id': '118472855'},
                                                                  {    'attributes': None,
                                                                       'id': '8156627'},
                                                                  {    'attributes': None,
                                                                       'id': '8966609'}],
                                                      'e01': [    {    'attributes': None,
                                                                       'id': '117227756'}]},
                                'resourc

The `knowledge_graph` contains information about each of the Nodes and Edges found in `results`.  An example of a Node and an Edge are shown below.  

In [9]:
pp.pprint(kg.keys())

dict_keys(['nodes', 'edges'])


Information returned for the each Node includes the concept ID (key), biolink categories, the name/label, attributes, the value type, and others.  Note that each entry under the `nodes` level is itemized in dictionary format with the key corresponding to the identifier used to define the `Result`.  The content for one Node is shown below.

In [10]:
next(iter( kg['nodes'].items() ))

('HP:0012164',
 {'categories': ['biolink:ThingWithTaxon',
   'biolink:Entity',
   'biolink:PhenotypicFeature',
   'biolink:NamedThing',
   'biolink:DiseaseOrPhenotypicFeature',
   'biolink:BiologicalEntity'],
  'name': 'Asterixis',
  'attributes': [{'attribute_type_id': 'dct:description',
    'value': 'A clinical sign indicating a lapse of posture and is usually manifest by a bilateral flapping tremor at the wrist, metacarpophalangeal, and hip joints.',
    'value_type_id': 'EDAM:data_0006',
    'original_attribute_name': 'description',
    'value_url': None,
    'attribute_source': None,
    'description': None,
    'attributes': None},
   {'attribute_type_id': 'biolink:Attribute',
    'value': 100.0,
    'value_type_id': 'EDAM:data_0006',
    'original_attribute_name': 'information_content',
    'value_url': None,
    'attribute_source': None,
    'description': None,
    'attributes': None},
   {'attribute_type_id': 'biolink:same_as',
    'value': ['SNOMEDCT:32838008',
     'MEDDRA:

Information returned for the each Edge includes the edge ID (key), the subject's concept ID, the object's concept ID, the edge's predicate, any qualifiers, and attributes.  Note that each entry under the `edges` level is itemized in dictionary format  with the key corresponding to the identifier used to define the `Result`.  The content for one Edge is shown below.

In [11]:
next(iter( kg['edges'].items() ))

('39304169',
 {'subject': 'NCBIGene:1565',
  'object': 'HP:0002174',
  'predicate': 'biolink:genetically_associated_with',
  'sources': [{'resource_id': 'infores:disgenet',
    'resource_role': 'primary_knowledge_source',
    'upstream_resource_ids': None,
    'source_record_urls': None},
   {'resource_id': 'infores:pharos',
    'resource_role': 'aggregator_knowledge_source',
    'upstream_resource_ids': ['infores:disgenet'],
    'source_record_urls': None},
   {'resource_id': 'infores:automat-robokop',
    'resource_role': 'aggregator_knowledge_source',
    'upstream_resource_ids': ['infores:pharos'],
    'source_record_urls': None}],
  'qualifiers': None,
  'attributes': [{'attribute_type_id': 'biolink:Attribute',
    'value': 0.3,
    'value_type_id': 'EDAM:data_0006',
    'original_attribute_name': 'score',
    'value_url': None,
    'attribute_source': None,
    'description': None,
    'attributes': None}]})

The following sections show how the `knowledge_graph` can be used to annotate the `results`, which serve to bind the components of the `knowedge_graph` to the original `query_graph` used to find the `knowledge_graph`. A single result is shown below to highlight the features we'll use for the subsequent summaries. The `node_bindings` link the components of the `query_graph` to the identifiers of the nodes within the `knowledge_graph`. The keys for these bindings correspond to the identifiers used to define the `query_graph` above. For this example nodes are defined with an "n" followed by a number corresponding to the order of the node in the `query_graph`. The `edge_bindings` have identifiers defined with "e" plus a number specifying the order. In both cases, the `id` field corresponds to the corresponding identifier in the `knowledge_graph`. The `result` includes a score field, which is not used when directly querying ROBOKOP using TRAPI. We'll show the score field below when we cover the API for querying ROBOKOP that takes advantage of the Automomous Relay Agent (ARA) methodology developed as part of the NIH Biomedical Data Translator to query the ROBOKOP knowledgegraph. That method of accessing ROBOKOP is covered in the `HelloRobokop_ARA.ipynb` notebook.

In [12]:
# Illustrating the structure of each pathway result from the message property
pp.pprint(results[0])

{    'analyses': [    {    'attributes': None,
                           'edge_bindings': {    'e00': [    {    'attributes': None,
                                                                  'id': '92137489'},
                                                             {    'attributes': None,
                                                                  'id': '118472855'},
                                                             {    'attributes': None,
                                                                  'id': '8156627'},
                                                             {    'attributes': None,
                                                                  'id': '8966609'}],
                                                 'e01': [    {    'attributes': None,
                                                                  'id': '117227756'}]},
                           'resource_id': 'infores:automat-robokop',
                          

To export the results from our query, we create a directory with today's date to hold the output files and then convert our output dictionaries into a set of tables, which can be easily exported.

The code below writes out nodes from our `results`, but NOT the edges connecting the nodes.

In [13]:
import pandas as pd
import os

cols = []
for node in sorted(results[0]['node_bindings'].keys()):
    cols.append(node)
    cols.append(node + '_name')
results_df = pd.DataFrame(columns = cols)

results_list = []
for result in results:
    result_dict = {}
    for node in sorted(result['node_bindings'].keys()):
        node_id = result['node_bindings'][node][0]['id']
        result_dict[node] = node_id
        result_dict[node + '_name'] = kg['nodes'][node_id]['name']

    results_list.append(pd.DataFrame([result_dict]))
results_df = pd.concat(results_list)
display(results_df)
results_df.to_csv(os.path.join(write_dir,'results_TRAPI.csv'), index=False)

combined_node_list = ["_".join([row[1].replace(" ", "_"), row[3].replace(" ", "_"), row[5].replace(" ", "_")]) for row in results_df[cols].to_numpy()]
pp.pprint(combined_node_list)

Unnamed: 0,n00,n00_name,n01,n01_name,n02,n02_name
0,PUBCHEM.COMPOUND:644073,Buprenorphine,NCBIGene:1565,CYP2D6,HP:0002322,Resting tremor
0,PUBCHEM.COMPOUND:644073,Buprenorphine,NCBIGene:1565,CYP2D6,HP:0002174,Postural tremor
0,PUBCHEM.COMPOUND:644073,Buprenorphine,NCBIGene:1565,CYP2D6,HP:0002345,Action tremor
0,PUBCHEM.COMPOUND:644073,Buprenorphine,NCBIGene:4988,OPRM1,HP:0012164,Asterixis
0,PUBCHEM.COMPOUND:644073,Buprenorphine,NCBIGene:1565,CYP2D6,HP:0001337,Tremor
0,PUBCHEM.COMPOUND:644073,Buprenorphine,NCBIGene:1565,CYP2D6,HP:0200085,Limb tremor
0,PUBCHEM.COMPOUND:644073,Buprenorphine,NCBIGene:1565,CYP2D6,HP:0025387,Pill-rolling tremor


[    'Buprenorphine_CYP2D6_Resting_tremor',
     'Buprenorphine_CYP2D6_Postural_tremor',
     'Buprenorphine_CYP2D6_Action_tremor',
     'Buprenorphine_OPRM1_Asterixis',
     'Buprenorphine_CYP2D6_Tremor',
     'Buprenorphine_CYP2D6_Limb_tremor',
     'Buprenorphine_CYP2D6_Pill-rolling_tremor']


The edge IDs are then retrieved from the `results` and used to find the corresponding `predicate` label in the `knowledge_graph`. Two nodes can have more than one edge connecting them because each edge represents a distinct type of association derived from a single data source. Because of this, the number of rows in this report may differ from the set of results above. For an answer with three nodes such as this one, there will be at least one edge between nodes 1 and 2 as well as nodes 2 and 3. The direction of the edge connecting two nodes can differ as well, such as for "Buprenorphine" and "CYP2D6" below. Sometimes a single association type will be derived from multiple data sources. In the code below, we print a single edge for each association type and include the count of the number of data sources where that association was found. The following writes out all unique edges for each of the results in the format of `subject` -> `predicate` -> `object` and exports the information into a single file per `result`.

In [16]:
from collections import Counter
import json
import pprint
pp = pprint.PrettyPrinter(indent=5)

for i in range(number_pathway_results):
    print(f"Pathway result: {combined_node_list[i]}")
    edge_bindings = results[i]['analyses'][0]['edge_bindings']

    edge_ids = []
    for edge_name, edge_list in edge_bindings.items():
        edge_ids.append({edge_name: [x['id'] for x in edge_list]})

    string_out_list = []
    for edge_dict in edge_ids:
        for edge_name, edge_list in edge_dict.items():
            for edge_id in edge_list:
                subject_id = kg['edges'][edge_id]['subject']
                subject = kg['nodes'][subject_id]['name']
                predicate = kg['edges'][edge_id]['predicate']
                object_id = kg['edges'][edge_id]['object']
                object = kg['nodes'][object_id]['name']
                string_out = f"{subject} -> {predicate} -> {object}"
                string_out_list.append(string_out)
    string_out_dict = dict(Counter(string_out_list).items())
    pp.pprint(string_out_dict)
    print("")
    
    with open(os.path.join(write_dir,combined_node_list[i]+".txt"), 'w') as convert_file:
        convert_file.write(json.dumps(string_out_dict))
        

Pathway result: Buprenorphine_CYP2D6_Resting_tremor
{    'Buprenorphine -> biolink:affects -> CYP2D6': 2,
     'Buprenorphine -> biolink:directly_physically_interacts_with -> CYP2D6': 1,
     'CYP2D6 -> biolink:affects -> Buprenorphine': 1,
     'CYP2D6 -> biolink:genetically_associated_with -> Resting tremor': 1}

Pathway result: Buprenorphine_CYP2D6_Postural_tremor
{    'Buprenorphine -> biolink:affects -> CYP2D6': 2,
     'Buprenorphine -> biolink:directly_physically_interacts_with -> CYP2D6': 1,
     'CYP2D6 -> biolink:affects -> Buprenorphine': 1,
     'CYP2D6 -> biolink:genetically_associated_with -> Postural tremor': 1}

Pathway result: Buprenorphine_CYP2D6_Action_tremor
{    'Buprenorphine -> biolink:affects -> CYP2D6': 2,
     'Buprenorphine -> biolink:directly_physically_interacts_with -> CYP2D6': 1,
     'CYP2D6 -> biolink:affects -> Buprenorphine': 1,
     'CYP2D6 -> biolink:genetically_associated_with -> Action tremor': 1}

Pathway result: Buprenorphine_OPRM1_Asterixis
{  

We see that the connection between Buprenorphine and CYP2D6 consists of three edge types `biolink:affects`,  `biolink:directly_physically_interacts_with`, and `biolink:regulates`. The `biolink:affects` relationship is bidirectional with one edge starting with Buprenorphine (`Buprenorphine -> biolink:affects -> CYP2D6`) and one starting with CYP2D6 (`CYP2D6 -> biolink:affects -> Buprenorphine`). These three edges are repeated for all results that include Buprenorphine and CYP2D6 as the first two nodes. Those results differ in the type of tremor found for the third node. There are three edges connecting Buprenorphine and OPRM1 (`biolink:affects`, `biolink:directly_physically_interacts_with`, `biolink:related_to`). Of note, the `biolink:affects` was identified in 4 different knowledge sources as indicated by the count reported after the colon.