In [None]:
# Parameter inputs
aragorn_submit_url = "https://aragorn-u24.apps.renci.org/robokop/query"
input_search_string = 'ppara'
output_search_string = 'liver fibrosis'

# Initializing directory to write
from datetime import datetime
from pathlib import Path

now = datetime.now()
dt_string = now.strftime("%Y-%m-%d_%H%M%S")
write_dir = Path("output/TRAPI",str(dt_string))
write_dir.mkdir(parents=True, exist_ok=True)

In [None]:
import requests
import json
import pprint
pp = pprint.PrettyPrinter(indent=5)

This notebook provides a deeper dive into the TRAPI options for querying the ROBOKOP KG using the ARA functionality. It assumes you are familiar with the concepts covered in `HelloRobokop.ipynb`.  This follows the same format and process as in `HelloRobokop_TRAPI.ipynb` except using the API based upon Automomous Relay Agent (ARA) methodology developed as part of the NIH Biomedical Data Translator to query the ROBOKOP knowledgegraph, which returns a score for each result.

TRAPI Documentation: https://github.com/NCATSTranslator/ReasonerAPI

Most TRAPI documents contain a `message` key.  Within that `message` are a `query_graph` denoting the user query,
a `knowledge_graph` consisting of the union of all nodes and edges that match the `query_graph` pattern, and a list of `results` that bind `query_graph` elements to `knowledge_graph` elements.

The following message contains only a `query_graph`.  This query graph consists of 3 nodes connected together in a line.   Two of the nodes (`n00` and `n02`) have specified identifiers, while the middle node of the line does not.  Rather the middle node has a list of `categories` that are acceptable.

For a researcher who is starting from a `name` who wants to use TRAPI, they can use the Name Resolver tool to get the identifers for the nodes.  This is illustrated in the `HelloRobokop_TRAPI_multiple_IDs.ipynb` notebook. 

This query asks "Find me a Biological Process or Activity, or a Gene, or a Pathway that is related to both `PUBCHEM.COMPOUND:644073` (Buprenorphine) and `HP:0001337` (Tremor).

In [None]:
results = requests.post(f'https://name-resolution-sri.renci.org/lookup?string={input_search_string}&offset=0&limit=10')
results_json = results.json()
#print(json.dumps(results_json,indent=4))
input_node_id_list = list(results_json.keys())
print(input_node_id_list)


['UniProtKB:P37230', 'UniProtKB:Q07869', 'UniProtKB:Q95N78', 'PR:000013056', 'UniProtKB:P23204', 'NCBIGene:19013', 'NCBIGene:25747', 'NCBIGene:5465', 'NCBIGene:557714', 'NCBIGene:30755', 'NCBIGene:563298', 'UMLS:C0166415', 'NCBIGene:10891', 'NCBIGene:133522', 'UMLS:C2984537', 'NCBIGene:400931', 'PR:000040325', 'UMLS:C1868415', 'MESH:C000630914', 'MESH:C000634429', 'UMLS:C5226508', 'UMLS:C5197094', 'UMLS:C5417797', 'REACT:R-SSC-400204', 'REACT:R-BTA-400204', 'UMLS:C1518805', 'REACT:R-DRE-400204', 'REACT:R-HSA-400204', 'REACT:R-MMU-400204', 'REACT:R-BTA-9734475', 'REACT:R-SSC-9734475', 'REACT:R-HSA-879724', 'REACT:R-DRE-9734475', 'REACT:R-HSA-9734475', 'REACT:R-CFA-400204', 'REACT:R-MMU-9734475', 'REACT:R-RNO-400204', 'REACT:R-XTR-400204', 'REACT:R-CFA-9734475', 'REACT:R-RNO-9734475', 'REACT:R-XTR-9734475', 'REACT:R-DME-400204', 'REACT:R-HSA-1989781', 'REACT:R-BTA-4341070', 'REACT:R-DME-9734475', 'REACT:R-BTA-400143', 'REACT:R-DRE-400143', 'REACT:R-DRE-4341070', 'REACT:R-MMU-400143', 'RE

In [None]:
results = requests.post(f'https://name-resolution-sri.renci.org/lookup?string={output_search_string}&offset=0&limit=10')
results_json = results.json()
#print(json.dumps(results_json,indent=4))
output_node_id_list = list(results_json.keys())
print(output_node_id_list)

['HP:0001395', 'UMLS:C4227681', 'UMLS:C4034373', 'UMLS:C5189427', 'UMLS:C0544816', 'MONDO:0100430', 'MONDO:0018840', 'UMLS:C1397317', 'UMLS:C4068302', 'UMLS:C4481250', 'UMLS:C2827436', 'UMLS:C4321337', 'UMLS:C4695229', 'UMLS:C0494791', 'UMLS:C0400961', 'UMLS:C3864238', 'UMLS:C1960658', 'UMLS:C4695228', 'UMLS:C5563662', 'UMLS:C1407032', 'UMLS:C0400925', 'UMLS:C4749320', 'UMLS:C5548949', 'UMLS:C4533463', 'UMLS:C5689517', 'UMLS:C5689516', 'UMLS:C4722044', 'UMLS:C1856310', 'UMLS:C5439238', 'UMLS:C4722043', 'UMLS:C5548946', 'UMLS:C4533767', 'UMLS:C3277942', 'UMLS:C1385044', 'UMLS:C4070891', 'UMLS:C3873179', 'UMLS:C4070890', 'UMLS:C1954436', 'UMLS:C4070622', 'UMLS:C4036765', 'UMLS:C5215514', 'UMLS:C0451713', 'UMLS:C5686432', 'UMLS:C3275636', 'UMLS:C4750548', 'UMLS:C5549445', 'UMLS:C5549441', 'UMLS:C2184113', 'UMLS:C5190480', 'UMLS:C5171263', 'UMLS:C5171261', 'UMLS:C5171262', 'UMLS:C2751577', 'UMLS:C1869017', 'UMLS:C4030819', 'UMLS:C2749679', 'UMLS:C5697513', 'UMLS:C4732266', 'UMLS:C3869480',

In [None]:
query={
    "message": {
      "query_graph": {
        "edges": {
          "e00": {
            "subject": "n00",
              "object": "n01",
          "predicates":["biolink:related_to"]
          },
          "e01": {
            "subject": "n01",
              "object": "n02",
          "predicates":["biolink:related_to"]
          }
        },
        "nodes": {
          "n00": {
            "ids": input_node_id_list, #['NCBIGene:5465'], #
            "categories": ["biolink:GeneOrGeneProduct"]
          },
          "n01": {
              "categories": ["biolink:BiologicalEntity"]
          },
          "n02": {
            "ids": output_node_id_list, #["HP:0001395"],
            "categories": ["biolink:DiseaseOrPhenotypicFeature"]
          }
        }
      }
    }
  }


This query can be sent directly to the ROBOKOP KG hosted in the Automat system like this:

In [None]:
response = requests.post(aragorn_submit_url,json=query)
print(response.status_code)
number_pathway_results = len(response.json()['message']['results'])
print(len(response.json()['message']['results']))

200
201


The response in JSON form is a python dictionary with three main keys, the `message`, `log_level`, and `workflow`.  The `message` property contains the `query_graph`, `knowledge_graph`, and `results` from the query.

In [10]:
print(response.json().keys())
print(response.json()['message'].keys())

query_out = response.json()['message']['query_graph']
kg = response.json()['message']['knowledge_graph']
results = response.json()['message']['results']

dict_keys(['message', 'logs', 'status', 'pid'])
dict_keys(['query_graph', 'knowledge_graph', 'results'])


The original, submitted query graph with additional fields is returned with the `message` property in the `query_graph` property.

The code below shows the changes between the submitted query graph and the results query graph. The results includes additional attributes such as `knowledge_type`, `attribute_constraints`, and `qualifier_constraints`.  These attributes can be explicitly defined when submitting to automat such that further filtering can be done.  Because nothing was specified for these attributes in the submitted query graph, their values in the returned query graph are blank.

In [12]:
edges = ["e00", "e01"]
nodes = ["n00", "n01", "n02"]

#print(nodes)
#print(edges)
#pp.pprint(query)
pp.pprint(response.json())
#pp.pprint(results)
#pp.pprint(kg.keys())
#next(iter( kg['nodes'].items() ))
#next(iter( kg['edges'].items() ))
#pp.pprint(results[0])

{    'logs': [    {    'level': 'INFO',
                       'message': 'pid: 4b34dc26ee5f',
                       'timestamp': '2023-09-05T19:42:30.489462'}],
     'message': {    'knowledge_graph': {    'edges': {    '088f1ea8-6cc2-4d1b-a011-836c83421245': {    'attributes': [    {    'attribute_type_id': 'biolink:has_numeric_value',
                                                                                                                                'original_attribute_name': 'weight',
                                                                                                                                'value': 1,
                                                                                                                                'value_type_id': 'EDAM:data_1669'},
                                                                                                                           {    'attribute_type_id': 'biolink:has_count',
                    

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



The `results` property contains pathways resulting from the query message. Each pathway is organized into edge_bindings and node bindings and contains results for the edges and nodes specified in the query message.  Results are defined using the node and edge identifiers. The attributes for those nodes and edges (including the names) are available via the `knowledge_graph` component of the `message` section of the response.

The code below writes out nodes from our `results`, but NOT the edges connecting the nodes.

In [8]:
import pandas as pd
import os

cols = []
for node in sorted(results[0]['node_bindings'].keys()):
    cols.append(node)
    cols.append(node + '_name')
results_df = pd.DataFrame(columns = cols)

results_list = []
for result in results:
    result_dict = {}
    for node in sorted(result['node_bindings'].keys()):
        node_id = result['node_bindings'][node][0]['id']
        result_dict[node] = node_id
        result_dict[node + '_name'] = kg['nodes'][node_id]['name']

    results_list.append(pd.DataFrame([result_dict]))
results_df = pd.concat(results_list)
display(results_df)
# results_df.to_csv(os.path.join(write_dir,'results_TRAPI.csv'), index=False)

combined_node_list = ["_".join([row[1].replace(" ", "_"), row[3].replace(" ", "_"), row[5].replace(" ", "_")]) for row in results_df[cols].to_numpy()]
pp.pprint(combined_node_list)

TypeError: 'Response' object is not subscriptable

The edge IDs are then retrieved from the `results` and used to find the corresponding `predicate` label in the `knowledge_graph`. Two nodes can have more than one edge connecting them because each edge represents a distinct type of association derived from a single data source. Because of this, the number of rows in this report may differ from the set of results above. For an answer with three nodes such as this one, there will be at least one edge between nodes 1 and 2 as well as nodes 2 and 3. The direction of the edge connecting two nodes can differ as well, such as for "Buprenorphine" and "CYP2D6" below. Sometimes a single association type will be derived from multiple data sources. In the code below, we print a single edge for each association type and include the count of the number of data sources where that association was found. The following writes out all unique edges for each of the results in the format of `subject` -> `predicate` -> `object` and exports the information into a single file per `result`.

In [10]:
from collections import Counter
import json
import pprint
pp = pprint.PrettyPrinter(indent=5)

for i in range(number_pathway_results):
    print(f"Pathway result: {combined_node_list[i]}")
    edge_bindings = results[i]['edge_bindings']

    edge_ids = []
    for edge_name, edge_list in edge_bindings.items():
        edge_ids.append({edge_name: [x['id'] for x in edge_list]})

    string_out_list = []
    for edge_dict in edge_ids:
        for edge_name, edge_list in edge_dict.items():
            for edge_id in edge_list:
                subject_id = kg['edges'][edge_id]['subject']
                subject = kg['nodes'][subject_id]['name']
                predicate = kg['edges'][edge_id]['predicate']
                object_id = kg['edges'][edge_id]['object']
                object = kg['nodes'][object_id]['name']
                string_out = f"{subject} -> {predicate} -> {object}"
                string_out_list.append(string_out)
    string_out_dict = dict(Counter(string_out_list).items())
    pp.pprint(string_out_dict)
    print("")
    
    with open(os.path.join(write_dir,combined_node_list[i]+".txt"), 'w') as convert_file:
        convert_file.write(json.dumps(string_out_dict))
        

Pathway result: PPARA_SMAD3_Hepatic_fibrosis
{    'Hepatic fibrosis -> biolink:occurs_together_in_literature_with -> PPARA': 1,
     'Hepatic fibrosis -> biolink:occurs_together_in_literature_with -> SMAD3': 1,
     'PPARA -> biolink:regulates -> SMAD3': 1,
     'SMAD3 -> biolink:genetic_association -> Hepatic fibrosis': 1,
     'SMAD3 -> biolink:occurs_together_in_literature_with -> PPARA': 1,
     'SMAD3 -> biolink:regulates -> PPARA': 1}

Pathway result: PPARA_SMAD3_Hepatic_fibrosis
{    'Hepatic fibrosis -> biolink:occurs_together_in_literature_with -> PPARA': 1,
     'Hepatic fibrosis -> biolink:occurs_together_in_literature_with -> SMAD3': 1,
     'PPARA -> biolink:regulates -> SMAD3': 1,
     'SMAD3 -> biolink:genetic_association -> Hepatic fibrosis': 1,
     'SMAD3 -> biolink:occurs_together_in_literature_with -> PPARA': 1,
     'SMAD3 -> biolink:regulates -> PPARA': 1}

Pathway result: PPARA_TGFB1_Hepatic_fibrosis
{    'Hepatic fibrosis -> biolink:occurs_together_in_literature

We see that the connection between Buprenorphine and CYP2D6 consists of two edge types `biolink:affects` and `biolink:directly_physically_interacts_with`. The `biolink:affects` relationship is bidirectional with one edge starting with Buprenorphine (`Buprenorphine -> biolink:affects -> CYP2D6`) and one starting with CYP2D6 (`CYP2D6 -> biolink:affects -> Buprenorphine`). These three edges are repeated for all results that include Buprenorphine and CYP2D6 as the first two nodes. Those results differ in the type of tremor found for the third node. There are three edges connecting Buprenorphine and OPRM1 (`biolink:affects`, `biolink:directly_physically_interacts_with`, `biolink:related_to`). Of note, the `biolink:affects` was identified in 4 different knowledge sources as indicated by the count reported after the colon.

In [11]:
pp.pprint(results[0].keys())
pp.pprint(results[0]['node_bindings'].keys())
pp.pprint(results[0]['edge_bindings'].keys())
pp.pprint(results[0]['score'])
#pp.pprint(kg['edges'])

dict_keys(['node_bindings', 'edge_bindings', 'score'])
dict_keys(['n02', 'n01', 'n00'])
dict_keys(['e00', 'e01', 's7', 's67', 's68'])
0.4262633578538942


In [12]:
aragorn_result_summaries = []
for r in results:
    rs = f"Score={round(r['score'], 3)}: "
    j = 0
    while j < len(nodes):
        node_id = r['node_bindings'][nodes[j]][0]['id']
        node_name = kg['nodes'][node_id]['name']
        rs = rs + f"{node_name} ({node_id})"
        if j < len(edges):
            edge_id = r['edge_bindings'][edges[j]][0]['id']
            edge_name = kg['edges'][edge_id]['predicate']
            rs = rs + f"--{edge_name}-->"
        j = j + 1
    aragorn_result_summaries.append(rs)

In [13]:
for rs in aragorn_result_summaries:
    print(rs)

Score=0.426: PPARA (NCBIGene:5465)--biolink:regulates-->SMAD3 (NCBIGene:4088)--biolink:genetic_association-->Hepatic fibrosis (HP:0001395)
Score=0.426: PPARA (NCBIGene:5465)--biolink:regulates-->SMAD3 (NCBIGene:4088)--biolink:genetic_association-->Hepatic fibrosis (HP:0001395)
Score=0.422: PPARA (NCBIGene:5465)--biolink:regulates-->TGFB1 (NCBIGene:7040)--biolink:genetic_association-->Hepatic fibrosis (HP:0001395)
Score=0.422: PPARA (NCBIGene:5465)--biolink:regulates-->TGFB1 (NCBIGene:7040)--biolink:genetic_association-->Hepatic fibrosis (HP:0001395)
Score=0.415: PPARA (NCBIGene:5465)--biolink:regulates-->CCL2 (NCBIGene:6347)--biolink:genetic_association-->Hepatic fibrosis (HP:0001395)
Score=0.415: PPARA (NCBIGene:5465)--biolink:regulates-->CCL2 (NCBIGene:6347)--biolink:genetic_association-->Hepatic fibrosis (HP:0001395)
Score=0.406: PPARA (NCBIGene:5465)--biolink:directly_physically_interacts_with-->ALB (NCBIGene:213)--biolink:genetic_association-->Hepatic fibrosis (HP:0001395)
Score=0

The following assumes that the node names will sort in the correct order, which is the case with the default naming conventions. This exports the results showing the nodes and the score assigned to each result.

In [16]:
cols = []
for node in sorted(results[0]['node_bindings'].keys()):
    cols.append(node)
    cols.append(node + '_name')
results_df = pd.DataFrame(columns = cols)

results_dict_list = []
for result in results:
    result_dict = {}
    for node in result['node_bindings'].keys():
        node_id = result['node_bindings'][node][0]['id']
        result_dict[node] = node_id
        result_dict[node + '_name'] = kg['nodes'][node_id]['name']
    result_dict['score'] = result['score']
    #print(result_dict)

    results_dict_list.append(result_dict)
    # results_df = results_df.append(result_dict, ignore_index=True)
results_df = pd.concat([results_df,pd.DataFrame.from_records(results_dict_list)])
print(results_df.shape)
results_df.to_csv(os.path.join(write_dir,'results_TRAPI_aragorn.csv'))


(136, 7)
