## Working with CTKG in Deep Graph Library (DGL)

**Please See requirements.txt for dependency versions**

This notebook provides an example of building a heterograph from CTKG in DGL; and some examples of queries on the DGL heterograph. For more information about using DGL please refer to https://www.dgl.ai/ "
This notebook builds on the notebook from DRKG: https://github.com/gnn4dr/DRKG

In [2]:
import pandas as pd
import numpy as np
import dgl
import sys
sys.path.insert(1, '../utils')
from utils import download_and_extract
#download_and_extract()
ctkg_file = '../rawdata/ctkg.tsv'
df = pd.read_csv(ctkg_file, sep ="\t", header=None)
triplets = df.values.tolist()

Assign an ID to each node (entity): create a dictionary of node-types: each dictionary further consists of a dictionary mapping node to an ID.

In [4]:
entity_dictionary = {}
def insert_entry(entry, ent_type, dic):
    if ent_type not in dic:
        dic[ent_type] = {}
    ent_n_id = len(dic[ent_type])
    if entry not in dic[ent_type]:
         dic[ent_type][entry] = ent_n_id
    return dic

for triple in triplets:
    src = triple[0]
    split_src = src.split('::')
    src_type = split_src[0]
    dest = triple[2]
    split_dest = dest.split('::')
    dest_type = split_dest[0]
    insert_entry(src,src_type,entity_dictionary)
    insert_entry(dest,dest_type,entity_dictionary)

Create a dictionary of relations: the key is the relation and the value is the list of (source node ID, destimation node ID) tuples.

In [5]:
edge_dictionary={}
for triple in triplets:
    src = triple[0]
    split_src = src.split('::')
    src_type = split_src[0]
    dest = triple[2]
    split_dest = dest.split('::')
    dest_type = split_dest[0]
    
    src_int_id = entity_dictionary[src_type][src]
    dest_int_id = entity_dictionary[dest_type][dest]
    
    pair = (src_int_id,dest_int_id)
    etype = (src_type,triple[1],dest_type)
    if etype in edge_dictionary:
        edge_dictionary[etype] += [pair]
    else:
        edge_dictionary[etype] = [pair]

## Create a DGL heterograph using the dictionary of relations

In [6]:
graph = dgl.heterograph(edge_dictionary);

## Print the statistics of the created graph

Number of nodes for each node-type

In [7]:
total_nodes = 0;
for ntype in graph.ntypes:
    print(ntype, '\t', graph.number_of_nodes(ntype));
    total_nodes += graph.number_of_nodes(ntype);
print("Graph contains {} nodes from {} node-types.".format(total_nodes, len(graph.ntypes)))

AdverseEvent 	 18546
BaselineGroup 	 27068
BaselineRecord 	 315533
ClusterOutcome 	 200
Condition 	 1394
DropGroup 	 22272
DropRecord 	 123627
Drug 	 2548
EventGroup 	 22725
Method 	 907
Organ 	 27
Outcome 	 88386
OutcomeAnalysis 	 107294
OutcomeGroup 	 32499
OutcomeMeasure 	 690626
Period 	 34330
StandardOutcome 	 492
Study 	 8210
Graph contains 1496684 nodes from 18 node-types.


Number of edges for each relation (edge-type)

In [8]:
total_edges = 0;
for etype in graph.etypes:
    print(etype, '\t', graph.number_of_edges(etype))
    total_edges += graph.number_of_edges(etype);
print("Graph contains {} edges from {} edge-types.".format(total_edges, len(graph.etypes)))

adverseevent::eventgroup 	 966450
adverseevent::organ 	 18546
baselinegroup::baselinerecord 	 315533
baselinegroup::study 	 27068
baselinerecord::baselinegroup 	 315533
clusteroutcome::outcome 	 88244
condition::study 	 17259
dropgroup::period 	 34330
dropgroup::study 	 22272
droprecord::period 	 123627
drug::eventgroup 	 31527
studiedDrug::study 	 20982
usedDrug::study 	 3992
eventgroup::adverseevent 	 966450
eventgroup::drug 	 31527
eventgroup::study 	 22725
method::outcomeanalysis 	 91463
organ::adverseevent 	 18546
outcome::clusteroutcome 	 88244
outcome::outcomeanalysis 	 107294
outcome::outcomemeasure 	 690626
outcome::standardoutcome 	 58819
outcome::study 	 88386
outcomeanalysis::method 	 91463
outcomeanalysis::outcome 	 107294
outcomeanalysis::outcomegroup 	 209314
outcomegroup::outcomeanalysis 	 209314
outcomegroup::outcomemeasure 	 690541
outcomegroup::study 	 32499
outcomemeasure::outcome 	 690626
outcomemeasure::outcomegroup 	 690541
period::dropgroup 	 34330
period::dropr

Just printing the graph ("print(graph)") will also print the graph summary

In [9]:
print(graph)

Graph(num_nodes={'AdverseEvent': 18546, 'BaselineGroup': 27068, 'BaselineRecord': 315533, 'ClusterOutcome': 200, 'Condition': 1394, 'DropGroup': 22272, 'DropRecord': 123627, 'Drug': 2548, 'EventGroup': 22725, 'Method': 907, 'Organ': 27, 'Outcome': 88386, 'OutcomeAnalysis': 107294, 'OutcomeGroup': 32499, 'OutcomeMeasure': 690626, 'Period': 34330, 'StandardOutcome': 492, 'Study': 8210},
      num_edges={('AdverseEvent', 'adverseevent::eventgroup', 'EventGroup'): 966450, ('AdverseEvent', 'adverseevent::organ', 'Organ'): 18546, ('BaselineGroup', 'baselinegroup::baselinerecord', 'BaselineRecord'): 315533, ('BaselineGroup', 'baselinegroup::study', 'Study'): 27068, ('BaselineRecord', 'baselinerecord::baselinegroup', 'BaselineGroup'): 315533, ('ClusterOutcome', 'clusteroutcome::outcome', 'Outcome'): 88244, ('Condition', 'condition::study', 'Study'): 17259, ('DropGroup', 'dropgroup::period', 'Period'): 34330, ('DropGroup', 'dropgroup::study', 'Study'): 22272, ('DropRecord', 'droprecord::period'