# Walk Through for Surfacing Key RDF-Triples for Knowledge Graph Expansion
This notebook outlines the basic process for identifying RDF-triples that are key points from which an observed knowledge graph might be expanded toward a "complete" knowledge graph based on analysis of the observed and emulated graphs.

First, we set up the required imports and arguments for the demonstration. 


In [1]:
import json
import networkx as nx
import numpy as np
import os

from multivac.get_kg_query_params import build_network, read_txt
from calculate_network_change import *

In [2]:
ofile = "data/train2id.txt"
nfile = "new_multivac_test.txt"
kg_dir = "data"
measure = 'eigenvector'
num_results = 100
out = None

Next, we load the RDF-triples from our knowledge graph as a numpy array of indices. We also load dictionaries for our entities and relations mapping those indices to the original texts.

In [3]:
triples = read_txt(os.path.join(kg_dir, 'train2id.txt'))
triples = np.array(triples).astype(int)
triples = np.unique(triples, axis=0)

ents = get_items(os.path.join(kg_dir, 'entity2id.txt'))
rels = get_items(os.path.join(kg_dir, 'relation2id.txt'))


We then read in the new file from the emulated graph for comparison, and constitute both our observed and emulated knowledge graphs as networks using `networkx`. 



In [4]:
# read in new file for comparison
new = read_txt(nfile)
new = np.array(new).astype(int)

# create networks
neto = build_network(triples)
netn = build_network(new)


Finally, for all nodes in the observed network, we calculate the desired centrality measure for the node in both the observed and emulated graphs, and return these as a Python dictionary. We can then calculate the difference in centrality per node when moving from the observed to the emulated graph; we order these results according to which nodes exhibit the biggest increase in centrality.


In [5]:
net = build_comparison_metrics(neto, netn, measure)
result = generate_node_changes(net)
result = {k: v for k, v in sorted(result.items(),
                                  key=lambda item: item[1])}

# generate results of interest
gains = generate_result_lists(result, len(result), 'top')


Ultimately, we are most interested in which RDF-triples are most important in expanding the knowledge graph, not just individual entity nodes. To get this information, we calculate scores for each triple, by summing the scores of the component entity nodes. We then select the top scoring RDF-triples to return.

In [6]:
trip_scores = np.zeros(triples.shape[0])

for i, trip in enumerate(triples):
    head, tail, _ = trip
    trip_scores[i] = gains.get(head, 0) + gains.get(tail, 0)

idxs = trip_scores.argsort()[::-1]
top = triples[idxs,][:num_results,:]


Finally, we convert our top RDF-triples from numeric indices back to the original text for review. This can either be written out to a JSON file or returned directly as a Python dictionary object containing the entity and relation IDs, the score and the text for each identified triple.

In [7]:
results = {}

for i, t in enumerate(top):
    triple_id = idxs[i]
    h, t, r = t
    score = trip_scores[triple_id]
    label = " ".join([ents[h], rels[r], ents[t]])
    results[triple_id] = {'idxs': t, 'label': label, 'score': score}

for result in results.values():
    print("Score ({}):{}".format(result['score'], result['label']))

Score (0.0007605004881929781):influenza peak | influenza peaks be is increase
Score (0.0007266007786290531):local centrality higher larger than that | higher than that | lower than that
Score (0.0006923816430531934):simulation study | short simulation study in simulation study | short simulation study
Score (0.0006829411897315367):hong kong | hong kong sar of university hong kong | hong kong sar
Score (0.0006780930042352758):animal | animal 's phenotype | animal shelters | animal hosts become because of animal | animal 's phenotype | animal shelters | animal hosts
Score (0.000663488824713562):disease surveillance face challenges | logistical challenges
Score (0.0006520597275414555):new values | new value differ from nominal values
Score (0.0006287673753062097):exploitation be only strategy
Score (0.0006216245024218481):grid cells | 900 grid cells with asterisk
Score (0.0006173229661101804):doctor travel to hong kong | hong kong sar
Score (0.0006123284535794816):animal | animal 's pheno