# Walk Through for Knowledge Graph Testing
This notebook outlines the basic process of submitting queries to MULTIVAC's semantic knowledge graph. 
First, we set up the required imports and arguments for the test. This process can be performed all at once from the commandline as well:<br><br>
`python3 map_queries.py -d data -o models -g models/glove.42B.300d -r model -m transe`<br><br>
(threshold and num_top_rel are by default 0.1 and 10 respectively, but can also be set at the commandline with flags `-t` and `-n` 

In [1]:
from multivac.src.rdf_graph.map_queries import *
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

In [2]:
args_dict = {'dir': 'data',
             'out': 'models',
             'glove': 'models/glove.42B.300d',
             'run': 'model',
             'model': 'transe',
             'threshold': 0.1,
             'num_top_rel': 10}

args_dict['search'] = '/Users/ben_ryan/Documents/DARPA_ASKE/Phase_II/openke_experiments/query_tester.csv'

Next, we load up the knowledge graph embedding model previously calculated. This embedding model allows us to assign probabilities to missing nodes or relationships in the knowledge graph proposed via submitted queries. Here we are using TransE, an approach which models relationships by interpreting them as translations operating on the
low-dimensional embeddings of entities. 

In [3]:
con = config.Config()
con.set_in_path(args_dict['dir']+os.path.sep)
con.set_work_threads(8)
con.set_dimension(100)
con.set_test_link_prediction(True)
con.set_test_triple_classification(True)

files = glob.glob(os.path.join(args_dict['out'],'*tf*'))
times = list(set([file.split('.')[2] for file in files]))
ifile = max([datetime.strptime(x, '%d%b%Y-%H:%M:%S') for x in times]).strftime('%d%b%Y-%H:%M:%S')
con.set_import_files(os.path.join(args_dict['out'], 'model.vec.{}.tf'.format(ifile)))

con.init()
kem = set_model_choice(args_dict['model'])
con.set_model(kem)


files = [x for x in os.listdir(con.in_path) if '2id' in x]
rel_file = get_newest_file(con.in_path, files, 'relation')
ent_file = get_newest_file(con.in_path, files, 'entity')
trn_file = get_newest_file(con.in_path, files, 'train')

entities = pd.read_csv(ent_file, sep='\t', 
                       names=["Ent","Id"], skiprows=1)
relations = pd.read_csv(rel_file, sep='\t', 
                        names=["Rel","Id"], skiprows=1)
train = pd.read_csv(trn_file, sep='\t', 
                    names=["Head","Tail","Relation"], skiprows=1)

We then set up our NLP toolset -- a wrapper to the Stanford CoreNLP parsing engine, and a GloVe embedding model. Here we use the large scale, pre-trained GloVe embedding model given the open domain nature of potential submitted questions.

In [4]:
annots =  "tokenize ssplit pos depparse natlog openie ner coref",
props  = {"openie.triple.strict": "true",
          "openie.openie.resolve_coref": "true"}

parser = StanfordParser(annots=annots, props=props)

glove_vocab, glove_emb = load_word_vectors(args_dict['glove'])


==> File found, loading to memory


Finally, we load the query file and parse the queries into their semantic components. These are then matched to the knowledge graph as complete Subject-Relation-Object triples, or as partial triples with imputed portions. Complete triples return one result each, while partials will return up to `num_top_rel` results each.

In [5]:
queries = pd.read_csv(args_dict['search'])
for q in queries.Query:
    print(q)
    
parse = lambda z: stanford_parse(parser, z, sub_rdfs=True).get_rdfs(use_tokens=False, 
                                                                    how='longest')
triples = queries.Query.apply(parse)
results = triples.apply(lambda x: get_answers(con, x, 
                                              glove_vocab, 
                                              glove_emb, 
                                              entities, 
                                              relations,
                                              args_dict['num_top_rel'], 
                                              args_dict['threshold']))

What role do asymptomatically infected individuals play in transmission dynamics?
How are power converters traded at Tosche Station?


In [6]:
for result in results:
    print(result)

[('infected individuals', 'should play', 'critical role', 0.1093660369515419)]
[('power grids', 'trade', 'station | meteorological station', 0.12902237474918365)]
