# Knowledge Graph Embeddings using the Biomedical Wikidata knowledge graph

One way to mine and complete missing information from KGs is to use Knowledge Graph Embedding (KGE) techniques. To apply KGE methods on the KG data, you can use one of the existing approaches, such as DGL-KE, that allow you to train your data with selection of KGE models and adjustable machine learning metrics on its framework. We will use DGL-KE (https://github.com/awslabs/dgl-ke) package for the Assignment 2 solution to apply KG embeddings to the biomedical subgraph from Wikidata KG to generate vector space representation of entities and relations of the KG.

Let's start with DGL-KE package installation and its requirements to provide the environment in which the package can run.

In [None]:
!sudo pip3 install dgl-cu101==0.4.3.post2
!git clone  https://github.com/awslabs/dgl-ke.git
!pushd dgl-ke;cd python;sudo python3 setup.py install;
!pip3 uninstall torch -y
!sudo pip3 install torch==1.5.0+cu101 torchvision==0.6.0+cu101
!sudo pip3 install ogb

# Knowledge Graph Embeddings with DGL-KE
 
  Provided dataset already contains train and test sets so we can skip data preparation step for creation of train, test and valid sets. For DGL-KE, training set is only necessary to generate KG embeddings without evaluation step, or we can add optional test and valid sets to see the model performance with the ranking metrics (MRR and Hit@K) to have an idea about the quality of the resulting embedding model.

DGL-KE provides a set of commands to run training, evaluation, prediction steps for the KGE. We can change command line parameters with different options provided. For example, we can choose one of {TransE, TransE_l1, TransE_l2, TransR, RESCAL, DistMult, ComplEx, RotatE, SimplE} embedding model for our training using our preferred model with model_name parameter.
You can see detailed explanations for the DGL-KE training command line parameters below:
*   ***dglke_train*** trains KG embeddings on CPUs or GPUs in a single machine and saves the trained node embeddings and relation embeddings on disks.
*   ***--model_name*** {TransE, TransE_l1, TransE_l2, TransR, RESCAL, DistMult, ComplEx, RotatE, SimplE} The models provided by DGL-KE.
*   ***--data_path DATA_PATH*** The path of the directory where DGL-KE loads knowledge graph data.
*   ***--dataset DATA_SET*** The name of the knowledge graph stored under data_path. If it is one of the builtin knowledge grpahs such as FB15k, FB15k-237, wn18, wn18rr, and Freebase, DGL-KE will automatically download the knowledge graph and keep it under data_path.
*   ***--format FORMAT*** The format of the dataset. For builtin knowledge graphs, the format is determined automatically. For users own knowledge graphs, it needs to be raw_udd_{htr} or udd_{htr}. raw_udd_ indicates that the user's data use raw ID for entities and relations and udd_ indicates that the user's data uses KGE ID. {htr} indicates the location of the head entity, tail entity and relation in a triplet. For example, htr means the head entity is the first element in the triplet, the tail entity is the second element and the relation is the last element.
*   ***--data_files [DATA_FILES ...]*** A list of data file names. This is required for training KGE on their own datasets. If the format is raw_udd_{htr}, users need to provide train_file [valid_file] [test_file]. If the format is udd_{htr}, users need to provide entity_file relation_file train_file [valid_file] [test_file]. In both cases, valid_file and test_file are optional.
*   ***--neg_sample_size NEG_SAMPLE_SIZE*** The number of negative samples we use for each positive sample in the training.
*   ***--hidden_dim HIDDEN_DIM*** The embedding size of relations and entities.
*   ***-g GAMMA*** or ***--gamma GAMMA*** The margin value in the score function. It is used by TransX and RotatE.
*   ***--max_step MAX_STEP*** The maximal number of steps to train the model in a single process. A step trains the model with a batch of data. In the case of multiprocessing training, the total number of training steps is MAX_STEP * NUM_PROC.
*   ***--lr LR*** The learning rate. DGL-KE uses Adagrad to optimize the model parameters.
*   ***--batch_size BATCH_SIZE*** The batch size for training.

For more command explanations: https://github.com/awslabs/dgl-ke/tree/master/docs/source 


In [None]:
%%time
!DGLBACKEND=pytorch dglke_train --model_name SimplE --dataset wikibio --batch_size 1000 --log_interval 100 \
--neg_sample_size 200 --regularization_coef=1e-9 --hidden_dim 400 --gamma 19.9 \
--lr 0.25 --batch_size_eval 16 --gpu 0 -adv --max_step 25000  --save_path /content \
--data_path /content --format raw_udd_hrt --data_files /content/biomedical_kg_train.txt --neg_sample_size_eval 10000 


# KG Completion / Inference
 
  DGL-KE package accepts candidate head (h), relation (r) and tail (t) elements as .list extension files to calculate the probability of missing triples in the KG. All combinations of these elements are considered as candidate triples and a score for each triple is calculated. Embedding models will assign different probability scores for candidate triples based on the model chosen. For example for Trans E, the learnt embedding of (h+r) should be close to the learnt embedding of t if the triple is positive or correct. This distance increases for negative or false triples. The distance between (h+r) and t is measured with L1 or L2 distance in Trans E. Generating the prediction scores using different embedding methods and selecting the best model based on evaluation dataset is recommended to obtain more robust predictions.

For the KG completion task, we have provided different test triples for each group. As mentioned above, we will generate head file with unique subjects of triples in test group file. As we descrribed in the Assignment 2 instructions, we'd like to obtain possible new disease-drug links with the specified relation "drug or therapy used for treatment (P2176)". Our relation file will contain P2176 relation. To score each drug we will generate tail file with all unique drugs in the KG training set. 

In [None]:
import pandas as pd
head = []
rel = []
tail = []
drug_rel = "<http://www.wikidata.org/prop/direct/P2176>"
df =pd.read_csv("/content/test_group1.txt", names=['subject','predicate','object',"_"],  delimiter="\t")
df_all =pd.read_csv("/content/biomedical_kg_train.txt", names=['subject','predicate','object',"_"],  delimiter="\t")
#tail all drugs
df_d = df_all[df_all['predicate'].str.contains(drug_rel)]
df_t = df_d['object'].unique()
#head
df_h = df['subject'].unique()
with open("/content/head.list", "w") as fl:
    for h in df_h:
      fl.write(h + "\n")
#rel
df_r = df['predicate'].unique()
with open("/content/rel.list", "w") as fl:
    for r in df_r:
      fl.write(r +"\n")
#tail
with open("/content/tail.list", "w") as fl:
    for t in df_t:
      fl.write(t+ "\n")

For the KG completion task, DGL-KE uses dglke_predict command to evaluate head.list, rel.list and tail.list element combinations. According to their distance, it assigns a score for each possible triple combination. Beside head, rel and tail list files we should also provide the trained model, and entity and relation mapping files. Since we'd like to calculate the Hit@10 metric of the predictions we limit the command with top 10 results.

In [None]:
!DGLBACKEND=pytorch dglke_predict --model_path /content/SimplE_wikibio_0 --format 'h_r_t' --gpu 0 --entity_mfile entities.tsv --rel_mfile relations.tsv --data_files head.list rel.list tail.list --raw_data --score_func logsigmoid --topK 10 --exec_mode 'batch_head'


In [None]:
!cat result.tsv

We can normalize scores in results between 0 and 1 range to calculate the confidence scores

In [None]:
import numpy as np
new_rows = []
df_pred =pd.read_csv("/content/result.tsv", names=['subject','predicate','object',"score"],  delimiter="\t")
all_scores = df_pred["score"][1:].astype(float)
with open("/content/result.tsv", "r") as fl:
    next(fl)
    for r in fl:
      splitted = r[:-1].split("\t")
      normalised = (float(splitted[3]) - np.min(all_scores)) / (np.max(all_scores) - np.min(all_scores))
      newline = ("{}\t{}\t{}\t{}\n".format(splitted[0], splitted[1], splitted[2], normalised))
      new_rows.append(newline)
with open("/content/normalised_result.tsv", "w") as fl:
  for row in new_rows:
    fl.write(row+ "\n")

!cat normalised_result.tsv

## Hit@10 Score Table
To calculate Hit@10 metric of produced top 10 predictions we'll check for each triple in our test group to see if there is a match between the test group triple and top 10 result of predictions. Match has found means Hit@10 score is 1, otherwise 0. We should repeat this for each triple in the test group and the average of the Hit@10 scores will give the final Hit@10 score.

Test Group | TransE | ComplEx | SimplE
--- | --- | --- | --- 
Test Group 1 | **0.5** |**0**|**0.5**
Test Group 2 | **0.5** |**0.5** |**0**
Test Group 3 | **0.33** |**0.33**|**0.33**
Test Group 4 | **0** |**0**|**0**
Test Group 5 | **0** |**0** |**0**
Test Group 6 | **0** |**0**|**0**
Test Group 7 | **0** |**0**|**0**
Test Group 8 | **0** |**0**|**0**
Test Group 9 | **0** |**0**|**0**
Test Group 10 | **0** |**0**|**0**

The models that we select generated different Hit@10 scores for test groups. Additionally, these scores can be improved with other embedding model selections which might be a better fit for the traning data.
