<center><h1> ComplEX - Variable Negative Training Samples </h1></center>

# Open Graph Benchmark Library (OGBL) - BioKG

<b>Graph</b> : The ogbl-biokg dataset is a Knowledge Graph (KG), which we created using data from a large number of biomedical data repositories. It contains 5 types of entities: diseases (10,687 nodes), proteins (17,499), drugs (10,533 nodes), side effects (9,969 nodes), and protein functions (45,085 nodes). There are 51 types of directed relations connecting two types of entities, including 39 kinds of drug-drug interactions, 8 kinds of protein-protein interaction, as well as drug-protein, drug-side effect, drug-protein, function-function relations. All relations are modeled as directed edges, among which the relations connecting the same entity types (e.g., protein-protein, drug-drug, function-function) are always symmetric, i.e., the edges are bi-directional.

This dataset is relevant to both biomedical and fundamental ML research. On the biomedical side, the dataset allows us to get better insights into human biology and generate predictions that can guide downstream biomedical research. On the fundamental ML side, the dataset presents challenges in handling a noisy, incomplete KG with possible contradictory observations. This is because the ogbl-biokg dataset involves heterogeneous interactions that span from the molecular scale (e.g., protein-protein interactions within a cell) to whole populations (e.g., reports of unwanted side effects experienced by patients in a particular country). Further, triplets in the KG come from sources with a variety of confidence levels, including experimental readouts, human-curated annotations, and automatically extracted metadata.

<b>Prediction task</b>: The task is to predict new triplets given the training triplets. The evaluation protocol is exactly the same as ogbl-wikikg2, except that here we only consider ranking against entities of the same type. For instance, when corrupting head entities of the protein type, we only consider negative protein entities.

<b>Dataset splitting</b>: For this dataset, we adopt a random split. While splitting the triplets according to time is an attractive alternative, we note that it is incredibly challenging to obtain accurate information as to when individual experiments and observations underlying the triplets were made. We strive to provide additional dataset splits in future versions of the OGB.



# Installing libraries (OGB & TorchKGE)

The OGB (Open Graph Benchmark) library is used in this notebook to retrieve the knowledge graph and to evaluate our model.<br/>
The TorchKGE library is used to create our model.

In [1]:
# Installing OBG to download dataset
!pip install ogb

# Getting TorchKGE library
!git clone https://github.com/torchkge-team/torchkge.git
!mv /content/torchkge /content/torchkge_repo
!mv /content/torchkge_repo/torchkge /content/torchkge
!pip install -r /content/torchkge_repo/requirements_dev.txt

Collecting ogb
[?25l  Downloading https://files.pythonhosted.org/packages/d2/c5/20b1e4a5ff90ead06139ce1c2362474b97bb3a73ee0166eb37f2d3eb0dba/ogb-1.3.1-py3-none-any.whl (67kB)
[K     |████▉                           | 10kB 21.2MB/s eta 0:00:01[K     |█████████▊                      | 20kB 29.2MB/s eta 0:00:01[K     |██████████████▌                 | 30kB 21.6MB/s eta 0:00:01[K     |███████████████████▍            | 40kB 16.9MB/s eta 0:00:01[K     |████████████████████████▎       | 51kB 15.9MB/s eta 0:00:01[K     |█████████████████████████████   | 61kB 18.0MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 6.2MB/s 
Collecting outdated>=0.2.0
  Downloading https://files.pythonhosted.org/packages/fd/f6/95588d496e518355c33b389222c99069b1c6f2c046be64f400072fdc7cda/outdated-0.2.1-py3-none-any.whl
Collecting littleutils
  Downloading https://files.pythonhosted.org/packages/4e/b1/bb4e06f010947d67349f863b6a2ad71577f85590180a935f60543f622652/littleutils-0.2.2.tar.gz
Bui

# Imports

In [2]:
# General Purpose imports
import os
import sys
import json
import time
import tqdm
from copy import copy

# Math / Data structures imports
import math
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

# PyTorch imports
import torch                                                # PyTorch
from torch import cuda                                      # CUDA device
from torch.optim import Adam                                # Adam optimizer
from torch.utils.data import DataLoader

# OGB Improts
from ogb.linkproppred import Evaluator                      # Evaluates model
from ogb.linkproppred.dataset import LinkPropPredDataset    # Loads KG 

# TorchKGE imports
import torchkge
from torchkge.utils.data import DataLoader as KGEDataLoader               # Loads batches
from torchkge.utils.losses import MarginLoss                # Loss function
from torchkge.models.bilinear import ComplExModel           # KGE Model
from torchkge.data_structures import KnowledgeGraph         # KG object
from torchkge.sampling import BernoulliNegativeSampler      # Corrupts triplets

# Accessing data

In [3]:
dataset = LinkPropPredDataset(name = 'ogbl-biokg')

Downloading http://snap.stanford.edu/ogb/data/linkproppred/biokg.zip


Downloaded 0.90 GB: 100%|██████████| 920/920 [13:28<00:00,  1.14it/s]


Extracting dataset/biokg.zip
Loading necessary files...
This might take a while.


100%|██████████| 1/1 [00:00<00:00, 2647.92it/s]

Processing graphs...
Saving...





In [4]:
# Getting the dataset train-val-test split
split_set = dataset.get_edge_split()
train_set, valid_set, test_set = split_set["train"], split_set["valid"], split_set["test"]

#Re-indexing the dataset from (type, index) to (index)
# Getting the dataset train-val-test split
split_set = dataset.get_edge_split()
train_set, valid_set, test_set = split_set["train"], split_set["valid"], split_set["test"]


def type_and_id_2_new_id(typ, id):
  base = 0

  if 'dis' in typ:
    base = 0
  elif 'prot' in typ:
    base = 10_687
  elif 'drug' in typ:
    base = 10_687 + 17_499
  elif 'effe' in typ:
    base = 10_687 + 17_499 + 10_533
  elif 'fun' in typ:
    base = 10_687 + 17_499 + 10_533 + 9_969

  return base + id

def trainset_to_newset(train_set):
  tmp_set = []

  for ht, h, r, tt, t in zip(train_set['head_type'], train_set['head'], train_set['relation'], train_set['tail_type'], train_set['tail']):
    head_id = type_and_id_2_new_id(ht, h)
    tail_id = type_and_id_2_new_id(tt, t)
  
    tmp_set.append([head_id, r, tail_id])
  
  return pd.DataFrame(np.array(tmp_set), columns=['head', 'relation', 'tail'])

def valset_to_newset(val_set):
  tmp_set = []

  for ht, h, nhs, r, tt, t, nts in zip(val_set['head_type'], val_set['head'], val_set['head_neg'], val_set['relation'], val_set['tail_type'], val_set['tail'], val_set['tail_neg']):
    h_id = type_and_id_2_new_id(ht, h)
    t_id = type_and_id_2_new_id(tt, t)

    nh_ids, nt_ids = [], []
    for nh, nt in zip(nhs, nts):
      nh_id = type_and_id_2_new_id(ht, nh)
      nt_id = type_and_id_2_new_id(tt, nt)

      nh_ids.append(nh_id)
      nt_ids.append(nt_id)

    tmp_set.append([h_id, nh_ids, r, t_id, nt_ids])
  
  return pd.DataFrame(np.array(tmp_set), columns=['head', 'head_neg', 'relation', 'tail', 'tail_neg'])
  
train_set = trainset_to_newset(train_set)
valid_set = valset_to_newset(valid_set)
test_set = valset_to_newset(test_set)



          head  relation   tail
0         1718         0  13894
1         4903         0  24349
2         5480         0  26686
3         3148         0  17934
4        10300         0  26889
...        ...       ...    ...
4762673  24553        50  22370
4762674  21512        50  11344
4762675  23345        50  22686
4762676  22715        50  13169
4762677  12783        50  12780

[4762678 rows x 3 columns]
Number of distinct entities:  93773
Number of distinct relations:  51


# Loading the model

In [7]:
from google.colab import drive

# Mounting drive to store the best weights
drive.mount('/content/drive')

Mounted at /content/drive


In [33]:
model = torch.load('/content/drive/MyDrive/best_model_weights.pth')

if cuda.is_available():
  model.cuda()

# Real Fact Evaluation

In this example test, we go through all training facts (h, t, r) and corrupt them (h', t', r) and see how many time the scoring for the actual facts are higher, i.e. f(h, t, r) > f(h', t', r). This gives a sort of training accuracy.

In [10]:
df_train

Unnamed: 0,head,relation,tail
0,1718,0,13894
1,4903,0,24349
2,5480,0,26686
3,3148,0,17934
4,10300,0,26889
...,...,...,...
4762673,24553,50,22370
4762674,21512,50,11344
4762675,23345,50,22686
4762676,22715,50,13169


# Checking that the first triplet in training set has higher score than any random triplet

In [58]:

head = torch.LongTensor([1718]).cuda()
tail = torch.LongTensor([13894]).cuda()
relations = torch.LongTensor([0]).cuda()
negative_heads = torch.LongTensor([1718]).cuda()
negative_tails = torch.LongTensor([1]).cuda()

y_pos, y_neg = model(
    head,
    tail,
    negative_heads,
    negative_tails,
    relations
)

In [59]:
y_pos.item(), y_neg.item()

(3.067988395690918, -0.7393243908882141)