# Installation

Let's first import/install some dependent packages.

In [2]:
#@title Setup the device

use_gpu = True #@param {type:"boolean"}

if use_gpu:
  device = 'cuda:0'
else:
  device = 'cpu'

In [3]:
#@title Import basic modules

import torch
from google.colab import output
import torch.nn.functional as F


In [4]:
#@title Install dependency

!pip install torch-scatter -f https://data.pyg.org/whl/torch-1.13.0+cu116.html
!pip install pickle5 pybind11
output.clear()

Next we can install the smore package. As it contains highly optimized c++/cuda code, we need to compile it first. Below are two options to install the smore package. We will go with the first option to install a pre-build whl.

If this doens't work (e.g., the machine is not compatible), you can try the second option to build from source, though it would be much slower.

In [5]:
#@title Option 1: Install smore from pre-built whl
!pip install https://snap.stanford.edu/logtutorial/smore-0.0.0-cp38-cp38-linux_x86_64.whl
output.clear()

In [None]:
#@title Option 2: Install smore from source (If you really want to do so, you can uncomment the code below to run)
# !pip install git+https://github.com/google-research/smore@wikikgv2
# output.clear()

Now we are ready to import **smore**

In [6]:
import smore

# Loading a small KG: FB15k

First let's download a small KG from web, and unzip it.

In [7]:
!wget "https://snap.stanford.edu/logtutorial/FB15k-log.zip"
!unzip FB15k-log.zip
output.clear()

The **smore** package comes with a knowledge graph class that can be used to conveniently read/write/query large KGs. To enable that, we first import the class called **KGMem**, which stands for a shared memory object that holds the memory of KG.

In [8]:
from smore.cpp_sampler.sampler_clib import KGMem

It loads the pre-processed binary file into kgmem, and then we can create a KG object that references to this memory segment.

** note that KgMem creates a KG without actually copying the memory, so one need to hold KgMem object during the lifetime of created KG.

In [9]:
def load_kg(kg_name):
  kgmem = KGMem(dtype='uint32')
  kgmem.load('%s/train_bidir.bin' % kg_name)
  kg = kgmem.create_kg()
  print('This KG has %d entities, %d edges on %d relations' % (
      kg.num_ent, kg.num_edges, kg.num_rel))
  return kgmem, kg

kgmem, kg = load_kg('FB15k-log')

This KG has 14951 entities, 966284 edges on 2690 relations


Next we can use an exampler tuple of (head_entity, relation, tail_entity) to illustrate the query result in this KG.

In [10]:
head_entity = 834
tail_entity = 546
relation = 40

print(f'KG has edge {head_entity} --{relation}--> {tail_entity} ?',
      kg.has_forward_edge(head_entity, relation, tail_entity))

print(f'KG has edge {tail_entity} <--{relation}-- {head_entity} ?',
      kg.has_backward_edge(tail_entity, relation, head_entity))

KG has edge 834 --40--> 546 ? True
KG has edge 546 <--40-- 834 ? True


# Sampling queries from KG

First let's import the sampler, which is responsible to generate positive/negative samples from the knowledge graph, according to different query templates.

In [11]:
from smore.cpp_sampler.online_sampler import OnlineSampler
import time
from tqdm import tqdm

Then let's specify the hyperparameters of the sampler, and try a naive one first.

This basic sample exhaustively explore the knowledge graph to obtain the positive and negative examples per query template.

In [12]:
query_template = '2p' #@param {type:"string"}
num_negative_samples = 256 #@param {type:"integer"}
sampler_type = 'naive' #@param {type:"string"}
search_bandwidth = 14951 #@param {type:"integer"}
max_intermediate_entities = 14951 #@param {type:"integer"}

sample_mode = (search_bandwidth, 0, 'u', 'u', max_intermediate_entities)

sampler = OnlineSampler(kg, [query_template], num_negative_samples, sample_mode,
                        [1.0], sampler_type, share_negative=True,
                        same_in_batch=True, num_threads=1)


We can see it runs in a reasonable speed on this small KG.


In [13]:
batch_size = 10 #@param {type:"integer"}
num_batches = 10000 #@param {type:"integer"}

def test_sampler_speed(batch_gen):
  cur_time = time.time()
  for _ in tqdm(range(num_batches)):
    next(batch_gen)
  print('\n')
  total_time = time.time() - cur_time
  ns = batch_size * num_batches
  print(f'The {sampler_type} sampler takes {total_time} seconds for {ns} samples')

batch_gen = sampler.batch_generator(batch_size)
test_sampler_speed(batch_gen)

100%|██████████| 10000/10000 [00:07<00:00, 1373.19it/s]



The naive sampler takes 7.295474290847778 seconds for 100000 samples





Let's use a more efficient sampler, where the theoretical cost is supposed to be a square-root of the naive one.

In [14]:
sampler_type = 'sqrt'
sampler = OnlineSampler(kg, [query_template], num_negative_samples, sample_mode,
                        [1.0], sampler_type, share_negative=True,
                        same_in_batch=True, num_threads=1)
batch_gen = sampler.batch_generator(batch_size)
test_sampler_speed(batch_gen)

100%|██████████| 10000/10000 [00:04<00:00, 2305.13it/s]



The sqrt sampler takes 4.350490093231201 seconds for 100000 samples





We can see it indeed runs faster with the improved sampling technique, but the gain is not significant on small KGs.

If one is interested you can try on larger KGs.

In [29]:
#@title Optional: run the above study on larger KGs.

# !wget "https://snap.stanford.edu/logtutorial/ogbl-wikikg2-log.zip"
# !unzip ogbl-wikikg2-log.zip
# output.clear()
# kgmem, kg = load_kg('ogbl-wikikg2-log')

This KG has 2500604 entities, 32218364 edges on 1070 relations


Before taking a look at the model interface, let's first understand the samples returned by the sampler:

In [15]:
positive_sample, negative_sample, is_negative_mat, subsampling_weight, batch_queries, query_structures = next(batch_gen)

The `batch_queries` is a batch of instantiations of the query template `2p`.

In this case each query should be a tuple of (head_entity, relation-1, relation-2)

In [16]:
batch_queries[0]

tensor([13240,   342,   522])

The `is_negative_mat` matrix is a matrix of size (batch_size, num_negative_samples)

where the entry(i, j) indicates whether entity j is a true negative example of sample i.

# Build new models

A new model would inherit from a generic KGReasoning class

In [17]:
from smore.models.kg_reasoning import KGReasoning
from smore.common.embedding.sparse_embed import SparseEmbedding


class BasicReasoning(KGReasoning):

  def __init__(self, nentity, nrelation, hidden_dim, gamma,
               batch_size, test_batch_size=1, sparse_embeddings=None,
               sparse_device='gpu', use_cuda=False, query_name_dict=None,
               logit_impl='native'):
    super(BasicReasoning, self).__init__(nentity=nentity, nrelation=nrelation, hidden_dim=hidden_dim,
                                        gamma=gamma, optim_mode=None, batch_size=batch_size, test_batch_size=test_batch_size,
                                        sparse_embeddings=sparse_embeddings, sparse_device=sparse_device, use_cuda=use_cuda,
                                        query_name_dict=query_name_dict, logit_impl=logit_impl)

    self.geo = 'basic'  # name of this module
    self.entity_embedding = SparseEmbedding(nentity, self.entity_dim)  # suppose we need one embedding per each entity
    self.num_embedding_component = 1  # and number of components in an embedding representation
    self.init_params()  # finally let's initialize all the parameters

  def relation_projection(self, cur_embedding, relation_ids):
    '''

    '''
    relation_embedding = self.relation_embedding(relation_ids).unsqueeze(1)
    return [cur_embedding[0] + relation_embedding]

  def retrieve_embedding(self, entity_ids):
    '''
    Retrieve the entity embeddings given the entity indices
    Args:
        entity_ids: a list of entities indices
    '''
    embedding = self.entity_embedding(entity_ids)
    return [embedding.unsqueeze(1)] # [num_queries, 1, embedding_dim]

  def native_cal_logit(self, entity_embedding, entity_feat, query_embedding):
    assert entity_feat is None
    distance = entity_embedding.unsqueeze(1) - query_embedding[0]
    logit = self.gamma - torch.norm(distance, p=1, dim=-1)
    logit = torch.max(logit, dim=1)[0]
    return logit


In [18]:
model = BasicReasoning(nentity=kg.num_ent, nrelation=kg.num_rel,
                       hidden_dim=128, gamma=1.0, batch_size=batch_size,
                       query_name_dict={ ('e', ('r', 'r')): '2p'})

In [19]:
positive_logit, negative_logit, _ = model(positive_sample,
                                          negative_sample,
                                          query_structures[0],
                                          batch_queries,
                                          device=device)

In [20]:
negative_logit = negative_logit * is_negative_mat
negative_score = F.logsigmoid(-negative_logit).mean(dim=1)
positive_score = F.logsigmoid(positive_logit).squeeze(dim=1)

positive_sample_loss = -torch.mean(positive_score)
negative_sample_loss = -torch.mean(negative_score)
loss = positive_sample_loss + negative_sample_loss
print(loss)

tensor(2.1076, grad_fn=<AddBackward0>)
