## Importing Modules

In [5]:
import tensorflow as tf
import ampligraph
import numpy as np
import pandas as pd

In [10]:
from ampligraph.datasets import load_fb15k_237, load_wn18rr, load_yago3_10
from ampligraph.evaluation import train_test_split_no_unseen, evaluate_performance, mr_score, mrr_score, hits_at_n_score

In [7]:
from ampligraph.discovery import query_topn, discover_facts, find_clusters
from ampligraph.latent_features import TransE, ComplEx, HolE, DistMult, ConvE, ConvKB
from ampligraph.utils import save_model, restore_model

In [21]:
from ampligraph.evaluation import train_test_split_no_unseen
from ampligraph.utils import save_model, restore_model

In [9]:
print("Ampligraph version : {}".format(ampligraph.__version__))

Ampligraph version : 1.4.0


In [8]:
print("Tensorflow version: {}".format(tf.__version__))

Tensorflow version: 1.15.5


# Loading the KG (Knowledge Graph) dataset -

A standard KG called **Freebase-15k-237** will be loaded. You can load KGs, csvs, ntriples etc from the API : https://docs.ampligraph.org/en/1.4.0/ampligraph.datasets.html

* FB15k-237 dataset : Freebase knowledge base (ontology behind Google's semantic search feature (knowledge graph) which is a backend for Google search results that include structured asnwers to querues instead of series of links to external resources.) Its is 1.9 billion triples in the format (rfd - resource description format). Google bought it in 2010.IN 2016 it was closed and was migrated to Wikidata. FB15k-237 is a link prediction dataset created from FB15k. While FB15k consists of 1,345 relations, 14,951 entities, and 592,213 triples, many triples are inverses that cause leakage from the training to testing and validation splits. FB15k-237 was created by Toutanova and Chen (2015) to ensure that the testing and evaluation datasets do not have inverse relation test leakage. In summary, FB15k-237 dataset contains 310,079 triples with 14,505 entities and 237 relation types.

https://paperswithcode.com/dataset/fb15k-237

* wn18rr dataset : WN18RR is a link prediction dataset created from WN18, which is a subset of WordNet. WN18 consists of 18 relations and 40,943 entities. However, many text triples are obtained by inverting triples from the training set. Thus the WN18RR dataset is created to ensure that the evaluation dataset does not have inverse relation test leakage. In summary, WN18RR dataset contains 93,003 triples with 40,943 entities and 11 relation types.

https://paperswithcode.com/dataset/wn18rr

* yago3 : YAGO3-10 is benchmark dataset for knowledge base completion. It is a subset of YAGO3 (which itself is an extension of YAGO) that contains entities associated with at least ten different relations. In total, YAGO3-10 has 123,182 entities and 37 relations, and most of the triples describe attributes of persons such as citizenship, gender, and profession.

https://paperswithcode.com/dataset/yago3-10

* DBpedia: It extracts factual information from Wikipedia pages, allowing users to find answers to questions where the information is spread across multiple Wikipedia articles. Data is accessed using an SQL-like query language for RDF called SPARQL.

https://www.dbpedia.org/

* Wikidata : 

https://developer.ibm.com/articles/use-wikidata-in-ai-and-cognitive-applications-pt1/

https://developer.ibm.com/articles/use-wikidata-in-ai-and-cognitive-applications-pt2/

For this exercise we have remapped the IDs of freebase 237 and created a csv file containing human readable names instead of IDs.

In [11]:
URL = 'https://ampgraphenc.s3-eu-west-1.amazonaws.com/datasets/freebase-237-merged-and-remapped.csv'
dataset = pd.read_csv(URL, header = None)
dataset.columns = ['subject', 'predicate', 'object']
dataset.head(3)

Unnamed: 0,subject,predicate,object
0,"queens college, city university of new york",/education/educational_institution/students_gr...,carol leifer
1,digital equipment corporation,/business/business_operation/industry,computer hardware
2,/m/0drtv8,/award/award_ceremony/awards_presented./award/...,laurence mark


One example -

['academy award for best writing adapted screenplay',
        '/award/award_category/nominees./award/award_nomination/nominated_for',
        'the graduate']]

In [12]:
print('Total triples in the KG: ', dataset.shape)

Total triples in the KG:  (310079, 3)


# Creating training, validation and test splits

We will use train_test_split_no_unseen(). This API ensures that the test and validation splits contain triples whose entities are "seen" during training . This API can be used to generate train/test splits such that test set contains only entities 'seen' during training

In [17]:
# Validation set of size 500

test_train, X_valid = train_test_split_no_unseen(dataset.values, 500, seed = 0)

# Test set of size 1000 from the remaining triples
X_train, X_test = train_test_split_no_unseen(test_train, 1000, seed = 0)

print('Total triples: ', dataset.shape)
print('Size of train: ', X_train.shape)
print('Size of valid: ', X_valid.shape)
print('Size of test: ', X_test.shape)

Total triples:  (310079, 3)
Size of train:  (308579, 3)
Size of valid:  (500, 3)
Size of test:  (1000, 3)


# Model Training

Knowledge Graph embeddings are learned by training a neural architecture over a graph. In the training phase there is a loss function **L** that includes a scoring function **fm(t)** which is a model specific function that assigns a score to a triple **t = (sub, pred, obj)**

https://docs.ampligraph.org/en/latest/ampligraph.latent_features.html

a) **TransE** :  
It uses simple vector algebra to score the triples. It has very low number of trainable parameters compared to most models. The scoring function computes a similarity between the embedding of the subject translated by the embedding of the predicate and embedding of the object using L1 or L2 norm ||.||

                            f = −||s + p − o||n
                            
Translation Embeddings for modeling multi-relational data : https://proceedings.neurips.cc/paper/2013/file/1cecc7a77928ca8133fa24680a88d2f9-Paper.pdf


In [20]:
model = TransE(k = 150,                                                          # embedding size
              epochs = 100,                                                      # num epochs
              batches_count = 10,                                                # num batches
              eta = 1,                                                           # num of corruptions to generate during training
              loss = 'pairwise', loss_params = {'margin': 1},                    #  loss type and it's hyperparameters
              initializer = 'xavier', initializer_params = {'uniform': False},
              regularizer = 'LP', regularizer_params = {'lambda': 0.001, 'p': 3},
              optimizer = 'adam', optimizer_params = {'lr': 0.001},
              seed = 0, verbose = True)

In [22]:
model.fit(X_train)

save_model(model, 'TransE-small.pkl')

Average TransE Loss:   0.013576: 100%|████████████████████████████████████████████| 100/100 [02:31<00:00,  1.52s/epoch]
