# 1.0 Knowledge Graph Embeddings Introduction

Word embeddings aim at capturing the meaning of words based on very large corpora; however, there are decades of experience and approaches that have tried to capture this meaning by structuring knowledge into semantic nets, ontologies and graphs. 

|         | Neural           | Symbolic  |
| ------------- |-------------| -----|
| **representation**      | vectors | symbols (URIs) |
| **input**               | large corpora   | human editors (Knowledge engineers) |
| **interpretability**      | linked to model and training dataset      |   requires understanding of schema  |
| **alignability**    | parallel (annotated) corpora | heuristics + manual |
| **composability** | combine vectors | merge graphs | 
| **extensibility**   | fixed vocabulary | need to know how to link new nodes |
| **certainty**        | fuzzy | exact |
| **debugability**  | 'fix' training data? | edit graph |

In recent years, many new approaches have been proposed to derive 'neural' representations for existing knowledge graphs. Think of this as trying to capture the knowledge encoded in the KG to make it easier to use this in deep learning models.

 - [TransE (2013)](http://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data.pdf): try to assign an embedding to nodes and relations, so that $h + r$ is close to $t$, where $h$ and $t$ are nodes in the graph and $r$ is an edge. In the RDF world, this is simply an RDF triple where $h$ is the subject $r$ is the property and $t$ is the object of the triple.
 - [HolE (2016)](http://arxiv.org/abs/1510.04935): Variant of TransE, but uses a different operator (circular correlation) to represent pairs of entities.
 - [RDF2Vec(2016)](https://ub-madoc.bib.uni-mannheim.de/41307/1/Ristoski_RDF2Vec.pdf): applies word2vec to random walks on an RDF graph (essentially paths or sequences of nodes in the graph). 
 - [Graph convolutions(2018)](http://arxiv.org/abs/1703.06103): apply convolutional operations on graphs to learn the embeddings.
 - [Neural message passing(2018)](https://arxiv.org/abs/1704.01212): merges two strands of research on KG embeddings: recurrent and convolutional approaches.
 
For more background: [Nickel, M., Murphy, K., Tresp, V., & Gabrilovich, E. (2016). A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1), 11–33. https://doi.org/10.1109/JPROC.2015.2483592](http://www.dbs.ifi.lmu.de/~tresp/papers/1503.00759v3.pdf) provides a good overview (up to 2016).

Steps:
  
  0. Choose (or implement) a KG embedding algorithm
  1. Convert the KG into format required by embedding algorithm
  2. Execute the training
  3. Evaluate/inspect results

# 2.0 Install algorithms and import dataset

Choose embedding algorithm: HolE

We will use an [existing implementation of the `HolE` algorithm available on GitHub](https://github.com/mnick/holographic-embeddings). 


###2.1.1 Install `scikit-kge` package

The `holographic-embeddings` repo is actually just a wrapper around `scikit-kge` or [SKGE](https://github.com/mnick/scikit-kge), a library that implements a few KG embedding algorithms. First, we need to install `scikit-kge` as a library in our environment. Execute the following cells to clone the repository and install the library.

In [None]:
# make sure we are in the right folder to perform the git clone
%cd /content/
!git clone https://github.com/hybridNLP2018/scikit-kge

/content
Cloning into 'scikit-kge'...
remote: Enumerating objects: 116, done.[K
remote: Total 116 (delta 0), reused 0 (delta 0), pack-reused 116[K
Receiving objects: 100% (116/116), 25.32 KiB | 8.44 MiB/s, done.
Resolving deltas: 100% (51/51), done.


In [None]:
%cd scikit-kge
# install a dependency of scikit-kge on the colaboratory environment, needed to correclty build scikit-kge
!pip install nose
# now build a source distribution for the project
!python setup.py sdist
#which we can install on the local environment by using pip, the python package manager.
!pip install dist/scikit-kge-0.1.tar.gz
%cd /content

/content/scikit-kge
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nose
  Downloading nose-1.3.7-py3-none-any.whl (154 kB)
[K     |████████████████████████████████| 154 kB 23.5 MB/s 
[?25hInstalling collected packages: nose
Successfully installed nose-1.3.7
setuptools module not found.
Install setuptools if you want to enable 'python setup.py develop'.
  import imp
[39mrunning sdist[0m
[39mrunning egg_info[0m
`build_src` is being run, this may lead to missing
files in your sdist!  You want to use distutils.sdist
instead of the setuptools version:

    from distutils.command.sdist import sdist
    cmdclass={'sdist': sdist}"

See numpy's setup.py or gh-7131 for details.
  cmd_obj.run()
[39mrunning build_src[0m
[39mbuild_src[0m
[39mcreating scikit_kge.egg-info[0m
[39mwriting scikit_kge.egg-info/PKG-INFO[0m
[39mwriting dependency_links to scikit_kge.egg-info/dependency_links.txt[0m
[39mwriting top-level names t

### 2.1.2 Install holographic algorithm

Install and inspect `holographic_embeddings` repo
Now that `skge` is installed on this environment, we are ready to clone the [holographic-embeddings](https://github.com/mnick/holographic-embeddings) repository, which will enable us to train `HolE` embeddings.

In [None]:
# let's go back to the main \content folder and clone the holE repo
%cd /content/
!git clone https://github.com/mnick/holographic-embeddings

/content
Cloning into 'holographic-embeddings'...
remote: Enumerating objects: 37, done.[K
remote: Total 37 (delta 0), reused 0 (delta 0), pack-reused 37[K
Unpacking objects: 100% (37/37), done.


Training arguments

In [None]:
%less holographic-embeddings/run_hole_wn18.sh

You should see a section on the bottom of the screen with the contents of the `run_hole_wn18.sh` file. The main execution is:

```
python kg/run_hole.py --fin data/wn18.bin \
       --test-all 50 --nb 100 --me 500 \
       --margin 0.2 --lr 0.1 --ncomp 150
```

which is just executing the `kg/run_hole.py` script on the input data `data/wn18.bin` and passing various arguments to control how to train and produce the embeddings:

  * `me`: states the number of epochs to train for (i.e. number of times to go through the input dataset)
  * `ncomp`: specifies the dimension of the embeddings, each embedding will be a vector of 150 dimensions
  * `nb`: number of batches
  * `test-all`: specifies how often to run validation of the intermediate embeddings. In this case, every 50 epochs.

# 3.0 Convert our dataset to required input
SKGE requires a graph to be represented as a serialized python dictionary with the following structure:
  * `relations`: a list of relation names (the named edges in the graph)
  * `entities`:  a list of entity names (the nodes in the graph), 
  * `train_subs`: a list of triples of the form `(head_id, tail_id, rel_id)`, where `head_id` and `tail_id` refer to the index in the `entities`list and `rel_id` refers to the index in the `relations` list. This is the list of triples that will be used to train the embeddings.
  * `valid_subs`: a list of triples of the same form as `train_subs`. These are used to validate the embeddings during training (and thus to tune hyperparameters).
  * `test_subs`: a list of triples of the same form as `test_subs`.  These are used to test the learned embeddings.

Now that we have methods for generating lists of triples, we can generate the input dictionary and serialise it. We need to:
  * create our lists of entities and relations, 
  * derive a map from entity and relation names to ids
  * generate the triples
  * split the triples into training, validation and test subsets
  * write the python dict to a serialised file
  
We implement this in the following method:

In [None]:
# data.bin file is created by myself, I can explore it
inputs='/content/holographic-embeddings/data/data.bin'

# 4.0 Learn the embeddings
Now, we will use the WordNet 3.0 dataset to learn embeddings for both synsets and lemmas. Since this is fairly slow, we only train for 2 epochs, which can take up to 10 minutes (In the exercises at the end of this notebook, we provide a link to download pre-computed embeddings which have been trained for 500 epochs.)

In [None]:
outputs='/content/output_embeddings.bin'
holE_dim=150
num_epochs=50
num_batches=100
lr=0.08
!python /content/holographic-embeddings/kg/run_hole.py --fin {inputs} --fout {outputs} \
  --nb {num_batches} --me {num_epochs} --margin 0.2 --lr {lr} --ncomp {holE_dim}

INFO:EX-KG:[ 10] VALID: MRR = 0.91/0.95, Mean Rank = 28.82/28.64, Hits@10 = 96.28/96.62
DEBUG:EX-KG:FMRR valid = 0.947005, best = -1.000000
INFO:EX-KG:[ 10] TEST: MRR = 0.92/0.96, Mean Rank = 23.61/23.42, Hits@10 = 97.28/97.83
INFO:EX-KG:[ 11] time = 0s, violations = 125
INFO:EX-KG:[ 12] time = 0s, violations = 98
INFO:EX-KG:[ 13] time = 0s, violations = 92
INFO:EX-KG:[ 14] time = 0s, violations = 99
INFO:EX-KG:[ 15] time = 0s, violations = 96
INFO:EX-KG:[ 16] time = 0s, violations = 86
INFO:EX-KG:[ 17] time = 0s, violations = 52
INFO:EX-KG:[ 18] time = 0s, violations = 72
INFO:EX-KG:[ 19] time = 0s, violations = 61
INFO:EX-KG:[ 20] time = 0s, violations = 69
INFO:EX-KG:[ 20] VALID: MRR = 0.92/0.96, Mean Rank = 28.00/27.88, Hits@10 = 96.62/96.62
DEBUG:EX-KG:FMRR valid = 0.957281, best = 0.947005
INFO:EX-KG:[ 20] TEST: MRR = 0.93/0.97, Mean Rank = 18.82/18.64, Hits@10 = 97.83/97.83
INFO:EX-KG:[ 21] time = 0s, violations = 69
INFO:EX-KG:[ 22] time = 0s, violations = 62
INFO:EX-KG:[ 23] t

The output should look similar to:
```
INFO:EX-KG:Fitting model HolE with trainer PairwiseStochasticTrainer and parameters Namespace(afs='sigmoid', fin='/content/holographic-embeddings/data/wn30.bin', fout='/content/wn30_holE_2e.bin', init='nunif', lr=0.1, margin=0.2, me=2, mode='rank', nb=100, ncomp=150, ne=1, no_pairwise=False, rparam=0, sampler='random-mode', test_all=10)
INFO:EX-KG:[  1] time = 120s, violations = 773683
INFO:EX-KG:[  2] time = 73s, violations = 334894
INFO:EX-KG:[  2] time = 73s, violations = 334894
INFO:EX-KG:[  2] VALID: MRR = 0.11/0.12, Mean Rank = 90012.28/90006.14, Hits@10 = 15.02/15.12
DEBUG:EX-KG:FMRR valid = 0.122450, best = -1.000000
INFO:EX-KG:[  2] TEST: MRR = 0.11/0.12, Mean Rank = 95344.42/95335.96, Hits@10 = 15.74/15.74
```

# 5.0 Inspect resulting embeddings
The output file is again a pickled serialisation of a python dictionary. It contains the `model` itself, and results for the test and validation runs as well as execution times.

In [None]:
import pickle
with open(outputs, 'rb') as fin:
    hole_model = pickle.load(fin)
print(type(hole_model), len(hole_model))
for k in hole_model:
    print(k, type(hole_model[k]))

We are interested in the model itself, and the entity embeddings are stored in parameter `E`, which is essentially a matrix of $n_e \times d$, where $n_e$ is the number of entities and $d$ is the dimension of each vector.

In [None]:
model = hole_model['model']
E = model.params['E']

with open('/content/holographic-embeddings/data/data.bin', 'rb') as fin:
  data = pickle.load(fin)
entities = data['entities']
embeddings=dict()
for i,j in zip (entities,E):
  embeddings[i]=j.tolist()

import torch
torch.save(embeddings,'embeddings.pt')