# Knowledge Graph Embeddings

Word embeddings aim at capturing the meaning of words based on very large corpora; however, there are decades of experience and approaches that have tried to capture this meaning by structuring knowledge into semantic nets, ontologies and graphs. 

|         | Neural           | Symbolic  |
| ------------- |-------------| -----|
| **representation**      | vectors | symbols (URIs) |
| **input**               | large corpora   | human editors (Knowledge engineers) |
| **interpretability**      | linked to model and training dataset      |   requires understanding of schema  |
| **alignability**    | parallel (annotated) corpora | heuristics + manual |
| **composability** | combine vectors | merge graphs | 
| **extensibility**   | fixed vocabulary | need to know how to link new nodes |
| **certainty**        | fuzzy | exact |
| **debugability**  | 'fix' training data? | edit graph |

In recent years, many new approaches have been proposed to derive 'neural' representations for existing knowledge graphs. Think of this as trying to capture the knowledge encoded in the KG to make it easier to use this in deep learning models.

 - [TransE (2013)](http://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data.pdf): try to assign an embedding to nodes and relations, so that $h + r$ is close to $t$, where $h$ and $t$ are nodes in the graph and $r$ is an edge. In the RDF world, this is simply an RDF triple where $h$ is the subject $r$ is the property and $t$ is the object of the triple.
 - [HolE (2016)](http://arxiv.org/abs/1510.04935): Variant of TransE, but uses a different operator (circular correlation) to represent pairs of entities.
 - [RDF2Vec(2016)](https://ub-madoc.bib.uni-mannheim.de/41307/1/Ristoski_RDF2Vec.pdf): applies word2vec to random walks on an RDF graph (essentially paths or sequences of nodes in the graph). 
 - [Graph convolutions(2018)](http://arxiv.org/abs/1703.06103): apply convolutional operations on graphs to learn the embeddings.
 - [Neural message passing(2018)](https://arxiv.org/abs/1704.01212): merges two strands of research on KG embeddings: recurrent and convolutional approaches.
 
For more background: [Nickel, M., Murphy, K., Tresp, V., & Gabrilovich, E. (2016). A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1), 11–33. https://doi.org/10.1109/JPROC.2015.2483592](http://www.dbs.ifi.lmu.de/~tresp/papers/1503.00759v3.pdf) provides a good overview (up to 2016).

# Creating embeddings for WordNet

In this section, we go through the steps of generating word and concept embeddings using WordNet, a lexico-semantic knowledge graph.
  
  0. Choose (or implement) a KG embedding algorithm
  1. Convert the KG into format required by embedding algorithm
  2. Execute the training
  3. Evaluate/inspect results

## Choose embedding algorithm: HolE

We will use an [existing implementation of the `HolE` algorithm available on GitHub](https://github.com/mnick/holographic-embeddings). 

### Install `scikit-kge`

The `holographic-embeddings` repo is actually just a wrapper around `scikit-kge` or [SKGE](https://github.com/mnick/scikit-kge), a library that implements a few KG embedding algorithms. First, we need to install `scikit-kge` as a library in our environment. Execute the following cells to clone the repository and install the library.

In [0]:
# make sure we are in the right folder to perform the git clone
%cd /content/
!git clone https://github.com/hybridNLP2018/scikit-kge

In [0]:
%cd scikit-kge
# install a dependency of scikit-kge on the colaboratory environment, needed to correclty build scikit-kge
!pip install nose
# now build a source distribution for the project
!python setup.py sdist

Executing the previous cell should produce a lot of output as the project is built. Towards the end you should see something like:

```
Writing scikit-kge-0.1/setup.cfg
creating dist
Creating tar archive
```

This should have created a `tar.gz` file in the `dist` subfolder:

In [0]:
!ls dist/

which we can install on the local environment by using `pip`, the python package manager.

In [0]:
!pip install dist/scikit-kge-0.1.tar.gz
%cd /content

### Install and inspect `holographic_embeddings` repo
Now that `skge` is installed on this environment, we are ready to clone the [holographic-embeddings](https://github.com/mnick/holographic-embeddings) repository, which will enable us to train `HolE` embeddings.

In [0]:
# let's go back to the main \content folder and clone the holE repo
%cd /content/
!git clone https://github.com/mnick/holographic-embeddings

If you want, you can browse the contents of this repo on github, or execute the following to see how you can start training embeddings for the WordNet 1.8 knowledge graph. In the following sections we'll go into more detail about how to train embeddings, so there is no need to actually execute this training just yet.

In [0]:
%less holographic-embeddings/run_hole_wn18.sh

You should see a section on the bottom of the screen with the contents of the `run_hole_wn18.sh` file. The main execution is:

```
python kg/run_hole.py --fin data/wn18.bin \
       --test-all 50 --nb 100 --me 500 \
       --margin 0.2 --lr 0.1 --ncomp 150
```

which is just executing the `kg/run_hole.py` script on the input data `data/wn18.bin` and passing various arguments to control how to train and produce the embeddings:

  * `me`: states the number of epochs to train for (i.e. number of times to go through the input dataset)
  * `ncomp`: specifies the dimension of the embeddings, each embedding will be a vector of 150 dimensions
  * `nb`: number of batches
  * `test-all`: specifies how often to run validation of the intermediate embeddings. In this case, every 50 epochs.

## Convert WordNet KG to required input
### KG Input format required by SKGE
SKGE requires a graph to be represented as a serialized python dictionary with the following structure:
  * `relations`: a list of relation names (the named edges in the graph)
  * `entities`:  a list of entity names (the nodes in the graph), 
  * `train_subs`: a list of triples of the form `(head_id, tail_id, rel_id)`, where `head_id` and `tail_id` refer to the index in the `entities`list and `rel_id` refers to the index in the `relations` list. This is the list of triples that will be used to train the embeddings.
  * `valid_subs`: a list of triples of the same form as `train_subs`. These are used to validate the embeddings during training (and thus to tune hyperparameters).
  * `test_subs`: a list of triples of the same form as `test_subs`.  These are used to test the learned embeddings.

The `holographic-embeddings` GitHub repo comes with an example input file: `data/wn18.bin` for WordNet 1.8. In the following executable cell, we show how to read and inspect data:

In [0]:
import pickle
import os

with open('holographic-embeddings/data/wn18.bin', 'rb') as fin:
  wn18_data = pickle.load(fin)

for k in wn18_data:
  print(k, type(wn18_data[k]), len(wn18_data[k]), wn18_data[k][-3:])

The expected output should be similar to:

```
relations <class 'list'> 18 ['_synset_domain_region_of', '_verb_group', '_similar_to']
train_subs <class 'list'> 141442 [(5395, 37068, 9), (5439, 35322, 11), (28914, 1188, 10)]
entities <class 'list'> 40943 ['01164618', '02371344', '03788703']
test_subs <class 'list'> 5000 [(17206, 33576, 0), (1179, 11861, 0), (30287, 1443, 1)]
valid_subs <class 'list'> 5000 [(351, 25434, 0), (3951, 2114, 7), (756, 14490, 0)]
```
This shows that WordNet 1.8 has been represented as a graph of 40943 nodes (which we assume correspond to the synsets) interlinked using 18 relation types. The full set of relations has been split into 141K triples for training, and 5K triples each for testing and validation. 

### Converting WordNet 3.0 into the required input format
WordNet 1.8 is a bit dated and it will be useful to have experience converting your KG into the required input format. Hence, rather than simply reusing the `wn18.bin` input file, we will generate our own directly from the [NLTK WordNet API](http://www.nltk.org/howto/wordnet.html).

First we need to download WordNet:

In [0]:
import nltk
nltk.download('wordnet')

#### Explore WordNet API
Now that we have the KG, we can use the WordNet API to explore the graph. Refer to the [howto doc](http://www.nltk.org/howto/wordnet.html) for a more in depth overview, here we only show a few methods that will be needed to generate our input file.

In [0]:
from nltk.corpus import wordnet as wn

The main nodes in WordNet are called synsets (synonym sets). These correspond roughly to *concepts*. You can find all the synstes related to a word like this:

In [0]:
wn.synsets('dog')

The output from the cell above shows how synsets are identified by the NLTK WordNet API. They have the form `<main-lemma>.<POS-code>.<sense-number>`. As far as we are aware, this is a format chosen by the implementors of the NLTK WordNet API and other APIs may choose diverging ways to refer to synsets. 

You can get a list of all the synsets as follows (we only show the first 5):

In [0]:
for synset in list(wn.all_synsets())[:5]:
    print(synset.name())

Similarly, you can also get a list of all the lemma names (again we only show 5):

In [0]:
for lemma in list(wn.all_lemma_names())[5000:5005]:
    print(lemma)

For a given synset, you can find related synsets or lemmas, by calling the functions for each relation type. Below we provide a couple of examples for the first sense of adjective *adaxial*. In the first example, we see that this synset belongs to  `topic domain` `biology.n.01`, which is again a synset. In the second example, we see that it has two lemmas, which are relative to the synset. In the third example, we retrieve the lemmas in a form that are not relative to the synset, which is the one we will use later on. 

In [0]:
wn.synset('adaxial.a.01').topic_domains()

In [0]:
wn.synset('adaxial.a.01').lemmas()

In [0]:
wn.synset('adaxial.a.01').lemma_names()

#### Entities and relations to include

The main nodes in WordNet are the syncons, however, lemmas can also be considered to be nodes in the graph. Hence, you need to decide which nodes to include. Since we are interested in capturing as much information as can be provided by WordNet, we will include both synsets and lemmas.

WordNet defines a large number of relations between synsets and lemmas. Again, you can decide to include all or just some of these. One particularity of WordNet is that many relations are defined twice: e.g. hypernym and hyponym are the exact same relation, but in reverse order. Since this is not really providing additional information, we only include such relations once. The following cell defines all the relations we will be taking into account. We represent these as python dictionaries, where the keys are the name of the relation and the values are functions that accept a `head` entity and produce a list of `tail` entities for that specific relation:

In [0]:
syn_relations = {
    'hyponym': lambda syn: syn.hyponyms(), 
    'instance_hyponym': lambda syn: syn.instance_hyponyms(),  
    'member_meronym': lambda syn: syn.member_meronyms(),
    'has_part': lambda syn: syn.part_meronyms(), 
    'topic_domain': lambda syn: syn.topic_domains(), 
    'usage_domain': lambda syn: syn.usage_domains(), 
    '_member_of_domain_region': lambda syn: syn.region_domains(),
    'attribute': lambda syn: syn.attributes(),
    'entailment': lambda syn: syn.entailments(),
    'cause': lambda syn: syn.causes(),
    'also_see': lambda syn: syn.also_sees(),
    'verb_group': lambda syn: syn.verb_groups(),
    'similar_to': lambda syn: syn.similar_tos()
}
lem_relations = {
    'antonym': lambda lem: lem.antonyms(),
    'derivationally_related_form': lambda lem: lem.derivationally_related_forms(),
    'pertainym': lambda lem: lem.pertainyms()
}

syn2lem_relations = {
    'lemma': lambda syn: syn.lemma_names()
}

#### Triple generation

We are now ready to generate triples by using the WordNet API. Recall that `skge` requires triples of the form `(head_id, tail_id, rel_id)`, hence we will need to have some way of mapping entity (synset and lemma) names and relations types to  unique ids. We therefore assume we will have an `entity_id_map` and a `rel_id_map`, which will map the entity name (or relation type) to an id. The following two cells implement functions which will iterate through the synsets and relations to generate the triples:

In [0]:
def generate_syn_triples(entity_id_map, rel_id_map):
  result = []
  for synset in list(wn.all_synsets()):
    h_id = entity_id_map.get(synset.name())
    if h_id is None:
      print('No entity id for ', synset)
      continue
    for synrel, srfn in syn_relations.items():
      r_id = rel_id_map.get(synrel)
      if r_id is None:
        print('No rel id for', synrel)
        continue
      for obj in srfn(synset):  
        t_id = entity_id_map.get(obj.name())
        if t_id is None:
          print('No entity id for object', obj)
          continue
        result.append((h_id, t_id, r_id))
    
    for rel, fn in syn2lem_relations.items():
      r_id = rel_id_map.get(rel)
      if r_id is None:
        print('No rel id for', rel)
        continue
      for obj in fn(synset):
        lem = obj.lower()
        t_id = entity_id_map.get(lem)
        if t_id is None:
          print('No entity id for object', obj, 'lowercased:', lem)
          continue
        result.append((h_id, t_id, r_id))
  return result

In [0]:
def generate_lem_triples(entity_id_map, rel_id_map):
  result = []
  for lemma in list(wn.all_lemma_names()):
    h_id = entity_id_map.get(lemma)
    if h_id is None:
      print('No entity id for lemma', lemma)
      continue
    _lems = wn.lemmas(lemma)
    for lemrel, lrfn in lem_relations.items():
      r_id = rel_id_map.get(lemrel)
      if r_id is None:
        print('No rel id for ', lemrel)
        continue
      for _lem in _lems:
        for obj in lrfn(_lem):
          t_id = entity_id_map.get(obj.name().lower())
          if t_id is None:
            print('No entity id for obj lemma', obj, obj.name())
            continue
          result.append((h_id, t_id, r_id))
  return result

#### Putting it all together
Now that we have methods for generating lists of triples, we can generate the input dictionary and serialise it. We need to:
  * create our lists of entities and relations, 
  * derive a map from entity and relation names to ids
  * generate the triples
  * split the triples into training, validation and test subsets
  * write the python dict to a serialised file
  
We implement this in the following method:

In [0]:
import random # for shuffling list of triples
      
def wnet30_holE_bin(out):
  """Creates a skge-compatible bin file for training HolE embeddings based on WordNet31"""
  synsets = [synset.name() for synset in wn.all_synsets()]
  lemmas = [lemma for lemma in wn.all_lemma_names()]
  entities = list(synsets + list(set(lemmas)))
  print('Found %s synsets, %s lemmas, hence %s entities' % (len(synsets), len(lemmas), len(entities)))
  entity_id_map = {ent_name: id for id, ent_name in enumerate(entities)}
  n_entity = len(entity_id_map)
    
  print("N_ENTITY: %d" % n_entity)
    
  relations = list( list(syn_relations.keys()) + list(lem_relations.keys()) + list(syn2lem_relations.keys()))
  relation_id_map = {rel_name: id for id, rel_name in enumerate(relations)}
  n_rel = len(relation_id_map)
    
  print("N_REL: %d" % n_rel)
  print('relations', relation_id_map)
    
  syn_triples = generate_syn_triples(entity_id_map, relation_id_map)
  print("Syn2syn relations", len(syn_triples))
  lem_triples = generate_lem_triples(entity_id_map, relation_id_map)
  print("Lem2lem relations", len(lem_triples))
  all_triples = syn_triples + lem_triples
  print("All triples", len(all_triples))
  random.shuffle(all_triples)
    
  test_triple = all_triples[:500]
  valid_triple = all_triples[500:1000]
  train_triple = all_triples[1000:]
        
  to_pickle = {
      "entities": entities,
      "relations": relations,
      "train_subs": train_triple,
      "test_subs": test_triple,
      "valid_subs": valid_triple
  }
    
  with open(out, 'wb') as handle:
    pickle.dump(to_pickle, handle, protocol=pickle.HIGHEST_PROTOCOL)
        
  print("wrote to %s" % out)

#### Generate `wn30.bin`
Now we are ready to generate the  `wn30.bin` file which we can feed to the `HolE` algorithm implementation.

In [0]:
out_bin='/content/holographic-embeddings/data/wn30.bin'
wnet30_holE_bin(out_bin)

Notice, that the resulting dataset now contains 265K entities, compared to 41K in WordNet 1.8 (to be fair, only 118K of the entities are synsets).

## Learn the embeddings
Now, we will use the WordNet 3.0 dataset to learn embeddings for both synsets and lemmas. Since this is fairly slow, we only train for 2 epochs, which can take up to 10 minutes (In the exercises at the end of this notebook, we provide a link to download pre-computed embeddings which have been trained for 500 epochs.)

In [0]:
wn30_holE_out='/content/wn30_holE_2e.bin'
holE_dim=150
num_epochs=2
!python /content/holographic-embeddings/kg/run_hole.py --fin {out_bin} --fout {wn30_holE_out} \
  --nb 100 --me {num_epochs} --margin 0.2 --lr 0.1 --ncomp {holE_dim}

The output should look similar to:
```
INFO:EX-KG:Fitting model HolE with trainer PairwiseStochasticTrainer and parameters Namespace(afs='sigmoid', fin='/content/holographic-embeddings/data/wn30.bin', fout='/content/wn30_holE_2e.bin', init='nunif', lr=0.1, margin=0.2, me=2, mode='rank', nb=100, ncomp=150, ne=1, no_pairwise=False, rparam=0, sampler='random-mode', test_all=10)
INFO:EX-KG:[  1] time = 120s, violations = 773683
INFO:EX-KG:[  2] time = 73s, violations = 334894
INFO:EX-KG:[  2] time = 73s, violations = 334894
INFO:EX-KG:[  2] VALID: MRR = 0.11/0.12, Mean Rank = 90012.28/90006.14, Hits@10 = 15.02/15.12
DEBUG:EX-KG:FMRR valid = 0.122450, best = -1.000000
INFO:EX-KG:[  2] TEST: MRR = 0.11/0.12, Mean Rank = 95344.42/95335.96, Hits@10 = 15.74/15.74
```

## Inspect resulting embeddings

Now that we have trained the model, we can retrieve the embeddings for the entities and inspect them. 

### `skge` output file format
The output file is again a pickled serialisation of a python dictionary. It contains the `model` itself, and results for the test and validation runs as well as execution times.

In [0]:
with open(wn30_holE_out, 'rb') as fin:
    hole_model = pickle.load(fin)
print(type(hole_model), len(hole_model))
for k in hole_model:
    print(k, type(hole_model[k]))

We are interested in the model itself, which is an instance of a `skge.hole.HolE` class and has various parameters. The entity embeddings are stored in parameter `E`, which is essentially a matrix of $n_e \times d$, where $n_e$ is the number of entities and $d$ is the dimension of each vector.

In [0]:
model = hole_model['model']
E = model.params['E']
print(type(E), E.shape)

### Converting embeddings to more inspectable format
Unfortunately, `skge` does not provide methods for exploring the embedding space. (KG embedding libraries are more geared towards prediction of relations) So, we will convert the embeddings into an easier to explore format. We first convert them into a pair of files for the vectors and the vocabulary and we will then use the `swivel` library to explore the results.

We first read the list of entities, this is our **vocabulary** (i.e. names of synsets and lemmas for which we have embeddings).

In [0]:
with open('/content/holographic-embeddings/data/wn30.bin', 'rb') as fin:
  wn30_data = pickle.load(fin)
entities = wn30_data['entities']
len(entities)

Next, we generate a vocab file and a `tsv` file where each line contains the word and a list of $d$ numbers.

In [0]:
vec_file = '/content/wn30_holE_2e.tsv'
vocab_file = '/content/wn30_holE_2e.vocab.txt'

with open(vocab_file, 'w', encoding='utf_8') as f:
  for i, w in enumerate(entities):
    word = w.strip()
    print(word, file=f)
    
with open(vec_file, 'w', encoding='utf_8') as f:
  for i, w in enumerate(entities):
    word = w.strip()
    embedding = E[i]
    print('\t'.join([word] + [str(x) for x in embedding]), file=f)
!wc -l {vec_file}

Now that we have these files, we can use `swivel`, which we used in the first notebook to inspect the embeddings.

#### Download tutorial materials and `swivel` (if necessary)
Download swivel, although you may already have it on your environment if you already executed the first notebook of this tutorial.

In [0]:
%cd /content
!git clone https://github.com/HybridNLP2018/tutorial

Use the  `swivel/text2bin` script to convert the `tsv` embeddings into `swivel`'s binary format.

In [0]:
vecbin = '/content/wn30_holE_2e.tsv.bin'
!python /content/tutorial/scripts/swivel/text2bin.py --vocab={vocab_file} --output={vecbin} \
        {vec_file}

Next, we can load the vectors using `swivel`'s `Vecs` class, which provides easy inspection of neighbors.

In [0]:
from tutorial.scripts.swivel import vecs
vectors = vecs.Vecs(vocab_file, vecbin)

#### Inspect a few example lemmas and synsets

In [0]:
import pandas as pd
pd.DataFrame(vectors.k_neighbors('california'))

In [0]:
wn.synsets('california')

In [0]:
pd.DataFrame(vectors.k_neighbors('california.n.01'))

In [0]:
pd.DataFrame(vectors.k_neighbors('conference'))

In [0]:
pd.DataFrame(vectors.k_neighbors('semantic'))

In [0]:
pd.DataFrame(vectors.k_neighbors('semantic.a.01'))

As you can see, the embeddings do not look very good at the moment. In part this is due to the fact we only trained the model for 2 epochs. We have pre-calculated a set of HolE embeddings for 500 epochs, which you can download and inspect as part of an optional excercise below. Results for these are much better:

|    cosine sim    | entity  |
| ------------- |-------------|
| 1.0000 | lem_california |
| 0.4676 | lem_golden_state |
| 0.4327 | lem_ca |
| 0.4004 | lem_californian |
| 0.3838 | lem_calif. |
| 0.3500 | lem_fade |
| 0.3419 | lem_keystone_state |
| 0.3375 | wn31_antilles.n.01 |
| 0.3356 | wn31_austronesia.n.01 |
| 0.3340 | wn31_overbalance.v.02 |

For the synset for california, we also see 'sensible' results:

|    cosine sim    | entity  |
| ------------- |-------------|
| 1.0000 | wn31_california.n.01 |
| 0.4909 | wn31_nevada.n.01 |
| 0.4673 | wn31_arizona.n.01 |
| 0.4593 | wn31_tennessee.n.01 |
| 0.4587 | wn31_new_hampshire.n.01 |
| 0.4555 | wn31_sierra_nevada.n.02 |
| 0.4073 | wn31_georgia.n.01 |
| 0.4048 | wn31_west_virginia.n.01 |
| 0.3991| wn31_north_carolina.n.01 |
| 0.3977 | wn31_virginia.n.01 |

One thing to notice here is that all of the top 10 closely related entities for `california.n.01` are also synsets. Similarly for lemma `california`, the most closely related entities are also lemmas, although some synsets also made it into the top 10 neighbours. This may indicate a tendency of `HolE` to keep lemmas close to other lemmas and synsets close to other synsets. In general, choices about how nodes in the KG are related will affect how their embeddings are interrelated.

# Conclusion and exercises

In this notebook we provided an overview of recent knowledge graph embedding approaches and showed how to use existing implementations to generate word and concept embeddings for WordNet 3.0. 

## Excercise: train embeddings on your own KG
If you have a KG of your own, you can adapt the code shown above to generate a graph representation as expected by `skge` and you can train your embeddings in this way. Popular KGs are Freebase and DBpedia.

## Excercise: inspect embeddings for pre-calculated WordNet 3.0
We have used code similar to the one shown above to train embeddings for 500 epochs using HolE. You can execute the following cells to download and explore these embeddings. The embeddings are about 142MB, so dowloading them may take a few minutes.

In [0]:
!mkdir /content/vec/
%cd /content/vec/
!wget https://zenodo.org/record/1446214/files/wn-en-3.0-HolE-500e-150d.tar.gz
!tar -xzf wn-en-3.0-HolE-500e-150d.tar.gz

In [0]:
%ls /content/vec

The downloaded tar contains a `tsv.bin` and a `vocab` file like the one we created above. We can use it to load the vectors using `swivel`'s `Vecs`:

In [0]:
vocab_file = '/content/vec/wn-en-3.1-HolE-500e.vocab.txt'
vecbin = '/content/vec/wn-en-3.1-HolE-500e.tsv.bin'
wnHolE = vecs.Vecs(vocab_file, vecbin)

Now you are ready to start exploring. The only thing to notice is that we have added a prefix to `lem_` to all lemmas and `wn31_` to all synsets, as shown in the following examples:

In [0]:
pd.DataFrame(wnHolE.k_neighbors('lem_california'))

In [0]:
pd.DataFrame(wnHolE.k_neighbors('wn31_california.n.01'))