# KG curation, interlinking and multilinguality

In this notebook we look at how embeddings can be used in curation of Knowledge Graphs, in particular in tasks such as graph completion and alignment.

## KG completion

Knowledge Graph completion is the task of predicting whether an existing, incomplete, graph should add a vertix between two specific nodes. For example, in DBpedia, you may want to generate new links between pages and categories.

Although embeddings for KGs are more suitable for this kind of task, word (and cross-modal) embeddings can also provide valuable input.

## Multilingual KG alignment

If you have multiple KGs that need to be aligned, you may be able to use *embedding alignment techniques*.

## Linear alignment
The most straightforward alignment between two embedding spaces can be achieved by using a *translation matrix*, as shown in (Mikolov et al, 2013). Basically, a translation matrix W is such that **z=Wx**, where z is a vector belonging to the target vector space and x is the equivalent in the source.

To calculate the translation matrix, you need a **dictionary** that provides mappings for a subset of your vocabularies.

You can then use existing linear algorithms to calculate the pseudo inverse.

For best results, it is recommended to use parallel corpora (so that the same words are encoded in similar ways) or very large corpora.

In the following example, we use pre-generated embeddings for the most frequent 5K lemmas in the *United Nations parallel corpus*. 

We first load the vectors. 

In [0]:
%cd /content
!git clone https://github.com/HybridNLP2018/tutorial.git
from tutorial.scripts.swivel import vecs
import os
import pandas as pd
import numpy as np
from IPython.display import display

/content
Cloning into 'tutorial'...
remote: Enumerating objects: 592, done.[K
remote: Total 592 (delta 0), reused 0 (delta 0), pack-reused 592[K
Receiving objects: 100% (592/592), 47.53 MiB | 39.32 MiB/s, done.
Resolving deltas: 100% (337/337), done.


In [0]:
en_path = '/content/tutorial/datasamples/UNv1.0/en_lemma_5k/'
es_path = '/content/tutorial/datasamples/UNv1.0/es_lemma_5k/'
en_vecs = vecs.Vecs(en_path + 'vocab.txt', 
            en_path + 'vecs.bin')
es_vecs = vecs.Vecs(es_path + 'vocab.txt',
            es_path + 'vecs.bin')

Opening vector with expected size 5000 from file /content/tutorial/datasamples/UNv1.0/en_lemma_5k/vocab.txt
vocab size 5000 (unique 5000)
read rows
Opening vector with expected size 5000 from file /content/tutorial/datasamples/UNv1.0/es_lemma_5k/vocab.txt
vocab size 5000 (unique 5000)
read rows


Let's check a couple of words in each embedding space as we have done in previous notebooks:

In [0]:
import pandas as pd
pd.DataFrame(en_vecs.k_neighbors('knowledge'))

Unnamed: 0,cosim,word
0,1.0,knowledge
1,0.631812,skill
2,0.603642,know-how
3,0.574704,sharing
4,0.537305,information
5,0.536732,learning
6,0.534542,innovation
7,0.533146,technology
8,0.53126,understanding
9,0.513664,science


In [0]:
pd.DataFrame(es_vecs.k_neighbors('conocimiento'))

Unnamed: 0,cosim,word
0,1.0,conocimiento
1,0.780866,conocimientos
2,0.603392,aptitud
3,0.586549,comprensión
4,0.557678,intercambio
5,0.537809,capacidad
6,0.526911,difusión
7,0.525315,científico
8,0.521962,información
9,0.516031,fomentar


Besides the embeddings for English and Spanish, we also provide a **dictionary** that was generated automatically to map 1K English lemmas into Spanish.

In [0]:
%ls /content/tutorial/datasamples/UNv1.0/
en2es_dict_path = '/content/tutorial/datasamples/UNv1.0/en2es-lemma-dict-1k.txt'
!head -n 5 {en2es_dict_path}

en2es-lemma-dict-1k.txt  [0m[01;34men_lemma_5k[0m/  [01;34mes_lemma_5k[0m/
be:ser
by:por conducto de
report:informe
state:estado
country:estado


Let's load the dictionary into a python object.

In [0]:
def load_dict(path, invert=False):
    result = {}
    with open(path, 'r') as lines:
        for line in lines:
            (key, val) = line.split(':')
            if invert:
                result[val.strip('\n')] = key
            else: 
                result[key] = val.strip('\n')
    return result

In [0]:
en2es = load_dict(en2es_dict_path)
es2en = load_dict(en2es_dict_path, invert=True)
len(en2es), len(es2en)

(1000, 882)

We can see from the reported numbers that some English lemmas were mapped to the same Spanish lemma.

Let's inspect some of the entries in the dictionary:

In [0]:
min = 5
max = min + 5
for en in list(en2es)[min:max]:
    print(en, '->', en2es[en])
print('')
for es in list(es2en)[min:max]:
    print(es, '->', es2en[es])

also -> también
provide -> proporcionar
all -> todo
development -> intensificación
other -> otro

proporcionar -> supply
todo -> all
intensificación -> development
otro -> another
programar -> programme


In order to create the translation matrix, we need to create two **aligned** matrices:
  - $M_{en}$ will contain $n$ English embeddings from the dictionary
  - $M_{es}$ will contain $n$ Spanish embeddings from the dictionary
  
However, since the dictionary was generated automatically, it may be the case that some of the entries in the dictionary are not in the English or Spanish vocabularies. We only need the `id`s in the respective `vecs`:

In [0]:
en_dict_ids = []
es_dict_ids = []
es_dict_voc = []
for es in es2en:
    es_id = es_vecs.word_to_idx.get(es)
    en_id = en_vecs.word_to_idx.get(es2en[es])
    if en_id and es_id :
        es_dict_voc.append(es)
        en_dict_ids.append(en_id)
        es_dict_ids.append(es_id)
print(len(en_dict_ids), len(es_dict_ids))

477 477


From the 1K dictionary entries, only $477$ pairs were both in the English and the Spanish `vecs`. In order to verify that the translation works, we can split this into $450$ pairs that we will use to calculate the translation matrix and we keep the remaining $27$ for testing:

In [0]:
train_en_dict_ids = en_dict_ids[:450]
train_es_dict_ids = es_dict_ids[:450]
test_en_ids = en_dict_ids[450:] 
test_es_ids = es_dict_ids[450:]
print(len(train_en_dict_ids), len(test_en_ids))

450 27


Before calculating the translation matrix, let's verify that we need one. We chose 3 example words:
  - *conocimiento*  and *proporcionar* are in the in the training set, 
  - *tema* is in the test set
  
For each word, we get:
 - the $5$ Spanish neighbors for the English vector
 - the $5$ Spanish neighbors for the Spanish translation according to the dictionary

In [0]:
es_examples = ['conocimientos', 'proporcionar', 'tema']
from IPython.display import display
for i, es in enumerate(es_examples):
    print(es, '->', es2en[es])
    print('top k for Spanish vector in English vector space:')
    k = 5
    df1 = pd.DataFrame(en_vecs.k_neighbors(es_vecs.lookup(es), k=k, result_key_suffix='_es_vec'))
    print('top k for English translation in English vector space:')
    df2 = pd.DataFrame(en_vecs.k_neighbors(es2en[es], k=k, result_key_suffix='_en'))
    df3 = pd.concat([df1, df2], axis=1)
    #print(df3)
    display(df3)
    print('')

conocimientos -> knowledge
top k for Spanish vector in English vector space:
top k for English translation in English vector space:


Unnamed: 0,cosim_es_vec,word_es_vec,cosim_en,word_en
0,0.195447,jewish,1.0,knowledge
1,0.194971,concept,0.631812,skill
2,0.185432,once,0.603642,know-how
3,0.183663,theme,0.574704,sharing
4,0.183211,cross,0.537305,information
5,0.175961,sister,0.536732,learning
6,0.175771,saudi,0.534542,innovation
7,0.172918,business,0.533146,technology
8,0.169612,united kingdom,0.53126,understanding
9,0.165519,pronounce,0.513664,science



proporcionar -> supply
top k for Spanish vector in English vector space:
top k for English translation in English vector space:


Unnamed: 0,cosim_es_vec,word_es_vec,cosim_en,word_en
0,0.222659,candidate,1.0,supply
1,0.19556,arrest warrant,0.748887,supplies
2,0.187525,king,0.542984,spare part
3,0.185336,trading,0.537591,purchase
4,0.183837,selection,0.500451,fuel
5,0.183204,select,0.499277,medical
6,0.179192,commit,0.483074,transportation
7,0.179038,pool,0.482907,ration
8,0.17421,rule,0.48097,service
9,0.172928,business plan,0.470521,shortage



tema -> theme
top k for Spanish vector in English vector space:
top k for English translation in English vector space:


Unnamed: 0,cosim_es_vec,word_es_vec,cosim_en,word_en
0,0.204613,accumulate,1.0,theme
1,0.202783,per cent,0.695908,topic
2,0.190149,wood,0.636073,panel discussion
3,0.179899,go on,0.63466,thematic
4,0.174604,than,0.612211,cross-cutting
5,0.174476,ten,0.565744,sustainable development
6,0.171583,accumulation,0.562757,round table
7,0.167805,correctly,0.547525,discussion
8,0.167107,vision,0.541553,high-level
9,0.166883,scene,0.536971,focus





Clearly, simply using the Spanish vector in the English space does not work. Let's get the matrices:

In [0]:
m_en = en_vecs.vecs[train_en_dict_ids]
m_es = es_vecs.vecs[train_es_dict_ids]
print(m_en.shape, m_es.shape)

(450, 300) (450, 300)


As expected, we get two matrices of $450 \times 300$, since embeddings are of dimension $300$ and we have $450$ training examples. Now, we can calculate the translation matrix and define a method for linearly translating a point in the Spanish embedding space into a point in the English embedding space.

In [0]:
tm_es2en = np.linalg.pinv(m_es).dot(m_en)
def es_vec_to_en_vec(es_vec):
    return np.dot(es_vec, tm_es2en)
print(tm_es2en.shape)

(300, 300)


As we can see, the translation matrix is just a $300 \times 300$ matrix.

Now that we have the translation matrix, let's inspect the example words to see how it performs:

In [0]:
for i, es in enumerate(es_examples):
    print(es, '->', es2en[es])
    k = 5
    print('\t%s: Spanish vector for "%s" in English vector space' % ('es_vec', es))
    df1 = pd.DataFrame(en_vecs.k_neighbors(es_vecs.lookup(es), k=k, result_key_suffix='_es_vec'))
    print('\t%s: English vector for "%s" in English vector space' % ('en', es2en[es]))
    df2 = pd.DataFrame(en_vecs.k_neighbors(es2en[es], k=k, result_key_suffix='_en'))
    print('\t%s: Spanish vector for "%s" *mapped* to English vector space using tm_es2en' % ('tm_es_vec', es))
    df3 = pd.DataFrame(en_vecs.k_neighbors(es_vec_to_en_vec(es_vecs.lookup(es)), k=k, result_key_suffix='_tm_es_vec'))
    df4 = pd.concat([df1,df2,df3], axis=1)
    display(df4)
    print('')

conocimientos -> knowledge
	es_vec: Spanish vector for "conocimientos" in English vector space
	en: English vector for "knowledge" in English vector space
	tm_es_vec: Spanish vector for "conocimientos" *mapped* to English vector space using tm_es2en


Unnamed: 0,cosim_es_vec,word_es_vec,cosim_en,word_en,cosim_tm_es_vec,word_tm_es_vec
0,0.195447,jewish,1.0,knowledge,0.894568,knowledge
1,0.194971,concept,0.631812,skill,0.652778,skill
2,0.185432,once,0.603642,know-how,0.597379,know-how
3,0.183663,theme,0.574704,sharing,0.581842,technology
4,0.183211,cross,0.537305,information,0.568328,capacity
5,0.175961,sister,0.536732,learning,0.566806,information
6,0.175771,saudi,0.534542,innovation,0.564072,technical
7,0.172918,business,0.533146,technology,0.563809,sharing
8,0.169612,united kingdom,0.53126,understanding,0.551107,scientific
9,0.165519,pronounce,0.513664,science,0.545087,training



proporcionar -> supply
	es_vec: Spanish vector for "proporcionar" in English vector space
	en: English vector for "supply" in English vector space
	tm_es_vec: Spanish vector for "proporcionar" *mapped* to English vector space using tm_es2en


Unnamed: 0,cosim_es_vec,word_es_vec,cosim_en,word_en,cosim_tm_es_vec,word_tm_es_vec
0,0.222659,candidate,1.0,supply,0.742566,supply
1,0.19556,arrest warrant,0.748887,supplies,0.577888,supplies
2,0.187525,king,0.542984,spare part,0.442933,food
3,0.185336,trading,0.537591,purchase,0.427956,provision
4,0.183837,selection,0.500451,fuel,0.423716,provide
5,0.183204,select,0.499277,medical,0.419686,service
6,0.179192,commit,0.483074,transportation,0.419077,purchase
7,0.179038,pool,0.482907,ration,0.41131,medical
8,0.17421,rule,0.48097,service,0.411244,spare part
9,0.172928,business plan,0.470521,shortage,0.401046,transportation



tema -> theme
	es_vec: Spanish vector for "tema" in English vector space
	en: English vector for "theme" in English vector space
	tm_es_vec: Spanish vector for "tema" *mapped* to English vector space using tm_es2en


Unnamed: 0,cosim_es_vec,word_es_vec,cosim_en,word_en,cosim_tm_es_vec,word_tm_es_vec
0,0.204613,accumulate,1.0,theme,0.772404,theme
1,0.202783,per cent,0.695908,topic,0.610194,topic
2,0.190149,wood,0.636073,panel discussion,0.602442,session
3,0.179899,go on,0.63466,thematic,0.585061,discussion
4,0.174604,than,0.612211,cross-cutting,0.576319,thematic
5,0.174476,ten,0.565744,sustainable development,0.550427,agenda
6,0.171583,accumulation,0.562757,round table,0.549161,high-level
7,0.167805,correctly,0.547525,discussion,0.539559,panel discussion
8,0.167107,vision,0.541553,high-level,0.538621,meeting
9,0.166883,scene,0.536971,focus,0.533962,discuss





## Non-linear alignment

The linear alignment seems to work OK for this set of embeddings. In our experience, when dealing with larger vocabularies (and vocabularies mixing lemmas and concepts), this approach does not scale, since the number of parameters is limited to the $d \times d$ translation matrix.

For such cases it is possible to follow the same approach, but instead of deriving a pseudo-inverse matrix, we train a neural network to learn a non-linear translation function. The non-linearities can be introduced by using activation functions such as ReLUs.

See  [Towards a Vecsigrafo: Portable Semantics in Knowledge-based Text Analytics](https://pdfs.semanticscholar.org/b0d6/197940d8f1a5fa0d7474bd9a94bd9e44a0ee.pdf) for more details.

## Example: Cross-modal embeddings

In [Thoma, S., Rettinger, A., & Both, F. (2017). Towards Holistic Concept Representations: Embedding Relational Knowledge, Visual Attributes, and Distributional Word Semantics. In International Semantic Web Conference. Vienna, Austria.](https://pdfs.semanticscholar.org/413e/b0b519ac18ec86aaa290c86553291fae7ea2.pdf), a cross-modal embedding is generated for a 1538 concepts.  

![Cross-modal embeddings](https://github.com/HybridNLP2018/tutorial/blob/master/images/cross-modal-embedding.PNG?raw=1)

As part of their evaluations, the authors studied the problem of entity-type prediction (a subtask of KG completion), using a subgraph of DBpedia that provided coverage for the 1538 concepts. Their results were:

![TriM1538 entity-type prediction results](https://github.com/HybridNLP2018/tutorial/blob/master/images/TriM1538-entity-type-pred-results.PNG?raw=1)

The results show a clear improvement when using multi-modal embeddings, compared to just using the KG embeddings.

In [notebook 08](https://colab.research.google.com/github/HybridNLP2018/tutorial/blob/master/08_scientific_information_management.ipynb) of this tutorial you will see another possible way of exploiting cross-modality.