# CSKG embeddings

This notebooks compares graph and text embeddings computed over CSKG.

**Graph embeddings** The graph embeddings have been computed by the command:

`python embeddings/embedding_click.py -i input/kgtk_framenet.tsv -o output/kgtk_framenet`

using the `embedding/embedding_click.py` script in this repository.

We are currently integrating this function into the CSKG package.

**Text embeddings** The text embeddings were computed by using the KGTK `text-embedding` command as follows:
```
TBA!
```

In [3]:
from gensim.models import KeyedVectors, Word2Vec
import h5py
from annoy import AnnoyIndex

In [4]:
# Dimension of the embeddings - choose one of 100, 300, 400
dim=400

In [5]:
tsv_filename='../output/embeddings/entity_embedding_%d.tsv' % dim
#filename='../output/embeddings/embeddings_all_0.v100.h5'

In [6]:
t = AnnoyIndex(dim, 'euclidean')  # Length of item vector that will be indexed
node2id={}
id2node={}
with open(tsv_filename, 'r') as f:
    i=0
    for line in f:
        node, *data=line.split()
        v=[float(d) for d in data]
        t.add_item(i, v)
        node2id[node]=i
        id2node[i]=node
        i+=1
t.build(10) # 10 trees
t.save('complex_%d.ann' % dim)

True

In [7]:
u = AnnoyIndex(dim,'euclidean')
u.load('complex_%d.ann' % dim) # super fast, will just mmap the file


True

In [20]:
try_this='/c/en/man'
try_id=node2id[try_this]

In [21]:
try_id

222090

In [22]:
[id2node[i] for i in u.get_nns_by_item(try_id, 10)]

['/c/en/man',
 '/c/en/stavnsbånd/n',
 '/c/en/moneyman',
 '/c/en/satyromaniac/n',
 '/c/en/youth/n',
 '/c/en/plainclotheswoman/n',
 '/c/en/ladies_auxiliary/n',
 '/c/en/man_child/n',
 '/c/en/girl/n',
 '/c/en/heemraad/n']