# Exercise 09: Word2Vect example

The data file `terms.tsv` has 10K elements, which is a subset from a **much** larger file.
This represents the keyphrases from 843 unique documents.
Realistically, you want many more documents in a *Word2Vec* model before the results begin to make a lot of sense.

Even so, this is enough to show how to call the functions from [gensim](https://radimrehurek.com/gensim/models/word2vec.html).

In [None]:
import csv
import gensim
import logging
import sys

model_file = "model.dat"
term_path = "terms.tsv"

Load the parsed keyphrases into a list called `sentences`, where each "sentence" is the list of keyphrases from one document.

In [None]:
sentences = []
sent = []
last_doc = None

with open(term_path) as f:
    for term, doc, rank in csv.reader(f, delimiter="\t"):
        rank = float(rank)

        if doc != last_doc:
            if last_doc:
                sentences.append(sent)
                sent = []

            last_doc = doc

        sent.append(term)

    # handle the dangling last element
    sentences.append(sent)

print(len(sentences))

Set up logging (which is required by `gensim`) then train `word2vec` on each "sentence". Then save the model to the `model.dat` file.

In [None]:
FORMAT = "%(asctime)s : %(levelname)s : %(message)"
logging.basicConfig(format=FORMAT, level=logging.ERROR)

model = gensim.models.Word2Vec(sentences, min_count=1)
model.save(model_file)

If you need to load a trained model, use:
`model = gensim.models.Word2Vec.load(model_file)`

In [None]:
%sx ls -lth model.dat terms.tsv

Here's a helper method, which queries the resulting model for "neighbor" keyphrases:

In [None]:
def get_synset (model, query, topn=10):
    try:
        return sorted(model.most_similar(positive=[query], topn=topn), key=lambda x: x[1], reverse=True)
    except KeyError:
        return []

Now we can query the model interactively through a mini REPL:

In [None]:
NUM_RESULTS = 10

while True:
    try:
        query = input("\nquery? ")
        synset = get_synset(model, query, topn=NUM_RESULTS)
        print("most similar to", query, ":", synset)
    except KeyError:
        print("not found")