In [1]:
import pandas as pd
import numpy as np
import joblib
from nltk.tokenize import word_tokenize
import gensim
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

#### Download processed abstracts

In [2]:
abstracts_prepro = pd.read_csv("./interm/processed_abstracts.csv")
print( "%d preprocessed abstracts" % (len(abstracts_prepro)) )

4691 preprocessed abstracts


### Build a Word Embedding

To select the combination (*k*,$\alpha$,$l_1$), here we will use a *topic coherence* measure called TC-W2V. This measure relies on the use of a *word embedding* model constructed from our corpus. So in this step we will use the *Gensim* implementation of Word2Vec to build a Word2Vec model based on our collection of abstracts.

First, we need to define a class that will generate documents in a form that can be consumed by Gensim's Word2Vec implementation:

In [3]:
def list_word2vec(data_samples):
    liste_mots_abstracts =[]
    for k in range(len(data_samples)):
            tokens = word_tokenize(str(data_samples[k]))
            liste_mots_abstracts.append(tokens)
    return(liste_mots_abstracts)

In [4]:
data_fit = list_word2vec(abstracts_prepro['abstracts_prepro'].to_list())

Now build a Skipgram Word2Vec model from all documents in the input file using *Gensim*:

In [5]:
def myhash(obj):
    return hash(obj) % (2 ** 32)  

In [6]:
# the model has 500 dimensions, the minimum document-term frequency is 47 (1% as for tf-idf selection)
w2v_model = Word2Vec(data_fit, size=500, min_count=47, workers=1, seed = 1511, hashfxn = myhash)

Word2vec has more terms than the TF-IDF as it does not exclude words that are too frequent (i.e., that appear in more than 95$\%$ of the abstracts)

In [7]:
print( "Model has %d terms" % len(w2v_model.wv.vocab) )

Model has 1533 terms


#### Save trained word vectors

In [8]:
word_vectors = w2v_model.wv

In [9]:
word_vectors.save("./interm/word2vec.wordvectors")