[![](https://kaggle.com/static/images/open-in-kaggle.svg){fig-align="right"}](https://www.kaggle.com/code/lucapapariello/gensim-share-vocabulary-across-models)

# Introduction

Deep learning models based on the transformer architecture have taken the NLP world by storm in the last few years, achieving state-of-the-art results in several areas. An obvious example of this success is provided by the tremendous growth of the [Hugging Face](https://huggingface.co/) ecosystem, which provides access to a plethora of pre-trained models in a very user-friendly way. 

However, we believe that models based on (static) word embeddings still have their place in the "transformer era". Some reasons why this might be the case are the following: 

- Transformer-based models are usually much bigger (i.e. more parameters) than "standard" models.
- Transformer models are not renowned for their (inference) speed&mdash;this is related to the previous point.
- Models based on word embeddings still provide a solid baseline.

[Gensim](https://radimrehurek.com/gensim/) is a great library when it comes to word embeddings, and some other NLP tasks, especially if you want to train them on your own. There might be cases where you would like to train two NLP models and have them "speak the same language", i.e. share the *same* vocabulary. For the sake of concreteness, let's say these two models are [LSI](https://en.wikipedia.org/wiki/Latent_semantic_analysis#Latent_semantic_indexing) and [word2vec](https://en.wikipedia.org/wiki/Word2vec). The "standard" way of doing this, however, requires the preparation of *two* vocabularies, one for each model. In this post, we'll show how to avoid this by transferring the vocabuly of the LSI model to the word2vec model.

# LSI and word2vec the "standard" way

We will now build these two models following the "standard" procedure as can be found in the respective Gensim documentation. In what follows, we will work with the [20 Newsgroups dataset](http://qwone.com/~jason/20Newsgroups/), which is a collection of ca. 20,000 newsgroup documents grouped into 20 classes. Details are not very important in relation to our discussion. This dataset can easily be downloaded using the `sklearn.datasets.fetch_20newsgroups` [function](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html) of the scikit-learn libray, which will download and cache the dataset.

In [1]:
#| include: false
try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False
    
if IN_COLAB:
    !pip install nltk==3.2.4 -q
    !pip install scikit-learn==0.23.2 -q
    !pip install gensim==4.0.1 -q
    !pip install smart-open==5.1.0 -q

In [1]:
#| include: false
import re
import multiprocessing
from pathlib import Path

import nltk
from nltk.corpus import stopwords
from nltk import sent_tokenize, word_tokenize
from sklearn.datasets import fetch_20newsgroups
from gensim.corpora import Dictionary
from gensim.models import TfidfModel, LsiModel, Word2Vec
from smart_open import open as sopen



In [3]:
#| output: false
data, _ = fetch_20newsgroups(
    shuffle=True, random_state=123, remove=('headers', 'footers', 'quotes'), return_X_y=True
)

In [None]:
#| include: false
nltk.download('stopwords')
nltk.download('punkt')

In [4]:
#| include: false
stop_words = stopwords.words('english')


def clean_text(txt: str):
    '''Clean and lower case text.'''
    txt = re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower())
    txt = re.sub(r"\b\d+\b", "", txt).strip()
    return txt


def tokenizer(txt: str):
    '''Custom tokenizer.'''
    tokens = []
    
    # split strings into sentences
    for sent in sent_tokenize(txt, language='english'):
        # split each sentence into tokens and apply text cleaning 
        for word in word_tokenize(clean_text(sent), language='english'):
            if len(word) < 2:  # remove short words
                continue
            if word in stop_words:  # remove stop words
                continue
            tokens.append(word)

    return tokens

## LSI model

The first step in [building an LSI model](https://radimrehurek.com/gensim/models/lsimodel.html) is to create a dictionary, which maps words to integer ids. This is easily achieved through the `Dictionary` class, to which we have to pass tokenised documents:

In [5]:
tokenized_data = [tokenizer(doc) for doc in data]
dct = Dictionary(tokenized_data)

In [6]:
#| include: false
# Remove words that appear less than 2 times
rare_ids = [tokenid for tokenid, wordfreq in dct.cfs.items() if wordfreq < 2]
# Drop tokens
dct.filter_tokens(rare_ids)

With the help of the dictionary we can then build our corpus using the `.doc2bow()` method. This returns documents in a bag-of-words (BoW) representation. We could proceed with it, but a [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) representation is preferable, for which we can use the `TfidfModel` class.

In [7]:
corpus = [dct.doc2bow(line) for line in tokenized_data]
tfidf_model = TfidfModel(corpus, id2word=dct)
tfidf_matrix = tfidf_model[corpus]

We have everything we need to build our LSI model, which is conveniently done by the `LsiModel` class. Without further motivating this arbitrary choice, we set the number of latent dimensions to 200.

In [8]:
%%time

dim_lsi = 200  # Topic number (latent dimension)
lsi_model = LsiModel(corpus=tfidf_matrix, id2word=dct, num_topics=dim_lsi)

CPU times: user 12.6 s, sys: 625 ms, total: 13.2 s
Wall time: 10.5 s


We now have an LSI model ready to be used! Let's move on to word2vec.

## word2vec model

The quickest way to train a [word2vec model](https://radimrehurek.com/gensim/models/word2vec.html) is through the `Word2Vec` class.

In [9]:
#| code-fold: true
dim_w2v = dim_lsi  # Diminsionality of word vectors
alpha = 0.025  # Initial learning rate
alpha_min = 0.0001  # Drop learning rate to this value
wnd = 5        # Window size (max. distance to predicted word)
mincount = 2   # Word frequency lower bound
sample = 1e-5  # Threshold for downsampling
sg = 1         # Index 1 => Skip-Gram algo.
ngt = 10       # No. noisy words for negative sampling
epochs = 5     # No. epochs for training
cpus = multiprocessing.cpu_count()  # Tot. no. of CPUs
threads = cpus -1  # Use this number of threads for training

In [10]:
%%time

w2v_model = Word2Vec(
    sentences=tokenized_data, vector_size=dim_w2v, alpha=alpha, min_alpha=alpha_min, window=wnd, 
    min_count=mincount, sample=sample, sg=sg, negative=ngt, epochs=epochs, workers=threads
)

CPU times: user 1min 31s, sys: 1.19 s, total: 1min 32s
Wall time: 53.7 s


Let's double-check the number of words present in each of our two models:

In [11]:
#| code-fold: true
print('Size of LSI vocab.:', len(dct.keys()))
print('Size of w2v vocab.:', len(w2v_model.wv.key_to_index.keys()))

Size of LSI vocab.: 42439
Size of w2v vocab.: 42439


We’ve hence managed to build an LSI and a word2vec model whose vocabularies contain the exact same words&mdash;great!  However, this came at an unnecessarily high price and we'll shortly see why. What happens behind the scenes when we create a new instance of the `Word2Vec` class is the following. First a quick sanity check of the corpus is performed, then the vocabulary is built using the `.build_vocab()` method, and lastly the method `.train()` is executed, which trains the model. In the second step, a new dictionary is built from scratch, despite having already done so for the LSI model. When working with small datasets this procedure might be acceptable, but when the corpus is very large optimising this step can save a lot of time!

# LSI and word2vec the fast way

We will now see how we can build these models avoiding the above issue. To do that, we must split the creation of the model into three steps. We start by instantiating the model, but without passing it a corpus, i.e. leaving it uninitialised.

In [12]:
w2v_model = Word2Vec(
    vector_size=dim_w2v, alpha=alpha, min_alpha=alpha_min, window=wnd, 
    min_count=mincount, sample=sample, sg=sg, negative=ngt, workers=threads
)

Thankfully, for the second step, Gensim offers an easy workaround: one can build a vocabulary from a dictionary of *word frequencies*, instead of from a sequence of sentences as done by default by `.build_vocab()`. This can be done with the method `.build_vocab_from_freq()`, which requires a frequency mapping. The latter can be obtained from the LSI dictionary, specifically from the `dct.cfs` attribute, which contains index to frequency mappings.

In [13]:
%%time

# Step 2: borrow LSI vocab.
word_freq = {dct[k]: v for k,v in dct.cfs.items()}
w2v_model.build_vocab_from_freq(word_freq)
# Step 3: train model
num_samples = dct.num_docs
w2v_model.train(tokenized_data, total_examples=num_samples, epochs=epochs)

CPU times: user 1min 29s, sys: 812 ms, total: 1min 30s
Wall time: 37.1 s


(2609400, 6277620)

We have been successful in creating the word2vec model by borrowing the LSI vocabulary. This allowed us to avoid an unnecessary step and hence to waste resources.

Note that in this case the speed-up is barely observable, which is due to the very small size of the dataset (about 15 MB). However, this becomes considerable when working with huge datasets. The dataset on which I base these conclusions exceeds 50 GB and this second approach saved me *several hours*!

# Data streaming [optional]

This is a bonus section for those who have endured until this point. The motivation behind this post was to avoid unnecessary calculations, which makes particular sense when dealing with very large datasets. Very large datasets will most likely *not* fit in the memory, but in the above code we have loaded everything into RAM&mdash;ouch!

Gensim models are smart enough to accept *iterables* that stream the input data directly from disk. In this way, our corpus can be arbitrarily large (limited only by the size of our hard drive). We repeat here the steps of the above sections, but restructuring our code to take advantage of *data streaming*. We assume the corpus to be stored in a unique text file (`20news.txt`), which we get by writing the `data` list to file.

:::{.callout-note}

Loading a corpus into memory and then dumping it into a file obviously doesn't make much sense; we're doing this only to work with the same data as before. You will not need this step as you will be starting directly from some data stored in a (potentially very large) file.

:::

In [14]:
#| include: false

def write_to_file(txt, out_file):
    '''Write text corpus to file (line by line).'''
    with open(out_file, 'a') as f:
        for line in txt:
            f.write(line + '\n')
            
file_out = Path('20news.txt')

write_to_file(data, file_out)

## LSI model

As we've seen before, the first step is to create a dictionary. Before we passed a list to `Dictionary`, now we pass it a *generator*:

In [15]:
curpus_path = Path('20news.txt')
dct = Dictionary((tokenizer(line) for line in open(curpus_path)))

In [16]:
#| include: false
# Remove words that appear less than 2 times
rare_ids = [tokenid for tokenid, wordfreq in dct.cfs.items() if wordfreq < 2]
# Drop tokens
dct.filter_tokens(rare_ids)

Step two consists in creating a corpus and switching to a TF-IDF representation. Here is where things change a bit. We need to define an iterable that yields documents in BoW representation, which is done by the `Corpus` class here below.

In [17]:
class Corpus:
    '''Iterable that yields BoW representations of documents.'''
    
    def __init__(self, curpus_path, dct_object):
        self.curpus_path = curpus_path
        self.dct_object = dct_object
        
    def __iter__(self):
        for line in sopen(self.curpus_path):
            yield self.dct_object.doc2bow(tokenizer(line))

We then use it to create our streamed corpus, which can be passed to `TfidfModel`. We'll skip the explicit creation of the TF-IDF matrix because it can be very large.

In [18]:
corpus = Corpus(curpus_path, dct)
tfidf_model = TfidfModel(corpus, id2word=dct)

We are now ready to build our LSI model:

In [19]:
%%time

lsi_model = LsiModel(corpus=tfidf_model[corpus], id2word=dct, num_topics=dim_lsi)

CPU times: user 3min 58s, sys: 11 s, total: 4min 9s
Wall time: 3min 17s


:::{.callout-important}

We need to use iterables and not generators even though they both produce an iterator. This is because after we have exhausted a generator once there is no more data available. In contrast, iterables create a new iterator *every time* they are looped over. This is exactly what we need when creating a model: we need to be able to iterate over a dataset more than once.

:::

## word2vec model

Similar to what we did with the LSI model, we need to define an iterable that yields tokenized documents. This is provided by the `CorpusW2V` class below.

In [20]:
class CorpusW2V:
    '''Iterable that yields sentences (lists of str).'''

    def __init__(self, curpus_path):
        self.curpus_path = curpus_path

    def __iter__(self):
        for line in sopen(self.curpus_path):
            yield tokenizer(line)

corpus_w2v = CorpusW2V(curpus_path)

The rest follows exactly as above, with the only difference that now the `.train()` method receives an instance of the `CorpusW2V` class instead of a list (see `tokenized_data` above).

In [21]:
%%time

w2v_model = Word2Vec(
    vector_size=dim_w2v, alpha=alpha, min_alpha=alpha_min, window=wnd, 
    min_count=mincount, sample=sample, sg=sg, negative=ngt, workers=threads
)
# Borrow LSI vocab.
word_freq = {dct[k]: v for k,v in dct.cfs.items()}
w2v_model.build_vocab_from_freq(word_freq)
# Train model
num_samples = dct.num_docs
w2v_model.train(corpus_w2v, total_examples=num_samples, epochs=epochs)

CPU times: user 5min 52s, sys: 4.78 s, total: 5min 56s
Wall time: 6min 31s


(2608743, 6277620)

In [22]:
#| include: false
curpus_path.unlink()

We conclude by noting that this approach based on data streaming is certainly *slower* than when we load everything into memory. However, it allows us to process arbitrarily large datasets. One can't have it all, as they say.

# Acknowledgements and references

We must thank the Gensim community and in particular Austen Mack-Crane, on whose suggestions the section [LSI and word2vec the fast way](#LSI-and-word2vec-the-fast-way) is based. The section [Data streaming](#Data-streaming-[optional]) takes instead inspiration from the [Gensim documentation](https://radimrehurek.com/gensim/) and from Radim Řehůřek's [blog post](https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/) about data streaming in Python.