# Training a word embedding model from scratch

| Author | Last update |
|:------ |:----------- |
| Hauke Licht (https://github.com/haukelicht) | 2023-09-26 |

This notebook illustrates how to use `gensim` to train a word2vec model from scratch on original data.

## Setup

In [77]:
# # run this code if 'swifter' package is not installed
# !pip install swifter

In [78]:
import os
import gensim
import numpy as np
import pandas as pd
import swifter # <== `pip install swifter` (if not yet done)

In [79]:
data_path = os.path.join('..', 'data', 'corpora', 'gbr_commons')
os.makedirs(data_path, exist_ok=True)

## Load the data

I have prepared a corpus of sentence-splitted speeches from the UK *House of Commons* (lower house chamber of the parliament).
The file is too big to be uploaded on Github.
So you need to download it if you have not yet done so:

In [129]:
fp = os.path.join(data_path, 'gbr_commons_speech_sentences_tokenized.tsv.gzip')
if not os.path.exists(fp):
    print('downloading the corpus ... might take 3-10 minutes (depending on your internet connection)')
    url = 'https://www.dropbox.com/scl/fi/wkxj7k2uiy0935dmbjp34/gbr_commons_speech_sentences_tokenized.tsv.gzip?rlkey=urjdpz0vgbymllzugzsh0m2ld&dl=1'
    corp = pd.read_csv(url, sep='\t', compression='gzip')
    corp.to_csv(fp, sep='\t', compression='gzip', index=False)
    # keep only the first 100K sentences in the corpus
    corp = corp.iloc[:100_000]
else:
    # load only the first 100K sentences in the corpus
    corp = pd.read_csv(fp, sep='\t', compression='gzip', nrows=100_000)

corp = corp[~corp.text_tokenized.isna()]

In [140]:
import re

re.sub('\d+[.,]\d', '<NUM>', 'Hello world, I\'m 33 years old')
# note: maybe also add '<OOV>' = out of vocabulary /but then set `min_count=1`!)

"Hello world, I'm <NUM> years old"

In [130]:
corp.head()

Unnamed: 0,text_id,text_tokenized
0,uk.org.publicwhip/debate/1970-01-19a.1.5_0_0,On a point of order
1,uk.org.publicwhip/debate/1970-01-19a.1.5_0_1,Is it true Mr Speaker that your Private Secret...
2,uk.org.publicwhip/debate/1970-01-19a.1.5_0_2,If that be true then it is a matter of deep re...
3,uk.org.publicwhip/debate/1970-01-19a.1.5_0_3,Will you Mr Speaker be kind enough to convey o...
4,uk.org.publicwhip/debate/1970-01-19a.1.7_0_0,On behalf of hon and right hon Members on this...


The data file maps sentences to sentence IDs.
The sentences have been preprocessed and tokenized into words and then concatenated with white spaces.
So to get a sentences words, we can split at the white space:

In [133]:
corp.text_tokenized[1]

'Is it true Mr Speaker that your Private Secretary Sir Francis Reid died suddenly over the weekend'

## How to train (illustration)

To train a word2vec model on a corpus, we need to

1. create a new model instance `model = gensim.models.Word2Vec(...)`. Here we pass all model *hyper-parameters* like the embedding dimension and window size, the algorithm we want to use (Skip-gram or CBOW), etc.
2. build the model's vocabulary on our corpus `model.build_vocab(...)`
3. train the model by calling `model.train(...)`

### Train on a corpus (in memory)

All you need as **input** for these steps is a list of sentences split into tokens ("words").
That is, you input data should be prepared as follows:

```python
[
    ['Words', 'in', 'sentence', 'one'],
    ['Some', 'more', 'words', 'in', 'sentence', 'two'],
    ...
]
```

Let's split all our German *Bundestag* speech sentences into words:


In [137]:
# split pre-tokeniezd sentences into words
sentences = corp.text_tokenized.swifter.apply(lambda x: x.split(' ')).to_list()

Pandas Apply:   0%|          | 0/99997 [00:00<?, ?it/s]

In [138]:
# print first 4 sentences
sentences[:4]

[['On', 'a', 'point', 'of', 'order'],
 ['Is',
  'it',
  'true',
  'Mr',
  'Speaker',
  'that',
  'your',
  'Private',
  'Secretary',
  'Sir',
  'Francis',
  'Reid',
  'died',
  'suddenly',
  'over',
  'the',
  'weekend'],
 ['If',
  'that',
  'be',
  'true',
  'then',
  'it',
  'is',
  'a',
  'matter',
  'of',
  'deep',
  'regret',
  'to',
  'all',
  'Members',
  'on',
  'both',
  'sides',
  'of',
  'the',
  'House',
  'who',
  'shared',
  'deep',
  'friendships',
  'with',
  'Sir',
  'Francis'],
 ['Will',
  'you',
  'Mr',
  'Speaker',
  'be',
  'kind',
  'enough',
  'to',
  'convey',
  'our',
  'deep',
  'sympathy',
  'to',
  'the',
  'members',
  'of',
  'his',
  'family']]

Now we are ready to create a `Word2Vec` model instance.
The following model hyper-parameters are relevant for our purposes:

- `vector_size`: the number of dimensions $d$ of th word embedding matrix
- `window`: the number of context words to use for modeling 
- `min_count`: the minimum number of times a word needs to occur in the corpus to be included in the vocabulary
- `sg`: set to 1 to use the Skip-Gram algorithm, otherwise uses CBOW algorithm
- `hs`: set to 1 to use the hierarchical softmax
- `negative`: number of negative samples to use per focus word when computing the loss
- `epochs`: number of times to iterate over corpus (all sentences)
- `workers`: number of CPU cores to use for training parallelization

In [141]:
import gensim

# create a new model instance
model = gensim.models.Word2Vec(
    vector_size=10, # <= super low dimensionality for testing
    window=5,
    min_count=5,
    sg=1,
    hs=1,
    negative=10, # <== if negative = 2*window, data is "balanced"
    epochs=5, # <= set very low for testing
    workers=10 # <= 10 because I have 10 cores on my machine
)

In [144]:
model.epochs

5

Next, we build the vocabulary:

In [142]:
# build the vocabulary from the list of sentences
model.build_vocab(sentences)

In [143]:
model.corpus_count

99997

Now we can train the model:

In [145]:
# train the model on the list of sentences
model.train(
    sentences, 
    total_examples=model.corpus_count, 
    epochs=model.epochs
)

(7100711, 10061760)

In [172]:
type(model), type(model.wv)

(gensim.models.word2vec.Word2Vec, gensim.models.keyedvectors.KeyedVectors)

In [171]:
model.wv.vector_size

10

Let's compute similarities on a number of hand-picked word pairs to see whether the model has learned sensible word vectors.

*Note:* This is more of a "face validity" check and no full-fledged validation

In [165]:
# evaluate
pairs = [
    ('Sir', 'Madam'),
    ('Pound', 'Euro'),
    ('Pound', 'Dollar'),
    ('Government', 'law'),
    ('Government', 'bill'),
    ('Government', 'propoasal'),
    ('Opposition', 'law'),
    ('Opposition', 'bill'),
    ('Opposition', 'proposal'),
    ('bill', 'proposal'),
    ('Labour', 'Government'),
    ('Labour', 'Opposition'),
    ('Conservatives', 'Government'),
    ('Conservatives', 'Opposition'),
    ('shadow', 'Minister'),
    ('Opposition', 'bench'),
    ('chair', 'bench'),
    ('Chair', 'Speaker'),
    ('hon', 'Member'),
    ('honourable', 'Member'),
    ('hon', 'Friend'),
    ('Friend', 'Member')
]
for pair in pairs:
    if all(p in model.wv.key_to_index for p in pair):
        print(f'"{pair[0]}" - "{pair[1]}": {model.wv.similarity(pair[0], pair[1]):.3f}')

"Government" - "law": 0.426
"Government" - "bill": 0.357
"Opposition" - "law": 0.332
"Opposition" - "bill": 0.204
"Opposition" - "proposal": 0.723
"bill" - "proposal": 0.430
"Labour" - "Government": 0.785
"Labour" - "Opposition": 0.770
"Conservatives" - "Government": 0.661
"Conservatives" - "Opposition": 0.742
"shadow" - "Minister": 0.684
"Opposition" - "bench": 0.820
"chair" - "bench": 0.729
"Chair" - "Speaker": 0.954
"hon" - "Member": 0.515
"honourable" - "Member": 0.257
"hon" - "Friend": 0.632
"Friend" - "Member": 0.884


In [173]:
'Minister' in model.wv 

True

### Train with a corpus read from disk

The above code works just fine.
But it has a computational bottleneck!
To iterate over the list of sentences, we need to have them all loaded into our working memory (RAM).
If our corpus is very large -- which it should be to enable learning of reliable word embeddings --, this might be too burdensome for your computer.

So instead of loading the entire corpus in the RAW, we can iteratively read batches of sentences from a data file that exists somewhere on the hard drive ("disk") of your computer.
For this, we need an "iterable" class that implements "yields" sentences in our corpus one at a time.

In the next code cell, we define such a class.
This class reads the sentences from a (ZIP-ed) CSV file and splits sentences at white spaces before yielding them.


In [174]:
import gzip
from tqdm.auto import tqdm

# get corpus sentences
class SentenceCorpus(object):
    """Iterable class for reading a corpus from a file.
    
    Parameters
    ----------
    filepath : str
        Path to the corpus file.
    sep : str, optional
        Separator for splitting the lines into tokens. Default: ','
    compressed : bool, optional
        Whether the file is compressed. Default: False
    skip : int, optional
        Number of lines to skip at the beginning of the file. Default: 1
    nrows : int, optional
        Number of lines to read from the file. Default: None (read all lines)
    
    Yields
    ------
    list of str

    """
    # note: I've already added a lot of functionality here (e.g. reading from compressed files, skip rows, limit the number of rows to read).
    #       Hence, the actual code is a bit more complex than what you'd need in some use cases.
    def __init__(self, filepath, sep='\t', compressed=False, skip=1, nrows=None):
        self.filepath = filepath
        self.sep = sep
        self.compressed = compressed
        self.skip = skip if skip else 0
        self.nrows = nrows if nrows else np.inf
        if skip and skip > 0:
            self.nrows += skip
        with gzip.open(self.filepath, 'rt', encoding='utf-8') if self.compressed else open(self.filepath, 'r', encoding='utf-8') as file:
            corpus_size = len(file.readlines())
        if nrows is not None and nrows < corpus_size:
            corpus_size = nrows
        else:
            nrows = corpus_size
        self.corpus_size = nrows
    
    def __len__(self):
        return self.corpus_size
    
    def __iter__(self):
        # open file
        with gzip.open(self.filepath, 'rt', encoding='utf-8') if self.compressed else open(self.filepath, 'r', encoding='utf-8') as file:
            # compute how many rows to read in total
            max_ = self.nrows if self.skip <= 0 else self.nrows+self.skip
            # for each line in file
            for i, line in enumerate(file):
                # skip if you want to skipp first `skip` lines
                if i <= self.skip:
                    continue
                # stop if you have read max number of lines
                if max_ and i >= max_:
                    break
                yield line.strip().split(self.sep)[1].split(' ')
                # previsouly : 
                # now: skip if word splitting raises error
                try:
                    # try to split text into words (assuming that the text is in the second column)
                    # note: 
                    #   - the `self.sep` says what character you want to use for spliting the line into columns
                    #   - the `[1]` just says that you'll assume that the text is in the second column
                    words = line.strip().split(self.sep)[1].split(' ')
                except:
                    continue
                yield words

Let's read the some 100,000 sentences from our corpus as above:

In [175]:
fp = os.path.join(data_path, 'gbr_commons_speech_sentences_tokenized.tsv.gzip')
sentences = SentenceCorpus(filepath=fp, compressed=True, nrows=100_000)
# note: loading it takes time because it's an iterable

In [177]:
len(sentences)

100000

In [178]:
model = gensim.models.Word2Vec(
    vector_size=10, # <= super low dimensionality for testing
    window=5,
    min_count=5,
    sg=1,
    hs=1,
    negative=5,
    epochs=5, # <= set very low for testing
    workers=10
)

In [179]:
model.build_vocab(sentences)

In [180]:
model.train(
    sentences, 
    total_examples=model.corpus_count, 
    epochs=model.epochs
)

(7101496, 10061800)

Let's again compute similarities on our hand-picked word pairs:

In [181]:
for pair in pairs:
    if all(p in model.wv.key_to_index for p in pair):
        print(f'"{pair[0]}" - "{pair[1]}": {model.wv.similarity(pair[0], pair[1]):.3f}')

"Government" - "law": 0.377
"Government" - "bill": 0.298
"Opposition" - "law": 0.354
"Opposition" - "bill": 0.113
"Opposition" - "proposal": 0.765
"bill" - "proposal": 0.396
"Labour" - "Government": 0.772
"Labour" - "Opposition": 0.721
"Conservatives" - "Government": 0.665
"Conservatives" - "Opposition": 0.722
"shadow" - "Minister": 0.727
"Opposition" - "bench": 0.693
"chair" - "bench": 0.736
"Chair" - "Speaker": 0.946
"hon" - "Member": 0.463
"honourable" - "Member": 0.172
"hon" - "Friend": 0.585
"Friend" - "Member": 0.871


## Train, for real

In [182]:
fp = os.path.join(data_path, 'gbr_commons_speech_sentences_tokenized.tsv.gzip')
sentences = SentenceCorpus(filepath=fp, compressed=True)
# number of sentences in the corpus
len(sentences)

26290960

In [183]:
# number of tokens in the corpus
sum([len(text) for text in sentences])

519617649

In [184]:
model = gensim.models.Word2Vec(
    vector_size=100, # <= low dimensionality for testing
    window=5,
    min_count=25, # <= words should appear at least 25 times in the corpus
    sg=1,
    hs=1,
    negative=5,
    epochs=10, # <= set very low for testing
    workers=10,
    sorted_vocab=1,
)

In [185]:
model.build_vocab(sentences) # <== takes long because corpus is large 

Because training is going to take a while, we'll add a "callback" function that prints the loss after each epoch

*source:* https://stackoverflow.com/a/54891714

In [100]:
import time
import gensim

class callback(gensim.models.callbacks.CallbackAny2Vec):
    '''Callback to print loss after each epoch.'''

    def __init__(self):
        self.epoch = 0
        # record the start time of the training
    
    def on_epoch_begin(self, model):
        print(f'Epoch #{self.epoch} start ...', end=' ')
        self.ts = time.time()

    def on_epoch_end(self, model):
        # compute the time delta of the epoch relative to self.ts
        delta = time.time() - self.ts
        # print the delta time and loss after each epoch
        loss = model.get_latest_training_loss()
        print(f'Epoch #{self.epoch} took {int(delta // 60)}:{int(delta % 60)}m; loss = {loss}')
        self.epoch += 1

In [101]:
model.train(
    sentences, 
    total_examples=model.corpus_count, 
    epochs=model.epochs,
    # below the thing you need to adapt to use our custom callback that prints progress
    compute_loss=True, 
    callbacks=[callback()]
)

# note: the 'loss' values printed below are increasing =(
#       This is because gensim's reporting is buggy (https://github.com/RaRe-Technologies/gensim/pull/2135)
#       So don't make hard jdugements about 'convergence' 
#        based on the loss values printed below

Epoch #0 start ... Epoch #0 took 10:2m; loss = 77463592.0
Epoch #1 start ... Epoch #1 took 10:2m; loss = 90601976.0
Epoch #2 start ... Epoch #2 took 9:57m; loss = 103585816.0
Epoch #3 start ... Epoch #3 took 10:9m; loss = 116231952.0
Epoch #4 start ... Epoch #4 took 10:8m; loss = 128471592.0
Epoch #5 start ... Epoch #5 took 10:10m; loss = 134217728.0
Epoch #6 start ... Epoch #6 took 10:21m; loss = 134217728.0
Epoch #7 start ... Epoch #7 took 11:4m; loss = 134217728.0
Epoch #8 start ... Epoch #8 took 11:4m; loss = 134217728.0
Epoch #9 start ... Epoch #9 took 11:23m; loss = 134217728.0


(3791416743, 5196176490)

In [119]:
# Let's again compute similarities on our hand-picked word pairs:
for pair in pairs:
    if all(p in model.wv.key_to_index for p in pair):
        print(f'"{pair[0]}" - "{pair[1]}": {model.wv.similarity(pair[0], pair[1]):.3f}')

"Sir" - "Madam": 0.262
"Pound" - "Euro": 0.250
"Pound" - "Dollar": 0.291
"Government" - "law": 0.351
"Government" - "bill": 0.232
"Opposition" - "law": 0.128
"Opposition" - "bill": 0.089
"Opposition" - "proposal": 0.313
"bill" - "proposal": 0.362
"Labour" - "government": 0.638
"Labour" - "opposition": 0.531
"Conservatives" - "government": 0.560
"Conservatives" - "opposition": 0.501


## Saving the model and it's vectors

In [186]:
# store the model's word vectors in a file
models_path = data_path.replace('corpora', 'models')
print(models_path)
fp = os.path.join(models_path, 'gbr_commons_word2vec_w5_d100')
fp

../data/models/gbr_commons


'../data/models/gbr_commons/gbr_commons_word2vec_w5_d100'

With a trained model at hand, you have two options for saving it for re-use.
You can only save the word vectors (as `KeyedVectors`).
Or you can save the full model (which inlcudes the word vectors but has more capabilities).

The following table compares the **pros and cons** of these approaches ([source](https://radimrehurek.com/gensim/models/keyedvectors.html#how-to-obtain-word-vectors)):

<table class="docutils align-default">
<colgroup>
<col style="width: 24%">
<col style="width: 12%">
<col style="width: 11%">
<col style="width: 54%">
</colgroup>
<tbody>
<tr class="row-odd">
<td><p><em>capability</em></p></td>
<td><p><em>KeyedVectors</em></p></td>
<td><p><em>full model</em></p></td>
<td><p><em>note</em></p></td>
</tr>
<tr class="row-even"><td><p>continue training vectors</p></td>
<td><p>❌</p></td>
<td><p>✅</p></td>
<td><p>You need the full model to train or update vectors.</p></td>
</tr>
<tr class="row-odd"><td><p>smaller objects</p></td>
<td><p>✅</p></td>
<td><p>❌</p></td>
<td><p>KeyedVectors are smaller and need less RAM, because they
don’t need to store the model state that enables training.</p></td>
</tr>
<tr class="row-even"><td><p>save/load from native
fasttext/word2vec format</p></td>
<td><p>✅</p></td>
<td><p>❌</p></td>
<td><p>Vectors exported by the Facebook and Google tools
do not support further training, but you can still load
them into KeyedVectors.</p></td>
</tr>
<tr class="row-odd"><td><p>append new vectors</p></td>
<td><p>✅</p></td>
<td><p>✅</p></td>
<td><p>Add new-vector entries to the mapping dynamically.</p></td>
</tr>
<tr class="row-even"><td><p>concurrency</p></td>
<td><p>✅</p></td>
<td><p>✅</p></td>
<td><p>Thread-safe, allows concurrent vector queries.</p></td>
</tr>
<tr class="row-odd"><td><p>shared RAM</p></td>
<td><p>✅</p></td>
<td><p>✅</p></td>
<td><p>Multiple processes can re-use the same data, keeping only
a single copy in RAM using
<a class="reference external" href="https://en.wikipedia.org/wiki/Mmap">mmap</a>.</p></td>
</tr>
<tr class="row-even"><td><p>fast load</p></td>
<td><p>✅</p></td>
<td><p>✅</p></td>
<td><p>Supports <a class="reference external" href="https://en.wikipedia.org/wiki/Mmap">mmap</a>
to load data from disk instantaneously.</p></td>
</tr>
</tbody>
</table>

#### Saving the full model

Just call the save `method()` on your word2vec `mode` object

**_Note:_** It's customary to name the file you write the model to with the file extension '.model' (we will use another one when we just save the word vectors) 


In [121]:
model.save(fp+'.model')

It's then easy to reload the model:

In [122]:
from gensim.models import Word2Vec
tmp = Word2Vec.load(fp+'.model')
type(tmp), type(tmp.wv)

(gensim.models.word2vec.Word2Vec, gensim.models.keyedvectors.KeyedVectors)

#### Saving only the word vectors

Call the save `method()` on your word2vec `mode`'s `wv` attribute (a `KeyedVectors` object recording the model's word vectors):

**_Note:_** It's customary to name the file you write the model to with the file extension '.kv' 🥝


In [123]:
model.wv.save(fp+'.kv')

In [124]:
from gensim.models import KeyedVectors
tmp = KeyedVectors.load(fp+'.kv')
type(tmp)

gensim.models.keyedvectors.KeyedVectors

#### Save in other standard formats

Since Word2Vec has been implemented in many other languages (e.g., C and R), there is a standardized file format for storing word vectors.
Saving word embeddings in this format allows "interoperatbility" between languages. 
So for example, you could train your word embedding model in Python but compute with the embeddings in R.

In [126]:
model.wv.save_word2vec_format(fp+'.vectors')

The above code write a large text file.
We can read lines from it.


**_Note:_** To make the file smaller, set `binary=True` when calling `save_word2vec_format()`.

In [127]:
# inspect the file content (first two lines only) in the Terminal
!head -n 3 ../data/models/gbr_commons/gbr_commons_word2vec_w5_d100.vectors
# note: on Windows, use 'more /n 3' instead of 'head -n 3' and backward instead of forward slashes

79338 100
the 0.101355255 0.14979516 -0.016551731 -0.15822083 0.2165262 -0.025575511 -0.027467923 -0.27066264 0.06362151 0.013817692 -0.27832943 0.08193668 -0.07645614 0.053718105 0.23482281 -0.30853987 -0.03073675 -0.06710661 0.18313242 0.21347052 -0.21600038 -0.14045033 0.056353867 0.064991266 -0.0686162 -0.16323717 -0.21860068 -0.10157401 0.024227735 0.1470622 0.13034672 0.17344618 0.017403167 0.13086256 -0.023156911 0.06504448 -0.3455694 0.24614073 -0.09477032 -0.05096083 0.2906817 -0.064221926 -0.09442973 0.016494839 -0.12768437 0.24630766 0.24658822 0.17062102 -0.1795416 -0.07855744 0.22618705 0.3084336 -0.0024432726 -0.14292161 0.028521495 -0.17187124 0.20169954 0.037252676 -0.13526437 -0.13781741 -0.18336569 0.062085446 -0.2638672 0.097755834 0.00085297757 -0.12780148 -0.03433464 -0.03774455 -0.1729083 -0.00498778 0.19696833 0.00024347208 -0.3031193 0.24918249 0.20757464 -0.16696502 -0.1293286 -0.20655353 -0.34562206 -0.3210666 0.1323354 -0.18413182 -0.23619245 0.07494202 -0.18