# Train word2vec model with Gensim

In this example we'll use Gensim to train a word2vec model over the word corpus of the novel War and Peace by Leo Tolstoy. We'll use `nltk` to tokenize the sentences of the novel.

Let's start with the imports:

In [1]:
import logging
import pprint  # beautify prints

import gensim
import nltk
# nltk.download('punkt')  # Uncomment this to download the punkt tokenizer

logging.basicConfig(level=logging.INFO)

Next, let's define and instantiate the `TokenizedSentences` class, which splits the text into sentences and then tokenizes each sentence using the `nltk` tokenizer:

In [2]:
class TokenizedSentences:
    """Split text to sentences and tokenize them"""

    def __init__(self, filename: str):
        self.filename = filename

    def __iter__(self):
        with open(self.filename) as f:
            corpus = f.read()

        raw_sentences = nltk.tokenize.sent_tokenize(corpus)
        for sentence in raw_sentences:
            if len(sentence) > 0:
                yield gensim.utils.simple_preprocess(sentence, min_len=2, max_len=15)


sentences = TokenizedSentences('war_and_peace.txt')

Next, we'll instantiate and train the wor2vec model:

In [3]:
model = gensim.models.word2vec. \
    Word2Vec(sentences=sentences,
             sg=1,  # 0 for CBOW and 1 for Skip-gram
             size=100,  # size of the embedding vector
             window=5,  # the size of the context window
             negative=5,  # negative sampling word count
             min_count=5,  # minimal word occurrences to include
             iter=5,  # number of epochs
             )

INFO:gensim.models.word2vec:collecting all words and their counts
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #10000, processed 154865 words, keeping 9561 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #20000, processed 329560 words, keeping 13951 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #30000, processed 507242 words, keeping 16748 word types
INFO:gensim.models.word2vec:collected 17433 word types from a corpus of 551017 raw words and 32040 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:effective_min_count=5 retains 6538 unique words (37% of original 17433, drops 10895)
INFO:gensim.models.word2vec:effective_min_count=5 leaves 531623 word corpus (96% of original 551017, drops 19394)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 17433 items
INFO:gensim.models.word2vec:sample=0.001 downsample

To see the result of the training, we'll query the model for the most similar words to `mother`:

In [4]:
print("Words most similar to 'mother':")
pprint.pprint(model.wv.most_similar(positive='mother', topn=5))

INFO:gensim.models.keyedvectors:precomputing L2-norms of word weight vectors


Words most similar to 'mother':
[('brother', 0.8962684869766235),
 ('daughter', 0.8957958221435547),
 ('father', 0.8954352140426636),
 ('sister', 0.8898148536682129),
 ('husband', 0.876844048500061)]


We can also do the same for a combination of words. Let's try `woman` + `king` to see if the result would be `queen`:

In [5]:
print("Words most similar to 'woman' and 'king':")
pprint.pprint(model.wv.most_similar(positive=['woman', 'king'], topn=5))

Words most similar to 'woman' and 'king':
[('admirable', 0.9163172245025635),
 ('heiress', 0.9129906296730042),
 ('queen', 0.9082918167114258),
 ('providence', 0.9049242734909058),
 ('creature', 0.903453528881073)]


Indeed, one of the most similar words is `queen`. However, other words like `creature` are not relevant. Perhaps, we should train the model with larger training dataset.