In [1]:
%matplotlib inline


Word2Vec Model
==============

Introduces Gensim's Word2Vec model and demonstrates its use on the `Lee Evaluation Corpus
<https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf>`_.



In [2]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In case you missed the buzz, Word2Vec is a widely used algorithm based on neural
networks, commonly referred to as "deep learning" (though word2vec itself is rather shallow).
Using large amounts of unannotated plain text, word2vec learns relationships
between words automatically. The output are vectors, one vector per word,
with remarkable linear relationships that allow us to do things like:

* vec("king") - vec("man") + vec("woman") =~ vec("queen")
* vec("Montreal Canadiens") – vec("Montreal") + vec("Toronto") =~ vec("Toronto Maple Leafs").

Word2vec is very useful in `automatic text tagging
<https://github.com/RaRe-Technologies/movie-plots-by-genre>`_\ , recommender
systems and machine translation.

This tutorial:

#. Introduces ``Word2Vec`` as an improvement over traditional bag-of-words
#. Shows off a demo of ``Word2Vec`` using a pre-trained model
#. Demonstrates training a new model from your own data
#. Demonstrates loading and saving models
#. Introduces several training parameters and demonstrates their effect
#. Discusses memory requirements
#. Visualizes Word2Vec embeddings by applying dimensionality reduction

Review: Bag-of-words
--------------------

.. Note:: Feel free to skip these review sections if you're already familiar with the models.

You may be familiar with the `bag-of-words model
<https://en.wikipedia.org/wiki/Bag-of-words_model>`_ from the
`core_concepts_vector` section.
This model transforms each document to a fixed-length vector of integers.
For example, given the sentences:

- ``John likes to watch movies. Mary likes movies too.``
- ``John also likes to watch football games. Mary hates football.``

The model outputs the vectors:

- ``[1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0]``
- ``[1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1]``

Each vector has 10 elements, where each element counts the number of times a
particular word occurred in the document.
The order of elements is arbitrary.
In the example above, the order of the elements corresponds to the words:
``["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games", "hates"]``.

Bag-of-words models are surprisingly effective, but have several weaknesses.

First, they lose all information about word order: "John likes Mary" and
"Mary likes John" correspond to identical vectors. There is a solution: bag
of `n-grams <https://en.wikipedia.org/wiki/N-gram>`__
models consider word phrases of length n to represent documents as
fixed-length vectors to capture local word order but suffer from data
sparsity and high dimensionality.

Second, the model does not attempt to learn the meaning of the underlying
words, and as a consequence, the distance between vectors doesn't always
reflect the difference in meaning.  The ``Word2Vec`` model addresses this
second problem.

Introducing: the ``Word2Vec`` Model
-----------------------------------

``Word2Vec`` is a more recent model that embeds words in a lower-dimensional
vector space using a shallow neural network. The result is a set of
word-vectors where vectors close together in vector space have similar
meanings based on context, and word-vectors distant to each other have
differing meanings. For example, ``strong`` and ``powerful`` would be close
together and ``strong`` and ``Paris`` would be relatively far.

The are two versions of this model and :py:class:`~gensim.models.word2vec.Word2Vec`
class implements them both:

1. Skip-grams (SG)
2. Continuous-bag-of-words (CBOW)

.. Important::
  Don't let the implementation details below scare you.
  They're advanced material: if it's too much, then move on to the next section.

The `Word2Vec Skip-gram <http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model>`__
model, for example, takes in pairs (word1, word2) generated by moving a
window across text data, and trains a 1-hidden-layer neural network based on
the synthetic task of given an input word, giving us a predicted probability
distribution of nearby words to the input. A virtual `one-hot
<https://en.wikipedia.org/wiki/One-hot>`__ encoding of words
goes through a 'projection layer' to the hidden layer; these projection
weights are later interpreted as the word embeddings. So if the hidden layer
has 300 neurons, this network will give us 300-dimensional word embeddings.

Continuous-bag-of-words Word2vec is very similar to the skip-gram model. It
is also a 1-hidden-layer neural network. The synthetic training task now uses
the average of multiple input context words, rather than a single word as in
skip-gram, to predict the center word. Again, the projection weights that
turn one-hot words into averageable vectors, of the same width as the hidden
layer, are interpreted as the word embeddings.




Word2Vec Demo
-------------

To see what ``Word2Vec`` can do, let's download a pre-trained model and play
around with it. We will fetch the Word2Vec model trained on part of the
Google News dataset, covering approximately 3 million words and phrases. Such
a model can take hours to train, but since it's already available,
downloading and loading it with Gensim takes minutes.

.. Important::
  The model is approximately 2GB, so you'll need a decent network connection
  to proceed.  Otherwise, skip ahead to the "Training Your Own Model" section
  below.

You may also check out an `online word2vec demo
<http://radimrehurek.com/2014/02/word2vec-tutorial/#app>`_ where you can try
this vector algebra for yourself. That demo runs ``word2vec`` on the
**entire** Google News dataset, of **about 100 billion words**.




In [3]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

2025-10-17 14:04:50,008 : INFO : loading projection weights from /home/PE/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz
2025-10-17 14:05:35,350 : INFO : KeyedVectors lifecycle event {'msg': 'loaded (3000000, 300) matrix of type float32 from /home/PE/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz', 'binary': True, 'encoding': 'utf8', 'datetime': '2025-10-17T14:05:35.350876', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'load_word2vec_format'}


A common operation is to retrieve the vocabulary of a model. That is trivial:



In [4]:
for index, word in enumerate(wv.index_to_key):
    if index == 10:
        break
    print(f"word #{index}/{len(wv.index_to_key)} is {word}")

word #0/3000000 is </s>
word #1/3000000 is in
word #2/3000000 is for
word #3/3000000 is that
word #4/3000000 is is
word #5/3000000 is on
word #6/3000000 is ##
word #7/3000000 is The
word #8/3000000 is with
word #9/3000000 is said


We can easily obtain vectors for terms the model is familiar with:




In [5]:
vec_king = wv['king']

Unfortunately, the model is unable to infer vectors for unfamiliar words.
This is one limitation of Word2Vec: if this limitation matters to you, check
out the FastText model.




In [6]:
try:
    vec_cameroon = wv['cameroon']
except KeyError:
    print("The word 'cameroon' does not appear in this model")

The word 'cameroon' does not appear in this model


Moving on, ``Word2Vec`` supports several word similarity tasks out of the
box.  You can see how the similarity intuitively decreases as the words get
less and less similar.




In [7]:
pairs = [
    ('car', 'minivan'),   # a minivan is a kind of car
    ('car', 'bicycle'),   # still a wheeled vehicle
    ('car', 'airplane'),  # ok, no wheels, but still a vehicle
    ('car', 'cereal'),    # ... and so on
    ('car', 'communism'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, wv.similarity(w1, w2)))

'car'	'minivan'	0.69
'car'	'bicycle'	0.54
'car'	'airplane'	0.42
'car'	'cereal'	0.14
'car'	'communism'	0.06


Print the 5 most similar words to "car" or "minivan"



In [8]:
print(wv.most_similar(positive=['car', 'minivan'], topn=5))

[('SUV', 0.8532192707061768), ('vehicle', 0.8175783753395081), ('pickup_truck', 0.7763688564300537), ('Jeep', 0.7567334175109863), ('Ford_Explorer', 0.7565720081329346)]


Which of the below does not belong in the sequence?



In [9]:
print(wv.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car']))

car


Training Your Own Model
-----------------------

To start, you'll need some data for training the model. For the following
examples, we'll use the `Lee Evaluation Corpus
<https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf>`_
(which you `already have
<https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee_background.cor>`_
if you've installed Gensim).

This corpus is small enough to fit entirely in memory, but we'll implement a
memory-friendly iterator that reads it line-by-line to demonstrate how you
would handle a larger corpus.




In [10]:
from gensim.test.utils import datapath
from gensim import utils

class MyCorpus:
    """An iterator that yields sentences (lists of str)."""

    def __iter__(self):
        corpus_path = datapath('lee_background.cor')
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)

2025-10-17 14:05:58,249 : INFO : adding document #0 to Dictionary<0 unique tokens: []>
2025-10-17 14:05:58,249 : INFO : built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)
2025-10-17 14:05:58,250 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)", 'datetime': '2025-10-17T14:05:58.250227', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'created'}


If we wanted to do any custom preprocessing, e.g. decode a non-standard
encoding, lowercase, remove numbers, extract named entities... All of this can
be done inside the ``MyCorpus`` iterator and ``word2vec`` doesn’t need to
know. All that is required is that the input yields one sentence (list of
utf8 words) after another.

Let's go ahead and train a model on our corpus.  Don't worry about the
training parameters much for now, we'll revisit them later.




In [11]:
import gensim.models

sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences)

# for s in sentences:
#     print(s)
#     raise SystemExit

2025-10-17 14:05:58,699 : INFO : collecting all words and their counts
2025-10-17 14:05:58,700 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2025-10-17 14:05:58,762 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences
2025-10-17 14:05:58,763 : INFO : Creating a fresh vocabulary
2025-10-17 14:05:58,768 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1750 unique words (25.07% of original 6981, drops 5231)', 'datetime': '2025-10-17T14:05:58.768180', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:05:58,769 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 49335 word corpus (84.84% of original 58152, drops 8817)', 'datetime': '2025-10-17T14:05:58.769085', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform'

Once we have our model, we can use it in the same way as in the demo above.

The main part of the model is ``model.wv``\ , where "wv" stands for "word vectors".




In [12]:
vec_king = model.wv['king']

Retrieving the vocabulary works the same way:



In [13]:
for index, word in enumerate(wv.index_to_key):
    if index == 10:
        break
    print(f"word #{index}/{len(wv.index_to_key)} is {word}")

word #0/3000000 is </s>
word #1/3000000 is in
word #2/3000000 is for
word #3/3000000 is that
word #4/3000000 is is
word #5/3000000 is on
word #6/3000000 is ##
word #7/3000000 is The
word #8/3000000 is with
word #9/3000000 is said


Storing and loading models
--------------------------

You'll notice that training non-trivial models can take time.  Once you've
trained your model and it works as expected, you can save it to disk.  That
way, you don't have to spend time training it all over again later.

You can store/load models using the standard gensim methods:




In [14]:
import tempfile

with tempfile.NamedTemporaryFile(prefix='gensim-model-', delete=False) as tmp:
    temporary_filepath = tmp.name
    model.save(temporary_filepath)
    #
    # The model is now safely stored in the filepath.
    # You can copy it to other machines, share it with others, etc.
    #
    # To load a saved model:
    #
    new_model = gensim.models.Word2Vec.load(temporary_filepath)

2025-10-17 14:06:00,272 : INFO : Word2Vec lifecycle event {'fname_or_handle': '/tmp/gensim-model-jfcd632k', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2025-10-17T14:06:00.272168', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'saving'}
2025-10-17 14:06:00,272 : INFO : not storing attribute cum_table
2025-10-17 14:06:00,274 : INFO : saved /tmp/gensim-model-jfcd632k
2025-10-17 14:06:00,275 : INFO : loading Word2Vec object from /tmp/gensim-model-jfcd632k
2025-10-17 14:06:00,453 : INFO : loading wv recursively from /tmp/gensim-model-jfcd632k.wv.* with mmap=None
2025-10-17 14:06:00,454 : INFO : setting ignored attribute cum_table to None
2025-10-17 14:06:00,464 : INFO : Word2Vec lifecycle event {'fname': '/tmp/gensim-model-jfcd632k', 'datetime': '2025-10-17T14:06:00.464941', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC

which uses pickle internally, optionally ``mmap``\ ‘ing the model’s internal
large NumPy matrices into virtual memory directly from disk files, for
inter-process memory sharing.

In addition, you can load models created by the original C tool, both using
its text and binary formats::

  model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)
  # using gzipped/bz2 input works too, no need to unzip
  model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)




Training Parameters
-------------------

``Word2Vec`` accepts several parameters that affect both training speed and quality.

min_count
---------

``min_count`` is for pruning the internal dictionary. Words that appear only
once or twice in a billion-word corpus are probably uninteresting typos and
garbage. In addition, there’s not enough data to make any meaningful training
on those words, so it’s best to ignore them:

default value of min_count=5



In [15]:
model = gensim.models.Word2Vec(sentences, min_count=10)

2025-10-17 14:06:00,837 : INFO : collecting all words and their counts
2025-10-17 14:06:00,838 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2025-10-17 14:06:00,916 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences
2025-10-17 14:06:00,917 : INFO : Creating a fresh vocabulary
2025-10-17 14:06:00,920 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=10 retains 889 unique words (12.73% of original 6981, drops 6092)', 'datetime': '2025-10-17T14:06:00.920731', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:06:00,921 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=10 leaves 43776 word corpus (75.28% of original 58152, drops 14376)', 'datetime': '2025-10-17T14:06:00.921224', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platfor

vector_size
-----------

``vector_size`` is the number of dimensions (N) of the N-dimensional space that
gensim Word2Vec maps the words onto.

Bigger size values require more training data, but can lead to better (more
accurate) models. Reasonable values are in the tens to hundreds.




In [16]:
# The default value of vector_size is 100.
model = gensim.models.Word2Vec(sentences, vector_size=200)

2025-10-17 14:06:01,357 : INFO : collecting all words and their counts
2025-10-17 14:06:01,360 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2025-10-17 14:06:01,469 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences
2025-10-17 14:06:01,469 : INFO : Creating a fresh vocabulary
2025-10-17 14:06:01,478 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1750 unique words (25.07% of original 6981, drops 5231)', 'datetime': '2025-10-17T14:06:01.478168', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:06:01,478 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 49335 word corpus (84.84% of original 58152, drops 8817)', 'datetime': '2025-10-17T14:06:01.478852', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform'

workers
-------

``workers`` , the last of the major parameters (full list `here
<http://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec>`_)
is for training parallelization, to speed up training:




In [17]:
# default value of workers=3 (tutorial says 1...)
model = gensim.models.Word2Vec(sentences, workers=4)

2025-10-17 14:06:01,988 : INFO : collecting all words and their counts
2025-10-17 14:06:01,990 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2025-10-17 14:06:02,092 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences
2025-10-17 14:06:02,093 : INFO : Creating a fresh vocabulary
2025-10-17 14:06:02,101 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1750 unique words (25.07% of original 6981, drops 5231)', 'datetime': '2025-10-17T14:06:02.101511', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:06:02,102 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 49335 word corpus (84.84% of original 58152, drops 8817)', 'datetime': '2025-10-17T14:06:02.102183', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform'

The ``workers`` parameter only has an effect if you have `Cython
<http://cython.org/>`_ installed. Without Cython, you’ll only be able to use
one core because of the `GIL
<https://wiki.python.org/moin/GlobalInterpreterLock>`_ (and ``word2vec``
training will be `miserably slow
<http://rare-technologies.com/word2vec-in-python-part-two-optimizing/>`_\ ).




Memory
------

At its core, ``word2vec`` model parameters are stored as matrices (NumPy
arrays). Each array is **#vocabulary** (controlled by the ``min_count`` parameter)
times **vector size** (the ``vector_size`` parameter) of floats (single precision aka 4 bytes).

Three such matrices are held in RAM (work is underway to reduce that number
to two, or even one). So if your input contains 100,000 unique words, and you
asked for layer ``vector_size=200``\ , the model will require approx.
``100,000*200*4*3 bytes = ~229MB``.

There’s a little extra memory needed for storing the vocabulary tree (100,000 words would
take a few megabytes), but unless your words are extremely loooong strings, memory
footprint will be dominated by the three matrices above.




Evaluating
----------

``Word2Vec`` training is an unsupervised task, there’s no good way to
objectively evaluate the result. Evaluation depends on your end application.

Google has released their testing set of about 20,000 syntactic and semantic
test examples, following the “A is to B as C is to D” task. It is provided in
the 'datasets' folder.

For example a syntactic analogy of comparative type is ``bad:worse;good:?``.
There are total of 9 types of syntactic comparisons in the dataset like
plural nouns and nouns of opposite meaning.

The semantic questions contain five types of semantic analogies, such as
capital cities (``Paris:France;Tokyo:?``) or family members
(``brother:sister;dad:?``).




Gensim supports the same evaluation set, in exactly the same format:




In [18]:
model.wv.evaluate_word_analogies(datapath('questions-words.txt'))

2025-10-17 14:06:02,807 : INFO : Evaluating word analogies for top 300000 words in the model on /home/PE/Documents/small_Word2Vec/.venv3.11/lib/python3.11/site-packages/gensim/test/test_data/questions-words.txt
2025-10-17 14:06:02,826 : INFO : capital-common-countries: 0.0% (0/6)
2025-10-17 14:06:02,969 : INFO : capital-world: 0.0% (0/2)
2025-10-17 14:06:03,077 : INFO : family: 0.0% (0/6)
2025-10-17 14:06:03,107 : INFO : gram3-comparative: 0.0% (0/20)
2025-10-17 14:06:03,121 : INFO : gram4-superlative: 0.0% (0/12)
2025-10-17 14:06:03,139 : INFO : gram5-present-participle: 0.0% (0/20)
2025-10-17 14:06:03,170 : INFO : gram6-nationality-adjective: 0.0% (0/30)
2025-10-17 14:06:03,231 : INFO : gram7-past-tense: 0.0% (0/20)
2025-10-17 14:06:03,256 : INFO : gram8-plural: 0.0% (0/30)
2025-10-17 14:06:03,263 : INFO : Quadruplets with out-of-vocabulary words: 99.3%
2025-10-17 14:06:03,266 : INFO : NB: analogies containing OOV words were skipped from evaluation! To change this behavior, use "dumm

(0.0,
 [{'section': 'capital-common-countries',
   'correct': [],
   'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'),
    ('CANBERRA', 'AUSTRALIA', 'PARIS', 'FRANCE'),
    ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE'),
    ('KABUL', 'AFGHANISTAN', 'CANBERRA', 'AUSTRALIA'),
    ('PARIS', 'FRANCE', 'CANBERRA', 'AUSTRALIA'),
    ('PARIS', 'FRANCE', 'KABUL', 'AFGHANISTAN')]},
  {'section': 'capital-world',
   'correct': [],
   'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'),
    ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE')]},
  {'section': 'currency', 'correct': [], 'incorrect': []},
  {'section': 'city-in-state', 'correct': [], 'incorrect': []},
  {'section': 'family',
   'correct': [],
   'incorrect': [('HE', 'SHE', 'HIS', 'HER'),
    ('HE', 'SHE', 'MAN', 'WOMAN'),
    ('HIS', 'HER', 'MAN', 'WOMAN'),
    ('HIS', 'HER', 'HE', 'SHE'),
    ('MAN', 'WOMAN', 'HE', 'SHE'),
    ('MAN', 'WOMAN', 'HIS', 'HER')]},
  {'section': 'gram1-adjective-to-adverb', 'correct': [

This ``evaluate_word_analogies`` method takes an `optional parameter
<http://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.evaluate_word_analogies>`_
``restrict_vocab`` which limits which test examples are to be considered.




In the December 2016 release of Gensim we added a better way to evaluate semantic similarity.

By default it uses an academic dataset WS-353 but one can create a dataset
specific to your business based on it. It contains word pairs together with
human-assigned similarity judgments. It measures the relatedness or
co-occurrence of two words. For example, 'coast' and 'shore' are very similar
as they appear in the same context. At the same time 'clothes' and 'closet'
are less similar because they are related but not interchangeable.




In [19]:
model.wv.evaluate_word_pairs(datapath('wordsim353.tsv'))

2025-10-17 14:06:03,512 : INFO : Skipping line #2 with OOV words: love	sex	6.77
2025-10-17 14:06:03,513 : INFO : Skipping line #3 with OOV words: tiger	cat	7.35
2025-10-17 14:06:03,513 : INFO : Skipping line #4 with OOV words: tiger	tiger	10.00
2025-10-17 14:06:03,514 : INFO : Skipping line #5 with OOV words: book	paper	7.46
2025-10-17 14:06:03,514 : INFO : Skipping line #6 with OOV words: computer	keyboard	7.62
2025-10-17 14:06:03,514 : INFO : Skipping line #7 with OOV words: computer	internet	7.58
2025-10-17 14:06:03,515 : INFO : Skipping line #9 with OOV words: train	car	6.31
2025-10-17 14:06:03,515 : INFO : Skipping line #10 with OOV words: telephone	communication	7.50
2025-10-17 14:06:03,516 : INFO : Skipping line #14 with OOV words: bread	butter	6.19
2025-10-17 14:06:03,516 : INFO : Skipping line #15 with OOV words: cucumber	potato	5.92
2025-10-17 14:06:03,517 : INFO : Skipping line #16 with OOV words: doctor	nurse	7.00
2025-10-17 14:06:03,517 : INFO : Skipping line #18 with OOV 

(PearsonRResult(statistic=0.2121643451542885, pvalue=0.10364502314397497),
 SignificanceResult(statistic=0.14820269908504785, pvalue=0.2584431957742359),
 83.0028328611898)

.. Important::
  Good performance on Google's or WS-353 test set doesn’t mean word2vec will
  work well in your application, or vice versa. It’s always best to evaluate
  directly on your intended task. For an example of how to use word2vec in a
  classifier pipeline, see this `tutorial
  <https://github.com/RaRe-Technologies/movie-plots-by-genre>`_.




Online training / Resuming training
-----------------------------------

Advanced users can load a model and continue training it with more sentences
and `new vocabulary words <online_w2v_tutorial.ipynb>`_:




In [20]:
model = gensim.models.Word2Vec.load(temporary_filepath)
more_sentences = [
    ['Advanced', 'users', 'can', 'load', 'a', 'model',
     'and', 'continue', 'training', 'it', 'with', 'more', 'sentences'],
]
model.build_vocab(more_sentences, update=True)
model.train(more_sentences, total_examples=model.corpus_count, epochs=model.epochs)

# cleaning up temporary file
import os
os.remove(temporary_filepath)

2025-10-17 14:06:04,396 : INFO : loading Word2Vec object from /tmp/gensim-model-jfcd632k
2025-10-17 14:06:04,400 : INFO : loading wv recursively from /tmp/gensim-model-jfcd632k.wv.* with mmap=None
2025-10-17 14:06:04,401 : INFO : setting ignored attribute cum_table to None
2025-10-17 14:06:04,427 : INFO : Word2Vec lifecycle event {'fname': '/tmp/gensim-model-jfcd632k', 'datetime': '2025-10-17T14:06:04.427864', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'loaded'}
2025-10-17 14:06:04,429 : INFO : collecting all words and their counts
2025-10-17 14:06:04,429 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2025-10-17 14:06:04,430 : INFO : collected 13 word types from a corpus of 13 raw words and 1 sentences
2025-10-17 14:06:04,430 : INFO : Updating model with new vocabulary
2025-10-17 14:06:04,443 : INFO : Word2Vec lifecycle event {'msg': 'added 0 new u

You may need to tweak the ``total_words`` parameter to ``train()``,
depending on what learning rate decay you want to simulate.

Note that it’s not possible to resume training with models generated by the C
tool, ``KeyedVectors.load_word2vec_format()``. You can still use them for
querying/similarity, but information vital for training (the vocab tree) is
missing there.




Training Loss Computation
-------------------------

The parameter ``compute_loss`` can be used to toggle computation of loss
while training the Word2Vec model. The computed loss is stored in the model
attribute ``running_training_loss`` and can be retrieved using the function
``get_latest_training_loss`` as follows :




In [21]:
# instantiating and training the Word2Vec model
model_with_loss = gensim.models.Word2Vec(
    sentences,
    min_count=1,
    compute_loss=True,
    hs=0,
    sg=1,
    seed=42,
)

# getting the training loss value
training_loss = model_with_loss.get_latest_training_loss()
print(training_loss)

2025-10-17 14:06:04,956 : INFO : collecting all words and their counts
2025-10-17 14:06:04,957 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2025-10-17 14:06:05,019 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences
2025-10-17 14:06:05,020 : INFO : Creating a fresh vocabulary
2025-10-17 14:06:05,034 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 retains 6981 unique words (100.00% of original 6981, drops 0)', 'datetime': '2025-10-17T14:06:05.034742', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:06:05,035 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 leaves 58152 word corpus (100.00% of original 58152, drops 0)', 'datetime': '2025-10-17T14:06:05.035325', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'L

1356487.375


Benchmarks
----------

Let's run some benchmarks to see effect of the training loss computation code
on training time.

We'll use the following data for the benchmarks:

#. Lee Background corpus: included in gensim's test data
#. Text8 corpus.  To demonstrate the effect of corpus size, we'll look at the
   first 1MB, 10MB, 50MB of the corpus, as well as the entire thing.




In [22]:
import io
import os

import gensim.models.word2vec
import gensim.downloader as api
import smart_open


def head(path, size):
    with smart_open.open(path) as fin:
        return io.StringIO(fin.read(size))


def generate_input_data():
    lee_path = datapath('lee_background.cor')
    ls = gensim.models.word2vec.LineSentence(lee_path)
    ls.name = '25kB'
    yield ls

    text8_path = api.load('text8').fn
    labels = ('1MB', '10MB', '50MB', '100MB')
    sizes = (1024 ** 2, 10 * 1024 ** 2, 50 * 1024 ** 2, 100 * 1024 ** 2)
    for l, s in zip(labels, sizes):
        ls = gensim.models.word2vec.LineSentence(head(text8_path, s))
        ls.name = l
        yield ls


input_data = list(generate_input_data())

We now compare the training time taken for different combinations of input
data and model training parameters like ``hs`` and ``sg``.

For each combination, we repeat the test several times to obtain the mean and
standard deviation of the test duration.




In [23]:
# Temporarily reduce logging verbosity
logging.root.level = logging.ERROR

import time
import numpy as np
import pandas as pd

train_time_values = []
seed_val = 42
sg_values = [0, 1]
hs_values = [0, 1]

fast = True
if fast:
    input_data_subset = input_data[:3]
else:
    input_data_subset = input_data


for data in input_data_subset:
    for sg_val in sg_values:
        for hs_val in hs_values:
            for loss_flag in [True, False]:
                time_taken_list = []
                for i in range(3):
                    start_time = time.time()
                    w2v_model = gensim.models.Word2Vec(
                        data,
                        compute_loss=loss_flag,
                        sg=sg_val,
                        hs=hs_val,
                        seed=seed_val,
                    )
                    time_taken_list.append(time.time() - start_time)

                time_taken_list = np.array(time_taken_list)
                time_mean = np.mean(time_taken_list)
                time_std = np.std(time_taken_list)

                model_result = {
                    'train_data': data.name,
                    'compute_loss': loss_flag,
                    'sg': sg_val,
                    'hs': hs_val,
                    'train_time_mean': time_mean,
                    'train_time_std': time_std,
                }
                print("Word2vec model #%i: %s" % (len(train_time_values), model_result))
                train_time_values.append(model_result)

train_times_table = pd.DataFrame(train_time_values)
train_times_table = train_times_table.sort_values(
    by=['train_data', 'sg', 'hs', 'compute_loss'],
    ascending=[False, False, True, False],
)
print(train_times_table)

2025-10-17 14:06:15,283 : INFO : collecting all words and their counts
2025-10-17 14:06:15,284 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2025-10-17 14:06:15,304 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences
2025-10-17 14:06:15,305 : INFO : Creating a fresh vocabulary
2025-10-17 14:06:15,312 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-10-17T14:06:15.312941', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:06:15,313 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-10-17T14:06:15.313687', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platfo

Word2vec model #0: {'train_data': '25kB', 'compute_loss': True, 'sg': 0, 'hs': 0, 'train_time_mean': 0.20878203709920248, 'train_time_std': 0.009089137925210865}


2025-10-17 14:06:16,117 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences
2025-10-17 14:06:16,117 : INFO : Creating a fresh vocabulary
2025-10-17 14:06:16,124 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-10-17T14:06:16.124235', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:06:16,125 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-10-17T14:06:16.125764', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:06:16,133 : INFO : deleting the raw counts dictionary of 10781 items
2025-

Word2vec model #1: {'train_data': '25kB', 'compute_loss': False, 'sg': 0, 'hs': 0, 'train_time_mean': 0.19549846649169922, 'train_time_std': 0.00937267917370574}


2025-10-17 14:06:16,726 : INFO : EPOCH 2: training on 59890 raw words (32654 effective words) took 0.0s, 696542 effective words/s
2025-10-17 14:06:16,772 : INFO : EPOCH 3: training on 59890 raw words (32582 effective words) took 0.0s, 742094 effective words/s
2025-10-17 14:06:16,812 : INFO : EPOCH 4: training on 59890 raw words (32561 effective words) took 0.0s, 831652 effective words/s
2025-10-17 14:06:16,813 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (162881 effective words) took 0.2s, 712576 effective words/s', 'datetime': '2025-10-17T14:06:16.813515', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'train'}
2025-10-17 14:06:16,813 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec<vocab=1762, vector_size=100, alpha=0.025>', 'datetime': '2025-10-17T14:06:16.813933', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]

Word2vec model #2: {'train_data': '25kB', 'compute_loss': True, 'sg': 0, 'hs': 1, 'train_time_mean': 0.3348703384399414, 'train_time_std': 0.013008925941800175}


2025-10-17 14:06:17,735 : INFO : EPOCH 0: training on 59890 raw words (32543 effective words) took 0.1s, 550342 effective words/s
2025-10-17 14:06:17,789 : INFO : EPOCH 1: training on 59890 raw words (32552 effective words) took 0.1s, 626928 effective words/s
2025-10-17 14:06:17,840 : INFO : EPOCH 2: training on 59890 raw words (32630 effective words) took 0.0s, 660436 effective words/s
2025-10-17 14:06:17,884 : INFO : EPOCH 3: training on 59890 raw words (32560 effective words) took 0.0s, 772602 effective words/s
2025-10-17 14:06:17,931 : INFO : EPOCH 4: training on 59890 raw words (32583 effective words) took 0.0s, 714649 effective words/s
2025-10-17 14:06:17,931 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (162868 effective words) took 0.3s, 635673 effective words/s', 'datetime': '2025-10-17T14:06:17.931786', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'ev

Word2vec model #3: {'train_data': '25kB', 'compute_loss': False, 'sg': 0, 'hs': 1, 'train_time_mean': 0.3885192076365153, 'train_time_std': 0.030623544473732296}


2025-10-17 14:06:18,871 : INFO : EPOCH 1: training on 59890 raw words (32676 effective words) took 0.1s, 461568 effective words/s
2025-10-17 14:06:18,938 : INFO : EPOCH 2: training on 59890 raw words (32621 effective words) took 0.1s, 501928 effective words/s
2025-10-17 14:06:19,013 : INFO : EPOCH 3: training on 59890 raw words (32698 effective words) took 0.1s, 447644 effective words/s
2025-10-17 14:06:19,088 : INFO : EPOCH 4: training on 59890 raw words (32561 effective words) took 0.1s, 447913 effective words/s
2025-10-17 14:06:19,089 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (163224 effective words) took 0.4s, 457245 effective words/s', 'datetime': '2025-10-17T14:06:19.089022', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'train'}
2025-10-17 14:06:19,089 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec<vocab=1762, vector_size=100, alpha=0

Word2vec model #4: {'train_data': '25kB', 'compute_loss': True, 'sg': 1, 'hs': 0, 'train_time_mean': 0.4079549312591553, 'train_time_std': 0.011481673547809684}


2025-10-17 14:06:20,148 : INFO : EPOCH 2: training on 59890 raw words (32591 effective words) took 0.1s, 489712 effective words/s
2025-10-17 14:06:20,222 : INFO : EPOCH 3: training on 59890 raw words (32763 effective words) took 0.1s, 455409 effective words/s
2025-10-17 14:06:20,292 : INFO : EPOCH 4: training on 59890 raw words (32621 effective words) took 0.1s, 481195 effective words/s
2025-10-17 14:06:20,292 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (163223 effective words) took 0.4s, 463533 effective words/s', 'datetime': '2025-10-17T14:06:20.292749', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'train'}
2025-10-17 14:06:20,293 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec<vocab=1762, vector_size=100, alpha=0.025>', 'datetime': '2025-10-17T14:06:20.293610', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]

Word2vec model #5: {'train_data': '25kB', 'compute_loss': False, 'sg': 1, 'hs': 0, 'train_time_mean': 0.40966296195983887, 'train_time_std': 0.020924562955893372}


2025-10-17 14:06:21,341 : INFO : EPOCH 0: training on 59890 raw words (32668 effective words) took 0.1s, 242497 effective words/s
2025-10-17 14:06:21,487 : INFO : EPOCH 1: training on 59890 raw words (32566 effective words) took 0.1s, 226058 effective words/s
2025-10-17 14:06:21,619 : INFO : EPOCH 2: training on 59890 raw words (32623 effective words) took 0.1s, 249334 effective words/s
2025-10-17 14:06:21,760 : INFO : EPOCH 3: training on 59890 raw words (32636 effective words) took 0.1s, 235311 effective words/s
2025-10-17 14:06:21,896 : INFO : EPOCH 4: training on 59890 raw words (32662 effective words) took 0.1s, 242129 effective words/s
2025-10-17 14:06:21,897 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (163155 effective words) took 0.7s, 235994 effective words/s', 'datetime': '2025-10-17T14:06:21.897238', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'ev

Word2vec model #6: {'train_data': '25kB', 'compute_loss': True, 'sg': 1, 'hs': 1, 'train_time_mean': 0.8081325689951578, 'train_time_std': 0.040626869392516037}


2025-10-17 14:06:23,773 : INFO : EPOCH 0: training on 59890 raw words (32668 effective words) took 0.1s, 227935 effective words/s
2025-10-17 14:06:23,920 : INFO : EPOCH 1: training on 59890 raw words (32670 effective words) took 0.1s, 223091 effective words/s
2025-10-17 14:06:24,052 : INFO : EPOCH 2: training on 59890 raw words (32552 effective words) took 0.1s, 250414 effective words/s
2025-10-17 14:06:24,191 : INFO : EPOCH 3: training on 59890 raw words (32664 effective words) took 0.1s, 238018 effective words/s
2025-10-17 14:06:24,331 : INFO : EPOCH 4: training on 59890 raw words (32644 effective words) took 0.1s, 236484 effective words/s
2025-10-17 14:06:24,332 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (163198 effective words) took 0.7s, 231978 effective words/s', 'datetime': '2025-10-17T14:06:24.332183', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'ev

Word2vec model #7: {'train_data': '25kB', 'compute_loss': False, 'sg': 1, 'hs': 1, 'train_time_mean': 0.8055164813995361, 'train_time_std': 0.0255577470170492}


2025-10-17 14:06:26,246 : INFO : EPOCH 1: training on 175599 raw words (110105 effective words) took 0.1s, 1464169 effective words/s
2025-10-17 14:06:26,328 : INFO : EPOCH 2: training on 175599 raw words (110141 effective words) took 0.1s, 1601596 effective words/s
2025-10-17 14:06:26,411 : INFO : EPOCH 3: training on 175599 raw words (110408 effective words) took 0.1s, 1497611 effective words/s
2025-10-17 14:06:26,494 : INFO : EPOCH 4: training on 175599 raw words (110320 effective words) took 0.1s, 1512013 effective words/s
2025-10-17 14:06:26,494 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (550968 effective words) took 0.4s, 1344397 effective words/s', 'datetime': '2025-10-17T14:06:26.494865', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'train'}
2025-10-17 14:06:26,495 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec<vocab=4125, vector_size

Word2vec model #8: {'train_data': '1MB', 'compute_loss': True, 'sg': 0, 'hs': 0, 'train_time_mean': 0.5571568012237549, 'train_time_std': 0.038710054112509454}


2025-10-17 14:06:27,838 : INFO : EPOCH 0: training on 175599 raw words (110135 effective words) took 0.1s, 1409103 effective words/s
2025-10-17 14:06:27,923 : INFO : EPOCH 1: training on 175599 raw words (110411 effective words) took 0.1s, 1534562 effective words/s
2025-10-17 14:06:28,000 : INFO : EPOCH 2: training on 175599 raw words (110207 effective words) took 0.1s, 1669336 effective words/s
2025-10-17 14:06:28,072 : INFO : EPOCH 3: training on 175599 raw words (110104 effective words) took 0.1s, 1771107 effective words/s
2025-10-17 14:06:28,147 : INFO : EPOCH 4: training on 175599 raw words (110399 effective words) took 0.1s, 1763645 effective words/s
2025-10-17 14:06:28,148 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (551256 effective words) took 0.4s, 1382395 effective words/s', 'datetime': '2025-10-17T14:06:28.148603', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with

Word2vec model #9: {'train_data': '1MB', 'compute_loss': False, 'sg': 0, 'hs': 0, 'train_time_mean': 0.5303853352864584, 'train_time_std': 0.03153330569643895}


2025-10-17 14:06:29,564 : INFO : EPOCH 0: training on 175599 raw words (109994 effective words) took 0.1s, 800108 effective words/s
2025-10-17 14:06:29,722 : INFO : EPOCH 1: training on 175599 raw words (110240 effective words) took 0.1s, 745350 effective words/s
2025-10-17 14:06:29,862 : INFO : EPOCH 2: training on 175599 raw words (110006 effective words) took 0.1s, 848273 effective words/s
2025-10-17 14:06:30,004 : INFO : EPOCH 3: training on 175599 raw words (110409 effective words) took 0.1s, 834441 effective words/s
2025-10-17 14:06:30,152 : INFO : EPOCH 4: training on 175599 raw words (110151 effective words) took 0.1s, 801047 effective words/s
2025-10-17 14:06:30,153 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (550800 effective words) took 0.7s, 746902 effective words/s', 'datetime': '2025-10-17T14:06:30.153428', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc

Word2vec model #10: {'train_data': '1MB', 'compute_loss': True, 'sg': 0, 'hs': 1, 'train_time_mean': 0.9319492975870768, 'train_time_std': 0.008903690813385734}


2025-10-17 14:06:32,232 : INFO : resetting layer weights
2025-10-17 14:06:32,234 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-10-17T14:06:32.234633', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'build_vocab'}
2025-10-17 14:06:32,235 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-10-17T14:06:32.235511', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'train'}
2025-10-17 14:06:32,382 : INFO : EPOCH 0: training on 175599 raw words (110075 effective words) took 0.1s, 807059 effective words/s
2025-10-17 14:06:32,534 : INFO : EPOCH 1: training on 175599 raw words (110135 effective wo

Word2vec model #11: {'train_data': '1MB', 'compute_loss': False, 'sg': 0, 'hs': 1, 'train_time_mean': 0.9427781105041504, 'train_time_std': 0.011839644536183214}


2025-10-17 14:06:35,226 : INFO : EPOCH 0: training on 175599 raw words (110087 effective words) took 0.2s, 453983 effective words/s
2025-10-17 14:06:35,462 : INFO : EPOCH 1: training on 175599 raw words (110309 effective words) took 0.2s, 487847 effective words/s
2025-10-17 14:06:35,695 : INFO : EPOCH 2: training on 175599 raw words (110151 effective words) took 0.2s, 494808 effective words/s
2025-10-17 14:06:35,914 : INFO : EPOCH 3: training on 175599 raw words (110251 effective words) took 0.2s, 526094 effective words/s
2025-10-17 14:06:36,158 : INFO : EPOCH 4: training on 175599 raw words (110252 effective words) took 0.2s, 473519 effective words/s
2025-10-17 14:06:36,159 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (551050 effective words) took 1.2s, 465075 effective words/s', 'datetime': '2025-10-17T14:06:36.159377', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc

Word2vec model #12: {'train_data': '1MB', 'compute_loss': True, 'sg': 1, 'hs': 0, 'train_time_mean': 1.3096357186635335, 'train_time_std': 0.012777475442690794}


2025-10-17 14:06:39,114 : INFO : EPOCH 0: training on 175599 raw words (110344 effective words) took 0.2s, 525810 effective words/s
2025-10-17 14:06:39,344 : INFO : EPOCH 1: training on 175599 raw words (110313 effective words) took 0.2s, 505501 effective words/s
2025-10-17 14:06:39,580 : INFO : EPOCH 2: training on 175599 raw words (110382 effective words) took 0.2s, 489497 effective words/s
2025-10-17 14:06:39,817 : INFO : EPOCH 3: training on 175599 raw words (110250 effective words) took 0.2s, 486312 effective words/s
2025-10-17 14:06:40,039 : INFO : EPOCH 4: training on 175599 raw words (110454 effective words) took 0.2s, 524708 effective words/s
2025-10-17 14:06:40,040 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (551743 effective words) took 1.1s, 481577 effective words/s', 'datetime': '2025-10-17T14:06:40.040380', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc

Word2vec model #13: {'train_data': '1MB', 'compute_loss': False, 'sg': 1, 'hs': 0, 'train_time_mean': 1.2585927645365398, 'train_time_std': 0.024879737614813165}


2025-10-17 14:06:42,810 : INFO : built huffman tree with maximum node depth 15
2025-10-17 14:06:42,836 : INFO : estimated required memory for 4125 words and 100 dimensions: 7837500 bytes
2025-10-17 14:06:42,836 : INFO : resetting layer weights
2025-10-17 14:06:42,838 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-10-17T14:06:42.838654', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'build_vocab'}
2025-10-17 14:06:42,839 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-10-17T14:06:42.839765', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'train'}
2025-10-17 14:06:43,361 : INFO : EPO

Word2vec model #14: {'train_data': '1MB', 'compute_loss': True, 'sg': 1, 'hs': 1, 'train_time_mean': 2.765448013941447, 'train_time_std': 0.10220001702041183}


2025-10-17 14:06:51,108 : INFO : built huffman tree with maximum node depth 15
2025-10-17 14:06:51,137 : INFO : estimated required memory for 4125 words and 100 dimensions: 7837500 bytes
2025-10-17 14:06:51,137 : INFO : resetting layer weights
2025-10-17 14:06:51,139 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-10-17T14:06:51.139685', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'build_vocab'}
2025-10-17 14:06:51,140 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-10-17T14:06:51.140503', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'train'}
2025-10-17 14:06:51,617 : INFO : EPO

Word2vec model #15: {'train_data': '1MB', 'compute_loss': False, 'sg': 1, 'hs': 1, 'train_time_mean': 2.7010581493377686, 'train_time_std': 0.07374958487184224}


2025-10-17 14:06:59,426 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2025-10-17 14:06:59,426 : INFO : Creating a fresh vocabulary
2025-10-17 14:06:59,494 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-10-17T14:06:59.494623', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:06:59,495 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-10-17T14:06:59.495226', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:06:59,583 : INFO : deleting the raw counts dictionary of 73167 ite

Word2vec model #16: {'train_data': '10MB', 'compute_loss': True, 'sg': 0, 'hs': 0, 'train_time_mean': 6.148722728093465, 'train_time_std': 0.07312011559651219}


2025-10-17 14:07:17,863 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2025-10-17 14:07:17,863 : INFO : Creating a fresh vocabulary
2025-10-17 14:07:17,927 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-10-17T14:07:17.927948', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:07:17,928 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-10-17T14:07:17.928557', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:07:18,014 : INFO : deleting the raw counts dictionary of 73167 ite

Word2vec model #17: {'train_data': '10MB', 'compute_loss': False, 'sg': 0, 'hs': 0, 'train_time_mean': 5.517084995905559, 'train_time_std': 0.27259832355268065}


2025-10-17 14:07:34,439 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2025-10-17 14:07:34,440 : INFO : Creating a fresh vocabulary
2025-10-17 14:07:34,501 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-10-17T14:07:34.501776', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:07:34,502 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-10-17T14:07:34.502341', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:07:34,723 : INFO : deleting the raw counts dictionary of 73167 ite

Word2vec model #18: {'train_data': '10MB', 'compute_loss': True, 'sg': 0, 'hs': 1, 'train_time_mean': 11.313425143559774, 'train_time_std': 0.22267623597938954}


2025-10-17 14:08:08,328 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2025-10-17 14:08:08,328 : INFO : Creating a fresh vocabulary
2025-10-17 14:08:08,392 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-10-17T14:08:08.392104', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:08:08,392 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-10-17T14:08:08.392910', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:08:08,485 : INFO : deleting the raw counts dictionary of 73167 ite

Word2vec model #19: {'train_data': '10MB', 'compute_loss': False, 'sg': 0, 'hs': 1, 'train_time_mean': 10.859464486440023, 'train_time_std': 0.525564304551739}


2025-10-17 14:08:40,904 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2025-10-17 14:08:40,904 : INFO : Creating a fresh vocabulary
2025-10-17 14:08:40,964 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-10-17T14:08:40.964121', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:08:40,964 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-10-17T14:08:40.964706', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:08:41,047 : INFO : deleting the raw counts dictionary of 73167 ite

Word2vec model #20: {'train_data': '10MB', 'compute_loss': True, 'sg': 1, 'hs': 0, 'train_time_mean': 15.386495033899942, 'train_time_std': 0.24339972917902924}


2025-10-17 14:09:27,087 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2025-10-17 14:09:27,088 : INFO : Creating a fresh vocabulary
2025-10-17 14:09:27,145 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-10-17T14:09:27.145629', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:09:27,146 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-10-17T14:09:27.146129', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:09:27,230 : INFO : deleting the raw counts dictionary of 73167 ite

Word2vec model #21: {'train_data': '10MB', 'compute_loss': False, 'sg': 1, 'hs': 0, 'train_time_mean': 16.83411478996277, 'train_time_std': 0.1844806263072886}


2025-10-17 14:10:17,582 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2025-10-17 14:10:17,582 : INFO : Creating a fresh vocabulary
2025-10-17 14:10:17,645 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-10-17T14:10:17.645802', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:10:17,646 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-10-17T14:10:17.646306', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:10:17,781 : INFO : deleting the raw counts dictionary of 73167 ite

Word2vec model #22: {'train_data': '10MB', 'compute_loss': True, 'sg': 1, 'hs': 1, 'train_time_mean': 32.31974673271179, 'train_time_std': 1.4650705962983597}


2025-10-17 14:11:54,481 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2025-10-17 14:11:54,481 : INFO : Creating a fresh vocabulary
2025-10-17 14:11:54,561 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-10-17T14:11:54.561414', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:11:54,562 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-10-17T14:11:54.561987', 'gensim': '4.3.3', 'python': '3.11.14 (main, Oct 10 2025, 10:21:20) [GCC 14.2.0]', 'platform': 'Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41', 'event': 'prepare_vocab'}
2025-10-17 14:11:54,710 : INFO : deleting the raw counts dictionary of 73167 ite

Word2vec model #23: {'train_data': '10MB', 'compute_loss': False, 'sg': 1, 'hs': 1, 'train_time_mean': 32.82801310221354, 'train_time_std': 1.547465817587142}
   train_data  compute_loss  sg  hs  train_time_mean  train_time_std
4        25kB          True   1   0         0.407955        0.011482
5        25kB         False   1   0         0.409663        0.020925
6        25kB          True   1   1         0.808133        0.040627
7        25kB         False   1   1         0.805516        0.025558
0        25kB          True   0   0         0.208782        0.009089
1        25kB         False   0   0         0.195498        0.009373
2        25kB          True   0   1         0.334870        0.013009
3        25kB         False   0   1         0.388519        0.030624
12        1MB          True   1   0         1.309636        0.012777
13        1MB         False   1   0         1.258593        0.024880
14        1MB          True   1   1         2.765448        0.102200
15        1MB

Visualising Word Embeddings
---------------------------

The word embeddings made by the model can be visualised by reducing
dimensionality of the words to 2 dimensions using tSNE.

Visualisations can be used to notice semantic and syntactic trends in the data.

Example:

* Semantic: words like cat, dog, cow, etc. have a tendency to lie close by
* Syntactic: words like run, running or cut, cutting lie close together.

Vector relations like vKing - vMan = vQueen - vWoman can also be noticed.

.. Important::
  The model used for the visualisation is trained on a small corpus. Thus
  some of the relations might not be so clear.




In [24]:
from sklearn.decomposition import IncrementalPCA    # inital reduction
from sklearn.manifold import TSNE                   # final reduction
import numpy as np                                  # array handling


def reduce_dimensions(model):
    num_dimensions = 2  # final num dimensions (2D, 3D, etc)

    # extract the words & their vectors, as numpy arrays
    vectors = np.asarray(model.wv.vectors)
    labels = np.asarray(model.wv.index_to_key)  # fixed-width numpy strings

    # reduce using t-SNE
    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)

    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
    return x_vals, y_vals, labels


x_vals, y_vals, labels = reduce_dimensions(model)

def plot_with_plotly(x_vals, y_vals, labels, plot_in_notebook=True):
    from plotly.offline import init_notebook_mode, iplot, plot
    import plotly.graph_objs as go

    trace = go.Scatter(x=x_vals, y=y_vals, mode='text', text=labels)
    data = [trace]

    if plot_in_notebook:
        init_notebook_mode(connected=True)
        iplot(data, filename='word-embedding-plot')
    else:
        plot(data, filename='word-embedding-plot.html')


def plot_with_matplotlib(x_vals, y_vals, labels):
    import matplotlib.pyplot as plt
    import random

    random.seed(0)

    plt.figure(figsize=(12, 12))
    plt.scatter(x_vals, y_vals)

    #
    # Label randomly subsampled 25 data points
    #
    indices = list(range(len(labels)))
    selected_indices = random.sample(indices, 25)
    for i in selected_indices:
        plt.annotate(labels[i], (x_vals[i], y_vals[i]))

try:
    get_ipython()
except Exception:
    plot_function = plot_with_matplotlib
else:
    plot_function = plot_with_plotly

plot_function(x_vals, y_vals, labels)

Conclusion
----------

In this tutorial we learned how to train word2vec models on your custom data
and also how to evaluate it. Hope that you too will find this popular tool
useful in your Machine Learning tasks!

Links
-----

- API docs: :py:mod:`gensim.models.word2vec`
- `Original C toolkit and word2vec papers by Google <https://code.google.com/archive/p/word2vec/>`_.


