Add `evaluate_word_analogies` (will replace `accuracy`) method for `gensim.models.KeyedVectors` #1935

akutuzov · 2018-02-26T00:42:27Z

The accuracy function evaluates the performance of word2vec models in analogy task. restrict_vocab parameter defines which part of the model vocabulary will be used for evaluation. The previous default was 30 000 top frequent words (analogy questions containing words beyond this threshold are simply skipped). It indeed makes sense to use some kind of limit here, as the evaluation running time depends on the size of the used vocabulary.
However, 30 000 is a very small value, with typical models nowadays featuring hundreds of thousands or even millions of words in their vocabularies. This leads to unrealistic evaluation scores, calculated only on small parts of a test set and a model.
Therefore, I suggest increasing the default value of restrict_vocab 10-fold, up to 300 000. This will be more in line with the typical vocabulary size of contemporary word embedding models, and also will be consistent with the default restrict_vocab value for the evaluate_word_pairs function.

Note that although the original C word2vec does mention 30 000 as a good threshold value for analogies evaluation, the default behavior of its compute-accuracy executable is still not to use any threshold (=evaluate on the whole vocabulary).

The `accuracy` function evaluates the performance of word2vec models in analogy task. `restrict_vocab` parameter defines which part of the model vocabulary will be used for evaluation. The previous default was 30 000 top frequent words (analogy questions containing words beyond this threshold are simply skipped). It indeed makes sense to use some kind of limit here, as the evaluation running time depends on the size of the used vocabulary. However, 30 000 is a very small value, with typical models nowadays featuring hundreds of thousands or even millions of words in their vocabularies. This leads to unrealistic evaluation scores, calculated only on small parts of a test set and a model. Therefore, I suggest increasing the default value of `restrict_vocab` 10-fold, up to 300 000. This will be more in line with the typical vocabulary size of contemporary word embedding models, and also will be consistent with the default value for the `evaluate_word_pairs` function. Note that although the original C word2vec does mention 30 000 as a good threshold value for analogies evaluation, the default behavior of its `compute-accuracy` executable is still not to use any threshold (=evaluate on the whole vocabulary).

menshikh-iv · 2018-02-26T05:41:28Z

Hello @akutuzov,
I understand your point, sounds reasonable to me but looks like this change breaks backward compatibility (in the case when the user didn't specify restrict_vocab parameter). From other hands - if the user receives unrealistic scores with default, this looks like a bug for me and should be fixed ASAP.

Why this important (I mean, why we should change the default value if a user can specify 300000 or any different value manually)?

CC: @piskvorky @gojomo @manneshiva

akutuzov · 2018-02-26T17:07:23Z

Well, the backwards compatibility would not be broken: just the evaluation scores would be different (but much more realistic in most cases). Of course, this should be highlighted in the changelog.

Changing the default value is important, because otherwise people get over-inflated evaluation scores. For example, suppose one uses the current default of 30 000 and evaluates on the pretty much standard Google Analogies test set. The semantic part of this test set contains about 12 000 quadruplets. But with the 30 000 threshold, about half of the test set will be silently skipped (with the Google News model, 5 678 out of 12 280 questions are skipped). As a result, the models are evaluated only on the questions containing high-frequent words, which is of course easier. Moreover, the candidates for the answers are also selected only from the words within the threshold. All this makes such evaluation scores highly unreliable and depending on word frequency fluctuations in the training corpora.

The 300 000 threshold which I suggest will at least cover most 'normal words' of any natural language and make the evaluation scores for different models more comparable. It can of course be set to something else: 100 000 or 400 000 if you like, just that the order should be hundreds of thousands, not tens of thousands.

Finally, increasing threshold will make Gensim-produced evaluation scores closer to the scores produced by the original word2vec-accuracy script by default.

gojomo · 2018-02-26T18:20:37Z

If making this change, people using the same data & same eval-method will, after an upgrade, get 'worse' scores - which is likely to cause alarm, confusion, and support requests. (It's not quite 'backward compaTIbility' that's being broken, but 'backward compaRAbility'.)

But the case for being more realistic, and especially matching the original word2vec.c compute-accuracy script behavior is reasonable.

I'd suggest adding a new, fixed method that's more directly analogous to the compare-accuracy of word2vec.c – and using the 'right' default there, and suggesting it for future use. The older accuracy() could be marked deprecated. (The existing accuracy()/log_accuracy() split, and strange deep-nested-dictionary-of-sections reporting of accuracy(), might also be cleaned up at the same time to be more useful/descriptive for the common cases of just reporting on overall correct, or just syntactic/semantic correct.)

akutuzov · 2018-02-26T19:46:58Z

May be adding a new correct method and marking accuracy as deprecated is indeed a good idea. accuracy is a bad name for the analogy task evaluation anyway :-)

In fact, this silent skipping of OOV questions is wrong even if the threshold is very permissive (or even if there is no threshold at all). The current implementation (and the original word2vec implementation as well) allows to get high scores on Google Analogies even if your model's vocabulary contains only 10 words, for example. If these 10 words cover at least 1 question from the test set and produce the correct answer, the method will output the score of 100%. I think that the fair way to evaluate is to punish the models for lacking words from the test set. This is what the dummy4unknown parameter in the evaluate_word_pairs method does: if True, predictions of 0.0 are produced for OOV word pairs, making it impossible to achieve high scores with only a handful of known words. I think the same parameter should be added to the analogies evaluation method.
I even think that in the future this parameter should be set to True by default both for evaluate_word_pairs and for analogies evaluation (whatever its name will be), as it really ensures fair comparison of models. But this is a radical change not inherited from the original word2vec, so I would really like to hear what others think.

piskvorky · 2018-02-27T18:36:21Z

That evaluation process was selected precisely to match the C tool behaviour 1:1. Did something change in compute-accuracy?

Otherwise I'm -1 on diverging from the original at this point, at least under the established name. That's just confusing.

akutuzov · 2018-02-27T18:55:15Z

As I've said above, the default compute-accuracy behavior is to use the threshold=0 (that is, no threshold at all). I quote directly from the comment in its source code:
threshold is used to reduce vocabulary of the model for fast approximate evaluation (0 = off, otherwise typical value is 30000)
30000 is only suggested, but by default the evaluation is done on the full model vocabulary.
Thus, the current Gensim default setting does not mimic the C tool behavior anyway.

piskvorky · 2018-02-28T09:29:02Z

Alright, thanks. That means compute-accuracy must have changed in the meanwhile.

If that's the case, mimicking its current behaviour is OK.

menshikh-iv · 2018-02-28T12:15:59Z

@piskvorky @gojomo And what about the total, can I merge this one?

akutuzov · 2018-02-28T13:13:05Z

Note that it doesn't make sense to precisely mimic the compute-accuracy behavior (i.e. no threshold) anyway. The reason is that if a typical user tries to run analogy evaluation on the full vocabulary of, say, Google's 3 million words model, this will run for ages on a commodity hardware.
That's why we probable have to somehow balance between too large evaluation vocabularies and too small evaluation vocabularies. My suggestion of 300 000 default value is an attempt for such a balance.

piskvorky · 2018-02-28T13:35:25Z

I had a look and compute-accuracy recommends 30,000.

Mikolov's word2vec paper also used 30,000.

If that is the established standard, I'm -1 on deviating at this point.

One option would be to change the default to match the C tool default ("off"). Another to create a new, non-conflicting evaluation method/process.

akutuzov · 2018-02-28T16:15:00Z

@piskvorky yes, compute-accuracy does recommend 30,000, but uses no threshold by default, as I've already mentioned.
Mikolov's paper evaluates on both 30,0000-limited vocabulary and on full vocabulary, see Tables 2 and 4.
It is difficult to tell what is the 'established standard' here. Here is the quote from highly influential 2014 paper by Baroni et al.:

'because of the way the task is framed, performance also depends on the size of the vocabulary to be searched: Mikolov et al. (2013a) pick the nearest neighbour among vectors for 1M words, Mikolov et al. (2013c) among 700K words, and we among 300K words.'

I would say there is unfortunately no established standard here and most users will simply run evaluation script with default parameters.

Overall, I support the suggestion of @gojomo to implement new evaluation method and mark the current accuracy method as deprecated. This new method can be called something like evaluate_word_analogies to be more consistent with evaluate_word_pairs, it will feature a higher default threshold and the dummy4uknown parameter (which probably at some point in the future can be set to True by default). It would also be great for it to have other methods for computing analogies, besides the currently used 3CosAdd: for example, 3CosMul and set-based methods.

If everyone agrees to this plan, I can start implementing this new method.

piskvorky · 2018-02-28T16:40:54Z

Sounds good to me, thanks for investigating @akutuzov .

gojomo · 2018-03-01T03:48:23Z

Despite the word2vec.c precedent, calling this just accuracy() has always bugged me, and tends to mislead casual users to think this score is some ‘true north’ of vector quality, the one thing that makes one-set-of-parameters/vectors better than others. And, its name/prominence/ease-of-use forestalls consideration of the limitations of this evaluation, and the tradeoffs in using more/fewer of the most-frequent words.

For example, a plausible non-running-time reason for clipping the analogies evaluation to more-frequent words is that the ‘long tail’ of words includes many that might crowd-out the ‘right’ answer, without being wholly ‘wrong’. They may be near-synonyms of the ‘best’ answer, or just be idiosyncratically-placed because of their few training examples. But they still help on most tasks, even if they hurt analogies! And you might not want to discard them before training - because they still have other value, and perhaps even help improve the ‘fat head’ words. (For example, model60kvocab.analogies_score(restrict_vocab=30000) may score better than model30kvocab.analogies_score(), and be a fairer comparison, that’s also better correlated with word-vector-quality from the top to the bottom for more other tasks.)

menshikh-iv · 2018-03-08T07:36:41Z

@akutuzov you'll make changes in current PR or in new? If in new - please close current PR.

akutuzov · 2018-03-08T13:12:30Z

@menshikh-iv I think I will work in this PR

New method `evaluate_word_analogies` to solve word analogies. Implements more sensible frequency threshold and the `dummy4unknown` parameter. Also, works two times faster than the previous `accuracy` method which is now deprecated.

akutuzov · 2018-04-01T22:50:04Z

OK, so as discussed before, I implemented a new evaluate_word_analogies method. In comparison to the accuracy method, it:

Defaults to 300,000 top frequent words from the model vocabulary, instead of 30,000.
Outputs the percentage of OOV quadruplets, and implements the dummy4unknown parameter which makes OOV quadruplets to yield zero accuracy when set to True.
Runs about 2 times faster, as it calculates similarities only for top 5 nearest neighbors, not for the whole vocabulary.
Has slightly updated docs and logging.

I marked the accuracy method as deprecated, but I am not sure I did it the right way, so I ask @menshikh-iv to double-check this.

menshikh-iv

Great work @akutuzov 👍

CC: @gojomo please have a look too

menshikh-iv · 2018-04-02T10:15:11Z

gensim/models/keyedvectors.py

@@ -859,6 +967,7 @@ def log_accuracy(section):
                section['section'], 100.0 * correct / (correct + incorrect), correct, correct + incorrect
            )

+    @deprecated("Method will be removed in 4.0.0, use self.evaluate_word_analogies() instead")


This is the correct way, all fine 👍

menshikh-iv · 2018-04-02T10:19:24Z

gensim/models/keyedvectors.py

@@ -850,6 +851,113 @@ def n_similarity(self, ws1, ws2):
        v2 = [self[word] for word in ws2]
        return dot(matutils.unitvec(array(v1).mean(axis=0)), matutils.unitvec(array(v2).mean(axis=0)))

+    @staticmethod
+    def log_evaluate_word_analogies(section):


Maybe better hide this method (with _)?

What exactly do you mean? Or may be you can point to some example of such hiding in the existing Gensim code?

I mean why not _log_evaluate_word_analogies? I asking because this method looks like a helper for evaluate_word_analogies, not more.

menshikh-iv · 2018-04-02T10:20:52Z

gensim/models/keyedvectors.py

+
+    def evaluate_word_analogies(self, analogies, restrict_vocab=300000, case_insensitive=True, dummy4unknown=False):
+        """
+        Compute performance of the model on an analogy test set


Please use numpy-style docstrings (http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html and https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt)

menshikh-iv · 2018-04-02T10:21:51Z

gensim/models/keyedvectors.py

+    def evaluate_word_analogies(self, analogies, restrict_vocab=300000, case_insensitive=True, dummy4unknown=False):
+        """
+        Compute performance of the model on an analogy test set
+        (see https://aclweb.org/aclwiki/Analogy_(State_of_the_art)).


this should be rendered as a link, should look like

`Analogy (State of the art) <https://aclweb.org/aclwiki/Analogy_(State_of_the_art)>`_

menshikh-iv · 2018-04-02T10:24:07Z

gensim/models/keyedvectors.py

+        Compute performance of the model on an analogy test set
+        (see https://aclweb.org/aclwiki/Analogy_(State_of_the_art)).
+        `analogies` is a filename where lines are 4-tuples of words, split into sections by ": SECTION NAME" lines.
+        See questions-words.txt in


this file also provided in the current repo: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/questions-words.txt + this is part of the gensim package, i.e. path on local machine can be retrieved as

from gensim.test.utils import datapath datapath("questions-words.txt")

No need to download source-code of C version for looking into this file

menshikh-iv · 2018-04-02T10:25:11Z

gensim/models/keyedvectors.py

+        oov_ratio = float(oov) / line_no * 100
+        logger.info('Quadruplets with out-of-vocabulary words: %.1f%%', oov_ratio)
+        if not dummy4unknown:
+            logger.info('NB: analogies containing OOV words were skipped from evaluation! '


nitpick: please use hanging indents (instead of vertical)

menshikh-iv · 2018-04-03T03:20:19Z

gensim/models/keyedvectors.py

+
+        Parameters
+        ----------
+        `analogies` is a filename where lines are 4-tuples of words,


Should be

parameter_1 : type_1 Description_1. parameter_2 : type_2 Description_2. ...

example

https://github.com/RaRe-Technologies/gensim/blob/1611f3a477686bf9f462d0ed09b1daff2a58f09e/gensim/scripts/glove2word2vec.py#L89-L103

menshikh-iv · 2018-04-03T03:23:37Z

gensim/models/keyedvectors.py

+        with out-of-vocabulary words. Otherwise (default False), these
+        tuples are skipped entirely and not used in the evaluation.
+
+        References


Please don't use References section (this will produce a problem in future, "thanks" autosummary sphinx plugin), add it simply as a link with description (as I mentioned before #1935 (comment)).

akutuzov · 2018-04-03T16:34:52Z

@menshikh-iv Is everything OK now?
(Travis build for Python 3.6 seems to get stuck on test_ldamodel.py, don't know why)

menshikh-iv · 2018-04-03T18:27:57Z

@akutuzov I fixed Travis problem. Looks good to me (I'll slightly cleanup docstring & merge it soon), thanks for your work @akutuzov 👍

akutuzov added 6 commits April 1, 2018 21:56

New word analogies method

7af4021

New method `evaluate_word_analogies` to solve word analogies. Implements more sensible frequency threshold and the `dummy4unknown` parameter. Also, works two times faster than the previous `accuracy` method which is now deprecated.

Mention new word analogies method in the doc

73e8c72

Refer to new word analogies method in word2vec.py

d0f09ea

Removed redundant spaces

bf9c750

Removed more redundant spaces

d39176c

Another round of space-elimination...

f2a8605

menshikh-iv reviewed Apr 2, 2018

View reviewed changes

akutuzov added 2 commits April 2, 2018 20:57

Code polishing.

f9d175d

Fix for docstring

3040a31

menshikh-iv reviewed Apr 3, 2018

View reviewed changes

akutuzov added 3 commits April 3, 2018 16:46

Hide log method, fix the docstring

175100b

Docstring updated.

0ad6a49

Removed redundant spaces.

68c3316

menshikh-iv added the RFM label Apr 3, 2018

menshikh-iv changed the title ~~Increased default restrict_vocab in accuracy~~ Add evaluate_word_analogies (will replace accuracy) method for gensim.models.KeyedVectors Apr 3, 2018

cleanup docstrings

6c6823d

menshikh-iv merged commit 49e6abd into piskvorky:develop Apr 3, 2018

Add evaluate_word_analogies (will replace accuracy) method for gensim.models.KeyedVectors #1935

Add evaluate_word_analogies (will replace accuracy) method for gensim.models.KeyedVectors #1935

Conversation

akutuzov commented Feb 26, 2018

menshikh-iv commented Feb 26, 2018 • edited Loading

akutuzov commented Feb 26, 2018

gojomo commented Feb 26, 2018

akutuzov commented Feb 26, 2018

piskvorky commented Feb 27, 2018 • edited Loading

akutuzov commented Feb 27, 2018

piskvorky commented Feb 28, 2018

menshikh-iv commented Feb 28, 2018

akutuzov commented Feb 28, 2018

piskvorky commented Feb 28, 2018 • edited Loading

akutuzov commented Feb 28, 2018

piskvorky commented Feb 28, 2018

gojomo commented Mar 1, 2018

menshikh-iv commented Mar 8, 2018

akutuzov commented Mar 8, 2018

akutuzov commented Apr 1, 2018

menshikh-iv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv Apr 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akutuzov commented Apr 3, 2018

menshikh-iv commented Apr 3, 2018

Add `evaluate_word_analogies` (will replace `accuracy`) method for `gensim.models.KeyedVectors` #1935

Add `evaluate_word_analogies` (will replace `accuracy`) method for `gensim.models.KeyedVectors` #1935

menshikh-iv commented Feb 26, 2018 •

edited

Loading

piskvorky commented Feb 27, 2018 •

edited

Loading

piskvorky commented Feb 28, 2018 •

edited

Loading

menshikh-iv Apr 3, 2018 •

edited

Loading