New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add `evaluate_word_analogies` (will replace `accuracy`) method for `gensim.models.KeyedVectors` #1935

Merged
merged 13 commits into from Apr 3, 2018

Conversation

Projects
None yet
4 participants
@akutuzov
Contributor

akutuzov commented Feb 26, 2018

The accuracy function evaluates the performance of word2vec models in analogy task. restrict_vocab parameter defines which part of the model vocabulary will be used for evaluation. The previous default was 30 000 top frequent words (analogy questions containing words beyond this threshold are simply skipped). It indeed makes sense to use some kind of limit here, as the evaluation running time depends on the size of the used vocabulary.
However, 30 000 is a very small value, with typical models nowadays featuring hundreds of thousands or even millions of words in their vocabularies. This leads to unrealistic evaluation scores, calculated only on small parts of a test set and a model.
Therefore, I suggest increasing the default value of restrict_vocab 10-fold, up to 300 000. This will be more in line with the typical vocabulary size of contemporary word embedding models, and also will be consistent with the default restrict_vocab value for the evaluate_word_pairs function.

Note that although the original C word2vec does mention 30 000 as a good threshold value for analogies evaluation, the default behavior of its compute-accuracy executable is still not to use any threshold (=evaluate on the whole vocabulary).

Increased default restrict_vocab in accuracy
The `accuracy` function evaluates the performance of word2vec models in analogy task. `restrict_vocab` parameter defines which part of the model vocabulary will be used for evaluation. The previous default was 30 000 top frequent words (analogy questions containing words beyond this threshold are simply skipped). It indeed makes sense to use some kind of limit here, as the evaluation running time depends on the size of the used vocabulary.
However, 30 000 is a very small value, with typical models nowadays featuring hundreds of thousands or even millions of words in their vocabularies. This leads to unrealistic evaluation scores, calculated only on small parts of a test set and a model. 
Therefore, I suggest increasing the default value of `restrict_vocab` 10-fold, up to 300 000. This will be more in line with the typical vocabulary size of contemporary word embedding models, and also will be consistent with the default value for the `evaluate_word_pairs` function.

Note that although the original C word2vec does mention  30 000 as a good threshold value for analogies evaluation, the default behavior of its `compute-accuracy` executable is still not to use any threshold (=evaluate on the whole vocabulary).
@menshikh-iv

This comment has been minimized.

Member

menshikh-iv commented Feb 26, 2018

Hello @akutuzov,
I understand your point, sounds reasonable to me but looks like this change breaks backward compatibility (in the case when the user didn't specify restrict_vocab parameter). From other hands - if the user receives unrealistic scores with default, this looks like a bug for me and should be fixed ASAP.

Why this important (I mean, why we should change the default value if a user can specify 300000 or any different value manually)?

CC: @piskvorky @gojomo @manneshiva

@akutuzov

This comment has been minimized.

Contributor

akutuzov commented Feb 26, 2018

Well, the backwards compatibility would not be broken: just the evaluation scores would be different (but much more realistic in most cases). Of course, this should be highlighted in the changelog.

Changing the default value is important, because otherwise people get over-inflated evaluation scores. For example, suppose one uses the current default of 30 000 and evaluates on the pretty much standard Google Analogies test set. The semantic part of this test set contains about 12 000 quadruplets. But with the 30 000 threshold, about half of the test set will be silently skipped (with the Google News model, 5 678 out of 12 280 questions are skipped). As a result, the models are evaluated only on the questions containing high-frequent words, which is of course easier. Moreover, the candidates for the answers are also selected only from the words within the threshold. All this makes such evaluation scores highly unreliable and depending on word frequency fluctuations in the training corpora.

The 300 000 threshold which I suggest will at least cover most 'normal words' of any natural language and make the evaluation scores for different models more comparable. It can of course be set to something else: 100 000 or 400 000 if you like, just that the order should be hundreds of thousands, not tens of thousands.

Finally, increasing threshold will make Gensim-produced evaluation scores closer to the scores produced by the original word2vec-accuracy script by default.

@gojomo

This comment has been minimized.

Member

gojomo commented Feb 26, 2018

If making this change, people using the same data & same eval-method will, after an upgrade, get 'worse' scores - which is likely to cause alarm, confusion, and support requests. (It's not quite 'backward compaTIbility' that's being broken, but 'backward compaRAbility'.)

But the case for being more realistic, and especially matching the original word2vec.c compute-accuracy script behavior is reasonable.

I'd suggest adding a new, fixed method that's more directly analogous to the compare-accuracy of word2vec.c – and using the 'right' default there, and suggesting it for future use. The older accuracy() could be marked deprecated. (The existing accuracy()/log_accuracy() split, and strange deep-nested-dictionary-of-sections reporting of accuracy(), might also be cleaned up at the same time to be more useful/descriptive for the common cases of just reporting on overall correct, or just syntactic/semantic correct.)

@akutuzov

This comment has been minimized.

Contributor

akutuzov commented Feb 26, 2018

May be adding a new correct method and marking accuracy as deprecated is indeed a good idea. accuracy is a bad name for the analogy task evaluation anyway :-)

In fact, this silent skipping of OOV questions is wrong even if the threshold is very permissive (or even if there is no threshold at all). The current implementation (and the original word2vec implementation as well) allows to get high scores on Google Analogies even if your model's vocabulary contains only 10 words, for example. If these 10 words cover at least 1 question from the test set and produce the correct answer, the method will output the score of 100%. I think that the fair way to evaluate is to punish the models for lacking words from the test set. This is what the dummy4unknown parameter in the evaluate_word_pairs method does: if True, predictions of 0.0 are produced for OOV word pairs, making it impossible to achieve high scores with only a handful of known words. I think the same parameter should be added to the analogies evaluation method.
I even think that in the future this parameter should be set to True by default both for evaluate_word_pairs and for analogies evaluation (whatever its name will be), as it really ensures fair comparison of models. But this is a radical change not inherited from the original word2vec, so I would really like to hear what others think.

@piskvorky

This comment has been minimized.

Member

piskvorky commented Feb 27, 2018

That evaluation process was selected precisely to match the C tool behaviour 1:1. Did something change in compute-accuracy?

Otherwise I'm -1 on diverging from the original at this point, at least under the established name. That's just confusing.

@akutuzov

This comment has been minimized.

Contributor

akutuzov commented Feb 27, 2018

As I've said above, the default compute-accuracy behavior is to use the threshold=0 (that is, no threshold at all). I quote directly from the comment in its source code:
threshold is used to reduce vocabulary of the model for fast approximate evaluation (0 = off, otherwise typical value is 30000)
30000 is only suggested, but by default the evaluation is done on the full model vocabulary.
Thus, the current Gensim default setting does not mimic the C tool behavior anyway.

@piskvorky

This comment has been minimized.

Member

piskvorky commented Feb 28, 2018

Alright, thanks. That means compute-accuracy must have changed in the meanwhile.

If that's the case, mimicking its current behaviour is OK.

@menshikh-iv

This comment has been minimized.

Member

menshikh-iv commented Feb 28, 2018

@piskvorky @gojomo And what about the total, can I merge this one?

@akutuzov

This comment has been minimized.

Contributor

akutuzov commented Feb 28, 2018

Note that it doesn't make sense to precisely mimic the compute-accuracy behavior (i.e. no threshold) anyway. The reason is that if a typical user tries to run analogy evaluation on the full vocabulary of, say, Google's 3 million words model, this will run for ages on a commodity hardware.
That's why we probable have to somehow balance between too large evaluation vocabularies and too small evaluation vocabularies. My suggestion of 300 000 default value is an attempt for such a balance.

@piskvorky

This comment has been minimized.

Member

piskvorky commented Feb 28, 2018

I had a look and compute-accuracy recommends 30,000.

Mikolov's word2vec paper also used 30,000.

If that is the established standard, I'm -1 on deviating at this point.

One option would be to change the default to match the C tool default ("off"). Another to create a new, non-conflicting evaluation method/process.

@akutuzov

This comment has been minimized.

Contributor

akutuzov commented Feb 28, 2018

@piskvorky yes, compute-accuracy does recommend 30,000, but uses no threshold by default, as I've already mentioned.
Mikolov's paper evaluates on both 30,0000-limited vocabulary and on full vocabulary, see Tables 2 and 4.
It is difficult to tell what is the 'established standard' here. Here is the quote from highly influential 2014 paper by Baroni et al.:

'because of the way the task is framed, performance also depends on the size of the vocabulary to be searched: Mikolov et al. (2013a) pick the nearest neighbour among vectors for 1M words, Mikolov et al. (2013c) among 700K words, and we among 300K words.'

I would say there is unfortunately no established standard here and most users will simply run evaluation script with default parameters.

Overall, I support the suggestion of @gojomo to implement new evaluation method and mark the current accuracy method as deprecated. This new method can be called something like evaluate_word_analogies to be more consistent with evaluate_word_pairs, it will feature a higher default threshold and the dummy4uknown parameter (which probably at some point in the future can be set to True by default). It would also be great for it to have other methods for computing analogies, besides the currently used 3CosAdd: for example, 3CosMul and set-based methods.

If everyone agrees to this plan, I can start implementing this new method.

@piskvorky

This comment has been minimized.

Member

piskvorky commented Feb 28, 2018

Sounds good to me, thanks for investigating @akutuzov .

@gojomo

This comment has been minimized.

Member

gojomo commented Mar 1, 2018

Despite the word2vec.c precedent, calling this just accuracy() has always bugged me, and tends to mislead casual users to think this score is some ‘true north’ of vector quality, the one thing that makes one-set-of-parameters/vectors better than others. And, its name/prominence/ease-of-use forestalls consideration of the limitations of this evaluation, and the tradeoffs in using more/fewer of the most-frequent words.

For example, a plausible non-running-time reason for clipping the analogies evaluation to more-frequent words is that the ‘long tail’ of words includes many that might crowd-out the ‘right’ answer, without being wholly ‘wrong’. They may be near-synonyms of the ‘best’ answer, or just be idiosyncratically-placed because of their few training examples. But they still help on most tasks, even if they hurt analogies! And you might not want to discard them before training - because they still have other value, and perhaps even help improve the ‘fat head’ words. (For example, model60kvocab.analogies_score(restrict_vocab=30000) may score better than model30kvocab.analogies_score(), and be a fairer comparison, that’s also better correlated with word-vector-quality from the top to the bottom for more other tasks.)

@menshikh-iv

This comment has been minimized.

Member

menshikh-iv commented Mar 8, 2018

@akutuzov you'll make changes in current PR or in new? If in new - please close current PR.

@akutuzov

This comment has been minimized.

Contributor

akutuzov commented Mar 8, 2018

@menshikh-iv I think I will work in this PR

akutuzov added some commits Apr 1, 2018

New word analogies method
New method `evaluate_word_analogies` to solve word analogies. Implements more sensible frequency threshold and the `dummy4unknown` parameter. Also, works two times faster than the previous `accuracy` method which is now deprecated.
@akutuzov

This comment has been minimized.

Contributor

akutuzov commented Apr 1, 2018

OK, so as discussed before, I implemented a new evaluate_word_analogies method. In comparison to the accuracy method, it:

  1. Defaults to 300,000 top frequent words from the model vocabulary, instead of 30,000.
  2. Outputs the percentage of OOV quadruplets, and implements the dummy4unknown parameter which makes OOV quadruplets to yield zero accuracy when set to True.
  3. Runs about 2 times faster, as it calculates similarities only for top 5 nearest neighbors, not for the whole vocabulary.
  4. Has slightly updated docs and logging.

I marked the accuracy method as deprecated, but I am not sure I did it the right way, so I ask @menshikh-iv to double-check this.

@menshikh-iv

Great work @akutuzov 👍

CC: @gojomo please have a look too

@@ -859,6 +967,7 @@ def log_accuracy(section):
section['section'], 100.0 * correct / (correct + incorrect), correct, correct + incorrect
)
@deprecated("Method will be removed in 4.0.0, use self.evaluate_word_analogies() instead")

This comment has been minimized.

@menshikh-iv

menshikh-iv Apr 2, 2018

Member

This is the correct way, all fine 👍

@@ -850,6 +851,113 @@ def n_similarity(self, ws1, ws2):
v2 = [self[word] for word in ws2]
return dot(matutils.unitvec(array(v1).mean(axis=0)), matutils.unitvec(array(v2).mean(axis=0)))
@staticmethod
def log_evaluate_word_analogies(section):

This comment has been minimized.

@menshikh-iv

menshikh-iv Apr 2, 2018

Member

Maybe better hide this method (with _)?

This comment has been minimized.

@akutuzov

akutuzov Apr 2, 2018

Contributor

What exactly do you mean? Or may be you can point to some example of such hiding in the existing Gensim code?

This comment has been minimized.

@menshikh-iv

menshikh-iv Apr 3, 2018

Member

I mean why not _log_evaluate_word_analogies? I asking because this method looks like a helper for evaluate_word_analogies, not more.

This comment has been minimized.

@akutuzov

akutuzov Apr 3, 2018

Contributor

Done.

def evaluate_word_analogies(self, analogies, restrict_vocab=300000, case_insensitive=True, dummy4unknown=False):
"""
Compute performance of the model on an analogy test set

This comment has been minimized.

@akutuzov

akutuzov Apr 2, 2018

Contributor

Done.

def evaluate_word_analogies(self, analogies, restrict_vocab=300000, case_insensitive=True, dummy4unknown=False):
"""
Compute performance of the model on an analogy test set
(see https://aclweb.org/aclwiki/Analogy_(State_of_the_art)).

This comment has been minimized.

@menshikh-iv

menshikh-iv Apr 2, 2018

Member

this should be rendered as a link, should look like

 `Analogy (State of the art) <https://aclweb.org/aclwiki/Analogy_(State_of_the_art)>`_

This comment has been minimized.

@akutuzov

akutuzov Apr 2, 2018

Contributor

Done

Compute performance of the model on an analogy test set
(see https://aclweb.org/aclwiki/Analogy_(State_of_the_art)).
`analogies` is a filename where lines are 4-tuples of words, split into sections by ": SECTION NAME" lines.
See questions-words.txt in

This comment has been minimized.

@menshikh-iv

menshikh-iv Apr 2, 2018

Member

this file also provided in the current repo: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/questions-words.txt + this is part of the gensim package, i.e. path on local machine can be retrieved as

from gensim.test.utils import datapath
datapath("questions-words.txt")

No need to download source-code of C version for looking into this file

This comment has been minimized.

@akutuzov

akutuzov Apr 2, 2018

Contributor

Done.

oov_ratio = float(oov) / line_no * 100
logger.info('Quadruplets with out-of-vocabulary words: %.1f%%', oov_ratio)
if not dummy4unknown:
logger.info('NB: analogies containing OOV words were skipped from evaluation! '

This comment has been minimized.

@menshikh-iv

menshikh-iv Apr 2, 2018

Member

nitpick: please use hanging indents (instead of vertical)

This comment has been minimized.

@akutuzov

akutuzov Apr 2, 2018

Contributor

Done.

akutuzov added some commits Apr 2, 2018

Parameters
----------
`analogies` is a filename where lines are 4-tuples of words,

This comment has been minimized.

@menshikh-iv

menshikh-iv Apr 3, 2018

Member

Should be

parameter_1 : type_1
    Description_1.
parameter_2 : type_2
    Description_2.
...

example

"""Convert `glove_input_file` in GloVe format to word2vec format and write it to `word2vec_output_file`.
Parameters
----------
glove_input_file : str
Path to file in GloVe format.
word2vec_output_file: str
Path to output file.
Returns
-------
(int, int)
Number of vectors (lines) of input file and its dimension.
"""

This comment has been minimized.

@akutuzov

akutuzov Apr 3, 2018

Contributor

Done.

with out-of-vocabulary words. Otherwise (default False), these
tuples are skipped entirely and not used in the evaluation.
References

This comment has been minimized.

@menshikh-iv

menshikh-iv Apr 3, 2018

Member

Please don't use References section (this will produce a problem in future, "thanks" autosummary sphinx plugin), add it simply as a link with description (as I mentioned before #1935 (comment)).

This comment has been minimized.

@akutuzov

akutuzov Apr 3, 2018

Contributor

Done.

akutuzov added some commits Apr 3, 2018

@akutuzov

This comment has been minimized.

Contributor

akutuzov commented Apr 3, 2018

@menshikh-iv Is everything OK now?
(Travis build for Python 3.6 seems to get stuck on test_ldamodel.py, don't know why)

@menshikh-iv

This comment has been minimized.

Member

menshikh-iv commented Apr 3, 2018

@akutuzov I fixed Travis problem. Looks good to me (I'll slightly cleanup docstring & merge it soon), thanks for your work @akutuzov 👍

@menshikh-iv menshikh-iv added the RFM label Apr 3, 2018

@menshikh-iv menshikh-iv changed the title from Increased default restrict_vocab in accuracy to Add `evaluate_word_analogies` (will replace `accuracy`) method for `gensim.models.KeyedVectors` Apr 3, 2018

@menshikh-iv menshikh-iv merged commit 49e6abd into RaRe-Technologies:develop Apr 3, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment