Fix WordEmbeddingsKeyedVectors.most_similar #2356

Witiko · 2019-01-25T18:10:04Z

This PR fixes two issues discovered in #2105:

The gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar method returns an unexpected result (a numeric array rather than an empty list) when topn=0.
The example code for gensim.similarities.SoftCosineSimilarity needs fixing.

piskvorky · 2019-01-25T18:54:07Z

Thanks a lot @Witiko ! Does this fix the reported issue fully, or are there still lingering/unresolved issues?
I remember reading something about a problem with backward compatibility.

Witiko · 2019-01-25T19:09:34Z

@piskvorky This PR fixes the new issue reported today in #2105 with no apparent backwards-compatibility problems. This PR does not fix the originally reported issue in #2105, because the poster never got back to us and we were unable to reproduce their issue. Nevertheless, that issue is long past its expiry date, since the code that it refers to is no longer even part of Gensim. Therefore, this PR should close #2105.

There is something that should be discussed before a merge. Before to this PR, WordEmbeddingsKeyedVectors.most_similar would return an array of similarities for all terms in a vocabulary when topn=0. This behavior was undocumented:

>>> from gensim.models.word2vec import Word2Vec
>>> from gensim.test.utils import common_texts
>>> 
>>> model = Word2Vec(common_texts, size=20, min_count=1)
>>> model.wv.most_similar(positive=['computer'], topn=2)
[('response', 0.38100379705429077), ('minors', 0.3752439618110657)]
>>> model.wv.most_similar(positive=['computer'], topn=0)
array([-0.1180886 ,  0.32174808, -0.02938104, -0.21145007,  0.37524396,
       -0.23777878,  0.99999994, -0.01436211,  0.36708638, -0.09770551,
        0.05963777,  0.3810038 ], dtype=float32)

This PR enforces the following documented behavior:

>>> model.wv.most_similar(positive=['computer'], topn=0)
[]

However, the unit tests reveal that the undocumented behavior is expected:

================================== FAILURES ===================================
______________ TestEuclideanKeyedVectors.test_most_similar_topn _______________
self = <gensim.test.test_keyedvectors.TestEuclideanKeyedVectors testMethod=test_most_similar_topn>
    def test_most_similar_topn(self):
        """Test most_similar returns correct results when `topn` is specified."""
        self.assertEqual(len(self.vectors.most_similar('war', topn=5)), 5)
        self.assertEqual(len(self.vectors.most_similar('war', topn=10)), 10)
    
        predicted = self.vectors.most_similar('war', topn=None)
>       self.assertEqual(len(predicted), len(self.vectors.vocab))
E       AssertionError: 0 != 2747
predicted  = []
self       = <gensim.test.test_keyedvectors.TestEuclideanKeyedVectors testMethod=test_most_similar_topn>
gensim\test\test_keyedvectors.py:107: AssertionError
============================== warnings summary ===============================
== 1 failed, 948 passed, 94 skipped, 74 warnings, 3 rerun in 1070.41 seconds ==
ERROR: InvocationError for command 'C:\\projects\\gensim-431bq\\.tox\\py27-win\\Scripts\\pytest.EXE gensim/test' (exited with code 1)
py27-win finish: runtests after 1085.62 seconds
py27-win start: run-test-post 
py27-win finish: run-test-post after 0.00 seconds
___________________________________ summary ___________________________________
ERROR:   py27-win: commands failed
Command exited with code 1

This PR removes the unit test. A more conservative approach would be to leave WordEmbeddingsKeyedVectors.most_similar alone, document the behavior, and patch the caller to special-case topn=0. Someone with more insight into the code base should probably decide this.

piskvorky · 2019-01-25T19:26:54Z

That's weird. Why would anybody call most_similar with topn=0? How can that happen? Error in user code?

The topn=None case makes some sense (user is asking for all results; 0 probably falls under None, my guess would be the check is only if topn: instead of if topn is None:). I don't know if it's worth keeping around, why this special case was added. Are there any comments?

The backward compatibility issue I was talking about is here:
https://groups.google.com/forum/#!topic/gensim/Z4nHzwCuUl0

Witiko · 2019-01-25T19:35:19Z

We are calling most_similar with topn=0 when constructing a similarity matrix column that already contains the maximum number of nonzero elements. This can happen when symmetric=True.

The “backward compatibility issue” is related to 3f04940. Before Gensim 3.7, most_similar would also be called with topn=0, but the caller code was more defensive and would return an empty slice regardless of what most_similar would return. In Gensim 3.7, the code is less defensive and expects most_similar to return an empty list when topn=0 as documented. To summarize, this PR fixes a new issue that appeared in Gensim 3.7. Backwards compatibility is retained.

Witiko · 2019-01-25T19:43:20Z

According to the unit tests, no other code relies on the undocumented behavior. There are no comments in the code.

Witiko · 2019-01-25T20:18:34Z

To solve the dilemma, I added back the support for the undocumented topn=None behavior in 06781e2. It is still undocumented, but at least we are not breaking anyone's code unnecessarily.

tvrbanec · 2019-01-25T20:50:53Z

I am still confused.
I wanted to apply fix in my /usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py

First I saw that the line numbers are different: it is 1401, not as patch declares 1370.

Then I saw another difference: what is supposed to be a patch, there is already! And is still making an error:

    similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)  # construct similarity matrix
  File "/usr/local/lib/python2.7/dist-packages/gensim/similarities/termsim.py", line 234, in __init__
    for term, similarity in index.most_similar(t1, num_rows)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 1401, in most_similar
    for t2, similarity in most_similar:
TypeError: 'numpy.float32' object is not iterable

Witiko · 2019-01-25T21:17:58Z

@tvrbanec In your case, applying 01c5336 should be enough:

diff --git a/gensim/models/keyedvectors.py b/gensim/models/keyedvectors.py
index ff79212c..ad9c90c7 100644
--- a/gensim/models/keyedvectors.py
+++ b/gensim/models/keyedvectors.py
@@ -511,6 +511,9 @@ class WordEmbeddingsKeyedVectors(BaseKeyedVectors):
             Sequence of (word, similarity).
 
         """
+        if topn < 1:
+            return []
+
         if positive is None:
             positive = []
         if negative is None:
@@ -550,8 +553,6 @@ class WordEmbeddingsKeyedVectors(BaseKeyedVectors):
 
         limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
         dists = dot(limited, mean)
-        if not topn:
-            return dists
         best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)
         # ignore (don't return) words from the input
         result = [(self.index2word[sim], float(dists[sim])) for sim in best if sim not in all_words]

Line numbers may differ, since you are using the release version of Gensim, whereas the patches are against the development branch of Gensim. Any luck?

tvrbanec · 2019-01-25T21:45:15Z

OK. I tried. First and second patch wasn't big deal. But third, which is probably the key one can't be done. There is no part that should be changed:
if topn < 1:
and there are no surroundings of code that needs to be insert.

I have two options now:

To give up until the official patch version (3.7.1?) or the new version (3.8) is released. I hope you will not wait long?
Switch to the developer version. This is not possible because I test on three completely different machines, on one I have only VI editor. Too much effort and time.

Witiko · 2019-01-25T21:50:42Z

There is no part that should be changed:
if topn < 1:
and there are no surroundings of code that needs to be insert.

The if topn < 1: part is added by the first patch (01c5336), are you sure you applied it correctly?

However, just applying the single patch above (manually adding three lines and deleting two lines) should do the trick. No need to apply the entire pull request. 🙂

tvrbanec · 2019-01-25T22:13:42Z

After applying patch above.

    similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)  # construct similarity matrix
  File "/usr/local/lib/python2.7/dist-packages/gensim/similarities/termsim.py", line 234, in __init__
    for term, similarity in index.most_similar(t1, num_rows)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 1403, in most_similar
    most_similar = self.keyedvectors.most_similar(positive=[t1], topn=topn, **self.kwargs)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 550, in most_similar
    raise ValueError("cannot compute similarity with no input")
ValueError: cannot compute similarity with no input

Witiko · 2019-01-25T22:25:56Z

I cannot reproduce your issue. Here is what I did:

I installed Gensim 3.7.0 using Python 2.7:

$ git clone https://github.com/RaRe-Technologies/gensim.git
$ cd gensim
$ git checkout 3.7.0
$ pip install .

I applied the patch:

$ git apply
diff --git a/gensim/models/keyedvectors.py b/gensim/models/keyedvectors.py
index ff79212c..ad9c90c7 100644
--- a/gensim/models/keyedvectors.py
+++ b/gensim/models/keyedvectors.py
@@ -511,6 +511,9 @@ class WordEmbeddingsKeyedVectors(BaseKeyedVectors):
             Sequence of (word, similarity).
 
         """
+        if topn < 1:
+            return []
+
         if positive is None:
             positive = []
         if negative is None:
@@ -550,8 +553,6 @@ class WordEmbeddingsKeyedVectors(BaseKeyedVectors):
 
         limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
         dists = dot(limited, mean)
-        if not topn:
-            return dists
         best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)
         # ignore (don't return) words from the input
         result = [(self.index2word[sim], float(dists[sim])) for sim in best if sim not in all_words]

Then, I ran your code:

>>> from gensim.corpora import Dictionary
>>> from gensim.models.keyedvectors import WordEmbeddingSimilarityIndex
>>> from gensim.models.word2vec import Word2Vec
>>> from gensim.similarities import SparseTermSimilarityMatrix
>>> 
>>> corpus = [['how', 'exile', 'selfie', ...]]  # from <https://pastebin.com/at7idpCV>
>>> model = Word2Vec(corpus, size=20, min_count=1)
>>> termsim_index = WordEmbeddingSimilarityIndex(model.wv)
>>> dictionary = Dictionary(corpus)
>>> similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)
>>> similarity_matrix
<gensim.similarities.termsim.SparseTermSimilarityMatrix object at 0x7eff8d0d4e10>

Does the above procedure produce a working setup for you?

tvrbanec · 2019-01-25T22:47:24Z

Hm.

    from gensim import corpora
  File "/usr/local/lib/python2.7/dist-packages/gensim/__init__.py", line 5, in <module>
    from gensim import parsing, corpora, matutils, interfaces, models, similarities, summarization, utils  # noqa:F401
  File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/__init__.py", line 6, in <module>
    from .indexedcorpus import IndexedCorpus  # noqa:F401 must appear before the other classes
  File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/indexedcorpus.py", line 15, in <module>
    from gensim import interfaces, utils
  File "/usr/local/lib/python2.7/dist-packages/gensim/interfaces.py", line 21, in <module>
    from gensim import utils, matutils
  File "/usr/local/lib/python2.7/dist-packages/gensim/matutils.py", line 1082, in <module>
    from gensim._matutils import logsumexp, mean_absolute_difference, dirichlet_expectation
  File "__init__.pxd", line 918, in init gensim._matutils
ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

tvrbanec · 2019-01-25T22:49:46Z

Never mind. I'll go back to 3.6. Thanks for time and energy. I just don't have it more. I hope new release will come soon.

Witiko · 2019-01-25T23:50:50Z

If you are installing to the same python environment as before, you may need to use pip install --force-reinstall --ignore-installed . to force the reinstall. I am sorry about the broken feature. Hopefully, you will regain your appetite for tinkering with Gensim in a couple of days. 🙂

Witiko · 2019-01-27T01:09:02Z

This PR should be good to merge.

menshikh-iv · 2019-01-27T05:14:25Z

Thanks for the fast fix @Witiko

tvrbanec · 2019-02-02T14:13:28Z

It seems to me that everything is fine in 3.7.1.

Witiko · 2019-02-02T19:02:47Z

I am glad to hear that. Then #2105 should be closed.

Witiko changed the title ~~Fix gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar~~ Fix WordEmbeddingsKeyedVectors.most_similar Jan 25, 2019

Witiko mentioned this pull request Jan 25, 2019

AssertionError: sparse documents must not contain any explicit zero entries and the similarity matrix S must satisfy x^T * S * x > 0 for any nonzero bag-of-words vector x. #2105

Closed

Witiko force-pushed the fix-most_similar branch 2 times, most recently from cade7bd to 4463d31 Compare January 25, 2019 18:34

Witiko closed this Jan 25, 2019

Witiko reopened this Jan 25, 2019

Witiko added 2 commits January 25, 2019 20:14

Make WordEmbeddingsKeyedVectors.most_similar(..topn=O) return empty list

01c5336

Fix the example code for SoftCosineSimilarity

f4a2d00

Witiko force-pushed the fix-most_similar branch from 4463d31 to f4a2d00 Compare January 25, 2019 19:14

Witiko force-pushed the fix-most_similar branch from 5275ccb to 916013b Compare January 25, 2019 20:17

Retain undocumented WordEmbeddingsKeyedVectors.most_similar(..topn=None)

06781e2

Witiko force-pushed the fix-most_similar branch from 916013b to 06781e2 Compare January 25, 2019 20:48

menshikh-iv merged commit 0bc2ca6 into piskvorky:develop Jan 27, 2019

Witiko mentioned this pull request Jan 28, 2019

Fix the example code for SparseTermSimilarityMatrix #2359

Merged

piskvorky mentioned this pull request Apr 20, 2019

accuracy function for analogy test error #2455

Open

Witiko mentioned this pull request Apr 23, 2019

Fix WordEmbeddingsKeyedVectors.most_similar #2461

Merged

magiob mentioned this pull request May 17, 2019

SparseTermSimilarityMatrix - TypeError: 'numpy.float32' object is not iterable #2496

Closed

Witiko mentioned this pull request May 17, 2019

Make most_similar accept topn of any integer type #2497

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix WordEmbeddingsKeyedVectors.most_similar #2356

Fix WordEmbeddingsKeyedVectors.most_similar #2356

Witiko commented Jan 25, 2019 •

edited

Loading

piskvorky commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

piskvorky commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

tvrbanec commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

tvrbanec commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

tvrbanec commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

tvrbanec commented Jan 25, 2019

tvrbanec commented Jan 25, 2019

Witiko commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 27, 2019

menshikh-iv commented Jan 27, 2019

tvrbanec commented Feb 2, 2019

Witiko commented Feb 2, 2019

Fix WordEmbeddingsKeyedVectors.most_similar #2356

Fix WordEmbeddingsKeyedVectors.most_similar #2356

Conversation

Witiko commented Jan 25, 2019 • edited Loading

piskvorky commented Jan 25, 2019 • edited Loading

Witiko commented Jan 25, 2019 • edited Loading

piskvorky commented Jan 25, 2019 • edited Loading

Witiko commented Jan 25, 2019 • edited Loading

Witiko commented Jan 25, 2019 • edited Loading

Witiko commented Jan 25, 2019 • edited Loading

tvrbanec commented Jan 25, 2019 • edited Loading

Witiko commented Jan 25, 2019 • edited Loading

tvrbanec commented Jan 25, 2019 • edited Loading

Witiko commented Jan 25, 2019 • edited Loading

tvrbanec commented Jan 25, 2019 • edited Loading

Witiko commented Jan 25, 2019 • edited Loading

tvrbanec commented Jan 25, 2019

tvrbanec commented Jan 25, 2019

Witiko commented Jan 25, 2019 • edited Loading

Witiko commented Jan 27, 2019

menshikh-iv commented Jan 27, 2019

tvrbanec commented Feb 2, 2019

Witiko commented Feb 2, 2019

Witiko commented Jan 25, 2019 •

edited

Loading

piskvorky commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

piskvorky commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

tvrbanec commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

tvrbanec commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

tvrbanec commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading