Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix negative sampling floating-point error for gensim.models.PoincareModel. Fix #1917 #1959

Merged

Conversation

jayantj
Copy link
Contributor

@jayantj jayantj commented Mar 7, 2018

Fixes #1917

The bug was resulting from the fact that the probability of each node for negative sampling is subject to floating-point error (even with float128), which increases with increasing number of nodes - which explains why we hadn't encountered it before training on wikipedia.

Since the individual probabilities are subject to some error, the cumulative sum can be != 1 (an error of around 1e-10 in the case of en wikipedia), and thus np.searchsorted for a randomly chosen value in [0.0, 1.0) can return last index of cumsum table + 1 when the random value is greater than the last value in the cumsum table.

To avoid this, I'm multiplying the randomly chosen values in [0.0, 1.0) by the last value in the cumulative sum table to ensure it lies within the cumsum range (extremely close to [0.0, 1.0] in most cases)

This is rather hard to write a simple test for, so I'm skipping it. Let me know what you think @menshikh-iv

@menshikh-iv
Copy link
Contributor

@jayantj I agree about the test (let's skip it because this can be reproduced only with the really large model, that we have no chance to test in CI service).

The really interesting reason, nice catch!

@@ -202,8 +202,12 @@ def _get_candidate_negatives(self):
"""

if self._negatives_buffer.num_items() < self.negative:
# Note: np.random.choice much slower than random.sample for large populations, possible bottleneck
uniform_numbers = self._np_random.random_sample(self._negatives_buffer_size)
# self._node_probabilities_cumsum sometimes doesn't have 1 as the last value due to floating point error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be a good idea to add a link to the issue directly to comment

@jayantj jayantj force-pushed the poincare_negative_sampling_bugfix branch from 633520b to fea4ba5 Compare March 8, 2018 08:24
@menshikh-iv menshikh-iv changed the title [MRG] Fixes bug in negative sampling due to floating point error Fix negative sampling floating-point error for gensim.models.PoincareModel. Fix #1917 Mar 9, 2018
@jayantj
Copy link
Contributor Author

jayantj commented Mar 14, 2018

I investigated the failing test - it seems rather strange and looks to have to do with pickle. The poincare model saved in py3.5 wasn't being loaded correctly in py2.7, with this error -

In [1]: from gensim.models.poincare import PoincareModel

In [2]: PoincareModel.load('gensim/test/test_data/poincare_test_3.4.0')
---------------------------------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
<ipython-input-2-1d8d6b315418> in <module>()
----> 1 PoincareModel.load('gensim/test/test_data/poincare_test_3.4.0')

/home/jayant/Projects/gensim/gensim/gensim/models/poincare.pyc in load(cls, *args, **kwargs)
    344     def load(cls, *args, **kwargs):
    345         """Load model from disk, inherited from :class:`~gensim.utils.SaveLoad`."""
--> 346         model = super(PoincareModel, cls).load(*args, **kwargs)
    347         model._init_node_probabilities()
    348         return model

/home/jayant/Projects/gensim/gensim/gensim/utils.pyc in load(cls, fname, mmap)
    423         compress, subname = SaveLoad._adapt_by_suffix(fname)
    424 
--> 425         obj = unpickle(fname)
    426         obj._load_specials(fname, mmap, compress, subname)
    427         logger.info("loaded %s", fname)

/home/jayant/Projects/gensim/gensim/gensim/utils.pyc in unpickle(fname)
   1332             return _pickle.load(f, encoding='latin1')
   1333         else:
-> 1334             return _pickle.loads(f.read())
   1335 
   1336 

UnpicklingError: state is not a dictionary

However, it works fine when I use pickle instead of cPickle -

In [3]: import pickle, cPickle

In [4]: pickle.load(open('gensim/test/test_data/poincare_test_3.4.0', 'rb'))
Out[4]: <gensim.models.poincare.PoincareModel at 0x7f8e8da57dd0>

In [5]: cPickle.load(open('gensim/test/test_data/poincare_test_3.4.0', 'rb'))
---------------------------------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
<ipython-input-5-d18194368f07> in <module>()
----> 1 cPickle.load(open('gensim/test/test_data/poincare_test_3.4.0', 'rb'))

UnpicklingError: state is not a dictionary

For now, I've trained the model used in the tests with py2.7, which seems to work fine across all python versions. We should probably investigate this in a subsequent issue.

Attaching the model here -

poincare_test_3.4.0.zip
gensim-3.4 was used to train this model

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Mar 14, 2018

@jayantj aha, really suspicious behavior (and I also think that this problem is wider than only with Poincare).
So, what're our next actions? Merge current PR & raise issue about the pickle?

UPD: We discuss it, current PR is done, @jayantj will raise an issue later.

@menshikh-iv menshikh-iv merged commit e8264bd into piskvorky:develop Mar 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Poincare training bug
2 participants