Make token2id mapping reproducible #1715

formi23 · 2017-11-14T08:09:45Z

This change modifies the attributes token2id and dfs of the gensim.corpora.Dictionary class to be OrderedDicts rather than dicts. This ensures deterministic order of elements for consecutive executions regardless of Python version.

Between Python versions 3.3 and 3.6, the order of iteration for sets and dictionaries is randomised before each execution by default (see link). In prior versions, hash randomisation is disabled by default (see link). As of 3.6, the order of set and dictionary iteration reflects the order in which elements were added, but this is an implementation detail not to be relied upon (see link). The changes in this PR resolve this issue by ensuring that the mapping behaviour of gensim.corpora.Dictionary is consistent across all modern versions of Python and between executions.

Note that unit testing this functionality is not possible because of the quirks of the various Python versions. The problem only manifests between executions, which undermines the reproducibility of experiments using gensim.corpora.Dictionary. A minimal working example that demonstrates the problem follows (compare the output between the runs for versions of Python that use hash seed randomisation; 3.5, for example):

import gensim.corpora
import sys

print(sys.version)


def extract_words(documents):
    return [[word for word in document.lower().split()
            if word not in stoplist]
            for document in documents]


def main():
    documents = ["Human machine interface for lab abc computer applications",
                 "A survey of user opinion of computer system response time",
                 "The EPS user interface management system",
                 "System and human system engineering testing of EPS",
                 "Relation of user perceived response time to error measurement",
                 "The generation of random binary unordered trees"]

    documents_add = ["The intersection graph of paths in trees",
                     "Graph minors IV Widths of trees and well quasi ordering",
                     "Graph minors A survey"]

    stoplist = set('for a of the and to in'.split())

    texts = extract_words(documents)
    texts_add = extract_words(documents_add)

    dictionary = gensim.corpora.Dictionary(texts)
    dictionary.add_documents(texts_add)

    dictionary.filter_extremes(no_below=1, no_above=0.9)
    dictionary.compactify()

    dictionary.save("dict_test.dict")
    dictionary.load("dict_test.dict")

    result = dictionary.iteritems()
    result = sorted(result, key=lambda x: x[0])

    print(result[:5])
    print(result[-5:])


main()

Example output:

$ python3.5 ...
Python 3.5.2 (...)
[(0, 'generation'), (1, 'well'), (2, 'intersection'), (3, 'human'), (4, 'response')]
[(30, 'graph'), (31, 'eps'), (32, 'applications'), (33, 'engineering'), (34, 'lab')]

$ python3.5 ...
Python 3.5.2 (...)
[(0, 'computer'), (1, 'paths'), (2, 'testing'), (3, 'machine'), (4, 'minors')]
[(30, 'time'), (31, 'abc'), (32, 'unordered'), (33, 'ordering'), (34, 'relation')]

menshikh-iv · 2017-11-16T06:14:10Z

Hi @formi23, thanks for PR, but I don't think that this is a bug.
We know about the fact that the order can be arbitrary, but this does not affect the use of corpora.Dictionary, for this reason I think that we no needed this change.

WDYT @piskvorky @gojomo ?

formi23 · 2017-11-16T07:46:38Z

Hi @menshikh-iv,
I wasn't suggesting this is a bug, rather an overlooked detail that affects reproducibility of results produced by whatever model used in conjunction with gensim's Dictionary.
Using OrderedDict eliminates this discrepancy, and even ensures that the results of runs for v.2.7 to v.3.6 are exactly the same.

menshikh-iv · 2017-11-16T08:23:23Z

@formi23 but it does not matter how the words will be numbered (for any algorithm), all difference will be in final indexes, but if you have two exactly same models with different-numbering dictionaries, the models will produce exactly same results (up to numbering words).

Also, if you save your model in python2 and load in python3 (and vice versa), you'll receive exactly same indexes in the dictionary.

piskvorky · 2017-11-16T13:27:04Z

The cross-Python reproducibility is not critical, but nice to have.

What price do we pay for it @formi23 ? What is the impact on performance / memory of this PR?

formi23 · 2017-11-17T09:28:45Z

Since memory and performance are affected by dataset size, Python version (e.g. the implementation of dict varies by version) and use case, I don't think it's feasible to come up with representative benchmarks.

OrderedDict wraps a dict instance and also manages a list of keys, by which it maintains the order. The memory usage of the list is presumably O(n) (it certainly appears to be from the source. Most of the methods on OrderedDict have identical performance to dict, as can be also seen from source.

I just want to clarify that the intent of this PR is not strictly to introduce cross-version reproducibility, but to introduce determinism. The use of OrderedDict is incidental and simply removes the non-determinism introduced by using dict to assign indices to words. Sorting words before assigning indices would achieve the same outcome, with obviously different performance and memory characteristics.

The behaviour of gensim.corpora.Dictionary in Python 3.6 is reproducible only due to the current dict implementation, which may change in the future.

@menshikh-iv , the difference is between algorithmically guaranteeing the same results versus depending on circumstance. Using Dictionary in conjunction with LDA in Python v. [3.3-3.6) produces different results due to the document vectors being different in each execution. As an example, see the results of LDA modeling using python 3.5.2 and 3.6.3 for two consecutive executions with and w/o the changes in this PR.

menshikh-iv · 2017-11-17T09:36:52Z

@formi23 nice description 💣 !

Can you make the small benchmark for dictionary construction/filtering/ doc2bow usage without/with this changes (by memory and time)?

formi23 · 2017-11-17T09:53:03Z

@menshikh-iv do you have an example of such benchmarks?

menshikh-iv · 2017-11-17T09:37:57Z

gensim/corpora/dictionary.py

@@ -149,6 +149,7 @@ def doc2bow(self, document, allow_update=False, return_missing=False):
        token2id = self.token2id
        if allow_update or return_missing:
            missing = {w: freq for w, freq in iteritems(counter) if w not in token2id}
+            missing = OrderedDict(sorted(missing.items(), key=lambda x: (x[1], x[0])))


It's really needed to sort items explicitly (everywhere)? If yes - please use iteritems instead of .items()

This is the essential step because it introduces determinism in the assignment of ids and therefore consistency in the word-id mapping between executions (for a given dataset). However, OrderedDict can be omitted - see my comment regarding benchmarks for the implications of only using sorted and dict.

menshikh-iv · 2017-11-17T09:55:03Z

@formi23 no, but it's very simple.

You need to calculate what's time/memory needed for the fill-up, filter and apply Dictionary with old/new code version and post results here.

formi23 · 2017-11-19T14:44:26Z

I'm familiar with benchmarking, I just don't see the value in comparing the performance before and after the changes on an arbitrary dataset. The algorithmic complexity implications of the changes are clear from the implementation of OrderedDict and that is far more indicative of time and memory performance than would be a single comparison for specific inputs.

If you insist, the following scatter plot shows the execution time for three configurations using both Python 3.5 (p35_*) and 3.6 (p36_*) on a dataset containing 245625 words (27970 unique):

Unmodified gensim.corpora.Dictionary (no suffix)
Using OrderedDict to remove non-determinism; e.g. this PR's changes (od)
Using the sorted function to ensure determinism but omitting OrderedDict (suffix sl)

Four methods of Dictionary are tested:

Dictionary initialisation (i)
filter_extremes (f)
compactify (c)
doc2bow (applied to each of the 2525 documents in the dataset) (d)

So, for example, p36_od_c indicates the execution times for the Dictionary.compactify method using the changes in this PR and Python 3.6. 100 executions were timed for each configuration.

The sorted approach results in consistent word-to-integer mappings but does not ensure that the iteration order of Dictionary is consistent. Its performance presumably suffers compared to the OrderedDict approach because OrderedDict maintains the initial sorting, making subsequent sorts much faster. The memory required for Dictionary's token2id attribute approximately doubles when using OrderedDict (325480 vs 147560 bytes). This is consistent with the space and time complexities of the operations and data structures.

piskvorky · 2017-11-19T20:17:26Z

Thanks for the clarification @formi23 . @menshikh-iv is still struggling with English, please excuse if he sounded too rude or bossy. It wasn't intentional.

Doubling the memory sounds like a really high price, for a nice-to-have feature. Where exactly did you hit the non-reproducibility problem yourself, how severe was it?

Is there a more frugal way to achieve the same effect?

formi23 · 2017-11-20T01:03:48Z

@piskvorky, the memory concerns are understandable. At the same time, I consider full reproducibility essential for the nature of the work gensim is used for. Furthermore, this non-determinism undermines the random state control functionality of any model implementation it is used in conjunction with.

If the memory cost of OrderedDict is too severe then consider the alternative that only uses sorted (implementation is here - see lines ~~150, 247-255, 273, 275~~ - UPDATE: actually, only lines 151 and 269 changed). It's comparable in terms of memory to the current implementation of Dictionary but extends execution time slightly (as shown in the chart above - see the yellow plots).

It's possible that faster or more memory efficient approaches may exist to achieve consistent word-id mappings but that's outside the scope of this PR.

In principle, the problem is that the ids for tokens are assigned in the order of dictionary iteration, which changes between the executions in Python versions 3.3-3.6. I posted the results of LDA topic modeling here. You can see the difference between two executions with Python 3.5. This is a toy example, but the effect is more pronounced in a dataset that I unfortunately don't have permission to share in which I first encountered the problem.

menshikh-iv · 2017-11-20T09:09:42Z

@formi23 thanks for the detailed benchmark, double-memory is unacceptable for us, so please try to use only sorted (or .sort, as I remember, this method is inplace in contrast to the sorted)

formi23 · 2017-11-21T09:05:08Z

@menshikh-iv, I made the changes. Regarding sort vs sorted, they seem to be essentially the same in this use case:
sorted is given a generator, from which a list is created, which is then sorted and returned (calls sort on this list). For sort, one explicitly creates a list, which is then also sorted using sort.

menshikh-iv

Let's check again that this works as expected, for me LGTM

menshikh-iv · 2017-11-21T09:28:39Z

gensim/corpora/dictionary.py

@@ -148,9 +148,9 @@ def doc2bow(self, document, allow_update=False, return_missing=False):

        token2id = self.token2id
        if allow_update or return_missing:
-            missing = {w: freq for w, freq in iteritems(counter) if w not in token2id}
+            missing = sorted((x for x in iteritems(counter) if x[0] not in token2id), key=lambda x: (x[1], x[0]))


Sorting by (freq, w), are you sure that this is correct?

@menshikh-iv, that's a good point. Either way works and ensures reproducibility. It can be assumed that sorting integers is faster than strings, but then the branching for words with the same frequency seems to introduce overhead (see the figure, where freq indicates frequency-word sorting and string indicates word-frequency sorting, i, f, c, and d methods are as in the previous example).

Since the words in token2id guaranteed to be unique but the frequencies are not (and will likely overlap given patterns within natural language), sorting by token then frequency is probably the better choice but will vary on a dataset-by-dataset basis. What's your preference?

IMO (w, freq) better than (freq, w) (just because of the uniqueness of the tokens). For bigger corpus, the situation will be even worse for (freq, w) because frequencies obey zipf's law -> more and more duplicates by freq in "tail" of distribution.

Made the changes.

menshikh-iv · 2017-11-21T09:57:04Z

I checked manually, now mapping looks reproducible between different python versions, nice work @formi23 👍 🔥

piskvorky · 2017-11-23T14:56:38Z

gensim/corpora/dictionary.py

@@ -148,7 +148,7 @@ def doc2bow(self, document, allow_update=False, return_missing=False):

        token2id = self.token2id
        if allow_update or return_missing:
-            missing = sorted((x for x in iteritems(counter) if x[0] not in token2id), key=lambda x: (x[1], x[0]))
+            missing = sorted((x for x in iteritems(counter) if x[0] not in token2id), key=lambda x: x[0])


sort works lexicographically by default; no need for the key parameter at all.

menshikh-iv · 2017-11-23T19:46:57Z

Congratz with first contributions @formi23 🔥
I'm very impressed by your high level of investigation, maybe you will be interested in our student incubator?

piskvorky · 2017-11-23T19:54:42Z

Impressive indeed. Incubator or not, we'd like to keep you around @formi23 :)

Do you have an appetite for another open source project with gensim?

…tionary (piskvorky#1715) * Make token2id mapping reproducible * Use iteritems instead of items * Use sorted for deterministic token2id mapping * Replace frequency-token sort with token sort * remove key from sort

formi23 · 2017-11-28T07:14:44Z

@piskvorky Thanks, happy I got to contribute to the library I'm using. I'll keep an eye out for PR opportunities as they become apparent in my work.

Make token2id mapping reproducible

d290ff3

menshikh-iv suggested changes Nov 17, 2017

View reviewed changes

Use iteritems instead of items

c67ced2

Use sorted for deterministic token2id mapping

ba36a1b

formi23 force-pushed the dictionary-determinism branch from 7dabc31 to ba36a1b Compare November 21, 2017 02:20

menshikh-iv reviewed Nov 21, 2017

View reviewed changes

Replace frequency-token sort with token sort

2a53556

piskvorky requested changes Nov 23, 2017

View reviewed changes

remove key from sort

31f0903

menshikh-iv merged commit 7fabdbd into piskvorky:develop Nov 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make token2id mapping reproducible #1715

Make token2id mapping reproducible #1715

formi23 commented Nov 14, 2017 •

edited

Loading

menshikh-iv commented Nov 16, 2017

formi23 commented Nov 16, 2017 •

edited

Loading

menshikh-iv commented Nov 16, 2017

piskvorky commented Nov 16, 2017

formi23 commented Nov 17, 2017

menshikh-iv commented Nov 17, 2017 •

edited

Loading

formi23 commented Nov 17, 2017

menshikh-iv Nov 17, 2017

formi23 Nov 19, 2017

menshikh-iv commented Nov 17, 2017 •

edited by piskvorky

Loading

formi23 commented Nov 19, 2017

piskvorky commented Nov 19, 2017 •

edited

Loading

formi23 commented Nov 20, 2017 •

edited

Loading

menshikh-iv commented Nov 20, 2017 •

edited

Loading

formi23 commented Nov 21, 2017

menshikh-iv left a comment

menshikh-iv Nov 21, 2017

formi23 Nov 21, 2017 •

edited

Loading

menshikh-iv Nov 22, 2017 •

edited

Loading

formi23 Nov 23, 2017

menshikh-iv commented Nov 21, 2017

piskvorky Nov 23, 2017

menshikh-iv commented Nov 23, 2017 •

edited by piskvorky

Loading

piskvorky commented Nov 23, 2017 •

edited

Loading

formi23 commented Nov 28, 2017

Make token2id mapping reproducible #1715

Make token2id mapping reproducible #1715

Conversation

formi23 commented Nov 14, 2017 • edited Loading

menshikh-iv commented Nov 16, 2017

formi23 commented Nov 16, 2017 • edited Loading

menshikh-iv commented Nov 16, 2017

piskvorky commented Nov 16, 2017

formi23 commented Nov 17, 2017

menshikh-iv commented Nov 17, 2017 • edited Loading

formi23 commented Nov 17, 2017

menshikh-iv Nov 17, 2017

Choose a reason for hiding this comment

formi23 Nov 19, 2017

Choose a reason for hiding this comment

menshikh-iv commented Nov 17, 2017 • edited by piskvorky Loading

formi23 commented Nov 19, 2017

piskvorky commented Nov 19, 2017 • edited Loading

formi23 commented Nov 20, 2017 • edited Loading

menshikh-iv commented Nov 20, 2017 • edited Loading

formi23 commented Nov 21, 2017

menshikh-iv left a comment

Choose a reason for hiding this comment

menshikh-iv Nov 21, 2017

Choose a reason for hiding this comment

formi23 Nov 21, 2017 • edited Loading

Choose a reason for hiding this comment

menshikh-iv Nov 22, 2017 • edited Loading

Choose a reason for hiding this comment

formi23 Nov 23, 2017

Choose a reason for hiding this comment

menshikh-iv commented Nov 21, 2017

piskvorky Nov 23, 2017

Choose a reason for hiding this comment

menshikh-iv commented Nov 23, 2017 • edited by piskvorky Loading

piskvorky commented Nov 23, 2017 • edited Loading

formi23 commented Nov 28, 2017

formi23 commented Nov 14, 2017 •

edited

Loading

formi23 commented Nov 16, 2017 •

edited

Loading

menshikh-iv commented Nov 17, 2017 •

edited

Loading

menshikh-iv commented Nov 17, 2017 •

edited by piskvorky

Loading

piskvorky commented Nov 19, 2017 •

edited

Loading

formi23 commented Nov 20, 2017 •

edited

Loading

menshikh-iv commented Nov 20, 2017 •

edited

Loading

formi23 Nov 21, 2017 •

edited

Loading

menshikh-iv Nov 22, 2017 •

edited

Loading

menshikh-iv commented Nov 23, 2017 •

edited by piskvorky

Loading

piskvorky commented Nov 23, 2017 •

edited

Loading