Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make token2id mapping reproducible #1715

Merged
merged 5 commits into from
Nov 23, 2017

Conversation

formi23
Copy link
Contributor

@formi23 formi23 commented Nov 14, 2017

This change modifies the attributes token2id and dfs of the gensim.corpora.Dictionary class to be OrderedDicts rather than dicts. This ensures deterministic order of elements for consecutive executions regardless of Python version.

Between Python versions 3.3 and 3.6, the order of iteration for sets and dictionaries is randomised before each execution by default (see link). In prior versions, hash randomisation is disabled by default (see link). As of 3.6, the order of set and dictionary iteration reflects the order in which elements were added, but this is an implementation detail not to be relied upon (see link). The changes in this PR resolve this issue by ensuring that the mapping behaviour of gensim.corpora.Dictionary is consistent across all modern versions of Python and between executions.

Note that unit testing this functionality is not possible because of the quirks of the various Python versions. The problem only manifests between executions, which undermines the reproducibility of experiments using gensim.corpora.Dictionary. A minimal working example that demonstrates the problem follows (compare the output between the runs for versions of Python that use hash seed randomisation; 3.5, for example):

import gensim.corpora
import sys

print(sys.version)


def extract_words(documents):
    return [[word for word in document.lower().split()
            if word not in stoplist]
            for document in documents]


def main():
    documents = ["Human machine interface for lab abc computer applications",
                 "A survey of user opinion of computer system response time",
                 "The EPS user interface management system",
                 "System and human system engineering testing of EPS",
                 "Relation of user perceived response time to error measurement",
                 "The generation of random binary unordered trees"]

    documents_add = ["The intersection graph of paths in trees",
                     "Graph minors IV Widths of trees and well quasi ordering",
                     "Graph minors A survey"]

    stoplist = set('for a of the and to in'.split())

    texts = extract_words(documents)
    texts_add = extract_words(documents_add)

    dictionary = gensim.corpora.Dictionary(texts)
    dictionary.add_documents(texts_add)

    dictionary.filter_extremes(no_below=1, no_above=0.9)
    dictionary.compactify()

    dictionary.save("dict_test.dict")
    dictionary.load("dict_test.dict")

    result = dictionary.iteritems()
    result = sorted(result, key=lambda x: x[0])

    print(result[:5])
    print(result[-5:])


main()

Example output:

$ python3.5 ...
Python 3.5.2 (...)
[(0, 'generation'), (1, 'well'), (2, 'intersection'), (3, 'human'), (4, 'response')]
[(30, 'graph'), (31, 'eps'), (32, 'applications'), (33, 'engineering'), (34, 'lab')]

$ python3.5 ...
Python 3.5.2 (...)
[(0, 'computer'), (1, 'paths'), (2, 'testing'), (3, 'machine'), (4, 'minors')]
[(30, 'time'), (31, 'abc'), (32, 'unordered'), (33, 'ordering'), (34, 'relation')]

@menshikh-iv
Copy link
Contributor

Hi @formi23, thanks for PR, but I don't think that this is a bug.
We know about the fact that the order can be arbitrary, but this does not affect the use of corpora.Dictionary, for this reason I think that we no needed this change.

WDYT @piskvorky @gojomo ?

@formi23
Copy link
Contributor Author

formi23 commented Nov 16, 2017

Hi @menshikh-iv,
I wasn't suggesting this is a bug, rather an overlooked detail that affects reproducibility of results produced by whatever model used in conjunction with gensim's Dictionary.
Using OrderedDict eliminates this discrepancy, and even ensures that the results of runs for v.2.7 to v.3.6 are exactly the same.

@menshikh-iv
Copy link
Contributor

@formi23 but it does not matter how the words will be numbered (for any algorithm), all difference will be in final indexes, but if you have two exactly same models with different-numbering dictionaries, the models will produce exactly same results (up to numbering words).

Also, if you save your model in python2 and load in python3 (and vice versa), you'll receive exactly same indexes in the dictionary.

@piskvorky
Copy link
Owner

The cross-Python reproducibility is not critical, but nice to have.

What price do we pay for it @formi23 ? What is the impact on performance / memory of this PR?

@formi23
Copy link
Contributor Author

formi23 commented Nov 17, 2017

Since memory and performance are affected by dataset size, Python version (e.g. the implementation of dict varies by version) and use case, I don't think it's feasible to come up with representative benchmarks.

OrderedDict wraps a dict instance and also manages a list of keys, by which it maintains the order. The memory usage of the list is presumably O(n) (it certainly appears to be from the source. Most of the methods on OrderedDict have identical performance to dict, as can be also seen from source.

I just want to clarify that the intent of this PR is not strictly to introduce cross-version reproducibility, but to introduce determinism. The use of OrderedDict is incidental and simply removes the non-determinism introduced by using dict to assign indices to words. Sorting words before assigning indices would achieve the same outcome, with obviously different performance and memory characteristics.

The behaviour of gensim.corpora.Dictionary in Python 3.6 is reproducible only due to the current dict implementation, which may change in the future.

@menshikh-iv , the difference is between algorithmically guaranteeing the same results versus depending on circumstance. Using Dictionary in conjunction with LDA in Python v. [3.3-3.6) produces different results due to the document vectors being different in each execution. As an example, see the results of LDA modeling using python 3.5.2 and 3.6.3 for two consecutive executions with and w/o the changes in this PR.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Nov 17, 2017

@formi23 nice description 💣 !

Can you make the small benchmark for dictionary construction/filtering/ doc2bow usage without/with this changes (by memory and time)?

@formi23
Copy link
Contributor Author

formi23 commented Nov 17, 2017

@menshikh-iv do you have an example of such benchmarks?

@@ -149,6 +149,7 @@ def doc2bow(self, document, allow_update=False, return_missing=False):
token2id = self.token2id
if allow_update or return_missing:
missing = {w: freq for w, freq in iteritems(counter) if w not in token2id}
missing = OrderedDict(sorted(missing.items(), key=lambda x: (x[1], x[0])))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really needed to sort items explicitly (everywhere)? If yes - please use iteritems instead of .items()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the essential step because it introduces determinism in the assignment of ids and therefore consistency in the word-id mapping between executions (for a given dataset). However, OrderedDict can be omitted - see my comment regarding benchmarks for the implications of only using sorted and dict.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Nov 17, 2017

@formi23 no, but it's very simple.

You need to calculate what's time/memory needed for the fill-up, filter and apply Dictionary with old/new code version and post results here.

@formi23
Copy link
Contributor Author

formi23 commented Nov 19, 2017

I'm familiar with benchmarking, I just don't see the value in comparing the performance before and after the changes on an arbitrary dataset. The algorithmic complexity implications of the changes are clear from the implementation of OrderedDict and that is far more indicative of time and memory performance than would be a single comparison for specific inputs.

If you insist, the following scatter plot shows the execution time for three configurations using both Python 3.5 (p35_*) and 3.6 (p36_*) on a dataset containing 245625 words (27970 unique):

  • Unmodified gensim.corpora.Dictionary (no suffix)
  • Using OrderedDict to remove non-determinism; e.g. this PR's changes (od)
  • Using the sorted function to ensure determinism but omitting OrderedDict (suffix sl)

Four methods of Dictionary are tested:

  • Dictionary initialisation (i)
  • filter_extremes (f)
  • compactify (c)
  • doc2bow (applied to each of the 2525 documents in the dataset) (d)

So, for example, p36_od_c indicates the execution times for the Dictionary.compactify method using the changes in this PR and Python 3.6. 100 executions were timed for each configuration.

fig

The sorted approach results in consistent word-to-integer mappings but does not ensure that the iteration order of Dictionary is consistent. Its performance presumably suffers compared to the OrderedDict approach because OrderedDict maintains the initial sorting, making subsequent sorts much faster. The memory required for Dictionary's token2id attribute approximately doubles when using OrderedDict (325480 vs 147560 bytes). This is consistent with the space and time complexities of the operations and data structures.

@piskvorky
Copy link
Owner

piskvorky commented Nov 19, 2017

Thanks for the clarification @formi23 . @menshikh-iv is still struggling with English, please excuse if he sounded too rude or bossy. It wasn't intentional.

Doubling the memory sounds like a really high price, for a nice-to-have feature. Where exactly did you hit the non-reproducibility problem yourself, how severe was it?

Is there a more frugal way to achieve the same effect?

@formi23
Copy link
Contributor Author

formi23 commented Nov 20, 2017

@piskvorky, the memory concerns are understandable. At the same time, I consider full reproducibility essential for the nature of the work gensim is used for. Furthermore, this non-determinism undermines the random state control functionality of any model implementation it is used in conjunction with.

If the memory cost of OrderedDict is too severe then consider the alternative that only uses sorted (implementation is here - see lines 150, 247-255, 273, 275 - UPDATE: actually, only lines 151 and 269 changed). It's comparable in terms of memory to the current implementation of Dictionary but extends execution time slightly (as shown in the chart above - see the yellow plots).

It's possible that faster or more memory efficient approaches may exist to achieve consistent word-id mappings but that's outside the scope of this PR.

In principle, the problem is that the ids for tokens are assigned in the order of dictionary iteration, which changes between the executions in Python versions 3.3-3.6. I posted the results of LDA topic modeling here. You can see the difference between two executions with Python 3.5. This is a toy example, but the effect is more pronounced in a dataset that I unfortunately don't have permission to share in which I first encountered the problem.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Nov 20, 2017

@formi23 thanks for the detailed benchmark, double-memory is unacceptable for us, so please try to use only sorted (or .sort, as I remember, this method is inplace in contrast to the sorted)

@formi23
Copy link
Contributor Author

formi23 commented Nov 21, 2017

@menshikh-iv, I made the changes. Regarding sort vs sorted, they seem to be essentially the same in this use case:
sorted is given a generator, from which a list is created, which is then sorted and returned (calls sort on this list). For sort, one explicitly creates a list, which is then also sorted using sort.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check again that this works as expected, for me LGTM

@@ -148,9 +148,9 @@ def doc2bow(self, document, allow_update=False, return_missing=False):

token2id = self.token2id
if allow_update or return_missing:
missing = {w: freq for w, freq in iteritems(counter) if w not in token2id}
missing = sorted((x for x in iteritems(counter) if x[0] not in token2id), key=lambda x: (x[1], x[0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting by (freq, w), are you sure that this is correct?

Copy link
Contributor Author

@formi23 formi23 Nov 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv, that's a good point. Either way works and ensures reproducibility. It can be assumed that sorting integers is faster than strings, but then the branching for words with the same frequency seems to introduce overhead (see the figure, where freq indicates frequency-word sorting and string indicates word-frequency sorting, i, f, c, and d methods are as in the previous example).
string-integer

Since the words in token2id guaranteed to be unique but the frequencies are not (and will likely overlap given patterns within natural language), sorting by token then frequency is probably the better choice but will vary on a dataset-by-dataset basis. What's your preference?

Copy link
Contributor

@menshikh-iv menshikh-iv Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO (w, freq) better than (freq, w) (just because of the uniqueness of the tokens). For bigger corpus, the situation will be even worse for (freq, w) because frequencies obey zipf's law -> more and more duplicates by freq in "tail" of distribution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the changes.

@menshikh-iv
Copy link
Contributor

I checked manually, now mapping looks reproducible between different python versions, nice work @formi23 👍 🔥

@@ -148,7 +148,7 @@ def doc2bow(self, document, allow_update=False, return_missing=False):

token2id = self.token2id
if allow_update or return_missing:
missing = sorted((x for x in iteritems(counter) if x[0] not in token2id), key=lambda x: (x[1], x[0]))
missing = sorted((x for x in iteritems(counter) if x[0] not in token2id), key=lambda x: x[0])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort works lexicographically by default; no need for the key parameter at all.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Nov 23, 2017

Congratz with first contributions @formi23 🔥
I'm very impressed by your high level of investigation, maybe you will be interested in our student incubator?

@menshikh-iv menshikh-iv merged commit 7fabdbd into piskvorky:develop Nov 23, 2017
@piskvorky
Copy link
Owner

piskvorky commented Nov 23, 2017

Impressive indeed. Incubator or not, we'd like to keep you around @formi23 :)

Do you have an appetite for another open source project with gensim?

VaiyeBe pushed a commit to VaiyeBe/gensim that referenced this pull request Nov 26, 2017
…tionary (piskvorky#1715)

* Make token2id mapping reproducible

* Use iteritems instead of items

* Use sorted for deterministic token2id mapping

* Replace frequency-token sort with token sort

* remove key from sort
@formi23
Copy link
Contributor Author

formi23 commented Nov 28, 2017

@piskvorky Thanks, happy I got to contribute to the library I'm using. I'll keep an eye out for PR opportunities as they become apparent in my work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants