Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory consumption in AuthorTopicModel #2122

Merged
merged 3 commits into from Jul 13, 2018

Conversation

Projects
None yet
3 participants
@philipphager
Copy link
Contributor

commented Jul 7, 2018

@menshikh-iv
I submitted this PR to address the memory consumption issue faced when using the AuthorTopicModel as also recognized in #1947. We had to fix this issue for a research project and maybe you are interested to use this simple fix in the main project. The concatenation of the entire corpus for all authors and then removing unique values (resulting in all unique values in the corpus), can be reduced to this, and the results did not change.

This fix reduced the memory consumption on our project with ≈400.000 docs from 32GB to 2GB for the entire duration of the training.

train_corpus_idx.extend(doc_ids)
# Collect all documents of authors.
for doc_ids in self.author2doc.values():
train_corpus_idx.extend(doc_ids)

This comment has been minimized.

Copy link
@piskvorky

piskvorky Jul 8, 2018

Member

Why keep train_corpus_idx as list in the first place? It looks like it's converted to a set right below, so maybe better to make it a set from the start?

This comment has been minimized.

Copy link
@philipphager

philipphager Jul 8, 2018

Author Contributor

That is true, indeed. I would propose something like this:

# Train on all documents of authors in input_corpus.
train_corpus_idx = set()
# Collect all documents of authors.
for doc_ids in self.author2doc.values():
    train_corpus_idx.update(doc_ids)

train_corpus_idx = list(train_corpus_idx)

This comment has been minimized.

Copy link
@piskvorky

piskvorky Jul 8, 2018

Member

I'm not familiar with the algo, but for the sake of reproducibility, I'd replace list(train_corpus_idx) with sorted(train_corpus_idx). That will remove randomness from the ordering of values, while still producing a list.

@piskvorky

This comment has been minimized.

Copy link
Member

commented Jul 8, 2018

Nice! The original code looks strange indeed -- @olavurmortensen any particular reason for that nested quadratic loop?

@philipphager

This comment has been minimized.

Copy link
Contributor Author

commented Jul 9, 2018

I updated the loop to use a set and to sort the resulting list. At least on my current dataset this is also a little bit faster :), and reduces the temporary memory overhead even further.

@olavurmortensen

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2018

I can't imagine there was a reason for that nested loop, must have just slipped my mind. I only tested scalability w.r.t. running time, if I'd tested memory consumption as well I should have caught this.

There is some stuff in my thesis about asymptotic complexity of memory consumption (pdf, section 2.4.2.6). The algorithm doesn't scale terribly well w.r.t. memory consumption. The empirical results showed that running time scaled as expected, compared to the theoretical scalability, but as I said I didn't test memory consumption.

I'm glad this problem was caught, hopefully it fixes the issues people are having. Thanks @philipphager.

@philipphager

This comment has been minimized.

Copy link
Contributor Author

commented Jul 11, 2018

No problem! And also, awesome job done on the implementation of the AuthorTopicModel @olavurmortensen, the API is an absolute pleasure to work with 👍 ! And thank you both for being so responsive on this PR :)!

@piskvorky

This comment has been minimized.

Copy link
Member

commented Jul 11, 2018

Thanks @philipphager . What happens next: we'll wait for @menshikh-iv to come back from holiday, so he can review and merge this 👍

@piskvorky

This comment has been minimized.

Copy link
Member

commented Jul 13, 2018

Actually, let me merge right away. The fix is simple enough that hopefully @menshikh-iv won't be angry :)

@piskvorky piskvorky merged commit 96444a7 into RaRe-Technologies:develop Jul 13, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.