Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn the user if the corpus for LDA/LSI is plain list #1392

Closed
wants to merge 4 commits into from

Conversation

rasto2211
Copy link
Contributor

Encourage the user to use the memory-friendly generator interface
presented in the tutorial on corpus streaming. See the issue #74 .

I have decided to move the warning ldamodel.update and lsimodel.add_documents since both of the methods are called by constructors.

Encourage the user to use the memory-friendly generator interface
presented in the tutorial on corpus streaming. See the issue piskvorky#74.
Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small note/question from me; a proper review of the concept/architecture of this change is up to @menshikh-iv .

@@ -579,6 +579,15 @@ def update(self, corpus, chunksize=None, decay=None, offset=None,
if gamma_threshold is None:
gamma_threshold = self.gamma_threshold

if isinstance(corpus, list):
logger.warning(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this warning message look like when actually emitted?

The spacing around line breaks and punctuation doesn't seem right.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this

taking plain list as a corpus for LDA;we strongly recommend you to use corpus streaming instead of lists;LDA is processing the corpus in batches andit only loads one batch of documents into memory at once;please take a look at the tutorial on corpus streaming: https://radimrehurek.com/gensim/tut1.html#corpus-streaming-one-document-at-a-time

Copy link
Contributor Author

@rasto2211 rasto2211 Jun 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added spaces.

@@ -348,6 +348,15 @@ def add_documents(self, corpus, chunksize=None, decay=None):
"""
logger.info("updating model with new documents")

if isinstance(corpus, list):
logger.warning(
"taking plain list as a corpus for LSI;"
Copy link
Contributor

@menshikh-iv menshikh-iv Jun 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make a few fixes:

  • Use capital letters/dots as a start/end of the sentences.
  • Be careful that your words do not stick together, for example
print(
"hello"
"world"
)

the output is helloworld

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that the logging messages formatted as proper English sentences look better. I have just seen that the rest of the code in this file is always starting the logging message with lower case letter and ends with semicolon and I thought that it's the "gensim code style".

Fixed the problem with words sticking together.

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing!

Still needs an architectural review by @menshikh-iv , to see whether this style of solution is what we want (consistent, comprehensive).

"Taking plain list as a corpus for LDA.\n"
"We strongly recommend you to use corpus streaming instead of lists. "
"LDA is processing the corpus in batches and "
"it only loads one batch of documents into memory at once.\n"
Copy link
Owner

@piskvorky piskvorky Jun 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 on newlines in log lines. Makes the log harder to process and parse (missing timestamps, broken line format).

If you must output multiple lines, output them individually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I removed the newlines from log messages and added spaces instead of them.

@rasto2211 rasto2211 changed the title Warn user if the corpus for LDA/LSI is plain list Warn the user if the corpus for LDA/LSI is plain list Jun 9, 2017
@menshikh-iv
Copy link
Contributor

I think that this change will only annoy our users, we discussed this with @piskvorky and he agrees with me. For this reason, I close this PR, sorry @rasto2211.

@piskvorky
Copy link
Owner

Let me add that checking errors and sanity checks are still very desirable (incorrect input, potential issues etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants