-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warn the user if the corpus for LDA/LSI is plain list #1392
Conversation
Encourage the user to use the memory-friendly generator interface presented in the tutorial on corpus streaming. See the issue piskvorky#74.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small note/question from me; a proper review of the concept/architecture of this change is up to @menshikh-iv .
@@ -579,6 +579,15 @@ def update(self, corpus, chunksize=None, decay=None, offset=None, | |||
if gamma_threshold is None: | |||
gamma_threshold = self.gamma_threshold | |||
|
|||
if isinstance(corpus, list): | |||
logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this warning message look like when actually emitted?
The spacing around line breaks and punctuation doesn't seem right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this
taking plain list as a corpus for LDA;we strongly recommend you to use corpus streaming instead of lists;LDA is processing the corpus in batches andit only loads one batch of documents into memory at once;please take a look at the tutorial on corpus streaming: https://radimrehurek.com/gensim/tut1.html#corpus-streaming-one-document-at-a-time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added spaces.
gensim/models/lsimodel.py
Outdated
@@ -348,6 +348,15 @@ def add_documents(self, corpus, chunksize=None, decay=None): | |||
""" | |||
logger.info("updating model with new documents") | |||
|
|||
if isinstance(corpus, list): | |||
logger.warning( | |||
"taking plain list as a corpus for LSI;" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make a few fixes:
- Use capital letters/dots as a start/end of the sentences.
- Be careful that your words do not stick together, for example
print(
"hello"
"world"
)
the output is helloworld
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that the logging messages formatted as proper English sentences look better. I have just seen that the rest of the code in this file is always starting the logging message with lower case letter and ends with semicolon and I thought that it's the "gensim code style".
Fixed the problem with words sticking together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing!
Still needs an architectural review by @menshikh-iv , to see whether this style of solution is what we want (consistent, comprehensive).
gensim/models/ldamodel.py
Outdated
"Taking plain list as a corpus for LDA.\n" | ||
"We strongly recommend you to use corpus streaming instead of lists. " | ||
"LDA is processing the corpus in batches and " | ||
"it only loads one batch of documents into memory at once.\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 on newlines in log lines. Makes the log harder to process and parse (missing timestamps, broken line format).
If you must output multiple lines, output them individually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I removed the newlines from log messages and added spaces instead of them.
I think that this change will only annoy our users, we discussed this with @piskvorky and he agrees with me. For this reason, I close this PR, sorry @rasto2211. |
Let me add that checking errors and sanity checks are still very desirable (incorrect input, potential issues etc). |
Encourage the user to use the memory-friendly generator interface
presented in the tutorial on corpus streaming. See the issue #74 .
I have decided to move the warning ldamodel.update and lsimodel.add_documents since both of the methods are called by constructors.