Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding type check for corpus_file argument #2469

Merged
merged 16 commits into from
May 5, 2019

Conversation

saraswatmks
Copy link
Contributor

@saraswatmks saraswatmks commented Apr 29, 2019

Add a type check for corpus_file argument and raise type error if the type of corpus_file is not a string. Fixes #2460.


# Check the type of corpus_file
if not isinstance(corpus_file, string_types):
raise TypeError("Parameter corpus_file of train() must be a string (path to a file).")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also show what the received parameter is instead?

Having concrete error messages helps avoid confusion.

@piskvorky piskvorky changed the title Fix #2460: Adding type check for corpus_file argument Adding type check for corpus_file argument Apr 30, 2019
@@ -794,6 +794,11 @@ def train(self, documents=None, corpus_file=None, total_examples=None, total_wor

"""
kwargs = {}

# Check the type of corpus_file
if not isinstance(corpus_file, string_types):
Copy link
Collaborator

@mpenkov mpenkov Apr 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corpus_file may legitimately be None, when documents is not None. This is the reason why some of the unit tests fail. Please have a look at Travis CI.

So, if you go ahead with your proposed check, you need something like:

Suggested change
if not isinstance(corpus_file, string_types):
if corpus_file is not None and not isinstance(corpus_file, string_types):

Also, please add a unit test that stresses your new functionality (pass in a non-string corpus_file, expect a TypeError raised).

Next, what is the benefit of raising TypeError here? What happens with the existing code if we do not raise TypeError? What kind of exception does the user see?

I think if we go ahead with this kind of parameter checking, we should do it properly:

  1. Ensure that one of documents or corpus_file is None (they cannot both be non-None)
  2. Ensure that one of documents or corpus_file is not None (they cannot both be None)
  3. If documents is not None, then it must be an iterable
  4. If corpus_file is not None, then it must be a string

Finally, if we go ahead with this, we should apply this consistently everywhere, not just doc2vec (e.g. fasttext has similar issues).

My question is: is it worth it? @piskvorky

Copy link
Owner

@piskvorky piskvorky Apr 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the tests. I edited the PR description for context (link to the original issue).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky @mpenkov Thanks for the detailed feedback. Working on it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'd say: while checking whether it's a string may help a bit, if there's checking, it'd make sense to ensure things like illegal or missing paths also generate meaningful error messages. It might be possible to just leverage some existing path-checking/file-existence-checking method, or wrap the failing method(s) in a handler that catches errors and shows a good-enough error message like "Problem with corpus_file value X".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gojomo so should I check corpus_file for being a valid file path instead of checking for string_type ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, what covers the most cases with the most straightforward code? A path-test, with the right catches/error-message, might "catch N birds with one stone", pointing the user in the right direction to understand their parameter error with fewer conditionals/lines-of-tests. Or it might complicate things compared to the simple proper-type test. But the motivating case for this change - that users following slightly-older examples would get a confusing error message, due to a new corpus_file parameter` – will actually already have been handled by the simple "one or the other but not both" test. So, even the string test isn't strictly required to address the motivating case. My pref would be: do the simplest thing that resolves the motivating case, for sure. If there's a easy/clear bit of extra checking that makes also makes sense, consider it as well.

# Check if both documents and corpus_file are not None
if corpus_file is not None and documents is not None:
raise TypeError("Instead provide value to either of corpus_file or documents parameter but not both.")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be tempted to combine these two cases with an XOR: if (corpus_file is None) ^ (documents is None):.

Copy link
Contributor Author

@saraswatmks saraswatmks May 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way for readability. Also, I noticed True ^ True is False. If I do it the xor way, this case will get skipped.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it'd actually need to be if not ((corpus_file is None) ^ (documents is None)): - but then essentially one error message, roughly "supply one or the other but not both", would be fine for either mistake.

Copy link
Owner

@piskvorky piskvorky May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong -1 on this: hard to read and reason about. Please stick to simpler (and easier to maintain) constructs .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clearest approach would be not to have a method with such delicate XOR one-but-not-the-other positional parameters - use separate methods with distinct names! But failing that, a one-line XOR, whose next line is an error message with a plain-english description of the constraint, seems pretty simple to me - just accurately reflecting the design choice already made.


# Check if documents is not None and iterable but not string type
if documents is not None and isinstance(documents, Iterable) and isinstance(documents, string_types):
raise TypeError("Documents must be an iterable of list and not a string type.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the test of is-Iterable and is-string_types is likely redundant, as all strings are iterable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's also important to not go overboard with the checking. For example, you're making sure it's not a string here. Why not take that further and make sure it's not an int? A float? A file buffer? And so on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gojomo I was not sure how to go about with handling this. documents should be an iterable
but not a string since like you said strings are iterable. I think I got it now, so instead of checking for both, checking only for string type would do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supplying a (path) string as documents might sufficiently reveal itself to be an error in other ways - like training that completes almost instantly, with little to no vocabulary. Separately, I'm not sure that all objects that would work because they are effectively "iterable" will actually test positive as being isinstance() Iterable.

# Check the type of corpus_file
if corpus_file is not None and not isinstance(corpus_file, string_types):
raise TypeError("""Parameter corpus_file of train() must be a
string (path to an existing file) got %s instead.""" % corpus_file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since None will already fail the is-string_types test, the check is not None is redundant.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of a multiline string is poor here. Your exception message contains two lines, and a lot of unnecessary spaces.

I think you also don't need to mention the function name: it'll be obvious from the stack trace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpenkov noob question: should I resolve the conversations or the person who reviews the PR does it ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense for the original reviewer to do it, once they're happy their comments have been addressed.

@saraswatmks
Copy link
Contributor Author

@mpenkov @gojomo Kindly check my latest commit. I've made the changes as suggested. For now, we test four things before training would begin (as suggested):

  1. Check if both corpus_file and documents are None.
  2. Check if both corpus_file and documents are provided.
  3. Check if corpus_file is not a string type.
  4. Check if documents is not Iterable.

Explanation point 4: I check only for being iterable to handle cases when someone would pass integer, float, bool value to documents. In case someone passes string, like @gojomo mentioned above, the training will finish super fast with almost no vocabulary.

I think these four cases cover most plausible situations which users might encounter.

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that you've made changes to doc2vec only. What about the other modules that suffer from the exact same problem, e.g. fasttext?

raise TypeError("Instead provide value to either of corpus_file or documents parameter but not both.")

# Check if corpus_file is string type
if documents is None and not isinstance(corpus_file, string_types):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If corpus_file must be a valid path to a file, then why not explicitly check for that, instead of checking the type?

Suggested change
if documents is None and not isinstance(corpus_file, string_types):
if documents is None and not os.path.isfile(corpus_file):

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that throw an exception for incorrect (non-string) corpus_file? It's an interesting idea for a try-catch block though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could wrap it in a _is_valid_corpus(corpus_file) function for malformed input.

Given that this input validation will be happening in mulitple places, it's probably not a bad idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky @mpenkov I've kept it simple. I used os.path.isfile as suggested. It returns False at all type of invalid inputs including non-string / float or whatever trash value a user would pass.

Also, to do this for fasttext, I think it will be better:

  1. To create a separate PR
  2. Wait until the this PR is finalised so that I can follow the same code approach there are well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm -1 on creating separate PRs for this, for several reasons:

  1. The changes are tiny. They span several lines of code.
  2. The changes will be identical for each model (e.g. doc2vec, fasttext, etc)
  3. PR overhead (review, merge, release, document) is not worth the effort.
  4. It's salami-slicing.

For your second point, sure, I think it's a good idea to wait until people come to a consensus before expanding your approach to the other models.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpenkov Could you please review the latest changes and let me know is we are good to go or something else needs to be done ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Left you some nitpicky comments.

In the future, please consider ticking "allow commits from maintainers" when creating a PR on github. That allows the reviewers of your PR to push changes to your branch, potentially shortcutting some of the back-and-forth for minor issues like typos, formatting, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpenkov thanks for the suggestion. I believe we are solid now on this PR. I will now make these changes in fasttext module as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please go ahead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpenkov I've done the changes in fasttext module as well. Please have a look.


# Check if both documents and corpus_file are not None
if corpus_file is not None and documents is not None:
raise TypeError("Instead provide value to either of corpus_file or documents parameter but not both.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the above.

Suggested change
raise TypeError("Instead provide value to either of corpus_file or documents parameter but not both.")
raise TypeError("Both corpus_file and documents may not be provided at the same time")


# Check if both documents and corpus_file are None
if corpus_file is None and documents is None:
raise TypeError("Either one of corpus_file or documents value must be provided.")
Copy link
Collaborator

@mpenkov mpenkov May 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to include a period. Please apply this to other messages as well.

Suggested change
raise TypeError("Either one of corpus_file or documents value must be provided.")
raise TypeError("Either one of corpus_file or documents value must be provided")

I tried committing these nitpicks to your branch myself, but I don't have permissions.


# Check if corpus_file is string type
if documents is None and not os.path.isfile(corpus_file):
raise TypeError("Parameter corpus_file must be a valid path to a file, got %s instead." % corpus_file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise TypeError("Parameter corpus_file must be a valid path to a file, got %s instead." % corpus_file)
raise TypeError("Parameter corpus_file must be a valid path to a file, got %r instead." % corpus_file)

Better than %s, because corpus_file may contain spaces.

@mpenkov mpenkov added the bugfix label May 4, 2019
@@ -901,6 +903,23 @@ def train(self, sentences=None, corpus_file=None, total_examples=None, total_wor
>>> model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)

"""

# Check if both sentences and corpus_file are None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove these comments. They add no value to the source code.

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@mpenkov
Copy link
Collaborator

mpenkov commented May 5, 2019

It looks like all comments have been addressed, so I'm merging this in.

@saraswatmks Congrats on your first contribution to gensim 🥇 Thank you!

@mpenkov mpenkov merged commit 40792c6 into piskvorky:develop May 5, 2019
@saraswatmks saraswatmks deleted the fix_2460 branch May 6, 2019 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants