Adding type check for corpus_file argument #2469

saraswatmks · 2019-04-29T23:31:40Z

Add a type check for corpus_file argument and raise type error if the type of corpus_file is not a string. Fixes #2460.

piskvorky · 2019-04-30T06:20:23Z

gensim/models/doc2vec.py

+
+        # Check the type of corpus_file
+        if not isinstance(corpus_file, string_types):
+            raise TypeError("Parameter corpus_file of train() must be a string (path to a file).")


Can you also show what the received parameter is instead?

Having concrete error messages helps avoid confusion.

mpenkov · 2019-04-30T06:29:33Z

gensim/models/doc2vec.py

@@ -794,6 +794,11 @@ def train(self, documents=None, corpus_file=None, total_examples=None, total_wor

        """
        kwargs = {}
+
+        # Check the type of corpus_file
+        if not isinstance(corpus_file, string_types):


corpus_file may legitimately be None, when documents is not None. This is the reason why some of the unit tests fail. Please have a look at Travis CI.

So, if you go ahead with your proposed check, you need something like:

Suggested change

if not isinstance(corpus_file, string_types):

if corpus_file is not None and not isinstance(corpus_file, string_types):

Also, please add a unit test that stresses your new functionality (pass in a non-string corpus_file, expect a TypeError raised).

Next, what is the benefit of raising TypeError here? What happens with the existing code if we do not raise TypeError? What kind of exception does the user see?

I think if we go ahead with this kind of parameter checking, we should do it properly:

Ensure that one of documents or corpus_file is None (they cannot both be non-None)

Ensure that one of documents or corpus_file is not None (they cannot both be None)

If documents is not None, then it must be an iterable

If corpus_file is not None, then it must be a string

Finally, if we go ahead with this, we should apply this consistently everywhere, not just doc2vec (e.g. fasttext has similar issues).

My question is: is it worth it? @piskvorky

Agreed on the tests. I edited the PR description for context (link to the original issue).

@piskvorky @mpenkov Thanks for the detailed feedback. Working on it.

Also I'd say: while checking whether it's a string may help a bit, if there's checking, it'd make sense to ensure things like illegal or missing paths also generate meaningful error messages. It might be possible to just leverage some existing path-checking/file-existence-checking method, or wrap the failing method(s) in a handler that catches errors and shows a good-enough error message like "Problem with corpus_file value X".

@gojomo so should I check corpus_file for being a valid file path instead of checking for string_type ?

I'm not sure, what covers the most cases with the most straightforward code? A path-test, with the right catches/error-message, might "catch N birds with one stone", pointing the user in the right direction to understand their parameter error with fewer conditionals/lines-of-tests. Or it might complicate things compared to the simple proper-type test. But the motivating case for this change - that users following slightly-older examples would get a confusing error message, due to a new corpus_file parameter` – will actually already have been handled by the simple "one or the other but not both" test. So, even the string test isn't strictly required to address the motivating case. My pref would be: do the simplest thing that resolves the motivating case, for sure. If there's a easy/clear bit of extra checking that makes also makes sense, consider it as well.

…ests

gojomo · 2019-05-01T03:12:43Z

gensim/models/doc2vec.py

+        # Check if both documents and corpus_file are not None
+        if corpus_file is not None and documents is not None:
+            raise TypeError("Instead provide value to either of corpus_file or documents parameter but not both.")
+


I'd be tempted to combine these two cases with an XOR: if (corpus_file is None) ^ (documents is None):.

I did it this way for readability. Also, I noticed True ^ True is False. If I do it the xor way, this case will get skipped.

Yes, it'd actually need to be if not ((corpus_file is None) ^ (documents is None)): - but then essentially one error message, roughly "supply one or the other but not both", would be fine for either mistake.

Strong -1 on this: hard to read and reason about. Please stick to simpler (and easier to maintain) constructs .

The clearest approach would be not to have a method with such delicate XOR one-but-not-the-other positional parameters - use separate methods with distinct names! But failing that, a one-line XOR, whose next line is an error message with a plain-english description of the constraint, seems pretty simple to me - just accurately reflecting the design choice already made.

gojomo · 2019-05-01T03:13:43Z

gensim/models/doc2vec.py

+
+        # Check if documents is not None and iterable but not string type
+        if documents is not None and isinstance(documents, Iterable) and isinstance(documents, string_types):
+            raise TypeError("Documents must be an iterable of list and not a string type.")


I believe the test of is-Iterable and is-string_types is likely redundant, as all strings are iterable.

I think it's also important to not go overboard with the checking. For example, you're making sure it's not a string here. Why not take that further and make sure it's not an int? A float? A file buffer? And so on.

@gojomo I was not sure how to go about with handling this. documents should be an iterable
but not a string since like you said strings are iterable. I think I got it now, so instead of checking for both, checking only for string type would do.

Supplying a (path) string as documents might sufficiently reveal itself to be an error in other ways - like training that completes almost instantly, with little to no vocabulary. Separately, I'm not sure that all objects that would work because they are effectively "iterable" will actually test positive as being isinstance() Iterable.

gojomo · 2019-05-01T03:16:44Z

gensim/models/doc2vec.py

+        # Check the type of corpus_file
+        if corpus_file is not None and not isinstance(corpus_file, string_types):
+            raise TypeError("""Parameter corpus_file of train() must be a
+                             string (path to an existing file) got %s instead.""" % corpus_file)


Since None will already fail the is-string_types test, the check is not None is redundant.

The use of a multiline string is poor here. Your exception message contains two lines, and a lot of unnecessary spaces.

I think you also don't need to mention the function name: it'll be obvious from the stack trace.

@mpenkov noob question: should I resolve the conversations or the person who reviews the PR does it ?

I think it makes sense for the original reviewer to do it, once they're happy their comments have been addressed.

saraswatmks · 2019-05-02T03:11:42Z

@mpenkov @gojomo Kindly check my latest commit. I've made the changes as suggested. For now, we test four things before training would begin (as suggested):

Check if both corpus_file and documents are None.
Check if both corpus_file and documents are provided.
Check if corpus_file is not a string type.
Check if documents is not Iterable.

Explanation point 4: I check only for being iterable to handle cases when someone would pass integer, float, bool value to documents. In case someone passes string, like @gojomo mentioned above, the training will finish super fast with almost no vocabulary.

I think these four cases cover most plausible situations which users might encounter.

mpenkov

I notice that you've made changes to doc2vec only. What about the other modules that suffer from the exact same problem, e.g. fasttext?

mpenkov · 2019-05-02T07:51:09Z

gensim/models/doc2vec.py

+            raise TypeError("Instead provide value to either of corpus_file or documents parameter but not both.")
+
+        # Check if corpus_file is string type
+        if documents is None and not isinstance(corpus_file, string_types):


If corpus_file must be a valid path to a file, then why not explicitly check for that, instead of checking the type?

Suggested change

if documents is None and not isinstance(corpus_file, string_types):

if documents is None and not os.path.isfile(corpus_file):

Wouldn't that throw an exception for incorrect (non-string) corpus_file? It's an interesting idea for a try-catch block though.

We could wrap it in a _is_valid_corpus(corpus_file) function for malformed input.

Given that this input validation will be happening in mulitple places, it's probably not a bad idea.

@piskvorky @mpenkov I've kept it simple. I used os.path.isfile as suggested. It returns False at all type of invalid inputs including non-string / float or whatever trash value a user would pass.

Also, to do this for fasttext, I think it will be better:

To create a separate PR

Wait until the this PR is finalised so that I can follow the same code approach there are well.

I'm -1 on creating separate PRs for this, for several reasons:

The changes are tiny. They span several lines of code.

The changes will be identical for each model (e.g. doc2vec, fasttext, etc)

PR overhead (review, merge, release, document) is not worth the effort.

It's salami-slicing.

For your second point, sure, I think it's a good idea to wait until people come to a consensus before expanding your approach to the other models.

@mpenkov Could you please review the latest changes and let me know is we are good to go or something else needs to be done ?

Sure. Left you some nitpicky comments.

In the future, please consider ticking "allow commits from maintainers" when creating a PR on github. That allows the reviewers of your PR to push changes to your branch, potentially shortcutting some of the back-and-forth for minor issues like typos, formatting, etc.

@mpenkov thanks for the suggestion. I believe we are solid now on this PR. I will now make these changes in fasttext module as well.

Yes, please go ahead.

@mpenkov I've done the changes in fasttext module as well. Please have a look.

mpenkov · 2019-05-04T02:44:46Z

gensim/models/doc2vec.py

+
+        # Check if both documents and corpus_file are not None
+        if corpus_file is not None and documents is not None:
+            raise TypeError("Instead provide value to either of corpus_file or documents parameter but not both.")


For consistency with the above.

Suggested change

raise TypeError("Instead provide value to either of corpus_file or documents parameter but not both.")

raise TypeError("Both corpus_file and documents may not be provided at the same time")

mpenkov · 2019-05-04T02:45:19Z

gensim/models/doc2vec.py

+
+        # Check if both documents and corpus_file are None
+        if corpus_file is None and documents is None:
+            raise TypeError("Either one of corpus_file or documents value must be provided.")


No need to include a period. Please apply this to other messages as well.

Suggested change

raise TypeError("Either one of corpus_file or documents value must be provided.")

raise TypeError("Either one of corpus_file or documents value must be provided")

I tried committing these nitpicks to your branch myself, but I don't have permissions.

mpenkov · 2019-05-04T02:54:26Z

gensim/models/doc2vec.py

+
+        # Check if corpus_file is string type
+        if documents is None and not os.path.isfile(corpus_file):
+            raise TypeError("Parameter corpus_file must be a valid path to a file, got %s instead." % corpus_file)


Suggested change

raise TypeError("Parameter corpus_file must be a valid path to a file, got %s instead." % corpus_file)

raise TypeError("Parameter corpus_file must be a valid path to a file, got %r instead." % corpus_file)

Better than %s, because corpus_file may contain spaces.

mpenkov · 2019-05-05T00:18:35Z

gensim/models/fasttext.py

@@ -901,6 +903,23 @@ def train(self, sentences=None, corpus_file=None, total_examples=None, total_wor
            >>> model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)

        """
+
+        # Check if both sentences and corpus_file are None


Please remove these comments. They add no value to the source code.

mpenkov

Looks good to me

mpenkov · 2019-05-05T13:48:15Z

It looks like all comments have been addressed, so I'm merging this in.

@saraswatmks Congrats on your first contribution to gensim 🥇 Thank you!

adding type check for corpus_file argument

40989ba

piskvorky requested changes Apr 30, 2019

View reviewed changes

piskvorky changed the title ~~Fix #2460: Adding type check for corpus_file argument~~ Adding type check for corpus_file argument Apr 30, 2019

mpenkov requested changes Apr 30, 2019

View reviewed changes

saraswatmks added 4 commits April 30, 2019 20:33

fixes to handle different typeerror in train parameters, adding unitt…

ad76b83

…ests

adding doc2vec with more typeerror checks

e1f32a2

fixing lint errors

a9eeddf

removing f-string use

6639089

gojomo reviewed May 1, 2019

View reviewed changes

saraswatmks added 6 commits May 1, 2019 08:07

fixes as suggested

2ca51ca

remove unused imports

6458aea

using xor as suggested

bc9dce6

minor fixes

97d5619

only check for iterable

cdaff9f

minor fix - 2

b046292

mpenkov requested changes May 2, 2019

View reviewed changes

checking corpus_file path, removing xor

298e8f0

mpenkov reviewed May 4, 2019

View reviewed changes

mpenkov added the bugfix label May 4, 2019

saraswatmks added 3 commits May 4, 2019 07:48

fixing nitpiks

4fb01f4

parameters check in fasttext module

07a7d5d

extra space fix

8821b92

mpenkov reviewed May 5, 2019

View reviewed changes

remove comments

bfc4360

mpenkov approved these changes May 5, 2019

View reviewed changes

mpenkov merged commit 40792c6 into piskvorky:develop May 5, 2019

saraswatmks deleted the fix_2460 branch May 6, 2019 18:58

	if not isinstance(corpus_file, string_types):
	if corpus_file is not None and not isinstance(corpus_file, string_types):

	if documents is None and not isinstance(corpus_file, string_types):
	if documents is None and not os.path.isfile(corpus_file):

	raise TypeError("Instead provide value to either of corpus_file or documents parameter but not both.")
	raise TypeError("Both corpus_file and documents may not be provided at the same time")

	raise TypeError("Either one of corpus_file or documents value must be provided.")
	raise TypeError("Either one of corpus_file or documents value must be provided")

	raise TypeError("Parameter corpus_file must be a valid path to a file, got %s instead." % corpus_file)
	raise TypeError("Parameter corpus_file must be a valid path to a file, got %r instead." % corpus_file)

Adding type check for corpus_file argument #2469

Adding type check for corpus_file argument #2469

Conversation

saraswatmks commented Apr 29, 2019 • edited by piskvorky Loading

Choose a reason for hiding this comment

mpenkov Apr 30, 2019 • edited Loading

Choose a reason for hiding this comment

piskvorky Apr 30, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saraswatmks May 1, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky May 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saraswatmks commented May 2, 2019

mpenkov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpenkov May 4, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpenkov left a comment

Choose a reason for hiding this comment

mpenkov commented May 5, 2019

saraswatmks commented Apr 29, 2019 •

edited by piskvorky

Loading

mpenkov Apr 30, 2019 •

edited

Loading

piskvorky Apr 30, 2019 •

edited

Loading

saraswatmks May 1, 2019 •

edited

Loading

piskvorky May 2, 2019 •

edited

Loading

mpenkov May 4, 2019 •

edited

Loading