Improve `FastText` documentation #2353

mpenkov · 2019-01-25T12:07:49Z

Updated fasttext module docstring to include better examples, design documentation
Improved tutorial Jupyter notebook

doctest complains about vocabulary and training continuation

piskvorky · 2019-01-25T12:11:41Z

gensim/models/fasttext.py

+    ...     ['intelligence'],
+    ...     ['artificial', 'intelligence', 'system']
+    ... ]
+    >>> model.train(sentences, total_examples=len(sentences), epochs=model.epochs)


Where is this model.epochs coming from? The model instantiation above shows no such variable.

Yes, the epochs parameter is optional, so not included during instantiation.

Unfortunately, there is also some confusion about its name: the FastText constructor uses iter to specify the number of epochs, whereas the superclass uses the proper name epochs.

The presence of the epochs parameter to the train function (which seems to override the one set in the constructor) also complicates matters.

Hm. If it's optional, let's not use it in train. Or if we use it in train, let's instantiate it explicitly. This neither-here-nor-there example is confusing ("where is this value come from?").

Regarding iter / epochs -- can you please rename it to epochs, consistently? I remember some discussion around this (cc @menshikh-iv @gojomo ), but can't imagine why we'd want both. At most we could support iter for a while as an alias, but with a clear deprecation warning.

This is a perfect opportunity to clean up some of the API mess, rather than piling on.

I agree regarding the cleanup. My preference would be to leave epochs/iter out of the constructor. The model doesn't need that parameter until training time.

Models in Gensim generally allow the trained_model = Constructor(params_including_training_params) pattern. So breaking that could be confusing to existing users (and a big change to backward incompatibility).

I'm not totally opposed though, especially if we still allow ctr params for a while with "deprecated" warnings. The API needs a clean up, and now is a good time.

Not a big priority though, and the documentation examples can already promote the instantiate then train, as 2 steps pattern.

It's not just training parameters you need to include in the constructor. It's also parameters for vocabulary creation. So you're managing at least 3 sets of separate parameters, 2 of which are duplicated by other methods of the class.

Yes, we should promote them as separate steps in docs. Question is, do we deprecate (certainly not remove) them from ctr?

I understand your motivation in not removing them (backward compatibility). Unfortunately, the current mess won't go away until we remove things like this.

I think the first step should be to deprecate them. After a while, we can remove them, perhaps in time for a major release.

If we want a one-liner way to instantiate and train, we can always write a pure function and promote that. That should make it easier for users to cut over to the cleaner API.

Yes, deprecation is what I suggest.

piskvorky · 2019-01-25T12:13:01Z

gensim/models/fasttext.py

+    ... ]
+    >>> 'rocket' in model.wv
+    False
+    >>> model.train(new_sentences, total_examples=len(sentences), epochs=model.epochs)


Why total_example=len(sentences)? Even if correct, it looks strange… would be my first question as a user.
Let's provide the answer here in the docs why total_examples is needed, what is its role.

It's correct. I agree that is confusing. The docstring for the train function attempts to clarify the situation.

Personally, I think if neither total_examples and total_words are specified, we should try to determine sensible defaults by looking at e.g. len(sentences). WDYT @menshikh-iv ?

Are you sure? I read the linked docs and still don't get why it's not len(new_sentences).

Please include some top-level intuition here. A short sentence on why this parameter is mandatory, and what should be its value, because it looks really strange and superfluous. +1 for sensible defaults.

For train() to manage alpha correctly, and show meaningful progress estimates, it needs a good estimate of the size of the supplied-corpus - even when the corpus (as an iterable) may not self-report its length. In the typical case where the same corpus was just surveyed for its vocabulary, this value should be handy. (In the Word2Vec case, the count from the vocab-scan is cached inside the model for later consultation – unsure if the FT paths do this.) In the case where other/new data is being supplied, the caller should supply the right counts for the current data.

@gojomo If the corpus does self-report its length though, should we use that instead? If yes, which should we do:

total_examples = len(corpus)

total_words = len(corpus)

If the corpus does not self-report its length, then we could raise an exception with a heplful message. WDYT?

If it's able to self-report its length, in count of texts, then yes, that would work as total_examples. But because streaming a corpus from an iterable is the motivating case for this interface, relying on that inside the method seems inappropriate to me. Leaving it the responsibility of the caller seems fine to me – and if they're lucky enough to have a corpus that reports its own length, they can supply it easily.

Also, while we're talking about simplifying the API, what do you think about removing the sentences and corpus_file parameters from the constructor? Currently, we have an inconsistency: in the constructor, we just pass sentences/corpus_file without total_examples and total_words parameters. In the train function, we include those additional parameters.

Instead of passing sentences in the constructor, the user can pass them in separately via the train function.

Pros:

Simpler constructor. There will be less parameters.

Easier to understand and consistent API.

Cons:

Getting a trained model now takes two steps instead of one (instantiate, train). Not sure how much of a con this is given that the train function is more powerful than the constructor anyway - it has parameters that the constructor doesn't.

@menshikh-iv @piskvorky @gojomo What do you think?

The same thing also applies to the callbacks parameters.

But because streaming a corpus from an iterable is the motivating case for this interface, relying on that inside the method seems inappropriate to me.

@gojomo Why do you think it is inappropriate? We could do something like this:

if total_examples or total_words: pass # nothing to do here elif sentences and hasattr(sentences, '__len__'): # could also check for callable, if necessary total_examples = len(sentences) elif data_corpus and hasattr(data_corpus, '__len__'): total_examples = len(data_corpus) else: raise ValueError( 'unable to infer total_examples or total_words from the training source, ' 'please pass one of them explicitly' )

It looks ugly, but it allows to user to do something like:

model.train(sentences)

instead of

model.train(sentences, num_examples=len(sentences))

I feel the former is more Pythonic.

Finally, I think having two separate keyword parameters for the input is confusing for the user. In my opinion, it would look a lot simpler if we unified the two parameters, and dealt with untangling them in the implementation.

@mpenkov yes, I'd consider that "sensible" defaults. Thanks.

Agreed on unifying iter/epoch. How would that work? Keep both as acceptable input (internally unify to one), no deprecations? Deprecate one? Which one?

@piskvorky OK. I think it's worth dealing with API refactoring in a separate PR, for two reasons:

Such changes risk introducing regressions, and I'd rather not have them together with the relatively safe documentation changes.

We've already spent much effort refactoring FastText, and it may be prudent to wait for things to stabilize (e.g. fix introduced regressions) before proceeding.

To answer your question, I think it makes sense to deprecate iter from the constructor. It's a poor name for a parameter, for three reasons:

It masks the built-in iter function.

It isn't obvious.

It's inconsistent. We use epochs everywhere else outside of the constructor.

gensim/models/fasttext.py

menshikh-iv

Good start overall, I note several issues with rendering
1.

:py:class:`Model`

should be

:class:`~gensim.models._fasttext_bin.Model`

please always use full path for any reference (class/method/function)
2.

:py:mod:`gensim.models.fasttext`

should be

:mod:`gensim.models.fasttext`

:py:class:`FastTextVocab`

should be

:class:`~gensim.models.FastTextVocab`

Also, you can always check how page rendered, just click "Details" near CircleCI and open "Artifacts" tab, this contains all rendered html, like https://circleci.com/gh/RaRe-Technologies/gensim/2108#artifacts/containers/0

piskvorky · 2019-01-25T12:26:44Z

@mpenkov looks good, thanks! Do the docstring examples now cover all the confusion we saw from users (mailing list, github)? Any important workflows left, undocumented / under-documented?

mpenkov · 2019-01-26T06:46:33Z

@mpenkov looks good, thanks! Do the docstring examples now cover all the confusion we saw from users (mailing list, github)? Any important workflows left, undocumented / under-documented?

It's hard for me to claim we've resolved all the confusion, because I haven't thoroughly read all the feedback from the users. Nevertheless, this PR addresses (or begins to address, owing to the ongoing discussions above) several key problems that we identified in the task description:

Unnecessary parameters in the constructor (ongoing, plan to continue unti we reach consensus).
IO problems (key, vec, etc).
Lack of examples for training continuation.
Lack of design documentation.

@menshikh-iv Have I missed anything?

gensim/models/fasttext.py

This reverts commit 2e728cb.

This reverts commit 29c5210.

gensim/models/_fasttext_bin.py

gensim/models/keyedvectors.py

gensim/models/fasttext.py

menshikh-iv · 2019-01-28T03:03:07Z

@piskvorky LGTM, wdyt?

gensim/models/fasttext.py

piskvorky · 2019-01-28T11:57:34Z

gensim/models/fasttext.py

+    >>> total_words = model3.corpus_total_words  # number of words in the corpus
+    >>> model3.train(corpus_file=corpus_file, total_examples=total_examples, total_words=total_words, epochs=5)
+
+The model needs the `total_examples` and `total_words` parameters in order to


Does the model really need both? IIRC, it only needs one of those, no?

You're right, I updated the example.

piskvorky · 2019-01-28T12:00:25Z

gensim/models/fasttext.py

+    ...     def __iter__(self):
+    ...         with open(datapath('crime-and-punishment.txt')) as fin:
+    ...             for line in fin:
+    ...                 yield line.lower().strip().split(" ")


Just split() (or we'll be left with newlines and all sorts of unicode "whitespace").

Also, I'd be careful with such examples, since people love to copy-paste (and this is really not a good way to tokenize -- even gensim.utils.tokenize is better).

OK, I've improved the example with gensim.utils.tokenize

piskvorky · 2019-01-28T12:00:40Z

gensim/models/fasttext.py

+
+.. sourcecode:: pycon
+
+    >>> class MyIter:


MyIter(object) (we still support Python 2).

piskvorky · 2019-01-28T12:02:51Z

This looks great! I wish all PRs were this meticulous (and useful).

I have some things I'd want to add to the docs, but it's nitpicking at this point. Probably best if we merge and I'll go over it myself here and there. Nothing critical.

Co-Authored-By: mpenkov <penkov@pm.me>

menshikh-iv · 2019-01-29T02:24:52Z

gensim/models/fasttext.py

@@ -769,9 +940,17 @@ def __contains__(self, word):
    def load_fasttext_format(cls, model_file, encoding='utf8', full_model=True):
        """Load the input-hidden weight matrix from Facebook's native fasttext `.bin` and `.vec` output files.

+        By default, this function loads the full model.
+        A full model allows continuing training with more data, but also consumes more RAM and takes longer to load.
+        If you do not need to continue training and only wish the work with the already-trained embeddings, use `partial=False` for faster loading and to save RAM.


too long line (should be <120) + full_model=False, not partial=False @mpenkov

menshikh-iv · 2019-01-29T05:34:01Z

Awesome works, great @mpenkov 🔥

WIP: doco improvements

18f5302

doctest complains about vocabulary and training continuation

menshikh-iv requested a review from piskvorky January 25, 2019 12:08

more doco

b57a086

menshikh-iv self-requested a review January 25, 2019 12:19

piskvorky requested changes Jan 25, 2019

View reviewed changes

piskvorky reviewed Jan 25, 2019

View reviewed changes

gensim/models/fasttext.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/develop' into doc-improv

5fa0c9f

menshikh-iv suggested changes Jan 25, 2019

View reviewed changes

mpenkov added 8 commits January 25, 2019 23:46

flake8-docs updates

d66f55c

adding fixmes

9696657

minor fixup

3019bea

review response

74b740c

Remove magic constant

09ab630

deprecate the iter parameter to the FastText constructor

2e728cb

minor documentation fixes

c435a8e

review response: use absolute references

c688877

mpenkov added 2 commits January 26, 2019 17:53

review response

677679c

fix unit test

29c5210

piskvorky reviewed Jan 26, 2019

View reviewed changes

gensim/models/fasttext.py Outdated Show resolved Hide resolved

mpenkov added 9 commits January 26, 2019 22:11

Revert "deprecate the iter parameter to the FastText constructor"

044d699

This reverts commit 2e728cb.

Revert "fix unit test"

f9df136

This reverts commit 29c5210.

more documentation improvements

cdc727a

comment out pesky import

4ea3f06

fix typo

e532d62

improve tutorial notebook

931d3d7

minor documentation update

177c712

flake8-docs

bd83886

more doco fixes

11fabca

menshikh-iv suggested changes Jan 27, 2019

View reviewed changes

mpenkov added 7 commits January 27, 2019 19:24

review response: include _fasttext_bin in docs

9b5e161

review response: make examples more readable

6aa013a

review response: remove blank line

7d2b562

review response: add emphasis

25b24c7

review response: add comment

b4e8405

review response: add example

1fc9bf2

review response: remove redundant line

72ec312

mpenkov changed the title ~~[WIP] Improve Fasttext documentation~~ Improve Fasttext documentation Jan 27, 2019

menshikh-iv approved these changes Jan 28, 2019

View reviewed changes

review response: update comment

29c4faf

piskvorky reviewed Jan 28, 2019

View reviewed changes

gensim/models/fasttext.py Outdated Show resolved Hide resolved

piskvorky reviewed Jan 28, 2019

View reviewed changes

piskvorky and others added 6 commits January 28, 2019 23:13

Update gensim/models/fasttext.py

74410fc

Co-Authored-By: mpenkov <penkov@pm.me>

review response: improve examples

a3456a4

clarify example

96eab08

review response: improve example

ff72185

review response: improve tokenization in example

9140cf6

flake8

31c79c3

menshikh-iv reviewed Jan 29, 2019

View reviewed changes

mpenkov added 2 commits January 29, 2019 14:23

fix long lines

2f479ca

fixup: use correct parameter name

c48a7f3

menshikh-iv changed the title ~~Improve Fasttext documentation~~ Improve FastText documentation Jan 29, 2019

menshikh-iv merged commit 80406c2 into piskvorky:develop Jan 29, 2019

mpenkov deleted the doc-improv branch June 26, 2020 01:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `FastText` documentation #2353

Improve `FastText` documentation #2353

mpenkov commented Jan 25, 2019 •

edited

Loading

piskvorky Jan 25, 2019

mpenkov Jan 25, 2019 •

edited

Loading

piskvorky Jan 26, 2019 •

edited

Loading

mpenkov Jan 26, 2019

piskvorky Jan 26, 2019 •

edited

Loading

mpenkov Jan 26, 2019

piskvorky Jan 27, 2019 •

edited

Loading

mpenkov Jan 27, 2019 •

edited

Loading

piskvorky Jan 27, 2019

piskvorky Jan 25, 2019 •

edited

Loading

mpenkov Jan 25, 2019

piskvorky Jan 26, 2019 •

edited

Loading

gojomo Jan 26, 2019

mpenkov Jan 26, 2019

gojomo Jan 26, 2019

mpenkov Jan 26, 2019 •

edited

Loading

mpenkov Jan 26, 2019 •

edited

Loading

piskvorky Jan 26, 2019 •

edited

Loading

mpenkov Jan 27, 2019 •

edited

Loading

menshikh-iv left a comment •

edited

Loading

piskvorky commented Jan 25, 2019 •

edited

Loading

mpenkov commented Jan 26, 2019

menshikh-iv commented Jan 28, 2019

piskvorky Jan 28, 2019 •

edited

Loading

mpenkov Jan 28, 2019

piskvorky Jan 28, 2019 •

edited

Loading

mpenkov Jan 29, 2019

piskvorky Jan 28, 2019

mpenkov Jan 28, 2019

piskvorky commented Jan 28, 2019 •

edited

Loading

menshikh-iv Jan 29, 2019

menshikh-iv commented Jan 29, 2019

Improve FastText documentation #2353

Improve FastText documentation #2353

Conversation

mpenkov commented Jan 25, 2019 • edited Loading

Choose a reason for hiding this comment

mpenkov Jan 25, 2019 • edited Loading

Choose a reason for hiding this comment

piskvorky Jan 26, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jan 26, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jan 27, 2019 • edited Loading

Choose a reason for hiding this comment

mpenkov Jan 27, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jan 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jan 26, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpenkov Jan 26, 2019 • edited Loading

Choose a reason for hiding this comment

mpenkov Jan 26, 2019 • edited Loading

Choose a reason for hiding this comment

piskvorky Jan 26, 2019 • edited Loading

Choose a reason for hiding this comment

mpenkov Jan 27, 2019 • edited Loading

Choose a reason for hiding this comment

menshikh-iv left a comment • edited Loading

Choose a reason for hiding this comment

piskvorky commented Jan 25, 2019 • edited Loading

mpenkov commented Jan 26, 2019

menshikh-iv commented Jan 28, 2019

piskvorky Jan 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jan 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky commented Jan 28, 2019 • edited Loading

Choose a reason for hiding this comment

menshikh-iv commented Jan 29, 2019

Improve `FastText` documentation #2353

Improve `FastText` documentation #2353

mpenkov commented Jan 25, 2019 •

edited

Loading

mpenkov Jan 25, 2019 •

edited

Loading

piskvorky Jan 26, 2019 •

edited

Loading

piskvorky Jan 26, 2019 •

edited

Loading

piskvorky Jan 27, 2019 •

edited

Loading

mpenkov Jan 27, 2019 •

edited

Loading

piskvorky Jan 25, 2019 •

edited

Loading

piskvorky Jan 26, 2019 •

edited

Loading

mpenkov Jan 26, 2019 •

edited

Loading

mpenkov Jan 26, 2019 •

edited

Loading

piskvorky Jan 26, 2019 •

edited

Loading

mpenkov Jan 27, 2019 •

edited

Loading

menshikh-iv left a comment •

edited

Loading

piskvorky commented Jan 25, 2019 •

edited

Loading

piskvorky Jan 28, 2019 •

edited

Loading

piskvorky Jan 28, 2019 •

edited

Loading

piskvorky commented Jan 28, 2019 •

edited

Loading