New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[MRG] Optimize Native unsupervised FastText #1742

Merged

menshikh-iv merged 24 commits into piskvorky:develop from manneshiva:optimize_fasttext

Dec 7, 2017

Contributor

manneshiva commented Nov 27, 2017 •

edited

Loading

Optimizes and speeds up -- by Cythonizing the native FastText pure Python implementation.

manneshiva added 7 commits

November 26, 2017 17:45


          adds fasttext extension to setup

ceb24ab


          cythonizes training using skipgram with negative sampling

fa3aa49


          loop over indexes using index iterator

da45be3


          cythonizes training using skipgram with hierarchical softmax

a8df8e1


          adds cython generated .c file

1e51a92


          resolves segmentation fault with multiple workers

838e833


          fixes accuracy issues due to reference counts of word_subwords becomi…

73e9176

…ng 0

menshikh-iv mentioned this pull request

[WIP] Adding unsupervised FastText to Gensim #1525

Merged

manneshiva added 2 commits

November 29, 2017 04:22


          cythonizes fasttext cbow architecture

2d0e1a6


          cleans extra variables/values

772267a

janpom reviewed

View reviewed changes

gensim/models/fasttext.py Outdated

@@ @@ -9,77 +9,83 @@ @@
               from gensim.models.word2vec import Word2Vec, train_sg_pair, train_cbow_pair
               from gensim.models.wrappers.fasttext import FastTextKeyedVectors
               from gensim.models.wrappers.fasttext import FastText as Ft_Wrapper, compute_ngrams, ft_hash
+              from gensim import matutils

Contributor

janpom Nov 29, 2017

unused import

janpom reviewed

View reviewed changes

gensim/models/fasttext.py Outdated

+                  from gensim.models.fasttext_inner import FAST_VERSION, MAX_WORDS_IN_BATCH
+              except ImportError:
+                  # failed... fall back to plain numpy (20-80x slower training than the above)

Contributor

janpom Nov 29, 2017

worth logging a warning?

janpom reviewed

View reviewed changes

gensim/models/fasttext.py Outdated

                       if self.word_ngrams <= 1 and self.max_n == 0:
                           self.bucket = 0
-                      super(FastText, self).__init__(
-                          sentences=sentences, size=size, alpha=alpha, window=window, min_count=min_count,
+                      super(FastText, self).__init__(sentences=sentences, size=size, alpha=alpha, window=window, min_count=min_count,

Contributor

janpom Nov 29, 2017

PEP8: Arguments on first line forbidden when not using vertical alignment.
https://www.python.org/dev/peps/pep-0008/#indentation

janpom reviewed

View reviewed changes

gensim/models/fasttext.py Outdated

-                               sorted_vocab=1, bucket=2000000, trim_rule=None, batch_words=MAX_WORDS_IN_BATCH):
+                  def __init__(
+                          self, sentences=None, sg=0, hs=0, size=100, alpha=0.025, window=5, min_count=5,
+                          max_vocab_size=None, word_ngrams=1, loss='ns', sample=1e-3, seed=1, workers=3, min_alpha=0.0001,

Contributor

janpom Nov 29, 2017

The loss parameter is unused.

janpom reviewed

View reviewed changes

gensim/models/fasttext.py Outdated

-                                  "You cannot do an online vocabulary-update of a model which has no prior vocabulary. "
-                                  "First build the vocabulary of your model with a corpus before doing an online update."
-                              )
+                              raise RuntimeError("You cannot do an online vocabulary-update of a model which has no prior vocabulary. "

Contributor

janpom Nov 29, 2017

incorrect indentation

janpom reviewed

View reviewed changes

gensim/models/fasttext.py Outdated

@@ @@ -245,4 +234,4 @@ def load_fasttext_format(cls, *args, **kwargs): @@
                   def save(self, *args, **kwargs):
                       kwargs['ignore'] = kwargs.get('ignore', ['syn0norm', 'syn0_vocab_norm', 'syn0_ngrams_norm'])
-                      super(FastText, self).save(*args, **kwargs)
+                      super(FastText, self).save(*args, **kwargs)

Contributor

janpom Nov 29, 2017

PEP8: No newline at the end of file.

janpom reviewed

View reviewed changes

gensim/models/fasttext.py Outdated

-                              -1.0 / self.vector_size, 1.0 / self.vector_size,
-                              (len(self.wv.vocab) - self.old_vocab_len, self.vector_size)
-                          )
+                          new_vocab_rows = rand_obj.uniform(-1.0 / self.vector_size, 1.0 / self.vector_size, (len(self.wv.vocab) - self.old_vocab_len, self.vector_size))

Contributor

janpom Nov 29, 2017

What's the point of reformatting this and the other lines? It makes the lines too long and the code is harder to read.

janpom reviewed

View reviewed changes

gensim/models/fasttext_inner.pyx

+                      return 2
+              FAST_VERSION = init()  # initialize the module
+              MAX_WORDS_IN_BATCH = MAX_SENTENCE_LEN

Contributor

janpom Nov 29, 2017

No newline at the end of file.

Contributor Author

manneshiva Dec 3, 2017

Some weird bug is causing the new line to be not reflected at the end of .pyx files even though my local commit shows it. Similar is the case with word2vec_inner.pyx.

janpom reviewed

View reviewed changes

gensim/models/fasttext_inner.pyx

+              # coding: utf-8
+              import cython
+              import numpy as np

Contributor

janpom Nov 29, 2017

I'm new to Cython, but isn't this redundant given the cimport below?

Contributor Author

manneshiva Dec 1, 2017

The import gives access to numpy's Python functions while the cimport allows us to use it's C-modules.

janpom reviewed

View reviewed changes

gensim/models/fasttext_inner.pyx Outdated

+              DEF MAX_SENTENCE_LEN = 10000
+              DEF MAX_SUBWORDS = 1000
+              from word2vec import FAST_VERSION

Contributor

janpom Nov 29, 2017

What exactly is the point of this import?

janpom reviewed

View reviewed changes

gensim/models/fasttext_inner.pyx Outdated

+                  for m in range(j, k):
+                      if m == i:
+                          continue
+                      else:

Contributor

janpom Nov 29, 2017

else redundant. Better to leave it out. You save one level of indentation.

janpom reviewed

View reviewed changes

gensim/models/fasttext_inner.pyx Outdated

+                              our_saxpy(&size, &ONEF, &syn0_ngrams[subwords_idx[m][d] * size], &ONE, neu1, &ONE)
+                  if count > (<REAL_t>0.5):
+                      inv_count = ONEF/count

Contributor

janpom Nov 29, 2017

Please put spaces between operators. Here and elsewhere.

janpom reviewed

View reviewed changes

gensim/models/fasttext_inner.pyx Outdated

+                  for m in range(j,k):
+                      if m == i:
+                          continue
+                      else:

Contributor

janpom Nov 29, 2017

else redundant

janpom reviewed

View reviewed changes

gensim/models/fasttext_inner.pyx Outdated

+                  for m in range(j, k):
+                      if m == i:
+                          continue
+                      else:

Contributor

janpom Nov 29, 2017

else redundant

janpom reviewed

View reviewed changes

gensim/models/fasttext_inner.pyx Outdated

+                                  if hs:
+                                      fast_sentence_sg_hs(points[j], codes[j], codelens[j], syn0_vocab, syn0_ngrams, syn1, size, subwords_idx[i], subwords_idx_len[i], _alpha, work, l1, word_locks_vocab, word_locks_ngrams)
+                                  if negative:
+                                      next_random = fast_sentence_sg_neg(negative, cum_table, cum_table_len, syn0_vocab, syn0_ngrams, syn1neg, size, indexes[j], subwords_idx[i], subwords_idx_len[i], _alpha, work, l1, next_random, word_locks_vocab, word_locks_ngrams)

Contributor

janpom Nov 29, 2017

I don't have a hard limit for line length, but this is way too long. IMO, it's a good idea to keep line length within 100 chars. Going a little over maybe OK where difficult to break, but this is clearly not the case.

janpom reviewed

View reviewed changes

gensim/models/fasttext_inner.pyx Outdated

+                  if count > (<REAL_t>0.5):
+                      inv_count = ONEF/count
+                  if cbow_mean:
+                      sscal(&size, &inv_count, neu1, &ONE)  # (does this need BLAS-variants like saxpy?)

Contributor

janpom Nov 29, 2017

I can see these comments and the TODOs are copied over fro word2vec_inner. Is now a good time to addressing them?

janpom reviewed

View reviewed changes

gensim/models/fasttext.py Outdated

+                  from gensim.models.fasttext_inner import train_batch_sg, train_batch_cbow
+                  from gensim.models.fasttext_inner import FAST_VERSION, MAX_WORDS_IN_BATCH
+              except ImportError:

Contributor

janpom Nov 29, 2017

The problem with this solution is that it makes it impossible to test the Python native implementation. There's a number of ways you could fix this. The first thing that comes into my mind is this:

add another module fasttext_native or fasttext_legacy and define the native version of train_batch_sg, train_batch_cbow, FAST_VERSION, MAX_WORDS_IN_BATCH there
import as follows:

try:
    from gensim.models.fasttext_inner import train_batch_sg, train_batch_cbow, FAST_VERSION, MAX_WORDS_IN_BATCH
except ImportError:
    # log warning
    from gensim.models.fasttext_native import train_batch_sg, train_batch_cbow, FAST_VERSION, MAX_WORDS_IN_BATCH

add two new params to FastText.__init__() for train_batch_sg and train_batch_cbow functions. The default values would be the imported functions.

To test the native/legacy implementation, you would then be able to initialize FastText as follows:

from gensim.models.fasttext import FastText
from gensim.models.fasttext_native import MAX_WORDS_IN_BATCH, train_batch_sg, train_batch_cbow

ft = FastText(..., batch_words=MAX_WORDS_IN_BATCH, train_batch_sg=train_batch_sg, train_batch_cbow=train_batch_cbow)
# test ft

I don't insist on this particular solution. There may be more elegant ways. My main concern is that the native implementation should also be testable and covered by unit tests (just parametrize existing unit tests to test both the native and Cython implementation).

Owner

piskvorky Nov 29, 2017

Native Python implementations will be removed in the next refactor. We're not going to support them any more, now that @menshikh-iv can build good wheels.

Contributor

menshikh-iv Nov 29, 2017

I hope that this will be true @piskvorky, need to distribute our wheels several releases in row first, and if all will be OK - we can remove python parts (but this process is not fast, I suggest not rushing with this)

Contributor

menshikh-iv Nov 29, 2017 •

edited

Loading

@janpom this code looks repeat logic from https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py#L137.

Owner

piskvorky Nov 29, 2017 •

edited

Loading

What I mean is there's no point introducing new constructs for testing both versions, when we know we only want one version.

manneshiva added 3 commits

November 30, 2017 18:10


          corrects parameters order for word_locks* in cbow

19a5da0


          fixes indentation, unused imports and logging warning for slow version

9dafbd5


          splits long lines and removes redundant import/else

796ec91

manneshiva added 2 commits

December 1, 2017 14:47


          minor: removes redundant else

5a4e627


          adds docstring

b471fde

janpom reviewed

View reviewed changes

gensim/models/fasttext.py Outdated

+              This module allows training a word embedding from a training corpus with the additional ability
+              to obtain word vectors for out-of-vocabulary words.
+              For a tutorial on gensim's native fasttext, refer the noteboook --

Contributor

janpom Dec 4, 2017

refer TO the notebook

piskvorky requested changes

View reviewed changes

gensim/models/fasttext.py Outdated

@@ @@ -1,5 +1,27 @@ @@
               #!/usr/bin/env python
               # -*- coding: utf-8 -*-
+              #
+              # Copyright (C) 2013 Radim Rehurek <me@radimrehurek.com>

Owner

piskvorky Dec 4, 2017

That doesn't seem right -- and FastText didn't even exist yet in 2013 :)

@manneshiva You can use this block:

# Authors: Shiva Manne <s.manne@rare-technologies.com>
# Copyright (C) 2017 RaRe Technologies s.r.o.

Contributor

janpom commented Dec 4, 2017

Looks good to me.

menshikh-iv suggested changes

View reviewed changes

Contributor

menshikh-iv left a comment

please be accurate with docstrings, we want to have clear and same(numpy-style) docstrings for all codebase

gensim/models/fasttext.py Outdated

+                  def train_batch_cbow(model, sentences, alpha, work=None, neu1=None):
+                      """
+                      Update CBOW model by training on a sequence of sentences.

Contributor

menshikh-iv Dec 4, 2017

Please use numpy style docstrings (here and everywhere), useful links: http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html, https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt

Contributor Author

manneshiva Dec 4, 2017

Sure, will make the changes.

gensim/models/fasttext.py Outdated

+                          negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0, min_n=3, max_n=6, sorted_vocab=1, bucket=2000000,
+                          trim_rule=None, batch_words=MAX_WORDS_IN_BATCH):
+                      """
+                      Initialize the model from an iterable of `sentences`. Each sentence is a

Contributor

menshikh-iv Dec 4, 2017

please add examples of usage (small executable pieces of code)

manneshiva mentioned this pull request

Updating vocab of a FastText model results in change in dtype of model.wv.syn0_vocab #1759

Closed

manneshiva and others added 4 commits

December 5, 2017 03:40


          changes docstrings style, splits long lines

eee20ef


          Merge branch 'develop' into optimize_fasttext

b97e68c


          Merge branch 'develop' into optimize_fasttext

2dee387


          fix references in fasttext docstring

1b3e2d3

manneshiva changed the title ~~[WIP] Optimize Native unsupervised FastText~~ [MRG] Optimize Native unsupervised FastText

manneshiva changed the title ~~[MRG] Optimize Native unsupervised FastText~~ [WIP] Optimize Native unsupervised FastText


          adds deleted else in cbow-neg

81eebfb

manneshiva changed the title ~~[WIP] Optimize Native unsupervised FastText~~ [MRG] Optimize Native unsupervised FastText

menshikh-iv suggested changes

View reviewed changes

Contributor

menshikh-iv left a comment

Great work @manneshiva 💣 !!

Please fix docstring format and I'll merge this.

gensim/models/fasttext.py Outdated

-                               max_vocab_size=None, word_ngrams=1, loss='ns', sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
-                               negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0, min_n=3, max_n=6,
-                               sorted_vocab=1, bucket=2000000, trim_rule=None, batch_words=MAX_WORDS_IN_BATCH):
+                  """Class for training, using and evaluating word representations learned using method described in

Contributor

menshikh-iv Dec 6, 2017

Please use numpy-style format for docstrings (everywhere).

manneshiva and others added 5 commits

December 7, 2017 00:44


          fixes docstring format

2560e1d


          adds missing docstrings

7b8673a


          add missing __getitem__ to rst

144fab9


          add missing import to __init__ (from gensim.models import FastText …

c3f5d74

…instead of `from gensim.models.fasttext ...`)


          fix docs

327f4ca

menshikh-iv merged commit d2cb79c into piskvorky:develop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet