Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Optimize Native unsupervised FastText #1742

Merged
merged 24 commits into from
Dec 7, 2017

Conversation

manneshiva
Copy link
Contributor

@manneshiva manneshiva commented Nov 27, 2017

Optimizes and speeds up -- by Cythonizing the native FastText pure Python implementation.

@@ -9,77 +9,83 @@
from gensim.models.word2vec import Word2Vec, train_sg_pair, train_cbow_pair
from gensim.models.wrappers.fasttext import FastTextKeyedVectors
from gensim.models.wrappers.fasttext import FastText as Ft_Wrapper, compute_ngrams, ft_hash
from gensim import matutils
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused import

from gensim.models.fasttext_inner import FAST_VERSION, MAX_WORDS_IN_BATCH

except ImportError:
# failed... fall back to plain numpy (20-80x slower training than the above)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth logging a warning?

@@ -89,12 +95,10 @@ def __init__(self, sentences=None, sg=0, hs=0, size=100, alpha=0.025, window=5,
if self.word_ngrams <= 1 and self.max_n == 0:
self.bucket = 0

super(FastText, self).__init__(
sentences=sentences, size=size, alpha=alpha, window=window, min_count=min_count,
super(FastText, self).__init__(sentences=sentences, size=size, alpha=alpha, window=window, min_count=min_count,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: Arguments on first line forbidden when not using vertical alignment.
https://www.python.org/dev/peps/pep-0008/#indentation

sorted_vocab=1, bucket=2000000, trim_rule=None, batch_words=MAX_WORDS_IN_BATCH):
def __init__(
self, sentences=None, sg=0, hs=0, size=100, alpha=0.025, window=5, min_count=5,
max_vocab_size=None, word_ngrams=1, loss='ns', sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loss parameter is unused.

"You cannot do an online vocabulary-update of a model which has no prior vocabulary. "
"First build the vocabulary of your model with a corpus before doing an online update."
)
raise RuntimeError("You cannot do an online vocabulary-update of a model which has no prior vocabulary. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incorrect indentation

@@ -245,4 +234,4 @@ def load_fasttext_format(cls, *args, **kwargs):

def save(self, *args, **kwargs):
kwargs['ignore'] = kwargs.get('ignore', ['syn0norm', 'syn0_vocab_norm', 'syn0_ngrams_norm'])
super(FastText, self).save(*args, **kwargs)
super(FastText, self).save(*args, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: No newline at the end of file.

-1.0 / self.vector_size, 1.0 / self.vector_size,
(len(self.wv.vocab) - self.old_vocab_len, self.vector_size)
)
new_vocab_rows = rand_obj.uniform(-1.0 / self.vector_size, 1.0 / self.vector_size, (len(self.wv.vocab) - self.old_vocab_len, self.vector_size))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of reformatting this and the other lines? It makes the lines too long and the code is harder to read.

return 2

FAST_VERSION = init() # initialize the module
MAX_WORDS_IN_BATCH = MAX_SENTENCE_LEN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No newline at the end of file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some weird bug is causing the new line to be not reflected at the end of .pyx files even though my local commit shows it. Similar is the case with word2vec_inner.pyx.

# coding: utf-8

import cython
import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm new to Cython, but isn't this redundant given the cimport below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import gives access to numpy's Python functions while the cimport allows us to use it's C-modules.


DEF MAX_SENTENCE_LEN = 10000
DEF MAX_SUBWORDS = 1000
from word2vec import FAST_VERSION
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is the point of this import?

for m in range(j, k):
if m == i:
continue
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else redundant. Better to leave it out. You save one level of indentation.

our_saxpy(&size, &ONEF, &syn0_ngrams[subwords_idx[m][d] * size], &ONE, neu1, &ONE)

if count > (<REAL_t>0.5):
inv_count = ONEF/count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put spaces between operators. Here and elsewhere.

for m in range(j,k):
if m == i:
continue
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else redundant

for m in range(j, k):
if m == i:
continue
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else redundant

if hs:
fast_sentence_sg_hs(points[j], codes[j], codelens[j], syn0_vocab, syn0_ngrams, syn1, size, subwords_idx[i], subwords_idx_len[i], _alpha, work, l1, word_locks_vocab, word_locks_ngrams)
if negative:
next_random = fast_sentence_sg_neg(negative, cum_table, cum_table_len, syn0_vocab, syn0_ngrams, syn1neg, size, indexes[j], subwords_idx[i], subwords_idx_len[i], _alpha, work, l1, next_random, word_locks_vocab, word_locks_ngrams)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a hard limit for line length, but this is way too long. IMO, it's a good idea to keep line length within 100 chars. Going a little over maybe OK where difficult to break, but this is clearly not the case.

if count > (<REAL_t>0.5):
inv_count = ONEF/count
if cbow_mean:
sscal(&size, &inv_count, neu1, &ONE) # (does this need BLAS-variants like saxpy?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see these comments and the TODOs are copied over fro word2vec_inner. Is now a good time to addressing them?

from gensim.models.fasttext_inner import train_batch_sg, train_batch_cbow
from gensim.models.fasttext_inner import FAST_VERSION, MAX_WORDS_IN_BATCH

except ImportError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this solution is that it makes it impossible to test the Python native implementation. There's a number of ways you could fix this. The first thing that comes into my mind is this:

  • add another module fasttext_native or fasttext_legacy and define the native version of train_batch_sg, train_batch_cbow, FAST_VERSION, MAX_WORDS_IN_BATCH there
  • import as follows:
try:
    from gensim.models.fasttext_inner import train_batch_sg, train_batch_cbow, FAST_VERSION, MAX_WORDS_IN_BATCH
except ImportError:
    # log warning
    from gensim.models.fasttext_native import train_batch_sg, train_batch_cbow, FAST_VERSION, MAX_WORDS_IN_BATCH
  • add two new params to FastText.__init__() for train_batch_sg and train_batch_cbow functions. The default values would be the imported functions.

To test the native/legacy implementation, you would then be able to initialize FastText as follows:

from gensim.models.fasttext import FastText
from gensim.models.fasttext_native import MAX_WORDS_IN_BATCH, train_batch_sg, train_batch_cbow

ft = FastText(..., batch_words=MAX_WORDS_IN_BATCH, train_batch_sg=train_batch_sg, train_batch_cbow=train_batch_cbow)
# test ft

I don't insist on this particular solution. There may be more elegant ways. My main concern is that the native implementation should also be testable and covered by unit tests (just parametrize existing unit tests to test both the native and Cython implementation).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Native Python implementations will be removed in the next refactor. We're not going to support them any more, now that @menshikh-iv can build good wheels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope that this will be true @piskvorky, need to distribute our wheels several releases in row first, and if all will be OK - we can remove python parts (but this process is not fast, I suggest not rushing with this)

Copy link
Contributor

@menshikh-iv menshikh-iv Nov 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner

@piskvorky piskvorky Nov 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is there's no point introducing new constructs for testing both versions, when we know we only want one version.

This module allows training a word embedding from a training corpus with the additional ability
to obtain word vectors for out-of-vocabulary words.

For a tutorial on gensim's native fasttext, refer the noteboook --
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refer TO the notebook

@@ -1,5 +1,27 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (C) 2013 Radim Rehurek <me@radimrehurek.com>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't seem right -- and FastText didn't even exist yet in 2013 :)

@manneshiva You can use this block:

# Authors: Shiva Manne <s.manne@rare-technologies.com>
# Copyright (C) 2017 RaRe Technologies s.r.o.

@janpom
Copy link
Contributor

janpom commented Dec 4, 2017

Looks good to me.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please be accurate with docstrings, we want to have clear and same(numpy-style) docstrings for all codebase


def train_batch_cbow(model, sentences, alpha, work=None, neu1=None):
"""
Update CBOW model by training on a sequence of sentences.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will make the changes.

negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0, min_n=3, max_n=6, sorted_vocab=1, bucket=2000000,
trim_rule=None, batch_words=MAX_WORDS_IN_BATCH):
"""
Initialize the model from an iterable of `sentences`. Each sentence is a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add examples of usage (small executable pieces of code)

@manneshiva manneshiva changed the title [WIP] Optimize Native unsupervised FastText [MRG] Optimize Native unsupervised FastText Dec 5, 2017
@manneshiva manneshiva changed the title [MRG] Optimize Native unsupervised FastText [WIP] Optimize Native unsupervised FastText Dec 5, 2017
@manneshiva manneshiva changed the title [WIP] Optimize Native unsupervised FastText [MRG] Optimize Native unsupervised FastText Dec 5, 2017
Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @manneshiva 💣 !!

Please fix docstring format and I'll merge this.

max_vocab_size=None, word_ngrams=1, loss='ns', sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0, min_n=3, max_n=6,
sorted_vocab=1, bucket=2000000, trim_rule=None, batch_words=MAX_WORDS_IN_BATCH):
"""Class for training, using and evaluating word representations learned using method described in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use numpy-style format for docstrings (everywhere).

@menshikh-iv menshikh-iv merged commit d2cb79c into piskvorky:develop Dec 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants