Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-design "*2vec" implementations #1777

Merged
merged 119 commits into from
Feb 1, 2018

Conversation

manneshiva
Copy link
Contributor

@manneshiva manneshiva commented Dec 11, 2017

This PR aims to better the current "*2vec" implementation design (Doc2Vec, Word2Vec, FastText, Sent2Vec, Poincare), making it more modular and ensuring maximum code re-use.
Link summarizing this PR.

@manneshiva
Copy link
Contributor Author

manneshiva commented Dec 11, 2017

The first commit contains the initial design to help facilitate discussion. It outlines present thoughts about the design and is in no way complete. @janpom Image below for your reference:
img_20171211_142132059

@gojomo
Copy link
Collaborator

gojomo commented Dec 13, 2017

Simply extracting/clarifying the existing implementation could offer value – but also is less-likely to clear the way for all-new algorithmic variations or optimizations. For example, a very generic "2Vec" model might not even have the idea of a 'vocabulary', or a step within itself that builds-a-vocabulary... as opposed to just taking corpuses or configuration prepped by other objects.

That is, this sort of incremental re-factoring starting from the current design might not offer the same reuse/efficiency/flexibility benefits for #1623 a from-scratch design could give.)

@manneshiva
Copy link
Contributor Author

@gojomo Agreed. These first few commits were only meant to facilitate discussions in our meetings. The idea is to sketch the design in a top-down fashion -- first, fix the public APIs that are to be exposed to users and then refactor everything else accordingly. This will involve designing a few classes from scratch, at the same time, will try and make use of existing classes/functions by making them more generic. Currently, the idea is to try and maintain backward compatibility (in terms of end-user APIs) unless some design changes make it absolutely necessary to change them (APIs). The scope of this PR may not include every point in #1623 but is definitely a big step in that direction which will make it easier to incorporate many of the listed points in the future.
Current status: We have finalized on the public interfaces. We also discussed providing/incorporating callbacks functionality in a way similar to keras. You should see be able to see these commits soon. Your review/comments would be helpful.

@gojomo
Copy link
Collaborator

gojomo commented Dec 14, 2017

I'll keep an eye for reviewable updates. I would suggest starting from a long, weakly-prioritized list of potential new functionality, then cull that to the things that can be usefully merged. (That is, more bottom-up from specifics of novel functions/combinations.) Starting from top-level, end-user visible interfaces, and maintaining compatibility with the (somewhat clunky) older interfaces, risks severely-limiting what new breakthrough capabilities can be delivered. (If they were easy to fit into the existing API, they could be added incrementally, rather than via a new design.)

logger.debug("worker exiting, processed %i jobs", jobs_processed)

def _job_producer(self, data_iterator, job_queue, cur_epoch=0, total_examples=None, total_words=None):
"""Fill jobs queue using the input `data_iterator`."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indented extra level

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for me


trained_word_count = 0
raw_word_count = 0
start = default_timer() - 0.00001
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a magic constant here?

# Log overall time
total_elapsed = default_timer() - start
logger.info(
"training on a %i raw words (%i effective words) took %.1fs, %.0f effective words/s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is always "words"? Probably "entities" is better for the general case.


@classmethod
def load(cls, fname_or_handle, **kwargs):
model = super(BaseAny2VecModel, cls).load(fname_or_handle, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return super(...)

@janpom
Copy link
Contributor

janpom commented Dec 20, 2017

Some thoughts about the parameter naming. Using a generic term such as entity (meaning word or sentence or document) is potentially confusing for the user. Also, I see little value in keeping the API exactly the same for all algorithms. Don't get me wrong. There's certainly value in keeping the API consistent so that once the user learns how to use Word2Vec, it will be easy for them to use Doc2Vec, but it's not necessary (and probably not even desirable) that the public API is exactly the same including parameter names.

For example, I consider the following:

doc2vec = Doc2Vec(data)
print(doc2vec.kv.similarity(document1=mydoc1, document2=mydoc2))

word2vec = Word2Vec(data)
print(word2vec.kv.similarity(word1="foo", word2="bar"))

better than:

doc2vec = Doc2Vec(data)
print(doc2vec.kv.similarity(entity1=mydoc1, entity2=mydoc2))

word2vec = Word2Vec(data)
print(word2vec.kv.similarity(entity1="foo", entity2="bar"))

Should the different parameter naming be a problem for code reuse, we can always keep the same parameter names in internal methods and define public methods as thin wrappers. For example:

class KeyedVectorsBase(object):
    def _similarity(self, entity1, entity2):
        # calculate and return similarity

class Word2VecKeyedVectors(KeyedVectorsBase):
    def similarity(self, word1, word2):
        return self._similarity(word1, word2)

class Doc2VecKeyedVectors(KeyedVectorsBase):
    def similarity(self, document1, document2):
        return self._similarity(document1, document2)

return model.trainables.alpha


def _update_job_params(model, job_params, progress, cur_epoch):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was probably some misunderstanding in "functions preferred over methods". They are preferred but only where it makes sense. Where the function parameter is the model, it makes more sense to keep it as a method, especially if the function modifies the model data such as is the case here.

Out of the functions in this module, I would only keep _raw_word_count() as a function. The other ones sound more like methods to me.

Should the _raw_word_count() be the only function in this module, it's not even worth having the module, especially if the function is a one-liner.

Also, if the functions are to be called from other modules, they shouldn't be private (no leading underscore).

@menshikh-iv menshikh-iv merged commit 916e423 into piskvorky:develop Feb 1, 2018
@menshikh-iv menshikh-iv mentioned this pull request Feb 5, 2018
sj29-innovate pushed a commit to sj29-innovate/gensim that referenced this pull request Feb 21, 2018
* first design draft

* adds public interfaces

* adds VocabItem and cleans BaseKeyedVectors

* adds explicit parameters

* implements `train` and adds `Callback` functionality

* refactors `train`, adds classes for vocabulary building and trainable weights.

* changes function parameters

* fixes minor errors

* starts refactoring `Word2Vec` based on new design

* removes `build_vocab_from_freq`, corrects `reset_from`

* changes attribute names

* adds saving/loading from word2vec format

* refactors/renames variables based on new design

* fixes **not** storing normalized vectors and recalculable tables

* replaces `syn0` with `vectors`, adds `estimate_memory`

* fixes indents

* starts `FastText` refactoring based on new design

* refactors to call coomon methods from `word2vec_utils`, removes deprecated methods

* refactors `FastText`

* adds common methods in `word2vec_utils`

* refactors keyedvectors for FT & W2V by creating a common base class

* creates a common base class for Word2Vec and FastText

* deletes word2vec_utils.py

* extracts logging to separate methods

* corrects alpha decay, modifies `_get_thread_working_mem` to support doc2vec

* refactors doc2vec initialization and training

* minor fixes to support doc2vec

* corrects parameter setting while calling `train`

* deletes `callbacks`, fixes alpha setting and degradation from `train`

* adds post training methods and keyedvectors for docvecs

* extracts common methods as functions, discard unnecessary function call

* shifts adding null word from trainables to vocab class

* unifies variable naming

* moves corpus_count from vocabulary to model attribute

* refactors test cases and corrects failing cases

* removes old import

* fixes errors

* creates seperate class for callbacks, adds saving and loss capturing callbacks

* refactors poincare keyedvectors base and related changes

* extracts save/load_word2vec_format as functions to avoid code repition for word2vec and poincare

* removes model initialization to None

* shifts cum_tables, make_cum_table & create_binary_tree from trainables to vocabulary

* adds fasttext test cases

* adds doc strings for public APIs for D2V, W2V & FT

* adds docstrings for keyedvectors

* resolves failing test cases

* updates cython generated .c files

* corrects error statement when failing to import FAST VERSION

* betters logging

* deletes fasttext wrapper

* fixes PEP8 long lines error

* fixes non-any2vec failing test cases

* deletes testing pure python any2vec implementations from tox

* fixes test_similarities failing test cases

* fixes PEP8 errors

* fixes python3 failing test cases

* renames syn0 to vectors in keras integration test

* fixes annoy notebook failure

* adds property aliases for backward compatibility

* adds properties and methods for backward compatibility

* removes trainables save

* minor changes to test cases

* shifts epoch saver callback to an example in docstring

* adds deleters for syn1 & syn1neg

* deprecates old KeyedVectors in favour of Word2VecKeyedVectors

* reverts word2vec_pre_kv_py2 saved models to original

* adds deprecated models and dependent python files

* adds unit tests for loading old models

* imports deprecated in model.__init__

* removes .wv.most_similar calls

* adds code to support loading old models

* adds cython auto generated .c files

* fixes PEP8 failures & fetching attributes from pre_kv word2vec models

* fixes num_ngram_vectors

* fixes estimate_memory, shifts BaseKeyedVectors to keyedvectors.py

* fixes review comments -- typos, indents, adding deprecated. No design change.

* fixes PEP8

* shifts *KeyedVectors to keyedvectors.py

* de-duplicates data between keyedvectors, vocabulary, trainables and removes data copying

* fixes failing cases

* removes unused vocabulary paramter from methods

* removes base classes for vocabulary & trainables, cleans code

* removes build_vocab from BaseAny2VecModel

* fixes vector size for doc2vec

* Fix typo in classname

* remove docs for fasttext wrapper

* update docstrings for callback

* Fix documentation build

* light cleanup for docstrings

* renames private util_any2vec functions

* adds deprecated warning for attributes

* adds deprecated warnings.warn for old doc2vec parameters

* shifts any2vec callback under gensim/models

* adds pure python implementations

* fixes PEP8 errors

* changes build_vocab method signature

* fixes vocabulary trimming error

* fixes long line

* removes deprecated/utils

* adds old_saveload to deprecated

* removes unused import

* returns fasttext wrapper

* adds alias iter setter

* fixes fasttext load error

* ignores PEP8 unused import

* Return fasttext wrapper rst

* Add rst for deprecated stuff

* Add all needed deprecations, upd *.rst.

* add description for deprecated package

* add missing import + return env war to tox config

* drop useless import

* adds num_ngrams_vectors property

* reverts to calling old attributes in all tests

* fixes PEP8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants