Re-design "*2vec" implementations #1777

manneshiva · 2017-12-11T17:01:59Z

This PR aims to better the current "*2vec" implementation design (Doc2Vec, Word2Vec, FastText, Sent2Vec, Poincare), making it more modular and ensuring maximum code re-use.
Link summarizing this PR.

manneshiva · 2017-12-11T17:05:29Z

The first commit contains the initial design to help facilitate discussion. It outlines present thoughts about the design and is in no way complete. @janpom Image below for your reference:

gojomo · 2017-12-13T19:18:11Z

Simply extracting/clarifying the existing implementation could offer value – but also is less-likely to clear the way for all-new algorithmic variations or optimizations. For example, a very generic "2Vec" model might not even have the idea of a 'vocabulary', or a step within itself that builds-a-vocabulary... as opposed to just taking corpuses or configuration prepped by other objects.

That is, this sort of incremental re-factoring starting from the current design might not offer the same reuse/efficiency/flexibility benefits for #1623 a from-scratch design could give.)

manneshiva · 2017-12-13T20:19:55Z

@gojomo Agreed. These first few commits were only meant to facilitate discussions in our meetings. The idea is to sketch the design in a top-down fashion -- first, fix the public APIs that are to be exposed to users and then refactor everything else accordingly. This will involve designing a few classes from scratch, at the same time, will try and make use of existing classes/functions by making them more generic. Currently, the idea is to try and maintain backward compatibility (in terms of end-user APIs) unless some design changes make it absolutely necessary to change them (APIs). The scope of this PR may not include every point in #1623 but is definitely a big step in that direction which will make it easier to incorporate many of the listed points in the future.
Current status: We have finalized on the public interfaces. We also discussed providing/incorporating callbacks functionality in a way similar to keras. You should see be able to see these commits soon. Your review/comments would be helpful.

gojomo · 2017-12-14T20:42:05Z

I'll keep an eye for reviewable updates. I would suggest starting from a long, weakly-prioritized list of potential new functionality, then cull that to the things that can be usefully merged. (That is, more bottom-up from specifics of novel functions/combinations.) Starting from top-level, end-user visible interfaces, and maintaining compatibility with the (somewhat clunky) older interfaces, risks severely-limiting what new breakthrough capabilities can be delivered. (If they were easy to fit into the existing API, they could be added incrementally, rather than via a new design.)

… weights.

janpom · 2017-12-18T08:46:28Z

gensim/models/base_any2vec.py

+        logger.debug("worker exiting, processed %i jobs", jobs_processed)
+
+    def _job_producer(self, data_iterator, job_queue, cur_epoch=0, total_examples=None, total_words=None):
+            """Fill jobs queue using the input `data_iterator`."""


indented extra level

menshikh-iv

Looks good for me

menshikh-iv · 2017-12-19T08:34:19Z

gensim/models/base_any2vec.py

+
+        trained_word_count = 0
+        raw_word_count = 0
+        start = default_timer() - 0.00001


What's a magic constant here?

menshikh-iv · 2017-12-19T08:55:43Z

gensim/models/base_any2vec.py

+        # Log overall time
+        total_elapsed = default_timer() - start
+        logger.info(
+            "training on a %i raw words (%i effective words) took %.1fs, %.0f effective words/s",


This is always "words"? Probably "entities" is better for the general case.

menshikh-iv · 2017-12-19T08:56:13Z

gensim/models/base_any2vec.py

+
+    @classmethod
+    def load(cls, fname_or_handle, **kwargs):
+        model = super(BaseAny2VecModel, cls).load(fname_or_handle, **kwargs)


return super(...)

janpom · 2017-12-20T08:57:04Z

Some thoughts about the parameter naming. Using a generic term such as entity (meaning word or sentence or document) is potentially confusing for the user. Also, I see little value in keeping the API exactly the same for all algorithms. Don't get me wrong. There's certainly value in keeping the API consistent so that once the user learns how to use Word2Vec, it will be easy for them to use Doc2Vec, but it's not necessary (and probably not even desirable) that the public API is exactly the same including parameter names.

For example, I consider the following:

doc2vec = Doc2Vec(data)
print(doc2vec.kv.similarity(document1=mydoc1, document2=mydoc2))

word2vec = Word2Vec(data)
print(word2vec.kv.similarity(word1="foo", word2="bar"))

better than:

doc2vec = Doc2Vec(data)
print(doc2vec.kv.similarity(entity1=mydoc1, entity2=mydoc2))

word2vec = Word2Vec(data)
print(word2vec.kv.similarity(entity1="foo", entity2="bar"))

Should the different parameter naming be a problem for code reuse, we can always keep the same parameter names in internal methods and define public methods as thin wrappers. For example:

class KeyedVectorsBase(object):
    def _similarity(self, entity1, entity2):
        # calculate and return similarity

class Word2VecKeyedVectors(KeyedVectorsBase):
    def similarity(self, word1, word2):
        return self._similarity(word1, word2)

class Doc2VecKeyedVectors(KeyedVectorsBase):
    def similarity(self, document1, document2):
        return self._similarity(document1, document2)

…cated methods

janpom · 2017-12-21T08:13:09Z

gensim/models/word2vec_utils.py

+    return model.trainables.alpha
+
+
+def _update_job_params(model, job_params, progress, cur_epoch):


There was probably some misunderstanding in "functions preferred over methods". They are preferred but only where it makes sense. Where the function parameter is the model, it makes more sense to keep it as a method, especially if the function modifies the model data such as is the case here.

Out of the functions in this module, I would only keep _raw_word_count() as a function. The other ones sound more like methods to me.

Should the _raw_word_count() be the only function in this module, it's not even worth having the module, especially if the function is a one-liner.

Also, if the functions are to be called from other modules, they shouldn't be private (no leading underscore).

* first design draft * adds public interfaces * adds VocabItem and cleans BaseKeyedVectors * adds explicit parameters * implements `train` and adds `Callback` functionality * refactors `train`, adds classes for vocabulary building and trainable weights. * changes function parameters * fixes minor errors * starts refactoring `Word2Vec` based on new design * removes `build_vocab_from_freq`, corrects `reset_from` * changes attribute names * adds saving/loading from word2vec format * refactors/renames variables based on new design * fixes **not** storing normalized vectors and recalculable tables * replaces `syn0` with `vectors`, adds `estimate_memory` * fixes indents * starts `FastText` refactoring based on new design * refactors to call coomon methods from `word2vec_utils`, removes deprecated methods * refactors `FastText` * adds common methods in `word2vec_utils` * refactors keyedvectors for FT & W2V by creating a common base class * creates a common base class for Word2Vec and FastText * deletes word2vec_utils.py * extracts logging to separate methods * corrects alpha decay, modifies `_get_thread_working_mem` to support doc2vec * refactors doc2vec initialization and training * minor fixes to support doc2vec * corrects parameter setting while calling `train` * deletes `callbacks`, fixes alpha setting and degradation from `train` * adds post training methods and keyedvectors for docvecs * extracts common methods as functions, discard unnecessary function call * shifts adding null word from trainables to vocab class * unifies variable naming * moves corpus_count from vocabulary to model attribute * refactors test cases and corrects failing cases * removes old import * fixes errors * creates seperate class for callbacks, adds saving and loss capturing callbacks * refactors poincare keyedvectors base and related changes * extracts save/load_word2vec_format as functions to avoid code repition for word2vec and poincare * removes model initialization to None * shifts cum_tables, make_cum_table & create_binary_tree from trainables to vocabulary * adds fasttext test cases * adds doc strings for public APIs for D2V, W2V & FT * adds docstrings for keyedvectors * resolves failing test cases * updates cython generated .c files * corrects error statement when failing to import FAST VERSION * betters logging * deletes fasttext wrapper * fixes PEP8 long lines error * fixes non-any2vec failing test cases * deletes testing pure python any2vec implementations from tox * fixes test_similarities failing test cases * fixes PEP8 errors * fixes python3 failing test cases * renames syn0 to vectors in keras integration test * fixes annoy notebook failure * adds property aliases for backward compatibility * adds properties and methods for backward compatibility * removes trainables save * minor changes to test cases * shifts epoch saver callback to an example in docstring * adds deleters for syn1 & syn1neg * deprecates old KeyedVectors in favour of Word2VecKeyedVectors * reverts word2vec_pre_kv_py2 saved models to original * adds deprecated models and dependent python files * adds unit tests for loading old models * imports deprecated in model.__init__ * removes .wv.most_similar calls * adds code to support loading old models * adds cython auto generated .c files * fixes PEP8 failures & fetching attributes from pre_kv word2vec models * fixes num_ngram_vectors * fixes estimate_memory, shifts BaseKeyedVectors to keyedvectors.py * fixes review comments -- typos, indents, adding deprecated. No design change. * fixes PEP8 * shifts *KeyedVectors to keyedvectors.py * de-duplicates data between keyedvectors, vocabulary, trainables and removes data copying * fixes failing cases * removes unused vocabulary paramter from methods * removes base classes for vocabulary & trainables, cleans code * removes build_vocab from BaseAny2VecModel * fixes vector size for doc2vec * Fix typo in classname * remove docs for fasttext wrapper * update docstrings for callback * Fix documentation build * light cleanup for docstrings * renames private util_any2vec functions * adds deprecated warning for attributes * adds deprecated warnings.warn for old doc2vec parameters * shifts any2vec callback under gensim/models * adds pure python implementations * fixes PEP8 errors * changes build_vocab method signature * fixes vocabulary trimming error * fixes long line * removes deprecated/utils * adds old_saveload to deprecated * removes unused import * returns fasttext wrapper * adds alias iter setter * fixes fasttext load error * ignores PEP8 unused import * Return fasttext wrapper rst * Add rst for deprecated stuff * Add all needed deprecations, upd *.rst. * add description for deprecated package * add missing import + return env war to tox config * drop useless import * adds num_ngrams_vectors property * reverts to calling old attributes in all tests * fixes PEP8

first design draft

31943ae

manneshiva added 3 commits December 13, 2017 10:26

adds public interfaces

d7209f4

adds VocabItem and cleans BaseKeyedVectors

fe19b9a

adds explicit parameters

fece94f

implements train and adds Callback functionality

e310dbf

refactors train, adds classes for vocabulary building and trainable…

30872ac

… weights.

janpom reviewed Dec 18, 2017

View reviewed changes

changes function parameters

2892f37

menshikh-iv reviewed Dec 19, 2017

View reviewed changes

manneshiva added 7 commits December 19, 2017 16:37

fixes minor errors

4b1e7f8

starts refactoring Word2Vec based on new design

68ac5bc

removes build_vocab_from_freq, corrects reset_from

7f60a47

changes attribute names

abc5702

adds saving/loading from word2vec format

b60a9d5

refactors/renames variables based on new design

ca1eae9

fixes **not** storing normalized vectors and recalculable tables

dab7b99

gojomo mentioned this pull request Dec 20, 2017

refactor syn0 initialization #1799

Closed

manneshiva added 3 commits December 20, 2017 11:27

replaces syn0 with vectors, adds estimate_memory

d249668

fixes indents

99cf2ad

starts FastText refactoring based on new design

267c682

manneshiva added 4 commits December 21, 2017 11:50

refactors to call coomon methods from word2vec_utils, removes depre…

c2bbb20

…cated methods

refactors FastText

7d774d7

adds common methods in word2vec_utils

9b156f5

refactors keyedvectors for FT & W2V by creating a common base class

0db83f1

janpom reviewed Dec 21, 2017

View reviewed changes

menshikh-iv merged commit 916e423 into piskvorky:develop Feb 1, 2018

menshikh-iv mentioned this pull request Feb 5, 2018

Fix 1779 #1843

Closed

pyup-bot mentioned this pull request Feb 5, 2018

Scheduled weekly dependency update for week 05 workforce-data-initiative/skills-ml#120

Closed

gojomo mentioned this pull request Feb 5, 2018

Potential unification/optimization/simplification/enhancement refactor of *2Vec & related algorithms (FastText, Sent2Vec, FastSent, etc) #1623

Closed

akutuzov mentioned this pull request Feb 7, 2018

Accesing vector model vocabulary broken in Gensim 3.3 when loading from word2vec format #1882

Open

JonathanHourany mentioned this pull request Mar 5, 2018

AttributeError: 'DocvecsArray' object has no attribute 'vector_size' #1952

Closed

lzfelix mentioned this pull request Mar 22, 2018

'Word2VecKeyedVectors' object has no attribute 'vectors' #1994

Open

gojomo mentioned this pull request Mar 27, 2018

Add gensim.models.BaseKeyedVectors.add_entity method for fill KeyedVectors in manual way. Fix #1942 #1957

Merged

gojomo mentioned this pull request Apr 8, 2018

ValueError: You must specify either total_examples or total_words, for proper alpha and progress calculations. The usual value is total_examples=model.corpus_count. #1956

Closed

This was referenced Jun 15, 2018

Fix documentation for *2vec models #2087

Merged

[GSoC 2018] Multistream API for vocabulary building in *2vec #2078

Merged

gojomo mentioned this pull request Jul 9, 2018

Doc2Vec to wikipedia articles notebook error - object has no attribute #2085

Open

menshikh-iv mentioned this pull request Oct 1, 2018

Unable to load KeyedVectors models created with gensim==3.2.0: AttributeError: Can't get attribute 'EuclideanKeyedVectors' #2201

Closed

menshikh-iv mentioned this pull request Oct 8, 2018

'Word2Vec' object has no attribute 'trainables' #2000

Closed

gojomo mentioned this pull request Nov 27, 2018

Efficient vector representation for documents through corruption #1159

Open

gojomo mentioned this pull request Jan 7, 2019

Fix critical issues in FastText #2313

Merged

gojomo mentioned this pull request Feb 8, 2019

implement separate functions to load FT embeddings and models #2376

Merged

polm mentioned this pull request May 16, 2019

Word Vector Training Script doesn't work with recent Gensim explosion/spaCy#3749

Closed

piskvorky mentioned this pull request Oct 8, 2019

Triage github tickets #2612

Open

gojomo mentioned this pull request Nov 21, 2019

Doc2VecKeyedVectors doesn't effectively support __setitem__()/add() #2683

Open

piskvorky mentioned this pull request Nov 22, 2019

Fix local import degrading the performance of word2vec model loading … #2682

Merged

gojomo mentioned this pull request Dec 5, 2019

KeyedVectors & *2Vec API streamlining, consistency #2698

Merged

gojomo mentioned this pull request Dec 24, 2019

Support pretrained word2vec model when train doc2vec #2703

Closed

gojomo mentioned this pull request Jul 12, 2020

Word2Vec keeps on training during on_batch_end call #2182

Closed

gojomo mentioned this pull request Jul 30, 2020

[MRG] Fix similarity bug in NMSLIB indexer + documentation fixes #2899

Merged

10 tasks

gojomo mentioned this pull request Sep 21, 2020

[MRG] *2Vec SaveLoad improvements #2939

Merged

This was referenced Sep 30, 2020

Adopting a (narrow) backward-compatibility standard; implications for 4.0.0 #2967

Open

Restore/improve/streamline hooks for controlling/reusing build_vocab() steps #2975

Open

piskvorky mentioned this pull request Aug 2, 2021

SECURITY: api.load() recklessly downloads & runs arbitrary python code #2283

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-design "*2vec" implementations #1777

Re-design "*2vec" implementations #1777

manneshiva commented Dec 11, 2017 •

edited

Loading

manneshiva commented Dec 11, 2017 •

edited

Loading

gojomo commented Dec 13, 2017

manneshiva commented Dec 13, 2017

gojomo commented Dec 14, 2017

janpom Dec 18, 2017

menshikh-iv left a comment

menshikh-iv Dec 19, 2017

menshikh-iv Dec 19, 2017

menshikh-iv Dec 19, 2017

janpom commented Dec 20, 2017

janpom Dec 21, 2017

		return model.trainables.alpha


		def _update_job_params(model, job_params, progress, cur_epoch):

Re-design "*2vec" implementations #1777

Re-design "*2vec" implementations #1777

Conversation

manneshiva commented Dec 11, 2017 • edited Loading

manneshiva commented Dec 11, 2017 • edited Loading

gojomo commented Dec 13, 2017

manneshiva commented Dec 13, 2017

gojomo commented Dec 14, 2017

janpom Dec 18, 2017

Choose a reason for hiding this comment

menshikh-iv left a comment

Choose a reason for hiding this comment

menshikh-iv Dec 19, 2017

Choose a reason for hiding this comment

menshikh-iv Dec 19, 2017

Choose a reason for hiding this comment

menshikh-iv Dec 19, 2017

Choose a reason for hiding this comment

janpom commented Dec 20, 2017

janpom Dec 21, 2017

Choose a reason for hiding this comment

manneshiva commented Dec 11, 2017 •

edited

Loading

manneshiva commented Dec 11, 2017 •

edited

Loading