Allow initialization with `max_final_vocab` in lieu of `min_count` for `gensim.models.Word2Vec`. Fix #465 #1915

aneesh-joshi · 2018-02-18T17:41:50Z

This is a rough implementation of the feature described in the Issue #465

Test code :

import gensim
from nltk.corpus import brown

sentences = brown.sents()
sentences = sentences[0:1000]

model = gensim.models.Word2Vec(sentences, min_count=1, use_max_vocab=True, max_vocab=8000)
print(model.most_similar('the'))
print(len(model.wv.vocab))

Output:

[('to', 0.9996629953384399), ('a', 0.9996598362922668), ('and', 0.9996397495269775), ('on', 0.999603271484375), ('by', 0.9995988607406616), ('that', 0.9995942115783691), ('of', 0.9995760321617126), (',', 0.9995731115341187), ("''", 0.9995728731155396), ('in', 0.9995712637901306)]
23

…into develop

aneesh-joshi · 2018-02-18T17:42:19Z

@gojomo please review

menshikh-iv

In addition to Gordon suggestion - #465 (comment)

menshikh-iv · 2018-02-19T05:57:25Z

gensim/models/word2vec.py

@@ -425,7 +425,8 @@ class Word2Vec(BaseWordEmbeddingsModel):
    def __init__(self, sentences=None, size=100, alpha=0.025, window=5, min_count=5,
                 max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
                 sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,
-                 trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, compute_loss=False, callbacks=()):
+                 trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, compute_loss=False, callbacks=(),
+                 use_max_vocab=False, max_vocab=None):


Should it be implemented only for word2vec (or for other *2vec models too)?
CC: @gojomo

menshikh-iv · 2018-02-19T05:59:29Z

gensim/models/word2vec.py

        self.max_vocab_size = max_vocab_size
        self.min_count = min_count
        self.sample = sample
        self.sorted_vocab = sorted_vocab
        self.null_word = null_word
        self.cum_table = None  # for negative sampling
        self.raw_vocab = None
+        self.use_max_vocab = use_max_vocab


problem with backward compatibility, here and above (when you add the new attribute, you should modify load function for the case when a user load old model (without this attribute) with new code (with new attribute)

Is this where I should make changes?

try: return super(Word2Vec, cls).load(*args, **kwargs) except AttributeError: logger.info('Model saved using code from earlier Gensim Version. Re-loading old model in a compatible way.') from gensim.models.deprecated.word2vec import load_old_word2vec return load_old_word2vec(*args, **kwargs)

menshikh-iv · 2018-02-19T06:00:47Z

gensim/models/word2vec.py

@@ -1131,14 +1134,17 @@ def __iter__(self):


 class Word2VecVocab(utils.SaveLoad):
-    def __init__(self, max_vocab_size=None, min_count=5, sample=1e-3, sorted_vocab=True, null_word=0):
+    def __init__(self, max_vocab_size=None, min_count=5, sample=1e-3, sorted_vocab=True, null_word=0,
+        use_max_vocab=False, max_vocab=None):


No need to add 2 parameters, max_vocab already enough.

gojomo · 2018-02-20T18:31:05Z

gensim/models/word2vec.py

+        if self.max_vocab is not None:
+            import operator
+
+            sorted_vocab = sorted(self.raw_vocab.items(), key=operator.itemgetter(1), reverse=True)


This might be clearer if only sorting the keys, and using a lambda – as is already done in sibling method sort_vocab().

gojomo · 2018-02-20T18:35:02Z

gensim/models/word2vec.py

+            calc_min_count = 0
+
+            for item in sorted_vocab:
+                curr_count += item[1]


Each word only counts as 1 in final-vocabulary size, so its actual occurrence count shouldn't be part of any tallying. (If max_vocab=10, you just need to throw out all words with the same or fewer occurrences as the 11th word, sorted_vocab[10].)

Oh!
I took max_vocab to mean maximum number of words
Whereas, it is maximum number of unique words.

I should've realised this sooner!

I thought it would make sense for the user to choose a maximum number of words they'd like. (Would that be good idea as another parameter/option for the user?)

gojomo · 2018-02-20T18:37:24Z

gensim/models/word2vec.py

+                    calc_min_count = item[1]
+                else:
+                    break
+            min_count = calc_min_count


This clobbers any other min_count provided – rather than respecting both min_count and max_vocab if both are supplied. As per my comment in #465, "If both a min_count and max_vocab are specified, they should both be satisfied - which in practice would mean whichever implies the higher min_count."

prepare_vocab() logging & return-value should provide the same visibility into this parameter's effects (including in a 'dry_run) as is available for min_count`.

I have changed the lines to:

if calc_min_count > min_count: min_count = calc_min_count

prepare_vocab() logging & return-value should provide the same visibility into this parameter's effects (including in a 'dry_run) as is available formin_count`.

Do you mean I should add comments describing max_vocab and logging comments describing the outcome of the max_vocab processing?

aneesh-joshi · 2018-02-20T18:42:08Z

This commit moves the code to prepare_vocab and makes a single parameter max_vocab which is None by default.

The code calculates the min_count required to get the max_vocab.
However, it should be noted, that it doesn't do so perfectly but as well as it can, considering it only has access to min_count

Use the following code to test it :

import gensim
from nltk.corpus import brown
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentences = brown.sents()
sentences = sentences[0:1000]

model = gensim.models.Word2Vec(sentences, max_vocab=15000)

varying the max_vocab will give different results like :

max_vocab = 15000 gives the following log:

2018-02-21 00:08:43,196 : INFO : collecting all words and their counts
2018-02-21 00:08:43,196 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-02-21 00:08:43,268 : INFO : collected 4641 word types from a corpus of 22079 raw words and 1000 sentences
2018-02-21 00:08:43,272 : INFO : Loading a fresh vocabulary
**2018-02-21 00:08:43,276 : INFO : min_count=6 retains 516 unique words (11% of original 4641, drops 4125)
2018-02-21 00:08:43,276 : INFO : min_count=6 leaves 15228 word corpus (68% of original 22079, drops 6851)**
2018-02-21 00:08:43,276 : INFO : deleting the raw counts dictionary of 4641 items
2018-02-21 00:08:43,276 : INFO : sample=0.001 downsamples 46 most-common words
2018-02-21 00:08:43,276 : INFO : downsampling leaves estimated 8737 word corpus (57.4% of prior 15228)
2018-02-21 00:08:43,280 : INFO : estimated required memory for 516 words and 100 dimensions: 670800 bytes
2018-02-21 00:08:43,280 : INFO : resetting layer weights

and

max_vocab = 20000 gives

2018-02-21 00:09:52,816 : INFO : collecting all words and their counts
2018-02-21 00:09:52,816 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-02-21 00:09:52,887 : INFO : collected 4641 word types from a corpus of 22079 raw words and 1000 sentences
2018-02-21 00:09:52,902 : INFO : Loading a fresh vocabulary
**2018-02-21 00:09:52,902 : INFO : min_count=1 retains 4641 unique words (100% of original 4641, drops 0)
2018-02-21 00:09:52,902 : INFO : min_count=1 leaves 22079 word corpus (100% of original 22079, drops 0)**
2018-02-21 00:09:52,918 : INFO : deleting the raw counts dictionary of 4641 items
2018-02-21 00:09:52,918 : INFO : sample=0.001 downsamples 38 most-common words
2018-02-21 00:09:52,918 : INFO : downsampling leaves estimated 16317 word corpus (73.9% of prior 22079)
2018-02-21 00:09:52,934 : INFO : estimated required memory for 4641 words and 100 dimensions: 6033300 bytes

As can be seen, although max_vocab is 20,000 it is only able to manage a min_count of 1 which results in the full vocab coming.

Working under the constraint of using min_count , this seems to be the only solution.

aneesh-joshi · 2018-02-20T20:19:56Z

@gojomo

I have made most of the changes you suggested.

when max_vocab = 64

2018-02-21 01:46:30,315 : INFO : Loading a fresh vocabulary
2018-02-21 01:46:30,315 : INFO : min_count=31 retains 66 unique words (1% of original 4641, drops 4575)
2018-02-21 01:46:30,315 : INFO : min_count=31 leaves 10129 word corpus (45% of original 22079, drops 11950)
2018-02-21 01:46:30,315 : INFO : deleting the raw counts dictionary of 4641 items
2018-02-21 01:46:30,315 : INFO : sample=0.001 downsamples 66 most-common words
2018-02-21 01:46:30,315 : INFO : downsampling leaves estimated 2880 word corpus (28.4% of prior 10129)

when max_vocab = 11

2018-02-21 01:47:31,570 : INFO : min_count=202 retains 12 unique words (0% of original 4641, drops 4629)
2018-02-21 01:47:31,570 : INFO : min_count=202 leaves 6395 word corpus (28% of original 22079, drops 15684)
2018-02-21 01:47:31,570 : INFO : deleting the raw counts dictionary of 4641 items

At higher values of max_vocab like max_vocab = 3000, the min count defaults to the specified min_count which in this case is 5

2018-02-21 01:49:04,985 : INFO : min_count=5 retains 671 unique words (14% of original 4641, drops 3970)
2018-02-21 01:49:04,985 : INFO : min_count=5 leaves 16003 word corpus (72% of original 22079, drops 6076)
2018-02-21 01:49:04,985 : INFO : deleting the raw counts dictionary of 4641 items

gojomo · 2018-02-20T22:39:58Z

If max_vocab=64, then a retained vocabulary size of 66 is an error: it's over the specified max. Similarly, if max_vocab=11, then a retained vocabulary size of 12 is an error.

Also, the logging should indicate the effect of max_vocab, and especially if it causes the min_count to go higher than the otherwise-specified (or default) value.

aneesh-joshi · 2018-02-21T13:54:03Z

@gojomo
The problem for max_vocab = 11 taking 12 was fixed by taking the min_count of one just before the condition wasn't met.

The problem with max_vocab=64 giving 66 was because the 64th, 65th and 66th word had the same frequency.
... ('their', 33), ('It', 32), ('but', 31), ('new', 31), ('plan', 31), ...

I implemented a simple check for this and now the highest min_count is selected such that vocab will be at most max_vocab but never more.

The previous examples with the same code and logging:

max_vocab = 11

2018-02-21 18:42:46,507 : INFO : min_count was set to 207 due to max_vocab being set to 11
2018-02-21 18:42:46,507 : INFO : Loading a fresh vocabulary
2018-02-21 18:42:46,507 : INFO : min_count=207 retains 11 unique words (0% of original 4641, drops 4630)
2018-02-21 18:42:46,507 : INFO : min_count=207 leaves 6193 word corpus (28% of original 22079, drops 15886)

max_vocab = 64

2018-02-21 18:43:30,308 : INFO : min_count was set to 32 due to max_vocab being set to 64
2018-02-21 18:43:30,308 : INFO : Loading a fresh vocabulary
2018-02-21 18:43:30,308 : INFO : min_count=32 retains 63 unique words (1% of original 4641, drops 4578)
2018-02-21 18:43:30,308 : INFO : min_count=32 leaves 10036 word corpus (45% of original 22079, drops 12043)
2018-02-21 18:43:30,308 : INFO : deleting the raw counts dictionary of 4641 items

max_vocab = 65

2018-02-21 18:44:16,365 : INFO : min_count was set to 32 due to max_vocab being set to 65
2018-02-21 18:44:16,365 : INFO : Loading a fresh vocabulary
2018-02-21 18:44:16,381 : INFO : min_count=32 retains 63 unique words (1% of original 4641, drops 4578)
2018-02-21 18:44:16,381 : INFO : min_count=32 leaves 10036 word corpus (45% of original 22079, drops 12043)
2018-02-21 18:44:16,381 : INFO : deleting the raw counts dictionary of 4641 items
2018-02-21 18:44:16,381 : INFO : sample=0.001 downsamples 63 most-common words

max_vocab = 66

2018-02-21 18:44:49,974 : INFO : min_count was set to 31 due to max_vocab being set to 66
2018-02-21 18:44:49,974 : INFO : Loading a fresh vocabulary
2018-02-21 18:44:49,974 : INFO : min_count=31 retains 66 unique words (1% of original 4641, drops 4575)
2018-02-21 18:44:49,974 : INFO : min_count=31 leaves 10129 word corpus (45% of original 22079, drops 11950)

max_vocab = 2000

2018-02-21 19:18:33,296 : INFO : specified min_count = 5 is larger that min_count calculated by max_vocab = 2, using specified min_count
2018-02-21 19:18:33,296 : INFO : Loading a fresh vocabulary
2018-02-21 19:18:33,312 : INFO : min_count=5 retains 671 unique words (14% of original 4641, drops 3970)
2018-02-21 19:18:33,312 : INFO : min_count=5 leaves 16003 word corpus (72% of original 22079, drops 6076)
2018-02-21 19:18:33,312 : INFO : deleting the raw counts dictionary of 4641 items
2018-02-21 19:18:33,312 : INFO : sample=0.001 downsamples 46 most-common words

gojomo · 2018-02-21T19:23:32Z

gensim/models/word2vec.py

@@ -1216,12 +1216,20 @@ def prepare_vocab(self, hs, negative, wv, update=False, keep_raw_vocab=False, tr
            sorted_vocab = sorted(sorted_vocab_list, key=lambda word: word[1], reverse=True)


I would still suggest the sorted list be word (keys) only, with counts retrieved via dict-lookups. Constantly accessing the 2nd-item-of-a-tuple, via [1], is less clear about intent. That is:

sorted_vocab = sorted(self.raw_vocab.keys(), key=lambda word: self.raw_vocab[word], reverse=True)

gojomo · 2018-02-21T19:32:16Z

gensim/models/word2vec.py

@@ -1216,12 +1216,20 @@ def prepare_vocab(self, hs, negative, wv, update=False, keep_raw_vocab=False, tr
            sorted_vocab = sorted(sorted_vocab_list, key=lambda word: word[1], reverse=True)

            if self.max_vocab < len(sorted_vocab):
-                calc_min_count = sorted_vocab[self.max_vocab][1]
+                if sorted_vocab[self.max_vocab][1] != sorted_vocab[self.max_vocab - 1][1]:


There's no need for this if-branch; calc_min_count = self.raw_vocab[sorted_vocab[self.max_vocab] + 1] will always set the threshold to the exact level necessary to eliminate the words at max_vocab and later ranks.

I think you meant : calc_min_count = self.raw_vocab[sorted_vocab[self.max_vocab]] + 1

aneesh-joshi · 2018-02-22T12:14:42Z

@gojomo I am sorry for this taking so long and my code being inefficient/unreadable. I am still learning.

gojomo · 2018-02-23T20:00:19Z

No worries! The progress has been good, and I believe the functionality is now correct. So the focus now should be docs/unit-testing/clarity. Specifically: (1) a clear doc-comment explanation of max_vocab's effects; (2) A test method (or two) which checks the proper handling of max_vocab values that are both more-restricting than min_count (forcing a larger effective min_count) and less-restricting (specified but essentially having no effect).

For maximal clarity-of-the-code, it may also help to draw a bigger distinction in variable names between the user-specified min_count, the max_vocab-implied calc_min_count, and the ultimate effective_min_count, which is essentially max(min_count, calc_min_count) and could be maintained as a separate property, to avoid clobbering min_count (so that even after a model goes through other steps, or is saved/reloaded much later, it's still clear what was specified & what was made-effective).

aneesh-joshi · 2018-03-05T14:42:47Z

@gojomo
Sorry, it took so long.
I have made the specified changes.

I have introduced effective_min_count as you specified and set it to None if it isn't used.
This, however, has forced me to use self.min_count instead of min_count in the remaining places. I'm not sure this is what is expected.

Also added tests

aneesh-joshi · 2018-03-05T17:37:52Z

The travis-ci seems to be failing because of a time out

gensim/test/test_sklearn_api.py .......................................................
The job exceeded the maximum time limit for jobs, and has been terminated.

menshikh-iv · 2018-03-06T03:23:24Z

@aneesh-joshi I re-run the test, this happens when one of the tests running more than 10 minutes (or Travis stuck)

aneesh-joshi · 2018-03-11T13:20:23Z

@gojomo please review the changes

@menshikh-iv
I couldn't make a case where loading failed.
I am not sure it would fail anywhere considering that max_final_vocab is called only when prepare_vocab is called and this won't happen for old models which are loaded.

I loaded the models both in my version and the pypi version
and did the following:

- most_similar
- scan_vocab
- prepare_vocab with update = True

and the results were the same.

menshikh-iv · 2018-03-12T06:58:25Z

@aneesh-joshi try to load old model & call build_vocab(..., update=True) (without change of load function, this should be failed)

aneesh-joshi · 2018-03-13T14:38:52Z

Hey @menshikh-iv
The current code should work without any extra work as backward compatibility comes by default.

I will try to explain why:

Whenever the load function is called, it tries to load the model. If it is an older version, an exception is caught and the load_old_word2vec function is called from deprecated/word2vec.py https://github.com/aneesh-joshi/gensim/blob/8578e3d14f57b1c3274bd37fd48f8ba4b3efa597/gensim/models/word2vec.py#L985

return load_old_word2vec(*args, **kwargs)

In that function:
a new model is instantiated
new_model = NewWord2Vec(**params)
https://github.com/aneesh-joshi/gensim/blob/8578e3d14f57b1c3274bd37fd48f8ba4b3efa597/gensim/models/word2vec.py#L985

This model already has max_final_vocab=None in model and model.vocabulary

Thus, when the old model is loaded and this change takes place, max_final_vocab is set and there is no need for even checking if it has it through:
if not hasattr(old_model, 'max_final_vocab'):

I tested this theory by adding print statements in deprecated/word2vec.py when the new model is created:

.
.
.
new_model = NewWord2Vec(**params)

print('***************new model*************')
print(new_model.__dir__())
print('***************vocab*************')
print(new_model.vocabulary.__dir__())
print('****************************')
.
.
.

This resulted in:

***************new model*************
[    'max_final_vocab'     , 'callbacks', 'load', 'wv', 'vocabulary', 'trainables', . . . ,'__dir__', '__class__']
***************vocab*************
['max_vocab_size', 'max_final_vocab', 'min_count', 'sample', 'sorted_vocab', 'null_word', 'cum_table', 'raw_vocab', '__module__', . . . ,  '__dir__', '__class__']
****************************

This was further corroborated by my tests which were unable to cause any error when loading gensim models of versions 3.1 and 3.2 and calling the functions

build_vocab
scan_vocab
prepare_vocab

I will make a commit to remove the check for if not hasattr(old_model, 'max_final_vocab'):

Hopefully, the PR will now be merge ready

What do you think, @gojomo ?

aneesh-joshi · 2018-03-14T11:07:56Z

ping @menshikh-iv

menshikh-iv · 2018-03-14T11:18:39Z

Looks slightly strange to me, I'll check it manually later (to be fully sure).
@aneesh-joshi @gojomo current PR is ready to merge?

gojomo · 2018-03-14T18:16:47Z

gensim/models/word2vec.py

+                calc_min_count = self.raw_vocab[sorted_vocab[self.max_final_vocab]] + 1
+
+            self.effective_min_count = max(calc_min_count, min_count)
+            logger.info("max_final_vocab=%d and min_count=%d resulted in calc_min_count=%d, effective_min_count=%d",


I would put this outside the max_final_vocab branch, so effective_min_count logged same way even in the simple case of max_final_vocab unset.

@menshikh-iv was this comment addressed?

gojomo · 2018-03-14T18:24:43Z

If a model was saved from gensim 3.3, what AttributeException would be raised in load() that causes load_old_word2vec() to even be run?

aneesh-joshi · 2018-03-15T12:25:47Z

Nice catch, @gojomo
Hadn't considered that

@aneesh-joshi try to load old model & call build_vocab(..., update=True) (without change of load function, this should be failed)
-- @menshikh-iv

I was finally able to generate the error with the 3.3 models and fix it!

As for

I would put this outside the max_final_vocab branch, so effective_min_count logged same way even in the simple case of max_final_vocab unset.

This isn't entirely possible as calc_min_count does not exist out of this scope.
We could add an extra log outside but I don't think it's needed as effective_min_count is already captured by itself in the next few lines:

"effective_min_count=%d retains %i unique words (%i%% of original %i, drops %i)",

https://github.com/aneesh-joshi/gensim/blob/340a8cf158d33aca2b4700f9f0f3fa4c8b6c60e5/gensim/models/word2vec.py#L1263

menshikh-iv · 2018-03-15T13:18:18Z

@aneesh-joshi good job! You add a backward-compatibility change, but I don't see needed test (that fails without this change), please add it too.

aneesh-joshi · 2018-03-18T19:04:36Z

Hi @menshikh-iv
Sorry for the delay. I was trying to figure out what you meant and the correct way of getting there.

The problem is, I cannot add a simple straightforward test for the backward compatibility using something like:

    def testLoadOldModel(self):
        """Test loading word2vec models from previous version"""

        model_file = 'word2vec_old'
        model = word2vec.Word2Vec.load(datapath(model_file))

        self.assertEqual(model.max_final_vocab, None)
        self.assertEqual(model.vocabulary.max_final_vocab, None)
        .
        .

since this will have no relevance to old models (3.1 and 3.2) since:
when Word2Vec.load is called, an exception is raised due to old versions and load_old_word2vec is called.
load_old_word2vec returns a new_model which has max_final_vocab already set.

The code I wrote:

# for backward compatibility for `max_final_vocab` feature
            if not hasattr(model, 'max_final_vocab'):
                model.max_final_vocab = None
                model.vocabulary.max_final_vocab = None

comes into effect only for models made in 3.3 which do not trigger the exception on load.

Thus, the only way I see of adding a test for the above code I wrote would be to include a model trained in 3.3 and then load it. I will have to upload my model to the repo.

If that's ok, I will proceed.

what do you think @gojomo ?

gojomo · 2018-03-18T21:03:17Z

Yes, a (tiniest-possible) model that was saved from gensim 3.3.0 would need to be included as test material to be sure models from that version load properly.

aneesh-joshi · 2018-03-19T07:43:12Z

@menshikh-iv
Please review.

menshikh-iv · 2018-03-19T14:12:42Z

@aneesh-joshi LGTM 👍, if @gojomo have no more suggestions - I'll merge it (please let me know Gordon).

aneesh-joshi · 2018-03-22T06:24:12Z

ping @gojomo

gojomo · 2018-03-22T18:15:36Z

Looks good to me! @aneesh-joshi thanks for your persistence!

aneesh-joshi added 5 commits February 9, 2018 01:14

handle deprecation

e249ed4

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

62f6c82

…into develop

handle max_count

1677e98

change flag name

e8c08f8

make flake8 compatible

258d033

menshikh-iv suggested changes Feb 19, 2018

View reviewed changes

move max_vocab to prepare vocab

875c65c

gojomo reviewed Feb 20, 2018

View reviewed changes

aneesh-joshi mentioned this pull request Feb 20, 2018

allow initialization with max_final_vocab in lieu of min_count #465

Closed

gojomo reviewed Feb 20, 2018

View reviewed changes

aneesh-joshi added 2 commits February 21, 2018 01:41

correct max_vocab semantics

0aa8426

remove unnecessary nextline

390f333

fix bug and make flake8 complaint

8c508c7

gojomo reviewed Feb 21, 2018

View reviewed changes

refactor code and change sorting to key based

c826b19

aneesh-joshi added 3 commits March 5, 2018 19:41

add tests

35dc681

introduce effective_min_count

67f6a14

make flake8 compliant

7b1f612

refactor word2vec.py

f379616

menshikh-iv changed the title ~~Addresses #465 : allow initialization with max_vocab in lieu of min_count~~ Addresses #465 : allow initialization with max_vocab in lieu of min_count. Fix #465 Mar 9, 2018

menshikh-iv changed the title ~~Addresses #465 : allow initialization with max_vocab in lieu of min_count. Fix #465~~ Allow initialization with max_vocab in lieu of min_count for gensim.models.Word2Vec. Fix #465 Mar 9, 2018

aneesh-joshi changed the title ~~Allow initialization with max_vocab in lieu of min_count for gensim.models.Word2Vec. Fix #465~~ Allow initialization with max_final_vocab in lieu of min_count for gensim.models.Word2Vec. Fix #465 Mar 11, 2018

handle possible old model load errors

46d3885

aneesh-joshi added 2 commits March 11, 2018 21:30

include effective_min_count tests

2cf5625

make flake compliant

8578e3d

remove check for max_final_vocab

a43fea3

menshikh-iv added the RFM label Mar 14, 2018

gojomo reviewed Mar 14, 2018

View reviewed changes

include backward compat for 3.3 models

340a8cf

remove unnecessary newline

0b62407

menshikh-iv removed the RFM label Mar 15, 2018

add test case for max_final_vocab

5b7a6c2

menshikh-iv merged commit 4f49265 into piskvorky:develop Mar 22, 2018

scribu mentioned this pull request May 31, 2019

Expose max_final_vocab parameter in FastText constructor #2516

Closed

		@@ -1216,12 +1216,20 @@ def prepare_vocab(self, hs, negative, wv, update=False, keep_raw_vocab=False, tr
		sorted_vocab = sorted(sorted_vocab_list, key=lambda word: word[1], reverse=True)

Allow initialization with max_final_vocab in lieu of min_count for gensim.models.Word2Vec. Fix #465 #1915

Allow initialization with max_final_vocab in lieu of min_count for gensim.models.Word2Vec. Fix #465 #1915

Conversation

aneesh-joshi commented Feb 18, 2018

aneesh-joshi commented Feb 18, 2018

menshikh-iv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aneesh-joshi commented Feb 20, 2018 • edited Loading

aneesh-joshi commented Feb 20, 2018

gojomo commented Feb 20, 2018

aneesh-joshi commented Feb 21, 2018

gojomo Feb 21, 2018 • edited Loading

Choose a reason for hiding this comment

gojomo Feb 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aneesh-joshi commented Feb 22, 2018

gojomo commented Feb 23, 2018

aneesh-joshi commented Mar 5, 2018

aneesh-joshi commented Mar 5, 2018

menshikh-iv commented Mar 6, 2018 • edited Loading

aneesh-joshi commented Mar 11, 2018

menshikh-iv commented Mar 12, 2018 • edited Loading

aneesh-joshi commented Mar 13, 2018

aneesh-joshi commented Mar 14, 2018

menshikh-iv commented Mar 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gojomo commented Mar 14, 2018

aneesh-joshi commented Mar 15, 2018

menshikh-iv commented Mar 15, 2018 • edited Loading

aneesh-joshi commented Mar 18, 2018 • edited Loading

gojomo commented Mar 18, 2018

aneesh-joshi commented Mar 19, 2018

menshikh-iv commented Mar 19, 2018

aneesh-joshi commented Mar 22, 2018

gojomo commented Mar 22, 2018

Allow initialization with `max_final_vocab` in lieu of `min_count` for `gensim.models.Word2Vec`. Fix #465 #1915

Allow initialization with `max_final_vocab` in lieu of `min_count` for `gensim.models.Word2Vec`. Fix #465 #1915

aneesh-joshi commented Feb 20, 2018 •

edited

Loading

gojomo Feb 21, 2018 •

edited

Loading

gojomo Feb 21, 2018 •

edited

Loading

menshikh-iv commented Mar 6, 2018 •

edited

Loading

menshikh-iv commented Mar 12, 2018 •

edited

Loading

menshikh-iv commented Mar 15, 2018 •

edited

Loading

aneesh-joshi commented Mar 18, 2018 •

edited

Loading