-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow initialization with max_final_vocab
in lieu of min_count
for gensim.models.Word2Vec
. Fix #465
#1915
Conversation
@gojomo please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to Gordon suggestion - #465 (comment)
gensim/models/word2vec.py
Outdated
@@ -425,7 +425,8 @@ class Word2Vec(BaseWordEmbeddingsModel): | |||
def __init__(self, sentences=None, size=100, alpha=0.025, window=5, min_count=5, | |||
max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001, | |||
sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0, | |||
trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, compute_loss=False, callbacks=()): | |||
trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, compute_loss=False, callbacks=(), | |||
use_max_vocab=False, max_vocab=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be implemented only for word2vec (or for other *2vec models too)?
CC: @gojomo
gensim/models/word2vec.py
Outdated
self.max_vocab_size = max_vocab_size | ||
self.min_count = min_count | ||
self.sample = sample | ||
self.sorted_vocab = sorted_vocab | ||
self.null_word = null_word | ||
self.cum_table = None # for negative sampling | ||
self.raw_vocab = None | ||
self.use_max_vocab = use_max_vocab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
problem with backward compatibility, here and above (when you add the new attribute, you should modify load
function for the case when a user load old model (without this attribute) with new code (with new attribute)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this where I should make changes?
try:
return super(Word2Vec, cls).load(*args, **kwargs)
except AttributeError:
logger.info('Model saved using code from earlier Gensim Version. Re-loading old model in a compatible way.')
from gensim.models.deprecated.word2vec import load_old_word2vec
return load_old_word2vec(*args, **kwargs)
gensim/models/word2vec.py
Outdated
@@ -1131,14 +1134,17 @@ def __iter__(self): | |||
|
|||
|
|||
class Word2VecVocab(utils.SaveLoad): | |||
def __init__(self, max_vocab_size=None, min_count=5, sample=1e-3, sorted_vocab=True, null_word=0): | |||
def __init__(self, max_vocab_size=None, min_count=5, sample=1e-3, sorted_vocab=True, null_word=0, | |||
use_max_vocab=False, max_vocab=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to add 2 parameters, max_vocab
already enough.
gensim/models/word2vec.py
Outdated
if self.max_vocab is not None: | ||
import operator | ||
|
||
sorted_vocab = sorted(self.raw_vocab.items(), key=operator.itemgetter(1), reverse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be clearer if only sorting the keys, and using a lambda
– as is already done in sibling method sort_vocab()
.
gensim/models/word2vec.py
Outdated
calc_min_count = 0 | ||
|
||
for item in sorted_vocab: | ||
curr_count += item[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each word only counts as 1 in final-vocabulary size, so its actual occurrence count shouldn't be part of any tallying. (If max_vocab=10
, you just need to throw out all words with the same or fewer occurrences as the 11th word, sorted_vocab[10]
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh!
I took max_vocab
to mean maximum number of words
Whereas, it is maximum number of unique words.
I should've realised this sooner!
I thought it would make sense for the user to choose a maximum number of words they'd like. (Would that be good idea as another parameter/option for the user?)
gensim/models/word2vec.py
Outdated
calc_min_count = item[1] | ||
else: | ||
break | ||
min_count = calc_min_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This clobbers any other min_count
provided – rather than respecting both min_count
and max_vocab
if both are supplied. As per my comment in #465, "If both a min_count and max_vocab are specified, they should both be satisfied - which in practice would mean whichever implies the higher min_count."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prepare_vocab()
logging & return-value should provide the same visibility into this parameter's effects (including in a 'dry_run) as is available for
min_count`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the lines to:
if calc_min_count > min_count:
min_count = calc_min_count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prepare_vocab() logging & return-value should provide the same visibility into this parameter's effects (including in a 'dry_run) as is available formin_count`.
Do you mean I should add comments describing max_vocab
and logging comments describing the outcome of the max_vocab processing?
This commit moves the code to The code calculates the Use the following code to test it :
varying the
and
As can be seen, although Working under the constraint of using |
I have made most of the changes you suggested. when
when
At higher values of
|
If Also, the logging should indicate the effect of |
@gojomo The problem with I implemented a simple check for this and now the highest The previous examples with the same code and logging:
|
gensim/models/word2vec.py
Outdated
@@ -1216,12 +1216,20 @@ def prepare_vocab(self, hs, negative, wv, update=False, keep_raw_vocab=False, tr | |||
sorted_vocab = sorted(sorted_vocab_list, key=lambda word: word[1], reverse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still suggest the sorted list be word (keys) only, with counts retrieved via dict-lookups. Constantly accessing the 2nd-item-of-a-tuple, via [1]
, is less clear about intent. That is:
sorted_vocab = sorted(self.raw_vocab.keys(), key=lambda word: self.raw_vocab[word], reverse=True)
gensim/models/word2vec.py
Outdated
@@ -1216,12 +1216,20 @@ def prepare_vocab(self, hs, negative, wv, update=False, keep_raw_vocab=False, tr | |||
sorted_vocab = sorted(sorted_vocab_list, key=lambda word: word[1], reverse=True) | |||
|
|||
if self.max_vocab < len(sorted_vocab): | |||
calc_min_count = sorted_vocab[self.max_vocab][1] | |||
if sorted_vocab[self.max_vocab][1] != sorted_vocab[self.max_vocab - 1][1]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need for this if-branch; calc_min_count = self.raw_vocab[sorted_vocab[self.max_vocab] + 1]
will always set the threshold to the exact level necessary to eliminate the words at max_vocab
and later ranks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you meant : calc_min_count = self.raw_vocab[sorted_vocab[self.max_vocab]] + 1
@gojomo I am sorry for this taking so long and my code being inefficient/unreadable. I am still learning. |
No worries! The progress has been good, and I believe the functionality is now correct. So the focus now should be docs/unit-testing/clarity. Specifically: (1) a clear doc-comment explanation of For maximal clarity-of-the-code, it may also help to draw a bigger distinction in variable names between the user-specified |
@gojomo I have introduced Also added tests |
The travis-ci seems to be failing because of a time out
|
@aneesh-joshi I re-run the test, this happens when one of the tests running more than 10 minutes (or Travis stuck) |
max_vocab
in lieu of min_count
max_vocab
in lieu of min_count
. Fix #465
max_vocab
in lieu of min_count
. Fix #465max_vocab
in lieu of min_count
for gensim.models.Word2Vec
. Fix #465
max_vocab
in lieu of min_count
for gensim.models.Word2Vec
. Fix #465max_final_vocab
in lieu of min_count
for gensim.models.Word2Vec
. Fix #465
@gojomo please review the changes @menshikh-iv I loaded the models both in my version and the pypi version
and the results were the same. |
@aneesh-joshi try to load old model & call |
Hey @menshikh-iv I will try to explain why: Whenever the
In that function: This model already has Thus, when the old model is loaded and this change takes place, I tested this theory by adding
This resulted in:
This was further corroborated by my tests which were unable to cause any error when loading gensim models of versions
I will make a commit to remove the check for Hopefully, the PR will now be merge ready What do you think, @gojomo ? |
ping @menshikh-iv |
Looks slightly strange to me, I'll check it manually later (to be fully sure). |
calc_min_count = self.raw_vocab[sorted_vocab[self.max_final_vocab]] + 1 | ||
|
||
self.effective_min_count = max(calc_min_count, min_count) | ||
logger.info("max_final_vocab=%d and min_count=%d resulted in calc_min_count=%d, effective_min_count=%d", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put this outside the max_final_vocab
branch, so effective_min_count
logged same way even in the simple case of max_final_vocab
unset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@menshikh-iv was this comment addressed?
If a model was saved from gensim 3.3, what |
Nice catch, @gojomo
I was finally able to generate the error with the 3.3 models and fix it! As for
This isn't entirely possible as
|
@aneesh-joshi good job! You add a backward-compatibility change, but I don't see needed test (that fails without this change), please add it too. |
Hi @menshikh-iv The problem is, I cannot add a simple straightforward test for the backward compatibility using something like:
since this will have no relevance to old models (3.1 and 3.2) since: The code I wrote:
comes into effect only for models made in Thus, the only way I see of adding a test for the above code I wrote would be to include a model trained in If that's ok, I will proceed. what do you think @gojomo ? |
Yes, a (tiniest-possible) model that was saved from gensim 3.3.0 would need to be included as test material to be sure models from that version load properly. |
@menshikh-iv |
@aneesh-joshi LGTM 👍, if @gojomo have no more suggestions - I'll merge it (please let me know Gordon). |
ping @gojomo |
Looks good to me! @aneesh-joshi thanks for your persistence! |
This is a rough implementation of the feature described in the Issue #465
Test code :
Output: