Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_word2vec_format(): new parameter to skip init_sims() call #545

Merged
merged 5 commits into from Nov 25, 2015

Conversation

svenkreiss
Copy link
Contributor

In certain use cases (custom doc2vec-type computations) only unnormalized vectors are used. The init_sims() call at the end of load_word2vec_format takes a lot of memory (even with norm_only=True) and is unnecessary in this scenario. This PR allows to skip the call which improves performance.

@piskvorky
Copy link
Owner

Sounds useful and clean, +1. A unit test for this new parameter would be useful.

Let's wait for @gojomo review & then merge.

@piskvorky
Copy link
Owner

Also @svenkreiss , can you commit a brief description of this change in CHANGELOG.txt?

@svenkreiss
Copy link
Contributor Author

@piskvorky thanks for comments. Done.

@piskvorky
Copy link
Owner

Perfect, thanks a lot @svenkreiss !

piskvorky added a commit that referenced this pull request Nov 25, 2015
load_word2vec_format(): new parameter to skip init_sims() call
@piskvorky piskvorky merged commit 5535fcf into piskvorky:develop Nov 25, 2015
@gojomo
Copy link
Collaborator

gojomo commented Nov 25, 2015

Looks fine, but another more radical simplification that reduces parameters & codepaths would also be worth considering: just don't do any automatic norming in the load_word2vec_format(). (That is, remove the init_sims() call rather than make it switchable.)

Then the (syn0) result of a load is really just a load, not a load-and-do-other stuff. This would also be roughly consistently with the syn0 state after native gensim training: you have the raw vectors, what you do with them next is up to your explicit further steps.

If the user starts making similarity calls, syn0norm would be automatically backfilled, as with natively-trained vectors.... but someone who doesn't need that would just choose not to trigger it.

If they instead want to convert to a compact, normed-only model, they'd call init_sims(norm_only=True) themself – just as if they'd trained the vectors with gensim (rather than just loaded).

@piskvorky
Copy link
Owner

Oh yes, that makes sense too. I actually like @gojomo 's option better -- simpler is better.

@svenkreiss
Copy link
Contributor Author

I also agree with @gojomo. I can prepare a PR that removes the init_sims and norm_only parameters next week.

This would be api backwards incompatible and change the default behavior of this function.

@piskvorky
Copy link
Owner

Yes, we'll need a prominent warning in CHANGELOG :)

Thanks again @svenkreiss , you're really helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants