New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_word2vec_format(): new parameter to skip init_sims() call #545

Merged
merged 5 commits into from Nov 25, 2015

Conversation

Projects
None yet
3 participants
@svenkreiss
Copy link
Contributor

svenkreiss commented Nov 24, 2015

In certain use cases (custom doc2vec-type computations) only unnormalized vectors are used. The init_sims() call at the end of load_word2vec_format takes a lot of memory (even with norm_only=True) and is unnecessary in this scenario. This PR allows to skip the call which improves performance.

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Nov 25, 2015

Sounds useful and clean, +1. A unit test for this new parameter would be useful.

Let's wait for @gojomo review & then merge.

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Nov 25, 2015

Also @svenkreiss , can you commit a brief description of this change in CHANGELOG.txt?

@svenkreiss

This comment has been minimized.

Copy link
Contributor

svenkreiss commented Nov 25, 2015

@piskvorky thanks for comments. Done.

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Nov 25, 2015

Perfect, thanks a lot @svenkreiss !

piskvorky added a commit that referenced this pull request Nov 25, 2015

Merge pull request #545 from svenkreiss/skip-l2-norm-calc
load_word2vec_format(): new parameter to skip init_sims() call

@piskvorky piskvorky merged commit 5535fcf into RaRe-Technologies:develop Nov 25, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@gojomo

This comment has been minimized.

Copy link
Member

gojomo commented Nov 25, 2015

Looks fine, but another more radical simplification that reduces parameters & codepaths would also be worth considering: just don't do any automatic norming in the load_word2vec_format(). (That is, remove the init_sims() call rather than make it switchable.)

Then the (syn0) result of a load is really just a load, not a load-and-do-other stuff. This would also be roughly consistently with the syn0 state after native gensim training: you have the raw vectors, what you do with them next is up to your explicit further steps.

If the user starts making similarity calls, syn0norm would be automatically backfilled, as with natively-trained vectors.... but someone who doesn't need that would just choose not to trigger it.

If they instead want to convert to a compact, normed-only model, they'd call init_sims(norm_only=True) themself – just as if they'd trained the vectors with gensim (rather than just loaded).

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Nov 26, 2015

Oh yes, that makes sense too. I actually like @gojomo 's option better -- simpler is better.

@svenkreiss

This comment has been minimized.

Copy link
Contributor

svenkreiss commented Nov 26, 2015

I also agree with @gojomo. I can prepare a PR that removes the init_sims and norm_only parameters next week.

This would be api backwards incompatible and change the default behavior of this function.

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Nov 26, 2015

Yes, we'll need a prominent warning in CHANGELOG :)

Thanks again @svenkreiss , you're really helpful!

@svenkreiss svenkreiss deleted the svenkreiss:skip-l2-norm-calc branch Nov 30, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment