Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word2Vec model to dict; Adding to the word2vec to production pipeline #1269

Closed
shubhvachher opened this issue Apr 10, 2017 · 4 comments
Closed
Labels
difficulty easy Easy issue: required small fix feature Issue described a new feature good first issue Issue for new contributors (not required gensim understanding + very simple)

Comments

@shubhvachher
Copy link
Contributor

shubhvachher commented Apr 10, 2017

A lot of users use their trained word2vec model in production environments to get most_similar words to (for example) words in a user's entered query, or words in complete documents, on the fly. In times like these, using the word2vec model becomes very cumbersome, often taking the most amount of time in the pipeline. [1]

What I propose is a model_to_dict method, to be used right at the end of the word2vec pipeline. It would find and store, in preproduction, all most similar words to all words in the trained vocabulary.

The most similar words can be from a custom user list as in #1229 and we can allow the user to define a custom preprocessing function to pass all most similar words through before storing them. Being a dict, the query time will also be minimal which is great for this purpose! Since our dict just stores words, the size of the dict should be comparable to a multiple of the size of the vocab [2]

At the end of this, we expect a dictionary with keys as word2vec vocab and values as the most_similar words to them. Words in vocab that have empty most_similar words will not be stored in the dict. This will happen a lot if there is a custom results list as well as a pass function for result words applied on top of most similar cutoff.

[1] This is because we always calculate all cos distances from the query word to all words in vocab before returning topn most similar words. I can't think of a better way for that, yet.
[2] albeit a large multiple if users do not provide proper preprocessing pass function or have a small similarity cutoff. Maybe we can give them a warning about the same.

@gojomo
Copy link
Collaborator

gojomo commented Apr 11, 2017

I can see this being useful. However, it could take a lot of time/memory to compute. And, it seems like a 1-liner:

most_similars_precalc = {k : model.wv.most_similar(k) for k in model.wv.index2word}

(The variants would be slightly different if working with some subset of the vocabulary.)

So, this might be more appropriate as some examples in one of the documentation notebooks (with proper caveats about the time/memory cost of the every-word calculations).

@tmylk
Copy link
Contributor

tmylk commented May 2, 2017

@menshikh-iv
Copy link
Contributor

Can you add this to notebook @shubhvachher?

@menshikh-iv menshikh-iv added difficulty easy Easy issue: required small fix feature Issue described a new feature test before incubator labels Oct 2, 2017
@menshikh-iv menshikh-iv added good first issue Issue for new contributors (not required gensim understanding + very simple) and removed test before incubator labels Oct 16, 2017
@kakshay21
Copy link
Contributor

kakshay21 commented Dec 10, 2017

@menshikh-iv @tmylk @gojomo
"albeit a large multiple if users do not provide proper preprocessing pass function or have a small similarity cutoff. Maybe we can give them a warning about the same."
Is that required on the Jupyter notebook?
Does the method have to give some additional parameter as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix feature Issue described a new feature good first issue Issue for new contributors (not required gensim understanding + very simple)
Projects
None yet
Development

No branches or pull requests

5 participants