New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Added method to restrict vocab of Word2Vec most similar search #481

Closed
wants to merge 6 commits into
base: develop
from

Conversation

Projects
None yet
7 participants
@jimgoo

jimgoo commented Oct 14, 2015

I've added a method to gensim.models.word2vec.py:

def most_similar_in_list(self, positive=[], negative=[], topn=10, restrict_vocab=None):

which allows restrict_vocab to be a list containing words to restrict the search over.

For example, these are the top 10 most similar results using the original most_similar method:

from gensim.models import Word2Vec
from nltk.corpus import brown

model = Word2Vec(brown.sents())
model.most_similar('vector')
[(u'V', 0.9310650825500488),
 (u'Q', 0.9126538634300232),
 (u'**zg', 0.9065112471580505),
 (u'null', 0.9064960479736328),
 (u'subspace', 0.9033916592597961),
 (u'intersection', 0.8994101881980896),
 (u'T', 0.8956964015960693),
 (u'staining', 0.8929149508476257),
 (u'secant', 0.8926326036453247),
 (u'concentration', 0.883994460105896)]

And we can restrict the search to a list of words with the new most_similar_in_list method:

model.most_similar_in_list('vector', restrict_vocab=['subspace', 'intersection', 'secant'])
[(u'subspace', 0.9033915996551514),
 (u'intersection', 0.8994101881980896),
 (u'secant', 0.8926326036453247)]

Passing an integer for restrict_vocab has the same behavior as the original,

model.most_similar('vector', restrict_vocab=3) == model.most_similar_in_list('vector', restrict_vocab=3)
True

For large vocabularies, there is some benefit to reducing the number of rows in limited when you're only interested in a subset of words:

dists = dot(limited, mean) 

The number of rows is len(restrict_vocab) rather than the total number of words in the vocab.

@piskvorky

This comment has been minimized.

Show comment
Hide comment
@piskvorky

piskvorky Oct 15, 2015

Member

Nice, that sounds really useful!

I'm thinking this should even be the "default" (promoted) way of using the restrict_vocab parameter: you pass in a list of words you care about. The "int" version is kind of opaque and non-explicit, I like this "list of words" better. Same goes for the model.accuracy() method.

It's also easy to simulate "int" by means of "list of words", but not the other way round, so "list of words" is more flexible.

We probably don't need an extra most_similar_in_list method though; is there a reason not to add this functionality directly to most_similar?

Member

piskvorky commented Oct 15, 2015

Nice, that sounds really useful!

I'm thinking this should even be the "default" (promoted) way of using the restrict_vocab parameter: you pass in a list of words you care about. The "int" version is kind of opaque and non-explicit, I like this "list of words" better. Same goes for the model.accuracy() method.

It's also easy to simulate "int" by means of "list of words", but not the other way round, so "list of words" is more flexible.

We probably don't need an extra most_similar_in_list method though; is there a reason not to add this functionality directly to most_similar?

@jimgoo

This comment has been minimized.

Show comment
Hide comment
@jimgoo

jimgoo Oct 15, 2015

Cool, if you like it then there is no reason to have both methods. I made it separate so I could test against the original and made sure they matched. I'll rename most_similar_in_list to most_similar and commit the change.

jimgoo commented Oct 15, 2015

Cool, if you like it then there is no reason to have both methods. I made it separate so I could test against the original and made sure they matched. I'll rename most_similar_in_list to most_similar and commit the change.

@tmylk

This comment has been minimized.

Show comment
Hide comment
@tmylk

tmylk Oct 15, 2015

Contributor

@jimgoo Talking about testing...Could you please add some for the new feature to /test/test_word2vec.py#L175

Contributor

tmylk commented Oct 15, 2015

@jimgoo Talking about testing...Could you please add some for the new feature to /test/test_word2vec.py#L175

@tmylk

This comment has been minimized.

Show comment
Hide comment
@tmylk

tmylk Jan 10, 2016

Contributor

@jimgoo Please fix the Python 3 syntax issues, add CHANGELOG and test.
And then it can go into January Gensim release!

Contributor

tmylk commented Jan 10, 2016

@jimgoo Please fix the Python 3 syntax issues, add CHANGELOG and test.
And then it can go into January Gensim release!

@tmylk

This comment has been minimized.

Show comment
Hide comment
@tmylk

tmylk Jan 23, 2016

Contributor

Hey @jimgoo, please post another commit to trigger the Travis build. And ignore appveyor test failures for now - we are working to fix them. But I would expect Travis unix tests to be green after the next commit.

Contributor

tmylk commented Jan 23, 2016

Hey @jimgoo, please post another commit to trigger the Travis build. And ignore appveyor test failures for now - we are working to fix them. But I would expect Travis unix tests to be green after the next commit.

@atran

This comment has been minimized.

Show comment
Hide comment
@atran

atran Mar 1, 2016

+1 @jimgoo, I think it's just a print statement.

atran commented Mar 1, 2016

+1 @jimgoo, I think it's just a print statement.

@hitochan777

This comment has been minimized.

Show comment
Hide comment
@hitochan777

hitochan777 Sep 2, 2016

It seems that this PR is not incorporated in the latest gensim. Have any update?

hitochan777 commented Sep 2, 2016

It seems that this PR is not incorporated in the latest gensim. Have any update?

@gojomo

This comment has been minimized.

Show comment
Hide comment
@gojomo

gojomo Sep 7, 2016

Member

Note that the current implementation is very inefficient if used with large lists of eligible-words. (It takes the argument and converts to a set then a list. Then it calculates all distances, then does linear-probes against the restricted-list to test if every word is in the list.)

Also, the PR seems to include other unrelated code for other features.

I would suggest splitting such functionality off into a different method, perhaps most_similar_among(), and refactoring to truly limit the collection-duplication and distance-calculations. (This could also avoid the clunky type-based overloading of the restrict_vocab parameter.)

Member

gojomo commented Sep 7, 2016

Note that the current implementation is very inefficient if used with large lists of eligible-words. (It takes the argument and converts to a set then a list. Then it calculates all distances, then does linear-probes against the restricted-list to test if every word is in the list.)

Also, the PR seems to include other unrelated code for other features.

I would suggest splitting such functionality off into a different method, perhaps most_similar_among(), and refactoring to truly limit the collection-duplication and distance-calculations. (This could also avoid the clunky type-based overloading of the restrict_vocab parameter.)

@tmylk tmylk changed the title from Added method to restrict vocab of Word2Vec most similar search to [WIP] Added method to restrict vocab of Word2Vec most similar search Oct 4, 2016

@shubhvachher

This comment has been minimized.

Show comment
Hide comment
@shubhvachher

shubhvachher Mar 22, 2017

Contributor

I had built this out for a recent project. Should I complete this issue?

Contributor

shubhvachher commented Mar 22, 2017

I had built this out for a recent project. Should I complete this issue?

@tmylk

This comment has been minimized.

Show comment
Hide comment
@tmylk

tmylk May 2, 2017

Contributor

Duplicate of #1229

Contributor

tmylk commented May 2, 2017

Duplicate of #1229

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment