Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
[WIP] Added method to restrict vocab of Word2Vec most similar search #481
I've added a method to
def most_similar_in_list(self, positive=, negative=, topn=10, restrict_vocab=None):
For example, these are the top 10 most similar results using the original
from gensim.models import Word2Vec from nltk.corpus import brown model = Word2Vec(brown.sents()) model.most_similar('vector')
And we can restrict the search to a list of words with the new
model.most_similar_in_list('vector', restrict_vocab=['subspace', 'intersection', 'secant'])
Passing an integer for
model.most_similar('vector', restrict_vocab=3) == model.most_similar_in_list('vector', restrict_vocab=3)
For large vocabularies, there is some benefit to reducing the number of rows in
dists = dot(limited, mean)
The number of rows is
Nice, that sounds really useful!
I'm thinking this should even be the "default" (promoted) way of using the
It's also easy to simulate "int" by means of "list of words", but not the other way round, so "list of words" is more flexible.
We probably don't need an extra
Note that the current implementation is very inefficient if used with large lists of eligible-words. (It takes the argument and converts to a set then a list. Then it calculates all distances, then does linear-probes against the restricted-list to test if every word is in the list.)
Also, the PR seems to include other unrelated code for other features.
I would suggest splitting such functionality off into a different method, perhaps