Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "most_similar_to_given" method for KeyedVectors #1582

Merged
merged 5 commits into from
Oct 17, 2017

Conversation

TheMathMajor
Copy link
Contributor

Added a function to find the most similar word in a given list to a given word.

@@ -0,0 +1,6 @@
[codestyle]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is it? Please remove all non-relevant files (all from .spyproject folder).

@@ -617,6 +618,22 @@ def similarity(self, w1, w2):

"""
return dot(matutils.unitvec(self[w1]), matutils.unitvec(self[w2]))

def most_similar_to_given(self, w1, word_list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reformat your docstring according to google-style


Example::

>>> trained_model.most_similar_to_given('music', ['water', 'sound', 'backpack', 'mouse'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you this @gojomo, it's a useful feature?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If done efficiently it makes sense. If the common case is there's a stable subset against which some projects are doing all of their similarity-searches, the need might be best met by a new method for creating a subset KeyedVectors, with just the words-of-interest. Overlaps with closed-as-idle PR #1229.

@menshikh-iv menshikh-iv changed the title Added New Function Add "most_similar_to_given" function for KeyedVectors Sep 14, 2017
@menshikh-iv menshikh-iv changed the title Add "most_similar_to_given" function for KeyedVectors Add "most_similar_to_given" method for KeyedVectors Sep 14, 2017
@menshikh-iv
Copy link
Contributor

@TheMathMajor Also, please fix PEP8 issues (look at travis log)

@gojomo
Copy link
Collaborator

gojomo commented Sep 14, 2017

There's another nearly-complete implementation of similar functionality by @shubhvachher in closed-as-idle PR #1229.

@menshikh-iv
Copy link
Contributor

Ping @TheMathMajor, what's a status of this PR?

@TheMathMajor
Copy link
Contributor Author

Hi, thanks for the feedback, I have made the committed the changes requested.

@menshikh-iv
Copy link
Contributor

Thanks @TheMathMajor LGTM
@gojomo should I merge this change (or you have any suggestions)?

@gojomo
Copy link
Collaborator

gojomo commented Oct 16, 2017

There's no need for "deprecated" forwarding-method in Word2Vec if this is a brand-new feature on KeyedVectors that no one has ever learned to call on Word2Vec.

Perhaps the method should have a test, but as a simple 1-liner composed of other well-tested methods, maybe not.

But that highlights another difference with the earlier #1229 – while that PR had a lot of code-duplication, it did try to do the similarity calculations with array math, and thus might be noticeably faster with long word-lists. If main goal is performance, that approach may have been better; if goal is simply providing a convenience/clarity/example-method, this idiomatic 1-liner is better.

@menshikh-iv
Copy link
Contributor

There's no need for "deprecated" forwarding-method in Word2Vec if this is a brand-new feature on KeyedVectors that no one has ever learned to call on Word2Vec.

Agree, I'll remove it from Word2Vec

For this method, I think clean one-liner is better (IMO we no need performance here)

@gojomo
Copy link
Collaborator

gojomo commented Oct 16, 2017

I think the people who brought this up in #1229 and #481 were concerned about performance in their projects, which is why they pursued that path. But seems OK to add the clear 1-liner to start, and see who uses-and-needs-better, thus perhaps contributing a faster alternative.

@menshikh-iv
Copy link
Contributor

Thanks @TheMathMajor, congratz with the first contribution:1st_place_medal:

@menshikh-iv menshikh-iv merged commit 2690289 into piskvorky:develop Oct 17, 2017
@TheMathMajor
Copy link
Contributor Author

Thanks a lot for the suggestions and guiding me through my first contribution!

horpto pushed a commit to horpto/gensim that referenced this pull request Oct 28, 2017
* finished adding 2 new functions

* imported argmax to word2vec

* reformatted

* remove `most_similar_to_given` from w2v class

* Fix PEP8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants