Add "most_similar_to_given" method for KeyedVectors #1582

TheMathMajor · 2017-09-11T08:43:41Z

Added a function to find the most similar word in a given list to a given word.

menshikh-iv · 2017-09-14T09:25:39Z

.spyproject/codestyle.ini

@@ -0,0 +1,6 @@
+[codestyle]


What is it? Please remove all non-relevant files (all from .spyproject folder).

menshikh-iv · 2017-09-14T09:28:27Z

gensim/models/keyedvectors.py

@@ -617,6 +618,22 @@ def similarity(self, w1, w2):

        """
        return dot(matutils.unitvec(self[w1]), matutils.unitvec(self[w2]))
+
+    def most_similar_to_given(self, w1, word_list):


Please reformat your docstring according to google-style

menshikh-iv · 2017-09-14T09:29:29Z

gensim/models/keyedvectors.py

+
+        Example::
+
+          >>> trained_model.most_similar_to_given('music', ['water', 'sound', 'backpack', 'mouse'])


What do you this @gojomo, it's a useful feature?

If done efficiently it makes sense. If the common case is there's a stable subset against which some projects are doing all of their similarity-searches, the need might be best met by a new method for creating a subset KeyedVectors, with just the words-of-interest. Overlaps with closed-as-idle PR #1229.

menshikh-iv · 2017-09-14T09:32:08Z

@TheMathMajor Also, please fix PEP8 issues (look at travis log)

gojomo · 2017-09-14T16:31:55Z

There's another nearly-complete implementation of similar functionality by @shubhvachher in closed-as-idle PR #1229.

menshikh-iv · 2017-09-25T10:30:15Z

Ping @TheMathMajor, what's a status of this PR?

TheMathMajor · 2017-10-03T17:56:57Z

Hi, thanks for the feedback, I have made the committed the changes requested.

menshikh-iv · 2017-10-16T12:58:18Z

Thanks @TheMathMajor LGTM
@gojomo should I merge this change (or you have any suggestions)?

gojomo · 2017-10-16T17:28:57Z

There's no need for "deprecated" forwarding-method in Word2Vec if this is a brand-new feature on KeyedVectors that no one has ever learned to call on Word2Vec.

Perhaps the method should have a test, but as a simple 1-liner composed of other well-tested methods, maybe not.

But that highlights another difference with the earlier #1229 – while that PR had a lot of code-duplication, it did try to do the similarity calculations with array math, and thus might be noticeably faster with long word-lists. If main goal is performance, that approach may have been better; if goal is simply providing a convenience/clarity/example-method, this idiomatic 1-liner is better.

menshikh-iv · 2017-10-16T17:43:04Z

There's no need for "deprecated" forwarding-method in Word2Vec if this is a brand-new feature on KeyedVectors that no one has ever learned to call on Word2Vec.

Agree, I'll remove it from Word2Vec

For this method, I think clean one-liner is better (IMO we no need performance here)

gojomo · 2017-10-16T23:20:51Z

I think the people who brought this up in #1229 and #481 were concerned about performance in their projects, which is why they pursued that path. But seems OK to add the clear 1-liner to start, and see who uses-and-needs-better, thus perhaps contributing a faster alternative.

menshikh-iv · 2017-10-17T05:50:55Z

Thanks @TheMathMajor, congratz with the first contribution:1st_place_medal:

TheMathMajor · 2017-10-17T08:13:48Z

Thanks a lot for the suggestions and guiding me through my first contribution!

* finished adding 2 new functions * imported argmax to word2vec * reformatted * remove `most_similar_to_given` from w2v class * Fix PEP8

TheMathMajor added 2 commits September 11, 2017 01:33

finished adding 2 new functions

1b4676b

imported argmax to word2vec

e1c3448

menshikh-iv suggested changes Sep 14, 2017

View reviewed changes

menshikh-iv changed the title ~~Added New Function~~ Add "most_similar_to_given" function for KeyedVectors Sep 14, 2017

menshikh-iv changed the title ~~Add "most_similar_to_given" function for KeyedVectors~~ Add "most_similar_to_given" method for KeyedVectors Sep 14, 2017

reformatted

219fa95

remove most_similar_to_given from w2v class

1cb72c4

menshikh-iv approved these changes Oct 16, 2017

View reviewed changes

Fix PEP8

e7834a4

menshikh-iv merged commit 2690289 into piskvorky:develop Oct 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "most_similar_to_given" method for KeyedVectors #1582

Add "most_similar_to_given" method for KeyedVectors #1582

TheMathMajor commented Sep 11, 2017

menshikh-iv Sep 14, 2017

menshikh-iv Sep 14, 2017

menshikh-iv Sep 14, 2017

gojomo Sep 14, 2017

menshikh-iv commented Sep 14, 2017

gojomo commented Sep 14, 2017

menshikh-iv commented Sep 25, 2017

TheMathMajor commented Oct 3, 2017

menshikh-iv commented Oct 16, 2017

gojomo commented Oct 16, 2017

menshikh-iv commented Oct 16, 2017

gojomo commented Oct 16, 2017

menshikh-iv commented Oct 17, 2017

TheMathMajor commented Oct 17, 2017


		Example::

		>>> trained_model.most_similar_to_given('music', ['water', 'sound', 'backpack', 'mouse'])

Add "most_similar_to_given" method for KeyedVectors #1582

Add "most_similar_to_given" method for KeyedVectors #1582

Conversation

TheMathMajor commented Sep 11, 2017

menshikh-iv Sep 14, 2017

Choose a reason for hiding this comment

menshikh-iv Sep 14, 2017

Choose a reason for hiding this comment

menshikh-iv Sep 14, 2017

Choose a reason for hiding this comment

gojomo Sep 14, 2017

Choose a reason for hiding this comment

menshikh-iv commented Sep 14, 2017

gojomo commented Sep 14, 2017

menshikh-iv commented Sep 25, 2017

TheMathMajor commented Oct 3, 2017

menshikh-iv commented Oct 16, 2017

gojomo commented Oct 16, 2017

menshikh-iv commented Oct 16, 2017

gojomo commented Oct 16, 2017

menshikh-iv commented Oct 17, 2017

TheMathMajor commented Oct 17, 2017