Phrases getitem() method does not respect chosen scoring function #1533

SergeyDidenko · 2017-08-15T11:22:19Z

Description

Phrases __getitem__() method does not respect chosen scoring function, it has 'default' scoring builtin. So when Phrases object is constructed with scoring='npmi' it returns wrong results.

See Phrases export_phrases() for correct implementation, i.e. score = scoring_function(count_a, count_b, count_ab) instead of score = (pab - min_count) / pa / pb * len(vocab)

For a while it's better to construct and use Phraser object, i.e.
phraser = Phraser[phrases]; res = phraser[sentences] instead of res = phrases[sentences]

The text was updated successfully, but these errors were encountered:

piskvorky · 2017-08-15T11:50:29Z

You're right, looks like a bug, thanks for reporting. Since you already have the solution, can you open a PR with the fix?

CC @michaelwsherman re. #1464.

SergeyDidenko · 2017-08-15T13:32:45Z

Sorry, I don't have the fix yet. I just use Phraser.

michaelwsherman · 2017-08-15T13:38:52Z

Good find--i totally forgot about getitem when I implemented this. I can get a fix together, should be fairly straightforward. Will be a few weeks though, and I assume this isn't a showstopper since Phraser still works.

@sergeididenko -- are you using npmi? That makes me happy.

SergeyDidenko · 2017-08-15T15:29:46Z

@michaelwsherman Yes, I prefer MPI scores over alternatives. Can't say yet if it's better for my current project though.

michaelwsherman · 2017-09-06T18:27:24Z

Fix in PR #1573 .

@piskvorky

…iskvorky#1573) * initial commit of fixes in comments of piskvorky#1423 * removed unnecessary space in logger * added support for custom Phrases scorers * fixed Phrases.__getitem__ to support pluggable scoring piskvorky#1533 * travisCI style fixes * fixed __next__() to next() for python 3 compatibilyt * misc fixes * spacing fixes for style * custom scorer support in sklearn api * Phrases scikit interface tests for pluggable scoring * missing line breaks * style, clarity, and robustness fixes requested by @piskvorky * check in Phrases init to make sure scorer is pickleable * backwards scoring compatibility when loading a Phrases class * removal of pickle testing objects in Phrases init * switched to six for python 2/3 compatibility * fix docstring

piskvorky added the bug Issue described a bug label Aug 15, 2017

michaelwsherman pushed a commit to bloomberg/gensim that referenced this issue Sep 6, 2017

fixed Phrases.__getitem__ to support pluggable scoring piskvorky#1533

32b66bd

michaelwsherman mentioned this issue Sep 6, 2017

1533 fix and 1464 1423 comments #1573

Merged

menshikh-iv closed this as completed in a5872fa Oct 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phrases getitem() method does not respect chosen scoring function #1533

Phrases getitem() method does not respect chosen scoring function #1533

SergeyDidenko commented Aug 15, 2017 •

edited

piskvorky commented Aug 15, 2017 •

edited

SergeyDidenko commented Aug 15, 2017

michaelwsherman commented Aug 15, 2017

SergeyDidenko commented Aug 15, 2017

michaelwsherman commented Sep 6, 2017

Phrases __getitem__() method does not respect chosen scoring function #1533

Phrases __getitem__() method does not respect chosen scoring function #1533

Comments

SergeyDidenko commented Aug 15, 2017 • edited

Description

piskvorky commented Aug 15, 2017 • edited

SergeyDidenko commented Aug 15, 2017

michaelwsherman commented Aug 15, 2017

SergeyDidenko commented Aug 15, 2017

michaelwsherman commented Sep 6, 2017

Phrases getitem() method does not respect chosen scoring function #1533

Phrases getitem() method does not respect chosen scoring function #1533

SergeyDidenko commented Aug 15, 2017 •

edited

piskvorky commented Aug 15, 2017 •

edited