Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phrases __getitem__() method does not respect chosen scoring function #1533

Closed
SergeyDidenko opened this issue Aug 15, 2017 · 5 comments
Closed
Labels
bug Issue described a bug

Comments

@SergeyDidenko
Copy link

SergeyDidenko commented Aug 15, 2017

Description

Phrases __getitem__() method does not respect chosen scoring function, it has 'default' scoring builtin. So when Phrases object is constructed with scoring='npmi' it returns wrong results.

See Phrases export_phrases() for correct implementation, i.e. score = scoring_function(count_a, count_b, count_ab) instead of score = (pab - min_count) / pa / pb * len(vocab)

For a while it's better to construct and use Phraser object, i.e.
phraser = Phraser[phrases]; res = phraser[sentences] instead of res = phrases[sentences]

@piskvorky
Copy link
Owner

piskvorky commented Aug 15, 2017

You're right, looks like a bug, thanks for reporting. Since you already have the solution, can you open a PR with the fix?

CC @michaelwsherman re. #1464.

@piskvorky piskvorky added the bug Issue described a bug label Aug 15, 2017
@SergeyDidenko
Copy link
Author

Sorry, I don't have the fix yet. I just use Phraser.

@michaelwsherman
Copy link
Contributor

Good find--i totally forgot about getitem when I implemented this. I can get a fix together, should be fairly straightforward. Will be a few weeks though, and I assume this isn't a showstopper since Phraser still works.

@sergeididenko -- are you using npmi? That makes me happy.

@SergeyDidenko
Copy link
Author

@michaelwsherman Yes, I prefer MPI scores over alternatives. Can't say yet if it's better for my current project though.

@michaelwsherman
Copy link
Contributor

Fix in PR #1573 .

horpto pushed a commit to horpto/gensim that referenced this issue Oct 28, 2017
…iskvorky#1573)

* initial commit of fixes in comments of piskvorky#1423

* removed unnecessary space in logger

* added support for custom Phrases scorers

* fixed Phrases.__getitem__ to support pluggable scoring piskvorky#1533

* travisCI style fixes

* fixed __next__() to next() for python 3 compatibilyt

* misc fixes

* spacing fixes for style

* custom scorer support in sklearn api

* Phrases scikit interface tests for pluggable scoring

* missing line breaks

* style, clarity, and robustness fixes requested by @piskvorky

* check in Phrases init to make sure scorer is pickleable

* backwards scoring compatibility when loading a Phrases class

* removal of pickle testing objects in Phrases init

* switched to six for python 2/3 compatibility

* fix docstring
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug
Projects
None yet
Development

No branches or pull requests

3 participants