Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

models.Phrases multiple scoring methods (#1363) #1464

Merged

Conversation

michaelwsherman
Copy link
Contributor

@michaelwsherman michaelwsherman commented Jul 5, 2017

First attempt at alternative scoring methods, with tests, based on issue #1363

Currently uses a string to specify the scoring. That can be changed, but I'd like some feedback first.

Different scoring methods require different inputs, for example the default scoring requires the min_count setting and the length of the vocabulary, and npmi requires a count of all the words in the corpus (which is now counted in learn_vocab, and presumably slows learn_vocab down a tiny bit).

My main argument against pluggable scoring functions is that since different scoring methods require different corpus-based or settings-based constants, all pluggable functions would have to take all these parameters even if they weren't used. Further, if you wanted a scoring function that used a different corpus-based constant, you'd have to implement the determination/counting of that constant in somewhere like learn_vocab. New corpus based constants could also cause backwards compatibility issues if the become part of the standard scoring function call (although there might be a clever way around this by requiring all scoring functions to inherit from a parent function, which could then be updated to fill in missing parameters in existing scoring functions as new ones are created). See the remarks in the code for more about how pluggable functions might work.

My second argument against pluggable functions is that the current implementation as a string setting is more straightforward, and that it might be overkill to create cool pluggable scoring functionality that may never/rarely get used. And even if there is demand for more scoring options later, after a few other ad-hoc scoring functions are implemented the best design of pluggable scoring functions will be more obvious. Backwards compatibility could easily be kept by creating a new scoring_function input to phrases that overrides scorer.

It's very possible I'm overthinking this and pluggable scoring isn't so complicated. I'm happy to implement something if the feeling is there's a clear best way to do it.

Michael Sherman added 10 commits June 30, 2017 16:19
now with a scoring parameter to initialize a Phrases object, defaults to
the mikolov paper scoring, but also switchable to 'npmi', normalized
pointwise mutual information

moved scoring calculation to call a function, scoring functions are now
top level functions in models.Phrases that are called when calculating
scores in models.Phrases.export_phrases
fixed some bugs with the pluggable scoring that were causing tests to
fail.
count_a = float(vocab[word_a])
count_b = float(vocab[word_b])
count_ab = float(vocab[bigram_word])
score = scoring_function(count_a, count_b, count_ab)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pluggable scoring function would have to be called with all corpus constants and Phrases settings used in any scoring function. Right now that would look like:
score = scoring_function(count_a, count_b, count_ab, min_count, len_vocab, corpus_word_count).
And the call would grow as the universe of variables considered by all scoring functions grows.

Copy link
Owner

@piskvorky piskvorky Jul 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's still preferable. This string-passing seems inflexible.

We could support some common use-cases by passing a string, but the code underneath should simply translate that string into a scoring_function and work with that underneath. Custom scoring_functions should be supported IMO.

In other words, we could support both string and callable as param. If string, gensim converts that to a known callable (for easy-to-use common cases).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make this change, hopefully before the end of the week, and make it part of a PR.

@menshikh-iv
Copy link
Contributor

Looks good, thank you @michaelwsherman 👍

@menshikh-iv menshikh-iv merged commit 5f54b60 into piskvorky:develop Jul 20, 2017
Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this was already merged, but some changes are necessary.

total vocabulary size.
`threshold` represents a score threshold for forming the phrases (higher means
fewer phrases). A phrase of words `a` followed by `b` is accepted if the score of the
phrase is greater than threshold. see the `scoring' setting
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize first word in sentence, end in full stop.

@@ -197,8 +221,10 @@ def add_vocab(self, sentences):
# directly, but gives the new sentences a fighting chance to collect
# sufficient counts, before being pruned out by the (large) accummulated
# counts collected in previous learn_vocab runs.
min_reduce, vocab = self.learn_vocab(sentences, self.max_vocab_size, self.delimiter, self.progress_per)
min_reduce, vocab, total_words = \
self.learn_vocab(sentences, self.max_vocab_size, self.delimiter, self.progress_per)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style: bad indentation (unneeded line break).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the number of columns we cap at? I thought it was 100, which I believe this exceeded.

Copy link
Owner

@piskvorky piskvorky Jul 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no hard limit; if the line becomes hard to read, we break it.

If the break would be even harder to read than the original (for semantic/visual/clarity reasons), we don't break it.

Line continuations are indented at one extra level (4 spaces to the right).


if scoring == 'default':
scoring_function = \
partial(self.original_scorer, len_vocab=float(len(vocab)), min_count=float(min_count))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation (unneeded line break).

partial(self.original_scorer, len_vocab=float(len(vocab)), min_count=float(min_count))
elif scoring == 'npmi':
scoring_function = \
partial(self.npmi_scorer, corpus_word_count=corpus_word_count)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation (unneeded line break).

count_a = float(vocab[word_a])
count_b = float(vocab[word_b])
count_ab = float(vocab[bigram_word])
score = scoring_function(count_a, count_b, count_ab)
Copy link
Owner

@piskvorky piskvorky Jul 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's still preferable. This string-passing seems inflexible.

We could support some common use-cases by passing a string, but the code underneath should simply translate that string into a scoring_function and work with that underneath. Custom scoring_functions should be supported IMO.

In other words, we could support both string and callable as param. If string, gensim converts that to a known callable (for easy-to-use common cases).

if as_tuples:
yield ((word_a, word_b), score)
else:
yield (out_delimiter.join((word_a, word_b)), score)
last_bigram = True
continue
last_bigram = False
last_bigram = False
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this on purpose? What is this change about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is on purpose. Matches up to line 277. If that test fails we have to set last_bigram to false. This positioning sets it to false always--the only time it gets set to true is in line 293 when a passing bigram is found.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, so this is a bug fix at the same time. Thanks! CC @menshikh-iv

# len_vocab and min_count set so functools.partial works
@staticmethod
def original_scorer(worda_count, wordb_count, bigram_count, len_vocab=0.0, min_count=0.0):
return (bigram_count - min_count) / worda_count / wordb_count * len_vocab
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beware of integer divisions - this code is brittle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't fix this in PR #1573 . Rather, I just cast everything before calling the scoring method in Phrases and Phraser. I think that's the better place to do the casting since then it fixes the problem for all custom scorers as well.

Of course, I can do the casting in the scoring methods as well. Let me know if you still think I need it here and in npmi_scorer and I'll update PR #1573. It's extra steps, but I'd assume the performance hit is infinitesimal.

# normalized PMI, requires corpus size
@staticmethod
def npmi_scorer(worda_count, wordb_count, bigram_count, corpus_word_count=0.0):
pa = worda_count / corpus_word_count
Copy link
Owner

@piskvorky piskvorky Jul 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to be an integer or float division? (dtto below)

@michaelwsherman
Copy link
Contributor Author

@piskvorky Thank you for your comments. I'll go though these and respond with some specifics sometime next week, am on vacation now.

@piskvorky
Copy link
Owner

Sounds good. Enjoy your vacation :)

@michaelwsherman
Copy link
Contributor Author

Question @piskvorky -- what's the best way to make these changes (and to the other PR)? Submit another PR? Or Is there a way to update this PR even though it has already been merged?

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jul 25, 2017

@michaelwsherman Please submit as another PR.

@michaelwsherman
Copy link
Contributor Author

I haven't forgotten about this, just really swamped right now at work. Do you have an expected date of next release? I'll try to do my best to get these fixes (and the fixes from #1423 ) in a new PR before that.

Sorry.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Aug 7, 2017

No problem @michaelwsherman, I think next release will be in the first week of September.

@michaelwsherman
Copy link
Contributor Author

Fixes from @piskvorky in PR #1573 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants