Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log probabilities for discriminative models ( issue from StereoSet repo ) #3

Closed
Lynx1820 opened this issue May 3, 2022 · 3 comments
Assignees

Comments

@Lynx1820
Copy link

Lynx1820 commented May 3, 2022

Hi all, it says on the StereoSet paper that you compute the average log probability of attribute terms for BERT/RoBERTa, but here it seems like you're just taking the average of probabilities. Is this intentional?

@ncmeade
Copy link
Collaborator

ncmeade commented May 3, 2022

Hi @Lynx1820,

Thank you for raising this issue! Indeed, this does look like an inconsistency between the StereoSet paper and code. As you state, the code does compute the average token probability of the attribute words. Computing the average log token probability should not change the reported results though because log is a monotonically increasing function.

Hope this helps. Happy to answer any additional questions.

@Lynx1820
Copy link
Author

Lynx1820 commented May 4, 2022

I think you might get conflicting results when you take the average, since log is nonlinear. Say you have probabilities [7.2118e-07, 5.8076e-07] for sentence A and [1.3232e-06, 2.2212e-07] for sentence B. If you take the average of probabilities you would get mean 6.509703e-07 for A and mean 7.726641e-07 for B. Here, B has the higher value. But taking the average of logs, you get (ln(7.2118e-07) + ln(5.8076e-07))/2 = -14.25 for A and (ln(2.2212e-07) + ln(1.3232e-06))/2=-14.42 for B. Here, A has the higher value.

@ncmeade
Copy link
Collaborator

ncmeade commented May 5, 2022

Yes, you're correct -- thank you for pointing that out! Intuitively, I would expect both scoring methods (average log probability and average probability) to produce similar scores on aggregate across StereoSet (many attribute words are composed of a single token as well), however, I'll need to re-run the models reported in the StereoSet paper to verify. For practical purposes, both methods are sensible scoring techniques.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants