Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding adequacy metric #18

Closed
cegersdoerfer opened this issue Jun 21, 2022 · 2 comments
Closed

Understanding adequacy metric #18

cegersdoerfer opened this issue Jun 21, 2022 · 2 comments

Comments

@cegersdoerfer
Copy link

Hi, I have been using the filters file from this repo to experiment on evaluating some paraphrases I created using various different models, but I noticed that the adequacy score gives some unexpected results so I was wondering if you could tell me some more about how it was trained?
I noticed that if the paraphrase and the original are the exact same, the adequacy is quite low (around 0.7-0.80). If the paraphrase is shorter or longer than the original, it generally has a much higher score. Ex. Original: "I need to buy a house in the neighborhood" -> Paraphrase: "I need to buy a house" the paraphrase has a score of 0.98. Paraphrase: "I need to buy a house in the neighborhood where I want to live" results in an even higher score of .99 while the paraphrase "I need to buy a house in the neighborhood" (which is the same exact sentence as the original) gets a score of 0.7 and the same sentence with a period at the end gets 0.8.
This makes me think that the adequacy model takes into account how much the new sentence has changed from the original in addition to how well its meaning was preserved in some way.
Since the ReadMe states that adequacy measures whether or not the paraphrase preserves the meaning of the original, it is confusing to me that using the same sentence for original and paraphrase does not get a high score, could you clarify?

@PrithivirajDamodaran
Copy link
Owner

  • The model is framed as a sentence pair regression task. By design, it preserves the intent to the core, and even if the intent changes slightly it model is sensitive enough to catch it. Look at example 1
  • Don't conflate this with other sentence pair regression tasks like STS or STS2 o STS5. The strange behavior was due to a bug in considering the raw logits out of the model. Fixed it. (With or without a period shouldn't see a different score, negligible if at all some difference)
  • Now it should work. Look at example 2 (yours)

Screenshot 2022-06-25 at 9 25 12 AM

@cegersdoerfer
Copy link
Author

Thank you @PrithivirajDamodaran for clearing that up and fixing the bug!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants