Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate Yes-or-No Question from Statement #126

Merged
merged 73 commits into from
Sep 30, 2021

Conversation

boyleconnor
Copy link
Contributor

No description provided.

@boyleconnor boyleconnor marked this pull request as draft July 14, 2021 00:11
@boyleconnor
Copy link
Contributor Author

On GitHub CI, I am getting some weird behavior from spacy's en_core_web_sm model, which is causing issues downstream. I left a debugging breadcrumb (see previous link), and it looks like the tokenization is off (e.g. "isn't" isn't getting split into "is" and "n't", as it should) and "isn't" is getting read as a PROPN (proper noun) by the POS tagger.

TL;DR: Spacy's NLP model seems to suddenly have become incapable of splitting up English contractions (e.g. "isn't"), and the one token comprising this contraction then gets read as a proper noun, apparently. This bug occurs in the CI run, but I can't seem to replicate it locally.

@AbinayaM02
Copy link
Collaborator

TL;DR: Spacy's NLP model seems to suddenly have become incapable of splitting up English contractions (e.g. "isn't"), and the one token comprising this contraction then gets read as a proper noun, apparently. This bug occurs in the CI run, but I can't seem to replicate it locally.

The reason behind this behavior is because of the issue explained in #209. I have also provided the suggestion for a possible fix. Let me know if you need more information.

@AbinayaM02 AbinayaM02 self-requested a review September 6, 2021 09:28
@boyleconnor
Copy link
Contributor Author

@AbinayaM02 thank you for the bugfix/advice re: the Spacy tokenizer.

@AbinayaM02
Copy link
Collaborator

AbinayaM02 commented Sep 20, 2021

@AbinayaM02 thank you for the bugfix/advice re: the Spacy tokenizer.

You're welcome. Btw, please add your transformation in the corresponding dictionary (in map_transformation["light"] since yours is a light transformation) in test/mapper.py for the workflow to pick up your transformation for testing. You may need to fetch & merge your repository and pull the main branch to see the test/mapper.py file.

@boyleconnor
Copy link
Contributor Author

Thanks for your help, @AbinayaM02 . Unfortunately, I am having trouble running the evaluation script (see issue #300). Could someone help to run it locally and add the results? Or resolve the issue?

@AbinayaM02
Copy link
Collaborator

Thanks for your help, @AbinayaM02 . Unfortunately, I am having trouble running the evaluation script (see issue #300). Could someone help to run it locally and add the results? Or resolve the issue?

The issue is with torchtext version in the diverse_paraphrase's requirements.txt. You can specify the version of the torchtext to be 0.9.1. Once that's done, also download the en_core_web_sm using the command,

pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz

You should be able to run the evaluation :) [I tested the diverse paraphrase after specifying the torchtext version, it works]

@kaustubhdhole
Copy link
Collaborator

I think this PR looks great! You might want to also add keywords and the robustness evaluation too to make it stronger. Besides, it would be really interesting to see whether tasks like Question-Answering actually benefit or falter from such conversion. Also nudging @AbinayaM02 here.

Copy link
Collaborator

@AbinayaM02 AbinayaM02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cascadianblue: Thank you for an interesting transformation. I have fixed the pytest failure.

Here is my general observation,

  1. Transformation is novel and interesting and it's not implemented already.
  2. Test cases are added.
  3. New requirements are specified.
  4. README is updated with all the necessary information.

I see that docstrings don't have param explanations. Please add that wherever it is missing and we're good to go. Also, add the evaluation result. Approving the PR!

Edit:
@kaustubhdhole : I'm merging this PR since the build is failing on the main branch and the fix for the issue is part of this PR. Docstrings and evaluation can be added later as separate PR by @cascadianblue later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants