Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character duplication #184

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

marco-digio
Copy link
Contributor

No description provided.

@kaustubhdhole
Copy link
Collaborator

Hi @marco-digio thank you very much for your changes. I would suggest combining character_duplication, underscore_trick into one. Also, please mention in the README about other previous PRs (created as well as merged) which are similar to your PR.

@marco-digio
Copy link
Contributor Author

Hi @marco-digio thank you very much for your changes. I would suggest combining character_duplication, underscore_trick into one. Also, please mention in the README about other previous PRs (created as well as merged) which are similar to your PR.

Hi @kaustubhdhole thank you. I am sorry but I messed up a bit with git. I accidentally included an old commit in the new branch. Now it is fixed by removing the underscore_trick to the character_duplication pull request, since I already did a separate underscore_trick pull request.
However, if you believe that they should be merged into a single transformation, I will fix it, but I think that they should remain separate since they affect the texts in quite different ways.

Copy link
Contributor

@uyaseen uyaseen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @marco-digio, I am a reviewer assigned to this PR. Overall, everything looks good, you just need to include keywords for the transformation as explained here. I will provide the final feedback after you have added the keywords.

Few minor comments:

  1. There is a typo in the last line of README.md ("perfornamce")
  2. If this transformation was proposed in any existing work then please include the relevant citation.

@marco-digio
Copy link
Contributor Author

Hi @marco-digio, I am a reviewer assigned to this PR. Overall, everything looks good, you just need to include keywords for the transformation as explained here. I will provide the final feedback after you have added the keywords.

Few minor comments:

  1. There is a typo in the last line of README.md ("perfornamce")
  2. If this transformation was proposed in any existing work then please include the relevant citation.

Thank you @uyaseen for the feedback. I have inserted the keywords now in 22a16e7 and I have fixed the README typo in 97866a7 .
To the best of my knowledge there is no existing work proposing this transformation. However, if I have missed something, I will surely include it as relevant citation.

Copy link
Contributor

@uyaseen uyaseen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marco-digio thanks for making the changes.

Here's my general review:

Clarity: The README clearly explains the transformation
Correctness: All checks have passed
Interface: The interface seems correct

Adding New Libraries: No new libraries were added
Test Cases: 5 test cases added
Evaluating Robustness: Robustness evaluation is not yet conducted

"sentence": "Andrew finally returned the French book to Chris that I bought last week"
},
"outputs": [{
"sentence": "Anndrew ffinnallly returrned thee French book too Chhris that I bought last week"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Triple duplication in the same word doesn't seem like a typical situation. I would suggest adding some rules to limit the generation of such unlikely human input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here triple duplication happens just because one of the two ‘l’ chars in the word “finally” was duplicated, obtaining the same letter 3 times in total.
I am not sure how likely is this in real data with respect to duplication of characters that appears once in the word.
However I believe that trained models should be able to process words like “ffinallly” in the similar way as “finally”, since humans can easily understand the meaning of the word with this kind of typo.

"sentence": "Alice in Wonderland is a 2010 American live-action/animated dark fantasy adventure film"
},
"outputs": [{
"sentence": "Allice inn WWondderland is a 200110 American livve-aaction/animated dark fanntasy adventure film"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same is for the double letter in the beginning or a 6-figure number, which should represent a year. Please consider adding some rules to change that behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the same reason as before, I disagree about the double letter in the beginning, but I agree with you about not duplicating digits. I have added a rule to exclude digits from duplication in eb09bbc. Thank you for the suggestion

from interfaces.SentenceOperation import SentenceOperation
from tasks.TaskTypes import TaskType


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider adding doc strings, comments and error handling logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a brief doc string in eb09bbc. I believe that the code is simple enough to understand everything without the need of more comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the description of your arguments, using the doc string convetion:
`def complex(real=0.0, imag=0.0):
"""Form a complex number.

Keyword arguments:
real -- the real part (default 0.0)
imag -- the imaginary part (default 0.0)
"""
if imag == 0.0 and real == 0.0:
    return complex_zero
...`

as stated in the official doc string convention for Python: https://www.python.org/dev/peps/pep-0257/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget about error handling logic - what happens if the user enters the illegal value for some of the parameters? Will he receive a human-readable message, pointing out what he/she did wrong or a generic Python error log, when the wrong parameter will break the code?

tasks = [
TaskType.TEXT_CLASSIFICATION,
TaskType.TEXT_TO_TEXT_GENERATION,
TaskType.TEXT_TAGGING,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How the TaskType.TEXT_TAGGING is relevant to this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for spotting this, you are completely right and I have removed it in eb09bbc

@asnota
Copy link
Contributor

asnota commented Sep 28, 2021

Thank you for the contribution. Could you please explain the value or your PR compared to the [transformatio]https://github.com/GEM-benchmark/NL-Augmenter/tree/main/transformations/butter_fingers_perturbation), which addresses the same typo issue?

@marco-digio
Copy link
Contributor Author

Thank you for the contribution. Could you please explain the value or your PR compared to the [transformatio]https://github.com/GEM-benchmark/NL-Augmenter/tree/main/transformations/butter_fingers_perturbation), which addresses the same typo issue?

The transformation is similar, because it adds noise similar to typos. However, Butter Fingers Perturbation swap two characters, while this PR (character duplication) duplicate a character.
Example:
Original sentence: "benchmark"
Butter Fingers Perturbation (possible) output: "benchnark" ("m" and "n" are close in the english keyboard)
Character Duplication (possible) output: "benchmmark"

I hope that this clarifies your doubts @asnota

from interfaces.SentenceOperation import SentenceOperation
from tasks.TaskTypes import TaskType


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the description of your arguments, using the doc string convetion:
`def complex(real=0.0, imag=0.0):
"""Form a complex number.

Keyword arguments:
real -- the real part (default 0.0)
imag -- the imaginary part (default 0.0)
"""
if imag == 0.0 and real == 0.0:
    return complex_zero
...`

as stated in the official doc string convention for Python: https://www.python.org/dev/peps/pep-0257/

from interfaces.SentenceOperation import SentenceOperation
from tasks.TaskTypes import TaskType


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget about error handling logic - what happens if the user enters the illegal value for some of the parameters? Will he receive a human-readable message, pointing out what he/she did wrong or a generic Python error log, when the wrong parameter will break the code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants