Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character duplication #184

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
21 changes: 21 additions & 0 deletions transformations/character_duplication/README.md
@@ -0,0 +1,21 @@
# Character Duplication
This perturbation adds noise to all types of text sources (sentence, paragraph, etc.) proportional to noise erupting from keyboard typos making common spelling errors.

Author name: Marco Di Giovanni
Author email: marco.digiovanni@polimi.it
Author Affiliation: Politecnico di Milano and University of Bologna



## What type of a transformation is this?
This transformation acts like a perturbation to test robustness.
Few letters picked at random are duplicated.
Generated transformations display high similarity to the source sentences.

## What tasks does it intend to benefit?
- This perturbation would benefit all tasks which have a sentence/paragraph/document as input like text classification, text generation, etc.
- The generated texts mimic typing mistakes.

## What are the limitations of this transformation?
- This transformation is not capable of generating linguistically diverse text.
- This transformation will mainly affect the perfornamce of token/word-level models, while character-level models should be much robust.
1 change: 1 addition & 0 deletions transformations/character_duplication/__init__.py
@@ -0,0 +1 @@
from .transformation import *
50 changes: 50 additions & 0 deletions transformations/character_duplication/test.json
@@ -0,0 +1,50 @@
{
"type": "character_duplication",
"test_cases": [
{
"class": "CharacterDuplication",
"inputs": {
"sentence": "Andrew finally returned the French book to Chris that I bought last week"
},
"outputs": [{
"sentence": "Anndrew ffinnallly returrned thee French book too Chhris that I bought last week"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Triple duplication in the same word doesn't seem like a typical situation. I would suggest adding some rules to limit the generation of such unlikely human input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here triple duplication happens just because one of the two ‘l’ chars in the word “finally” was duplicated, obtaining the same letter 3 times in total.
I am not sure how likely is this in real data with respect to duplication of characters that appears once in the word.
However I believe that trained models should be able to process words like “ffinallly” in the similar way as “finally”, since humans can easily understand the meaning of the word with this kind of typo.

}]
},
{
"class": "CharacterDuplication",
"inputs": {
"sentence": "Sentences with gapping, such as Paul likes coffee and Mary tea, lack an overt predicate to indicate the relation between two or more arguments."
},
"outputs": [{
"sentence": "Seentencees witth gappiing, succhh as Paul likess cooffee and Mary tea, lackk an overt predicate ttoo indiicate tthe relation between two orr moree arrguuments."
}]
},
{
"class": "CharacterDuplication",
"inputs": {
"sentence": "Alice in Wonderland is a 2010 American live-action/animated dark fantasy adventure film"
},
"outputs": [{
"sentence": "Allice inn WWondderland is a 200110 American livve-aaction/animated dark fanntasy adventure film"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same is for the double letter in the beginning or a 6-figure number, which should represent a year. Please consider adding some rules to change that behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the same reason as before, I disagree about the double letter in the beginning, but I agree with you about not duplicating digits. I have added a rule to exclude digits from duplication in eb09bbc. Thank you for the suggestion

}]
},
{
"class": "CharacterDuplication",
"inputs": {
"sentence": "Ujjal Dev Dosanjh served as 33rd Premier of British Columbia from 2000 to 2001"
},
"outputs": [{
"sentence": "Ujjjal Deev Dossanjh seerved ass 33rd Premier oof BBritish Columbia from 20000 to 2001"
}]
},
{
"class": "CharacterDuplication",
"inputs": {
"sentence": "Neuroplasticity is a continuous processing allowing short-term, medium-term, and long-term remodeling of the neuronosynaptic organization."
},
"outputs": [{
"sentence": "Neeuroplaastticiity is aa continnuuous processingg alllowing short-term, mediium-term, and long-terrmm remoodelingg of the neuronosynaptic orrganizzatiionn."
}]
}
]
}
44 changes: 44 additions & 0 deletions transformations/character_duplication/transformation.py
@@ -0,0 +1,44 @@
import itertools
import random

from interfaces.SentenceOperation import SentenceOperation
from tasks.TaskTypes import TaskType


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider adding doc strings, comments and error handling logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a brief doc string in eb09bbc. I believe that the code is simple enough to understand everything without the need of more comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the description of your arguments, using the doc string convetion:
`def complex(real=0.0, imag=0.0):
"""Form a complex number.

Keyword arguments:
real -- the real part (default 0.0)
imag -- the imaginary part (default 0.0)
"""
if imag == 0.0 and real == 0.0:
    return complex_zero
...`

as stated in the official doc string convention for Python: https://www.python.org/dev/peps/pep-0257/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget about error handling logic - what happens if the user enters the illegal value for some of the parameters? Will he receive a human-readable message, pointing out what he/she did wrong or a generic Python error log, when the wrong parameter will break the code?

def duplicate(text, prob=0.1, seed=42, max_outputs=1):
random.seed(seed)

original_text = list(text)
perturbed_texts = []
for _ in itertools.repeat(None, max_outputs):
perturbed_text = [
[letter] if random.random() > prob else [letter, letter]
for letter in original_text
]
perturbed_text = [
letter for sublist in perturbed_text for letter in sublist
]
perturbed_texts.append("".join(perturbed_text))
return perturbed_texts


class CharacterDuplication(SentenceOperation):
tasks = [
TaskType.TEXT_CLASSIFICATION,
TaskType.TEXT_TO_TEXT_GENERATION,
TaskType.TEXT_TAGGING,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How the TaskType.TEXT_TAGGING is relevant to this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for spotting this, you are completely right and I have removed it in eb09bbc

]
languages = ["All"]

def __init__(self, seed=42, max_outputs=1, prob=0.1):
super().__init__(seed, max_outputs=max_outputs)
self.prob = prob

def generate(self, sentence: str):
perturbed_texts = duplicate(
text=sentence,
prob=self.prob,
seed=self.seed,
max_outputs=self.max_outputs,
)
return perturbed_texts