Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character duplication #184

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
21 changes: 21 additions & 0 deletions transformations/character_duplication/README.md
@@ -0,0 +1,21 @@
# Character Duplication
This perturbation adds noise to all types of text sources (sentence, paragraph, etc.) proportional to noise erupting from keyboard typos making common spelling errors.

Author name: Marco Di Giovanni
Author email: marco.digiovanni@polimi.it
Author Affiliation: Politecnico di Milano and University of Bologna



## What type of a transformation is this?
This transformation acts like a perturbation to test robustness.
Few letters picked at random are duplicated.
Generated transformations display high similarity to the source sentences.

## What tasks does it intend to benefit?
- This perturbation would benefit all tasks which have a sentence/paragraph/document as input like text classification, text generation, etc.
- The generated texts mimic typing mistakes.

## What are the limitations of this transformation?
- This transformation is not capable of generating linguistically diverse text.
- This transformation will mainly affect the performance of token/word-level models, while character-level models should be much robust.
1 change: 1 addition & 0 deletions transformations/character_duplication/__init__.py
@@ -0,0 +1 @@
from .transformation import *
50 changes: 50 additions & 0 deletions transformations/character_duplication/test.json
@@ -0,0 +1,50 @@
{
"type": "character_duplication",
"test_cases": [
{
"class": "CharacterDuplication",
"inputs": {
"sentence": "Andrew finally returned the French book to Chris that I bought last week"
},
"outputs": [{
"sentence": "Anndrew ffinnallly returrned thee French book too Chhris that I bought last week"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Triple duplication in the same word doesn't seem like a typical situation. I would suggest adding some rules to limit the generation of such unlikely human input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here triple duplication happens just because one of the two ‘l’ chars in the word “finally” was duplicated, obtaining the same letter 3 times in total.
I am not sure how likely is this in real data with respect to duplication of characters that appears once in the word.
However I believe that trained models should be able to process words like “ffinallly” in the similar way as “finally”, since humans can easily understand the meaning of the word with this kind of typo.

}]
},
{
"class": "CharacterDuplication",
"inputs": {
"sentence": "Sentences with gapping, such as Paul likes coffee and Mary tea, lack an overt predicate to indicate the relation between two or more arguments."
},
"outputs": [{
"sentence": "Seentencees witth gappiing, succhh as Paul likess cooffee and Mary tea, lackk an overt predicate ttoo indiicate tthe relation between two orr moree arrguuments."
}]
},
{
"class": "CharacterDuplication",
"inputs": {
"sentence": "Alice in Wonderland is a 2010 American live-action/animated dark fantasy adventure film"
},
"outputs": [{
"sentence": "Allice inn WWondderland is a 2010 AAmmerican live-acctioon/animated dark fantasyy adventure film"
}]
},
{
"class": "CharacterDuplication",
"inputs": {
"sentence": "Ujjal Dev Dosanjh served as 33rd Premier of British Columbia from 2000 to 2001"
},
"outputs": [{
"sentence": "Ujjjal Deev Dossanjh seerved ass 33rd Premier of Briitish Columbia from 2000 to 2001"
}]
},
{
"class": "CharacterDuplication",
"inputs": {
"sentence": "Neuroplasticity is a continuous processing allowing short-term, medium-term, and long-term remodeling of the neuronosynaptic organization."
},
"outputs": [{
"sentence": "Neeuroplaastticiity is aa continnuuous processingg alllowing short-term, mediium-term, and long-terrmm remoodelingg of the neuronosynaptic orrganizzatiionn."
}]
}
]
}
56 changes: 56 additions & 0 deletions transformations/character_duplication/transformation.py
@@ -0,0 +1,56 @@
import random

from interfaces.SentenceOperation import SentenceOperation
from tasks.TaskTypes import TaskType


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider adding doc strings, comments and error handling logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a brief doc string in eb09bbc. I believe that the code is simple enough to understand everything without the need of more comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the description of your arguments, using the doc string convetion:
`def complex(real=0.0, imag=0.0):
"""Form a complex number.

Keyword arguments:
real -- the real part (default 0.0)
imag -- the imaginary part (default 0.0)
"""
if imag == 0.0 and real == 0.0:
    return complex_zero
...`

as stated in the official doc string convention for Python: https://www.python.org/dev/peps/pep-0257/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget about error handling logic - what happens if the user enters the illegal value for some of the parameters? Will he receive a human-readable message, pointing out what he/she did wrong or a generic Python error log, when the wrong parameter will break the code?

def duplicate(text, prob=0.1, seed=42, max_outputs=1):
"""
This function duplicates random chars (not digits) in the text string, with specified probability. It returns a list of different perturbed strings, whose length is specified by max_outputs.
"""
random.seed(seed)

original_text = list(text)
perturbed_texts = []
for _ in range(max_outputs):
perturbed_text = [
[letter]
if letter.isdigit() or random.random() > prob
else [letter, letter]
for letter in original_text
]
perturbed_text = [
letter for sublist in perturbed_text for letter in sublist
]
perturbed_texts.append("".join(perturbed_text))
return perturbed_texts


class CharacterDuplication(SentenceOperation):
tasks = [
TaskType.TEXT_CLASSIFICATION,
TaskType.TEXT_TO_TEXT_GENERATION,
]
languages = ["All"]
keywords = [
"morphological",
"noise",
"rule-based",
"highly-meaning-preserving",
"high-precision",
"high-coverage",
"high-generations",
]

def __init__(self, seed=42, max_outputs=1, prob=0.1):
super().__init__(seed, max_outputs=max_outputs)
self.prob = prob

def generate(self, sentence: str):
perturbed_texts = duplicate(
text=sentence,
prob=self.prob,
seed=self.seed,
max_outputs=self.max_outputs,
)
return perturbed_texts