Corrupt an input text to test NLP models' robustness.
For details refer to https://nlp-demo.readthedocs.io
pip install wild-nlp
All together we defined and implemented 11 aspects of text corruption.
Randomly removes or swaps articles into wrong ones.
Converts numbers into words. Handles floating numbers as well.
Misspells words appearing in the Wikipedia list of:
- commonly misspelled English words
Randomly adds or removes specified punctuation marks.
Simulates errors made while writing on a QWERTY-type keyboard.
- characters from words or
- white spaces from sentences
Replaces random, single character with for example an asterisk in:
- negative or
- positive words from Opinion Lexicon:
Randomly swaps two characters within a word, excluding punctuations.
Randomly change characters according to chosen dictionary, default is 'ocr' to simulate simple OCR errors.
Randomly add or remove white spaces (listed as a parameter).
- Sub string
Randomly add a substring to simulate more comples signs.
- All aspects can be chained together with the wildnlp.aspects.utils.compose function.
Aspects can be applied to any text. Below is the list of datasets for which we already implemented processing pipelines.
The CoNLL-2003 shared task data for language-independent named entity recognition.
The IMDB dataset containing movie reviews for a sentiment analysis. The dataset consists of 50 000 reviews of two classes, negative and positive.
The SNLI dataset supporting the task of natural language inference.
The SQuAD dataset for the Machine Comprehension problem.
from wildnlp.aspects.dummy import Reverser, PigLatin from wildnlp.aspects.utils import compose from wildnlp.datasets import SampleDataset # Create a dataset object and load the dataset dataset = SampleDataset() dataset.load() # Crate a composed corruptor function. # Functions will be applied in the same order they appear. composed = compose(Reverser(), PigLatin()) # Apply the function to the dataset modified = dataset.apply(composed)