# Text Augumentation using nlpaug

## Introduction

More data can help train models to be more general, with less overfitting. One convenient way to generate additional data is simply to transform the given data into something slightly different, such that it still represents the assigned labels. As well as helping during training, augmentation can also be used when running inference (Test Time Augmentation or TTA).

For an image based challenge, flips, rotations, etc. can be used to generate a new image that still presents the same class of object as the original. But how do we do this for a NLP challenge?

There are a wide variety of approaches, from character and word substitutions, all the way to translating the text from the source lanaguage to another and then back to get a setence with (hopefully) the same meaning, but a different structure.

For more info, vad13irt has already posted a great survey here: https://www.kaggle.com/c/feedback-prize-2021/discussion/295277

For this notebook, I want to focus on what the application of some of these methods actually looks like for the feedback-prize-2021 dataset. In particular, I'll be looking at methods that can be used while preserving the discourse_start and discourse_end annotations given in train.csv.

# The nlpaug Package

I have chosen the nlpaug package, as it seems to have all I could want to experiement with. I could just have used 'pip install nlpaug', but I've installed it from a dataset to allow this notebook to run with internet turned off.

Documentation is available at: https://nlpaug.readthedocs.io/en/latest/

In [1]:
# Install and import nlpaug package.
!cp -r ../input/nlpaug-from-github/nlpaug-master ./
!pip install nlpaug-master/
!rm -r nlpaug-master

import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.word.context_word_embs as nawcwe
import nlpaug.augmenter.word.word_embs as nawwe
import nlpaug.augmenter.word.spelling as naws

Processing ./nlpaug-master
  Preparing metadata (setup.py) ... [?25l- done
Building wheels for collected packages: nlpaug
  Building wheel for nlpaug (setup.py) ... [?25l- \ | / done
[?25h  Created wheel for nlpaug: filename=nlpaug-1.1.10-py3-none-any.whl size=406197 sha256=2e3e9438ceff88d58b14cddcb350200dbc6f8f8bf31c3bea08ce515fbe9788ac
  Stored in directory: /root/.cache/pip/wheels/43/64/85/ce1afc6a0b63f139f70ea6945d5deebcebed4a875cb186adc8
Successfully built nlpaug
Installing collected packages: nlpaug
Successfully installed nlpaug-1.1.10


# First Attempt

Let's start by taking a arbitrarily selected training text and augmenting it. For this I've picked a very simple augmetation, KeyboardAug, that uses adjacency of keys on the keyboard to simulate typos. Alongside the original and annotated text, I'm printing the length of the text run through split() - if that changes the discourse_start and discourse_end annotations given in train.csv will no longer be valid.

In [2]:
from colorama import Fore
from pathlib import Path

# Set some seeds. We want randomness for real usage, but for this tutorial, determinism helps explain some examples.
import numpy as np
np.random.seed(1000)
import random
random.seed(1000)

base_path = Path("../input/feedback-prize-2021/train")

with open(base_path / "3FF2F530D590.txt") as f:
    sample = f.read()
    
print(f"Original: {len(sample)}\n{sample}")

aug = nac.KeyboardAug()
augmented_text = aug.augment(sample)
print(f"\nKeyboard augmentation: {len(augmented_text)}\n{augmented_text}")

Original: 1105
Dear Senator,

I favor keeping the Electoral College in the arguement. One of the reasons I feel that way is that it is harder for someone who is runnig for president to win. To win they would need to win over the votes of most of the small states. Or win over the votes over some of the small states and some of the big states.  So it would need someone who is smart or at least somewhat smart.

Another one of my reasons for going on this side of the arguement is not only do they have to win over the electoral votes they have to win over the poularity votes. The next reason I have is the Electoral College requires the people running for president they have appeal to all regions and not just the west because they were born there or something like that.

Now there are a few bad things about the Electoral College. Like one is the fact that a tie is posible.

Another is that it's out dated. Lastly each party picks a slate of electors to vote for the party's nominee and it's po

Not great, several issues are obvious:
1. The length of the split text has changed as the original had additional white space for formatting.
2. It is hard to see all the differences without highlighting.
3. The augmentation adds digits and special characters, which are unlikely to have been present.
4. The augmentation can change many characters in one word, making the new word too far from the original.
5. The augmented text adds spaces around apostrophes, increasing the split length.

Let's solve these by:
1. Stripping the original text.
2. Adding a diff viewer to highlight the differences.
3. Setting arguments to the augmentation method.
4. Setting arguments to the augmentation method.
5. Post-processing the augmented text with: replace(" ' ", "'")

In [3]:
# Replace orig text, with a version that has extra whitespace removed.
sample = " ".join([x.strip() for x in sample.split()])

In [4]:
def print_and_highlight_diff(orig_text, new_texts):
    """ A simple diff viewer for augmented texts. """
    orig_split = orig_text.split()
    print(f"Original: {len(orig_split)}\n{orig_text}\n")
    for new_text in new_texts:
        print(f"Augmented: {len(new_text.split())}")
        for i, word in enumerate(new_text.split()):
            if i < len(orig_split) and word == orig_split[i]:
                print(word, end=" ")
            else:
                print(Fore.RED + word + Fore.RESET, end=" ")
        print()

# KeyboardAug

Let's try again with no digits, no special characters and only one augmented character per word.

In [5]:
aug = nac.KeyboardAug(include_numeric=False, include_special_char=False, aug_char_max=1, aug_word_p=0.05)
augmented_texts = aug.augment(sample, n=3)
augmented_texts = [x.replace(" ' ", "'") for x in augmented_texts]
print_and_highlight_diff(sample, augmented_texts)

Original: 208
Dear Senator, I favor keeping the Electoral College in the arguement. One of the reasons I feel that way is that it is harder for someone who is runnig for president to win. To win they would need to win over the votes of most of the small states. Or win over the votes over some of the small states and some of the big states. So it would need someone who is smart or at least somewhat smart. Another one of my reasons for going on this side of the arguement is not only do they have to win over the electoral votes they have to win over the poularity votes. The next reason I have is the Electoral College requires the people running for president they have appeal to all regions and not just the west because they were born there or something like that. Now there are a few bad things about the Electoral College. Like one is the fact that a tie is posible. Another is that it's out dated. Lastly each party picks a slate of electors to vote for the party's nominee and it's posible 

That looks usable, but some of the misspellings are fairly severe.

# SpellingAug

It KeyboardAug creates typos that are too unnatural, let's try SpellingAug which uses a DB of common misspellings.

In [6]:
aug = naw.SpellingAug()
augmented_texts = aug.augment(sample, n=3)
augmented_texts = [x.replace(" ' ", "'") for x in augmented_texts]
print_and_highlight_diff(sample, augmented_texts)

Original: 208
Dear Senator, I favor keeping the Electoral College in the arguement. One of the reasons I feel that way is that it is harder for someone who is runnig for president to win. To win they would need to win over the votes of most of the small states. Or win over the votes over some of the small states and some of the big states. So it would need someone who is smart or at least somewhat smart. Another one of my reasons for going on this side of the arguement is not only do they have to win over the electoral votes they have to win over the poularity votes. The next reason I have is the Electoral College requires the people running for president they have appeal to all regions and not just the west because they were born there or something like that. Now there are a few bad things about the Electoral College. Like one is the fact that a tie is posible. Another is that it's out dated. Lastly each party picks a slate of electors to vote for the party's nominee and it's posible 

That looks usable.

# SynonymAug

Now SynonymAug, that replaces some words with synonyms.

In [7]:
aug = naw.SynonymAug()
augmented_texts = aug.augment(sample, n=3)
augmented_texts = [x.replace(" ' ", "'") for x in augmented_texts]
print_and_highlight_diff(sample, augmented_texts)

Original: 208
Dear Senator, I favor keeping the Electoral College in the arguement. One of the reasons I feel that way is that it is harder for someone who is runnig for president to win. To win they would need to win over the votes of most of the small states. Or win over the votes over some of the small states and some of the big states. So it would need someone who is smart or at least somewhat smart. Another one of my reasons for going on this side of the arguement is not only do they have to win over the electoral votes they have to win over the poularity votes. The next reason I have is the Electoral College requires the people running for president they have appeal to all regions and not just the west because they were born there or something like that. Now there are a few bad things about the Electoral College. Like one is the fact that a tie is posible. Another is that it's out dated. Lastly each party picks a slate of electors to vote for the party's nominee and it's posible 

Oh dear, the synonym augmentation is not a 1-to-1 swap at the word level. E.g. small -> pocket size. Therefore this cannot be used while maintaining the discourse annotations.

# WordEmbsAug

Uses word embeddings to find a similar word for augmentation, used here with a local copy of the GloVe model.

In [8]:
aug = nawwe.WordEmbsAug(model_type='glove', model_path='../input/glove-embeddings/glove.6B.300d.txt')
augmented_texts = aug.augment(sample, n=3)
augmented_texts = [x.replace(" ' ", "'") for x in augmented_texts]
print_and_highlight_diff(sample, augmented_texts)

Original: 208
Dear Senator, I favor keeping the Electoral College in the arguement. One of the reasons I feel that way is that it is harder for someone who is runnig for president to win. To win they would need to win over the votes of most of the small states. Or win over the votes over some of the small states and some of the big states. So it would need someone who is smart or at least somewhat smart. Another one of my reasons for going on this side of the arguement is not only do they have to win over the electoral votes they have to win over the poularity votes. The next reason I have is the Electoral College requires the people running for president they have appeal to all regions and not just the west because they were born there or something like that. Now there are a few bad things about the Electoral College. Like one is the fact that a tie is posible. Another is that it's out dated. Lastly each party picks a slate of electors to vote for the party's nominee and it's posible 

Some interesting swaps there! For example mubarak == president only in a very specific place and time!

# ContextualWordEmbsAug

This is similar to WordEmbsAug, but uses more powerful contextual word embeddings. Here used with BERT.

In [9]:
aug = nawcwe.ContextualWordEmbsAug(model_path='../input/huggingface-bert-variants/bert-base-cased/bert-base-cased')
augmented_texts = aug.augment(sample, n=3)
augmented_texts = [x.replace(" ' ", "'") for x in augmented_texts]
print_and_highlight_diff(sample, augmented_texts)

Original: 208
Dear Senator, I favor keeping the Electoral College in the arguement. One of the reasons I feel that way is that it is harder for someone who is runnig for president to win. To win they would need to win over the votes of most of the small states. Or win over the votes over some of the small states and some of the big states. So it would need someone who is smart or at least somewhat smart. Another one of my reasons for going on this side of the arguement is not only do they have to win over the electoral votes they have to win over the poularity votes. The next reason I have is the Electoral College requires the people running for president they have appeal to all regions and not just the west because they were born there or something like that. Now there are a few bad things about the Electoral College. Like one is the fact that a tie is posible. Another is that it's out dated. Lastly each party picks a slate of electors to vote for the party's nominee and it's posible 