# Text Augmentation

\
The more data we have, the better model performance we can achieve. However, it is often too costly and time-consuming to annotate a large amount of training data. Therefore, proper data augmentation is useful to boost up your model performance.

\
In the natural language processing (NLP) field, it is hard to augment text due to the high complexity of language.

\
The [nlpaug](https://github.com/makcedward/nlpaug) module implements a number of high-performance text augmentation algorithms that may significantly boost performance of NLP models. This tutorial introduces some cool text augmentation functions in the nlpaug module.

\
References and recommended readings:
* [nlpaug Git Hub site](https://github.com/makcedward/nlpaug)
* [Data Augmentation in NLP: Introduction to Text Augmentation](https://towardsdatascience.com/data-augmentation-in-nlp-2801a34dfc28)

## Install and import modules

In [None]:
# Install the most recent version of gensim.
# Otherwise, you may get the following error when running naw.WordEmbsAug():
# 'Word2VecKeyedVectors' object has no attribute 'index_to_key'
# see: https://stackoverflow.com/questions/71032760/word2veckeyedvectors-object-has-no-attribute-index-to-key
!pip install --upgrade gensim --quiet

[K     |████████████████████████████████| 24.1 MB 1.1 MB/s 
[?25h

In [None]:
# Import gensim.
# Note: You will need to retart runtime in order to import the most recent version of gensim 
import gensim
print(gensim.__version__)

4.2.0


In [None]:
# Install the transformers module in order to use their base models (e.g., BERT)
!pip install transformers --quiet

[K     |████████████████████████████████| 4.9 MB 15.3 MB/s 
[K     |████████████████████████████████| 120 kB 57.0 MB/s 
[K     |████████████████████████████████| 6.6 MB 65.3 MB/s 
[?25h

In [None]:
# Import transformers
import transformers

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

In [None]:
# Install the tokenizer needed by the back translation model
!pip install sacremoses --quiet

[?25l[K     |▍                               | 10 kB 29.2 MB/s eta 0:00:01[K     |▊                               | 20 kB 18.9 MB/s eta 0:00:01[K     |█▏                              | 30 kB 24.9 MB/s eta 0:00:01[K     |█▌                              | 40 kB 15.7 MB/s eta 0:00:01[K     |█▉                              | 51 kB 13.0 MB/s eta 0:00:01[K     |██▎                             | 61 kB 15.2 MB/s eta 0:00:01[K     |██▋                             | 71 kB 15.3 MB/s eta 0:00:01[K     |███                             | 81 kB 13.1 MB/s eta 0:00:01[K     |███▍                            | 92 kB 14.5 MB/s eta 0:00:01[K     |███▊                            | 102 kB 14.5 MB/s eta 0:00:01[K     |████                            | 112 kB 14.5 MB/s eta 0:00:01[K     |████▌                           | 122 kB 14.5 MB/s eta 0:00:01[K     |████▉                           | 133 kB 14.5 MB/s eta 0:00:01[K     |█████▏                          | 143 kB 14.5 MB/s eta 0:

In [None]:
# Install the tokenizer
import sacremoses

In [None]:
# Install the nlpaug module
!pip install nlpaug --quiet

[?25l[K     |▉                               | 10 kB 28.5 MB/s eta 0:00:01[K     |█▋                              | 20 kB 22.9 MB/s eta 0:00:01[K     |██▍                             | 30 kB 30.2 MB/s eta 0:00:01[K     |███▏                            | 40 kB 15.3 MB/s eta 0:00:01[K     |████                            | 51 kB 14.0 MB/s eta 0:00:01[K     |████▉                           | 61 kB 16.3 MB/s eta 0:00:01[K     |█████▋                          | 71 kB 15.2 MB/s eta 0:00:01[K     |██████▍                         | 81 kB 14.9 MB/s eta 0:00:01[K     |███████▏                        | 92 kB 16.2 MB/s eta 0:00:01[K     |████████                        | 102 kB 14.7 MB/s eta 0:00:01[K     |████████▉                       | 112 kB 14.7 MB/s eta 0:00:01[K     |█████████▋                      | 122 kB 14.7 MB/s eta 0:00:01[K     |██████████▍                     | 133 kB 14.7 MB/s eta 0:00:01[K     |███████████▏                    | 143 kB 14.7 MB/s eta 0:

In [None]:
# Import the nlpaug module and its methods
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc
from nlpaug.util import Action

## Download models

In [None]:
# Download models to a temporary path
from nlpaug.util.file.download import DownloadUtil
DownloadUtil.download_word2vec(dest_dir = '.')
# Possible values are ‘wiki-news-300d-1M’, ‘wiki-news-300d-1M-subword’, ‘crawl-300d-2M’ and ‘crawl-300d-2M-subword’
DownloadUtil.download_fasttext(dest_dir = '.', model_name = 'crawl-300d-2M')
# Possible values are ‘glove.6B’, ‘glove.42B.300d’, ‘glove.840B.300d’ and ‘glove.twitter.27B’
DownloadUtil.download_glove(dest_dir = '.', model_name = 'glove.6B')

Downloading...
From: https://drive.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM
To: /content/GoogleNews-vectors-negative300.bin.gz
100%|██████████| 1.65G/1.65G [00:16<00:00, 97.8MB/s]


## Example text

In [None]:
# Let's define some texts
text = """
  Is daily coffee consumption good for our health? 
  I guess it is reasonable to believe so, but it may also depend on how much you drink.
  """

## Option 1: Substitute or insert word randomly using word embeddings similarity

In [None]:
# Initialize the augmenter with model "word2vec"
aug = naw.WordEmbsAug(
  # You can choose from "word2vec", "glove", or "fasttext" 
  model_type = 'word2vec', 
  model_path = 'GoogleNews-vectors-negative300.bin',
  # You may also choose "insert"
  action = "substitute")

# Augment the text
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:

  Is daily coffee consumption good for our health? 
  I guess it is reasonable to believe so, but it may also depend on how much you drink.
  
Augmented Text:
["Is daily coffee unprocessed_grains good depriving our Ajmal_Pardes? hadn'tI guess it'sa revolves_around particularized_suspicion to believe AMÉLIE_MAURESMO, but it may also adversely_affects on how much you drink."]


In [None]:
# Initialize the augmenter with model "fasttext"
aug = naw.WordEmbsAug(
  # You can choose from "word2vec", "glove", or "fasttext" 
  model_type = 'fasttext', 
  # Note: check your "content" path to find out specific model names
  model_path = 'crawl-300d-2M.vec',
  # You may also choose "insert"
  action = "substitute")

# Augment the text
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:

  Is daily coffee consumption good for our health? 
  I guess it is reasonable to believe so, but it may also depend on how much you drink.
  
Augmented Text:
['1.Is nearly-daily coffee consumption good for our hеаlth? I surmize it is equitable to believe so, but it mght acutally depending on you--how much you drink.']


In [None]:
# Initialize the augmenter with model "glove"
aug = naw.WordEmbsAug(
  # You can choose from "word2vec", "glove", or "fasttext" 
  model_type = 'glove', 
  # Note: check your "content" path to find out specific model names
  model_path = 'glove.6B.300d.txt',
  # You may also choose "insert"
  action = "substitute")

# Augment the text
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:

  Is daily coffee consumption good for our health? 
  I guess it is reasonable to believe so, but it may also depend on how much you drink.
  
Augmented Text:
['Is daily coffee goods good for our health? I guess it can reasonable continue believe so, go it if use depend on how some anybody ate.']


## Option 2: Substitute or insert word by contextual word embeddings

In [None]:
## Substitute word by contextual word embeddings (BERT, DistilBERT, RoBERTA or XLNet)
aug = naw.ContextualWordEmbsAug(
  # Other models include 'distilbert-base-uncased', 'roberta-base', etc.
  model_path = 'bert-base-uncased', 
  # You can also choose "insert"
  action = "substitute")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Original:

  Is daily coffee consumption good for our health? 
  I guess it is reasonable to believe so, but it may also depend on how much you drink.
  
Augmented Text:
['is daily coffee as good for that waitress? i guess it appears reasonable to use her, but it may certainly depend on have much you serve.']


## Option 3: Substitute or insert word by synonym

In [None]:
## Substitute word by WordNet's synonym
aug = naw.SynonymAug(aug_src = 'wordnet')
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Original:

  Is daily coffee consumption good for our health? 
  I guess it is reasonable to believe so, but it may also depend on how much you drink.
  
Augmented Text:
['Is casual coffee consumption skillful for our health? I guess it is reasonable to think so, simply information technology may also calculate on how much you drink.']


In [None]:
## Substitute word by WordNet's synonym.
# You can optionally set the max number of words to replace with synonym.
aug = naw.SynonymAug(aug_src = 'wordnet', aug_max = 3)
augmented_text = aug.augment(text, )
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:

  Is daily coffee consumption good for our health? 
  I guess it is reasonable to believe so, but it may also depend on how much you drink.
  
Augmented Text:
['Is daily coffee consumption good for our health? I guess information technology is reasonable to believe so, but it may also reckon on how much you drink.']


## Option 4: Substitute or insert word using back translation

In [None]:
# Use back translation augmenter
back_translation_aug = naw.BackTranslationAug(
    from_model_name = 'facebook/wmt19-en-de', 
    to_model_name = 'facebook/wmt19-de-en'
)
back_translation_aug.augment(text)

Downloading:   0%|          | 0.00/825 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.08G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/825 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.08G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/849k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/315k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/849k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/315k [00:00<?, ?B/s]

['Is daily coffee consumption good for our health? I think it is reasonable to believe so, but it can also depend on how much you drink.']