## Data Augmentation using NLPaug

This notebook demostrate the usage of a character augmenter, word augmenter. There are other types such as augmentation for sentences, audio, spectrogram inputs etc. 

In [1]:
# To install only the requirements of this notebook, uncomment the lines below and run this cell

# ===========================

!pip install numpy==1.19.5
!pip install nlpaug==0.0.14
!pip install wget==3.2
!pip install matplotlib==3.2.2
!pip install requests==2.23.0

# ===========================

Collecting nlpaug==0.0.14
[?25l  Downloading https://files.pythonhosted.org/packages/1f/6c/ca85b6bd29926561229e8c9f677c36c65db9ef1947bfc175e6641bc82ace/nlpaug-0.0.14-py3-none-any.whl (101kB)
[K     |████████████████████████████████| 102kB 4.4MB/s 
[?25hInstalling collected packages: nlpaug
Successfully installed nlpaug-0.0.14
Collecting wget==3.2
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp37-none-any.whl size=9675 sha256=53699b93bdf14ec28540dbe53e9978af5099ce729b974b2c43ec5f5f12e13c59
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [3]:
# This will be the base text which we will be using throughout this notebook
text="The quick brown fox jumps over the lazy dog ."

In [4]:
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc

from nlpaug.util import Action
import os
!git clone https://github.com/makcedward/nlpaug.git
os.environ["MODEL_DIR"] = 'nlpaug/model/'

Cloning into 'nlpaug'...
remote: Enumerating objects: 5078, done.[K
remote: Counting objects: 100% (605/605), done.[K
remote: Compressing objects: 100% (387/387), done.[K
remote: Total 5078 (delta 426), reused 358 (delta 215), pack-reused 4473[K
Receiving objects: 100% (5078/5078), 3.17 MiB | 16.74 MiB/s, done.
Resolving deltas: 100% (3588/3588), done.


### Augmentation at the Character Level


1.   OCR Augmenter: To read textual data from on image, we need an OCR(optical character recognition) model. Once the text is extracted from the image, there may be errors like; '0' instead of an 'o', '2' instead of 'z' and other such similar errors.  
2.   Keyboard Augmenter: While typing/texting typos are fairly common this augmenter simulates the errors by substituting characters in words with ones at a similar distance on a keyboard.



In [5]:
# OCR augmenter
# import nlpaug.augmenter.char as nac

aug = nac.OcrAug()  
augmented_texts = aug.augment(text, n=3) # specifying n=3 gives us only 3 augmented versions of the sentence.

print("Original:")
print(text)

print("Augmented Texts:")
print(augmented_texts)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Texts:
['The quick brown fox jumps 0ver the lazy d0g .', 'The quicr brown fox jump8 ovek the lazy dog .', 'The qoick brown fux jumps over the lazy do9 .']


In [6]:
# Keyboard Augmenter
# import nlpaug.augmenter.word as naw


aug = nac.KeyboardAug()
augmented_text = aug.augment(text, n=3) # specifying n=3 gives us only 3 augmented versions of the sentence.

print("Original:")
print(text)

print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['The @uick broen fox jumps over the lazy dog .', 'The quick brown fox jumps ovee the ?azy dog .', 'The quick brown fox jumps oveT the lazy dog .']


There are other types of character augmenters too. Their details are avaiable in the links mentioned at the beginning of this notebook.

### Augmentation at the Word Level

Augmentation is important at the word level as well , here we use word2vec to insert or substitute a similar word.

**Spelling** **augmentor**


In [7]:
# Downloading the required txt file
import wget

if not os.path.exists("spelling_en.txt"):
    wget.download("https://raw.githubusercontent.com/makcedward/nlpaug/5238e0be734841b69651d2043df535d78a8cc594/nlpaug/res/word/spelling/spelling_en.txt")
else:
    print("File already exists")

In [8]:
# Substitute word by spelling mistake words dictionary
aug = naw.SpellingAug('spelling_en.txt')
augmented_texts = aug.augment(text)
print("Original:")
print(text)
print("Augmented Texts:")
print(augmented_texts)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Texts:
Te quick brown fox jumps over hthe lazy djg .


**Word embeddings augmentor**

In [9]:
import gzip
import shutil

gn_vec_path = "GoogleNews-vectors-negative300.bin"
if not os.path.exists("GoogleNews-vectors-negative300.bin"):
    if not os.path.exists("../Ch3/GoogleNews-vectors-negative300.bin"):
        # Downloading the reqired model
        if not os.path.exists("../Ch3/GoogleNews-vectors-negative300.bin.gz"):
            if not os.path.exists("GoogleNews-vectors-negative300.bin.gz"):
                wget.download("https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz")
            gn_vec_zip_path = "GoogleNews-vectors-negative300.bin.gz"
        else:
            gn_vec_zip_path = "../Ch3/GoogleNews-vectors-negative300.bin.gz"
        # Extracting the required model
        with gzip.open(gn_vec_zip_path, 'rb') as f_in:
            with open(gn_vec_path, 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)
    else:
        gn_vec_path = "../Ch3/" + gn_vec_path

print(f"Model at {gn_vec_path}")

Model at GoogleNews-vectors-negative300.bin


Insert word randomly by word embeddings similarity

In [10]:
# model_type: word2vec, glove or fasttext
aug = naw.WordEmbsAug(
    model_type='word2vec', model_path=gn_vec_path,
    action="insert")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The Carillion quick brown Emergent fox jumps over the BY lazy dog .


Substitute word by word2vec similarity


In [11]:
aug = naw.WordEmbsAug(
    model_type='word2vec', model_path=gn_vec_path,
    action="substitute")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
His quick brown fox jumps morethan the whiny dog .


There are many more features which nlpaug offers you can visit the github repo and documentation for further details