## Data Augmentation using NLPaug

This notebook demostrate the usage of a character augmenter, word augmenter. There are other types such as augmentation for sentences, audio, spectrogram inputs etc. All of the types many before mentioned types and many more can be found at the github repo and docs of nlpaug.

In [1]:
!pip install numpy
!pip install nlpaug
!pip install wget
!pip install matplotlib
!pip install requests

Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
     -------------------------------------- 410.5/410.5 kB 8.5 MB/s eta 0:00:00
Collecting gdown>=4.0.0
  Downloading gdown-4.7.1-py3-none-any.whl (15 kB)
Installing collected packages: gdown, nlpaug
Successfully installed gdown-4.7.1 nlpaug-1.1.11
Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: wget
  Building wheel for wget (setup.py): started
  Building wheel for wget (setup.py): finished with status 'done'
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9680 sha256=3c64bac3baec0008b7107884e583beaeece0d3b1dce4c68d1e16d403a42938f5
  Stored in directory: c:\users\choukrallah lachhab\appdata\local\pip\cache\wheels\46\78\0e\8e5e2b500f83a682c8d7e7ce820638cf99faa894a662f71cf0
Successfully built wget
Installing collected packages: wget
Successfully installed 

In [2]:
# This will be the base text which we will be using throughout this notebook
text="The quick brown fox jumps over the lazy dog ."

In [3]:
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc

from nlpaug.util import Action
import os

### Augmentation at the Character Level

1 . **OCR Augmenter**: To read textual data from on image, we need an OCR(optical character recognition) model. Once the text is extracted from the image, there may be errors like; '0' instead of an 'o', '2' instead of 'z' and other such similar errors.

2 . **Keyboard Augmenter**: While typing/texting typos are fairly common this augmenter simulates the errors by substituting characters in words with ones at a similar distance on a keyboard.

In [4]:
aug = nac.OcrAug()  
augmented_texts = aug.augment(text, n=3) # specifying n=3 gives us only 3 augmented versions of the sentence.

print("Original:")
print(text)

print("Augmented Texts:")
print(augmented_texts)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Texts:
['The quick brown fox jumps over the lazy d09.', 'The quick brown fox jumps over the 1a2y dog.', 'The quick brown f0x jumps over the lazy dog.']


In [5]:
aug = nac.KeyboardAug()
augmented_text = aug.augment(text, n=3) # specifying n=3 gives us only 3 augmented versions of the sentence.

print("Original:")
print(text)

print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['The qJivk brown fox jumOW ovd# the lazy dog.', 'The qukfk b$owH fox jumps over the ?xzy dog.', 'The quick NEown fox jumps ov@E the >Qzy dog.']


There are other types of character augmenters too.

### Augmentation at the Word Level

Augmentation is important at the word level as well , here we use word2vec to insert or substitute a similar word.

#### 1 - Spelling augmentor

In [6]:
# Downloading the required txt file
import wget

if not os.path.exists("spelling_en.txt"):
    wget.download("https://raw.githubusercontent.com/makcedward/nlpaug/5238e0be734841b69651d2043df535d78a8cc594/nlpaug/res/word/spelling/spelling_en.txt")
else:
    print("File already exists")

In [7]:
# Substitute word by spelling mistake words dictionary
aug = naw.SpellingAug('spelling_en.txt')
augmented_texts = aug.augment(text)
print("Original:")
print(text)
print("Augmented Texts:")
print(augmented_texts)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Texts:
['The qchick brown fox jumps over tne lazy djg.']


#### 2 - Word embeddings augmentor

In [10]:
gn_vec_path = "GoogleNews-vectors-negative300.bin.gz"

Insert word randomly by word embeddings similarity

In [11]:
# model_type: word2vec, glove or fasttext
aug = naw.WordEmbsAug(
    model_type='word2vec', model_path=gn_vec_path,
    action="insert")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['Master The quick brown dwelling fox jumps over the lazy Pend dog.']


Substitute word by word2vec similarity

In [12]:
aug = naw.WordEmbsAug(
    model_type='word2vec', model_path=gn_vec_path,
    action="substitute")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['The quick brown fox grooming_petting cu_zn_ore particular lazy dog.']


There are many more features which nlpaug offers you can visit the github repo and documentation for further details