# Data Augmentation Tutorial

- Gensim
- Transformers
- nlpaug
- TextAttack 🐙
- Back Translation with MarianMT from HuggingFace
- AugLy
- TextAugment
- GPT-3 by OpenAI

In [14]:
text = '''Pizza is a dish of Italian origin consisting of a usually round, flat base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, etc.), which is then baked at a high temperature, traditionally in a wood-fired oven.'''
print(text)

Pizza is a dish of Italian origin consisting of a usually round, flat base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, etc.), which is then baked at a high temperature, traditionally in a wood-fired oven.


---

# Gensim

## Word-Embeddings Substitution

In [9]:
import gensim.downloader as api

model = api.load('glove-twitter-25')  
model.most_similar('pizza', topn=5)

[('pasta', 0.9345267415046692),
 ('chocolate', 0.9283044338226318),
 ('sushi', 0.9267594218254089),
 ('starbucks', 0.9123665690422058),
 ('nutella', 0.8918704986572266)]

---

# Transformers

> Works with TensorFlow environment in SageMaker

In [None]:
!pip install transformers

In [1]:
from transformers import pipeline

nlp = pipeline('fill-mask')

nlp('Pizza is a dish of <mask> origin')

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
All model checkpoint layers were used when initializing TFRobertaForMaskedLM.

All the layers of TFRobertaForMaskedLM were initialized from the model checkpoint at distilroberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.


[{'score': 0.08497830480337143,
  'token': 24381,
  'token_str': ' dubious',
  'sequence': 'Pizza is a dish of dubious origin'},
 {'score': 0.07489389181137085,
  'token': 3108,
  'token_str': ' Italian',
  'sequence': 'Pizza is a dish of Italian origin'},
 {'score': 0.036742500960826874,
  'token': 16820,
  'token_str': ' culinary',
  'sequence': 'Pizza is a dish of culinary origin'},
 {'score': 0.03182853385806084,
  'token': 1362,
  'token_str': ' Indian',
  'sequence': 'Pizza is a dish of Indian origin'},
 {'score': 0.029324421659111977,
  'token': 14083,
  'token_str': ' humble',
  'sequence': 'Pizza is a dish of humble origin'}]

In [2]:
nlp('Pizza is a dish of <mask> origin consisting of a <mask> round, flat base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients')

[[{'score': 0.18357928097248077,
   'token': 3108,
   'token_str': ' Italian',
   'sequence': '<s>Pizza is a dish of Italian origin consisting of a<mask> round, flat base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients</s>'},
  {'score': 0.07059834897518158,
   'token': 11965,
   'token_str': ' Mediterranean',
   'sequence': '<s>Pizza is a dish of Mediterranean origin consisting of a<mask> round, flat base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients</s>'},
  {'score': 0.0653436928987503,
   'token': 14083,
   'token_str': ' humble',
   'sequence': '<s>Pizza is a dish of humble origin consisting of a<mask> round, flat base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients</s>'},
  {'score': 0.03337021917104721,
   'token': 16820,
   'token_str': ' culinary',
   'sequence': '<s>Pizza is a dish of culinary origin consisting of a<mask> round, fla

In [3]:
nlp('Pizza is a dish of Italian origin consisting of a usually round, flat base of leavened wheat-based dough topped with <mask>, <mask>, and often various other <mask>')

[[{'score': 0.16325989365577698,
   'token': 7134,
   'token_str': ' cheese',
   'sequence': '<s>Pizza is a dish of Italian origin consisting of a usually round, flat base of leavened wheat-based dough topped with cheese,<mask>, and often various other<mask></s>'},
  {'score': 0.08153437077999115,
   'token': 18553,
   'token_str': ' tomatoes',
   'sequence': '<s>Pizza is a dish of Italian origin consisting of a usually round, flat base of leavened wheat-based dough topped with tomatoes,<mask>, and often various other<mask></s>'},
  {'score': 0.0786515399813652,
   'token': 20406,
   'token_str': ' tomato',
   'sequence': '<s>Pizza is a dish of Italian origin consisting of a usually round, flat base of leavened wheat-based dough topped with tomato,<mask>, and often various other<mask></s>'},
  {'score': 0.060847487300634384,
   'token': 21568,
   'token_str': ' onions',
   'sequence': '<s>Pizza is a dish of Italian origin consisting of a usually round, flat base of leavened wheat-based

---

# nlpaug

In [None]:
!pip install numpy requests nlpaug

In [4]:
!pip install torch>=1.6.0 transformers>=4.11.3 sentencepiece

In [5]:
!pip install nltk>=3.4.5

In [7]:
import os
os.environ["MODEL_DIR"] = '../model'

In [12]:
model_dir = os.environ["MODEL_DIR"]

In [19]:
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc

from nlpaug.util import Action

import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

# Character Augmenter

Augmenting data in character level. Possible scenarios include image to text and chatbot. During recognizing text from image, we need to optical character recognition (OCR) model to achieve it but OCR introduces some errors such as recognizing "o" and "0". `OCRAug` simulate these errors to perform the data augmentation. For chatbot, we still have typo even though most of application comes with word correction. Therefore, `KeyboardAug` is introduced to simulate this kind of errors.

## OCR Augmenter
### Substitute character by pre-defined OCR error

In [20]:
aug = nac.OcrAug()
augmented_texts = aug.augment(text, n=3)
print("Original:")
print(text)
print("\nAugmented Texts:")
print(augmented_texts)

Original:
Pizza is a dish of Italian origin consisting of a usually round, flat base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, etc.), which is then baked at a high temperature, traditionally in a wood-fired oven.

Augmented Texts:
['Pizza is a dish 0f Italian origin consisting of a usually round, flat 6a8e of leavened wheat - based dough topped with tomatoes, cheese, and uften various other ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, etc. ), which is then baked at a hi9h temperature, traditionally in a wood - fired oven.', 'Pizza is a dish of Italian origin consisting uf a osoa1ly koond, flat base of leavened wheat - based dough topped with tomatoes, cheese, and often various other ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives, vegetables

## Keyboard Augmenter
### Substitute character by keyboard distance

In [34]:
aug = nac.KeyboardAug()
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("\nAugmented Text:")
print(augmented_text)

Original:
Pizza is a dish of Italian origin consisting of a usually round, flat base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, etc.), which is then baked at a high temperature, traditionally in a wood-fired oven.

Augmented Text:
Pizza is a diCh of Italian origin consisting of a usually round, flat base of ldavrned wheat - based dough toppex with tomatoes, cheese, and often various other ingredients (such as various typWs of sQusags, anchlcies, ,ushrLoms, onions, olivWs, vegetables, meat, ham, etc. ), which is then baked at a high tW<oerature, traditionally in a wood - fir#d oven.


## Random Augmenter
### Insert character randomly

In [32]:
aug = nac.RandomCharAug(action="insert")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("\nAugmented Text:")
print(augmented_text)

Original:
The quick brown fox jumped over the lazy dog

Augmented Text:
The quXick brown fox jumped over the laozy dog


### Substitute character randomly

In [23]:
aug = nac.RandomCharAug(action="substitute")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("\nAugmented Text:")
print(augmented_text)

Original:
Pizza is a dish of Italian origin consisting of a usually round, flat base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, etc.), which is then baked at a high temperature, traditionally in a wood-fired oven.

Augmented Text:
['Pizza is a dish of Italian orcgbn consisting of a usually r0u&d, flat E+se of leavened wherk - bas%d dough topped with tomatoes, cheese, and often various other ingredients (such as ve(ioup types of sausage, ancX4eies, mus_rFom_, onions, olEEes, vegetables, meat, ham, etc. ), which is then baFe% at a high temperature, traditionally in a wood - fired oven.']


### Swap character randomly

In [24]:
aug = nac.RandomCharAug(action="swap")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("\nAugmented Text:")
print(augmented_text)

Original:
Pizza is a dish of Italian origin consisting of a usually round, flat base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, etc.), which is then baked at a high temperature, traditionally in a wood-fired oven.

Augmented Text:
['Pizza is a dish of Italian origin ocnsisitgn of a usually round, flat abes of leavened wheat - sbaed oduhg topped with tomatoes, cheese, and foetn various other ingredients (such as various tpyse of sausage, anchovies, mushrooms, onions, olives, egvetbales, meat, ham, etc. ), whhic is then baked at a high temperature, tarditoinlayl in a wood - fired vone.']


### Delete character randomly

In [26]:
aug = nac.RandomCharAug(action="delete")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("\nAugmented Text:")
print(augmented_text)

Original:
Pizza is a dish of Italian origin consisting of a usually round, flat base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, etc.), which is then baked at a high temperature, traditionally in a wood-fired oven.

Augmented Text:
['zza is a dish of Italian origin consisting of a suly round, flat ba of leavened wheat - based dough topped with tomatoes, cheese, and oft various other ingredients (uh as vaus types of sausage, anchovies, mushrooms, onions, oliv, vegeaes, at, ham, etc. ), which is then baked at a high erature, traditionally in a wood - fired oven.']


# Word Augmenter

Besides character augmentation, word level is important as well. We make use of word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), fasttext (Joulin et al., 2016), BERT(Devlin et al., 2018) and wordnet to insert and substitute similar word. `Word2vecAug`,  `GloVeAug` and `FasttextAug` use word embeddings to find most similar group of words to replace original word. On the other hand, `BertAug` use language models to predict possible target word. `WordNetAug` use statistics way to find the similar group of words.

## Spelling Augmenter
### Substitute word by spelling mistake words dictionary

In [28]:
aug = naw.SpellingAug()
augmented_texts = aug.augment(text, n=3)
print("Original:")
print(text)
print("\nAugmented Texts:")
print(augmented_texts)

Original:
Pizza is a dish of Italian origin consisting of a usually round, flat base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, etc.), which is then baked at a high temperature, traditionally in a wood-fired oven.

Augmented Texts:
['Pizza wis a dish of Italyan origin consisting of a usually round, flat based of leavened wheat - basead dough topped witc tomatoes, cheese, and often variuos other ingredients (such is varoius types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, ecc. ), which is then baked at a high temperature, traditionally in an wood - fired oven.', "Pizza is a dish of Italin origin consisting oof a usually rond, flat base of leavened wheat - basead dough topped with tomatoes, cheese, and ofthen various other ingredients (such as varios types lf sausage, anchovies, mushrooms, union, olives, vegetab

## Synonym Augmenter
### Substitute word by WordNet's synonym

In [24]:
aug = naw.SynonymAug(aug_src='wordnet')
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("\nAugmented Text:")
print(augmented_text)

Original:
Pizza is a dish of Italian origin consisting of a usually round, flat base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, etc.), which is then baked at a high temperature, traditionally in a wood-fired oven.

Augmented Text:
Pizza is a bag of Italian origin consisting of a ordinarily round, flat base of leavened wheat - based dough topped with love apple, cheese, and often various other ingredients (such as various types of sausage, anchovy, mushrooms, onions, olives, vegetable, kernel, ham, etc. ), which is and so baked at a high temperature, traditionally in a wood - fired oven.


## Antonym Augmenter
### Substitute word by antonym

In [26]:
aug = naw.AntonymAug()
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("\nAugmented Text:")
print(augmented_text)

Original:
Pizza is a dish of Italian origin consisting of a usually round, flat base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, etc.), which is then baked at a high temperature, traditionally in a wood-fired oven.

Augmented Text:
Pizza differ a dish of Italian origin consisting of a remarkably round, natural base of unleavened wheat - based dough topped with tomatoes, cheese, and rarely various same ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, etc. ), which differ then baked at a low temperature, traditionally in a wood - hire oven.


## Random Word Augmenter
### Swap word randomly

In [35]:
aug = naw.RandomWordAug(action="swap")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("\nAugmented Text:")
print(augmented_text)

Original:
Pizza is a dish of Italian origin consisting of a usually round, flat base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, etc.), which is then baked at a high temperature, traditionally in a wood-fired oven.

Augmented Text:
Is pizza a dish Italian of origin consisting of a usually round, flat base of wheat leavened - based dough topped, with tomatoes cheese, and often various other (ingredients such various as types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat ham, , etc. ), is which baked then at a high temperature, traditionally in a wood - fired oven.


### Delete word randomly

In [36]:
aug = naw.RandomWordAug()
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("\nAugmented Text:")
print(augmented_text)

Original:
Pizza is a dish of Italian origin consisting of a usually round, flat base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, etc.), which is then baked at a high temperature, traditionally in a wood-fired oven.

Augmented Text:
Pizza is a dish of Italian origin consisting a usually, flat base of leavened wheat - dough with tomatoes, cheese, and often various ingredients (such various of sausage, , mushrooms, onions, olives, , meat, , etc. ), which is then baked at a high temperature, traditionally in a wood - fired oven.


### Delete a set of contunous word will be removed randomly

In [39]:
aug = naw.RandomWordAug(action='crop')
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("\nAugmented Text:")
print(augmented_text)

Original:
Pizza is a dish of Italian origin consisting of a usually round, flat base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, etc.), which is then baked at a high temperature, traditionally in a wood-fired oven.

Augmented Text:
Pizza is a dish of Italian origin consisting of a usually round, flat base of leavened wheat - based dough topped with tomatoes, cheese, and often various other ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham a high temperature, traditionally in a wood - fired oven.


---

# TextAttack 🐙

In [None]:
!pip install textattack --upgrade

In [None]:
!pip install tensorflow-text

## WordNetAugmenter
Wordnet augments text by replacing words with synonyms provided by WordNet.

In [6]:
import textattack
from textattack.augmentation import WordNetAugmenter

wordnet_aug = WordNetAugmenter()
wordnet_aug.augment(text)

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


['Pizza is a dish of Italian origin consisting of a usually rhythm, matte meanspirited of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients (such as respective types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, etc.), which is then baked at a gamy temperature, traditionally in a wood-fired oven.']

## EmbeddingAugmenter 
Embedding augments text by replacing words with neighbors in the counter-fitted embedding space, with a constraint to ensure their cosine similarity is at least 0.8.

In [None]:
from textattack.augmentation import EmbeddingAugmenter
embed_aug = EmbeddingAugmenter()
embed_aug.augment(text)

## CharSwapAugmenter 
It augments text by substituting, deleting, inserting, and swapping adjacent characters

In [3]:
from textattack.augmentation import CharSwapAugmenter
charswap_aug = CharSwapAugmenter()
charswap_aug.augment(text)

['iPzza is a dish of Italian origin consisting of a usually round, flat gbase of leavened heat-based dough topped with tomatoes, cheese, and often various other ingredients (such as variou types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, etc.), which is then baked at a high temperaeture, traditionally in a wood-fired oven.']

In [6]:
# import transformations, contraints, and the Augmenter
from textattack.transformations import WordSwapRandomCharacterDeletion
from textattack.transformations import WordSwapQWERTY
from textattack.transformations import CompositeTransformation

from textattack.constraints.pre_transformation import RepeatModification
from textattack.constraints.pre_transformation import StopwordModification

from textattack.augmentation import Augmenter

In [17]:

# Set up transformation using CompositeTransformation()
transformation = CompositeTransformation([WordSwapRandomCharacterDeletion(), WordSwapQWERTY()])
# Set up constraints
constraints = [RepeatModification(), StopwordModification()]
# Create augmenter with specified parameters
augmenter = Augmenter(transformation=transformation, constraints=constraints, pct_words_to_swap=0.5, transformations_per_example=10)
s = 'Pizza is a dish of Italian origin consisting of a usually round.'
# Augment!
augmenter.augment(s)

['Piza is a dih of Itallan origin conslsting of a usualky roun.',
 'Piza is a dish of Italiah origi consising of a sually roun.',
 'Pizz is a fish of Ialian ofigin cnsisting of a usjally round.',
 'Pizza is a xish of Ialian otigin conisting of a ushally rounx.',
 'Pizza is a xish of Itlian oriin fonsisting of a usuaily roun.',
 'Plzza is a didh of Itakian origin consistiny of a sually rpund.',
 'Puzza is a dsh of Itaian oigin consistkng of a usualiy round.',
 'Pzza is a djsh of Italiqn oriin consistibg of a usully round.',
 'izza is a dish of Italin orign conssting of a usully roun.',
 'izza is a fish of Itwlian oritin consisting of a jsually roud.']

## DeletionAugmenter
This one augments the text by deleting some parts of the text to make new text.

In [3]:
from textattack.augmentation import DeletionAugmenter

deletion_aug = DeletionAugmenter()
deletion_aug.augment(text)

['Pizza is a dish of Italian origin consisting of a usually round, flat base of wheat-based dough topped tomatoes, cheese, and often various other ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives,,, ham,.), which is then baked at a high temperature, traditionally in a wood-fired oven.']

## EasyDataAugmenter technique

This augments the text with a combination of different methods, such as:
- Randomly swapping the positions of the words in the sentence.
- Randomly removing words from the sentence.
- Randomly inserting a random synonym of a random word at a random location.
- Randomly replacing words with their synonyms.

In [4]:
from textattack.augmentation import EasyDataAugmenter

eda_aug = EasyDataAugmenter()
eda_aug.augment(text)

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


['Pizza is a bag of Italian origin consisting of a usually around, 2-dimensional alkali of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives, veg, meat, ham, etc.), which is then baked at a high temperature, traditionally in a wood-fired oven.',
 'Pizza is a dish of Italian origin consisting of a usually round, flat at base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients (such as various types of sausage, anchovies, mushrooms, At onions, olives, vegetables, meat, ham, etc.), which live is then baked at a high temperature, traditionally in a wood-fired oven.',
 'is a dish of Italian origin consisting of a usually round, flat base of wheat-based dough topped with tomatoes, cheese, and often various other ingredients ( as various types of sausage,, mushrooms, onions, olives, vegetables,, ham, etc.), which is then baked at a high

---

# Back Translation with MarianMT from HuggingFace

> MarianMTModel requires the PyTorch library

In [None]:
!pip install transformers==4.1.1 sentencepiece==0.1.94
!pip install mosestokenizer==1.1.0

In [4]:
from transformers import MarianMTModel, MarianTokenizer

Then, we can create a initialize the model that can translate from English to Romance languages. This is a single model that can translate to any of the romance languages()

In [5]:
target_model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
target_tokenizer = MarianTokenizer.from_pretrained(target_model_name)
target_model = MarianMTModel.from_pretrained(target_model_name)

Downloading:   0%|          | 0.00/1.37k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/298M [00:00<?, ?B/s]

Similarly, we can initialize models that can translate Romance languages to English.

In [6]:
en_model_name = 'Helsinki-NLP/opus-mt-ROMANCE-en'
en_tokenizer = MarianTokenizer.from_pretrained(en_model_name)
en_model = MarianMTModel.from_pretrained(en_model_name)

Downloading:   0%|          | 0.00/781k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/761k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/298M [00:00<?, ?B/s]

Next, we write a helper function to translate a batch of text given the machine translation model, tokenizer and the target romance language.

In [7]:
def translate(texts, model, tokenizer, language="fr"):
    # Prepare the text data into appropriate format for the model
    template = lambda text: f"{text}" if language == "en" else f">>{language}<< {text}"
    src_texts = [template(text) for text in texts]

    # Tokenize the texts
    encoded = tokenizer.prepare_seq2seq_batch(src_texts)
    
    # Generate translation using model
    translated = model.generate(**encoded)

    # Convert the generated tokens indices back into text
    translated_texts = tokenizer.batch_decode(translated, skip_special_tokens=True)
    
    return translated_texts

Next, we will prepare a function to use the above `translate()` function to perform back translation.

In [8]:
def back_translate(texts, source_lang="en", target_lang="fr"):
    # Translate from source to target language
    fr_texts = translate(texts, target_model, target_tokenizer, 
                         language=target_lang)

    # Translate from target language back to source language
    back_translated_texts = translate(fr_texts, en_model, en_tokenizer, 
                                      language=source_lang)
    
    return back_translated_texts

Now, we can perform data augmentation using back-translation from English to Spanish on a list of sentences as shown below.

In [None]:
en_texts = ['This is so cool', 'I hated the food', 'They were very helpful']

aug_texts = back_translate(en_texts, source_lang="en", target_lang="es")
print(aug_texts)

---
# AugLy

- [GitHub - AugLy](https://github.com/facebookresearch/AugLy)
- [AugLy’s documentation](https://augly.readthedocs.io/en/latest/index.html)

In [None]:
!pip install augly[text]

In [14]:
import augly.text as textaugs

# Define input text
input_text = "Hello, world! How are you today?"
input_text

'Hello, world! How are you today?'

In [25]:
# Now we can apply various augmentations!
print(textaugs.simulate_typos(text))

Pizza is a dish of Italizn origin consisying of a usally round, fat bzse of leavened w^heat-based dough topped with tomatoes, cheese, andd oftenly various other ingredients (such as avrious types of sausage, anchovies, ushrooms, onions, olives, vegitables, meta, ahm, ect.), which hs thn bak(d ast a yhigh temperature, traditionaly in a wood-fired ven.


In [27]:
# You can evaluate the fairness of your model by swapping gender in text
# inputs & evaluating the performance!
gendered_text = "She has two brothers, but she always wanted a sister"
aug = textaugs.SwapGenderedWords(aug_word_p=1.0)
print(aug(gendered_text))

He has two sisters, but he always wanted a brother


In [24]:
aug = textaugs.Contractions(aug_p=1.0)
print(aug(input_text))

["Hello, world! How're you today?"]


In [31]:
augmented_synonyms = textaugs.insert_punctuation_chars(
    text,
    granularity="all",
    cadence=5.0,
    vary_chars=True,
)

print(augmented_synonyms)

["Pizza- is a. dish? of I:talia,n ori,gin c;onsis-ting ,of a ...usual-ly ro-und, !flat .base ?of le!avene.d whe?at-ba!sed d,ough !toppe'd wit,h tom...atoes?, che-ese, ;and o.ften -vario:us ot.her i.ngred.ients? (suc'h as ;vario...us ty.pes o!f sau.sage,? anch!ovies., mus!hroom.s, on.ions,: oliv,es, v...egeta.bles,, meat!, ham,, etc?.), w!hich ,is th,en ba-ked a,t a h-igh t-emper,ature:, tra'ditio'nally... in a. wood.-fire!d ove...n."]


In [33]:
print(textaugs.change_case(text))

['PIZZA IS a dish OF italian ORIGIN CONSISTING of A USUALLY Round, flat BASE of Leavened wheat-based Dough topped with tomatoes, Cheese, and Often Various OTHER ingredients (such As VARIOUS Types OF SAUSAGE, ANCHOVIES, MUSHROOMS, Onions, OLIVES, vegetables, Meat, Ham, Etc.), WHICH is THEN BAKED at A HIGH TEMPERATURE, Traditionally In a wood-fired Oven.']


In [37]:
print(textaugs.insert_whitespace_chars(text))

['P\x0ci\x0cz\x0cz\x0ca\x0c \x0ci\x0cs\x0c \x0ca\x0c \x0cd\x0ci\x0cs\x0ch\x0c \x0co\x0cf\x0c \x0cI\x0ct\x0ca\x0cl\x0ci\x0ca\x0cn\x0c \x0co\x0cr\x0ci\x0cg\x0ci\x0cn\x0c \x0cc\x0co\x0cn\x0cs\x0ci\x0cs\x0ct\x0ci\x0cn\x0cg\x0c \x0co\x0cf\x0c \x0ca\x0c \x0cu\x0cs\x0cu\x0ca\x0cl\x0cl\x0cy\x0c \x0cr\x0co\x0cu\x0cn\x0cd\x0c,\x0c \x0cf\x0cl\x0ca\x0ct\x0c \x0cb\x0ca\x0cs\x0ce\x0c \x0co\x0cf\x0c \x0cl\x0ce\x0ca\x0cv\x0ce\x0cn\x0ce\x0cd\x0c \x0cw\x0ch\x0ce\x0ca\x0ct\x0c-\x0cb\x0ca\x0cs\x0ce\x0cd\x0c \x0cd\x0co\x0cu\x0cg\x0ch\x0c \x0ct\x0co\x0cp\x0cp\x0ce\x0cd\x0c \x0cw\x0ci\x0ct\x0ch\x0c \x0ct\x0co\x0cm\x0ca\x0ct\x0co\x0ce\x0cs\x0c,\x0c \x0cc\x0ch\x0ce\x0ce\x0cs\x0ce\x0c,\x0c \x0ca\x0cn\x0cd\x0c \x0co\x0cf\x0ct\x0ce\x0cn\x0c \x0cv\x0ca\x0cr\x0ci\x0co\x0cu\x0cs\x0c \x0co\x0ct\x0ch\x0ce\x0cr\x0c \x0ci\x0cn\x0cg\x0cr\x0ce\x0cd\x0ci\x0ce\x0cn\x0ct\x0cs\x0c \x0c(\x0cs\x0cu\x0cc\x0ch\x0c \x0ca\x0cs\x0c \x0cv\x0ca\x0cr\x0ci\x0co\x0cu\x0cs\x0c \x0ct\x0cy\x0cp\x0ce\x0cs\x0c \x0co\x0cf\x0c \x0cs\x0ca\x0cu\x

In [38]:
print(textaugs.insert_zero_width_chars(text))

['P\u2064i\u2064z\u2064z\u2064a\u2064 \u2064i\u2064s\u2064 \u2064a\u2064 \u2064d\u2064i\u2064s\u2064h\u2064 \u2064o\u2064f\u2064 \u2064I\u2064t\u2064a\u2064l\u2064i\u2064a\u2064n\u2064 \u2064o\u2064r\u2064i\u2064g\u2064i\u2064n\u2064 \u2064c\u2064o\u2064n\u2064s\u2064i\u2064s\u2064t\u2064i\u2064n\u2064g\u2064 \u2064o\u2064f\u2064 \u2064a\u2064 \u2064u\u2064s\u2064u\u2064a\u2064l\u2064l\u2064y\u2064 \u2064r\u2064o\u2064u\u2064n\u2064d\u2064,\u2064 \u2064f\u2064l\u2064a\u2064t\u2064 \u2064b\u2064a\u2064s\u2064e\u2064 \u2064o\u2064f\u2064 \u2064l\u2064e\u2064a\u2064v\u2064e\u2064n\u2064e\u2064d\u2064 \u2064w\u2064h\u2064e\u2064a\u2064t\u2064-\u2064b\u2064a\u2064s\u2064e\u2064d\u2064 \u2064d\u2064o\u2064u\u2064g\u2064h\u2064 \u2064t\u2064o\u2064p\u2064p\u2064e\u2064d\u2064 \u2064w\u2064i\u2064t\u2064h\u2064 \u2064t\u2064o\u2064m\u2064a\u2064t\u2064o\u2064e\u2064s\u2064,\u2064 \u2064c\u2064h\u2064e\u2064e\u2064s\u2064e\u2064,\u2064 \u2064a\u2064n\u2064d\u2064 \u2064o\u2064f\u2064t\u2064e\u2

In [39]:
print(textaugs.merge_words(text))

Pizzais a dish ofItalian originconsisting of a usually round, flat base ofleavened wheat-based dough topped withtomatoes, cheese, and often various other ingredients (suchas various typesof sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, etc.), which is then baked at a hightemperature, traditionally in a wood-firedoven.


In [40]:
print(textaugs.replace_bidirectional(text))

['\u202e.nevo derif-doow a ni yllanoitidart ,erutarepmet hgih a ta dekab neht si hcihw ,).cte ,mah ,taem ,selbategev ,sevilo ,snoino ,smoorhsum ,seivohcna ,egasuas fo sepyt suoirav sa hcus( stneidergni rehto suoirav netfo dna ,eseehc ,seotamot htiw deppot hguod desab-taehw denevael fo esab talf ,dnuor yllausu a fo gnitsisnoc nigiro nailatI fo hsid a si azziP\u202c']


In [41]:
print(textaugs.replace_fun_fonts(text))

P̳i̳z̳z̳a̳ i̳s̳ a̳ d̳i̳s̳h̳ o̳f̳ I̳t̳a̳l̳i̳a̳n̳ o̳r̳i̳g̳i̳n̳ c̳o̳n̳s̳i̳s̳t̳i̳n̳g̳ o̳f̳ a̳ u̳s̳u̳a̳l̳l̳y̳ r̳o̳u̳n̳d̳ ,̳ f̳l̳a̳t̳ b̳a̳s̳e̳ o̳f̳ l̳e̳a̳v̳e̳n̳e̳d̳ w̳h̳e̳a̳t̳ -̳ b̳a̳s̳e̳d̳ d̳o̳u̳g̳h̳ t̳o̳p̳p̳e̳d̳ w̳i̳t̳h̳ t̳o̳m̳a̳t̳o̳e̳s̳ ,̳ c̳h̳e̳e̳s̳e̳ ,̳ a̳n̳d̳ o̳f̳t̳e̳n̳ v̳a̳r̳i̳o̳u̳s̳ o̳t̳h̳e̳r̳ i̳n̳g̳r̳e̳d̳i̳e̳n̳t̳s̳ (̳ s̳u̳c̳h̳ a̳s̳ v̳a̳r̳i̳o̳u̳s̳ t̳y̳p̳e̳s̳ o̳f̳ s̳a̳u̳s̳a̳g̳e̳ ,̳ a̳n̳c̳h̳o̳v̳i̳e̳s̳ ,̳ m̳u̳s̳h̳r̳o̳o̳m̳s̳ ,̳ o̳n̳i̳o̳n̳s̳ ,̳ o̳l̳i̳v̳e̳s̳ ,̳ v̳e̳g̳e̳t̳a̳b̳l̳e̳s̳ ,̳ m̳e̳a̳t̳ ,̳ h̳a̳m̳ ,̳ e̳t̳c̳.̳ )̳ ,̳ w̳h̳i̳c̳h̳ i̳s̳ t̳h̳e̳n̳ b̳a̳k̳e̳d̳ a̳t̳ a̳ h̳i̳g̳h̳ t̳e̳m̳p̳e̳r̳a̳t̳u̳r̳e̳ ,̳ t̳r̳a̳d̳i̳t̳i̳o̳n̳a̳l̳l̳y̳ i̳n̳ a̳ w̳o̳o̳d̳ -̳ f̳i̳r̳e̳d̳ o̳v̳e̳n̳.̳


In [42]:
print(textaugs.replace_similar_chars(text))

Pizza 1s a dish Df Italian orig7n consisting of a usually round, flat bas3 of leavened wheat-based |)ough topped with tom@t[]es, cheese, and often var|ous ot)-(er ingredients (su[h 4s various types of sausage, anchovies, mushrooms, onions, olives, vegetables, m3at, ham, etc.), which 1s then baked a7 a |-|igh temperature, tradit!0na|_ly i^ a VVood-fired ove^.


In [43]:
print(textaugs.replace_similar_unicode_chars(text))

PiⓏza Ⓘs a dish oƒ Italian origin Ćonsϊstin₲ of a usuαlḽy round, flat base of leáveneð wheat-based douᎶh topped with tomatoes, cheese, and oᚩten vᎯrÏous otheᚱ ingreᚧiĒñts (such as various types o₣ sauşaᶃe, anchovies, mushrooms, onΐons, olive§, vegetables, ᗰeat, Ћam, eŤc.), whᎥch i₰ then baked at a high temperature, traditionally in a Ꮚood-fired ovÊn.


In [45]:
print(textaugs.replace_upside_down(text))

˙uǝʌo pǝɹᴉɟ-pooʍ ɐ uᴉ ʎllɐuoᴉʇᴉpɐɹʇ 'ǝɹnʇɐɹǝdɯǝʇ ɥɓᴉɥ ɐ ʇɐ pǝʞɐq uǝɥʇ sᴉ ɥɔᴉɥʍ ')˙ɔʇǝ 'ɯɐɥ 'ʇɐǝɯ 'sǝlqɐʇǝɓǝʌ 'sǝʌᴉlo 'suoᴉuo 'sɯooɹɥsnɯ 'sǝᴉʌoɥɔuɐ 'ǝɓɐsnɐs ɟo sǝdʎʇ snoᴉɹɐʌ sɐ ɥɔns( sʇuǝᴉpǝɹɓuᴉ ɹǝɥʇo snoᴉɹɐʌ uǝʇɟo puɐ 'ǝsǝǝɥɔ 'sǝoʇɐɯoʇ ɥʇᴉʍ pǝddoʇ ɥɓnop pǝsɐq-ʇɐǝɥʍ pǝuǝʌɐǝl ɟo ǝsɐq ʇɐlɟ 'punoɹ ʎllɐnsn ɐ ɟo ɓuᴉʇsᴉsuoɔ uᴉɓᴉɹo uɐᴉlɐʇI ɟo ɥsᴉp ɐ sᴉ ɐzzᴉԀ


In [52]:
print(textaugs.simulate_typos(text))

Pizza i5s a dish of Italian origin conissting of a usualy round, flat base of leavebed wheat-based dough top(ed iwth tomatos, chevese, and often carious otehr ingredients (such sa va^rious types oa sausage, ancbovies, mushoroms, onions, loives, vegetables, meat, ham, etc.), which is then baked ast a hihg tempertaure, tradionally in a wood-fired ovfen.


In [53]:
print(textaugs.split_words(text))

Pizza is a dis h of Ital ian origin consisting of a usually round, fl at b ase of leavened wh eat-ba sed doug h top ped wi th tomatoes, ch eese, and often various other ingredi ents (s uch as variou s types of sausage, ancho vies, mushrooms, onion s, ol ives, vegetables, mea t, ham, etc.), which is th en baked at a hig h tempera ture, traditionally in a w ood-fire d oven.


---

# TextAugment

TextAugment is a Python 3 library for augmenting text for natural language processing applications. TextAugment stands on the giant shoulders of NLTK, Gensim, and TextBlob and plays nicely with them.

- [GitHub - TextAugment](https://github.com/dsfsi/textaugment)

In [None]:
!pip install textaugment

In [None]:
!pip install numpy nltk gensim textblob googletrans 

In [None]:
!pip install --upgrade gensim

In [60]:
import nltk
import gensim

In [89]:
nltk.download(['wordnet','punkt','averaged_perceptron_tagger'])

# averaged_perceptron_tagger - The following code downloads default NLTK part-of-speech tagger model. 
# A part-of-speech tagger processes a sequence of words, and attaches a part of speech tag to each word.

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Use gensim to load a pre-trained word2vec model.

In [81]:
from textaugment import Word2vec
from textaugment import Wordnet
from textaugment import Translate

In [88]:
src = "en" # source language of the sentence
#to = "fr" # target language
to = "es" # target language

t = Translate(src="en", to="fr")
print(text, '\n')
t.augment(text)

Pizza is a dish of Italian origin consisting of a usually round, flat base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, etc.), which is then baked at a high temperature, traditionally in a wood-fired oven. 



'pizza is a dish of italian origin made up of a flat base and a lefa wheat dough, garnished with tomatoes, cheese and often other ingredients (such as various types of sausages, anchovies, anchovies, anchovies, anchovies, anchovies, anchovies, anchovies, anchovies mushrooms, onions, olives, vegetables, meat, ham, etc.), which is then cooked at high temperature, traditionally in a wood oven.'

## Load Fasttext Embeddings

Fasttext has Pre-trained word vectors on English webcrawl and Wikipedia which you can find [here](https://fasttext.cc/docs/en/english-vectors.html) as well as Pre-trained models for 157 different languages which you can find [here](https://fasttext.cc/docs/en/crawl-vectors.html)

In [91]:
# Download the FastText embeddings in the language of your choice
!wget "https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz"

--2023-02-10 21:03:19--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.75.142, 104.22.74.142, 172.67.9.4, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.75.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4503593528 (4.2G) [application/octet-stream]
Saving to: ‘cc.en.300.bin.gz’


2023-02-10 21:05:30 (33.4 MB/s) - ‘cc.en.300.bin.gz’ saved [4503593528/4503593528]



In [3]:
!pip install gensim

Collecting gensim
  Using cached gensim-4.2.0-cp36-cp36m-linux_x86_64.whl
Installing collected packages: gensim
Successfully installed gensim-4.2.0


In [None]:
# save path to your pre-trained model
import gensim
from gensim.test.utils import datapath
#pretrained_path = datapath('keywords-phrases-extraction/notebooks/cc.en.300.bin.gz')

# load model
model = gensim.models.fasttext.load_facebook_model('cc.en.300.bin.gz')

In [None]:
from textaugment import Word2vec
t = Word2vec(model = model.wv)
output = t.augment('The stories are good')

---
# GPT-3 by OpenAI

In [None]:
!pip install openai

In [11]:
import openai

In [12]:
# Add your key here
openai.api_key = ('')

In [36]:
# Prompts

# Change ingredients
#prompt = 'Change ingredients in this sentence'
#prompt = 'Change ingredients in this sentence to seafood'
#prompt = 'Change ingredients in this sentence to make Peperroni piza'


# Rephrase
#prompt = 'Rephrase this sentence'
prompt = 'Rephrase and change ingredients in this sentence'


# Summary
#prompt = 'Make a summary of this sentence:'


# Change with synonyms
#prompt = 'Change with synonyms this sentence'


# Augment
#prompt = 'Augment this sentence'

In [37]:
# Models:
# `text-curie-001`
# `text-babbage-001`
# `text-ada-001`
# `text-davinci-003`

response_extract = openai.Completion.create(
    model = "text-davinci-003",
    prompt = "{}:\n\n{}\n\n".format(prompt, text),
    #suffix = '\n\n', 
    max_tokens = 1500,
    temperature = 0.7,
    #top_p = 1.0,
    n = 3, 
    #stream = True,
    #logprobs = 3,
    #echo = True,
    #stop = None,
    presence_penalty = 1,  #0.0
    frequency_penalty = 1, #0.8
    best_of = 4,
    #logit_bias = {16971: -100, 198: -100, 49802: -100, 6099: -100, 220: -100, 6099: -100} #remove word beer
)

print(text)
print('\n\n', prompt, ':')
response_extract
#response_extract['choices'][0]['text']

Pizza is a dish of Italian origin consisting of a usually round, flat base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, etc.), which is then baked at a high temperature, traditionally in a wood-fired oven.


 Rephrase and change ingredients in this sentence :


<OpenAIObject text_completion id=cmpl-6m5Bx5ksE7lERR26CgQdbhvR5hmLb at 0x7fc263d48bf8> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": "\nPizza, an Italian dish characterized by a flat, round base of yeast-risen wheat dough topped with tomatoes and cheese plus additional toppings such as sausage, anchovies, mushrooms, onions, olives, vegetables, meat or ham is traditionally cooked at a high temperature in a wood-fired oven."
    },
    {
      "finish_reason": "stop",
      "index": 1,
      "logprobs": null,
      "text": "\nPizza is an Italian favorite that typically consists of a flat, round base made with leavened wheat-based dough, topped with tomatoes and cheese plus other ingredients like different types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat or ham. The pie is then baked at a high temperature in a wood-fired oven."
    },
    {
      "finish_reason": "stop",
      "index": 2,
      "