Augmentation with text is not used as frequently as in video computing. It is difficult with low-resource languages b/c they are not as diverse and don't have the vector space overlap with higher-resource languages. The goal below was to follow an example in an arvix paper (Data Augmentation for Low-Resource Neural Machine Translation, https://arxiv.org/pdf/1705.00440.pdf) which employed a method similar to that used in visual computing. They called it 'translation data augmentation' (TDA). It augments the training data by altering existing sentences in the parallel corpus, similar in spirit to the data augmentation approaches in computer vision. This study proposed a weaker notion of label preservation that allows to alter both source and target sentences at the same time as long as they remain translation of each other. The paper augmented only low-frequency words; the code below includes the entire set.

In [None]:
!pip install googletrans==3.1.0a0
!pip install nlpaug
!pip install transformers
%load_ext google.colab.data_table

In [80]:
import googletrans
import nlpaug
import nlpaug.augmenter.word as naw
from googletrans import Translator
import pandas as pd
import numpy as np
print(googletrans.LANGUAGES)

The google.colab.data_table extension is already loaded. To reload it, use:
  %reload_ext google.colab.data_table
{'af': 'afrikaans', 'sq': 'albanian', 'am': 'amharic', 'ar': 'arabic', 'hy': 'armenian', 'az': 'azerbaijani', 'eu': 'basque', 'be': 'belarusian', 'bn': 'bengali', 'bs': 'bosnian', 'bg': 'bulgarian', 'ca': 'catalan', 'ceb': 'cebuano', 'ny': 'chichewa', 'zh-cn': 'chinese (simplified)', 'zh-tw': 'chinese (traditional)', 'co': 'corsican', 'hr': 'croatian', 'cs': 'czech', 'da': 'danish', 'nl': 'dutch', 'en': 'english', 'eo': 'esperanto', 'et': 'estonian', 'tl': 'filipino', 'fi': 'finnish', 'fr': 'french', 'fy': 'frisian', 'gl': 'galician', 'ka': 'georgian', 'de': 'german', 'el': 'greek', 'gu': 'gujarati', 'ht': 'haitian creole', 'ha': 'hausa', 'haw': 'hawaiian', 'iw': 'hebrew', 'he': 'hebrew', 'hi': 'hindi', 'hmn': 'hmong', 'hu': 'hungarian', 'is': 'icelandic', 'ig': 'igbo', 'id': 'indonesian', 'ga': 'irish', 'it': 'italian', 'ja': 'japanese', 'jw': 'javanese', 'kn': 'kannada', 

In [81]:
translator = Translator()

In [105]:
# One test sentence from our Google Translate data.
test_sentence = "One had to resign from the job market where people get job opportunities, but for Zodiak video journalist in the northern region Angela Saidi found a family opportunity."

In [98]:
# 'ny' is the code for Chichewa in googletrans
Chichewa_trans = translator.translate(test_sentence, src = 'en', dest = 'ny')

In [99]:
print(Chichewa_trans.src)
print(Chichewa_trans.dest)
print("Original Sentence in English: ", Chichewa_trans.origin)
print("English translated to Chichewa: ", Chichewa_trans.text)
print(Chichewa_trans.pronunciation)

en
ny
Original Sentence in English:  One had to resign from the job market where people get job opportunities, but for Zodiak video journalist in the northern region Angela Saidi found a family opportunity.
English translated to Chichewa:  Mmodzi adayenera kusiya ntchito yomwe anthu amapeza mwayi wa ntchito, koma kwa mtolankhani wa kanema wa Zodiak kudera la kumpoto Angela Saidi adapeza mwayi wabanja.
One had to resign from the job market where people get job opportunities, but for Zodiak video journalist in the northern region Angela Saidi found a family opportunity.


In [116]:
# A Dataframe to hold the Augmented Sentences and Translations to Chichewa
augment_sents = pd.DataFrame(columns=['English; original sentence and 10 Augmentations', 'Chichewa translations'])

In [117]:
# Use BERT to insert equivalent words

TOPK = 20 #default=100
ACT = 'insert' #"substitute"
 
aug_bert = naw.ContextualWordEmbsAug(
    model_path='distilbert-base-uncased', 
    device='cuda', 
    action = ACT, top_k = TOPK)
augmented_temp = []
augmented_temp.append(test_sentence)

for ii in range(10):
    augmented_text = aug_bert.augment(test_sentence)
    augmented_temp.append(augmented_text)
augment_sents['English; original sentence and 10 Augmentations'] = augmented_temp

In [118]:
# Translate the augmented sentences to Chichewa
translated_temp = []
for i, sentence in enumerate(augmented_temp):
    # print(type(sentence))
    result = translator.translate(sentence, src = 'en', dest = 'ny')
    translated_temp.append(result.text)
augment_sents['Chichewa translations'] = translated_temp

In [119]:
# The dataframe holding the English augmentations and their Chichewa translations:
augment_sents

Unnamed: 0,English; original sentence and 10 Augmentations,Chichewa translations
0,One had to resign from the job market where pe...,Mmodzi adayenera kusiya ntchito yomwe anthu am...
1,one had refused to resign from all the ukraini...,m'modzi anali atakana kusiya ntchito kumsika w...
2,one had planned to constantly resign from the ...,m'modzi adakonza zosiya kusiya ntchito komwe a...
3,one editor had to resign altogether from contr...,mkonzi m'modzi adayenera kusiya ntchito yoyang...
4,only one had to resign from the job market set...,m'modzi yekha ndiye adasiya ntchito pomwe anth...
5,one had to practically resign from the job sea...,Mmodzi adayenera kusiya ntchito yofunafuna ntc...
6,sometimes one had to resign from the job marke...,Nthawi zina munthu amayenera kusiya ntchito po...
7,no one had to resign altogether from the job c...,palibe amene adasiya ntchito pa msika wolenga ...
8,suppose one had to resign from the employment ...,tiyerekeze kuti wina akuyenera kusiya ntchito ...
9,recently one journalist had opted to resign fr...,posachedwapa mtolankhani wina adasankha kusiya...
