# NLP Augmentation (Offline)

**In computer vision problems, there is a virtual infinitude of techniques you can use to augment your images ranging from simple techniques like randomly flipping images to blending images together with CutMix or MixUp. In natural language processing, it is not as easy to come up with similar augmentation strategies; we must be a little more creative**

**The first idea I had was to randomly replace words with their synonyms or to randomly add word synonyms to the sequence, but then I saw [this kernel](https://www.kaggle.com/jpmiller/augmenting-data-with-translations) which is based on [this discussion thread](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/discussion/48038) and realized we can do better: we can use translation to augment our data and do several things:**

1. We can experiment and see if training our model on one language is better/worse than training on multiple languages
2. We can change the distribution of languages in our dataset, perhaps translating sentences to low-resource languages like Swahili and Urdu
3. We can randomly translate sentences to another language and then translate them back to the original like so:

![](https://amitness.com/images/backtranslation-en-fr.png)

*Image from [@amitness](https://www.kaggle.com/amitness)*

**We can also apply this augmentation in two ways: offline augmentation or online augmentation. In the first, we augment before we feed to the model, adding to our dataset size. This is preferable for smaller datasets where we are not worried about training taking too long. When you can't afford an increase in size, you resort to online augmentation where augment the data every epoch. We will use offline augmentation in this commit (for more [see](https://www.kaggle.com/c/datasciencebowl/discussion/12597))**

**For now, we will only translate to languages currently present in our dataset, but translating to languages outside of our dataset might give us better performance. Please note that some of these language codes are slightly different within the `googletrans` Python API. See [here](https://py-googletrans.readthedocs.io/en/latest/) for more**

In [1]:
GEN_BACK_TR = True

GEN_UPSAMPLE = False

GEN_EN_ONLY = False

In [2]:
#python basics
from matplotlib import pyplot as plt
import math, os, re, time
import numpy as np, pandas as pd, seaborn as sns

#nlp augmentation
!pip install --quiet googletrans
from googletrans import Translator

#model evaluation
from sklearn.model_selection import train_test_split, StratifiedKFold

#for fast parallel processing
from dask import bag, diagnostics

You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m


In [3]:
def back_translate(sequence, PROB = 1):
    languages = ['en', 'fr', 'th', 'tr', 'ur', 'ru', 'bg', 'de', 'ar', 'zh-cn', 'hi',
                 'sw', 'vi', 'es', 'el']
    
    #instantiate translator
    translator = Translator()
    
    #store original language so we can convert back
    org_lang = translator.detect(sequence).lang
    
    #randomly choose language to translate sequence to  
    random_lang = np.random.choice([lang for lang in languages if lang is not org_lang])
    
    if org_lang in languages:
        #translate to new language and back to original
        translated = translator.translate(sequence, dest = random_lang).text
        #translate back to original language
        translated_back = translator.translate(translated, dest = org_lang).text
    
        #apply with certain probability
        if np.random.uniform(0, 1) <= PROB:
            output_sequence = translated_back
        else:
            output_sequence = sequence
            
    #if detected language not in our list of languages, do nothing
    else:
        output_sequence = sequence
    
    return output_sequence

#check performance
for i in range(5):
    output = back_translate('I genuinely have no idea what the output of this sequence of words will be')
    print(output)

I really have no idea what the output of this word string will be
I really have no idea what the outcome of this series of words will be
I really have no idea what this sequence of words will be like
I really have no idea what the outcome of this string of words is
I really don't know what the results of this word sequence will be.


In [4]:
#applies above define function with Dask
def back_translate_parallel(dataset):
    prem_bag = bag.from_sequence(dataset['premise'].tolist()).map(back_translate)
    hyp_bag =  bag.from_sequence(dataset['hypothesis'].tolist()).map(back_translate)
    
    with diagnostics.ProgressBar():
        prems = prem_bag.compute()
        hyps = hyp_bag.compute()

    #pair premises and hypothesis
    dataset[['premise', 'hypothesis']] = list(zip(prems, hyps))
    
    return dataset

In [5]:
twice_train_aug = pd.read_csv('../input/contradictorywatsontwicetranslatedaug/twice_translated_aug_train.csv')
twice_test_aug = pd.read_csv('../input/contradictorywatsontwicetranslatedaug/twice_translated_aug_test.csv')

In [6]:
if GEN_BACK_TR:
#now we apply translation augmentation
    train_thrice_aug = twice_train_aug.pipe(back_translate_parallel)
    test_thrice_aug = twice_test_aug.pipe(back_translate_parallel)
    
    train_thrice_aug.to_csv('thrice_translation_aug_train.csv')
    test_thrice_aug.to_csv('thrice_translation_aug_test.csv')

[########################################] | 100% Completed | 29min  0.5s
[########################################] | 100% Completed | 25min  3.1s
[########################################] | 100% Completed | 12min 43.1s
[########################################] | 100% Completed | 11min  1.2s


# For TTA

**I have already created datasets where each premise/hypothesis is separately mapped to a random language and back: these datasets can be found [here](https://www.kaggle.com/tuckerarrants/contradictorywatsontranslationaug) and [here](https://www.kaggle.com/tuckerarrants/contradictorywatsontwicetranslatedaug). One is translated a single time and the other is twice translated:**

**Note that we can augment the test set as well and use test time augmentation (TTA) where we make separate predictions on the original sequences and the augmented sequences, and then use the average of the predictions for our final predictions:**

In [7]:
#offline loading
train = pd.read_csv("../input/contradictory-my-dear-watson/train.csv")
test = pd.read_csv("../input/contradictory-my-dear-watson/test.csv")

train_aug = pd.read_csv("../input/contradictorywatsontwicetranslatedaug/translation_aug_train.csv")
test_aug = pd.read_csv("../input/contradictorywatsontwicetranslatedaug/translation_aug_test.csv")

train_twice_aug = pd.read_csv("../input/contradictorywatsontwicetranslatedaug/twice_translated_aug_train.csv")
test_twice_aug = pd.read_csv("../input/contradictorywatsontwicetranslatedaug/twice_translated_aug_test.csv")

#view original
print(train.shape)
train.head()

(12120, 6)


Unnamed: 0,id,premise,hypothesis,lang_abv,language,label
0,5130fd2cb5,and these comments were considered in formulat...,The rules developed in the interim were put to...,en,English,0
1,5b72532a0b,These are issues that we wrestle with in pract...,Practice groups are not permitted to work on t...,en,English,2
2,3931fbe82a,Des petites choses comme celles-là font une di...,J'essayais d'accomplir quelque chose.,fr,French,0
3,5622f0c60b,you know they can't really defend themselves l...,They can't defend themselves because of their ...,en,English,0
4,86aaa48b45,ในการเล่นบทบาทสมมุติก็เช่นกัน โอกาสที่จะได้แสด...,เด็กสามารถเห็นได้ว่าชาติพันธุ์แตกต่างกันอย่างไร,th,Thai,1


In [8]:
#view first aug
print(train_aug.shape)
train_aug.head()

(12120, 6)


Unnamed: 0,id,premise,hypothesis,lang_abv,language,label
0,5130fd2cb5,and these ideas were considered in the develop...,The rules developed in the interim were taken ...,en,English,0
1,5b72532a0b,These are the challenges we face in practice w...,Practice groups are not permitted to work on t...,en,English,2
2,3931fbe82a,Ces petites choses font une grande différence ...,J'essaye d'accomplir quelque chose.,fr,French,0
3,5622f0c60b,Do you know that they can't really defend them...,They cannot protect themselves because of age.,en,English,0
4,86aaa48b45,เล่นตามบทบาทด้วย โอกาสในการแสดงและเล่นบทบาทหลา...,เด็ก ๆ สามารถเห็นได้ว่าชาติพันธุ์ต่างๆมีความแต...,th,Thai,1


In [9]:
#view second aug
print(train_twice_aug.shape)
train_twice_aug.head()

(12120, 6)


Unnamed: 0,id,premise,hypothesis,lang_abv,language,label
0,5130fd2cb5,and these ideas were taken into consideration ...,Interim rules developed in conjunction with th...,en,English,0
1,5b72532a0b,These are the challenges we face in practice w...,Practice groups are not allowed to deal with t...,en,English,2
2,3931fbe82a,Ces petites choses font une grande différence ...,J'essaye d'accomplir quelque chose.,fr,French,0
3,5622f0c60b,Do you know that they can't really defend them...,They cannot protect themselves because of age.,en,English,0
4,86aaa48b45,การเล่นบทบาทสมมติยังสามารถช่วยให้เกิดโอกาสในกา...,เด็ก ๆ สามารถเห็นโครงสร้างทางชาติพันธุ์ที่แตกต...,th,Thai,1


In [10]:
#view third aug
print(train_thrice_aug.shape)
train_thrice_aug.head()

(12120, 6)


Unnamed: 0,id,premise,hypothesis,lang_abv,language,label
0,5130fd2cb5,and these ideas were taken into account in the...,Provisional rules have been developed in conne...,en,English,0
1,5b72532a0b,“These are the challenges we face in practice ...,Practice groups are not allowed to deal with t...,en,English,2
2,3931fbe82a,Ces petites choses font une grande différence ...,J'essaye de réaliser quelque chose.,fr,French,0
3,5622f0c60b,Did you know they can't really defend themselv...,They cannot protect themselves due to their age.,en,English,0
4,86aaa48b45,การสวมบทบาทยังสามารถช่วยให้โอกาสในการทำงานและก...,เด็กสามารถเห็นโครงสร้างทางชาติพันธุ์ที่แตกต่างกัน,th,Thai,1


**Wonderful! We can see that this translation procedure consistently alters our sentences without much information loss. Feel free to use these datasets for your own experiments and see how they improve your model's performance**

# For Oversampling

**Now we will use the same idea to create a training dataset of only undersampled languages, like Urdu and Swahili. The only difference here is that we are not translating to random languages or translating back to the original sentence:**

In [11]:
#check most undersampled languages in training dataset
train['language'].value_counts()

English       6870
Chinese        411
Arabic         401
French         390
Swahili        385
Urdu           381
Vietnamese     379
Russian        376
Hindi          374
Greek          372
Thai           371
Spanish        366
German         351
Turkish        351
Bulgarian      342
Name: language, dtype: int64

In [12]:
#check most undersampled languages in test dataset
test['language'].value_counts()

English       2945
Spanish        175
Swahili        172
Russian        172
Urdu           168
Greek          168
Turkish        167
Thai           164
Arabic         159
French         157
German         152
Chinese        151
Bulgarian      150
Hindi          150
Vietnamese     145
Name: language, dtype: int64

**We have a choose to make here: do we translate based on language prevalence in the training, testing dataset, or based on the languages that XLM-R was trained on? I will base my translation on the test set languages, so I will create a Vietnamese, Hindi, and Bulgarian training datasets, for now:**

In [13]:
def translation(sequence, lang):
    
    #instantiate translator
    translator = Translator()
    
    org_lang = translator.detect(sequence).lang
    
    if lang is not org_lang:
        #translate to new language and back to original
        translated = translator.translate(sequence, dest = lang).text
        
    else:
        translated = sequence
    
    return translated

def translation_parallel(dataset, lang):
    prem_bag = bag.from_sequence(dataset['premise'].tolist()).map(lambda x: translation(x, lang = lang))
    hyp_bag =  bag.from_sequence(dataset['hypothesis'].tolist()).map(lambda x: translation(x, lang = lang))
    
    with diagnostics.ProgressBar():
        prems = prem_bag.compute()
        hyps = hyp_bag.compute()

    #pair premises and hypothesis
    dataset[['premise', 'hypothesis']] = list(zip(prems, hyps))
    
    return dataset

In [14]:
#translate to Vietnamese
prem_bag_vi = bag.from_sequence(train['premise'].tolist()).map(lambda x: translation(x, lang = 'vi'))
hyp_bag_vi =  bag.from_sequence(train['hypothesis'].tolist()).map(lambda x: translation(x, lang = 'vi'))

#translate to Hindi
prem_bag_hi = bag.from_sequence(train['premise'].tolist()).map(lambda x: translation(x, lang = 'hi'))
hyp_bag_hi =  bag.from_sequence(train['hypothesis'].tolist()).map(lambda x: translation(x, lang = 'hi'))

#translate to Bulgarian
prem_bag_bg = bag.from_sequence(train['premise'].tolist()).map(lambda x: translation(x, lang = 'bg'))
hyp_bag_bg =  bag.from_sequence(train['hypothesis'].tolist()).map(lambda x: translation(x, lang = 'bg'))

#and compute
if GEN_UPSAMPLE:
    with diagnostics.ProgressBar():
        print('Translating train to Vietnamese...')
        prems_vi = prem_bag_vi.compute()
        hyps_vi = hyp_bag_vi.compute()
        print('Done'); print('')
    
        print('Translating train to Hindi...')
        prems_hi = prem_bag_hi.compute()
        hyps_hi = hyp_bag_hi.compute()
        print('Done'); print('')
    
        print('Translating train to Bulgarian...')
        prems_bg = prem_bag_bg.compute()
        hyps_bg = hyp_bag_bg.compute()
        print('Done')
        
else:
    train_vi = pd.read_csv("../input/contradictorytranslatedtrain/train_vi.csv")
    train_hi = pd.read_csv("../input/contradictorytranslatedtrain/train_hi.csv")
    train_bg = pd.read_csv("../input/contradictorytranslatedtrain/train_bg.csv")

In [15]:
if GEN_UPSAMPLE:
    #sanity check
    train_vi = train
    train_vi[['premise', 'hypothesis']] = list(zip(prems_vi, hyps_vi))
    train_vi[['lang_abv', 'language']] = ['vi', 'Vietnamese']
    train_vi.to_csv('train_vi.csv', index = False)
train_vi.head()

Unnamed: 0,id,premise,hypothesis,lang_abv,language,label
0,5130fd2cb5,và những nhận xét này đã được xem xét trong vi...,Các quy tắc được phát triển trong thời gian tạ...,vi,Vietnamese,0
1,5b72532a0b,"Bà nói, đây là những vấn đề mà chúng tôi phải ...",Các nhóm thực hành không được phép làm việc về...,vi,Vietnamese,2
2,3931fbe82a,Những điều nhỏ nhặt như thế này tạo ra sự khác...,Tôi đã cố gắng hoàn thành một cái gì đó.,vi,Vietnamese,0
3,5622f0c60b,bạn biết họ không thể thực sự tự vệ như ai đó ...,Họ không thể tự vệ vì tuổi của họ.,vi,Vietnamese,0
4,86aaa48b45,Trong vai trò là tốt Cơ hội thể hiện và đóng n...,Trẻ em có thể thấy các nhóm dân tộc khác nhau ...,vi,Vietnamese,1


In [16]:
if GEN_UPSAMPLE:
    #sanity check
    train_hi = train
    train_hi[['premise', 'hypothesis']] = list(zip(prems_hi, hyps_hi))
    train_hi[['lang_abv', 'language']] = ['hi', 'Hindi']
    train_hi.to_csv('train_hi.csv', index = False)
train_hi.head()

Unnamed: 0,id,premise,hypothesis,lang_abv,language,label
0,5130fd2cb5,और इन टिप्पणियों को अंतरिम नियम बनाने पर विचार...,अंतरिम में विकसित नियमों को इन टिप्पणियों को ध...,hi,Hindi,0
1,5b72532a0b,"उन्होंने कहा कि ये ऐसे मुद्दे हैं, जिन पर हम क...",अभ्यास समूहों को इन मुद्दों पर काम करने की अनु...,hi,Hindi,2
2,3931fbe82a,इस तरह की छोटी-छोटी चीजें जो मैं करने की कोशिश...,मैं कुछ पूरा करने की कोशिश कर रहा था।,hi,Hindi,0
3,5622f0c60b,तुम्हें पता है कि वे वास्तव में खुद का बचाव नह...,वे अपनी उम्र के कारण खुद का बचाव नहीं कर सकते।,hi,Hindi,0
4,86aaa48b45,भूमिका में भी एक साथ कई भूमिकाओं को व्यक्त करन...,बच्चे देख सकते हैं कि विभिन्न जातीय समूह कैसे ...,hi,Hindi,1


In [17]:
if GEN_UPSAMPLE:
    #sanity check
    train_bg = train
    train_bg[['premise', 'hypothesis']] = list(zip(prems_bg, hyps_bg))
    train_bg[['lang_abv', 'language']] = ['bg', 'Bulgarian']
    train_bg.to_csv('train_bg.csv', index = False)
train_bg.head()

Unnamed: 0,id,premise,hypothesis,lang_abv,language,label
0,5130fd2cb5,и тези коментари бяха взети предвид при формул...,"Правилата, разработени във временното време, б...",bg,Bulgarian,0
1,5b72532a0b,"Това са проблеми, с които се борим в практичес...",Практическите групи не могат да работят по тез...,bg,Bulgarian,2
2,3931fbe82a,Малки неща като тези имат огромна разлика в то...,Опитвах се да постигна нещо.,bg,Bulgarian,0
3,5622f0c60b,"знаете, че не могат наистина да се защитят кат...",Те не могат да се защитят поради възрастта си.,bg,Bulgarian,0
4,86aaa48b45,В ролята също Възможността да изразяват и игра...,Децата могат да видят как са различни етническ...,bg,Bulgarian,1


# English Only

In [18]:
#translate to English
prem_bag_en = bag.from_sequence(train['premise'].tolist()).map(lambda x: translation(x, lang = 'en'))
hyp_bag_en =  bag.from_sequence(train['hypothesis'].tolist()).map(lambda x: translation(x, lang = 'en'))

if GEN_EN_ONLY:
    #sanity check
    train_en = train
    train_en[['premise', 'hypothesis']] = list(zip(prems_en, hyps_en))
    train_en[['lang_abv', 'language']] = ['en', 'English']
    train_en.to_csv('train_en.csv', index = False)

else:
    train_en = pd.read_csv("../input/contradictorytranslatedtrain/train_en.csv")
    
#sanity check
train_en.head()

Unnamed: 0,id,premise,hypothesis,lang_abv,language,label
0,5130fd2cb5,and these comments were considered in formulat...,The rules developed in the interim were put to...,en,English,0
1,5b72532a0b,These are issues that we wrestle with in pract...,Practice groups are not permitted to work on t...,en,English,2
2,3931fbe82a,Little things like these make a huge differenc...,I was trying to accomplish something.,en,English,0
3,5622f0c60b,you know they can't really defend themselves l...,They can't defend themselves because of their ...,en,English,0
4,86aaa48b45,In role playing as well Opportunities to expre...,Children can see how different ethnic groups are.,en,English,1


**The English translated dataset can now be found [here](https://www.kaggle.com/tuckerarrants/contradictorytranslatedtrain)**