# **DATA AUGMENTATION**

**A quick overview of a couple of data agumentation technique.**

In [42]:
!pip install nlpaug
!pip install transformers

import tqdm
import random
import numpy as np
import pandas as pd
from tqdm import tqdm

import warnings
warnings.filterwarnings('ignore')

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold,StratifiedKFold

from transformers import (AutoModel, AutoTokenizer, 
                          AutoModelForSequenceClassification,get_constant_schedule_with_warmup)

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
import re     # library for regular expression operations
import string # for string operations
import collections
import gensim  
from gensim import parsing  
from gensim.summarization.textcleaner import split_sentences

import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc

from nlpaug.util import Action

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Load the dataset from kaggle.

In [4]:
!pip install -q kaggle 

In [6]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"alessioborriero","key":"7f0768dad21428b753204801a07b4c1c"}'}

In [7]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [12]:
!chmod 600 ~/.kaggle/kaggle.json 

In [22]:
!kaggle competitions download -c commonlitreadabilityprize -f train.csv

train.csv.zip: Skipping, found more recently modified local copy (use --force to force download)


In [25]:
ls

kaggle.json   sample_submission.csv  [0m[01;34mtrain[0m/
[01;34msample_data[0m/  test.csv               train.csv.zip


In [23]:
!mkdir train

mkdir: cannot create directory ‘train’: File exists


In [26]:
!unzip train.csv.zip -d train

Archive:  train.csv.zip
  inflating: train/train.csv         


In [35]:
train_data = pd.read_csv('train/train.csv')
test_data = pd.read_csv('test.csv')
sample = pd.read_csv('sample_submission.csv')

In [36]:
train_data.head()

Unnamed: 0,id,url_legal,license,excerpt,target,standard_error
0,c12129c31,,,When the young people returned to the ballroom...,-0.340259,0.464009
1,85aa80a4c,,,"All through dinner time, Mrs. Fayre was somewh...",-0.315372,0.480805
2,b69ac6792,,,"As Roger had predicted, the snow departed as q...",-0.580118,0.476676
3,dd1000b26,,,And outside before the palace a great garden w...,-1.054013,0.450007
4,37c1b32fb,,,Once upon a time there were Three Bears who li...,0.247197,0.510845


## **Augmentation using word substitution (NLPAUG)**

Let's start with an example. With the following code we generate a "clone" excerpt wich differs from the original by a few words for each sentence of the excerpt.

In [37]:
text=train_data.excerpt[0] #an example

In [46]:
split_text=split_sentences(text) #split text in sentences. Split_sentences return a list of strings
len(split_text)

11

In the following line we exploit the library nlpaug which gives us a bunch of tool usefull for data augmentation. In this section we have used a method which implement word substitution using BERT (Bidirectional Encoder Representations from Transformers), a pre-trained architecture developed by google.

In [45]:
aug = naw.ContextualWordEmbsAug(
    model_path='bert-base-uncased', action="substitute")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




In [47]:
augmented_text = aug.augment(split_text[0])
print("Original:")
print(split_text[0])
print("Augmented Text:")
print(augmented_text)

Original:
When the young people returned to the ballroom, it presented a decidedly changed appearance.
Augmented Text:
when the young princess returned to the ballroom, it showed an drastically changed appearance.


Note: this approach generates different substitutions at each different iteration, so we can use it many time in order to have a larger augmented dataset.

In [49]:
augmented_text = aug.augment(split_text[0])
print("Original:")
print(split_text[0])
print("Augmented Text:")
print(augmented_text)

Original:
When the young people returned to the ballroom, it presented a decidedly changed appearance.
Augmented Text:
until the young people returned to that ballroom, it assumed a decidedly cleaner appearance.


Before to apply the procedure above to all the dataset we have to think about the assignment of the target to the excerpt just generated. The easiest way to do that is to extract the values from a gaussian distribution with mean value equal to the target of the original excerpt and as standard error the value gives by the dataset. 

In [50]:
import random #to use the gaussian distribution to extract the target from the new excerpt

In [51]:
#example of generating a new target
mu=train_data.target[0]
sigma=train_data.standard_error[0]
random.gauss(mu, sigma/4)

-0.37720769134094184

# **Question: There's a better way to assign the targets to the new excerpts?**

Now just apply the augmentation to all the dataset ;)

In [56]:
#to add new sample to the dataframe we use the method datframe.loc
df_len=len(train_data)
it=5 #number of new textes we want to generate from the same sample
counter=0 #a counter usefull to count how many lines me have added to the dataframe
for i in range(df_len):
    for k in range(it):
        transformed_sentences=[] #empty list to fill with the modified sentences
        text=train_data.excerpt[i]
        splitted_text=split_sentences(text) #splitted_text is a list of sentence
        for j in range(len(splitted_text)):
            transformed_sentences.append(aug.augment(splitted_text[j]))
    
        transformed_text = ' '.join(transformed_sentences) #with this line we merge the list of sentences in a unique string
    
        #generating target for the new text
        mu=train_data.target[i]
        sigma=train_data.standard_error[i]
        new_target=random.gauss(mu, sigma)
    
        train_data.loc[len(train_data) + counter]=['NaN','NaN','NaN',transformed_text,new_target,sigma]
        counter+=1
        break

11
1
13
2
10
3
5
4
5
5
7
6
8
7
7
8
10
9
8
10
5
11
9
12
7
13
15
14
12
15
6
16
12
17
2


KeyboardInterrupt: ignored

In [None]:
len(train_data)

**Unfortunately this code is too slow to perform a real data augmentation in a accetable amount of time...how can we speed up this code?**

## **Augmentation using back traslation (NLPAUG)**

With back traslation we simply generate new text performing a double traslation of a corpus, forward and backward. The library nlpaug gives us a simple implementation of this technique.

In [57]:
import nlpaug.augmenter.word as naw

In [60]:
!pip install fairseq #library necessary to download the back translator

Collecting fairseq
[?25l  Downloading https://files.pythonhosted.org/packages/15/ab/92c6efb05ffdfe16fbdc9e463229d9af8c3b74dc943ed4b4857a87b223c2/fairseq-0.10.2-cp37-cp37m-manylinux1_x86_64.whl (1.7MB)
[K     |████████████████████████████████| 1.7MB 31.5MB/s 
Collecting hydra-core
[?25l  Downloading https://files.pythonhosted.org/packages/52/e3/fbd70dd0d3ce4d1d75c22d56c0c9f895cfa7ed6587a9ffb821d6812d6a60/hydra_core-1.0.6-py3-none-any.whl (123kB)
[K     |████████████████████████████████| 133kB 57.6MB/s 
Collecting sacrebleu>=1.4.12
[?25l  Downloading https://files.pythonhosted.org/packages/7e/57/0c7ca4e31a126189dab99c19951910bd081dea5bbd25f24b77107750eae7/sacrebleu-1.5.1-py3-none-any.whl (54kB)
[K     |████████████████████████████████| 61kB 8.0MB/s 
Collecting dataclasses
  Downloading https://files.pythonhosted.org/packages/26/2f/1095cdc2868052dd1e64520f7c0d5c8c550ad297e944e641dbf1ffbb9a5d/dataclasses-0.6-py3-none-any.whl
Collecting antlr4-python3-runtime==4.8
[?25l  Downloading 

In [None]:
back_translation_aug = naw.BackTranslationAug(
    from_model_name='transformer.wmt19.en-de', 
    to_model_name='transformer.wmt19.de-en'
)

Downloading: "https://github.com/pytorch/fairseq/archive/master.zip" to /root/.cache/torch/hub/master.zip
100%|██████████| 11946275315/11946275315 [08:28<00:00, 23497361.66B/s]


In [58]:
text=train_data.excerpt[0] #an example
split_text=split_sentences(text) #split text in sentences. Split_sentences return a list of strings
len(split_text)

11

In [None]:
augmented_text = back_translation_aug.augment(split_text[0])
print("Original:")
print(split_text[0])
print("Augmented Text:")
print(augmented_text)

To create the augmented dataset we use the same procedure used before.

In [None]:
#to add new sample to the dataframe we use the method datframe.loc
df_len=len(train_data)
it=1 #number of new textes we want to generate from the same sample
counter=0 #a counter usefull to count how many lines me have added to the dataframe
for i in range(df_len):
    for k in range(it):
        transformed_sentences=[] #empty list to fill with the modified sentences
        text=train_data.excerpt[i]
        splitted_text=split_sentences(text) #splitted_text is a list of sentence
        for j in range(len(splitted_text)):
            transformed_sentences.append(back_translation_aug.augment(splitted_text[j]))
    
        transformed_text = ' '.join(transformed_sentences) #with this line we merge the list of sentences in a unique string
    
        #generating target for the new text
        mu=train_data.target[i]
        sigma=train_data.standard_error[i]
        new_target=random.gauss(mu, sigma)
    
        train_data.loc[len(train_data) + counter]=['NaN','NaN','NaN',transformed_text,new_target,sigma]
        print(train_data.excerpt[len(train_data) + counter])
        counter+=1
        break