 # Augmentations in NLP

Data Augmentation techniques in NLP show substantial improvements on datasets with less than 500 observations, as illustrated by the original paper.

https://arxiv.org/abs/1901.11196

The Paper Considered here is EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks




In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/tweet-sentiment-extraction/sample_submission.csv
/kaggle/input/tweet-sentiment-extraction/train.csv
/kaggle/input/tweet-sentiment-extraction/test.csv


#  ***Simple Data Augmentatons Techniques* are:**
1. SR : Synonym Replacement 
2. RD : Random Deletion
3. RS : Random Swap
4. RI : Random Insertion



In [2]:
data = pd.read_csv('../input/tweet-sentiment-extraction/train.csv')

In [3]:
data.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


In [4]:
list_to_drop = ['textID','selected_text','sentiment']
data.drop(list_to_drop,axis=1,inplace=True)

In [5]:
data.head()

Unnamed: 0,text
0,"I`d have responded, if I were going"
1,Sooo SAD I will miss you here in San Diego!!!
2,my boss is bullying me...
3,what interview! leave me alone
4,"Sons of ****, why couldn`t they put them on t..."


In [6]:
print(f"Total number of examples to be used is : {len(data)}")

Total number of examples to be used is : 27481


# 1. Synonym Replacement :

Synonym replacement is a technique in which we replace a word by one of its synonyms

For identifying relevent Synonyms we use WordNet

The get_synonyms funtion will return pre-processed list of synonyms of given word

Now we will replace the words with synonyms

In [7]:
from nltk.corpus import stopwords
stop_words = []
for w in stopwords.words('english'):
    stop_words.append(w)
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [8]:
import random
from nltk.corpus import wordnet


In [9]:

def get_synonyms(word):
    
    synonyms = set()
    
    for syn in wordnet.synsets(word):
        for l in syn.lemmas():
            synonym = l.name().replace("_", " ").replace("-", " ").lower()
            synonym = "".join([char for char in synonym if char in ' qwertyuiopasdfghjklzxcvbnm'])
            synonyms.add(synonym) 
    if word in synonyms:
        synonyms.remove(word)
    
    return list(synonyms)

In [10]:
def synonym_replacement(words, n):    
    words = words.split()    
    new_words = words.copy()
    random_word_list = list(set([word for word in words if word not in stop_words]))
    random.shuffle(random_word_list)
    num_replaced = 0
    
    for random_word in random_word_list:
        synonyms = get_synonyms(random_word)
        
        if len(synonyms) >= 1:
            synonym = random.choice(list(synonyms))
            new_words = [synonym if word == random_word else word for word in new_words]
            num_replaced += 1
        
        if num_replaced >= n: #only replace up to n words
            break
    sentence = ' '.join(new_words)
    return sentence

In [11]:
print(f" Example of Synonym Replacement: {synonym_replacement('The quick brown fox jumps over the lazy dog',4)}")

 Example of Synonym Replacement: The spry brown university fox jumpstart over the lazy detent


To Get Larger Diversity of Sentences we could try replacing 1,2 3, .. Words in the given sentence.

Now lets get an example from out dataset and try augmenting it so that we could create 3 additional sentences per tweet 

In [12]:
trial_sent = data['text'][25]
print(trial_sent)


the free fillin` app on my ipod is fun, im addicted


In [13]:
# Create 3 Augmented Sentences per data 

for n in range(3):
    print(f" Example of Synonym Replacement: {synonym_replacement(trial_sent,n)}")

 Example of Synonym Replacement: the free fillin` app on my ipod is fun, im addict
 Example of Synonym Replacement: the innocent fillin` app on my ipod is fun, im addicted
 Example of Synonym Replacement: the relinquish fillin` app on my ipod is fun, im addict


Now we are able to augment this Data :)

You can create New colums for the Same text-id  in our tweet - sentiment Dataset

# 2.Random Deletion (RD)

In Random Deletion, we randomly delete a word if a uniformly generated number between 0 and 1 is smaller than a pre-defined threshold. This allows for a random deletion of some words of the sentence.



In [14]:
def random_deletion(words, p):

    words = words.split()
    
    #obviously, if there's only one word, don't delete it
    if len(words) == 1:
        return words

    #randomly delete words with probability p
    new_words = []
    for word in words:
        r = random.uniform(0, 1)
        if r > p:
            new_words.append(word)

    #if you end up deleting all words, just return a random word
    if len(new_words) == 0:
        rand_int = random.randint(0, len(words)-1)
        return [words[rand_int]]

    sentence = ' '.join(new_words)
    
    return sentence

Lets test out this Augmentation with our test_sample

In [15]:
print(random_deletion(trial_sent,0.2))
print(random_deletion(trial_sent,0.3))
print(random_deletion(trial_sent,0.4))

the free fillin` app on my is fun, addicted
free fillin` app on my ipod is im addicted
the free on my ipod is fun, im


This Could help us in reducing Overfitting and may help to imporve our Model Accuracy 


# 3. Random Swap (RS)

In Random Swap, we randomly swap the order of two words in a sentence.


In [16]:
def swap_word(new_words):    
    random_idx_1 = random.randint(0, len(new_words)-1)
    random_idx_2 = random_idx_1
    counter = 0    
    while random_idx_2 == random_idx_1:
        random_idx_2 = random.randint(0, len(new_words)-1)
        counter += 1        
        if counter > 3:
            return new_words
    
    new_words[random_idx_1], new_words[random_idx_2] = new_words[random_idx_2], new_words[random_idx_1] 
    return new_words

In [17]:
def random_swap(words, n):    
    words = words.split()
    new_words = words.copy()
    # n is the number of words to be swapped
    for _ in range(n):
        new_words = swap_word(new_words)
        
    sentence = ' '.join(new_words)    
    return sentence

In [18]:
print(random_swap(trial_sent,1))
print(random_swap(trial_sent,2))
print(random_swap(trial_sent,3))

the free addicted app on my ipod is fun, im fillin`
fun, free fillin` app on my ipod is im the addicted
free app fillin` the on addicted ipod is fun, im my


This Random Swapping will help to make our models robust and may inturn help in text classification. 

High order of swapping may downgrade the model

There is a high chance to loose semantics of language so be careful while using this augmentaion.



# 4. Random Insertion (RI)
Finally, in Random Insertion, we randomly insert synonyms of a word at a random position.

Data augmentation
operations should not change the true label of
a sentence, as that would introduce unnecessary
noise into the data. Inserting a synonym of a word
in a sentence, opposed to a random word, is more
likely to be relevant to the context and retain the
original label of the sentence.

In [19]:
def random_insertion(words, n):    
    words = words.split()
    new_words = words.copy()    
    for _ in range(n):
        add_word(new_words)        
    sentence = ' '.join(new_words)
    return sentence

def add_word(new_words):    
    synonyms = []
    counter = 0
    
    while len(synonyms) < 1:
        random_word = new_words[random.randint(0, len(new_words)-1)]
        synonyms = get_synonyms(random_word)
        counter += 1
        if counter >= 10:
            return        
    random_synonym = synonyms[0]
    random_idx = random.randint(0, len(new_words)-1)
    new_words.insert(random_idx, random_synonym)

In [20]:
print(random_insertion(trial_sent,1))
print(random_insertion(trial_sent,2))
print(random_insertion(trial_sent,3))

the free fillin` app on my addict ipod is fun, im addicted
the complimentary free fillin` app on my ipod along is fun, im addicted
the free along fillin` app addict on my ipod along is fun, im addicted


In [21]:
def aug(sent,n,p):
    print(f" Original Sentence : {sent}")
    print(f" SR Augmented Sentence : {synonym_replacement(sent,n)}")
    print(f" RD Augmented Sentence : {random_deletion(sent,p)}")
    print(f" RS Augmented Sentence : {random_swap(sent,n)}")
    print(f" RI Augmented Sentence : {random_insertion(sent,n)}")

In [22]:
aug(trial_sent,4,0.3)

 Original Sentence : the free fillin` app on my ipod is fun, im addicted
 SR Augmented Sentence : the disembarrass fillin` app on my ipod is fun, im hook
 RD Augmented Sentence : the free app on my ipod fun, im addicted
 RS Augmented Sentence : on free fillin` ipod is my the app fun, im addicted
 RI Augmented Sentence : the free fillin` app on gratis addict my ipod is complimentary make up fun, im addicted
