<a href="https://colab.research.google.com/github/Paul-mwaura/Natural-Language-Processing/blob/main/NLP_(DATA_AUGMENTATION).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP (DATA AUGMENTATION)

## Data Augmentation Techniques
>>
The simple data augmentation techniques are the following:
>>
* SR: synonym replacement
* RD: random deletion
* RS: random swap
* RI: random insertion

## Import Necessary Libraries

In [25]:
import pandas as pd
import random
import nltk
from nltk.corpus import wordnet, stopwords

nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Import the Dataset

In [26]:
text = open("food.txt").read()
text



## Functions to get Synonyms and Replace them Randomly

> ##### **Synonym replacement (SR)**
Synonym replacement is a technique in which we replace a word by one of its synonyms. We use WordNet, a large linguistic database, to identify relevant synonyms.

In [27]:
stop_words = stopwords.words('english')

def get_synonyms(word):
    """
    Get synonyms of a word
    """
    synonyms = set()
    
    for syn in wordnet.synsets(word): 
        for l in syn.lemmas(): 
            synonym = l.name().replace("_", " ").replace("-", " ").lower()
            synonym = "".join([char for char in synonym if char in ' qwertyuiopasdfghjklzxcvbnm'])
            synonyms.add(synonym) 
    
    if word in synonyms:
        synonyms.remove(word)
    
    return list(synonyms)

# This first function identifies the synonyms of a given word and pre-processes them. 
# The synonyms are then randomly replaced in the original sentence.

def synonym_replacement(words, n):
    
    words = words.split()
    
    new_words = words.copy()
    random_word_list = list(set([word for word in words if word not in stop_words]))
    random.shuffle(random_word_list)
    num_replaced = 0
    
    for random_word in random_word_list:
        synonyms = get_synonyms(random_word)
        
        if len(synonyms) >= 1:
            synonym = random.choice(list(synonyms))
            new_words = [synonym if word == random_word else word for word in new_words]
            num_replaced += 1
        
        if num_replaced >= n: #only replace up to n words
            break

    sentence = ' '.join(new_words)

    return sentence

> This first function identifies the synonyms of a given word and pre-processes them. The synonyms are then randomly replaced in the original sentence.

> We randomly select n words, and replace them by their synonyms. This function can then be used in an apply function on a data frame.


In [71]:
syn_replace = synonym_replacement(text, 500)
syn_replace[:200]

'The Global Report on Food crisis (GRFC) 2020 is the result of a joint, consensus-based assessment of keen food insecurity situations around the world by  collaborator organizations. At one hundred thi'

In [68]:
text[:200]

'The Global Report on Food Crises (GRFC) 2020 is the result of\na joint, consensus-based assessment of acute food insecurity\nsituations around the world by 16 partner organizations.\nAt 135 million, the '

## Random Deletion (RD)
>>
In Random Deletion, we randomly delete a word if a uniformly generated number between 0 and 1 is smaller than a pre-defined threshold. This allows for a random deletion of some words of the sentence.

In [40]:
def random_deletion(words, p):

    words = words.split()
    
    #obviously, if there's only one word, don't delete it
    if len(words) == 1:
        return words

    #randomly delete words with probability p
    new_words = []
    for word in words:
        r = random.uniform(0, 1)
        if r > p:
            new_words.append(word)

    #if you end up deleting all words, just return a random word
    if len(new_words) == 0:
        rand_int = random.randint(0, len(words)-1)
        return [words[rand_int]]

    sentence = ' '.join(new_words)
    
    return sentence

In [64]:
rand_del = random_deletion(text, 5)
rand_del

['they']

## Random Swap (RS)
>>
In Random Swap, we randomly swap the order of two words in a sentence.

In [45]:
def swap_word(new_words):
    
    random_idx_1 = random.randint(0, len(new_words)-1)
    random_idx_2 = random_idx_1
    counter = 0
    
    while random_idx_2 == random_idx_1:
        random_idx_2 = random.randint(0, len(new_words)-1)
        counter += 1
        
        if counter > 3:
            return new_words
    
    new_words[random_idx_1], new_words[random_idx_2] = new_words[random_idx_2], new_words[random_idx_1] 
    return new_words

def random_swap(words, n):
    
    words = words.split()
    new_words = words.copy()
    
    for _ in range(n):
        new_words = swap_word(new_words)
        
    sentence = ' '.join(new_words)
    
    return sentence

In [62]:
swapped_text = random_swap(text, 500)
swapped_text[:150]

'The and security on Food Crises (GRFC) 2020 is the result of a joint, la assessment of acute food insecurity situations around the world by 16 partner'

In [61]:
text[:150]

'The Global Report on Food Crises (GRFC) 2020 is the result of\na joint, consensus-based assessment of acute food insecurity\nsituations around the world'

> Swapping does not give us the desired output as it swaps relevant words with unwanted words or words that do not give the desired meaning in the sentence.

## Random Insertion (RI)
>>
Finally, in Random Insertion, we randomly insert synonyms of a word at a random position.

In [57]:
def random_insertion(words, n):
    
    words = words.split()
    new_words = words.copy()
    
    for _ in range(n):
        add_word(new_words)
        
    sentence = ' '.join(new_words)
    return sentence

def add_word(new_words):
    
    synonyms = []
    counter = 0
    
    while len(synonyms) < 1:
        random_word = new_words[random.randint(0, len(new_words)-1)]
        synonyms = get_synonyms(random_word)
        counter += 1
        if counter >= 10:
            return
        
    random_synonym = synonyms[0]
    random_idx = random.randint(0, len(new_words)-1)
    new_words.insert(random_idx, random_synonym)

In [60]:
rand_insert = random_insertion(text, 500)
rand_insert[:150]

'The Global Report on unstableness Food Crises (GRFC) 2020 is the result of a joint, consensus-based assessment of acute food tear down insecurity situ'