# Introduction:

## Data Augmentation for NLP

This notebook implements various data augmentation techniques for text with the goal of increasing the diversity of the training data. Data augmentation in NLP is a method to synthetically create more data to improve model generalization, without needing additional labeled examples.

### Theory of Data Augmentation in NLP

Data augmentation is widely used in Natural Language Processing (NLP) to improve model performance by introducing variations in the training data. By augmenting the text, the model is exposed to a wider range of input formats and becomes more robust. Augmentation can be especially useful in tasks like text classification, machine translation, or question-answering systems.

### Techniques Used in This Notebook:

1. ### Synonym Replacement
   - **Description**: This technique replaces words in the text with their synonyms, aiming to maintain the meaning of the sentence while increasing lexical diversity.
   - **Method**: For each word in the sentence, we search for synonyms using the WordNet lexical database. The first synonym is used as a replacement, and if no synonym is found, the word remains unchanged.
   - **Example**:
     - Original: "The quick brown fox"
     - Synonym Replaced: "The quick brown trick"
   
2. ### Random Insertion
   - **Description**: Random words or their synonyms are inserted at random positions within the sentence. This adds variation in sentence structure, which makes the model more robust to unexpected word orders.
   - **Method**: A random word is chosen from the sentence, and one of its synonyms is inserted at a random position.
   - **Example**:
     - Original: "The quick brown fox"
     - Random Insertion: "The quick brown fox fox"

3. ### Random Deletion
   - **Description**: Words in the sentence are randomly removed with a set probability `p`. This trains the model to handle missing words or incomplete data, simulating noisy input.
   - **Method**: Each word has a `p` probability (default: 20%) of being removed. If all words are deleted, one word is retained to ensure the sentence is not empty.
   - **Example**:
     - Original: "The quick brown fox"
     - Random Deletion: "The quick fox"

4. ### Random Shuffling
   - **Description**: The order of the words in the sentence is randomly shuffled, altering the sentence structure while keeping the same words. This helps the model learn to handle differently ordered inputs.
   - **Method**: Words in the sentence are shuffled randomly.
   - **Example**:
     - Original: "The quick brown fox"
     - Shuffled: "brown fox quick The"

5. ### Noise Addition
   - **Description**: This technique adds random keyboard noise, simulating typographical errors made by humans. This helps train models that deal with human input, such as auto-correction systems.
   - **Method**: Using the `nlpaug` library's `KeyboardAug` augmenter, random keyboard errors are introduced into the text (e.g., mistyping 'e' as 'r').
   - **Example**:
     - Original: "The quick brown fox"
     - Noise Added: "The q7uick brown fox"

### Application of Augmentations

In the following code, for each phrase in the dataset, we apply the following augmentations:
- Synonym Replacement
- Random Insertion
- Random Deletion
- Random Shuffling
- Noise Addition

These augmentations increase the variety of the input data from a single sentence into multiple versions, making the model more robust and less prone to overfitting.

### Benefits of Augmentation:
- **Generalization**: The model is exposed to diverse examples, which helps it generalize better to unseen data.
- **Error Resilience**: Augmentations like noise addition and random deletion simulate real-world data noise, making the model more resilient to input errors.
- **Lexical Diversity**: Synonym replacement and random insertion increase lexical diversity, improving the model's ability to handle paraphrased inputs.


# Code:

In [None]:
!pip install nlpaug

Collecting googletrans
  Downloading googletrans-3.0.0.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting httpx==0.13.3 (from googletrans)
  Downloading httpx-0.13.3-py3-none-any.whl.metadata (25 kB)
Collecting hstspreload (from httpx==0.13.3->googletrans)
  Downloading hstspreload-2024.10.1-py3-none-any.whl.metadata (2.1 kB)
Collecting chardet==3.* (from httpx==0.13.3->googletrans)
  Downloading chardet-3.0.4-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting idna==2.* (from httpx==0.13.3->googletrans)
  Downloading idna-2.10-py2.py3-none-any.whl.metadata (9.1 kB)
Collecting rfc3986<2,>=1.3 (from httpx==0.13.3->googletrans)
  Downloading rfc3986-1.5.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting httpcore==0.9.* (from httpx==0.13.3->googletrans)
  Downloading httpcore-0.9.1-py3-none-any.whl.metadata (4.6 kB)
Collecting h11<0.10,>=0.8 (from httpcore==0.9.*->httpx==0.13.3->googletrans)
  Downloading h11-0.9.0-py2.py3-none-any.whl.metadata (8.1 kB)
Collecti

In [None]:
import random
import nltk
from nltk.corpus import wordnet
import nlpaug.augmenter.char as nac
import pandas as pd

In [None]:
# Download WordNet for synonym replacement
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
df = pd.read_csv('')
df.head()

df = df[["phrase","prompt"]]

In [None]:
# Initialize a noise augmenter for noise addition
aug = nac.KeyboardAug()

In [None]:
# Function for synonym replacement
def synonym_replacement(text):
    words = text.split()
    new_words = []
    for word in words:
        synonyms = wordnet.synsets(word)
        if synonyms:
            synonym = synonyms[0].lemmas()[0].name()
            new_words.append(synonym)
        else:
            new_words.append(word)
    return ' '.join(new_words)

# Function for random insertion
def random_insertion(text, n=1):
    words = text.split()
    for _ in range(n):
        synonym = None
        while synonym is None:
            word = random.choice(words)
            synonyms = wordnet.synsets(word)
            if synonyms:
                synonym = synonyms[0].lemmas()[0].name()
        insert_pos = random.randint(0, len(words))
        words.insert(insert_pos, synonym)
    return ' '.join(words)

# Function for random deletion
def random_deletion(text, p=0.2):
    words = text.split()
    if len(words) == 1:
        return text

    new_words = []
    for word in words:
        if random.uniform(0, 1) > p:
            new_words.append(word)
    return ' '.join(new_words) if new_words else random.choice(words)

# # Function for shuffling
def random_shuffling(text):
    words = text.split()
    random.shuffle(words)
    return ' '.join(words)

# # Function for noise addition
def noise_addition(text):
    return aug.augment(text)


In [None]:
# Apply augmentations to the dataset
augmented_data = []

for index, row in df.iterrows():
    phrase = row['phrase']
    prompt = row['prompt']

    # Synonym Replacement
    synonym_replaced = synonym_replacement(phrase)

    # Random Insertion
    random_inserted = random_insertion(phrase)

    # Random Deletion
    random_deleted = random_deletion(phrase)

    # # Shuffling
    shuffled = random_shuffling(phrase)

    # # Noise Addition
    noisy = noise_addition(phrase)

    # Add the augmented data to the list
    augmented_data.append([synonym_replaced, prompt])
    augmented_data.append([random_inserted, prompt])
    augmented_data.append([random_deleted, prompt])
    augmented_data.append([shuffled, prompt])
    augmented_data.append([noisy, prompt])

In [None]:
# Convert augmented data to a DataFrame
augmented_df = pd.DataFrame(augmented_data, columns=['phrase', 'prompt'])

# Combine with the original data if desired
final_df = pd.concat([df, augmented_df])

# Save the augmented dataset to a new CSV file
final_df.to_csv('augmented_dataset.csv', index=False)

# Combine original and augmented data into a DataFrame
final_df = pd.concat([df, augmented_df])

# Shuffle the DataFrame
final_df = final_df.sample(frac=1).reset_index(drop=True)

# Save the shuffled dataset to a new CSV file
final_df.to_csv('shuffled_augmented_dataset.csv', index=False)

print("Data augmentation completed and saved to 'augmented_dataset.csv'")

Data augmentation completed and saved to 'augmented_dataset.csv'
