Text augmentation can significantly enhance the diversity of your training data and improve the performance NLP models. Here are some examples using different libraries for text augmentation:
### 1. **NLTK**

The Natural Language Toolkit (NLTK) is a popular library for NLP. Here's how to perform synonym replacement using NLTK:


In [5]:
import random
import nltk
from nltk.corpus import wordnet

nltk.download('wordnet')
nltk.download('punkt')

def synonym_replacement(text):
    words = nltk.word_tokenize(text)
    new_words = words.copy()

    for i, word in enumerate(words):
        synonyms = wordnet.synsets(word)
        if synonyms:
            synonym = random.choice(synonyms).lemmas()[0].name()
            new_words[i] = synonym if synonym != word else new_words[i]

    return ' '.join(new_words)

# Example usage
text = "The cat sat on the mat."
augmented_text = synonym_replacement(text)
print(augmented_text)


The kat sit_down on the mat .


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!



### 2. **TextAugment**

The TextAugment library provides several text augmentation techniques. Here's how to use it:


In [8]:
!pip install textaugment



In [13]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [14]:
from textaugment import EDA

text = "The cat sat on the mat."

# Create an EDA instance
eda = EDA()

# Using synonym replacement
augmented_text = eda.synonym_replacement(text)
print(augmented_text)

# Using random insertion
augmented_text_insertion = eda.random_insertion(text)
print(augmented_text_insertion)

The cat pose on the mat.
The cat sat pose on the mat.



### 3. **Back Translation Using Transformers**

You can also perform back translation using the Hugging Face Transformers library. This involves translating text to another language and then back to the original language.


In [23]:
!pip install transformers torch



In [26]:
from transformers import MarianMTModel, MarianTokenizer

def back_translate(text, src_lang="en", mid_lang="fr"):
    # Translate to French
    model_name = f"Helsinki-NLP/opus-mt-{src_lang}-{mid_lang}"
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name)
    translated = model.generate(**tokenizer(text, return_tensors="pt", padding=True))

    # Translate back to English
    back_model_name = f"Helsinki-NLP/opus-mt-{mid_lang}-{src_lang}"
    back_tokenizer = MarianTokenizer.from_pretrained(back_model_name)
    back_model = MarianMTModel.from_pretrained(back_model_name)

    # Decode the translated text before passing to back_tokenizer
    decoded_translated = tokenizer.decode(translated[0], skip_special_tokens=True)

    back_translated = back_model.generate(**back_tokenizer(decoded_translated, return_tensors="pt", padding=True))

    return back_tokenizer.decode(back_translated[0], skip_special_tokens=True)

# Example usage
text = "The cat sat on the mat."
augmented_text = back_translate(text)
print("Back Translated:", augmented_text)



Back Translated: The cat was sitting on the carpet.
