<a href="https://colab.research.google.com/github/Clearbox-AI/Marileni_Sinioraki_Thesis/blob/start-testing-tools/synthetic_data_tools_comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Synthetic textual data

Synthetic textual data is artificially generated text created by algorithms, AI models, or other computational methods, rather than being written by humans. It’s designed to mimic real-world text patterns, structures, or styles, often used for tasks like training machine learning models, testing systems, or augmenting datasets when real data is scarce or sensitive.

Benefits:

1. **Privacy and Security**: No real personal data at risk of breaches.
2. **Data Augmentation**: Expands datasets for machine learning.
3. **Flexibility**: Create specific or rare scenarios.
4. **Cost-effective**: Often cheaper than real-world data collection.
5. **Regulatory Compliance**: Helps navigate strict data protection laws.
6. **Model Robustness**: Can lead to better generalizing AI models.
7. **Rapid Prototyping**: Enables quick testing without real data.
8. **Controlled Experimentation**: Simulate specific conditions.
9. **Access to Data**: Alternative when real data isn't available.

# **Data Augmentation Libraries**

## **TextAttack**

In [None]:
!pip install --upgrade textattack

In [2]:
from textattack.augmentation import WordNetAugmenter, Augmenter
from textattack.transformations import WordSwapMaskedLM  # Import correct class for BERT

text = "The food was absolutely delicious, and the service was attentive, making it one of the best dining experiences I've had recently."

wordnet_aug = WordNetAugmenter()

# Initialize with a specific model or use default.
# Change method to 'bert-attack' or 'mlm'
bert_aug = WordSwapMaskedLM(method="bert-attack")

# Create an Augmenter using the bert_aug transformation.
bert_augmenter = Augmenter(transformation=bert_aug)

print("WordNet:", wordnet_aug.augment(text))

# Use the bert_augmenter to apply the transformation
print("BERT:", bert_augmenter.augment(text))

textattack: Updating TextAttack package dependencies.
textattack: Downloading NLTK required packages.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package omw to /root/nltk_data...
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google 

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

WordNet: ["The food was dead delicious, and the service was attentive, making it one of the best dining know I've had recently."]
BERT: ["The food was absolutely marvelous, and the service was attentive, making it one of the best dining experiences I've had recently."]


## **AugLy**

In [None]:
!pip install augly[all]  # Installs for text, image, video, audio

In [1]:
from augly import text as textaugs

text = "The food was absolutely delicious, and the service was attentive, making it one of the best dining experiences I've had recently."

# Define a list of augmentations
augs = [
    # Pass the 'text' variable as an argument to the function
    lambda x: textaugs.insert_punctuation_chars(texts=x),
    #lambda x: textaugs.replace_synonyms(texts=x),
    lambda x: textaugs.simulate_typos(texts=x),
    #lambda x: textaugs.replace_word_similar_char(texts=x),
]

# Apply all augmentations, passing the 'text' variable to the lambdas
for aug in augs:
    augmented_text = aug(text)
    print(f"{aug.__class__.__name__}: {augmented_text}")


function: ["T!h!e! !f!o!o!d! !w!a!s! !a!b!s!o!l!u!t!e!l!y! !d!e!l!i!c!i!o!u!s!,! !a!n!d! !t!h!e! !s!e!r!v!i!c!e! !w!a!s! !a!t!t!e!n!t!i!v!e!,! !m!a!k!i!n!g! !i!t! !o!n!e! !o!f! !t!h!e! !b!e!s!t! !d!i!n!i!n!g! !e!x!p!e!r!i!e!n!c!e!s! !I!'!v!e! !h!a!d! !r!e!c!e!n!t!l!y!."]
function: Tghe food was absolutly delicsious, and the service ws attentive, making it noone of the best dining experiences I' ev had rwcently.
