<a href="https://colab.research.google.com/github/Clearbox-AI/Marileni_Sinioraki_Thesis/blob/start-testing-tools/synthetic_data_tools_comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Synthetic textual data

Synthetic textual data is artificially generated text created by algorithms, AI models, or other computational methods, rather than being written by humans. It’s designed to mimic real-world text patterns, structures, or styles, often used for tasks like training machine learning models, testing systems, or augmenting datasets when real data is scarce or sensitive.

Benefits:

1. **Privacy and Security**: No real personal data at risk of breaches.
2. **Data Augmentation**: Expands datasets for machine learning.
3. **Flexibility**: Create specific or rare scenarios.
4. **Cost-effective**: Often cheaper than real-world data collection.
5. **Regulatory Compliance**: Helps navigate strict data protection laws.
6. **Model Robustness**: Can lead to better generalizing AI models.
7. **Rapid Prototyping**: Enables quick testing without real data.
8. **Controlled Experimentation**: Simulate specific conditions.
9. **Access to Data**: Alternative when real data isn't available.

# **Data Augmentation Libraries**

## **TextAttack**

In [2]:
!pip install textattack

Collecting textattack
  Downloading textattack-0.3.10-py3-none-any.whl.metadata (38 kB)
Collecting bert-score>=0.3.5 (from textattack)
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting flair (from textattack)
  Downloading flair-0.15.1-py3-none-any.whl.metadata (12 kB)
Collecting language-tool-python (from textattack)
  Downloading language_tool_python-2.9.3-py3-none-any.whl.metadata (54 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.7/54.7 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting lemminflect (from textattack)
  Downloading lemminflect-0.2.3-py3-none-any.whl.metadata (7.0 kB)
Collecting lru-dict (from textattack)
  Downloading lru_dict-1.3.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.5 kB)
Collecting terminaltables (from textattack)
  Downloading terminaltables-3.1.10-py2.py3-none-any.whl.metadata (3.5 kB)
Collecting word2number (from textattack)
 

In [4]:
from textattack.augmentation import WordNetAugmenter, Augmenter
from textattack.transformations import WordSwapMaskedLM  # Import correct class for BERT

text = "The food was absolutely delicious, and the service was attentive, making it one of the best dining experiences I've had recently."

wordnet_aug = WordNetAugmenter()
# Initialize with a specific model or use default.
# Change method to 'bert-attack' or 'mlm'
# bert_aug = WordSwapMaskedLM(method="bert-attack")
# Create an Augmenter using the bert_aug transformation.
# bert_augmenter = Augmenter(transformation=bert_aug)

print("WordNet:", wordnet_aug.augment(text))
# Use the bert_augmenter to apply the transformation
# print("BERT:", bert_augmenter.augment(text))

WordNet: ["The food was absolutely delicious, and the service was attentive, create it one of the secure dining experiences I've had recently."]


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


## **AugLy**

In [5]:
!pip install augly[all]

Collecting augly[all]
  Downloading augly-1.0.0-py3-none-any.whl.metadata (9.4 kB)
Collecting iopath>=0.1.8 (from augly[all])
  Downloading iopath-0.1.10.tar.gz (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting python-magic>=0.4.22 (from augly[all])
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting nlpaug==1.1.3 (from augly[all])
  Downloading nlpaug-1.1.3-py3-none-any.whl.metadata (17 kB)
Collecting vidgear>=0.2.4 (from augly[all])
  Downloading vidgear-0.3.3-py3-none-any.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.6/50.6 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting Pillow<9.0.0,>=8.2.0 (from augly[all])
  Downloading Pillow-8.4.0.tar.gz (

In [1]:
from augly import text as textaugs

text = "The food was absolutely delicious, and the service was attentive, making it one of the best dining experiences I've had recently."

# Define a list of augmentations
augs = [
    # Pass the 'text' variable as an argument to the function
    lambda x: textaugs.insert_punctuation_chars(texts=x),
    #lambda x: textaugs.replace_synonyms(texts=x),
    lambda x: textaugs.simulate_typos(texts=x),
    #lambda x: textaugs.replace_word_similar_char(texts=x),
]

# Apply all augmentations, passing the 'text' variable to the lambdas
for aug in augs:
    augmented_text = aug(text)
    print(f"{aug.__class__.__name__}: {augmented_text}")


function: ["T...h...e... ...f...o...o...d... ...w...a...s... ...a...b...s...o...l...u...t...e...l...y... ...d...e...l...i...c...i...o...u...s...,... ...a...n...d... ...t...h...e... ...s...e...r...v...i...c...e... ...w...a...s... ...a...t...t...e...n...t...i...v...e...,... ...m...a...k...i...n...g... ...i...t... ...o...n...e... ...o...f... ...t...h...e... ...b...e...s...t... ...d...i...n...i...n...g... ...e...x...p...e...r...i...e...n...c...e...s... ...I...'...v...e... ...h...a...d... ...r...e...c...e...n...t...l...y...."]
function: Thge food was absolutely dleicious, and the service was attentive, amking ti noone of the brst dining experiences I' ev had recently.


## **langchain**

In [9]:
!pip install -U langchain langchain_experimental openai langchain-openai


Collecting langchain-openai
  Downloading langchain_openai-0.3.16-py3-none-any.whl.metadata (2.3 kB)
Downloading langchain_openai-0.3.16-py3-none-any.whl (62 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.8/62.8 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain-openai
Successfully installed langchain-openai-0.3.16


In [36]:
from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.pydantic_v1 import BaseModel
from langchain_experimental.tabular_synthetic_data.base import SyntheticDataGenerator
from langchain_experimental.tabular_synthetic_data.openai import create_openai_data_generator, OPENAI_TEMPLATE
from langchain_experimental.tabular_synthetic_data.prompts import SYNTHETIC_FEW_SHOT_SUFFIX, SYNTHETIC_FEW_SHOT_PREFIX
from langchain_openai import ChatOpenAI

In [37]:
from pydantic import BaseModel
from pydantic import BaseModel, ConfigDict

class MedicalBilling(BaseModel):
    model_config = ConfigDict(arbitrary_types_allowed=True)
    bill_amount: float
    patient_id: int
    patient_name: str
    diagnosis_code: str
    procedure_code: str
    total_charge: float
    insurance_claim_amount: float


examples = [
    {"example": """Patient ID: 123456, Patient Name: John Doe, Diagnosis Code:
        J20.9, Procedure Code: 99203, Total Charge: $500, Insurance Claim Amount: $350"""},
    {"example": """Patient ID: 789012, Patient Name: Johnson Smith, Diagnosis
        Code: M54.5, Procedure Code: 99213, Total Charge: $150, Insurance Claim Amount: $120"""},
    {"example": """Patient ID: 345678, Patient Name: Emily Stone, Diagnosis Code:
        E11.9, Procedure Code: 99214, Total Charge: $300, Insurance Claim Amount: $250"""},
]

In [39]:
OPENAI_TEMPLATE = PromptTemplate(input_variables=["example"], template="{example}")


prompt_template = FewShotPromptTemplate(
    prefix=SYNTHETIC_FEW_SHOT_PREFIX,
    examples=examples,
    suffix=SYNTHETIC_FEW_SHOT_SUFFIX,
    input_variables=["subject", "extra"],
    example_prompt=OPENAI_TEMPLATE,
)

In [None]:
# Instantiate
llm = ChatOpenAI(temperature=1)

synthetic_data_generator = create_openai_data_generator(
    output_schema=MedicalBilling,
    llm=llm,
    prompt=prompt_template,
)


synthetic_results = synthetic_data_generator.generate(
    subject="medical_billing",
    extra="the name must be chosen at random. Make it something you wouldn't normally choose.",
    runs=1,
)


synthetic_results

In [None]:
import pandas as pd

# Create a list of dictionaries from the objects
synthetic_data = []
for item in synthetic_results:
    synthetic_data.append({
        'patient_id': item.patient_id,
        'patient_name': item.patient_name,
        'diagnosis_code': item.diagnosis_code,
        'procedure_code': item.procedure_code,
        'total_charge': item.total_charge,
        'insurance_claim_amount': item.insurance_claim_amount
    })

# Create a Pandas DataFrame from the list of dictionaries
synthetic_df = pd.DataFrame(synthetic_data)

# Display the DataFrame
print(type(synthetic_df))
synthetic_df