## Training Data Generation (Augmented Text Data)

In [1]:
#!pip install nlpaug
#!pip install nltk

Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl.metadata (14 kB)
Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
Installing collected packages: nlpaug
Successfully installed nlpaug-1.1.11



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [228]:
# Importing useful dependencies
import re
import io
import nltk
import torch
import boto3
import random
import open_clip
import numpy as np
from typing import List

# Download the corpus
nltk.download('wordnet')
from nltk.corpus import wordnet

# Set a seed for reproducibility
SEED = 10721
random.seed(SEED) 
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\SakuraSnow\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [205]:
# Setup S3 client for MinIO (MinIO implements Amazon S3 API)
s3 = boto3.client(
    "s3",
    endpoint_url="http://127.0.0.1:9000", # MinIO API endpoint
    aws_access_key_id="minioadmin", # User name
    aws_secret_access_key="minioadmin", # Password
)


In [4]:
# We create a new Bucket in Min-IO to store our augmented training data

# List existing buckets
buckets = [b["Name"] for b in s3.list_buckets()["Buckets"]]

# Function that given a name, creates a bucket
def createBucket(name, list_buckets):
    if name in list_buckets:
        print(f"Bucket '{name}' already exists!")
    else:
        s3.create_bucket(Bucket=name)
        print(f"Created bucket: {name}")

# Create a bucket named landing_zone
createBucket("training-data-construction-zone", buckets)
# Sub-bucket: Baseline Training Data
s3.put_object(Bucket="training-data-construction-zone", Key="text_augmented-training-data/")

Bucket 'training-data-construction-zone' already exists!


{'ResponseMetadata': {'RequestId': '187A5C03F502FBA5',
  'HostId': 'dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'accept-ranges': 'bytes',
   'content-length': '0',
   'etag': '"d41d8cd98f00b204e9800998ecf8427e"',
   'server': 'MinIO',
   'strict-transport-security': 'max-age=31536000; includeSubDomains',
   'vary': 'Origin, Accept-Encoding',
   'x-amz-checksum-crc32': 'AAAAAA==',
   'x-amz-checksum-type': 'FULL_OBJECT',
   'x-amz-id-2': 'dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8',
   'x-amz-request-id': '187A5C03F502FBA5',
   'x-content-type-options': 'nosniff',
   'x-ratelimit-limit': '2101',
   'x-ratelimit-remaining': '2101',
   'x-xss-protection': '1; mode=block',
   'date': 'Sat, 22 Nov 2025 14:56:16 GMT'},
  'RetryAttempts': 0},
 'ETag': '"d41d8cd98f00b204e9800998ecf8427e"',
 'ChecksumCRC32': 'AAAAAA==',
 'ChecksumType': 'FULL_OBJECT'}

In this notebook, we will apply various data augmentation techniques to increase the diversity of our text data and enhance the variance in the training set. Specifically, we will implement four techniques: **random word deletion**, **random word swap**, **random spelling error**, and **random synonym replacement**.

- **Random word deletion** removes a word from the text with a certain probability `p`.  
- **Random word swap** is similar, but instead of deleting words, it randomly changes the positions of words in the text.  
- **Random spelling error** introduces a spelling mistake in a word with probability `p`.  
- **Random synonym replacement** replaces a word with one of its synonyms with probability `p`, where synonyms are obtained from the WordNet corpus.

In [93]:
# We can use a sample description generated using Chat-GPT as an example
samp_text = "Hatsune Miku is a virtual pop star and vocaloid software persona created by Crypton Future Media. Represented as a 16-year-old girl with long turquoise twin-tails, she 'sings' by synthesizing voices from the Vocaloid engine, allowing producers to create original songs. Since her debut in 2007, Miku has gained a massive global following, performing in live concerts as a hologram and appearing in video games, merchandise, and collaborations, making her a symbol of digital music culture."
samp_text

"Hatsune Miku is a virtual pop star and vocaloid software persona created by Crypton Future Media. Represented as a 16-year-old girl with long turquoise twin-tails, she 'sings' by synthesizing voices from the Vocaloid engine, allowing producers to create original songs. Since her debut in 2007, Miku has gained a massive global following, performing in live concerts as a hologram and appearing in video games, merchandise, and collaborations, making her a symbol of digital music culture."

In [94]:
# A simple tokenizer to split the text into a list of tokens (words in this case)
def simple_tokenize(text: str) -> List[str]:
    return re.findall(r"\w+|[^\w\s]", text, re.UNICODE)

# Test the simple tokenizer of words
simple_tokenize(samp_text)

['Hatsune',
 'Miku',
 'is',
 'a',
 'virtual',
 'pop',
 'star',
 'and',
 'vocaloid',
 'software',
 'persona',
 'created',
 'by',
 'Crypton',
 'Future',
 'Media',
 '.',
 'Represented',
 'as',
 'a',
 '16',
 '-',
 'year',
 '-',
 'old',
 'girl',
 'with',
 'long',
 'turquoise',
 'twin',
 '-',
 'tails',
 ',',
 'she',
 "'",
 'sings',
 "'",
 'by',
 'synthesizing',
 'voices',
 'from',
 'the',
 'Vocaloid',
 'engine',
 ',',
 'allowing',
 'producers',
 'to',
 'create',
 'original',
 'songs',
 '.',
 'Since',
 'her',
 'debut',
 'in',
 '2007',
 ',',
 'Miku',
 'has',
 'gained',
 'a',
 'massive',
 'global',
 'following',
 ',',
 'performing',
 'in',
 'live',
 'concerts',
 'as',
 'a',
 'hologram',
 'and',
 'appearing',
 'in',
 'video',
 'games',
 ',',
 'merchandise',
 ',',
 'and',
 'collaborations',
 ',',
 'making',
 'her',
 'a',
 'symbol',
 'of',
 'digital',
 'music',
 'culture',
 '.']

In the following cell we define the methods we described previously

In [86]:
###############################
##### Random word deletion ####
###############################

# Delete each token with probability p.
def random_deletion(tokens: List[str], p: float = 0.1) -> List[str]:
    if len(tokens) == 1:
        return tokens

    kept = [t for t in tokens if random.random() > p]
    if not kept:
        kept.append(random.choice(tokens))
    return kept

###########################
##### Random word swap ####
###########################

# Randomly swap a small portion of tokens.
def random_swap(tokens: List[str], ratio: float = 0.05) -> list[str]:
    n = len(tokens)
    if n < 2:
        return tokens

    n_swaps = max(1, int(ratio * n))

    for _ in range(n_swaps):
        i, j = random.sample(range(n), 2)
        tokens[i], tokens[j] = tokens[j], tokens[i]
    return tokens

################################
##### Random spelling error ####
################################

# Introduce a simple spelling error in a single word
def corrupt_word(word: str) -> str:
    if len(word) == 0:
        return word

    ALPHABET = "abcdefghijklmnopqrstuvwxyz"

    op = random.choice(["delete", "substitute", "duplicate"])

    if op == "delete" and len(word) > 1:
        pos = random.randrange(len(word))
        return word[:pos] + word[pos+1:]

    if op == "substitute":
        pos = random.randrange(len(word))
        new_char = random.choice(ALPHABET)
        return word[:pos] + new_char + word[pos+1:]

    if op == "duplicate":
        pos = random.randrange(len(word))
        return word[:pos] + word[pos] + word[pos:]

    return word

# For each alphabetical token, apply a spelling error with probability p.
def random_spelling_error(tokens: List[str], p: float = 0.1) -> List[str]:
    new_tokens = []
    for t in tokens:
        if t.isalpha() and random.random() < p:
            new_tokens.append(corrupt_word(t))
        else:
            new_tokens.append(t)
    return new_tokens

######################################
##### Random synonym replacement #####
######################################

# Collect synonyms from WordNet
def get_synonyms(word: str) -> List[str]:
    synonyms = set()
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            lemma_name = lemma.name().replace("_", " ")
            if lemma_name.lower() != word.lower():
                synonyms.add(lemma_name)
    return list(synonyms)

# Randomly choose a small portion of tokens and replace them with synonyms.
def random_synonym_replacement(tokens: List[str], ratio: float = 0.05) -> List[str]:
    n = len(tokens)
    n_replacements = max(1, int(ratio * n))
    candidate_indices = [
        i for i, t in enumerate(tokens)
        if t.isalpha() and len(t) > 2 # skip punctuation & very short tokens
    ]
    if not candidate_indices:
        return tokens

    indices = random.sample(candidate_indices, n_replacements)

    for idx in indices:
        word = tokens[idx]
        syns = get_synonyms(word)
        if syns:
            tokens[idx] = random.choice(syns)
    return tokens

In [100]:
# Test Random word deletion method
" ".join(random_deletion(simple_tokenize(samp_text)))

"Hatsune Miku is a virtual pop star and vocaloid software persona created Crypton Future Media . Represented as 16 - year - old girl with long turquoise twin - tails , she ' sings ' by synthesizing voices from the Vocaloid engine , producers to create original songs . Since her debut in 2007 , has gained a massive following , performing in live concerts a hologram and appearing video games , merchandise and collaborations , making her a symbol of digital ."

In [104]:
# Test Random word swap method
" ".join(random_swap(simple_tokenize(samp_text)))

"Hatsune hologram is a virtual pop star , vocaloid software persona created by Crypton Future Media . Represented as a and - year - old girl with long turquoise twin - tails , she ' sings ' by synthesizing voices from the Vocaloid engine , allowing producers to create original songs . Since her debut in 2007 and Miku has gained a massive global following , performing in live concerts video a Miku 16 appearing in as games , merchandise , and collaborations , making her a symbol of digital music culture ."

In [102]:
# Test Random spelling error method
" ".join(random_spelling_error(simple_tokenize(samp_text)))

"Hatsune kiku is a virtual pop star annd vocaloid software persona created by Crypton Future Mxdia . Represented as a 16 - year - old girl with long turquoise twin - tails , she ' sings ' by synthesizing voices from the Vocaloid engine , allowing producers to create original songs . Since hgr debut in 2007 , Miku has gainek a massive global following , performing in live concerts as a hologram and appearing in video games , merchandise , and collaborations , maing her a symbol of digital music culture ."

In [168]:
# Test Random synonym replacement method
" ".join(random_synonym_replacement(simple_tokenize(samp_text)))

"Hatsune Miku is a virtual pop star and vocaloid software role created by Crypton Future Media . Represented as a 16 - year - old girl with long turquoise twin - tails , she ' sings ' by synthesizing voices from the Vocaloid engine , allowing producers to create original songs . Since her debut in 2007 , Miku has gained a massive global following , performing in exist concerts as a hologram and appearing in TV games , merchandise , and collaborations , making her a symbol of digital music culture ."

Now that we have evaluated the performance of the output texts, in the following cells we will create an augmented text file for each text file in the baseline training data. The corresponding image file will remain the same, but we will make a copy of it so that it shares the same prefix as the augmented text.

In [184]:
# We can use this function to retrieve an text from our bucket
def get_text(bucket, key):
    resp = s3.get_object(Bucket=bucket, Key=key)
    body = resp["Body"].read()
    text = body.decode("utf-8")
    return text

In [229]:
# This function generates augmented data for our descriptions in the baseline training data
def text_augmentation(src_bucket, dest_bucket, dest_prefix="text_augmented-training-data/"):
    
    paginator = s3.get_paginator("list_objects_v2") # It returns objects in pages and not all at once.
    for page in paginator.paginate(Bucket=src_bucket, Prefix="baseline-training-data/"):

        for obj in page.get("Contents", []):
            key = obj["Key"]

            if obj['Size'] == 0 and key.endswith("/"): # skip the folder itself
                continue

            # Add new prefixes
            key_1 = dest_prefix + "rwd" + "_" + key.split("/")[1]
            key_2 = dest_prefix + "rws" + "_" + key.split("/")[1]
            key_3 = dest_prefix + "rse" + "_" + key.split("/")[1]
            key_4 = dest_prefix + "rsr" + "_" + key.split("/")[1]

            # New key for original text and image file
            new_key = dest_prefix + key.split("/")[1]

            if "image" in key:

                # Copy objects without top-level folder and rename them
                copy_source_image = {"Bucket": src_bucket, "Key": key}
                s3.copy_object(Bucket=dest_bucket, Key=new_key, CopySource=copy_source_image) # Original image
                s3.copy_object(Bucket=dest_bucket, Key=key_1, CopySource=copy_source_image) # Image for the first augemented text
                s3.copy_object(Bucket=dest_bucket, Key=key_2, CopySource=copy_source_image) # Image for the second augemented text
                s3.copy_object(Bucket=dest_bucket, Key=key_3, CopySource=copy_source_image) # Image for the third augemented text
                s3.copy_object(Bucket=dest_bucket, Key=key_4, CopySource=copy_source_image) # Image for the fourth augemented text
                
            elif "text" in key:

                # Get the description
                description = get_text(src_bucket, key)

                # Get augmented descriptions
                str1 = " ".join(random_deletion(simple_tokenize(description)))
                str2 = " ".join(random_swap(simple_tokenize(description)))
                str3 = " ".join(random_spelling_error(simple_tokenize(description)))
                str4 = " ".join(random_synonym_replacement(simple_tokenize(description)))

                # Copy objects without top-level folder and rename them
                copy_source_text = {"Bucket": src_bucket, "Key": key}
                s3.copy_object(Bucket=dest_bucket, Key=new_key, CopySource=copy_source_text)
                s3.put_object(Bucket=dest_bucket, Key=key_1, Body=io.BytesIO(str1.encode("utf-8")),ContentType="text/plain")
                s3.put_object(Bucket=dest_bucket, Key=key_2, Body=io.BytesIO(str2.encode("utf-8")),ContentType="text/plain")
                s3.put_object(Bucket=dest_bucket, Key=key_3, Body=io.BytesIO(str3.encode("utf-8")),ContentType="text/plain")
                s3.put_object(Bucket=dest_bucket, Key=key_4, Body=io.BytesIO(str4.encode("utf-8")),ContentType="text/plain")

                print(f"✅ Augmented data for #{key.split('/')[1]} created successfully.")

    print(f"✅ All augmented text data have been successfully uploaded.")

In [230]:
# Create augmented text data
text_augmentation(src_bucket = "training-data-construction-zone", dest_bucket = "training-data-construction-zone")

✅ Augmented data for #text_000001.txt created successfully.
✅ Augmented data for #text_000002.txt created successfully.
✅ Augmented data for #text_000003.txt created successfully.
✅ Augmented data for #text_000004.txt created successfully.
✅ Augmented data for #text_000005.txt created successfully.
✅ Augmented data for #text_000006.txt created successfully.
✅ Augmented data for #text_000007.txt created successfully.
✅ Augmented data for #text_000008.txt created successfully.
✅ Augmented data for #text_000009.txt created successfully.
✅ Augmented data for #text_000010.txt created successfully.
✅ Augmented data for #text_000011.txt created successfully.
✅ Augmented data for #text_000012.txt created successfully.
✅ Augmented data for #text_000013.txt created successfully.
✅ Augmented data for #text_000014.txt created successfully.
✅ Augmented data for #text_000015.txt created successfully.
✅ Augmented data for #text_000016.txt created successfully.
✅ Augmented data for #text_000017.txt cr