# Synonym Replacement for Data Augmentation (ISL-CLSRT)

This notebook demonstrates **synonym replacement** using **NLTK WordNet** to generate augmented text data from gloss sentences. This technique is particularly useful in low-resource sign language translation pipelines where data scarcity is a challenge.


In [None]:
# !pip install nltk pandas

In [None]:
import pandas as pd
import nltk
from nltk.corpus import wordnet as wn
from itertools import chain
import random

nltk.download('punkt')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df = pd.read_csv('/content/drive/MyDrive/IETGenAI-SLT/Chapter 4/isl_train_meta_cleaned.csv')
df[['cleaned_gloss']].head()


Unnamed: 0,cleaned_gloss
0,MAKE DIFFERENCE
1,TELL TRUTH
2,FAVOUR
3,WORRY
4,ABUSE


## Synonym Replacement Function

This section defines the synonym_replacement function, which takes a sentence and a replacement probability as input. It replaces a fixed percentage of words in the sentence with their synonyms from WordNet, if available. This process helps create variations of the original sentences for data augmentation.


In [None]:
def synonym_replacement(sentence, replacement_prob=0.3):
    tokens = sentence.split()
    new_tokens = []

    for word in tokens:
        if random.uniform(0, 1) < replacement_prob:
            synonyms = wn.synsets(word.lower())
            lemmas = set(chain.from_iterable([s.lemma_names() for s in synonyms]))
            lemmas = [lemma.upper() for lemma in lemmas if lemma.upper() != word]
            if lemmas:
                replacement = random.choice(lemmas)
                new_tokens.append(replacement)
            else:
                new_tokens.append(word)
        else:
            new_tokens.append(word)

    return ' '.join(new_tokens)


In [None]:
df['augmented_gloss'] = df['cleaned_gloss'].astype(str).apply(lambda x: synonym_replacement(x, 0.3))
df[['cleaned_gloss', 'augmented_gloss']].head()

Unnamed: 0,cleaned_gloss,augmented_gloss
0,MAKE DIFFERENCE,MAKE DIVERGENCE
1,TELL TRUTH,TELL TRUTH
2,FAVOUR,FAVOUR
3,WORRY,WORRY
4,ABUSE,ABUSE


In [None]:
df.to_csv('/content/drive/MyDrive/IETGenAI-SLT/Chapter 4/isl_train_meta_augmented.csv', index=False)
print("Augmented data saved to isl_train_meta_augmented.csv")


Augmented data saved to isl_train_meta_augmented.csv


### Summary

Synonym replacement, as demonstrated in this notebook, provides a straightforward and effective method to generate **diverse training samples** from small datasets. This data augmentation technique is crucial for increasing the generalization capabilities of models in sign language translation, especially when working with limited data resources.

In [None]:
# !pip install nltk pandas

import pandas as pd
import nltk
from nltk.corpus import wordnet as wn
from itertools import chain
import random

nltk.download('punkt')
nltk.download('wordnet')

from google.colab import drive
drive.mount('/content/drive')

df = pd.read_csv('/content/drive/MyDrive/IETGenAI-SLT/Chapter 4/isl_train_meta_cleaned.csv')

def synonym_replacement(sentence, replacement_prob=0.3):
    tokens = sentence.split()
    new_tokens = []

    for word in tokens:
        if random.uniform(0, 1) < replacement_prob:
            synonyms = wn.synsets(word.lower())
            lemmas = set(chain.from_iterable([s.lemma_names() for s in synonyms]))
            lemmas = [lemma.upper() for lemma in lemmas if lemma.upper() != word]
            if lemmas:
                replacement = random.choice(lemmas)
                new_tokens.append(replacement)
            else:
                new_tokens.append(word)
        else:
            new_tokens.append(word)

    return ' '.join(new_tokens)

df['augmented_gloss'] = df['cleaned_gloss'].astype(str).apply(lambda x: synonym_replacement(x, 0.3))

df.to_csv('/content/drive/MyDrive/IETGenAI-SLT/Chapter 4/isl_train_meta_augmented.csv', index=False)