# Back-Translation for Data Augmentation (ISL_CLSRT)

This notebook demonstrates **back-translation** for text augmentation using **MarianMT** (offline) via the Huggingface Transformers library. Back-translation is a data augmentation technique where a sentence is translated from a source language to a target language, and then translated back to the original source language. This process often results in a paraphrased version of the original sentence, introducing linguistic variations.

For low-resource sign language datasets, where obtaining large amounts of parallel text (sign language gloss to written language) can be challenging, back-translation is particularly useful. By applying back-translation to existing gloss sentences, we can generate diverse paraphrases. These augmented sentences can then be used to train more robust sign language translation models, improving their ability to handle different linguistic expressions of the same meaning.

Here's a simplified illustration of the back-translation process:

Original Sentence (English Gloss)
        ↓
Translate (English to German)
        ↓
Intermediate Translation (German)
        ↓
Translate Back (German to English)
        ↓
Back-Translated Sentence (English Gloss - Paraphrased)

This process helps create a larger and more varied dataset, which is crucial for training effective machine learning models, especially in domains with limited data like sign language translation.

In [None]:
pip install transformers sentencepiece pandas



In [None]:
import pandas as pd
from transformers import MarianMTModel, MarianTokenizer


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df = pd.read_csv('/content/drive/MyDrive/IETGenAI-SLT/Chapter 4/isl_train_meta_cleaned.csv')
sentences = df['cleaned_gloss'].tolist()
df[['cleaned_gloss']].head()


Unnamed: 0,cleaned_gloss
0,MAKE DIFFERENCE
1,TELL TRUTH
2,FAVOUR
3,WORRY
4,ABUSE


## Back-Translation Model Setup

I used the **English → German → English** pipeline for back-translation using pretrained MarianMT models. This language pair is a common choice for back-translation for several reasons:

*   **Model Availability and Quality:** There are high-quality, well-trained, and readily available pretrained models for English-German translation (and vice-versa) on platforms like Hugging Face. These models, often trained on large datasets, provide a strong foundation for effective translation.
*   **Linguistic Differences:** English and German have significant grammatical and structural differences, which can lead to more varied paraphrases during the back-translation process compared to language pairs that are very similar. This variation is beneficial for data augmentation.
*   **Computational Efficiency:** The MarianMT models for this language pair are generally efficient for deployment and inference, making the back-translation process practical.

While English-German-English is a common and effective choice, other language pairs could also be used for back-translation. The selection of an alternative language pair would depend on factors such as:

*   **Availability of High-Quality Translation Models:** The availability and performance of pretrained models for the desired language pair are crucial.
*   **Linguistic Diversity:** Choosing an intermediate language with different linguistic characteristics from the source language can lead to more diverse paraphrases.
*   **Computational Resources:** The size and complexity of the translation models for the chosen language pair can impact the computational resources required.

For this demonstration, the English-German-English pipeline provides a good balance of model quality, linguistic diversity, and computational practicality..


In [None]:
src_model_name = 'Helsinki-NLP/opus-mt-en-de'
tgt_model_name = 'Helsinki-NLP/opus-mt-de-en'

src_tokenizer = MarianTokenizer.from_pretrained(src_model_name)
src_model = MarianMTModel.from_pretrained(src_model_name)

tgt_tokenizer = MarianTokenizer.from_pretrained(tgt_model_name)
tgt_model = MarianMTModel.from_pretrained(tgt_model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]



pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/298M [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

In [None]:
def translate(text, tokenizer, model):
    batch = tokenizer.prepare_seq2seq_batch([text], return_tensors="pt")
    generated = model.generate(**batch)
    tgt_text = tokenizer.batch_decode(generated, skip_special_tokens=True)[0]
    return tgt_text


In [None]:
def back_translate(text):
    german = translate(text, src_tokenizer, src_model)
    back_translated = translate(german, tgt_tokenizer, tgt_model)
    return german, back_translated

In [None]:
df_sample = df.sample(10, random_state=42).copy()
df_sample['cleaned_gloss'] = df_sample['cleaned_gloss'].fillna('')
df_sample[['german', 'back_translated']] = df_sample['cleaned_gloss'].apply(lambda x: pd.Series(back_translate(x)))
display(df_sample[['cleaned_gloss', 'german', 'back_translated']])

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



Unnamed: 0,cleaned_gloss,german,back_translated
361,GOOD,WAHRSCHEINLICHKEIT,LIKELIHOOD
73,NICE CHATTING,NICE CHATING,NICE CHATING
374,GOT HURT,GUT HURT,GOOD CURRENCY
155,,Der Präsident,The President
104,CAME TRAIN,KAME TRAIN,KAME TRAIN
394,NEED MEDICINE TAKE ONE,NOTWENDIGES MEDIZIN ERWACHET,NEEDED MEDICALLY AWAKES
377,SPEAK SOFTLY,SPRACHEN SOFTLICH,LANGUAGES SOFTLY
124,CAME TRAIN,KAME TRAIN,KAME TRAIN
68,,Der Präsident,The President
450,CAME TRAIN,KAME TRAIN,KAME TRAIN


In [None]:
df_sample.to_csv('isl_back_translated_sample.csv', index=False)
print("Back-translated sample saved to isl_back_translated_sample.csv")

Back-translated sample saved to isl_back_translated_sample.csv


### Summary and Next Steps

Back-translation is a simple yet effective method to generate **paraphrased gloss sentences**, boosting data diversity for downstream sign language translation tasks. By creating a larger and more varied dataset, we can improve the **robustness and generalization** of sign language translation models, making them more capable of handling diverse linguistic expressions.

#### Potential Next Steps

The augmented data generated through back-translation can be utilized in several ways to advance sign language translation research:

*   **Train and Fine-Tune Models:** The augmented dataset can be used to train new sign language translation models from scratch or to fine-tune existing models. This can lead to significant improvements in translation accuracy and fluency.
*   **Evaluate Model Performance:** It is important to systematically evaluate the impact of data augmentation on model performance. This can be done by comparing a model trained on the original dataset with one trained on the augmented dataset, using standard evaluation metrics.
*   **Explore Other Augmentation Techniques:** Back-translation is just one of many data augmentation techniques. Other methods, such as synonym replacement, random insertion, or deletion, could also be explored in combination with back-translation to further increase data diversity.

In [None]:
import pandas as pd
from transformers import MarianMTModel, MarianTokenizer
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Load the dataset
df = pd.read_csv('/content/drive/MyDrive/IETGenAI-SLT/Chapter 4/isl_train_meta_cleaned.csv')
sentences = df['cleaned_gloss'].tolist()

# Display the head of the cleaned_gloss column
display(df[['cleaned_gloss']].head())

# Define translation function
def translate(text, tokenizer, model):
    batch = tokenizer.prepare_seq2seq_batch([text], return_tensors="pt")
    generated = model.generate(**batch)
    tgt_text = tokenizer.batch_decode(generated, skip_special_tokens=True)[0]
    return tgt_text

# Define back-translation function
def back_translate(text):
    src_model_name = 'Helsinki-NLP/opus-mt-en-de'
    tgt_model_name = 'Helsinki-NLP/opus-mt-de-en'

    src_tokenizer = MarianTokenizer.from_pretrained(src_model_name)
    src_model = MarianMTModel.from_pretrained(src_model_name)

    tgt_tokenizer = MarianTokenizer.from_pretrained(tgt_model_name)
    tgt_model = MarianMTModel.from_pretrained(tgt_model_name)

    german = translate(text, src_tokenizer, src_model)
    back_translated = translate(german, tgt_tokenizer, tgt_model)
    return german, back_translated

# Sample the dataframe and apply back-translation
df_sample = df.sample(10, random_state=42).copy()
df_sample['cleaned_gloss'] = df_sample['cleaned_gloss'].fillna('')
df_sample[['german', 'back_translated']] = df_sample['cleaned_gloss'].apply(lambda x: pd.Series(back_translate(x)))

# Display the results
display(df_sample[['cleaned_gloss', 'german', 'back_translated']])

# Save the back-translated sample to a CSV file
df_sample.to_csv('isl_back_translated_sample.csv', index=False)
print("Back-translated sample saved to isl_back_translated_sample.csv")