## Shubham Shekhar Jha
## CS6320 - NLP - HW5
## Text Translation (English <-> Hindi)

Check CUDA availability and use device

In [1]:
import torch
import warnings

# Ignore warnings to declutter the output
warnings.filterwarnings('ignore')

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

Load the test dataset for pre-trained model evaluation

In [2]:
from datasets import load_dataset

en_hi_dataset = load_dataset("cfilt/iitb-english-hindi")
print(en_hi_dataset)

# Separate instances into different lists, because they'll be tokenized & processed separately
en_test = []
hi_test = []
for translation_pair in en_hi_dataset["test"]["translation"]:
    en_test.append(translation_pair["en"])
    hi_test.append(translation_pair["hi"])

en_train = []
hi_train = []
max_instances = 20000
# Choose only 20k training instances to avoid kernel crashes
for i, translation_pair in enumerate(en_hi_dataset["train"]["translation"]):
    if i >= max_instances:
        break
    en_train.append(translation_pair["en"])
    hi_train.append(translation_pair["hi"])

en_validation = []
hi_validation = []
for translation_pair in en_hi_dataset["validation"]["translation"]:
    en_validation.append(translation_pair["en"])
    hi_validation.append(translation_pair["hi"])

print(len(en_train))
print(len(en_validation))
print(len(en_test))

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 1659083
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 520
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 2507
    })
})
20000
520
2507


## English to Hindi Translation 

Load English to Hindi Pre-trained translation model

In [6]:
from transformers import MarianTokenizer, MarianMTModel
from accelerate import init_empty_weights

init_empty_weights()
en_hi_model_name = "Helsinki-NLP/opus-mt-en-hi"
en_hi_tokenizer = MarianTokenizer.from_pretrained(en_hi_model_name)
en_hi_model = MarianMTModel.from_pretrained(en_hi_model_name).to(device)

Test the Pre-trained model for English to Hindi translation

In [8]:
batch_size = 4

tokenized_input = en_hi_tokenizer(
    en_test, return_tensors="pt", padding=True, truncation=True
)
input_ids = tokenized_input["input_ids"].to(device)
attention_mask = tokenized_input["attention_mask"].to(device)

# Split the input tensors into batches to handle VRAM and memory issues
input_ids_batches = input_ids.split(batch_size)
attention_mask_batches = attention_mask.split(batch_size)

output_ids = []
for input_ids_batch, attention_mask_batch in zip(
    input_ids_batches, attention_mask_batches
):
    output = en_hi_model.generate(
        input_ids_batch, attention_mask=attention_mask_batch, use_cache=False
    )
    output_ids.extend(output)

# Decode the output IDs to get the text sequences
output_texts = [
    en_hi_tokenizer.decode(output_id, skip_special_tokens=True)
    for output_id in output_ids
]

In [9]:
import sacrebleu

# evaluate the model through the BLEU score
total_bleu_score = 0.0
for i in range(len(output_texts)):
    bleu = sacrebleu.sentence_bleu(output_texts[i], [hi_test[i]]).score
    total_bleu_score += bleu

average_bleu_score = total_bleu_score / len(output_texts)
print(f"Average BLEU score: {average_bleu_score}")

Average BLEU score: 9.856293456579714


## Hindi to English Translation

Load Hindi to English Pre-trained translation model

In [3]:
from transformers import MarianTokenizer, MarianMTModel
from accelerate import init_empty_weights

init_empty_weights()

hi_en_model_name = "Helsinki-NLP/opus-mt-hi-en"
hi_en_tokenizer = MarianTokenizer.from_pretrained(hi_en_model_name)
hi_en_model = MarianMTModel.from_pretrained(hi_en_model_name).to(device)

Test the Pre-trained model for Hindi to English translation

In [4]:
batch_size = 4

tokenized_input = hi_en_tokenizer(
    hi_test, return_tensors="pt", padding=True, truncation=True
)
input_ids = tokenized_input["input_ids"].to(device)
attention_mask = tokenized_input["attention_mask"].to(device)

# Split the input tensors into batches
input_ids_batches = input_ids.split(batch_size)
attention_mask_batches = attention_mask.split(batch_size)

output_ids = []
for input_ids_batch, attention_mask_batch in zip(
    input_ids_batches, attention_mask_batches
):
    # Generate the output for the current batch
    output = hi_en_model.generate(
        input_ids_batch, attention_mask=attention_mask_batch, use_cache=False
    )
    output_ids.extend(output)

# Decode the output IDs to get the text sequences
output_texts = [
    hi_en_tokenizer.decode(output_id, skip_special_tokens=True)
    for output_id in output_ids
]

In [5]:
import sacrebleu

# evaluate the model through the BLEU score
total_bleu_score = 0.0
for i in range(len(output_texts)):
    bleu = sacrebleu.sentence_bleu(output_texts[i], [en_test[i]]).score
    total_bleu_score += bleu

average_bleu_score = total_bleu_score / len(output_texts)
print(f"Average BLEU score: {average_bleu_score}")

Average BLEU score: 13.266520410866555


## Helper Functions for Fine-tuning the model

In [3]:
from datasets import Dataset
import pandas as pd


def get_finetuning_dataset(src_texts, tgt_texts, src_lang, tgt_lang, tokenizer):
    data = []
    for src_text, tgt_text in zip(src_texts, tgt_texts):
        model_inputs = tokenizer(
            src_text, max_length=128, truncation=True, padding=True, return_tensors="pt"
        )
        model_inputs["translation"] = {src_lang: src_text, tgt_lang: tgt_text}

        with tokenizer.as_target_tokenizer():
            labels = tokenizer(tgt_text, max_length=128, truncation=True, padding=True)
            model_inputs["labels"] = labels["input_ids"]

        if model_inputs["attention_mask"] is None:
            model_inputs["attention_mask"] = torch.tensor(
                [0] * len(model_inputs["input_ids"])
            )
        else:
            model_inputs["attention_mask"] = model_inputs["attention_mask"].squeeze()

        model_inputs["input_ids"] = model_inputs["input_ids"].squeeze()
        model_inputs["attention_mask"] = model_inputs["attention_mask"].tolist()

        data.append(model_inputs)

    return Dataset.from_dict(pd.DataFrame(data))

In [4]:
import numpy as np
import sacrebleu


def compute_metrics(tokenizer, eval_preds):
    preds, labels = eval_preds

    if isinstance(preds, tuple):
        preds = preds[0]
    elif isinstance(preds, np.ndarray):
        preds = preds.tolist()
    elif isinstance(preds, list) and not isinstance(preds[0], list):
        preds = [preds]

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    decoded_preds = [str(pred) for pred in decoded_preds]
    decoded_labels = [str(label) for label in decoded_labels]

    total_bleu_score = 0.0
    for i in range(len(decoded_preds)):
        bleu = sacrebleu.sentence_bleu(decoded_preds[i], [decoded_labels[i]]).score
        total_bleu_score += bleu

    average_bleu_score = total_bleu_score / len(decoded_preds)
    result = {"bleu": average_bleu_score}

    prediction_lens = [
        np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds
    ]

    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}

    return result

## Fine-tuning English to Hindi Model

Fine-tune the English to Hindi Model and use the best version on test dataset

In [5]:
from transformers import MarianTokenizer

en_hi_model_name = "Helsinki-NLP/opus-mt-en-hi"
en_hi_tokenizer = MarianTokenizer.from_pretrained(en_hi_model_name)

train_dataset = get_finetuning_dataset(en_train, hi_train, "en", "hi", en_hi_tokenizer)
validation_dataset = get_finetuning_dataset(
    en_validation, hi_validation, "en", "hi", en_hi_tokenizer
)

train_dataset

Dataset({
    features: ['attention_mask', 'input_ids', 'labels', 'translation'],
    num_rows: 20000
})

In [6]:
from transformers import (
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    MarianMTModel,
)
from accelerate import init_empty_weights

init_empty_weights()
en_hi_model = MarianMTModel.from_pretrained(en_hi_model_name).to(device)
model_args = Seq2SeqTrainingArguments(
    "helsinki-nlp-finetuned-en-hi",
    evaluation_strategy="epoch",
    learning_rate=2e-4,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    weight_decay=0.02,
    save_total_limit=3,
    num_train_epochs=10,
    predict_with_generate=True,
)

data_collator = DataCollatorForSeq2Seq(en_hi_tokenizer, model=en_hi_model)

trainer = Seq2SeqTrainer(
    en_hi_model,
    model_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    data_collator=data_collator,
    tokenizer=en_hi_tokenizer,
    compute_metrics=lambda eval_preds: compute_metrics(en_hi_tokenizer, eval_preds),
)

trainer.train()

2024-04-19 01:12:45.498832: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.9615,8.818621,1.0349,1.0
2,0.6217,9.512757,1.734,1.0
3,0.441,11.311492,1.3778,1.0
4,0.3433,11.372407,1.4074,1.0
5,0.288,11.708949,1.2765,1.0
6,0.2286,11.394603,1.2261,1.0
7,0.1628,11.853123,0.904,1.0
8,0.1111,11.68035,1.2899,1.0
9,0.0892,11.7592,1.0742,1.0
10,0.0794,11.511595,1.241,1.0


Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[61949]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[61949]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[61949]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[61949]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[61949]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[61949]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[61949]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[61949]], 'forced_eos_token_id': 0}


TrainOutput(global_step=50000, training_loss=0.3761871636581421, metrics={'train_runtime': 4887.3752, 'train_samples_per_second': 40.922, 'train_steps_per_second': 10.23, 'total_flos': 518564474781696.0, 'train_loss': 0.3761871636581421, 'epoch': 10.0})

In [7]:
test_dataset = get_finetuning_dataset(en_test, hi_test, "en", "hi", en_hi_tokenizer)
test_results = trainer.predict(test_dataset)

print("BLEU score on test dataset after Fine-tuning:", test_results.metrics["test_bleu"])

BLEU score on test dataset after Fine-tuning: 1.1926


## Fine-tuning Hindi to English Model

Fine-tune the Hindi to English Model and use the best version on test dataset

In [5]:
from transformers import MarianTokenizer

hi_en_model_name = "Helsinki-NLP/opus-mt-hi-en"
hi_en_tokenizer = MarianTokenizer.from_pretrained(hi_en_model_name)

train_dataset = get_finetuning_dataset(hi_train, en_train, "hi", "en", hi_en_tokenizer)
validation_dataset = get_finetuning_dataset(
    hi_validation, en_validation, "hi", "en", hi_en_tokenizer
)

train_dataset

Dataset({
    features: ['attention_mask', 'input_ids', 'labels', 'translation'],
    num_rows: 20000
})

In [8]:
from transformers import (
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    MarianMTModel,
)
from accelerate import init_empty_weights

init_empty_weights()
hi_en_model = MarianMTModel.from_pretrained(hi_en_model_name).to(device)
model_args = Seq2SeqTrainingArguments(
    "helsinki-nlp-finetuned-hi-en",
    evaluation_strategy="epoch",
    learning_rate=3e-4,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    weight_decay=0.03,
    save_total_limit=3,
    num_train_epochs=10,
    predict_with_generate=True,
)

data_collator = DataCollatorForSeq2Seq(hi_en_tokenizer, model=hi_en_model)

trainer = Seq2SeqTrainer(
    hi_en_model,
    model_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    data_collator=data_collator,
    tokenizer=hi_en_tokenizer,
    compute_metrics=lambda eval_preds: compute_metrics(hi_en_tokenizer, eval_preds),
)

trainer.train()

Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.6966,9.938084,2.0495,1.0
2,1.0195,10.921923,2.0672,1.0
3,0.7118,11.066596,2.2464,1.0
4,0.587,11.605906,2.425,1.0
5,0.4814,12.410396,2.1429,1.0
6,0.3661,12.381299,2.177,1.0
7,0.2821,13.130521,2.3236,1.0
8,0.2029,12.842032,2.2736,1.0
9,0.1816,12.837738,2.1727,1.0
10,0.1347,12.800934,2.1955,1.0


Non-default generation parameters: {'max_length': 512, 'num_beams': 6, 'bad_words_ids': [[61126]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 6, 'bad_words_ids': [[61126]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 6, 'bad_words_ids': [[61126]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 6, 'bad_words_ids': [[61126]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 6, 'bad_words_ids': [[61126]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 6, 'bad_words_ids': [[61126]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 6, 'bad_words_ids': [[61126]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 6, 'bad_words_ids': [[61126]], 'forced_eos_token_id': 0}


TrainOutput(global_step=50000, training_loss=0.6559607733917237, metrics={'train_runtime': 3568.1088, 'train_samples_per_second': 56.052, 'train_steps_per_second': 14.013, 'total_flos': 601887715098624.0, 'train_loss': 0.6559607733917237, 'epoch': 10.0})

In [11]:
test_dataset = get_finetuning_dataset(hi_test, en_test, "hi", "en", hi_en_tokenizer)
test_results = trainer.predict(test_dataset)

print("BLEU score on test dataset after Fine-tuning:", test_results.metrics["test_bleu"])

BLEU score on test dataset after Fine-tuning: 1.8749


## Analysis

I chose the translation task and my dataset is [IIT Bombay Eng-Hin dataset on HuggingFace](https://huggingface.co/datasets/cfilt/iitb-english-hindi). I have performed English to Hindi and Hindi to English translation in this notebook.

Firstly, I used the [Helsinki-NLP/opus-mt-en-hi](https://huggingface.co/Helsinki-NLP/opus-mt-en-hi) pretrained model for **English to Hindi** translation and tested it on the above test dataset. The pretrained model performs reasonably well. I evalauted the pretrained model using the standard metric for translation tasks, i.e the BLEU score. The pretrained model has a BLEU score of 9.86 which is close to 10, although the score is poor on this test dataset, I would consider this score to be reasonbale because the pretrained model benchmarks have scores 6.9, 9.9, 16.1 on newsdev2014.eng.hin, newstest2014-hien.eng.hin, Tatoeba-test.eng.hin datasets respectively.

Secondly, I used the [Helsinki-NLP/opus-mt-hi-en](https://huggingface.co/Helsinki-NLP/opus-mt-hi-en) pretrained model for **Hindi to English** translation and tested it on the above test dataset. This pretrained model performs better as it resulted in a BLEU score of 13.2. This score is also reasonable because the pretrained model benchmarks have scores 9.1, 13.6, 40.4 on newsdev2014.hi.en, newstest2014-hien.hi.en, Tatoeba.hi.en datasets respectively.

Then, I finetuned both these models by changing the hyperparameters and training the model on a subset of my train dataset (20,000 instances). Both these models were trained for 10 epochs, and the best version was chosen as the final model. We then test both these models with my tet dataset. Both these models perform fairly poorly. Both the finetuned models for Eng-Hin and Hin-Eng translation have a score of 1.2 and 1.8 respectively.

We could maybe enhance the models' performance by training for more epoch and searching across multiple values for hyperparameters, but it took a very long time to train each of these models. Also, we only have 20k instances for training because bigger train datasets were leading to kernel crashes. Although, the train dataset contains 1.6 million sentence pairs, I tried training with 50,000, 100,000 but it lead to kernel crash everytime. So, if we had the resources to train on the full dataset, we have gotten a better performing model.

Reference: [Medium article](https://medium.com/@notsokarda/fine-tuning-a-transformer-model-for-neural-machine-translation-c604a24d3376). Some of the code is from this medium article.