<a href="https://colab.research.google.com/github/ShaanK2408/CS421_Assignment2/blob/main/translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [2]:
!pip install transformers datasets evaluate sacrebleu torchtext



In [3]:
from tqdm.auto import tqdm

## Q1: Dataset Preparation (5 points)

In [4]:
from datasets import load_dataset

We use the ```load_dataset()``` function to download the dataset. Replace the dummy arguments to download the wmt14 dataset for fr-en translation as provided here: https://huggingface.co/datasets/wmt/wmt14

In [5]:
dataset = load_dataset("wmt14", "fr-en", split='train[:15000]')
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]

Dataset({
    features: ['translation'],
    num_rows: 15000
})

Now, we split the dataset into training and testing splits. This is done using the ```train_test_split``` function. Replace the dummy arguments with appropriate parameters.

In [6]:
split_datasets = dataset.train_test_split(train_size=0.8, seed=42)
split_datasets

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 12000
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 3000
    })
})

Define the test dataset as follows:

In [7]:
test_dataset = split_datasets["test"]
test_dataset

Dataset({
    features: ['translation'],
    num_rows: 3000
})

Now, follow the same process to split the train dataset to training and validation splits.

In [8]:
split_to_val = split_datasets["train"].train_test_split(train_size=0.875, seed=42)
train_dataset = split_to_val["train"]
eval_dataset = split_to_val["test"]

## Q2 Prepare for training RNNs (10)
In this part, you are required to define the tokenizers for english and french, tokenize the data, and define the dataloaders.

Choose and initialize the tokenizer

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased") # CHOOSE AN APPROPRIATE MULTILINGUAL MODEL such as https://huggingface.co/google-bert/bert-base-multilingual-cased

You will need to create a pytorch dataset to process the tokens in the required format. Complete the implementation of the dataset.

In [10]:
from torch.utils.data import Dataset

class TranslationDataset(Dataset):
    def __init__(self, dataset, input_size, output_size):
        source_texts = [text["translation"]["fr"] for text in dataset]
        target_texts = [text["translation"]["en"] for text in dataset]
        self.source_sentences = tokenizer(source_texts, padding='max_length', truncation=True, return_tensors="pt")["input_ids"]
        self.target_sentences = tokenizer(target_texts, padding='max_length', truncation=True, return_tensors="pt")["input_ids"]
        self.input_size = input_size
        self.output_size = output_size

    def __len__(self):
        return len(self.source_sentences)

    def __getitem__(self, idx):
        return self.source_sentences[idx], self.target_sentences[idx]

Initialize the datasets

In [13]:
train_dataset_rnn = TranslationDataset(train_dataset, vocab_size, vocab_size)
eval_dataset_rnn = TranslationDataset(eval_dataset, vocab_size, vocab_size)
test_dataset_rnn = TranslationDataset(test_dataset, vocab_size, vocab_size)

Get the vocab size from the tokenizer

In [12]:
vocab_size = tokenizer.vocab_size # This size is used somewhere in the model, think.

Initialize and define the dataloaders

In [14]:
#Instantiate the DataLoaders
from torch.utils.data import DataLoader
BATCH_SIZE = 4
train_dataloader = DataLoader(train_dataset_rnn, batch_size=BATCH_SIZE, shuffle=True)
eval_dataloader = DataLoader(eval_dataset_rnn, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(test_dataset_rnn, batch_size=BATCH_SIZE)

## Q3: Implementing RNNs (10)
Define the RNN model as an encoder-decoder RNN for the task of translation in the cell below. You may refer: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

In [15]:
import torch
import torch.nn as nn

In [16]:
class Seq2SeqRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()

        # Embedding layer and RNN for source language (encoder)
        self.encoder_embedding = nn.Embedding(input_size, hidden_size)
        self.encoder_rnn = nn.RNN(hidden_size, hidden_size, batch_first=True)

        # Decoding layer and RNN for target language (decoder)
        self.decoder_embedding = nn.Embedding(output_size, hidden_size)
        self.decoder_rnn = nn.RNN(hidden_size, hidden_size, batch_first=True)

        # Fully connected layer to map decoder hidden states to target vocab
        self.fc_out = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        source_tokens, target_tokens = x

        # Embeds source tokens and passes it through encoder RNN
        embedded_source = self.encoder_embedding(source_tokens)
        _, hidden = self.encoder_rnn(embedded_source)

        # Embeds target tokens and passes it through decoder RNN
        embedded_target = self.decoder_embedding(target_tokens)
        output, _ = self.decoder_rnn(embedded_target, hidden)

        output = self.fc_out(output)
        return output

In [17]:
model = Seq2SeqRNN(input_size = vocab_size, hidden_size= 64, output_size= vocab_size)
model

Seq2SeqRNN(
  (encoder_embedding): Embedding(119547, 64)
  (encoder_rnn): RNN(64, 64, batch_first=True)
  (decoder_embedding): Embedding(119547, 64)
  (decoder_rnn): RNN(64, 64, batch_first=True)
  (fc_out): Linear(in_features=64, out_features=119547, bias=True)
)

## Q4: Training RNNs (15)
In this question, you will define the hyperparameters, loss and optimizer for training. You will then implement a custom training loop.

In [18]:
if torch.cuda.is_available():
    model = model.cuda()

define the optimizer and the loss function

In [19]:
from torch.optim import Adam
from torch.nn import CrossEntropyLoss

num_train_epochs = 5
num_training_steps = num_train_epochs * len(train_dataloader)
criterion = CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)
optimizer = Adam(model.parameters(), lr=0.001)

Write the training loop

In [None]:
from tqdm import tqdm
progress_bar = tqdm(total=num_training_steps, desc="Training Progress")

for epoch in range(num_train_epochs):
    # Training Phase
    model.train()
    total_loss = 0
    for batch_src, batch_tgt in train_dataloader:
        ## Complete the training loop

        # Move data to GPU
        if torch.cuda.is_available():
            batch_src = batch_src.cuda()
            batch_tgt = batch_tgt.cuda()

        # Reset gradients
        optimizer.zero_grad()

        # Forward pass
        output = model((batch_src, batch_tgt))

        # Reshape for CrossEntropyLoss: [batch_size * seq_len, vocab_size]
        output_reshaped = output.view(-1, output.size(-1))
        target_reshaped = batch_tgt.view(-1)

        # Compute loss
        loss = criterion(output_reshaped, target_reshaped)

        # Backward propoagation
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

        progress_bar.update(1)

    # Evaluation Phase
    model.eval()

    total_batches = 0
    total_eval_loss = 0.0 # Added in order to accumulate eval loss

    for batch_src, batch_tgt in eval_dataloader:
      ### Complete the evaluation phase

      with torch.no_grad():
        if torch.cuda.is_available():
            batch_src = batch_src.cuda()
            batch_tgt = batch_tgt.cuda()

        # Forward pass
        output = model((batch_src, batch_tgt))
        output_reshaped = output.view(-1, output.size(-1))
        target_reshaped = batch_tgt.view(-1)

        # Compute loss
        eval_loss = criterion(output_reshaped, target_reshaped)
        total_eval_loss += eval_loss.item()
        total_batches += 1

    avg_loss = total_eval_loss / total_batches if total_batches > 0 else 0
    print(f"Epoch {epoch}: Average Eval Loss: {avg_loss:.4f}")

Training Progress:   2%|▏         | 199/13125 [00:17<18:32, 11.62it/s]

## Q5: Evaluating RNNs for Machine Translation (5)

Implement the calculation of BLEU-1,2,3,4 scores using the ```sacrebleu``` library for the test dataset.

In [None]:
from sacrebleu import corpus_bleu

model.eval()
bleu1, bleu2, bleu3, bleu4 = None, None, None, None

predictions = []
references = []

for batch_src, batch_tgt in test_dataloader:
    # Complete the testing loop
    # Move data to GPU if available
    if torch.cuda.is_available():
        batch_src = batch_src.cuda()
        batch_tgt = batch_tgt.cuda()

    # Generate predictions
    with torch.no_grad():
        output = model((batch_src, batch_tgt))

    # Convert output tokens to predicted sentences
    predicted_tokens = output.argmax(dim=-1)  # Get the token with the highest probability
    predicted_sentences = tokenizer.batch_decode(predicted_tokens, skip_special_tokens=True)

    # Convert target tokens to reference sentences
    reference_sentences = tokenizer.batch_decode(batch_tgt, skip_special_tokens=True)

    # Append predictions and references to their respective lists
    predictions.extend(predicted_sentences)
    references.extend([[ref] for ref in reference_sentences])

bleu_score = corpus_bleu(predictions, references)

# Extract BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores
bleu1 = bleu_score.precisions[0]
bleu2 = bleu_score.precisions[1]
bleu3 = bleu_score.precisions[2]
bleu4 = bleu_score.precisions[3]

print("BLEU-1: ", bleu1)
print("BLEU-2: ", bleu2)
print("BLEU-3: ", bleu3)
print("BLEU-4: ", bleu4)

Congratulations! You can now work with RNNs for the task of Machine Translation!

## Q6: Prepare for training transformers (10)

In this part we cover the initial setup required before training transformer this including data preprocessing and setting up data collators and loaders.

Ensure you have loaded the dataset!

In [1]:
dataset

NameError: name 'dataset' is not defined

We will begin by tokenizing the data. Based on your model selection load the appropriate tokenizer. We are using models from AutoModelForSeq2SeqLM in this assignment. You can checkout all the available models here: https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoModelForSeq2SeqLM

In [27]:
from transformers import AutoTokenizer

checkpoint = "t5-small" #Select a model of your choice
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

We will need to tokenize both our input and outputs. Thus we make use of pre_process() function to generate tokenized model inputs and targets. Ensure you use truncation and padding! The max length will be 128.

In [28]:
##Implement the preprocess function
def preprocess_function(examples):
    inputs = [example["fr"] for example in examples["translation"]]
    targets = [example["en"] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, truncation=True, padding="max_length", max_length=128) #Instantitate tokenizer to generate model outputs
    return model_inputs

In [29]:
tokenized_train_data = train_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/10500 [00:00<?, ? examples/s]

In [31]:
tokenized_val_data = eval_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

We remove the column 'translation' as we do not require it for training. Also often having columns other than we created using the preprocess_function may lead to errors during training. Since model might get confused which inputs it needs to use.

In [32]:
tokenized_train_data = tokenized_train_data.remove_columns(train_dataset.column_names)
tokenized_val_data = tokenized_val_data.remove_columns(eval_dataset.column_names)

In [33]:
tokenized_train_data.set_format("torch")
tokenized_val_data.set_format("torch")

To construct batches of training data for model training, we require collators that set the properties for the batches and data loaders that generate the batches.

In [34]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model) #INSTANTIATE THE COLLATOR

In [35]:
#Instantiate the DataLoader for training and evaluation data

from torch.utils.data import DataLoader

train_dataloader = DataLoader(tokenized_train_data, batch_size=4, shuffle=True, collate_fn=data_collator)
eval_dataloader = DataLoader(tokenized_val_data, batch_size=4, collate_fn=data_collator)

## Q7) Choosing & Loading the Model (5)

Choose a pre-trained transformer model that you will use for fine-tuning on the translation dataset

In [36]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

## Q8) Training the Transformer Model

Now, that we have are data tokenized and ready in batches and model fixed. We will begin with training this model. To do so we must setup the right hyperparameters, then proceed to implment the training loop to train our model!

For training we require an optimizer and a scheduler to manage the learning rate during the training. Let's set them up before our training loop

In [37]:
from torch.optim import AdamW
from transformers import get_scheduler

num_train_epochs = 3
num_training_steps = num_train_epochs * len(train_dataloader)

optimizer = AdamW(model.parameters(), lr=0.00005)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

Finally, we are here!

In the loop during training you will run a forward pass, compute the loss, compute the gradients, and then update the weights. (Don't foregt to set gradient to zero!)

During the eval phase we simply do a forward pass and compute the loss!

In [None]:
from tqdm.auto import tqdm


progress_bar = tqdm(total=num_training_steps, desc="Training Progress")

for epoch in range(num_train_epochs):
    # Training Phase
    model.train()
    for batch in train_dataloader:

        batch = {k: v.to(model.device) for k, v in batch.items()}

        # Reset gradients
        optimizer.zero_grad()

        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()

        ## Complete the training loop

        progress_bar.update(1)

    # Evaluation Phase
    model.eval()
    total_loss = 0
    total_batches = 0

    for batch in eval_dataloader:
        batch = {k: v.to(model.device) for k, v in batch.items()}

        with torch.no_grad():
          outputs = model(**batch)
          loss = outputs.loss
          total_loss += loss.item()
          total_batches += 1

      ### Complete the evaluation phase

    avg_loss = total_loss / total_batches if total_batches > 0 else 0
    print(f"Epoch {epoch}: Average Eval Loss: {avg_loss:.4f}")

Training Progress:   0%|          | 0/987 [00:00<?, ?it/s]

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Congratulations!! On completing the training. Now don't forget to save your model and the tokenizer

In [None]:
# Save model and tokenizer
output_dir = SET_OUTPUT_DIR
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

## Q9) Evaluating Transformer for Machine Translation

We will now test our trained model and analyze its performance using BLEU-1, 2, 3, 4 scores from the sacrebleu library. You will create a task evaluator for translation, load and process the test dataset, and compute the results on an existing trained model.

Below we load a model trained for french to english translation. You can read more about it here: https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-fr-en

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
checkpoint = "Helsinki-NLP/opus-mt-tc-big-fr-en"

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Initialize an evaluator for translation task

In [None]:
## Load Evaluator for translation
from evaluate import evaluator
task_evaluator = None

We will need to change our test dataset by having specific input and target columns. Thus we will use split_translation to split the translation column into two columns 'en' and 'fr'.

In [None]:
#  Implement the split function
def split_translations(example):
    en_text = example[][]
    fr_text = example[][]
    example['en'] =
    example['fr'] =
    return example

In [None]:
test_data = test_dataset.map(split_translations)

You can now go ahead and compute the results by appropriately setting up the task_evaluator.compute()

In [None]:
results = task_evaluator.compute(
    model_or_pipeline= MODEL,
    data= DATA,
    metric=METRIC,
    input_column=COLUMN,
    label_column=COLUMN,
)

In [None]:
print(results)

## Q10) Inferencing on Transformers

Let's check out how well this trained model's translation skills are. You can use try with a few french sentence and see how well it translates.

To do so we will setup a pipline using the existing trained model.


Loading the tokenizer and model for the pipeline

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
checkpoint = "Helsinki-NLP/opus-mt-tc-big-fr-en"

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Setup the pipeline for translation using your model and tokenizer. You can read about pipelines here: https://huggingface.co/docs/transformers/en/main_classes/pipelines

In [None]:
from transformers import pipeline
# Instatiate a pipeline for Translation using the model and tokenizer
pipeline = None

Translate the given sentence using the pipeline

In [None]:
input_text = "REPLACE WITH A SENTENCE IN FRENCH."
translation_result = pipeline(REPLACE_WITH_TEXT_TO_TRANSLATE)

In [None]:
print(translation_result)