If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [None]:
# !pip install transformers datasets evaluate sacrebleu torchtext

In [None]:
from tqdm.auto import tqdm

## Q1: Dataset Preparation (5 points)

In [None]:
from datasets import load_dataset

We use the ```load_dataset()``` function to download the dataset. Replace the dummy arguments to download the wmt14 dataset for fr-en translation as provided here: https://huggingface.co/datasets/wmt/wmt14

In [None]:
dataset = load_dataset(REPLACE_WITH_DATASET_NAME, REPLACE_WITH_LANGUAGE_PAIR, split='train[:15000]')
dataset

Now, we split the dataset into training and testing splits. This is done using the ```train_test_split``` function. Replace the dummy arguments with appropriate parameters.

In [None]:
split_datasets = dataset.train_test_split(train_size=REPLACE_WITH_TRAIN_SIZE, seed=REPLACE_WITH_SEED)
split_datasets


Define the test dataset as follows:

In [None]:
test_dataset = split_datasets["test"]
test_dataset

Now, follow the same process to split the train dataset to training and validation splits.

In [None]:
split_to_val = YOUR_CODE_HERE
train_dataset = YOUR_CODE_HERE
eval_dataset = YOUR_CODE_HERE

## Q2 Prepare for training RNNs (10)
In this part, you are required to define the tokenizers for english and french, tokenize the data, and define the dataloaders.

Choose and initialize the tokenizer

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(REPLACE_WITH_MODEL_NAME) # CHOOSE AN APPROPRIATE MULTILINGUAL MODEL such as https://huggingface.co/google-bert/bert-base-multilingual-cased

You will need to create a pytorch dataset to process the tokens in the required format. Complete the implementation of the dataset.

In [None]:
from torch.utils.data import Dataset

class TranslationDataset(Dataset):
    def __init__(self, dataset, input_size, output_size):
        source_texts = [text["translation"][FILL_LANGUAGE] for text in dataset]
        target_texts = [text["translation"][FILL_LANGUAGE] for text in dataset]
        self.source_sentences = tokenizer(FILL, padding='max_length', truncation=True, return_tensors="pt")["input_ids"]
        self.target_sentences = tokenizer(FILL, padding='max_length', truncation=True, return_tensors="pt")["input_ids"]
        self.input_size = input_size
        self.output_size = output_size

    def __len__(self):
        return len(self.source_sentences)

    def __getitem__(self, idx):
        return self.source_sentences[idx], self.target_sentences[idx]

Initialize the datasets

In [None]:
train_dataset_rnn = TranslationDataset(DATASET_SPLIT, vocab_size, vocab_size)
eval_dataset_rnn = TranslationDataset(DATASET_SPLIT, vocab_size, vocab_size)
test_dataset_rnn = TranslationDataset(DATASET_SPLIT, vocab_size, vocab_size)

Get the vocab size from the tokenizer

In [None]:
vocab_size = tokenizer.vocab_size # This size is used somewhere in the model, think.

Initialize and define the dataloaders

In [None]:
#Instantiate the DataLoaders
from torch.utils.data import DataLoader
BATCH_SIZE = CHOSEN_BATCH_SIZE
train_dataloader = DataLoader(FILL, batch_size=BATCH_SIZE, shuffle=True)
eval_dataloader = DataLoader(FILL, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(FILL, batch_size=BATCH_SIZE)

## Q3: Implementing RNNs (10)
Define the RNN model as an encoder-decoder RNN for the task of translation in the cell below. You may refer: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

In [None]:
import torch
import torch.nn as nn

In [None]:
class Seq2SeqRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        # YOUR CODE HERE

    def forward(self, x):
        # YOUR CODE HERE
        return output

In [None]:
model = Seq2SeqRNN(input_size = FILL, hidden_size= FILL, output_size= FILL)
model

## Q4: Training RNNs (15)
In this question, you will define the hyperparameters, loss and optimizer for training. You will then implement a custom training loop.

In [None]:
if torch.cuda.is_available():
    model = model.cuda()

define the optimizer and the loss function

In [None]:
from torch.optim import IMPORT_OPTIMIZER
from torch.nn import IMPORT_LOSS_FUNCTION

num_train_epochs = NUM_EPOCHS
num_training_steps = num_train_epochs * len(train_dataloader)
criterion = # YOUR LOSS FUNCTION
optimizer = # YOUR OPTIMIZER HERE

Write the training loop

In [None]:
from tqdm import tqdm
progress_bar = tqdm(total=num_training_steps, desc="Training Progress")

for epoch in range(num_train_epochs):
    # Training Phase
    model.train()
    total_loss = 0
    for batch_src, batch_tgt in train_dataloader:
        ## Complete the training loop

        progress_bar.update(1)

    # Evaluation Phase
    model.eval()

    total_batches = 0

    for batch_src, batch_tgt in eval_dataloader:


      ### Complete the evaluation phase

    avg_loss = None
    print(f"Epoch {epoch}: Average Eval Loss: {avg_loss:.4f}")

## Q5: Evaluating RNNs for Machine Translation (5)

Implement the calculation of BLEU-1,2,3,4 scores using the ```sacrebleu``` library for the test dataset.

In [None]:
model.eval()
bleu1, bleu2, bleu3, bleu4 = None, None, None, None
for batch in test_dataloader:
    batch = {k: v.to(model.device) for k, v in batch.items()}
    # Complete the testing loop

print("BLEU-1: ", bleu1)
print("BLEU-2: ", bleu2)
print("BLEU-3: ", bleu3)
print("BLEU-4: ", bleu4)

Congratulations! You can now work with RNNs for the task of Machine Translation!

## Q6: Prepare for training transformers (10)

In this part we cover the initial setup required before training transformer this including data preprocessing and setting up data collators and loaders.

Ensure you have loaded the dataset!

In [None]:
dataset

We will begin by tokenizing the data. Based on your model selection load the appropriate tokenizer. We are using models from AutoModelForSeq2SeqLM in this assignment. You can checkout all the available models here: https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoModelForSeq2SeqLM

In [None]:
from transformers import AutoTokenizer

checkpoint = "" #Select a model of your choice
tokenizer = AutoTokenizer.from_pretrained(REPLACE_WITH_CHECKPOINT)

We will need to tokenize both our input and outputs. Thus we make use of pre_process() function to generate tokenized model inputs and targets. Ensure you use truncation and padding! The max length will be 128.

In [None]:
##Implement the preprocess function
def preprocess_function(examples):
    inputs = [example[SET_RIGHT_LANG] for example in examples["translation"]]
    targets = [example[SET_RIGHT_LANG] for example in examples["translation"]]
    model_inputs = tokenizer() #Instantitate tokenizer to generate model outputs
    return model_inputs

In [None]:
tokenized_train_data = train_dataset.map(preprocess_function, batched=True)

In [None]:
tokenized_val_data = val_dataset.map(preprocess_function, batched=True)

We remove the column 'translation' as we do not require it for training. Also often having columns other than we created using the preprocess_function may lead to errors during training. Since model might get confused which inputs it needs to use.

In [None]:
tokenized_train_data = tokenized_train_data.remove_columns(train_dataset.column_names)
tokenized_val_data = tokenized_val_data.remove_columns(val_dataset.column_names)

In [None]:
tokenized_train_data.set_format("torch")
tokenized_val_data.set_format("torch")

To construct batches of training data for model training, we require collators that set the properties for the batches and data loaders that generate the batches.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq() #INSTANTIATE THE COLLATOR

In [None]:
#Instantiate the DataLoader for training and evaluation data

from torch.utils.data import DataLoader

train_dataloader = DataLoader(, batch_size=32, shuffle=True)
eval_dataloader = DataLoader(, batch_size=32)

## Q7) Choosing & Loading the Model (5)

Choose a pre-trained transformer model that you will use for fine-tuning on the translation dataset

In [None]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(REPLACE_WITH_CHECKPOINT)

## Q8) Training the Transformer Model

Now, that we have are data tokenized and ready in batches and model fixed. We will begin with training this model. To do so we must setup the right hyperparameters, then proceed to implment the training loop to train our model!

For training we require an optimizer and a scheduler to manage the learning rate during the training. Let's set them up before our training loop

In [None]:
from torch.optim import AdamW
from transformers import get_scheduler

num_train_epochs = NUM_EPOCHS
num_training_steps = NUM_STEPS

optimizer = SETUP_Adam_OPTIMIZER
lr_scheduler = SETUP_SCHEDULER

Finally, we are here!

In the loop during training you will run a forward pass, compute the loss, compute the gradients, and then update the weights. (Don't foregt to set gradient to zero!)

During the eval phase we simply do a forward pass and compute the loss!

In [None]:
from tqdm.auto import tqdm


progress_bar = tqdm(total=num_training_steps, desc="Training Progress")

for epoch in range(num_train_epochs):
    # Training Phase
    model.train()
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}

        ## Complete the training loop

        progress_bar.update(1)

    # Evaluation Phase
    model.eval()
    total_loss = 0
    total_batches = 0

    for batch in eval_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}

      ### Complete the evaluation phase

    avg_loss = None
    print(f"Epoch {epoch}: Average Eval Loss: {avg_loss:.4f}")

Congratulations!! On completing the training. Now don't forget to save your model and the tokenizer

In [None]:
# Save model and tokenizer
output_dir = SET_OUTPUT_DIR
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

## Q9) Evaluating Transformer for Machine Translation

We will now test our trained model and analyze its performance using BLEU-1, 2, 3, 4 scores from the sacrebleu library. You will create a task evaluator for translation, load and process the test dataset, and compute the results on an existing trained model.

Below we load a model trained for french to english translation. You can read more about it here: https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-fr-en

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
checkpoint = "Helsinki-NLP/opus-mt-tc-big-fr-en"

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Initialize an evaluator for translation task

In [None]:
## Load Evaluator for translation
from evaluate import evaluator
task_evaluator = None

We will need to change our test dataset by having specific input and target columns. Thus we will use split_translation to split the translation column into two columns 'en' and 'fr'.

In [None]:
#  Implement the split function
def split_translations(example):
    en_text = example[][]
    fr_text = example[][]
    example['en'] =
    example['fr'] =
    return example

In [None]:
test_data = test_dataset.map(split_translations)

You can now go ahead and compute the results by appropriately setting up the task_evaluator.compute()

In [None]:
results = task_evaluator.compute(
    model_or_pipeline= MODEL,
    data= DATA,
    metric=METRIC,
    input_column=COLUMN,
    label_column=COLUMN,
)

In [None]:
print(results)

## Q10) Inferencing on Transformers

Let's check out how well this trained model's translation skills are. You can use try with a few french sentence and see how well it translates.

To do so we will setup a pipline using the existing trained model.


Loading the tokenizer and model for the pipeline

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
checkpoint = "Helsinki-NLP/opus-mt-tc-big-fr-en"

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Setup the pipeline for translation using your model and tokenizer. You can read about pipelines here: https://huggingface.co/docs/transformers/en/main_classes/pipelines

In [None]:
from transformers import pipeline
# Instatiate a pipeline for Translation using the model and tokenizer
pipeline = None

Translate the given sentence using the pipeline

In [None]:
input_text = "REPLACE WITH A SENTENCE IN FRENCH."
translation_result = pipeline(REPLACE_WITH_TEXT_TO_TRANSLATE)

In [None]:
print(translation_result)