If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [1]:
# !pip install transformers datasets evaluate sacrebleu torchtext

In [2]:
from tqdm.auto import tqdm

## Q1: Dataset Preparation (5 points)

In [3]:
from datasets import load_dataset

We use the ```load_dataset()``` function to download the dataset. Replace the dummy arguments to download the wmt14 dataset for fr-en translation as provided here: https://huggingface.co/datasets/wmt/wmt14

In [4]:
dataset = load_dataset("wmt/wmt14",  "fr-en", split='train[:15000]')
dataset

Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]

Dataset({
    features: ['translation'],
    num_rows: 15000
})

Now, we split the dataset into training and testing splits. This is done using the ```train_test_split``` function. Replace the dummy arguments with appropriate parameters.

In [5]:
split_datasets = dataset.train_test_split(train_size=0.8, seed=42)
split_datasets


DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 12000
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 3000
    })
})

Define the test dataset as follows:

In [6]:
test_dataset = split_datasets["test"]
test_dataset

Dataset({
    features: ['translation'],
    num_rows: 3000
})

Now, follow the same process to split the train dataset to training and validation splits.

In [7]:
split_to_val = split_datasets["train"].train_test_split(train_size=0.9, seed=42)
train_dataset = split_to_val["train"]
eval_dataset = split_to_val["test"]

## Q2 Prepare for training RNNs (10)
In this part, you are required to define the tokenizers for english and french, tokenize the data, and define the dataloaders.

Choose and initialize the tokenizer

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-multilingual-cased") # CHOOSE AN APPROPRIATE MULTILINGUAL MODEL such as https://huggingface.co/google-bert/bert-base-multilingual-cased



You will need to create a pytorch dataset to process the tokens in the required format. Complete the implementation of the dataset.

In [9]:
from torch.utils.data import Dataset

class TranslationDataset(Dataset):
    def __init__(self, dataset, input_size, output_size):
        source_texts = [text["translation"]["fr"] for text in dataset]
        target_texts = [text["translation"]["en"] for text in dataset]
        self.source_sentences = tokenizer(source_texts, padding='max_length', truncation=True, return_tensors="pt")["input_ids"]
        self.target_sentences = tokenizer(target_texts, padding='max_length', truncation=True, return_tensors="pt")["input_ids"]
        self.input_size = input_size
        self.output_size = output_size

    def __len__(self):
        return len(self.source_sentences)

    def __getitem__(self, idx):
        return self.source_sentences[idx], self.target_sentences[idx]

Initialize the datasets

Get the vocab size from the tokenizer

In [10]:
vocab_size = tokenizer.vocab_size # This size is used somewhere in the model, think.

In [11]:
train_dataset_rnn = TranslationDataset(train_dataset, vocab_size, vocab_size)
eval_dataset_rnn = TranslationDataset(eval_dataset, vocab_size, vocab_size)
test_dataset_rnn = TranslationDataset(test_dataset, vocab_size, vocab_size)

Initialize and define the dataloaders

In [12]:
#Instantiate the DataLoaders
from torch.utils.data import DataLoader
BATCH_SIZE = 8
train_dataloader = DataLoader(train_dataset_rnn, batch_size=BATCH_SIZE, shuffle=True)
eval_dataloader = DataLoader(eval_dataset_rnn, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(test_dataset_rnn, batch_size=BATCH_SIZE)

## Q3: Implementing RNNs (10)
Define the RNN model as an encoder-decoder RNN for the task of translation in the cell below. You may refer: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

In [13]:
import torch
import torch.nn as nn

In [14]:
class Seq2SeqRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.hidden_size = hidden_size
        
        # Embedding layer for input tokens
        self.embedding = nn.Embedding(input_size, hidden_size)
        
        # Encoder RNN (using GRU for better performance than basic RNN)
        self.encoder_rnn = nn.GRU(hidden_size, hidden_size, batch_first=True)
        
        # Decoder RNN
        self.decoder_rnn = nn.GRU(hidden_size, hidden_size, batch_first=True)
        
        # Final linear layer to predict output tokens
        self.out = nn.Linear(hidden_size, output_size)
        
        # Dropout for regularization
        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        # x shape: [batch_size, seq_len]
        batch_size = x.size(0)
        
        # Initialize hidden state
        hidden = self._init_hidden(batch_size, x.device)
        
        # Embedding
        embedded = self.dropout(self.embedding(x))  # [batch_size, seq_len, hidden_size]
        
        # Encoder
        _, hidden = self.encoder_rnn(embedded, hidden)
        
        # For decoder input, we use the first token (<SOS>) from target sequence
        # For this implementation, we'll use zeros as initial input
        decoder_input = torch.zeros(batch_size, 1, self.hidden_size, device=x.device)
        
        # We'll collect outputs for each time step
        outputs = []
        
        # Get sequence length from input
        max_length = x.size(1)
        
        # Decoder loop - one step at a time
        for t in range(max_length):
            # Pass through decoder
            output, hidden = self.decoder_rnn(decoder_input, hidden)
            
            # Apply linear layer to get vocabulary distribution
            output = self.out(output)
            
            # Save output
            outputs.append(output)
            
            # Use output as next input (teacher forcing would use actual target here)
            decoder_input = self.embedding(output.argmax(2))
        
        # Stack outputs into a single tensor [batch_size, seq_len, output_size]
        outputs = torch.cat(outputs, dim=1)
        
        return outputs
    
    def _init_hidden(self, batch_size, device):
        # Initialize hidden state with zeros
        return torch.zeros(1, batch_size, self.hidden_size, device=device)

In [15]:
model = Seq2SeqRNN(input_size=vocab_size, hidden_size=256, output_size=vocab_size)
model

Seq2SeqRNN(
  (embedding): Embedding(119547, 256)
  (encoder_rnn): GRU(256, 256, batch_first=True)
  (decoder_rnn): GRU(256, 256, batch_first=True)
  (out): Linear(in_features=256, out_features=119547, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

## Q4: Training RNNs (15)
In this question, you will define the hyperparameters, loss and optimizer for training. You will then implement a custom training loop.

In [16]:
if torch.cuda.is_available():
    model = model.cuda()

define the optimizer and the loss function

In [17]:
from torch.optim import Adam  # Replace IMPORT_OPTIMIZER with Adam
from torch.nn import CrossEntropyLoss  # Replace IMPORT_LOSS_FUNCTION with CrossEntropyLoss

num_train_epochs = 5  # Set NUM_EPOCHS to a reasonable value like 5
num_training_steps = num_train_epochs * len(train_dataloader)
criterion = CrossEntropyLoss(ignore_index=0)  # Use 0 as pad token ID to ignore in loss calculation
optimizer = Adam(model.parameters(), lr=0.001)  # Standard learning rate for Adam


Write the training loop

In [18]:
from tqdm import tqdm
progress_bar = tqdm(total=num_training_steps, desc="Training Progress")

for epoch in range(num_train_epochs):
    # Training Phase
    model.train()
    total_loss = 0
    for batch_src, batch_tgt in train_dataloader:
        # Move tensors to GPU if available
        if torch.cuda.is_available():
            batch_src = batch_src.cuda()
            batch_tgt = batch_tgt.cuda()
            
        # Zero gradients
        optimizer.zero_grad()
        
        # Forward pass
        outputs = model(batch_src)
        
        # Reshape outputs and targets for loss calculation
        # outputs: [batch_size, seq_len, vocab_size]
        # targets: [batch_size, seq_len]
        outputs = outputs.view(-1, outputs.size(-1))
        targets = batch_tgt.view(-1)
        
        # Calculate loss
        loss = criterion(outputs, targets)
        
        # Backward pass and optimize
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        progress_bar.update(1)
    
    avg_train_loss = total_loss / len(train_dataloader)
    print(f"Epoch {epoch}: Average Train Loss: {avg_train_loss:.4f}")
    
    # Evaluation Phase
    model.eval()
    total_eval_loss = 0
    total_batches = 0
    
    with torch.no_grad():
        for batch_src, batch_tgt in eval_dataloader:
            # Move tensors to GPU if available
            if torch.cuda.is_available():
                batch_src = batch_src.cuda()
                batch_tgt = batch_tgt.cuda()
            
            # Forward pass
            outputs = model(batch_src)
            
            # Reshape outputs and targets for loss calculation
            outputs = outputs.view(-1, outputs.size(-1))
            targets = batch_tgt.view(-1)
            
            # Calculate loss
            loss = criterion(outputs, targets)
            total_eval_loss += loss.item()
            total_batches += 1
    
    avg_loss = total_eval_loss / total_batches
    print(f"Epoch {epoch}: Average Eval Loss: {avg_loss:.4f}")

Training Progress:   0%|                                                                                                                                                                        | 0/6750 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 1.82 GiB. GPU 0 has a total capacity of 5.79 GiB of which 1.56 GiB is free. Including non-PyTorch memory, this process has 4.10 GiB memory in use. Of the allocated memory 3.96 GiB is allocated by PyTorch, and 41.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

## Q5: Evaluating RNNs for Machine Translation (5)

Implement the calculation of BLEU-1,2,3,4 scores using the ```sacrebleu``` library for the test dataset.

In [None]:
import sacrebleu
from sacrebleu.metrics import BLEU


model.eval()
bleu1, bleu2, bleu3, bleu4 = None, None, None, None
for batch in test_dataloader:
    batch = {k: v.to(model.device) for k, v in batch.items()}
    # Complete the testing loop

print("BLEU-1: ", bleu1)
print("BLEU-2: ", bleu2)
print("BLEU-3: ", bleu3)
print("BLEU-4: ", bleu4)

Congratulations! You can now work with RNNs for the task of Machine Translation!

## Q6: Prepare for training transformers (10)

In this part we cover the initial setup required before training transformer this including data preprocessing and setting up data collators and loaders.

Ensure you have loaded the dataset!

In [None]:
dataset

We will begin by tokenizing the data. Based on your model selection load the appropriate tokenizer. We are using models from AutoModelForSeq2SeqLM in this assignment. You can checkout all the available models here: https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoModelForSeq2SeqLM

In [None]:
from transformers import AutoTokenizer

checkpoint = "" #Select a model of your choice
tokenizer = AutoTokenizer.from_pretrained(REPLACE_WITH_CHECKPOINT)

We will need to tokenize both our input and outputs. Thus we make use of pre_process() function to generate tokenized model inputs and targets. Ensure you use truncation and padding! The max length will be 128.

In [None]:
##Implement the preprocess function
def preprocess_function(examples):
    inputs = [example[SET_RIGHT_LANG] for example in examples["translation"]]
    targets = [example[SET_RIGHT_LANG] for example in examples["translation"]]
    model_inputs = tokenizer() #Instantitate tokenizer to generate model outputs
    return model_inputs

In [None]:
tokenized_train_data = train_dataset.map(preprocess_function, batched=True)

In [None]:
tokenized_val_data = val_dataset.map(preprocess_function, batched=True)

We remove the column 'translation' as we do not require it for training. Also often having columns other than we created using the preprocess_function may lead to errors during training. Since model might get confused which inputs it needs to use.

In [None]:
tokenized_train_data = tokenized_train_data.remove_columns(train_dataset.column_names)
tokenized_val_data = tokenized_val_data.remove_columns(val_dataset.column_names)

In [None]:
tokenized_train_data.set_format("torch")
tokenized_val_data.set_format("torch")

To construct batches of training data for model training, we require collators that set the properties for the batches and data loaders that generate the batches.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq() #INSTANTIATE THE COLLATOR

In [None]:
#Instantiate the DataLoader for training and evaluation data

from torch.utils.data import DataLoader

train_dataloader = DataLoader(, batch_size=32, shuffle=True)
eval_dataloader = DataLoader(, batch_size=32)

## Q7) Choosing & Loading the Model (5)

Choose a pre-trained transformer model that you will use for fine-tuning on the translation dataset

In [None]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(REPLACE_WITH_CHECKPOINT)

## Q8) Training the Transformer Model

Now, that we have are data tokenized and ready in batches and model fixed. We will begin with training this model. To do so we must setup the right hyperparameters, then proceed to implment the training loop to train our model!

For training we require an optimizer and a scheduler to manage the learning rate during the training. Let's set them up before our training loop

In [None]:
from torch.optim import AdamW
from transformers import get_scheduler

num_train_epochs = NUM_EPOCHS
num_training_steps = NUM_STEPS

optimizer = SETUP_Adam_OPTIMIZER
lr_scheduler = SETUP_SCHEDULER

Finally, we are here!

In the loop during training you will run a forward pass, compute the loss, compute the gradients, and then update the weights. (Don't foregt to set gradient to zero!)

During the eval phase we simply do a forward pass and compute the loss!

In [None]:
from tqdm.auto import tqdm


progress_bar = tqdm(total=num_training_steps, desc="Training Progress")

for epoch in range(num_train_epochs):
    # Training Phase
    model.train()
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}

        ## Complete the training loop

        progress_bar.update(1)

    # Evaluation Phase
    model.eval()
    total_loss = 0
    total_batches = 0

    for batch in eval_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}

      ### Complete the evaluation phase

    avg_loss = None
    print(f"Epoch {epoch}: Average Eval Loss: {avg_loss:.4f}")

Congratulations!! On completing the training. Now don't forget to save your model and the tokenizer

In [None]:
# Save model and tokenizer
output_dir = SET_OUTPUT_DIR
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

## Q9) Evaluating Transformer for Machine Translation

We will now test our trained model and analyze its performance using BLEU-1, 2, 3, 4 scores from the sacrebleu library. You will create a task evaluator for translation, load and process the test dataset, and compute the results on an existing trained model.

Below we load a model trained for french to english translation. You can read more about it here: https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-fr-en

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
checkpoint = "Helsinki-NLP/opus-mt-tc-big-fr-en"

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Initialize an evaluator for translation task

In [None]:
## Load Evaluator for translation
from evaluate import evaluator
task_evaluator = None

We will need to change our test dataset by having specific input and target columns. Thus we will use split_translation to split the translation column into two columns 'en' and 'fr'.

In [None]:
#  Implement the split function
def split_translations(example):
    en_text = example[][]
    fr_text = example[][]
    example['en'] =
    example['fr'] =
    return example

In [None]:
test_data = test_dataset.map(split_translations)

You can now go ahead and compute the results by appropriately setting up the task_evaluator.compute()

In [None]:
results = task_evaluator.compute(
    model_or_pipeline= MODEL,
    data= DATA,
    metric=METRIC,
    input_column=COLUMN,
    label_column=COLUMN,
)

In [None]:
print(results)

## Q10) Inferencing on Transformers

Let's check out how well this trained model's translation skills are. You can use try with a few french sentence and see how well it translates.

To do so we will setup a pipline using the existing trained model.


Loading the tokenizer and model for the pipeline

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
checkpoint = "Helsinki-NLP/opus-mt-tc-big-fr-en"

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Setup the pipeline for translation using your model and tokenizer. You can read about pipelines here: https://huggingface.co/docs/transformers/en/main_classes/pipelines

In [None]:
from transformers import pipeline
# Instatiate a pipeline for Translation using the model and tokenizer
pipeline = None

Translate the given sentence using the pipeline

In [None]:
input_text = "REPLACE WITH A SENTENCE IN FRENCH."
translation_result = pipeline(REPLACE_WITH_TEXT_TO_TRANSLATE)

In [None]:
print(translation_result)