## Architecture Overview

> MarianMT

It is an architecture designed specifically for machine translation tasks. 

The building blocks of the architecture are elaborated below.

1. Encoder - The encoder block in MarianMT is responsible for processing the input source language and capturing its contextual information. It consists of multiple layers of self attn and feed fwd networks. The self attn mechanism allows the model to weigh the importance of diff words in the source sentence while capturing dependencies b/w them.

2. DEcoder - The decoder block in it generates the translated target lang based on the encoded source language representation. It also consists of multiple layers of self attn and feed fwd networks. In addition to the self attn mechanism, the decoder also employs another attn mechanism called encoder-decoder attn. This allows the models to focus on relevant parts of the source sentence while generating the translation.

3. Cross-Attention: The cross attention mechanism is a key component in MarianMT's architecture. It enables the decoder to attend to the encoded representations of the source sentence while generating the translation. By attending to diff parts of the source sentence, The model aligns to source and target language effectively.

4. Positional Encoding : To capture the positional information of words in a sentence, both the encoder and decoder blocks in MarianMT use positional encoding. This allows the model to understand the order of words, which is crucial for translation tasks.

> Comparison w.r.t to the transformer architectures.

1. BERT - BERT is a pretrained model primarily used for tasks like natural language understanding and sentiment analysis whereas MarianMT is trained specifically for machine translation.

2. GPT - GPT is unidirectional and generates text word by word whereas MarianMT is bidirectional and translates sentences from one language to another.


In [1]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import MarianMTModel, MarianTokenizer
import numpy as np

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


The code snippet you provided imports necessary libraries and modules, including `torch`, `torch.utils.data`, `transformers`, and `numpy`. It sets up the environment to work with PyTorch and the Hugging Face Transformers library.


In [2]:
class TranslationDataset(Dataset):
    def __init__(self, dataset_path):
        self.hypotheses_cols_path = dataset_path + '/deen_nt2021_bleurt_0p2/hypotheses_cols.tsv'
        self.hypotheses_rows_path = dataset_path + '/deen_nt2021_bleurt_0p2/hypotheses_rows.tsv'
        self.scores_path = dataset_path + '/deen_nt2021_bleurt_0p2/scores.npy'
        self.tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-de-en')
        self.hypotheses_cols = []
        self.hypotheses_rows = []
        self.scores = []
        self.load_data()

    def load_data(self):
        with open(self.hypotheses_cols_path, 'r', encoding='utf-8') as f:
            self.hypotheses_cols = f.read().splitlines()
        with open(self.hypotheses_rows_path, 'r', encoding='utf-8') as f:
            self.hypotheses_rows = f.read().splitlines()
        self.scores = np.load(self.scores_path)

    def __len__(self):
        return min(len(self.hypotheses_cols), len(self.hypotheses_rows), len(self.scores))

    def __getitem__(self, idx):
        source_text = self.hypotheses_cols[idx]
        target_text = self.hypotheses_rows[idx]
        score = self.scores[idx]
        source_inputs = self.tokenizer.encode(source_text, padding='max_length', truncation=True, max_length=128,
                                              return_tensors='pt')
        target_inputs = self.tokenizer.encode(target_text, padding='max_length', truncation=True, max_length=128,
                                              return_tensors='pt')
        return {
            'source_inputs': source_inputs.squeeze(),
            'target_inputs': target_inputs.squeeze(),
            'score': score
        }


The code defines a custom PyTorch dataset called `TranslationDataset` for working with machine translation data. It's functionality can be broken down into following steps:

- The `__init__` method initializes the dataset by setting the paths to the input files (hypotheses_cols.tsv, hypotheses_rows.tsv, and scores.npy) and the MarianTokenizer for the specific translation model.
- The `load_data` method reads the contents of the input files into the corresponding variables (`hypotheses_cols`, `hypotheses_rows`, and `scores`).
- The `__len__` method returns the length of the dataset, which is the minimum length among the three lists (`hypotheses_cols`, `hypotheses_rows`, and `scores`).
- The `__getitem__` method is called when an item from the dataset is requested by index (`idx`). It retrieves the corresponding source text, target text, and score. Then, it encodes the source and target texts using the tokenizer, applying padding and truncation as necessary. Finally, it returns a dictionary containing the source inputs, target inputs, and score.


In [3]:
def collate_fn(batch):
    source_inputs = torch.stack([item['source_inputs'] for item in batch])
    target_inputs = torch.stack([item['target_inputs'] for item in batch])
    scores = torch.tensor([item['score'] for item in batch])
    return {
        'source_inputs': source_inputs,
        'target_inputs': target_inputs,
        'score': scores
    }


The `collate_fn` function is a custom collate function used for data batching in the DataLoader. It takes a list of individual data samples (batch) and combines them into a single batch tensor for efficient processing. It's functionality can be broken down into following steos:

- It retrieves the 'source_inputs', 'target_inputs', and 'score' from each item in the batch using list comprehensions.
- It uses `torch.stack` to stack the 'source_inputs' and 'target_inputs' tensors along a new dimension, creating a batch tensor for both inputs.
- It converts the list of scores to a tensor using `torch.tensor`.
- Finally, it returns a dictionary containing the batched 'source_inputs', 'target_inputs', and 'score'.


In [4]:
dataset_path = '/kaggle/input/machine-translation-mbr-with-neural-metrics/de-en/newstest2021'  # Replace with the actual path to the dataset
dataset = TranslationDataset(dataset_path)
dataloader = DataLoader(dataset, batch_size=32, collate_fn=collate_fn, shuffle=True)


Downloading (…)olve/main/source.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]



In this code snippet, the dataset is initialized using the `TranslationDataset` class, which takes the `dataset_path` as an argument.

After initializing the dataset, a `DataLoader` is created using the `dataset`. The `batch_size` is set to 32, which means that the DataLoader will yield batches of 32 samples at a time. The `collate_fn` is passed as an argument to the `collate_fn` parameter, which will be used to collate the samples into batches. Additionally, `shuffle=True` is set to shuffle the samples during training.


In [5]:
for batch in dataloader:
    print(batch)
    break

{'source_inputs': tensor([[  448,   173,  3034,  ..., 58100, 58100, 58100],
        [  448,   309,  1908,  ..., 58100, 58100, 58100],
        [  448,   309,  2284,  ..., 58100, 58100, 58100],
        ...,
        [  448,   129,  2206,  ..., 58100, 58100, 58100],
        [  448,  6448,  8917,  ..., 58100, 58100, 58100],
        [  448,   300,  2734,  ..., 58100, 58100, 58100]]), 'target_inputs': tensor([[  448,   173,  3034,  ..., 58100, 58100, 58100],
        [  448,   309,  1908,  ..., 58100, 58100, 58100],
        [  448,   309,  2284,  ..., 58100, 58100, 58100],
        ...,
        [  448,   129,  2206,  ..., 58100, 58100, 58100],
        [  448,  6448,  8917,  ..., 58100, 58100, 58100],
        [  448,   300,  2734,  ..., 58100, 58100, 58100]]), 'score': tensor([[[0.9688, 0.8281, 0.8750,  ..., 0.2637, 0.2480, 0.3594],
         [0.8047, 0.9609, 0.8672,  ..., 0.2461, 0.2539, 0.3613],
         [0.8672, 0.8906, 0.9688,  ..., 0.2637, 0.2490, 0.3574],
         ...,
         [0.2148, 0.2

  scores = torch.tensor([item['score'] for item in batch])


Performing sanity check of the dataloader

In [6]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_name = 'Helsinki-NLP/opus-mt-de-en'
model = MarianMTModel.from_pretrained(model_name).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)


Downloading pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

In this code, the device is set based on the availability of CUDA. If CUDA is available, the device is set to `'cuda'`, otherwise it is set to `'cpu'`.

The `model_name` variable is set to `'Helsinki-NLP/opus-mt-de-en'`, which is the pre-trained model name for the Marian machine translation model that translates German to English.

The `MarianMTModel` is then initialized using `from_pretrained()` with the `model_name` and moved to the specified device using `.to(device)`.

An optimizer is created using `torch.optim.Adam` and the parameters of the `model` are passed to it. The learning rate is set to `1e-4`.


# Training

In [7]:
for epoch in range(1):
    for step,batch in enumerate(dataloader):
        source_inputs = batch['source_inputs'].to(device)
        target_inputs = batch['target_inputs'].to(device)
        scores = batch['score'].to(device)

        optimizer.zero_grad()
        outputs = model(source_inputs, decoder_input_ids=target_inputs, return_dict=True)
        logits = outputs.logits.flatten()
        
        # Reshape scores to match the size of logits
        scores = scores.view(-1)

        # Resize logits to match the size of scores
        logits = logits[:scores.size(0)]

        # Convert logits and scores to Float dtype
        logits = logits.float()
        scores = scores.float()

        loss = torch.nn.functional.mse_loss(logits, scores)
        loss.backward()
        optimizer.step()
        
        print("Step-{}, Loss-{}".format(step,loss.item()))


Step-0, Loss-5.724803924560547
Step-1, Loss-4.6054253578186035
Step-2, Loss-26.402923583984375
Step-3, Loss-0.773452639579773
Step-4, Loss-2.676593780517578
Step-5, Loss-2.299866199493408
Step-6, Loss-0.7236250638961792
Step-7, Loss-0.5419917702674866
Step-8, Loss-0.9050765037536621
Step-9, Loss-0.765993058681488
Step-10, Loss-0.4023821949958801
Step-11, Loss-0.25546500086784363
Step-12, Loss-0.3499484062194824
Step-13, Loss-0.4514180123806
Step-14, Loss-0.3846874535083771
Step-15, Loss-0.22641226649284363
Step-16, Loss-0.1575862020254135
Step-17, Loss-0.21602994203567505
Step-18, Loss-0.29133087396621704
Step-19, Loss-0.2708299160003662
Step-20, Loss-0.15109673142433167
Step-21, Loss-0.09491079300642014
Step-22, Loss-0.11346182227134705
Step-23, Loss-0.18739387392997742
Step-24, Loss-0.17765815556049347
Step-25, Loss-0.11436118930578232
Step-26, Loss-0.060721106827259064
Step-27, Loss-0.07766149938106537
Step-28, Loss-0.10315463691949844
Step-29, Loss-0.10392075031995773
Step-30, Loss

# Inference

In [8]:
# Define the German text
german_text = "Guten Tag!"


In [9]:
# Load the tokenizer
model_name = 'Helsinki-NLP/opus-mt-de-en'

tokenizer = MarianTokenizer.from_pretrained(model_name)


In [10]:
# Tokenize the German text
inputs = tokenizer.encode(german_text, return_tensors='pt')


In the code snippet provided, the German text stored in the variable `german_text` is tokenized using the `tokenizer.encode()` method.

The `return_tensors='pt'` parameter specifies that the encoded tokens should be returned as PyTorch tensors. The resulting tokenized representation of the German text is stored in the variable `inputs`.

In [11]:
# Perform inference
outputs = model.generate(inputs.to(model.device))




In the code snippet, the tokenized German text stored in the variable `inputs` is passed to the `model.generate()` method to perform inference. 

The `generate()` method is used for sequence generation and takes the tokenized input as input. It generates the corresponding translated output sequence using the pre-trained translation model. 

The resulting translated output sequence is stored in the variable `outputs`.

In [12]:
# Decode the English translation
english_translation = tokenizer.decode(outputs[0], skip_special_tokens=True)


The translated output sequence stored in the variable `outputs` is decoded using the `tokenizer.decode()` method. The `decode()` method takes the tensor of token IDs (`outputs[0]`) and converts it back to text.

The `skip_special_tokens=True` argument is used to exclude any special tokens, such as padding or end-of-sequence tokens, from the decoded text. This ensures that only the meaningful translated text is extracted.

The resulting English translation is stored in the variable `english_translation`.

In [13]:
# Print the translated text
print("German Text: ", german_text)
print("English Translation: ", english_translation)


German Text:  Guten Tag!
English Translation:  Hello.


Displaying the translation output