## Assigment 3: Transformers for translation 🙊


Have you ever wondered how applications like Google Translate or language translation features in social media platforms work? Behind these impressive technologies are sophisticated machine learning models that can understand and translate text between different languages. One of the most powerful and groundbreaking models used for this purpose is the Transformer model.

In this assignment, you will step into the shoes of an AI researcher and engineer to create your own Transformer model for translating text from English to French. This journey will not only enhance your understanding of machine learning and deep learning but also give you hands-on experience with state-of-the-art techniques in natural language processing.

Let's start by downloading important libraries

In [None]:
!pip install datasets
!pip install evaluate
!pip install transformers
!pip install bert_score
!pip install rouge_score

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

For this assignment we are using the IWSLT2017 dataset (read more about it [here](https://huggingface.co/datasets/IWSLT/iwslt2017) ). This dataset easily found in Huggingface fits perfectly for our machine translation task.

In [None]:
from datasets import load_dataset

dataset = load_dataset("IWSLT/iwslt2017",'iwslt2017-en-fr')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/18.5k [00:00<?, ?B/s]

iwslt2017.py:   0%|          | 0.00/8.17k [00:00<?, ?B/s]

The repository for IWSLT/iwslt2017 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/IWSLT/iwslt2017.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


en-fr.zip:   0%|          | 0.00/27.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/232825 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8597 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/890 [00:00<?, ? examples/s]

Just to have an idea let's have a quick peak at what our dataset looks like.

In [None]:
dataset['train']['translation'][0]

{'en': "Thank you so much, Chris. And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful.",
 'fr': "Merci beaucoup, Chris. C'est vraiment un honneur de pouvoir venir sur cette scène une deuxième fois. Je suis très reconnaissant."}

Since we don't want to take 8 hours training, let's trim our dataset a bit (although this might lead to underperformance, feel free to use the complete dataset if you have the computing power).

SUGESTION: start with a small dataset to debug your code and increase it gradually (the same principle applies for the number of epochs, batch size, test set size...).

In [None]:
# trim_dataset = dataset['train']['translation'][:100000]

In [None]:
trim_dataset = dataset['train']['translation'][:20000]

### Preprocessing


Same as our previous assignments preprocessing is an essential part of any NLP task.

In [None]:
import string
import re

def preprocess_data(text):
  """ Method to clean text from noise and standarize text across the different classes.
      The preprocessing includes converting to joining all datapoints, lowercase, removing punctuation, and removing stopwords.
  Arguments
  ---------
  text : List of String
     Text to clean
  Returns
  -------
  text : String
      Cleaned and joined text
  """

  text = text.lower() #make everything lower case
  text = text.replace('\n',' ') #remove \n characters
  text= re.sub(r'[^\w\s]', ' ', text) #remove any punctuation or special characters
  text = ' '.join([word for word in text.split(" ") if word.isalpha()]) #remove all numbers

  return text


For an easier training structure, it is useful to format our training and validation sets. The following function should help with this.

In [None]:
def create_dataset(dataset,source_lang,target_lang):
  """ Method to create a dataset from a list of text.
  Arguments
  ---------
  text : List of String
     Text from dataset
  source_lang : String
     Source language
  target_lang : String
     Target language
  Returns
  -------
  new_dataset : Tuple of String
      Source and target text in format (source, target)
  """
  new_dataset=[]

  #TODO: iterate through dataset extract source and target dataset and preprocess them creating a new clean dataset with the correct format
  for text in dataset:
    source_text = preprocess_data(text.get(source_lang))
    target_text = preprocess_data(text.get(target_lang))
    new_dataset.append((source_text, target_text))

  return new_dataset

training_set=create_dataset(trim_dataset,'en','fr')
validation_set=create_dataset(dataset['validation']['translation'],'en','fr')
test_set=create_dataset(dataset['test']['translation'],'en','fr')

In [None]:
training_set[0]

('thank you so much chris and it s truly a great honor to have the opportunity to come to this stage twice i m extremely grateful',
 'merci beaucoup chris c est vraiment un honneur de pouvoir venir sur cette scène une deuxième fois je suis très reconnaissant')

# Transformer Model


### Model Creation


Now that our data is ready, we can get started. Let's start by creating our Sequence to Sequence Transformer model.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class TransformerModel(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward, dropout):
        super(TransformerModel, self).__init__()
        self.src_embedding = nn.Embedding(src_vocab_size, d_model) # Embedding layer for source language
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model) # Embedding layer for target language
        # Transformer model with it's attributes (see pytorch documentation), set batch_first to True
        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True
        )
        self.fc = nn.Linear(d_model, tgt_vocab_size) # Last linear layer

    def positional_encoding(self, d_model, maxlen = 500): #if error, 1000
        """Method to create a positional encoding buffer.
        Arguments
        ---------
        d_model: int
            Embedding size
        maxlen: int
            Maximum sequence length
        Returns
        -------
        PE: Tensor
            Positional encoding buffer
        """
        pos = torch.arange(0, maxlen).unsqueeze(1)
        denominator = 10000 ** (torch.arange(0, d_model, 2) / d_model)

        PE = torch.zeros((maxlen, d_model))
        PE[:, 0::2] = torch.sin(pos / denominator) # Calculate sin for even positions
        PE[:, 1::2] = torch.cos(pos / denominator) # Calculate cosine for odd positions

        PE = PE.unsqueeze(0)  # Add batch dimension
        return PE


    def forward(self, src, tgt, src_mask=None, tgt_mask=None, src_key_padding_mask=None, tgt_key_padding_mask=None):
        """Method to forward a batch of data through the model."""
        #pass source and target throught embedding layer
        src = self.src_embedding(src)
        tgt = self.tgt_embedding(tgt)

        positional_encoding = self.positional_encoding(self.transformer.d_model).to(src.device) #get positional encoding and move it to device

        #get src_emb and tgt_emb by adding positional encoder
        src_emb = src + positional_encoding[:,:src.shape[1], :]
        tgt_emb = tgt + positional_encoding[:,:tgt.shape[1], :]

        #pass src, tgt and all masks throught transformer
        output = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None, src_key_padding_mask, tgt_key_padding_mask,src_key_padding_mask)

        #pass output throught linear layer
        output = self.fc(output)
        return output

    def encode(self, src, src_mask):
        """Method to encode a batch of data through the transformer model."""
        src = self.src_embedding(src) #pass src through embedding layer
        positional_encoding = self.positional_encoding(self.transformer.d_model).to(src.device) #create positional encoding
        src_emb = src + positional_encoding[:, :src.shape[1], :] #get src_emb
        return self.transformer.encoder(src_emb, src_mask) #pass src_emb through transformer encoder (look pytorch documentation)


    def decode(self, tgt, memory, tgt_mask):
        """Method to decode a batch of data through the transformer model."""
        tgt = self.tgt_embedding(tgt) #pass tgt throught embedding layer
        positional_encoding = self.positional_encoding(self.transformer.d_model).to(tgt.device) #create positional encoding
        tgt_emb = tgt + positional_encoding[:, :tgt.shape[1], :] #get tgt_emb
        return self.transformer.decoder(tgt_emb, memory, tgt_mask) #pass tgt_emb through transformer decoder (look pytorch documentation)


Now that our model is ready, we still need some methods that will come in handy during training.

In [None]:
def create_padding_mask(seq):
  """ Method to create a padding mask based on given sequence.
  Arguments
  ---------
  seq : Tensor
     Sequence to create padding mask for
  Returns
  -------
  mask : Tensor
      Padding mask
  """
  return (seq == 0).float() #float matrix that is 1 when datapoint is equal to 0

def create_triu_mask(sz):
  """ Method to create a triangular mask based on given sequence. This is used for the tgt mask in the Transformer model to avoid looking ahead.
  Arguments
  ---------
  seq : Tensor
     Sequence to create triangular mask for
  Returns
  -------
  mask : Tensor
      Triangular mask
  """
  #create triangular mask of size sz x sz
  mask = torch.triu(torch.ones(sz, sz), diagonal=0) # top triangle + diag are 1's
  #tranpose mask and cast to float type
  mask = mask.transpose(0, 1).float()
  #in pytorch the masked objects expect -inf instead of zero. Replace all 0 for -inf and all 1's for 0's
  #you might want to transpose at the end
  mask = mask.masked_fill(mask == 0, float('-inf'))  # Replace upper triangle with -inf
  mask = mask.masked_fill(mask == 1, float(0))  # Replace lower triangle with 0

  return mask

def tokenize_batch(source, targets,tokenizer):
  """ Method to tokenize a batch of data given a tokenizer.
  Arguments
  ---------
  source : List of String
     Source text
  targets : List of String
     Target text
  tokenizer : Tokenizer
     Tokenizer to use for tokenization
  Returns
  -------
  tokenized_source : Tensor
      Tokenized source text
  """

  tokenized_source = tokenizer(source, padding='max_length', max_length=120, truncation=True, return_tensors='pt')

  tokenized_targets = tokenizer(targets,  padding='max_length', max_length=120, truncation=True, return_tensors='pt')

  return tokenized_source['input_ids'], tokenized_targets['input_ids']


In [None]:
create_triu_mask(5)

tensor([[0., -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0.]])

### Training


In [None]:
from transformers import AutoTokenizer

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer=AutoTokenizer.from_pretrained('FacebookAI/xlm-roberta-base')
PAD_IDX = tokenizer.pad_token_id #for padding
BOS_IDX = tokenizer.bos_token_id #for beggining of sentence
EOS_IDX = tokenizer.eos_token_id #for end of sentence

model = TransformerModel(tokenizer.vocab_size, tokenizer.vocab_size, 512, 8, 3, 3, 256, 0.1).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
loss_function = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

train_loader = torch.utils.data.DataLoader(training_set, batch_size=16, shuffle=True)
validation_loader = torch.utils.data.DataLoader(validation_set, batch_size=16, shuffle=False)

In [None]:
# for src, tgt in tqdm(train_loader):
#   src, tgt = tokenize_batch(src, tgt, tokenizer)
#   print(src)
#   print((src.size(0), src.size(1))) #0: batch, 1:seq
#   break

In [None]:
from torch.utils.data import DataLoader
from tqdm import tqdm

def train_epoch(model,train_loader,tokenizer):
    model.train()
    losses = 0

    for src, tgt in tqdm(train_loader):
        src, tgt = tokenize_batch(src, tgt, tokenizer)
        src = src.to(device)
        tgt = tgt.to(device)

        tgt_input = tgt[:,:-1]

        #TODO
        src_mask = torch.zeros((src.size(1), src.size(1))).to(device) #creat src_mask this is basically a matrix of 0s of shape Sequence x Sequence (see https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html)
        tgt_mask = tgt_mask = create_triu_mask(tgt_input.size(1)).to(device) #create triangular mask for target

        src_padding_mask = create_padding_mask(src).to(device) #create padding mask for src
        tgt_padding_mask = create_padding_mask(tgt_input).to(device) #create padding mask for tgt

        logits = model(src, tgt_input, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask) #pass it through model

        optimizer.zero_grad()

        tgt_out = tgt[:,1:]
        loss = loss_function(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

    return losses / len(list(train_loader))


def evaluate(model,val_dataloader ):
    model.eval()
    losses = 0
    with torch.no_grad():
      for src, tgt in tqdm(val_dataloader):
          src, tgt = tokenize_batch(src, tgt, tokenizer)
          src = src.to(device)
          tgt = tgt.to(device)

          tgt_input = tgt[:,:-1]

          #do the same as in Train
          src_mask = torch.zeros((src.size(1), src.size(1))).to(device) #creat src_mask this is basically a matrix of 0s of shape Sequence x Sequence (see https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html)
          tgt_mask = tgt_mask = create_triu_mask(tgt_input.size(1)).to(device) #create triangular mask for target

          src_padding_mask = create_padding_mask(src).to(device) #create padding mask for src
          tgt_padding_mask = create_padding_mask(tgt_input).to(device) #create padding mask for tgt

          logits = model(src, tgt_input, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask) #pass it through model


          tgt_out = tgt[:,1:]
          loss = loss_function(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
          losses += loss.item()

    return losses / len(list(val_dataloader))

Now we can start training! Keep in mind this code is very demanding computationally, it has been set to 10 epochs (which can take up to 6-8 hours) but feel free to change this value depending on your resources, in this case the more epochs you can execute the better 😀

In [None]:
def train(model, epochs, train_loader,validation_loader ):
  for epoch in range(1, epochs+1):
        train_loss = train_epoch(model,train_loader, tokenizer)
        val_loss = evaluate(model,validation_loader)
        print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}"))

  #save model
  torch.save(model.state_dict(), "trained_model.pth")
  print("Model saved as 'trained_model.pth'")

train(model, 10, train_loader,validation_loader)

100%|██████████| 1250/1250 [13:20<00:00,  1.56it/s]
100%|██████████| 56/56 [00:10<00:00,  5.60it/s]


Epoch: 1, Train loss: 6.199, Val loss: 5.597


100%|██████████| 1250/1250 [13:27<00:00,  1.55it/s]
100%|██████████| 56/56 [00:09<00:00,  5.68it/s]


Epoch: 2, Train loss: 5.108, Val loss: 5.206


100%|██████████| 1250/1250 [13:27<00:00,  1.55it/s]
100%|██████████| 56/56 [00:09<00:00,  5.64it/s]


Epoch: 3, Train loss: 4.701, Val loss: 4.965


100%|██████████| 1250/1250 [13:28<00:00,  1.55it/s]
100%|██████████| 56/56 [00:09<00:00,  5.62it/s]


Epoch: 4, Train loss: 4.366, Val loss: 4.769


100%|██████████| 1250/1250 [13:28<00:00,  1.55it/s]
100%|██████████| 56/56 [00:09<00:00,  5.63it/s]


Epoch: 5, Train loss: 4.070, Val loss: 4.593


100%|██████████| 1250/1250 [13:28<00:00,  1.55it/s]
100%|██████████| 56/56 [00:09<00:00,  5.62it/s]


Epoch: 6, Train loss: 3.809, Val loss: 4.466


100%|██████████| 1250/1250 [13:28<00:00,  1.55it/s]
100%|██████████| 56/56 [00:09<00:00,  5.61it/s]


Epoch: 7, Train loss: 3.571, Val loss: 4.391


100%|██████████| 1250/1250 [13:27<00:00,  1.55it/s]
100%|██████████| 56/56 [00:09<00:00,  5.61it/s]


Epoch: 8, Train loss: 3.351, Val loss: 4.307


100%|██████████| 1250/1250 [13:27<00:00,  1.55it/s]
100%|██████████| 56/56 [00:09<00:00,  5.63it/s]


Epoch: 9, Train loss: 3.145, Val loss: 4.277


100%|██████████| 1250/1250 [13:29<00:00,  1.54it/s]
100%|██████████| 56/56 [00:09<00:00,  5.62it/s]


Epoch: 10, Train loss: 2.956, Val loss: 4.260
Model saved as 'trained_model.pth'


### Testing


In this assignment, we will use three different evaluation metrics to see our model's test performance: [Bert Score](https://huggingface.co/spaces/evaluate-metric/bertscore), [Meteor](https://huggingface.co/spaces/evaluate-metric/meteor) and [Rouge](https://huggingface.co/spaces/evaluate-metric/rouge). Please access their hugging face documentation to know how to implement them.

In [None]:
# Load model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = TransformerModel(tokenizer.vocab_size, tokenizer.vocab_size, 512, 8, 3, 3, 256, 0.1)
state_dict = torch.load("trained_model.pth", map_location=device)
model.load_state_dict(state_dict)
model = model.to(device)

  state_dict = torch.load("trained_model.pth", map_location=device)


In [None]:
from evaluate import load
bertscore = load("bertscore")
rouge = load('rouge')
meteor = load('meteor')

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Implement greedy decode as seen in class in the NLG slides.

In [None]:
# function to generate output sequence using greedy algorithm
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(device)
    src_mask = src_mask.to(device)
    memory = model.encode(src, src_mask) #pass src through encoder
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(device)

    for i in range(max_len-1):
        memory = memory.to(device)
        tgt_mask = create_triu_mask(ys.size(1)).to(device) #create triangular mask
        # print(tgt_mask)
        out = model.decode(ys, memory, tgt_mask) #pass through decoder
        prob = model.fc(out[:, -1])

        _, next_word = torch.max(prob, dim=1) #get next word based on probabilities (remember to use .item())
        next_word = next_word.item()

        ys = torch.cat([ys,torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=1)
        if next_word == EOS_IDX:
            break
    return ys

def translate(model: torch.nn.Module, src_sentence: str, tokenizer):
    model.eval()
    src, _ = tokenize_batch(src_sentence, "", tokenizer)
    src = src.to(device)
    num_tokens = src.shape[1]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.float).to(device)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len = int(num_tokens * 1.2), start_symbol=BOS_IDX).flatten()
    return tokenizer.decode(tgt_tokens, skip_special_tokens=True)

In [None]:
print(translate(model, "Hello how are you today",tokenizer))

comment sont aujourd hui


In [None]:
print(translate(model, "several years ago here at ted peter skillman introduced a design challenge called the marshmallow challenge",tokenizer))
test_set[0]

il y a ans à ted il y a des mots de design il s appelle une conception appelée appelée appelée le design


('several years ago here at ted peter skillman introduced a design challenge called the marshmallow challenge',
 'il y a plusieurs années ici à ted peter skillman a présenté une épreuve de conception appelée l épreuve du marshmallow')

In [None]:
import numpy as np
# you can also trim test_loader
def test(test_loader, model, tokenizer, device, max_length=200):
  """Method to test our model using best score and meteor metric.
  Arguments
  ---------
  test_loader: Dataloader
    Dataloader that holds test set
  model: nn.Module
    trained Machine Translation model
  tokenizer:
  """
  precision = 0
  recall = 0
  f1 = 0
  meteor_metric = 0
  for src, target in test_loader:
    #Use translate method to evaluate our model
    prediction = translate(model, src, tokenizer)

    results_bert = bertscore.compute(predictions=[prediction], references=[target], lang="fr")
    results_meteor = meteor.compute(predictions=[prediction], references=[target])
    precision += results_bert["precision"][0] #get precision of results_bert
    recall += results_bert["recall"][0] #get recall of results_bert
    f1 += results_bert["f1"][0] #get f1 of results_bert
    meteor_metric+= results_meteor["meteor"] #get meteor metric of results_meteor
  return precision / len(test_loader), recall / len(test_loader), f1 / len(test_loader), meteor_metric / len(test_loader)

test(test_set, model, tokenizer, device)

(0.7562843798010297,
 0.7349983566553632,
 0.7452303648542717,
 0.3140614734935857)

## Results
avg percision = 75.63% \
avg recall = 73.50% \
avg f1 = 74.52% \
avg meteor = 31.41%

## Let's experiment!

1. Play with a hyperparameter of your choice to measure its effect on the translation.

2. Compare the results of your model with the performance of using the T5 pretrained model. This [tutorial](https://huggingface.co/docs/transformers/en/tasks/translation) on using T5 for machine translation might come in handy.

In [None]:
from transformers import AutoTokenizer, DataCollatorForSeq2Seq
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import Dataset

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

# T5 model

In [None]:
checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
train_data = [{"input_text": source, "target_text": target} for source, target in training_set]
val_data = [{"input_text": source, "target_text": target} for source, target in validation_set]
# test_data = [{"input_text": source, "target_text": target} for source, target in test_set]

def tokenize_data(data):
    inputs = tokenizer(data["input_text"], padding="max_length", max_length=120, truncation=True)
    labels = tokenizer(data["target_text"], padding="max_length", max_length=120, truncation=True)
    inputs["labels"] = labels["input_ids"]
    return inputs

train_dataset = Dataset.from_list(train_data)
val_dataset = Dataset.from_list(val_data)
print(train_dataset[0])

tokenized_train = train_dataset.map(tokenize_data, batched=True, remove_columns=["input_text", "target_text"])
tokenized_eval = val_dataset.map(tokenize_data, batched=True, remove_columns=["input_text", "target_text"])
print(tokenized_train[0])

{'input_text': 'thank you so much chris and it s truly a great honor to have the opportunity to come to this stage twice i m extremely grateful', 'target_text': 'merci beaucoup chris c est vraiment un honneur de pouvoir venir sur cette scène une deuxième fois je suis très reconnaissant'}


Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/890 [00:00<?, ? examples/s]

{'input_ids': [2763, 25, 78, 231, 3, 524, 52, 159, 11, 34, 3, 7, 1892, 3, 9, 248, 3610, 12, 43, 8, 1004, 12, 369, 12, 48, 1726, 4394, 3, 23, 3, 51, 2033, 7335, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': [12947, 3933, 3, 524, 52, 159, 3, 75, 259, 7179, 73, 3, 23984, 20, 5969, 3, 7394, 244, 3, 922, 12739, 245, 3, 13438, 2529, 528, 3448, 1264, 18695, 29, 9, 10692, 1, 0, 0, 0, 0, 0, 0, 0, 0, 

In [None]:
import numpy as np
from evaluate import load

metric = load("meteor")

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"meteor": result["meteor"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

training_args = Seq2SeqTrainingArguments(
    output_dir="my_output_dir",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=10,
    predict_with_generate=True,
    fp16=True, #change to bf16=True for XPU
    push_to_hub=False,
    logging_dir="./logs",
    logging_strategy="epoch",
    disable_tqdm=False,
    report_to="none",  # Disable W&B logging
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Downloading builder script:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss,Meteor,Gen Len
1,1.1097,0.375388,0.3201,17.1045
2,0.4104,0.359485,0.3469,17.2618
3,0.3901,0.351992,0.3548,17.2753
4,0.3799,0.348248,0.3591,17.2764
5,0.3722,0.345859,0.3606,17.282
6,0.3672,0.343877,0.3625,17.2742
7,0.3633,0.342815,0.3644,17.264
8,0.3611,0.341536,0.3651,17.2584
9,0.3592,0.341338,0.3659,17.2528
10,0.3581,0.341097,0.3654,17.2449




TrainOutput(global_step=12500, training_loss=0.4471142529296875, metrics={'train_runtime': 2338.74, 'train_samples_per_second': 85.516, 'train_steps_per_second': 5.345, 'total_flos': 6344146944000000.0, 'train_loss': 0.4471142529296875, 'epoch': 10.0})

In [None]:
trainer.save_model("/model_dir")
tokenizer.save_pretrained("/model_dir")

('/model_dir/tokenizer_config.json',
 '/model_dir/special_tokens_map.json',
 '/model_dir/spiece.model',
 '/model_dir/added_tokens.json',
 '/model_dir/tokenizer.json')

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained("/model_dir")
tokenizer = AutoTokenizer.from_pretrained("/model_dir")

In [None]:
from tqdm import tqdm
import torch

bertscore = load("bertscore")
meteor = load("meteor")

def translate(model, src, tokenizer, max_length=120):
    model.eval()
    inputs = tokenizer(src, return_tensors="pt", padding=True, truncation=True, max_length=max_length).to(model.device)
    with torch.no_grad():
        outputs = model.generate(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'], max_length=max_length)
    translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return translated_text

precision, recall, f1, meteor_metric = 0, 0, 0, 0
for src, target in tqdm(test_set, desc="Evaluating"):
    prediction = translate(model, src, tokenizer)

    results_bert = bertscore.compute(predictions=[prediction], references=[target], lang="fr")
    results_meteor = meteor.compute(predictions=[prediction], references=[target])
    precision += results_bert["precision"][0]
    recall += results_bert["recall"][0]
    f1 += results_bert["f1"][0]
    meteor_metric += results_meteor["meteor"]

print(precision / len(test_set), recall / len(test_set), f1 / len(test_set), meteor_metric / len(test_set))

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
Evaluating:   0%|          | 0/8597 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Evaluating: 100%|██████████| 8597/8597 [1:54:00<00:00,  1.26it/s]

0.8658592592835135 0.8652147244919007 0.8653120232610158 0.5985146849368382





## Results
avg percision = 86.59% \
avg recall = 86.52% \
avg f1 = 86.53% \
avg meteor = 59.85%