# Fine-Tuning an NMT System ?

Today, fine-tuning is one of the main strategies for developing a model (be it an LLM or a specialized model) for a particular domain or language. It makes it possible to adapt a generic model to a particular vocabulary, turn of phrase or syntactic construction. The aim of this lab is twofold :
* we want to see how a translation system can be finetuned.
* we want to identify the criteria on which fine-tuning can adapt a model

In this lab, we consider translation between French and English using the mbart-25 model. The Hugging Face library provides all the tools needed to fine-tune a model (check, for instance, https://huggingface.co/docs/transformers/training or https://github.com/huggingface/transformers/blob/main/examples/pytorch/translation/run_translation.py).

## Testing the possibility for the model to learn to translate new words.

In [None]:
!pip install transformers
!pip install accelerate -U
!pip install sentencepiece
!pip install datasets

import random
import torch
from transformers import MBart50Tokenizer, MBartForConditionalGeneration

In [24]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt").to(device)
tokenizer = MBart50Tokenizer.from_pretrained("facebook/mbart-large-50-many-to-many-mmt", src_lang="fr_XX", tgt_lag = "en_XX")

tokenizer_config.json:   0%|          | 0.00/529 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/649 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

The different steps consist of:

1.   Generating 100 French (or French_-like) words with their English (or English-like) translation.

2.   Adding them at position 2 in 1000 sentences in French and their corresponding translation at the same position in the English sentences.

3.   Creating datasets of 100 sentences in the test_set (i.e. containing one occurrence of each generated words) and 900 sentences in the train_set. Foreach sentence in the dataset there are 3 keys (input_ids, attention_mask, labels) where input_ids is the encoded French sentence (pyTorch tensor) and labels is the encoded English sentence.

4.  The idea is for the model to be trained to correctly translate the generated French words into their generated English translation. After fine-tuning the model the correctly predicted target words go from 4% (before fine tuning) to 96%, meaning that the model has learnt very well.

1. Generate a list of 100 French words and translate them into English (e.g. by randomly swapping the letters of existing words, or by concatenating sub-words units). It is essential to ensure that the words generated do not appear in the vocabulary and that they are segmented into several sub-tokens.

In [1]:
from google.colab import drive
drive.mount("content/")

Mounted at content/


In [3]:
import pandas as pd
df = pd.read_csv("/content/content/MyDrive/Multilingual_NLP/Lab_6/Celine270test.csv", sep='\t')
df = df[df['source'] != '"']
#Reset indices in df
df = df.reset_index(drop=True)
df

Unnamed: 0,source,target
0,"La sienne, c'est vrai, elle se présentait un p...","Theirs was true, it looked a little better, bu..."
1,Moi j'ai fait ça tout de suite très mal.,I did it very badly right away.
2,Lui c'était par un cargo qu'il était arrivé.,He had arrived by a cargo ship.
3,"Ma façon, c'était pas beaucoup.",My way was not much.
4,"On peut pas tout faire!... Moi, c'est l'apéro ...",You can't do everything!... I prefer the apert...
...,...,...
165,"Je sais, Paul, où il est allé.","I know, Paul, where he went."
166,"Je lui ai demandé, ces pommes-là, à qui il ava...","I asked him, these apples, to whom he intended..."
167,"Je lui ai demandé à qui, ces pommes-là, il ava...","I asked him who, those apples, he intended to ..."
168,"Je lui ai demandé, ces pommes-là, où il les av...","I asked him, these apples, where he found them."


In [20]:
def generate_rotations(word, n):
  rotations = [word]  # Start with the original word

  # Iterate through the characters
  for i in range(1, min(n, len(word))):
    # Create a new word by placing the last character in the first position
    rotated_word = word[-i:] + word[:-i]
    rotations.append(rotated_word)

  return rotations

generate_rotations("cat", 2)

['cat', 'tca']

In [26]:
def make_verif(df, word_list, tokenized_list):
  # Ensure the words aren't in the vocabulary already
  for word in word_list:
    if any(df['source'].str.contains(word).values):
        print(f'Word "{word}" already in vocabulary.')

  # Check that they are tokenized in several subtokens
  for word in tokenized_list:
    if len(word) < 2:
      print(f"{word} has only one token")

make_verif(df, ["chat"], [tokenizer.tokenize("chat")])

Word "chat" already in vocabulary.
['▁chat'] has only one token


In [7]:
from itertools import chain
fr_list = ["chaton", "tabourets", "ânesse", "ventriloque", "luminosité", "écriture", "grammaire", "chevalet", "poudre", "édredon", "volet", "dictionnaire", "profondeur", "briquet", "armée"]
en_list = ["kitten", "stools", "donkey", "ventriloquist", "brightness", "handwriting", "grammar", "easel", "dust", "quilt", "shutter", " dictionary", "depth", "lighter", "army"]

# Generate 10 different words for each word in the original list
shuffled_fr_list = list(chain.from_iterable([generate_rotations(fr_word, len(en_word)) for fr_word, en_word in zip(fr_list, en_list)]))[:100]
print("French list: ", len(shuffled_fr_list))

shuffled_en_list = list(chain.from_iterable([generate_rotations(en_word, len(fr_word)) for fr_word, en_word in zip(fr_list, en_list)]))[:100]
print("English list: ", len(shuffled_en_list))

# Tokenize each word to make sure there are at least two subtokens
tokenized_fr = [tokenizer.tokenize(word) for word in shuffled_fr_list]
tokenized_en = [tokenizer.tokenize(word) for word in shuffled_en_list]

# Make verifications
make_verif(df, fr_list, tokenized_fr)
make_verif(df, en_list, tokenized_en)

French list:  100
English list:  100


2. For each pair of words generated, insert them into 10 sentences and their translation. You can thus build up a test corpus of 100 sentences containing the 100 new words and a training corpus of 900 sentences.

In [8]:
import random

# get 10 different sentence patterns randomly selected
random_indices = random.sample(range(len(df)), 10)
selected_src = [df.at[id, 'source'] for id in random_indices]
selected_tgt = [df.at[id, 'target'] for id in random_indices]

# insert the generated words into these sentences
modified_src = []
words = []
modified_tgt = []
trads = []

for word, trad in zip(shuffled_fr_list, shuffled_en_list):
    for src, tgt in zip(selected_src, selected_tgt):
        # Split the sentence into tokens and insert the word after the first 2 tokens
        # (to avoid issues due to future truncation)
        src_tokens = src.split()
        modified_src.append(' '.join(src_tokens[:2] + [word] + src_tokens[1:]))
        # Save the added words for the later evaluation
        words.append(word)

        # Split the sentence into tokens and insert the trad after the first 2 tokens
        tgt_tokens = tgt.split()
        modified_tgt.append(' '.join(tgt_tokens[:2] + [trad] + tgt_tokens[1:]))
        # Save the added translations for the later evaluation
        trads.append(trad)

print(len(modified_src))
print(len(modified_tgt))

for src, tgt in zip(modified_src[:5], modified_tgt[:5]):
  print(src)
  print(tgt)
  print()

1000
1000
"D'abord il chaton il est survenu du chômage, de la crise dans les environs et ça c'est le plus mauvais.
"First there kitten there was unemployment, the crisis in the vicinity and that's the worst.

Je lui chaton lui ai demandé où, ces pommes-là, il les avait trouvées.
I asked kitten asked him where he had found those apples.

"Trahir, qu'on chaton qu'on dit, c'est vite dit.
"Training, as kitten as one says, is quickly said.

Qui c'est chaton c'est qui le payera?
Who will kitten will pay for it?

Libre, je chaton je ne le suis à peu près jamais.
Free, I'm kitten I'm almost never.

<class 'list'>


### Datasets

In [13]:
# Testing the unzip function
train_data = [(1, 'a', 'X'), (2, 'b', 'Y'), (3, 'c', 'Z')]
numbers, letters, symbols = zip(*train_data)
numbers

(1, 2, 3)

In [14]:
from datasets import Dataset

def create_dataset(sources, targets):
    dataset = []
    tokenized_sources = tokenizer(sources, padding=True, truncation=True, return_tensors="pt")
    tokenized_targets = tokenizer(targets, padding=True, truncation=True, return_tensors="pt")

    # Move tensors to the same device as the model
    tokenized_sources = {key: value.to(device) for key, value in tokenized_sources.items()}
    tokenized_targets = {key: value.to(device) for key, value in tokenized_targets.items()}

    for source, target in zip(tokenized_sources["input_ids"], tokenized_targets["input_ids"]):
        dataset.append({
            "input_ids": source,
            "attention_mask":  [1 if token != 1 else 0 for token in source],
            "labels": target,
        })

    return Dataset.from_list(dataset)

# Create the combined modified dataset
combined_data = list(zip(modified_src, modified_tgt))

# Separate the new words (every 10 sentences) for the test dataset
test_data = [elt for i, elt in enumerate(combined_data) if i%10 ==0]
test_src, test_tgt = zip(*test_data)

# Separate the remaining data for the training corpus
train_data = [elt for i, elt in enumerate(combined_data) if i%10 != 0]
train_src, train_tgt = zip(*train_data)

# Use the create_dataset function to create datasets
test_dataset = create_dataset(test_src, test_tgt)
train_dataset = create_dataset(train_src, train_tgt)

# Print the lengths of the datasets
print("Test Dataset Length:", len(test_dataset))
print("Training Corpus Length:", len(train_dataset))

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Test Dataset Length: 100
Training Corpus Length: 900


In [15]:
print(train_tgt)

('Even he, kitten he, the eagle to his Joséphine! The fire on the train, it is the case to say it against and against everything.', 'He slept, kitten slept, the cat.', '"Well, you kitten you know a lot about it, Curé, about young people who are like this?... I don\'t know!...', '"To be kitten be alone is to train to die.', 'He was kitten was still a comedian.', "From the kitten the sun, that's for sure, there was always the same, as if you were to hear a large boiler always in full shape and then, below, still the sun and these senseless trees....", 'Theirs was kitten was true, it looked a little better, but not much better.', 'This team, kitten team, the amateurs have often suffered because of it.', 'Have a kitten a seat... I\'m not going well either..." -- I\'m just making a little round, which I replied, to give me a hold.', 'Even he, nkitte he, the eagle to his Joséphine! The fire on the train, it is the case to say it against and against everything.', 'He slept, nkitte slept, the 

In [16]:
test_dataset

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 100
})

### Translation

3. Are the new words correctly translated ?

In [17]:
import csv
from tqdm import tqdm

# Function to translate sentences using the MBart model
def translate_sentences(model, sentences, file_name):
    translated_sentences = []

    with open(file_name, "w", newline="", encoding="utf-8") as f:
        csv_writer = csv.writer(f)
        csv_writer.writerow(['Source', 'Translation'])

        for src in tqdm(sentences, desc="Translating"):
            encoded = tokenizer(src, return_tensors="pt")
            encoded =  {key: value.to(device) for key, value in encoded.items()}
            # Generate translations for the entire batch
            generated = model.generate(**encoded,  forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"]) #max_length=1024, num_beams=4, length_penalty=2.0, early_stopping=True)
            trad = tokenizer.batch_decode(generated, skip_special_tokens=True)

            # Write the translations to the CSV file
            csv_writer.writerow([src, trad])
            translated_sentences.append(trad)

    return translated_sentences

In [None]:
# Translate train dataset
train_src_sentences = train_dataset["input_ids"]
train_src_sentences = tokenizer.batch_decode(train_src_sentences, skip_special_tokens=True)
train_translated = translate_sentences(model, train_src_sentences, "/content/content/MyDrive/train_translations.csv")

for i in range(5):
    print(f"Source: {train_src_sentences[i]}")
    print(f"Translation: {train_translated[i]}\n")

Translating: 100%|██████████| 900/900 [11:54<00:00,  1.26it/s]

Source: Cette remarque qu'elle lui faisait, en douce, c'était pour chaton moi.
Translation: ['The gentle remark she made to him was for me.']

Source: "je les reconnais, ces airs-là moi, c'est les mêmes qu'on jouait à Détroit chez chaton Molly...
Translation: ['"I recognize them, these airs-that\'s me, they\'re the same ones that were played in Detroit at Molly\'s cat...']

Source: "Son médecin préféré à elle, c'est chaton Frolichon.
Translation: ['"Her favorite doctor is Frolichon.']

Source: "On ne sait pas ce que c'est que de revenir et d'attendre quelque chose tant qu'on n'a pas observé ce que peuvent attendre et revenir les pauvres qui espèrent une chaton pension.
Translation: ['"We don\'t know what it is to come back and wait for something until we see what the poor can expect and come back and hope for a little pension.']

Source: "Ce qui gênerait c'est plutôt leur odeur de poussière, qui vous retient par le bout du chaton nez.
Translation: ['"What would bother you is rather the




In [17]:
def evaluate_prediction(target_trads, translated):
  correctly_predicted = 0
  # Iterate over the translated sentences and the target words
  for i, (target, trad) in enumerate(zip(target_trads, translated)):
    # Print a portion of them to check
    if 20<= i <= 40:
      print("Target: ", target, "Trad: ", trad)
    # Increment the counter if the target word is in the translation
    if target in trad:
      correctly_predicted += 1

  print(f"{correctly_predicted/len(target_trads) * 100}% correctly predicted.")

In [19]:
# Retrieve all translations
train_translated = []
with open("/content/content/MyDrive/Multilingual_NLP/Lab_6/train_translations.csv", 'r') as f:
  reader = csv.DictReader(f)
  for row in reader:
    train_translated.append(row['Translation'])


# Get the list of the target words
train_generated_trads = [elt for i, elt in enumerate(trads) if i%10 != 0]

# Evaluate the translations compared to the target sentences
evaluate_prediction(train_generated_trads, train_translated)

Target:  enkitt Trad:  ['"Her favorite doctor is onchat Frolichon.']
Target:  enkitt Trad:  ['"We don\'t know what it is to come back and wait for something until we see what the poor can expect and come back and hope for a pension.']
Target:  enkitt Trad:  ['"What would bother you is rather their smell of dust, which pulls you off the end of your nose.']
Target:  enkitt Trad:  ['"Life is that, a light that ends in a night\'s shopping.']
Target:  enkitt Trad:  ['Have a seat... I\'m not going well either..." -- I\'m just going to make a little tour, which I replied, to give me an onchat container.']
Target:  enkitt Trad:  ['"The main thing for me was that she listened to my advice well and that she took on the most important part of it.']
Target:  enkitt Trad:  ['All of this is a regular purchase.']
Target:  tenkit Trad:  ['This remark she made to him, gently, was for toncha me.']
Target:  tenkit Trad:  ['"I recognize them, these airs-that\'s me, they\'re the same ones that were played 

### Fine-tuning

4. Fine-tune the model on the training corpus. Are the words translated correctly after this step ?

In [18]:
from transformers import Trainer, TrainingArguments
from datasets import load_metric # evaluate.load
import numpy as np

metric = load_metric('accuracy')

def compute_metrics(p):
    # Extract the predicted labels from the output of the model
    predictions, labels = p
    labels = labels.flatten()
    # Extract the class with the highest probability for each example
    predictions = np.argmax(predictions[0], axis=-1).flatten()
    # Return the computed accuracies
    return metric.compute(predictions=predictions, references=labels)

  metric = load_metric('accuracy')


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

In [None]:
import os
os.environ["TF_GPU_ALLOC_MEM_SOFT_LIMIT"] = "0"

In [19]:
torch.cuda.empty_cache()
batch_size = 2

training_args = TrainingArguments(
    output_dir="/content/content/MyDrive/Multilingual_NLP/Lab_6",
    evaluation_strategy="epoch",
    save_strategy='epoch',
    learning_rate=0.0001,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    load_best_model_at_end=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.0407,2.50623,0.544182
2,0.0103,2.820661,0.504978
3,0.0001,3.823429,0.419415
4,0.0,3.80412,0.437772
5,0.0,3.850548,0.435283


There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight', 'lm_head.weight'].


TrainOutput(global_step=2250, training_loss=0.059193138509833564, metrics={'train_runtime': 1463.9289, 'train_samples_per_second': 3.074, 'train_steps_per_second': 1.537, 'total_flos': 571411169280000.0, 'train_loss': 0.059193138509833564, 'epoch': 5.0})

In [20]:
# Translate test dataset
# Load the fine-tuned model
fine_tuned_model = trainer.model

test_src_sentences = test_dataset["input_ids"]
test_src_sentences = tokenizer.batch_decode(test_src_sentences, skip_special_tokens=True)
test_translated = translate_sentences(fine_tuned_model, test_src_sentences, "/content/content/MyDrive/Multilingual_NLP/Lab_6/test_translations.csv")

# Print translations for a few examples
for i in range(5):
    print(f"Source: {test_src_sentences[i]}")
    print(f"Translation: {test_translated[i]}\n")

Translating: 100%|██████████| 100/100 [02:52<00:00,  1.72s/it]

Source: "Elle s'en chaton s'en irait au grand cimetière d'à côté d'abord la tante, où les morts c'est comme une foule qui attend.
Translation: ['"To be kitten was on the train to the large ttershu be alone, where the people die as if it were to be like a fougle who slept.']

Source: "Elle s'en nchato s'en irait au grand cimetière d'à côté d'abord la tante, où les morts c'est comme une foule qui attend.
Translation: ['"To be nkitte slept on the large ighterl slept on the large ighterl slept on the large ighterl, which was on the side of the catt, where the diet is like a fougle who waits.']

Source: "Elle s'en onchat s'en irait au grand cimetière d'à côté d'abord la tante, où les morts c'est comme une foule qui attend.
Translation: ['"To be enkitt, enkitt to the large cemetery of the catt, where the diet is like a fougle who waits.']

Source: "Elle s'en toncha s'en irait au grand cimetière d'à côté d'abord la tante, où les morts c'est comme une foule qui attend.
Translation: ['"To be te




In [18]:
import csv
# Retrieve all translations
test_translated = []
with open("/content/content/MyDrive/Multilingual_NLP/Lab_6/test_translations.csv", 'r') as f:
  reader = csv.DictReader(f)
  for row in reader:
    test_translated.append(row['Translation'])

# Get the list of the target words
test_generated_trads = [elt for i, elt in enumerate(trads) if i%10 == 0]

# Evaluate the translations compared to the target sentences
evaluate_prediction(test_generated_trads, test_translated)

Target:  stventriloqui Trad:  ['"To be stventriloqui was on the train to the large ighterl was on the large ighterl the large boiler of the cat, where the deaths are like a fougle who waits.']
Target:  istventriloqu Trad:  ['"To be istventriloqu was going to the large ighterl was going to the large cemetery of the cat, where the diet is like a fougle who waits.']
Target:  uistventriloq Trad:  ['"To be uistventriloq be slept on the large ttergle slept on the large ttergle to the large cemeter of the cat, where the deaths are like a fougle who waits.']
Target:  quistventrilo Trad:  ['"To be quistventriloqu was going to the large ighterl was slept to the large cemetery of the cat, where the deaths are like a fougle who waits.']
Target:  oquistventril Trad:  ['"To be oquistventril slept on the large tterril slept on the large tterril slept on the large tterril, where the deaths are like a fougle who waits.']
Target:  loquistventri Trad:  ['"To be loquistventril slept on the large ighterl s