##Transformers for translation 🙊


Have you ever wondered how applications like Google Translate or language translation features in social media platforms work? Behind these impressive technologies are sophisticated machine learning models that can understand and translate text between different languages. One of the most powerful and groundbreaking models used for this purpose is the Transformer model.

In this assignment, you will step into the shoes of an AI researcher and engineer to create your own Transformer model for translating text from English to French. This journey will not only enhance your understanding of machine learning and deep learning but also give you hands-on experience with state-of-the-art techniques in natural language processing.

Let's start by downloading important libraries

In [None]:
!pip install datasets
!pip install evaluate
!pip install transformers
!pip install bert_score
!pip install rouge_score

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

For this assignment we are using the IWSLT2017 dataset (read more about it [here](https://huggingface.co/datasets/IWSLT/iwslt2017) ). This dataset easily found in Huggingface fits perfectly for our machine translation task.

In [None]:
from datasets import load_dataset

dataset = load_dataset("IWSLT/iwslt2017",'iwslt2017-en-fr')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/18.5k [00:00<?, ?B/s]

iwslt2017.py:   0%|          | 0.00/8.17k [00:00<?, ?B/s]

The repository for IWSLT/iwslt2017 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/IWSLT/iwslt2017.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


en-fr.zip:   0%|          | 0.00/27.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/232825 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8597 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/890 [00:00<?, ? examples/s]

Just to have an idea let's have a quick peak at what our dataset looks like.

In [None]:
dataset['train']['translation'][0]

{'en': "Thank you so much, Chris. And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful.",
 'fr': "Merci beaucoup, Chris. C'est vraiment un honneur de pouvoir venir sur cette scène une deuxième fois. Je suis très reconnaissant."}

Since we don't want to take 8 hours training, let's trim our dataset a bit (although this might lead to underperformance, feel free to use the complete dataset if you have the computing power).

SUGESTION: start with a small dataset to debug your code and increase it gradually (the same principle applies for the number of epochs, batch size, test set size...).

In [None]:
# trim_dataset= dataset['train']['translation'][:100000]

In [None]:
# Trimming the dataset to a smaller size for debugging and faster prototyping
trimmed_dataset = dataset['train']['translation'][:100000]

# Print a sample to verify
print(trimmed_dataset[0])

{'en': "Thank you so much, Chris. And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful.", 'fr': "Merci beaucoup, Chris. C'est vraiment un honneur de pouvoir venir sur cette scène une deuxième fois. Je suis très reconnaissant."}


### Preprocessing


Same as our previous assignments preprocessing is an essential part of any NLP task.

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
def preprocess_data(text):
    """ Method to clean text from noise and standardize text across the different classes.
        The preprocessing includes converting to lowercase, removing punctuation, and removing stopwords.
    Arguments
    ---------
    text : String
        Text to clean
    Returns
    -------
    text : String
        Cleaned text
    """

    # Make everything lower case
    text = text.lower()

    # Remove newline characters
    text = text.replace('\n', ' ')

    # Remove any punctuation or special characters
    text = re.sub(r'[^\w\s]', ' ', text)

    # Remove all numbers
    text = ' '.join([word for word in text.split(" ") if word.isalpha()])

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word not in stop_words])

    return text


For an easier training structure, it is useful to format our training and validation sets. The following function should help with this.

In [None]:
import re
import string
from nltk.corpus import stopwords
def create_dataset(dataset, source_lang, target_lang):
    """
    Method to create a dataset from a list of text.

    Arguments
    ---------
    dataset : List of Dict
        List of dictionary objects with source and target text
    source_lang : String
        Source language key in the dataset
    target_lang : String
        Target language key in the dataset

    Returns
    -------
    new_dataset : List of Tuples
        Cleaned source and target text in format (source, target)
    """
    new_dataset = []
    for example in dataset:
        # Extract source and target text
        source_text = example.get(source_lang, "")
        target_text = example.get(target_lang, "")

        # Preprocess source and target text
        clean_source = preprocess_data(source_text)
        clean_target = preprocess_data(target_text)

        # Append to the dataset
        new_dataset.append((clean_source, clean_target))
    return new_dataset

# Applying the preprocessing and formatting the training, validation, and test sets
training_set = create_dataset(trimmed_dataset, 'en', 'fr')
validation_set = create_dataset(dataset['validation']['translation'], 'en', 'fr')
test_set = create_dataset(dataset['test']['translation'], 'en', 'fr')

# **T5**

evaluate

In [None]:
!pip install bert_score




New Ver

In [None]:
import torch
import os
os.environ["WANDB_DISABLED"] = "true"

In [None]:
# Download nltk data
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
def preprocess_t5_input_for_training(dataset, source_lang="English", target_lang="French"):
    """
    Prepare input sentences for T5 training.
    Args:
        dataset: List of tuples (source_sentence, target_sentence)
        source_lang: Name of the source language.
        target_lang: Name of the target language.

    Returns:
        List of preprocessed sentences and corresponding targets.
    """
    inputs = [f"translate {source_lang} to {target_lang}: {src}" for src, tgt in dataset]
    targets = [tgt for src, tgt in dataset]
    return inputs, targets

# Preprocessing training, validation, and test data
train_inputs, train_targets = preprocess_t5_input_for_training(training_set)
val_inputs, val_targets = preprocess_t5_input_for_training(validation_set)
test_inputs, test_targets = preprocess_t5_input_for_training(test_set)


In [None]:
from transformers import T5ForConditionalGeneration, T5Tokenizer
# Define the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the T5 model and tokenizer
model_name = "t5-small"  # or "t5-base" for a larger model
t5_model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)
t5_tokenizer = T5Tokenizer.from_pretrained(model_name)

print("T5 model and tokenizer loaded successfully!")

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


T5 model and tokenizer loaded successfully!


In [None]:
# Function to translate sentences using T5
def translate_with_t5(t5_model, t5_tokenizer, sentences, device, max_length=200):
    """
    Translate a list of sentences using a pretrained T5 model.
    Args:
        t5_model: Pretrained T5 model.
        t5_tokenizer: Tokenizer for the T5 model.
        sentences: List of input sentences.
        device: Device to perform computation (CPU/GPU).
        max_length: Maximum length for the output sequence.

    Returns:
        List of translated sentences.
    """
    # Preprocess sentences for T5
    inputs = t5_tokenizer(sentences, return_tensors="pt", padding=True, truncation=True, max_length=max_length).to(device)

    # Generate translations
    outputs = t5_model.generate(input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, max_length=max_length)

    # Decode the outputs
    translations = t5_tokenizer.batch_decode(outputs, skip_special_tokens=True)
    return translations


In [None]:
from bert_score import score as bert_score  # For BERTScore
from nltk.translate.meteor_score import single_meteor_score
# Function to evaluate the translation model using BERTScore and METEOR
def evaluate_translation_model(model, tokenizer, test_sentences, reference_sentences, device, is_t5=False, max_length=200):
    generated_translations = []
    meteor_metric = 0

    # Translate sentences
    for src_sentence in test_sentences:
        if is_t5:
            inputs = tokenizer(src_sentence, return_tensors="pt", truncation=True, padding=True, max_length=max_length).to(device)
            outputs = model.generate(input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, max_length=max_length)
            translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
        else:
            translation = translate(model, src_sentence, tokenizer)
        generated_translations.append(translation)

    # Compute BERTScore
    P, R, F1 = bert_score(generated_translations, reference_sentences, lang="en")
    precision = P.mean().item()
    recall = R.mean().item()
    f1 = F1.mean().item()

    # Compute METEOR score
    for ref, hyp in zip(reference_sentences, generated_translations):
        meteor_metric += single_meteor_score(ref.split(), hyp.split())

    return generated_translations, precision, recall, f1, meteor_metric / len(reference_sentences)

# Perform translation and evaluate the model
test_src_sentences = [src for src, _ in test_set[:10]]
test_ref_sentences = [tgt for _, tgt in test_set[:10]]

# Get the translations and evaluation metrics
generated_translations, precision, recall, f1, meteor_metric = evaluate_translation_model(
    model=t5_model,
    tokenizer=t5_tokenizer,
    test_sentences=test_src_sentences,
    reference_sentences=test_ref_sentences,
    device=device,
    is_t5=True
)

# Display sample translations
print("Sample Translations:")
for i in range(5):  # Display first 5 translations
    print(f"Source: {test_src_sentences[i]}")
    print(f"Generated Translation: {generated_translations[i]}")
    print(f"Reference Translation: {test_ref_sentences[i]}\n")

# Display evaluation metrics
print(f"T5 Model - Precision: {precision:.4f}, Recall: {recall:.4f}, F1: {f1:.4f}, METEOR: {meteor_metric:.4f}")


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sample Translations:
Source: several years ago ted peter skillman introduced design challenge called marshmallow challenge
Generated Translation: ted peter skillman introduced design challenge called marshmallow challenge
Reference Translation: il plusieurs années ici à ted peter skillman présenté une épreuve de conception appelée l épreuve du marshmallow

Source: idea pretty simple teams four build tallest free standing structure sticks spaghetti one yard tape one yard string marshmallow
Generated Translation: teams four build tallest free standing structure sticks spaghetti one yard tape one yard string marshmallow
Reference Translation: et l idée est plutôt simple des équipes de quatre personnes doivent bâtir la plus haute structure tenant debout avec spaghettis un mètre de ruban collant un mètre de ficelle et un marshmallow

Source: marshmallow top
Generated Translation: marshmallow top
Reference Translation: le marshmallow doit être placé au sommet

Source: though seems really sim

In [None]:
from bert_score import score as bert_score  # For BERTScore
from evaluate import load as load_metric  # For loading ROUGE
from nltk.translate.meteor_score import single_meteor_score

# Load ROUGE metric
rouge = load_metric('rouge')

# Function to evaluate the translation model using BERTScore, ROUGE, and METEOR
def evaluate_translation_model(model, tokenizer, test_sentences, reference_sentences, device, is_t5=False, max_length=200):
    generated_translations = []
    meteor_metric = 0

    # Translate sentences
    for src_sentence in test_sentences:
        if is_t5:
            inputs = tokenizer(src_sentence, return_tensors="pt", truncation=True, padding=True, max_length=max_length).to(device)
            outputs = model.generate(input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, max_length=max_length)
            translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
        else:
            translation = translate(model, src_sentence, tokenizer)
        generated_translations.append(translation)

    # Compute BERTScore
    P, R, F1 = bert_score(generated_translations, reference_sentences, lang="en")
    precision = P.mean().item()
    recall = R.mean().item()
    f1 = F1.mean().item()

    # Compute ROUGE score
    rouge_scores = rouge.compute(predictions=generated_translations, references=reference_sentences)

    # Compute METEOR score
    for ref, hyp in zip(reference_sentences, generated_translations):
        meteor_metric += single_meteor_score(ref.split(), hyp.split())

    return generated_translations, precision, recall, f1, meteor_metric / len(reference_sentences), rouge_scores

# Perform translation and evaluate the model
test_src_sentences = [src for src, _ in test_set[:10]]
test_ref_sentences = [tgt for _, tgt in test_set[:10]]

# Get the translations and evaluation metrics
generated_translations, precision, recall, f1, meteor_metric, rouge_scores = evaluate_translation_model(
    model=t5_model,
    tokenizer=t5_tokenizer,
    test_sentences=test_src_sentences,
    reference_sentences=test_ref_sentences,
    device=device,
    is_t5=True
)

# Display sample translations
print("Sample Translations:")
for i in range(5):  # Display first 5 translations
    print(f"Source: {test_src_sentences[i]}")
    print(f"Generated Translation: {generated_translations[i]}")
    print(f"Reference Translation: {test_ref_sentences[i]}\n")

# Display evaluation metrics
print(f"T5 Model - Precision (BERTScore): {precision:.4f}, Recall (BERTScore): {recall:.4f}, F1 (BERTScore): {f1:.4f}, METEOR: {meteor_metric:.4f}")
print(f"ROUGE Scores: {rouge_scores}")


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sample Translations:
Source: several years ago ted peter skillman introduced design challenge called marshmallow challenge
Generated Translation: ted peter skillman introduced design challenge called marshmallow challenge
Reference Translation: il plusieurs années ici à ted peter skillman présenté une épreuve de conception appelée l épreuve du marshmallow

Source: idea pretty simple teams four build tallest free standing structure sticks spaghetti one yard tape one yard string marshmallow
Generated Translation: teams four build tallest free standing structure sticks spaghetti one yard tape one yard string marshmallow
Reference Translation: et l idée est plutôt simple des équipes de quatre personnes doivent bâtir la plus haute structure tenant debout avec spaghettis un mètre de ruban collant un mètre de ficelle et un marshmallow

Source: marshmallow top
Generated Translation: marshmallow top
Reference Translation: le marshmallow doit être placé au sommet

Source: though seems really sim

In [None]:
# Calculate the average F1 score from the available ROUGE scores
avg_f1_score = (rouge_scores['rouge1'] +
                rouge_scores['rouge2'] +
                rouge_scores['rougeL']) / 3

print(f"Average ROUGE F1 Score: {avg_f1_score:.4f}")



Average ROUGE F1 Score: 0.0636


In [None]:
print(rouge_scores)


{'rouge1': 0.0876293103448276, 'rouge2': 0.014814814814814814, 'rougeL': 0.08831432137466619, 'rougeLsum': 0.08703486009520492}


In [None]:
avg_f1_score = (rouge_scores['rouge1'] +
                rouge_scores['rouge2'] +
                rouge_scores['rougeL'] +
                rouge_scores['rougeLsum']) / 4

print(f"Average ROUGE F1 Score (including ROUGE-Lsum): {avg_f1_score:.4f}")


Average ROUGE F1 Score (including ROUGE-Lsum): 0.0694
