In [1]:
!pip install sentencepiece transformers==4.33 datasets sacremoses evaluate sacrebleu accelerate -q

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


# Translation

Translation converts a sequence of text from one language to another. It is one of several tasks you can formulate as a sequence-to-sequence problem, a powerful framework for returning some output from an input, like translation or summarization. Translation systems are commonly used for translation between different language texts, but it can also be used for speech or some combination in between like text-to-speech or speech-to-text.

This guide will show you how to:

1. Finetune [T5](https://huggingface.co/t5-small) on the English-French subset of the [OPUS Books](https://huggingface.co/datasets/opus_books) dataset to translate English text to French.
2. Use your finetuned model for inference.

<Tip>
The task illustrated in this tutorial is supported by the following model architectures:

<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->

[BART](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bart), [BigBird-Pegasus](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bigbird_pegasus), [Blenderbot](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/blenderbot), [BlenderbotSmall](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/blenderbot-small), [Encoder decoder](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/encoder-decoder), [FairSeq Machine-Translation](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/fsmt), [GPTSAN-japanese](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gptsan-japanese), [LED](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/led), [LongT5](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/longt5), [M2M100](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/m2m_100), [Marian](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/marian), [mBART](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mbart), [MT5](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mt5), [MVP](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mvp), [NLLB](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/nllb), [NLLB-MOE](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/nllb-moe), [Pegasus](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/pegasus), [PEGASUS-X](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/pegasus_x), [PLBart](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/plbart), [ProphetNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/prophetnet), [SwitchTransformers](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/switch_transformers), [T5](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/t5), [XLM-ProphetNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm-prophetnet)

<!--End of the generated tip-->

</Tip>

We encourage you to login to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to login:

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load OPUS Books dataset

Start by loading the English-French subset of the [OPUS Books](https://huggingface.co/datasets/opus_books) dataset from the 🤗 Datasets library:

In [3]:
from datasets import load_dataset

dataset = load_dataset("uvci/Koumankan_mt_dyu_fr")

Downloading readme:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/530k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/102k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/55.8k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8065 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1471 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1393 [00:00<?, ? examples/s]

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['ID', 'translation'],
        num_rows: 8065
    })
    validation: Dataset({
        features: ['ID', 'translation'],
        num_rows: 1471
    })
    test: Dataset({
        features: ['ID', 'translation'],
        num_rows: 1393
    })
})

In [5]:
SRC_LANG = "dyu"
TRG_LANG = "fr"
CHARS_TO_REMOVE_REGEX = '[!"&\(\),-./:;=?+.\n\[\]]'

In [6]:
# Clean data (lowercase and punctuation)
import re

def remove_special_characters(text):
    text = re.sub(CHARS_TO_REMOVE_REGEX, " ", text.lower())
    return text.strip()

def clean_text(batch):
    # process source text
    batch["translation"][SRC_LANG] = remove_special_characters(batch["translation"][SRC_LANG])
    # process target text
    batch["translation"][TRG_LANG] = remove_special_characters(batch["translation"][TRG_LANG])
    return batch

dataset = dataset.map(clean_text)
dataset

Map:   0%|          | 0/8065 [00:00<?, ? examples/s]

Map:   0%|          | 0/1471 [00:00<?, ? examples/s]

Map:   0%|          | 0/1393 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['ID', 'translation'],
        num_rows: 8065
    })
    validation: Dataset({
        features: ['ID', 'translation'],
        num_rows: 1471
    })
    test: Dataset({
        features: ['ID', 'translation'],
        num_rows: 1393
    })
})

Then take a look at an example:

In [7]:
dataset["train"][0]

{'ID': 'ID_18897661270129',
 'translation': {'dyu': 'a bi ji min na', 'fr': 'il boit de l’eau'}}

`translation`: an English and French translation of the text.

## Preprocess

The next step is to load a T5 tokenizer to process the English-French language pairs:

In [8]:
from transformers import AutoTokenizer

checkpoint = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, src_lang="dyu_Latn" , tgt_lang="fra_Latn")



tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.55k [00:00<?, ?B/s]

The preprocessing function you want to create needs to:

1. Prefix the input with a prompt so T5 knows this is a translation task. Some models capable of multiple NLP tasks require prompting for specific tasks.
2. Tokenize the input (English) and target (French) separately because you can't tokenize French text with a tokenizer pretrained on an English vocabulary.
3. Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.

In [9]:
source_lang = "dyu"
target_lang = "fr"
# prefix = "translate Dyula to French: "


def preprocess_function(examples):
    inputs = [example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:

In [10]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/8065 [00:00<?, ? examples/s]

Map:   0%|          | 0/1471 [00:00<?, ? examples/s]

Map:   0%|          | 0/1393 [00:00<?, ? examples/s]

Now create a batch of examples using [DataCollatorForSeq2Seq](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForSeq2Seq). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [11]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

## Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [SacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):

In [12]:
import evaluate

metric = evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

Then create a function that passes your predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the SacreBLEU score:

In [13]:
import numpy as np

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

## Train

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! Load T5 with [AutoModelForSeq2SeqLM](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForSeq2SeqLM):

In [24]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)



In [23]:
# # Freeze the encoder
# for param in model.model.encoder.parameters():
#     param.requires_grad = False

# for name, param in model.model.encoder.named_parameters():
#     if param.requires_grad:
#         print(f"{name} is trainable")
#     else:
#         print(f"{name} is frozen")

At this point, only three steps remain:

1. Define your training hyperparameters in [Seq2SeqTrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Seq2SeqTrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the SacreBLEU metric and save the training checkpoint.
2. Pass the training arguments to [Seq2SeqTrainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Seq2SeqTrainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [25]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, EarlyStoppingCallback

class BestModelCallback(EarlyStoppingCallback):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.best_score = float('-inf')
        self.best_model_path = None

    def on_evaluate(self, args, state, control, metrics, **kwargs):
        if metrics['eval_bleu'] > self.best_score:
            self.best_score = metrics['eval_bleu']
            self.best_model_path = f"{args.output_dir}/checkpoint-{state.global_step}"
        return super().on_evaluate(args, state, control, metrics, **kwargs)

    def on_train_end(self, args, state, control, **kwargs):
        if self.best_model_path:
            args.push_to_hub_model_id = f"{args.push_to_hub_model_id}-best-bleu"
            control.should_save = True
            control.should_training_stop = True
            state.best_model_checkpoint = self.best_model_path
        return control

training_args = Seq2SeqTrainingArguments(
    output_dir="nllb-en-fr_v3.0",
    evaluation_strategy="steps",
    eval_steps=500,  # Evaluate every 500 steps
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=100,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
    metric_for_best_model="bleu",
    load_best_model_at_end=True,
    save_strategy="steps",
    save_steps=500,  # Save checkpoint every 500 steps
)

best_model_callback = BestModelCallback(
    early_stopping_patience=5,
    early_stopping_threshold=0.01,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[best_model_callback],
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None)
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [26]:
from torch.optim.lr_scheduler import CosineAnnealingLR
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=training_args.learning_rate)

total_steps = 1000
scheduler = CosineAnnealingLR(
    optimizer,
    T_max=total_steps,
    eta_min=0,  # Minimum learning rate
)

# Define a function to update the learning rate during training
def update_lr():
  scheduler.step()

In [27]:
# Training loop
for epoch in range(0, 10):
    trainer.train()
    update_lr()
    trainer.evaluate()

Step,Training Loss,Validation Loss,Bleu,Gen Len
500,2.7699,2.336852,5.5365,12.5908
1000,2.3912,2.247035,6.5616,12.2719
1500,2.193,2.199405,7.2918,12.2495
2000,2.0275,2.185447,9.6889,11.1516
2500,1.8865,2.170183,10.1187,11.0258
3000,1.7582,2.169933,10.6676,10.913
3500,1.6495,2.18759,10.6358,11.0421
4000,1.5526,2.202447,10.7504,11.0503
4500,1.4433,2.218714,11.173,10.8341
5000,1.3478,2.258301,11.1824,10.777


Step,Training Loss,Validation Loss,Bleu,Gen Len
500,0.9828,2.417294,11.9531,11.3549
1000,0.9203,2.461453,12.2965,10.9388
1500,0.8515,2.500116,12.4851,11.0489
2000,0.791,2.554486,12.125,11.2019
2500,0.7293,2.559799,11.8699,11.1094
3000,0.683,2.602786,12.0653,11.0564
3500,0.621,2.653757,12.3735,11.2413
4000,0.5819,2.693492,12.0635,11.102


Step,Training Loss,Validation Loss,Bleu,Gen Len
500,0.788,2.53744,12.1462,11.1285
1000,0.7342,2.567884,12.396,11.0455
1500,0.675,2.596845,12.2156,10.9905
2000,0.6255,2.64678,12.5651,10.9952
2500,0.5703,2.67942,12.1756,11.0211
3000,0.5274,2.734358,12.5785,11.1693
3500,0.4853,2.753398,12.3316,11.1251
4000,0.4449,2.79476,12.1942,10.9293
4500,0.4111,2.833006,12.2451,10.9687
5000,0.3729,2.861239,12.358,11.2114


Step,Training Loss,Validation Loss,Bleu,Gen Len
500,0.49,2.757056,12.0094,11.1999
1000,0.447,2.793879,12.211,11.0537
1500,0.4111,2.813129,12.4421,11.1897
2000,0.3713,2.856413,11.8289,11.1067
2500,0.34,2.878772,12.2884,11.07
3000,0.3065,2.943583,12.1028,11.3481
3500,0.2912,2.949262,12.4195,11.1679
4000,0.2591,2.994941,12.1839,11.0829


Step,Training Loss,Validation Loss,Bleu,Gen Len
500,0.3736,2.885897,12.2275,11.2019
1000,0.3406,2.888389,12.1354,11.1788
1500,0.3133,2.926798,12.3562,11.191
2000,0.2845,2.965553,12.5169,10.9946
2500,0.2593,2.972893,12.4455,11.119
3000,0.2343,3.0247,12.0926,11.0625
3500,0.211,3.044395,11.8817,11.2141
4000,0.1943,3.088335,12.2032,11.2753
4500,0.179,3.095911,12.4283,11.1298


Step,Training Loss,Validation Loss,Bleu,Gen Len
500,0.262,2.991434,12.3472,11.1815
1000,0.2357,3.021219,12.0739,11.0945
1500,0.2139,3.060061,12.2243,11.1251
2000,0.1931,3.073891,11.7267,11.3929
2500,0.179,3.097089,12.0558,11.0564
3000,0.1605,3.143509,12.1789,11.2216


Step,Training Loss,Validation Loss,Bleu,Gen Len
500,0.235,3.052827,11.9769,11.2923
1000,0.2148,3.052038,12.0908,11.0571
1500,0.192,3.080964,11.412,11.3018
2000,0.1763,3.127833,12.0214,11.0775
2500,0.1601,3.151974,12.0919,11.17
3000,0.1452,3.164142,11.8177,11.2808
3500,0.1345,3.194954,12.079,11.2373


Step,Training Loss,Validation Loss,Bleu,Gen Len
500,0.1468,3.179209,12.2469,11.2121
1000,0.1358,3.198141,12.1656,11.1285
1500,0.1208,3.215169,12.3452,11.2223
2000,0.1106,3.236603,12.2157,11.1455
2500,0.1042,3.261365,12.1937,11.2876
3000,0.0945,3.269589,12.3344,11.14
3500,0.0859,3.306982,11.777,11.0544
4000,0.0789,3.330329,11.9529,11.1387


Step,Training Loss,Validation Loss,Bleu,Gen Len
500,0.1121,3.233445,12.2759,11.1693
1000,0.1023,3.248928,12.2009,11.2141
1500,0.0937,3.280941,12.2005,11.2094
2000,0.0874,3.316128,12.2561,11.1224
2500,0.0807,3.33095,12.1845,11.2407
3000,0.0745,3.336632,12.3376,11.0653
3500,0.0675,3.373309,12.3834,11.2189
4000,0.0632,3.395747,12.692,11.2005
4500,0.0584,3.392285,12.2998,11.1183
5000,0.0551,3.427745,12.3764,11.1672


Step,Training Loss,Validation Loss,Bleu,Gen Len
500,0.0611,3.40584,12.324,11.0877
1000,0.057,3.449677,12.2765,11.0585
1500,0.053,3.442086,12.1308,11.2114
2000,0.0503,3.47507,12.3683,11.1774
2500,0.0464,3.481982,12.5828,11.1638
3000,0.0421,3.503484,12.8392,11.0802
3500,0.0384,3.503821,12.7273,11.3379
4000,0.0391,3.538073,12.392,11.2305
4500,0.0356,3.539459,12.3458,11.2597
5000,0.036,3.588484,12.7932,10.9816


In [28]:
trainer.push_to_hub()

training_args.bin:   0%|          | 0.00/4.28k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Adeptschneider/nllb-en-fr_v3.0/commit/a8c64f878b4f9515ae21f0bb72c09f75934d0960', commit_message='End of training', commit_description='', oid='a8c64f878b4f9515ae21f0bb72c09f75934d0960', pr_url=None, pr_revision=None, pr_num=None)

<Tip>

For a more in-depth example of how to finetune a model for translation, take a look at the corresponding
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb)
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation-tf.ipynb).

</Tip>

## Inference

Great, now that you've finetuned a model, you can use it for inference!

Come up with some text you'd like to translate to another language. For T5, you need to prefix your input depending on the task you're working on. For translation from English to French, you should prefix your input as shown below:

In [None]:
text = "i tɔgɔ bi cogodɔ"

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for translation with your model, and pass your text to it:

In [None]:
from transformers import pipeline

translator = pipeline("translation", model="my_awesome_opus_books_model")
translator(text)

You can also manually replicate the results of the `pipeline` if you'd like:

Tokenize the text and return the `input_ids` as PyTorch tensors:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my_awesome_opus_books_model")
inputs = tokenizer(text, return_tensors="pt").input_ids

Use the [generate()](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate) method to create the translation. For more details about the different text generation strategies and parameters for controlling generation, check out the [Text Generation](https://huggingface.co/docs/transformers/main/en/tasks/../main_classes/text_generation) API.

In [None]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("my_awesome_opus_books_model")
outputs = model.generate(inputs, max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)

Decode the generated token ids back into text:

In [None]:
tokenizer.decode(outputs[0], skip_special_tokens=True)