# Translation

Let‚Äôs now dive into translation. This is another sequence-to-sequence task, which means it‚Äôs a problem that can be formulated as going from one sequence to another. In that sense the problem is pretty close to summarization, and you could adapt what we will see here to other sequence-to-sequence problems such as:

**Style transfer:** Creating a model that translates texts written in a certain style to another (e.g., formal to casual or Shakespearean English to modern English)

**Generative question answering:** Creating a model that generates answers to questions, given a context

In this section, we will fine-tune a Marian model pretrained to translate from English to French (since a lot of Hugging Face employees speak both those languages) on the KDE4 dataset, which is a dataset of localized files for the KDE apps. The model we will use has been pretrained on a large corpus of French and English texts taken from the Opus dataset, which actually contains the KDE4 dataset. But even if the pretrained model we use has seen that data during its pretraining, we will see that we can get a better version of it after fine-tuning.



## 1. Preparing the data


### 1.1. The KDE4 dataset


In [1]:
!pip install transformers torch sentencepiece sacremoses huggingface_hub evaluate

Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m897.5/897.5 kB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sacremoses, evaluate
Successfully installed evaluate-0.4.6 sacremoses-0.1.1


In [2]:
from datasets import load_dataset

raw_datasets = load_dataset(
    'parquet',
    data_files='https://huggingface.co/datasets/Helsinki-NLP/kde4/resolve/refs%2Fconvert%2Fparquet/en-fr/train/0000.parquet'
)



en-fr/train/0000.parquet:   0%|          | 0.00/13.3M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

If you want to work with a different pair of languages, you can specify them by their codes. A total of 92 languages are available for this dataset; you can see them all by expanding the language tags on its dataset card.

In [3]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 210173
    })
})

We have 210,173 pairs of sentences, but in one single split, so we will need to create our own validation set.

In [4]:
split_dataset = raw_datasets["train"].train_test_split(train_size=0.9, seed=42)
split_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 189155
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 21018
    })
})

In [5]:
split_dataset["validation"] = split_dataset.pop("test")

In [6]:
split_dataset["train"][0]

{'id': '118328',
 'translation': {'en': 'Show Pathfinder Lander Image',
  'fr': "Afficher l'image de Pathfinder LanderImage/ info menu item (should be translated)"}}

We get a dictionary with two sentences in the pair of languages we requested. One particularity of this dataset full of technical computer science terms is that they are all fully translated in French. However, French engineers leave most computer science-specific words in English when they talk. Here, for instance, the word ‚Äúthreads‚Äù might well appear in a French sentence, especially in a technical conversation; but in this dataset it has been translated into the more correct ‚Äúfils de discussion.‚Äù The pretrained model we use, which has been pretrained on a larger corpus of French and English sentences, takes the easier option of leaving the word as is:



In [7]:
# from transformers import pipeline

checkpoint = "Helsinki-NLP/opus-mt-en-fr"
# translator = pipeline("translation_en_to_fr", model=checkpoint)

# translator("Default to expanded threads.")

In [8]:
# translator(
#     "Unable to import %1 using the OFX importer plugin. This file is not the correct format."
# )

It will be interesting to see if our fine-tuned model picks up on those particularities of the dataset (spoiler alert: it will).



### 1.2. Processing the data

You should know the drill by now: the texts all need to be converted into sets of token IDs so the model can make sense of them. For this task, we‚Äôll need to tokenize both the inputs and the targets. Our first task is to create our `tokenizer` object. As noted earlier, we‚Äôll be using a Marian English to French pretrained model. If you are trying this code with another pair of languages, make sure to adapt the model checkpoint. The Helsinki-NLP organization provides more than a thousand models in multiple languages.



In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(checkpoint, return_tensors='pt')

config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

üí° If you are using a multilingual tokenizer such as mBART, mBART-50, or M2M100, you will need to set the language codes of your inputs and targets in the tokenizer by setting `tokenizer.src_lang` and `tokenizer.tgt_lang` to the right values.

The preparation of our data is pretty straightforward. There‚Äôs just one thing to remember; you need to ensure that the tokenizer processes the targets in the output language (here, French). You can do this by passing the targets to the `text_targets` argument of the tokenizer‚Äôs `__call__` method.



In [10]:
en_sentence = split_dataset["train"][1]["translation"]["en"]
fr_sentence = split_dataset["train"][1]["translation"]["fr"]

inputs = tokenizer(en_sentence, text_target=fr_sentence)
inputs

{'input_ids': [34014, 0], 'attention_mask': [1, 1], 'labels': [39433, 1547, 0]}

As we can see, the output contains the input IDs associated with the English sentence, while the IDs associated with the French one are stored in the `labels` field. If you forget to indicate that you are tokenizing labels, they will be tokenized by the input tokenizer, which in the case of a Marian model is not going to go well at all:



In [11]:
tokenizer.convert_ids_to_tokens(inputs["labels"])

['‚ñÅ√âdit', 'eur', '</s>']

Since inputs is a dictionary with our usual keys (input IDs, attention mask, etc.), the last step is to define the preprocessing function we will apply on the datasets:



In [12]:
max_length = 128

def preprocess_function(examples):
    # This function takes input as a batch of example
    inputs = [ex["en"] for ex in examples["translation"]]
    targets = [ex["fr"] for ex in examples["translation"]]
    model_inputs = tokenizer(
        inputs, text_target=targets, max_length = max_length, truncation=True
    )
    return model_inputs

üí° If you are using a T5 model (more specifically, one of the `t5-xxx` checkpoints), the model will expect the text inputs to have a prefix indicating the task at hand, such as `translate: English to French:`.



In [13]:
tokenized_dataset = split_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=split_dataset["train"].column_names
)

Map:   0%|          | 0/189155 [00:00<?, ? examples/s]

Map:   0%|          | 0/21018 [00:00<?, ? examples/s]

Now that the data has been preprocessed, we are ready to fine-tune our pretrained model!



## 2. Fine-tuning the model with the Trainer API


The actual code using the `Trainer` will be the same as before, with just one little change: we use a `Seq2SeqTrainer` here, which is a subclass of `Trainer` that will allow us to properly deal with the evaluation, using the `generate()` method to predict outputs from the inputs. We‚Äôll dive into that in more detail when we talk about the metric computation.



First things first, we need an actual model to fine-tune. We‚Äôll use the usual `AutoModel` API:



In [14]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/258 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Note that this time we are using a model that was trained on a translation task and can actually be used already, so there is no warning about missing weights or newly initialized ones.



### 2.1. Data collation

We‚Äôll need a data collator to deal with the padding for dynamic batching. We can‚Äôt just use a `DataCollatorWithPadding` in this case, because that only pads the inputs (input IDs, attention mask, and token type IDs). Our labels should also be padded to the maximum length encountered in the labels. And, as mentioned previously, the padding value used to pad the labels should be `-100` and not the padding token of the tokenizer, to make sure those padded values are ignored in the loss computation.



This is all done by a `DataCollatorForSeq2Seq`. Like the `DataCollatorWithPadding`, it takes the tokenizer used to preprocess the inputs, but it also takes the model. This is because this data collator will also be responsible for preparing the decoder input IDs, which are shifted versions of the labels with a special token at the beginning. Since this shift is done slightly differently for different architectures, the `DataCollatorForSeq2Seq` needs to know the `model` object:



In [15]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

To test this on a few samples, we just call it on a list of examples from our tokenized training set:



In [16]:
batch = data_collator([tokenized_dataset["train"][i] for i in range(1, 3)])
batch.keys()


KeysView({'input_ids': tensor([[34014,     0, 59513],
        [13331,  3280,     0]]), 'attention_mask': tensor([[1, 1, 0],
        [1, 1, 1]]), 'labels': tensor([[39433,  1547,     0,  -100,  -100],
        [24251,    14,     6,  4930,     0]]), 'decoder_input_ids': tensor([[59513, 39433,  1547,     0, 59513],
        [59513, 24251,    14,     6,  4930]])})

We can check our labels have been padded to the maximum length of the batch, using `-100`:



In [17]:
batch["labels"]

tensor([[39433,  1547,     0,  -100,  -100],
        [24251,    14,     6,  4930,     0]])

In [18]:
batch["decoder_input_ids"]

tensor([[59513, 39433,  1547,     0, 59513],
        [59513, 24251,    14,     6,  4930]])

In [19]:
tokenizer.batch_decode(batch["decoder_input_ids"])

['<pad>  √âditeur</s> <pad>', "<pad>  Supprimer l' entr√©e"]

We will pass this `data_collator` along to the `Seq2SeqTrainer`. Next, let‚Äôs have a look at the metric.



### 2.2. Metrics

The feature that `Seq2SeqTrainer` adds to its superclass `Trainer` is the ability to use the `generate()` method during evaluation or prediction. During training, the model will use the `decoder_input_ids` with an attention mask ensuring it does not use the tokens after the token it‚Äôs trying to predict, to speed up training. During inference we won‚Äôt be able to use those since we won‚Äôt have labels, so it‚Äôs a good idea to evaluate our model with the same setup.



The decoder performs inference by predicting tokens one by one ‚Äî something that‚Äôs implemented behind the scenes in ü§ó Transformers by the `generate()` method. The `Seq2SeqTrainer` will let us use that method for evaluation if we set `predict_with_generate=True`.

The traditional metric used for translation is the BLEU score, introduced in a 2002 article by Kishore Papineni et al. The BLEU score evaluates how close the translations are to their labels. It does not measure the intelligibility or grammatical correctness of the model‚Äôs generated outputs, but uses statistical rules to ensure that all the words in the generated outputs also appear in the targets. In addition, there are rules that penalize repetitions of the same words if they are not also repeated in the targets (to avoid the model outputting sentences like `"the the the the the"`) and output sentences that are shorter than those in the targets (to avoid the model outputting sentences like `"the"`).



One weakness with BLEU is that it expects the text to already be tokenized, which makes it difficult to compare scores between models that use different tokenizers. So instead, the most commonly used metric for benchmarking translation models today is SacreBLEU, which addresses this weakness (and others) by standardizing the tokenization step. To use this metric, we first need to install the SacreBLEU library:



In [20]:
!pip install sacrebleu

Collecting sacrebleu
  Downloading sacrebleu-2.6.0-py3-none-any.whl.metadata (39 kB)
Collecting portalocker (from sacrebleu)
  Downloading portalocker-3.2.0-py3-none-any.whl.metadata (8.7 kB)
Downloading sacrebleu-2.6.0-py3-none-any.whl (100 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m100.8/100.8 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading portalocker-3.2.0-py3-none-any.whl (22 kB)
Installing collected packages: portalocker, sacrebleu
Successfully installed portalocker-3.2.0 sacrebleu-2.6.0


We can then load it via `evaluate.load()`:

In [21]:
import evaluate

metric = evaluate.load('sacrebleu')

Downloading builder script: 0.00B [00:00, ?B/s]

This metric will take texts as inputs and targets. It is designed to accept several acceptable targets, as there are often multiple acceptable translations of the same sentence ‚Äî the dataset we‚Äôre using only provides one, but it‚Äôs not uncommon in NLP to find datasets that give several sentences as labels. So, the predictions should be a list of sentences, but the references should be a list of lists of sentences.



Let‚Äôs try an example:



In [22]:
predictions = [
    "This plugin lets you translate web pages between several languages automatically."
]

references =[
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)

{'score': 46.750469682990186,
 'counts': [11, 6, 4, 3],
 'totals': [12, 11, 10, 9],
 'precisions': [91.66666666666667,
  54.54545454545455,
  40.0,
  33.333333333333336],
 'bp': 0.9200444146293233,
 'sys_len': 12,
 'ref_len': 13}

On the other hand, if we try with the two bad types of predictions (lots of repetitions or too short) that often come out of translation models, we will get rather bad BLEU scores:

In [23]:
predictions = [
    "This plugin this this this."
]
references =[
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)

{'score': 5.594422941553801,
 'counts': [3, 1, 0, 0],
 'totals': [6, 5, 4, 3],
 'precisions': [50.0, 20.0, 12.5, 8.333333333333334],
 'bp': 0.31140322391459774,
 'sys_len': 6,
 'ref_len': 13}

The score can go from 0 to 100, and higher is better.



To get from the model outputs to texts the metric can use, we will use the tokenizer.`batch_decode()` method. We just have to clean up all the `-100s` in the labels (the tokenizer will automatically do the same for the padding token):



In [24]:
import numpy as np

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in labels as we cannot decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some post-processing techniques
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"BLEU": result['score']}

Now that this is done, we are ready to fine-tune our model!



### 2.3. Fine-tuning the model

In [25]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()
token = user_secrets.get_secret("HF_TOKEN")

login(token)

Once this is done, we can define our `Seq2SeqTrainingArguments`. Like for the `Trainer`, we use a subclass of `TrainingArguments` that contains a few more fields:



In [26]:
from transformers import Seq2SeqTrainingArguments

args = Seq2SeqTrainingArguments(
    "marian-finetuned-kde4-en-to-fr",
    eval_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    push_to_hub=True,
    fp16=True
)

Apart from the usual hyperparameters (like learning rate, number of epochs, batch size, and some weight decay), here are a few changes compared to what we saw in the previous sections:

* We don‚Äôt set any regular evaluation, as evaluation takes a while; we will just evaluate our model once before training and after.

* We set `fp16=True`, which speeds up training on modern GPUs.

* We set `predict_with_generate=True`, as discussed above.

* We use `push_to_hub=True` to upload the model to the Hub at the end of each epoch.

Finally, we just pass everything to the `Seq2SeqTrainer`:



In [27]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=args,
    model=model,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics
)

Before training, we‚Äôll first look at the score our model gets, to double-check that we‚Äôre not making things worse with our fine-tuning. This command will take a bit of time, so you can grab a coffee while it executes:



In [28]:
trainer.evaluate(max_length=max_length)

{'eval_loss': 1.697411060333252,
 'eval_model_preparation_time': 0.003,
 'eval_BLEU': 39.40455746526643,
 'eval_runtime': 1211.732,
 'eval_samples_per_second': 17.345,
 'eval_steps_per_second': 0.272}

Next is the training, which will also take a bit of time:



In [29]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.


Step,Training Loss
500,1.377714
1000,1.235522
1500,1.158086
2000,1.123041
2500,1.08975
3000,1.066188
3500,1.041008
4000,1.033669
4500,0.998418
5000,1.01576


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

TrainOutput(global_step=17736, training_loss=0.9248666578335003, metrics={'train_runtime': 4207.2361, 'train_samples_per_second': 134.878, 'train_steps_per_second': 4.216, 'total_flos': 1.1304295115194368e+16, 'train_loss': 0.9248666578335003, 'epoch': 3.0})

Note that while the training happens, each time the model is saved (here, every epoch) it is uploaded to the Hub in the background. This way, you will be able to to resume your training on another machine if necessary.



Once training is done, we evaluate our model again ‚Äî hopefully we will see some amelioration in the BLEU score!



In [30]:
trainer.evaluate(max_length=max_length)

{'eval_loss': 0.8564431667327881,
 'eval_model_preparation_time': 0.003,
 'eval_BLEU': 53.236574213217466,
 'eval_runtime': 1268.6305,
 'eval_samples_per_second': 16.567,
 'eval_steps_per_second': 0.259,
 'epoch': 3.0}

Finally, we use the `push_to_hub()` method to make sure we upload the latest version of the model. The `Trainer` also drafts a model card with all the evaluation results and uploads it. This model card contains metadata that helps the Model Hub pick the widget for the inference demo. Usually, there is no need to say anything as it can infer the right widget from the model class, but in this case, the same model class can be used for all kinds of sequence-to-sequence problems, so we specify it‚Äôs a translation model:



In [31]:
trainer.push_to_hub(tags='translation', commit_message='Training_finish!')

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

CommitInfo(commit_url='https://huggingface.co/arraypowerplay/marian-finetuned-kde4-en-to-fr/commit/74a8c0795d52814f25731b1e3e90769eaecddf58', commit_message='Training_finish!', commit_description='', oid='74a8c0795d52814f25731b1e3e90769eaecddf58', pr_url=None, repo_url=RepoUrl('https://huggingface.co/arraypowerplay/marian-finetuned-kde4-en-to-fr', endpoint='https://huggingface.co', repo_type='model', repo_id='arraypowerplay/marian-finetuned-kde4-en-to-fr'), pr_revision=None, pr_num=None)

## 3. A custom training loop

Let‚Äôs now take a look at the full training loop, so you can easily customize the parts you need.

### 3.1. Prepare everything for training

You‚Äôve seen all of this a few times now, so we‚Äôll go through the code quite quickly. First we‚Äôll build the `DataLoaders` from our datasets, after setting the datasets to the `"torch"` format so we get PyTorch tensors:



In [32]:
import torch

tokenized_dataset.set_format('torch')
train_dataloader = torch.utils.data.DataLoader(
    tokenized_dataset["train"],
    shuffle=True,
    batch_size=8,
    collate_fn=data_collator
)
eval_dataloader = torch.utils.data.DataLoader(
    tokenized_dataset["validation"],
    shuffle=False,
    batch_size=8,
    collate_fn=data_collator
)

In [33]:
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

Loading weights:   0%|          | 0/258 [00:00<?, ?it/s]



In [34]:
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)

Once we have all those objects, we can send them to the `accelerator.prepare()` method. Remember that if you want to train on TPUs in a Colab notebook, you will need to move all of this code into a training function, and that shouldn‚Äôt execute any cell that instantiates an `Accelerator`.



In [35]:
from accelerate import Accelerator

accelerator = Accelerator()

model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [36]:
from transformers import get_scheduler

num_train_epochs = 3
num_training_steps = num_train_epochs * len(train_dataloader)

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

In [37]:
from huggingface_hub import get_full_repo_name, HfApi, create_repo

model_name = "marian-finetuned-kde4-en-to-fr-accelerate"
repo_name = get_full_repo_name(model_name)
create_repo(repo_name, exist_ok=True)
api = HfApi()

In [38]:
output_dir = "marian-finetuned-kde4-en-to-fr-accelerate"

We can now upload anything we save in `output_dir`. This will help us upload the intermediate models at the end of each epoch.



### 3.2. Training loop

We are now ready to write the full training loop. To simplify its evaluation part, we define this `postprocess()` function that takes predictions and labels and converts them to the lists of strings our metric object will expect:



In [None]:
def postprocess(predictions, labels):
    # This function postprocess predicted tensors to suitable inputs that
    # can be feed into our metric
    # Convert from tensors to numpy
    predictions = predictions.cpu().numpy()
    labels = predictions.cpu().numpy()

    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Applying stripping from both left and right
    decoded_predictions = [pre.strip() for pre in decoded_predictions]
    decoded_labels = [[label.strip()] for label in decoded_labels]
    return decoded_predictions, decoded_labels

The first thing to note is that we use the `generate()` method to compute predictions, but this is a method on our base model, not the wrapped model ü§ó Accelerate created in the `prepare()` method. That‚Äôs why we unwrap the model first, then call this method.



The second thing is that, like with token classification, two processes may have padded the inputs and labels to different shapes, so we use `accelerator.pad_across_processes()` to make the predictions and labels the same shape before calling the `gather()` method. If we don‚Äôt do this, the evaluation will either error out or hang forever.


In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)
        accelerator.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
    
    model.eval()
    for batch in eval_dataloader:
        with torch.no_grad():
            predictions = accelerator.unwrap_model(model).generate(
                input_ids=batch["input_ids"],
                attention_mask=batch["attention_mask"],
                max_length=max_length
            )
        labels = batch["labels"]
        predictions = accelerator.pad_across_processes(
            predictions, 
            dim=1, 
            pad_index=tokenizer.pad_token_id
        )
        labels = accelerator.pad_across_processes(
            labels,
            dim=1,
            pad_index=-100
        )
        preds_gathered = accelerator.gather(predictions)
        labels_gathered = accelerator.gather(labels)
        decoded_preds, decoded_labels = postprocess(preds_gathered, labels_gathered)
        metric.add_batch(predictions=decoded_preds, references=decoded_labels)

    result = metric.compute()
    print(f"epoch {epoch + 1}, bleu: {result['score']:.3f}") 

    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        api.upload_folder(
            folder_path=output_dir,
            repo_id=repo_name,
            commit_message=f"Training progress in epoch {epoch + 1}."
        )