In [1]:
!export http_proxy=http://proxy.vmware.com:3128
!export https_proxy=http://proxy.vmware.com:3128

# Lab4

### Fine-tuning a model on a translation task
In this notebook, we will see how to fine-tune one of the 🤗 Transformers model for a translation task. We will use the WMT dataset, a machine translation dataset composed from a collection of various sources, including news commentaries and parliament proceedings.

## Load the dataset

In [20]:
import datasets
dataset_lists = datasets.list_datasets()
print(dataset_lists[0:10])

['acronym_identification', 'ade_corpus_v2', 'adversarial_qa', 'aeslc', 'afrikaans_ner_corpus', 'ag_news', 'ai2_arc', 'air_dialogue', 'ajgt_twitter_ar', 'allegro_reviews']


In [4]:
raw_datasets = datasets.load_dataset("wmt16", "ro-en")
print(raw_datasets)

Reusing dataset wmt16 (/root/.cache/huggingface/datasets/wmt16/ro-en/1.0.0/9e0038fe4cc117bd474d2774032cc133e355146ed0a47021b2040ca9db4645c0)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 610320
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 1999
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 1999
    })
})


In [5]:
# To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))
    
show_random_elements(raw_datasets["train"])

Unnamed: 0,translation
0,"{'en': 'Turkey's second most visited museum after Topkapi Palace, the Mevlana Museum in Konya, Anatolia, is undergoing its largest-ever restoration, and the first comprehensive restoration since 1926.', 'ro': 'Al doilea muzeu ca număr de vizitatori din Turcia după Palatul Topkapi, Muzeul Mevlana din Konya, Anatolia, trece prin cea mai amplă restaurare a sa, aceasta fiind prima sa restaurare completă din 1926.'}"
1,"{'en': 'The tribunal wants Serbia to extradite four more indictees, including Bosnian Serb wartime commander Ratko Mladic and Bosnian Serb wartime leader Radovan Karadzic.', 'ro': 'Tribunalul doreşte ca Serbia să extrădeze încă patru inculpaţi, printre care se numără comandantul sârb bosniac din timpul războiului Ratko Mladic şi liderul sârb bosniac Radovan Karadzic.'}"
2,"{'en': 'At the end of the month, parties will be convened to start formal negotiations about police restructuring, Ashdown said.', 'ro': 'La sfârşitul lunii, părţile vor fi convocate pentru lansarea negocierilor oficiale privitoare la restructurarea poliţiei, a afirmat Ashdown.'}"
3,"{'en': '(Sustained applause)', 'ro': '(Aplauze prelungite)'}"
4,"{'en': 'Let us have good discussions, then, but let us also play hard ball and let us, I beg, secure the energy supplies of our Member States.', 'ro': 'Haideți așadar să purtăm discuții fructuoase, dar haideți să jucăm dur și să asigurăm, vă rog, aprovizionarea cu energie a statelor membre.'}"


In [6]:
# get the metric we need to use for evaluation
from datasets import load_metric
metric = load_metric("sacrebleu")
print(metric)

Using the latest cached version of the module from /root/.cache/huggingface/modules/datasets_modules/metrics/sacrebleu/31e1673407d8789b8f5ddfd979948f6a1de0a6d691426d55fa74a35ffb0c1bdf (last modified on Mon Jul  4 08:03:55 2022) since it couldn't be found locally at sacrebleu, or remotely on the Hugging Face Hub.


Metric(name: "sacrebleu", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}, usage: """
Produces BLEU scores along with its sufficient statistics
from a source against one or more references.

Args:
    predictions (`list` of `str`): list of translations to score. Each translation should be tokenized into a list of tokens.
    references (`list` of `list` of `str`): A list of lists of references. The contents of the first sub-list are the references for the first prediction, the contents of the second sub-list are for the second prediction, etc. Note that there must be the same number of references for each prediction (i.e. all sub-lists must be of the same length).
    smooth_method (`str`): The smoothing method to use, defaults to `'exp'`. Possible values are:
        - `'none'`: no smoothing
        - `'floor'`: increment zero counts
        - `'add-k'`: increment num/deno

In [7]:
# You can call its compute method with your predictions and labels, which need to be list of decoded strings (list of list for the labels):
fake_preds = ["hello there", "general kenobi"]
fake_labels = [["hello there"], ["general kenobi"]]
metric.compute(predictions=fake_preds, references=fake_labels)


{'score': 0.0,
 'counts': [4, 2, 0, 0],
 'totals': [4, 2, 0, 0],
 'precisions': [100.0, 100.0, 0.0, 0.0],
 'bp': 1.0,
 'sys_len': 4,
 'ref_len': 4}

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers Tokenizer which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the AutoTokenizer.from_pretrained method, which will ensure:

we get a tokenizer that corresponds to the model architecture we want to use,
we download the vocabulary used when pretraining this specific checkpoint.
That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [8]:
from transformers import AutoTokenizer

model_checkpoint = "Helsinki-NLP/opus-mt-en-ro"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)


Downloading:   0%|          | 0.00/770k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33M [00:00<?, ?B/s]



For the mBART tokenizer (like we have here), we need to set the source and target languages (so the texts are preprocessed properly). You can check the language codes here if you are using this notebook on a different pairs of languages.

In [9]:
if "mbart" in model_checkpoint:
    tokenizer.src_lang = "en-XX"
    tokenizer.tgt_lang = "ro-RO"

In [10]:
# You can directly call this tokenizer on one sentence or a pair of sentences:
tokenizer(["Hello, this one sentence!", "This is another sentence."])

{'input_ids': [[125, 778, 3, 63, 141, 9191, 23, 0], [187, 32, 716, 9191, 2, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

In [11]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "translate English to Romanian: "
else:
    prefix = ""

print(prefix)




We can then write the function that will preprocess our samples. We just feed them to the tokenizer with the argument truncation=True. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [12]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "ro"

def preprocess_function(examples):
    inputs = [prefix + ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

preprocess_function(raw_datasets['train'][:2])
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)



  0%|          | 0/611 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

# Fine-tuning the model
Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the AutoModelForSeq2SeqLM class. Like with the tokenizer, the from_pretrained method will download and cache the model for us.



In [15]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

batch_size = 16
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-{source_lang}-to-{target_lang}",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True, # if cpu，set fp16=False; If gpu, set fp16=True
    #push_to_hub=True,
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

The last thing to define for our Seq2SeqTrainer is how to compute the metrics from the predictions. We need to define a function for this, which will just use the metric we loaded earlier, and we have to do a bit of pre-processing to decode the predictions into texts:

In [16]:
import numpy as np

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result


Then we just need to pass all of this along with our datasets to the Seq2SeqTrainer:

In [17]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

Using cuda_amp half precision backend
The following columns in the training set don't have a corresponding argument in `MarianMTModel.forward` and have been ignored: translation. If translation are not expected by `MarianMTModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 610320
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 38145


Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.7439,1.289592,28.0148,34.1091


Saving model checkpoint to opus-mt-en-ro-finetuned-en-to-ro/checkpoint-500
Configuration saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-500/config.json
Model weights saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-500/pytorch_model.bin
tokenizer config file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-500/tokenizer_config.json
Special tokens file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-500/special_tokens_map.json
Saving model checkpoint to opus-mt-en-ro-finetuned-en-to-ro/checkpoint-1000
Configuration saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-1000/config.json
Model weights saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-1000/tokenizer_config.json
Special tokens file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to opus-mt-en-ro-finetuned-en-to-ro/checkpoint-1500
Configuration saved i

Model weights saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-8000/pytorch_model.bin
tokenizer config file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-8000/tokenizer_config.json
Special tokens file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-8000/special_tokens_map.json
Deleting older checkpoint [opus-mt-en-ro-finetuned-en-to-ro/checkpoint-6500] due to args.save_total_limit
Saving model checkpoint to opus-mt-en-ro-finetuned-en-to-ro/checkpoint-8500
Configuration saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-8500/config.json
Model weights saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-8500/pytorch_model.bin
tokenizer config file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-8500/tokenizer_config.json
Special tokens file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-8500/special_tokens_map.json
Deleting older checkpoint [opus-mt-en-ro-finetuned-en-to-ro/checkpoint-7000] due to args.save_total_limit
Saving model checkpoint to opus-mt-en-ro

Deleting older checkpoint [opus-mt-en-ro-finetuned-en-to-ro/checkpoint-13500] due to args.save_total_limit
Saving model checkpoint to opus-mt-en-ro-finetuned-en-to-ro/checkpoint-15500
Configuration saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-15500/config.json
Model weights saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-15500/pytorch_model.bin
tokenizer config file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-15500/tokenizer_config.json
Special tokens file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-15500/special_tokens_map.json
Deleting older checkpoint [opus-mt-en-ro-finetuned-en-to-ro/checkpoint-14000] due to args.save_total_limit
Saving model checkpoint to opus-mt-en-ro-finetuned-en-to-ro/checkpoint-16000
Configuration saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-16000/config.json
Model weights saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-16000/pytorch_model.bin
tokenizer config file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint

Model weights saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-22500/pytorch_model.bin
tokenizer config file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-22500/tokenizer_config.json
Special tokens file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-22500/special_tokens_map.json
Deleting older checkpoint [opus-mt-en-ro-finetuned-en-to-ro/checkpoint-21000] due to args.save_total_limit
Saving model checkpoint to opus-mt-en-ro-finetuned-en-to-ro/checkpoint-23000
Configuration saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-23000/config.json
Model weights saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-23000/pytorch_model.bin
tokenizer config file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-23000/tokenizer_config.json
Special tokens file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-23000/special_tokens_map.json
Deleting older checkpoint [opus-mt-en-ro-finetuned-en-to-ro/checkpoint-21500] due to args.save_total_limit
Saving model checkpoint to opu

Deleting older checkpoint [opus-mt-en-ro-finetuned-en-to-ro/checkpoint-28000] due to args.save_total_limit
Saving model checkpoint to opus-mt-en-ro-finetuned-en-to-ro/checkpoint-30000
Configuration saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-30000/config.json
Model weights saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-30000/pytorch_model.bin
tokenizer config file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-30000/tokenizer_config.json
Special tokens file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-30000/special_tokens_map.json
Deleting older checkpoint [opus-mt-en-ro-finetuned-en-to-ro/checkpoint-28500] due to args.save_total_limit
Saving model checkpoint to opus-mt-en-ro-finetuned-en-to-ro/checkpoint-30500
Configuration saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-30500/config.json
Model weights saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-30500/pytorch_model.bin
tokenizer config file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint

Model weights saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-37000/pytorch_model.bin
tokenizer config file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-37000/tokenizer_config.json
Special tokens file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-37000/special_tokens_map.json
Deleting older checkpoint [opus-mt-en-ro-finetuned-en-to-ro/checkpoint-35500] due to args.save_total_limit
Saving model checkpoint to opus-mt-en-ro-finetuned-en-to-ro/checkpoint-37500
Configuration saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-37500/config.json
Model weights saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-37500/pytorch_model.bin
tokenizer config file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-37500/tokenizer_config.json
Special tokens file saved in opus-mt-en-ro-finetuned-en-to-ro/checkpoint-37500/special_tokens_map.json
Deleting older checkpoint [opus-mt-en-ro-finetuned-en-to-ro/checkpoint-36000] due to args.save_total_limit
Saving model checkpoint to opu

TrainOutput(global_step=38145, training_loss=0.7716991539066468, metrics={'train_runtime': 4043.8441, 'train_samples_per_second': 150.926, 'train_steps_per_second': 9.433, 'total_flos': 1.0093708497125376e+16, 'train_loss': 0.7716991539066468, 'epoch': 1.0})