# Practical machine learning and deep learning. Lab 5
## Competition
No competition for today



# Fine-tuning a model on a translation task
Today we will be finetunning T5(Text-To-Text Transfer Transformer) [model](https://github.com/google-research/t5x) on translation task. For this purpose we will be using [HuggingFace transformers](https://huggingface.co/docs/transformers/index) and [WMT16](https://huggingface.co/datasets/wmt16) dataset. 

In [2]:
# installing huggingface libraries for dataset, models and metrics
%pip install datasets transformers[sentencepiece] sacrebleu

%pip install numpy==1.24.3

Note: you may need to restart the kernel to use updated packages.



In [3]:
# Necessary inputs
import warnings

from datasets import load_dataset, load_metric
import transformers
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

warnings.filterwarnings('ignore')

## Selecting the model
For the example purpose we select as model checkpoint the smallest transformer in T5 family - `t5_small`. Other pre-trained models can be found [here](https://huggingface.co/docs/transformers/model_doc/t5#:~:text=T5%20comes%20in%20different%20sizes%3A).

In [4]:
# selecting model checkpoint
model_checkpoint = "t5-small"

## Loading the dataset

In [5]:
# setting random seed for transformers library
transformers.set_seed(42)

# Load the WMT16 dataset
raw_datasets = load_dataset("wmt16", "de-en")

# Load the BLUE metric
metric = load_metric("sacrebleu")

## Dataset
Downloaded from HuggingFace dataset is a `DatasetDict`. It contains keys `["train", "validation", "test"]` - which represents a dataset splits

In [6]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 4548885
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 2169
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 2999
    })
})

In [7]:
# samples from train dataset
raw_datasets["train"][:5]

{'translation': [{'de': 'Wiederaufnahme der Sitzungsperiode',
   'en': 'Resumption of the session'},
  {'de': 'Ich erkläre die am Freitag, dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen, wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe, daß Sie schöne Ferien hatten.',
   'en': 'I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.'},
  {'de': 'Wie Sie feststellen konnten, ist der gefürchtete "Millenium-Bug " nicht eingetreten. Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden.',
   'en': "Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful."},
  {'de': 'Im Parlament besteht der Wunsch na

## Metric
[Sacrebleu](https://huggingface.co/spaces/evaluate-metric/sacrebleu) computes:
- `score`: BLEU score
- `counts`: list of counts of correct n-grams
- `totals`: list of counts of total n-grams
- `precisions`: list of precisions
- `bp`: Brevity penalty
- `sys_len`: cumulative sysem length
- `ref_len`: cumulative reference length

The main metric is [BLEU score](https://en.wikipedia.org/wiki/BLEU). BLEU (BiLingual Evaluation Understudy) is a metric for automatically evaluating machine-translated text. The BLEU score measures the similarity of the machine-translated text to a set of high quality reference translations.

The BLEU metric is calculates using [n-grams](https://en.wikipedia.org/wiki/N-gram).

In [8]:
# how to use sacrebleu and its purpose
metric

Metric(name: "sacrebleu", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}, usage: """
Produces BLEU scores along with its sufficient statistics
from a source against one or more references.

Args:
    predictions (`list` of `str`): list of translations to score. Each translation should be tokenized into a list of tokens.
    references (`list` of `list` of `str`): A list of lists of references. The contents of the first sub-list are the references for the first prediction, the contents of the second sub-list are for the second prediction, etc. Note that there must be the same number of references for each prediction (i.e. all sub-lists must be of the same length).
    smooth_method (`str`): The smoothing method to use, defaults to `'exp'`. Possible values are:
        - `'none'`: no smoothing
        - `'floor'`: increment zero counts
        - `'add-k'`: increment num/deno

In [9]:
fake_preds = ["hello there", "general kenobi", "Can I get an A"]
fake_labels = [["hello there"], ["general kenobi"], ['Can I get a C']]
metric.compute(predictions=fake_preds, references=fake_labels)

{'score': 45.59274666224604,
 'counts': [7, 4, 1, 0],
 'totals': [9, 6, 3, 2],
 'precisions': [77.77777777777777,
  66.66666666666667,
  33.333333333333336,
  25.0],
 'bp': 1.0,
 'sys_len': 9,
 'ref_len': 9}

## Preprocessing the data
As usual we will need to preprocess data and tokenize it before passing to model

In [10]:
from transformers import AutoTokenizer

# we will use autotokenizer for this purpose
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [11]:
tokenizer("Hello, this one sentence!")

{'input_ids': [8774, 6, 48, 80, 7142, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [12]:
tokenizer(["Hello, this one sentence!", "This is another sentence."])

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

In [13]:
# prefix for model input
prefix = "translate English to Deutsch:"

In [14]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "de"

def preprocess_function(examples):
    inputs = [prefix + ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [15]:
# example of preprocessing
preprocess_function(raw_datasets['train'][:2])

{'input_ids': [[13959, 1566, 12, 3, 18609, 10, 1649, 4078, 102, 1575, 13, 8, 2363, 1], [13959, 1566, 12, 3, 18609, 10, 196, 15884, 4258, 26, 8, 2363, 13, 8, 1611, 12876, 19181, 1211, 29, 15, 26, 30, 1701, 1003, 1882, 5247, 6, 11, 27, 133, 114, 728, 541, 12, 1663, 25, 3, 9, 1095, 126, 215, 16, 8, 897, 24, 25, 2994, 3, 9, 8714, 15723, 1059, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[15158, 24860, 74, 11216, 425, 7, 4267, 32, 221, 1], [1674, 3, 49, 20635, 15, 67, 183, 17874, 6, 340, 11030, 17900, 1199, 5702, 1559, 15, 11216, 425, 7, 4267, 32, 221, 93, 3, 30604, 29, 13636, 7, 218, 1403, 3019, 7026, 6, 3, 25084, 2587, 18794, 7, 3532, 7756, 15, 674, 9242, 11621, 64, 3, 11950, 15, 6, 3, 26, 7118, 292, 11878, 16849, 8827, 5, 1]]}

In [16]:
# for the example purpose we will crop the dataset and select first 5000 for train
# and 500 for validation and test
cropped_datasets = raw_datasets
cropped_datasets['train'] = raw_datasets['train'].select(range(5000))
cropped_datasets['validation'] = raw_datasets['validation'].select(range(500))
cropped_datasets['test'] = raw_datasets['test'].select(range(500))
tokenized_datasets = cropped_datasets.map(preprocess_function, batched=True)
tokenized_datasets['train'][0]

{'translation': {'de': 'Wiederaufnahme der Sitzungsperiode',
  'en': 'Resumption of the session'},
 'input_ids': [13959,
  1566,
  12,
  3,
  18609,
  10,
  1649,
  4078,
  102,
  1575,
  13,
  8,
  2363,
  1],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'labels': [15158, 24860, 74, 11216, 425, 7, 4267, 32, 221, 1]}

## Fine-tuning the model

In [17]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
# create a model for the pretrained model
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [18]:
# defining the parameters for training
batch_size = 32
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-{source_lang}-to-{target_lang}",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=10,
    predict_with_generate=True,
    fp16=True,
    report_to='tensorboard',
)

In [19]:
# instead of writing collate_fn function we will use DataCollatorForSeq2Seq
# simliarly it implements the batch creation for training
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [20]:
import numpy as np

# simple postprocessing for text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

# compute metrics function to pass to trainer
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

In [21]:
# instead of writing train loop we will use Seq2SeqTrainer
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [22]:
trainer.train()

  0%|          | 0/1570 [00:00<?, ?it/s]

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


  0%|          | 0/16 [00:00<?, ?it/s]

{'eval_loss': 1.3392888307571411, 'eval_bleu': 10.9899, 'eval_gen_len': 17.642, 'eval_runtime': 4.8328, 'eval_samples_per_second': 103.461, 'eval_steps_per_second': 3.311, 'epoch': 1.0}


  0%|          | 0/16 [00:00<?, ?it/s]

{'eval_loss': 1.3392494916915894, 'eval_bleu': 10.8806, 'eval_gen_len': 17.65, 'eval_runtime': 4.7989, 'eval_samples_per_second': 104.19, 'eval_steps_per_second': 3.334, 'epoch': 2.0}


  0%|          | 0/16 [00:00<?, ?it/s]

{'eval_loss': 1.3395845890045166, 'eval_bleu': 10.8545, 'eval_gen_len': 17.602, 'eval_runtime': 5.1505, 'eval_samples_per_second': 97.079, 'eval_steps_per_second': 3.107, 'epoch': 3.0}
{'loss': 1.3208, 'learning_rate': 1.3630573248407644e-05, 'epoch': 3.18}


  0%|          | 0/16 [00:00<?, ?it/s]

{'eval_loss': 1.3398561477661133, 'eval_bleu': 10.7283, 'eval_gen_len': 17.618, 'eval_runtime': 7.5093, 'eval_samples_per_second': 66.584, 'eval_steps_per_second': 2.131, 'epoch': 4.0}


  0%|          | 0/16 [00:00<?, ?it/s]

{'eval_loss': 1.3402308225631714, 'eval_bleu': 10.7102, 'eval_gen_len': 17.604, 'eval_runtime': 4.4612, 'eval_samples_per_second': 112.078, 'eval_steps_per_second': 3.587, 'epoch': 5.0}


  0%|          | 0/16 [00:00<?, ?it/s]

{'eval_loss': 1.3409031629562378, 'eval_bleu': 10.7356, 'eval_gen_len': 17.598, 'eval_runtime': 4.8697, 'eval_samples_per_second': 102.677, 'eval_steps_per_second': 3.286, 'epoch': 6.0}
{'loss': 1.2741, 'learning_rate': 7.261146496815287e-06, 'epoch': 6.37}


  0%|          | 0/16 [00:00<?, ?it/s]

{'eval_loss': 1.341385841369629, 'eval_bleu': 10.7952, 'eval_gen_len': 17.596, 'eval_runtime': 4.6651, 'eval_samples_per_second': 107.178, 'eval_steps_per_second': 3.43, 'epoch': 7.0}


  0%|          | 0/16 [00:00<?, ?it/s]

{'eval_loss': 1.3417431116104126, 'eval_bleu': 10.7706, 'eval_gen_len': 17.59, 'eval_runtime': 6.8932, 'eval_samples_per_second': 72.535, 'eval_steps_per_second': 2.321, 'epoch': 8.0}


  0%|          | 0/16 [00:00<?, ?it/s]

{'eval_loss': 1.341884970664978, 'eval_bleu': 10.8015, 'eval_gen_len': 17.584, 'eval_runtime': 4.5162, 'eval_samples_per_second': 110.711, 'eval_steps_per_second': 3.543, 'epoch': 9.0}
{'loss': 1.2634, 'learning_rate': 9.044585987261147e-07, 'epoch': 9.55}


  0%|          | 0/16 [00:00<?, ?it/s]

{'eval_loss': 1.342007040977478, 'eval_bleu': 10.8123, 'eval_gen_len': 17.584, 'eval_runtime': 4.6832, 'eval_samples_per_second': 106.765, 'eval_steps_per_second': 3.416, 'epoch': 10.0}
{'train_runtime': 328.8671, 'train_samples_per_second': 152.037, 'train_steps_per_second': 4.774, 'train_loss': 1.2848550687170333, 'epoch': 10.0}


TrainOutput(global_step=1570, training_loss=1.2848550687170333, metrics={'train_runtime': 328.8671, 'train_samples_per_second': 152.037, 'train_steps_per_second': 4.774, 'train_loss': 1.2848550687170333, 'epoch': 10.0})

In [23]:
# saving model
trainer.save_model('best')

In [24]:
# loading the model and run inference for it
model = AutoModelForSeq2SeqLM.from_pretrained('best')
model.eval()
model.config.use_cache = False

In [25]:
def translate(model, inference_request, tokenizer=tokenizer):
    input_ids = tokenizer(inference_request, return_tensors="pt").input_ids
    outputs = model.generate(input_ids=input_ids)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True,temperature=0))

In [26]:
inference_request = prefix + 'Why did it take so long to train the model?'
translate(model, inference_request,tokenizer)

Warum hat es so lange dauern, das Modell zu trainieren?


In [27]:
inference_request = prefix +"My name is Wolfgang and I live in Berlin"
translate(model, inference_request,tokenizer)

Mein Name ist Wolfgang und ich lebe in Berlin.


In [28]:
inference_request = prefix + "Your assignment is hard. Start it today"
translate(model, inference_request,tokenizer)

Ihre Aufgabe ist schwierig, beginnen Sie heute
