# Fine-tuning

Fine-tuning refers to the process in transfer learning in which the parameter values of a model trained on a large dataset are modified when the training process continues on a small dataset (see [Kevin Murphy's book](https://probml.github.io/pml-book/book1.html) Section 19.2 for further details). The main motivation is to adapt a pre-trained model trained on a large amount of data to tackle a specific task providing better performance that would be achieved training on the small task-specific dataset.

In [None]:
!pip install datasets evaluate transformers accelerate peft bitsandbytes
!pip install sacrebleu
!pip install huggingface_hub


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In this notebook, we are going to use for fine-tuning a dataset set that is already available in the [Datasets repository](https://huggingface.co/datasets) from Hugging Face. However, the [Datasets library](https://huggingface.co/docs/datasets) makes easy to access and load datasets. For example, you can easily load your own dataset following [this tutorial](https://huggingface.co/docs/datasets/loading#local-and-remote-files).

More precisely, we are going to explain how to fine-tune the [Llama2 model](https://huggingface.co/docs/transformers/model_doc/llama2) on the [Europarl-ST dataset](https://huggingface.co/datasets/tj-solergibert/Europarl-ST), but only that [dataset of Europarl-ST focused on the text data for MT from English](https://huggingface.co/datasets/tj-solergibert/Europarl-ST-processed-mt-en).

In [1]:
from datasets import load_dataset

raw_datasets = load_dataset("tj-solergibert/Europarl-ST-processed-mt-en")

print(raw_datasets)

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['source_text', 'dest_text', 'dest_lang'],
        num_rows: 602605
    })
    test: Dataset({
        features: ['source_text', 'dest_text', 'dest_lang'],
        num_rows: 86170
    })
    valid: Dataset({
        features: ['source_text', 'dest_text', 'dest_lang'],
        num_rows: 81968
    })
})


As shown, the Europarl-ST already comes with a pre-defined partition on the three conventional sets: training, validation and test. Each set is a dictionary with a list of source sentences (source_text), target sentences (dest_text) and the target language (dest_lang).

Let's take a closer look at the features of the training set:

In [2]:
raw_datasets["train"].features

{'source_text': Value('string'),
 'dest_text': Value('string'),
 'dest_lang': ClassLabel(names=['de', 'en', 'es', 'fr', 'it', 'nl', 'pl', 'pt', 'ro'])}

As you can see, the possible target languages are German, English, Spanish, French, Italian, Dutch, Polish, Portuguese and Romanian.

Let us take a look at the translations of the first two English sentences:

In [3]:
raw_datasets["train"][:14]["source_text"]

['Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'During this period, two problems have essentially arisen: the definition of the scope of the exemption and the impossibility of recovering VAT incurred in ord

In [4]:
raw_datasets["train"][:14]["dest_text"]

['Seit 1977 wurden die meisten Finanzdienstleistungen, einschließlich Versicherungen und Verwaltung von Investmentfunds, von der Mehrwertsteuer ausgenommen.',
 'La mayoría de los servicios financieros, incluidos los seguros y la gestión de fondos de inversión, están exentos de IVA desde 1977.',
 'Depuis 1977, la plupart des services financiers, dont les assurances et la gestion des fonds de placement, ne sont pas tenus d ’ appliquer une TVA.',
 'Dal 1997 la maggior parte dei servizi finanziari, compresi i servizi assicurativi e la gestione di fondi di investimento, sono esenti da IVA.',
 'Sinds 1977 zijn de meeste financiële diensten, met inbegrip van verzekeringen en het beheer van beleggingsfondsen, vrijgesteld van btw.',
 'większość usług finansowych, w tym usług w zakresie ubezpieczeń i zarządzania funduszami inwestycyjnymi, była zwolniona z opodatkowania podatkiem VAT.',
 'Desde 1977 que a maioria dos serviços financeiros, incluindo os seguros e a gestão de fundos de investimento,

In [5]:
raw_datasets["train"][:14]["dest_lang"]

[0, 2, 3, 4, 5, 6, 7, 0, 2, 3, 4, 5, 6, 7]

As shown, each English sentence is repeated for each of the seven target languages (0: 'de', 2: 'es', 3: 'fr', 4: 'it', 5: 'nl', 6: 'pl', 7: 'pt').

The Llama2 model is a pretrained Large Language Model (LLM) ready to tackle several NLP tasks, being one of the them the translation from English into Spanish. Let us filter the Europarl-ST only for English into Spanish using a simple [lambda function](https://realpython.com/python-lambda/) with the [Dataset.filter() function](https://huggingface.co/docs/datasets/v2.9.0/en/package_reference/main_classes#datasets.Dataset.filter).

In [6]:
lang="es"
lang_id = raw_datasets["train"].features["dest_lang"].names.index(lang)
raw_datasets = raw_datasets.filter(lambda x: x["dest_lang"] == lang_id)

More precisely, we are going to be using the Llama-2 checkpoint [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) to run our experiments for which you need to accept the LLAMA 2 COMMUNITY LICENSE AGREEMENT. Processing your request may take some time, so please do it in advance.

Logging in HuggingFace to be granted access to Llama2 with 7B parameters:

In [7]:
from huggingface_hub import login

login(token="HERE YOUR HUGGINGFACE TOKEN")

We can apply the tokenizer function to any dataset taking advantage that Hugging Face Datasets are [Apache Arrow](https://arrow.apache.org) files stored on the disk, so you only keep the samples you ask for loaded in memory.

To keep the data as a dataset, we will use the [Dataset.map() function](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.Dataset.map). This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The map() method works by applying a function on each element of the dataset.

In our case, each sample pair is going to be preprocessed according to the needs of the model that is to be fine-tuned. In the case of Llama2, it is recommended to explicitly state a task prompt for each source sentence:

In [8]:
from transformers import AutoTokenizer

max_tok_length = 16
checkpoint = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(
    checkpoint, use_auth_token=True,
    padding=True,
    pad_to_multiple_of=8,
    truncation=True,
    max_length=max_tok_length,
    padding_side='left',
    )
tokenizer.pad_token = tokenizer.eos_token



In [9]:
def preprocess_function(sample):
    model_inputs = tokenizer(
        sample["source_text"], 
        text_target = sample["dest_text"],
        )
    return model_inputs

The way the Datasets library applies this processing is by adding new fields to the datasets, one for each key in the dictionary returned by the tokenize function, that is, *input_ids*, *attention_mask* and *labels*. We can check what the preprocess_function is doing with a small sample

In [10]:
sample = raw_datasets["train"].select(range(2))
model_input = preprocess_function({
    "source_text": list(sample["source_text"]),
    "dest_text": list(sample["dest_text"]),
})
print(model_input)

{'input_ids': [[1, 4001, 29871, 29896, 29929, 29955, 29955, 29892, 1556, 18161, 5786, 29892, 3704, 1663, 18541, 322, 13258, 358, 5220, 10643, 29892, 505, 1063, 429, 3456, 515, 478, 1299, 29889], [1, 7133, 445, 3785, 29892, 1023, 4828, 505, 13674, 564, 7674, 29901, 278, 5023, 310, 278, 6874, 310, 278, 11875, 683, 322, 278, 7275, 29879, 4127, 310, 9792, 292, 478, 1299, 297, 2764, 1127, 297, 1797, 304, 3867, 429, 3456, 5786, 29892, 6820, 14451, 304, 278, 27791, 265, 310, 7934, 478, 1299, 29889]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[1, 997, 26960, 316, 1232, 3348, 19382, 22347, 9672, 29892, 13654, 4396, 1232, 7025, 1883, 343, 425, 7737, 3175, 316, 6299, 359, 316, 297, 874, 3175, 29892, 21654, 429, 296, 359, 316, 6599, 29909, 5125, 29871, 29896, 299

In [11]:
for sample in model_input['input_ids']:
    print(tokenizer.convert_ids_to_tokens(sample))

['<s>', '▁Since', '▁', '1', '9', '7', '7', ',', '▁most', '▁financial', '▁services', ',', '▁including', '▁ins', 'urance', '▁and', '▁invest', 'ment', '▁fund', '▁management', ',', '▁have', '▁been', '▁ex', 'empt', '▁from', '▁V', 'AT', '.']
['<s>', '▁During', '▁this', '▁period', ',', '▁two', '▁problems', '▁have', '▁essentially', '▁ar', 'isen', ':', '▁the', '▁definition', '▁of', '▁the', '▁scope', '▁of', '▁the', '▁exem', 'ption', '▁and', '▁the', '▁impos', 's', 'ibility', '▁of', '▁recover', 'ing', '▁V', 'AT', '▁in', 'cur', 'red', '▁in', '▁order', '▁to', '▁provide', '▁ex', 'empt', '▁services', ',', '▁giving', '▁rise', '▁to', '▁the', '▁phenomen', 'on', '▁of', '▁hidden', '▁V', 'AT', '.']


We can recover the source text by applying [batch_decode](https://huggingface.co/docs/transformers/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.batch_decode) of the tokenizer 

In [12]:
tokenizer.batch_decode(model_input['input_ids'])

['<s> Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 '<s> During this period, two problems have essentially arisen: the definition of the scope of the exemption and the impossibility of recovering VAT incurred in order to provide exempt services, giving rise to the phenomenon of hidden VAT.']

Now, we can apply the preprocess_function to the raw datasets (training, validation and test):

In [13]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

We are going to filter the tokenized datasets by maximum number of tokens in source and target language:

In [14]:
tokenized_datasets = tokenized_datasets.filter(lambda x: len(x["input_ids"]) <= max_tok_length and len(x["labels"]) <= max_tok_length , desc=f"Discarding source and target sentences with more than {max_tok_length} tokens")

We can take a quick look at the length histogram in the source language:

In [15]:
dic = {}
for sample in tokenized_datasets['train']:
    sample_length = len(sample['input_ids'])
    if sample_length not in dic:
        dic[sample_length] = 1
    else:
        dic[sample_length] += 1 

for i in range(1,max_tok_length+1):
    if i in dic:
        print(f"{i:>2} {dic[i]:>3}")

 3   6
 4  64
 5  79
 6 304
 7 455
 8 568
 9 704
10 703
11 629
12 545
13 370
14 200
15 135
16  68


Checking a sample after filtering by maximum number of tokens:

In [16]:
for sample in tokenized_datasets['train'].select(range(5)):
    print(sample['input_ids'])
    print(sample['attention_mask'])
    print(sample['labels'])

[1, 3237, 7178, 29892, 591, 2609, 12522, 1749, 5076, 304, 445, 29889]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 922, 30046, 272, 28828, 29892, 694, 13279, 7681, 274, 3127, 279, 1232, 288, 14736, 29889]
[1, 1334, 817, 304, 4337, 7113, 7824, 2428, 4924, 29889]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 11389, 712, 1029, 4096, 279, 17926, 1185, 2428, 1730, 3175, 14721, 29874, 29889]
[1, 450, 24161, 411, 16762, 471, 6200, 1407, 10676, 29889]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 319, 15255, 6079, 29892, 425, 24161, 378, 16762, 3576, 12287, 15258, 29889]
[1, 512, 445, 3390, 2086, 29892, 591, 526, 10223, 292, 363, 278, 5434, 29889]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 18247, 427, 831, 29877, 29892, 4697, 14054, 29892, 707, 14054, 447, 27182, 3105, 2192, 29889]
[1, 2193, 338, 451, 1855, 29311, 3381, 29889]
[1, 1, 1, 1, 1, 1, 1, 1]
[1, 382, 578, 694, 831, 425, 1120, 29463, 983, 29311, 22919, 29889]


In [17]:
import torch

src = "en"
tgt = lang
task_prefix = f"Translate from {src} to {tgt}:\n"
s = ""

prefix_tok_len = len(tokenizer.encode(f"{task_prefix}{src}: {s} = {tgt}: "))
max_tok_len = prefix_tok_len
# Adding 2 for new line in target sentence and eos_token_id token
max_tok_len += 2 * max_tok_length + 2


def preprocess4training_function(sample):
    
    sample_size = len(sample["source_text"])

    # Creating the prompt with the task description for each source sentence
    inputs  = [f"{task_prefix}{src}: {s} = {tgt}: " for s in sample["source_text"]]

    # Appending new line after each sample in the batch
    targets = [f"{s}\n" for s in sample["dest_text"]]

    # Applying the Llama2 tokenizer to the inputs and targets 
    # to obtain "input_ids" (token_ids) and "attention mask" 
    model_inputs = tokenizer(inputs)
    labels = tokenizer(targets)
    
    # Each input is appended with its target 
    # Each target is prepended with as many special token id (-100) as the original input length
    # Both input and target (label) has the same max_tok_len
    # Attention mask is all 1s 
    for i in range(sample_size):
        sample_input_ids = model_inputs["input_ids"][i]
        label_input_ids = labels["input_ids"][i] + [tokenizer.eos_token_id]
        model_inputs["input_ids"][i] = sample_input_ids + label_input_ids
        labels["input_ids"][i] = [-100] * len(sample_input_ids) + label_input_ids
        model_inputs["attention_mask"][i] = [1] * len(model_inputs["input_ids"][i])

    # Each input is applied left padding up to max_tok_len
    # Attention mask is 0 for padding
    # Each target (label) is left filled with special token id (-100)
    # Finally inputs, attention_mask and targets (labels) are truncated to max_tok_len
    for i in range(sample_size):
        sample_input_ids = model_inputs["input_ids"][i]
        label_input_ids = labels["input_ids"][i]
        model_inputs["input_ids"][i] = [tokenizer.pad_token_id] * (
            max_tok_len - len(sample_input_ids)
        ) + sample_input_ids
        model_inputs["attention_mask"][i] = [0] * (max_tok_len - len(sample_input_ids)) + model_inputs[
            "attention_mask"
        ][i]
        labels["input_ids"][i] = [-100] * (max_tok_len - len(sample_input_ids)) + label_input_ids
        model_inputs["input_ids"][i] = torch.tensor(model_inputs["input_ids"][i][:max_tok_len])
        model_inputs["attention_mask"][i] = torch.tensor(model_inputs["attention_mask"][i][:max_tok_len])
        labels["input_ids"][i] = torch.tensor(labels["input_ids"][i][:max_tok_len])
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


We can check what the preprocess4training_function is doing:

In [18]:
sample = tokenized_datasets['train'].select(range(2))
model_input = preprocess4training_function(sample)
print(model_input)
print(tokenizer.batch_decode(model_input.input_ids))

{'input_ids': [tensor([    2,     2,     2,     2,     2,     2,     1,  4103,  9632,   515,
          427,   304,   831, 29901,    13,   264, 29901,  3237,  7178, 29892,
          591,  2609, 12522,  1749,  5076,   304,   445, 29889,   353,   831,
        29901, 29871,     1,   922, 30046,   272, 28828, 29892,   694, 13279,
         7681,   274,  3127,   279,  1232,   288, 14736, 29889,    13,     2]), tensor([    2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
            1,  4103,  9632,   515,   427,   304,   831, 29901,    13,   264,
        29901,  1334,   817,   304,  4337,  7113,  7824,  2428,  4924, 29889,
          353,   831, 29901, 29871,     1, 11389,   712,  1029,  4096,   279,
        17926,  1185,  2428,  1730,  3175, 14721, 29874, 29889,    13,     2])], 'attention_mask': [tensor([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1]), tenso

We need to replace -100 by 0 to apply batch_decode:

In [19]:
import numpy as np
for i in range(len(model_input['labels'])):
  print(tokenizer.batch_decode([np.where(model_input['labels'][i] < 0, tokenizer.pad_token_id, model_input['labels'][i])]))

['</s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s><s> Señor Presidente, no podemos cerrar los ojos.\n</s>']
['</s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s><s> Hay que avanzar hacia una supervisión europea.\n</s>']


In the case of the test set, we just preprocess the inputs (source sentences)

In [20]:
def preprocess4test_function(sample):
    inputs = [f"{task_prefix}{src}: {s} = {tgt}: " for s in sample["source_text"]]
    model_inputs = tokenizer(inputs,padding=True,)
    return model_inputs

We can check what the preprocess4test_function is doing:

In [21]:
sample = tokenized_datasets['train'].select(range(2))
model_input = preprocess4test_function(sample)
print(model_input)
print(tokenizer.batch_decode(model_input.input_ids))

{'input_ids': [[1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 264, 29901, 3237, 7178, 29892, 591, 2609, 12522, 1749, 5076, 304, 445, 29889, 353, 831, 29901, 29871], [2, 2, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 264, 29901, 1334, 817, 304, 4337, 7113, 7824, 2428, 4924, 29889, 353, 831, 29901, 29871]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
['<s> Translate from en to es:\nen: Mr President, we cannot shut our eyes to this. = es: ', '</s></s><s> Translate from en to es:\nen: We need to move towards European supervision. = es: ']


Preprocessing train and dev sets:

In [22]:
preprocessed_train_dataset = tokenized_datasets['train'].map(preprocess4training_function, batched=True)
preprocessed_dev_dataset = tokenized_datasets['valid'].map(preprocess4training_function, batched=True)

Map: 100%|██████████| 617/617 [00:00<00:00, 13023.10 examples/s]


In [23]:
for sample in preprocessed_train_dataset.select(range(5)):
    print(sample['input_ids'])
    print(sample['attention_mask'])
    print(sample['labels'])

[2, 2, 2, 2, 2, 2, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 264, 29901, 3237, 7178, 29892, 591, 2609, 12522, 1749, 5076, 304, 445, 29889, 353, 831, 29901, 29871, 1, 922, 30046, 272, 28828, 29892, 694, 13279, 7681, 274, 3127, 279, 1232, 288, 14736, 29889, 13, 2]
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 1, 922, 30046, 272, 28828, 29892, 694, 13279, 7681, 274, 3127, 279, 1232, 288, 14736, 29889, 13, 2]
[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 264, 29901, 1334, 817, 304, 4337, 7113, 7824, 2428, 4924, 29889, 353, 831, 29901, 29871, 1, 11389, 712, 1029, 4096, 279, 17926, 1185, 2428, 1730, 3175, 14721, 29874, 29889, 13, 2]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1

Preprocessing test set:

In [24]:
preprocessed_test_dataset = tokenized_datasets['test'].map(preprocess4test_function, batched=True)

In [25]:
for sample in preprocessed_test_dataset.select(range(5)):
    print(sample['input_ids'])
    print(sample['attention_mask'])
    print(sample['labels'])

[2, 2, 2, 2, 2, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 264, 29901, 1938, 591, 864, 304, 26054, 895, 278, 2791, 1691, 29973, 353, 831, 29901, 29871]
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 18613, 2182, 7884, 26054, 15356, 1232, 16856, 2255, 29973]
[2, 2, 2, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 264, 29901, 2398, 29892, 445, 947, 451, 2099, 766, 29885, 424, 1847, 963, 29889, 353, 831, 29901, 29871]
[0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 28608, 831, 29877, 694, 28711, 553, 29885, 424, 295, 279, 5409, 29889]
[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 264, 29901, 306, 674, 1286, 1369, 29889, 353, 831, 29901, 29871]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 319, 15255, 16296, 712, 3710, 347, 2502, 29889]
[2, 2, 2, 2, 2, 2, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 264, 29901, 45

[bitsandbytes](https://huggingface.co/docs/bitsandbytes/main/en/index) is a quantization library with a Transformers integration. With this integration, you can quantize a model to 8 or 4-bits and enable many other options by configuring the BitsAndBytesConfig class. For example, you can:

<ul>
<li>set load_in_4bit=True to quantize the model to 4-bits when you load it</li>
<li>set bnb_4bit_quant_type="nf4" to use a special 4-bit data type for weights initialized from a normal distribution</li>
<li>set bnb_4bit_use_double_quant=True to use a nested quantization scheme to quantize the already quantized weights</li>
<li>set bnb_4bit_compute_dtype=torch.bfloat16 to use bfloat16 for faster computation</li>
</ul>


In [26]:
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

Pass the quantization_config to the from_pretrained method.

In [27]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    token=True,
    quantization_config=quantization_config,
    dtype=torch.bfloat16,
)

Loading checkpoint shards: 100%|██████████| 2/2 [00:10<00:00,  5.10s/it]


Next, you should call the prepare_model_for_kbit_training() function to preprocess the quantized model for training.

In [28]:
from peft import prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=False, gradient_checkpointing_kwargs={'use_reentrant':False})

[LoRA (Low-Rank Adaptation of Large Language Models)](https://huggingface.co/docs/peft/task_guides/lora_based_methods) is a [parameter-efficient fine-tuning (PEFT)](https://huggingface.co/docs/peft/index) technique that significantly reduces the number of trainable parameters. It works by inserting a smaller number of new weights into the model and only these are trained. This makes training with LoRA much faster, memory-efficient, and produces smaller model weights (a few hundred MBs), which are easier to store and share.

Each PEFT method is defined by a PeftConfig class that stores all the important parameters for building a PeftModel. For example, to train with LoRA, load and create a LoraConfig class and specify the following parameters:

<ul>
<li>task_type: the task to train for (sequence-to-sequence language modeling in this case)</li>
<li>r: the dimension of the low-rank matrices</li>
<li>lora_alpha: the scaling factor for the low-rank matrices</li>
<li>target_modules: determine what set of parameters are adapted</li>
<li>lora_dropout: the dropout probability of the LoRA layers</li>
</ul>

In [29]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    task_type="CAUSAL_LM",
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    inference_mode=False,
)

Once LoRA and the quantization are setup, create a quantized PeftModel with the get_peft_model() function. It takes a quantized model and the LoraConfig containing the parameters for how to configure a model for training with LoRA.

In [30]:
lora_model = get_peft_model(model, config)
lora_model.print_trainable_parameters()

trainable params: 8,388,608 || all params: 6,746,804,224 || trainable%: 0.1243


The function that is responsible for putting together samples inside a batch is called a collate function.

In [31]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, pad_to_multiple_of=8)

## Training

The first step before we can define our [Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer) is to define a [TrainingArguments class](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments) that will contain all the hyperparameters the Trainer will use for training and evaluation. The only compulsory argument you have to provide is a directory where the trained model will be saved, as well as the checkpoints along the way. For all the rest, you can set them depending on the recommendations from the model developers:

In [32]:
from transformers import TrainingArguments

batch_size = 4
gradient_accumulation_steps = 8
model_name = checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-en-to-es",
    eval_strategy = "epoch",
    learning_rate=1e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=3,
    warmup_steps=100,
    optim="adamw_bnb_8bit",
    prediction_loss_only=True,
    gradient_accumulation_steps = gradient_accumulation_steps,
    bf16=True,
    bf16_full_eval=True,
    group_by_length=True,
)

Once we have our model, we can define a Trainer by passing it all the objects constructed up to now — the model, the training_args, the training and validation datasets, the tokenizer and the data collator:

In [33]:
from transformers import Trainer

trainer = Trainer(
    lora_model,
    args,
    train_dataset=preprocessed_train_dataset,
    eval_dataset=preprocessed_dev_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)


  trainer = Trainer(


To fine-tune the model on our dataset, we just have to call the [train() function](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Trainer.train) of our Trainer. However, the [wandb library](https://docs.wandb.ai/guides) is used and it requires to have a [wandb account and login](https://docs.wandb.ai/guides/integrations/huggingface/).

In [34]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.
The model is already on multiple devices. Skipping the move to device specified in `args`.


Epoch,Training Loss,Validation Loss
1,No log,0.892714
2,No log,0.877258
3,No log,0.874974


TrainOutput(global_step=453, training_loss=1.1506974176065812, metrics={'train_runtime': 1799.9671, 'train_samples_per_second': 8.05, 'train_steps_per_second': 0.252, 'total_flos': 3.220961853505536e+16, 'train_loss': 1.1506974176065812, 'epoch': 3.0})

## Inference

At inference time, it is recommended to use [generate()](https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationMixin.generate). This method takes care of encoding the input and auto-regressively generates the decoder output. Check out [this blog post](https://huggingface.co/blog/how-to-generate) to know all the details about generating text with Transformers.

Let us first load the default inference parameters of Llama-2: 

In [35]:
from transformers import GenerationConfig

generation_config = GenerationConfig.from_pretrained(
    checkpoint,
)

print(generation_config)

GenerationConfig {
  "bos_token_id": 1,
  "do_sample": true,
  "eos_token_id": 2,
  "max_length": 4096,
  "pad_token_id": 0,
  "temperature": 0.6,
  "top_p": 0.9
}



As observed, the default search strategy for Llama-2 is Top-p with probability 0.9 and temperature 0.6 ($0<T<1$ amplifies output probability differences and makes output more deterministic). [The search strategy can be selected](https://huggingface.co/docs/transformers/en/generation_strategies) at inference time. 

First, the test set is divided in small batches to reduce GPU memory comsumption:

In [36]:
test_batch_size = 4
batch_tokenized_test = preprocessed_test_dataset.batch(test_batch_size)

Batching examples: 100%|██████████| 678/678 [00:00<00:00, 5593.63 examples/s]


Batches are provided to the [generate()](https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationMixin.generate) together with inference parameters to define the search strategy. In this case, num_beams = 1 and do_sample = False means greedy search. 

In [37]:
number_of_batches = len(batch_tokenized_test["input_ids"])
output_sequences = []
for i in range(number_of_batches):
    with torch.no_grad():
        output_batch = lora_model.generate(
            generation_config=generation_config, 
            input_ids=torch.tensor(batch_tokenized_test["input_ids"][i]).cuda(), 
            attention_mask=torch.tensor(batch_tokenized_test["attention_mask"][i]).cuda(), 
            max_length = max_tok_len, 
            num_beams=1, 
            do_sample=False,)
    output_sequences.extend(output_batch)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


## Evaluation

The output of the model is automatically evaluated compared to the reference translations. To this purpose, we use the [Evaluate library](https://huggingface.co/docs/evaluate) which includes the definition of generic and task-specific metrics. In our case, we use the [BLEU metric](https://huggingface.co/spaces/evaluate-metric/bleu), or to be more precise, [sacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu).

In [38]:
from evaluate import load

metric = load("sacrebleu")

The example below performs a basic post-processing to decode the predictions and extract the translation:

In [39]:
import re

def compute_metrics(sample, output_sequences):
    inputs = [f"{task_prefix}{src}: {s} = {tgt}: "  for s in sample["source_text"]]
    preds = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)
    print(inputs)
    print(preds)
    for i, (input,pred) in enumerate(zip(inputs,preds)):
      pred = re.search(r'^.*\n',pred.removeprefix(input).lstrip())
      if pred is not None:
        preds[i] = pred.group()[:-1]
      else:
        preds[i] = ""
    print(sample["source_text"])
    print(sample["dest_text"])
    print(preds)
    result = metric.compute(predictions=preds, references=sample["dest_text"])
    result = {"bleu": result["score"]}
    return result

In [40]:
result = compute_metrics(preprocessed_test_dataset,output_sequences)
print(f'BLEU score: {result["bleu"]}')

['Translate from en to es:\nen: Do we want to liberalise the markets? = es: ', 'Translate from en to es:\nen: However, this does not mean dismantling them. = es: ', 'Translate from en to es:\nen: I will now start. = es: ', 'Translate from en to es:\nen: The international community cannot remain impassive. = es: ', 'Translate from en to es:\nen: Nonetheless, there must be a balanced outcome. = es: ', 'Translate from en to es:\nen: We now know what he wanted it for. = es: ', 'Translate from en to es:\nen: Secondly, some of your suggestions are inadvisable. = es: ', 'Translate from en to es:\nen: This would not do. = es: ', 'Translate from en to es:\nen: I cannot, therefore, agree with you on this point. = es: ', 'Translate from en to es:\nen: It can already be seen, Mr Cappato. = es: ', 'Translate from en to es:\nen: Thank you for your generosity, Mr President. = es: ', 'Translate from en to es:\nen: The ideal solution would be to use physical means. = es: ', 'Translate from en to es:\ne