<a href="https://colab.research.google.com/github/Priscilla97/llm-rag-foundations/blob/main/02_fine_tuning/1_Preprocess.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Processing the data (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

In [None]:
import torch
from torch.optim import AdamW
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

# This is new
batch["labels"] = torch.tensor([1, 1])

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

# Loading a dataset from the Hub
Of course, just training the model on two sentences is not going to yield very good results. To get better results, you will need to prepare a bigger dataset.

We will use as an example the MRPC (Microsoft Research Paraphrase Corpus) dataset, introduced in a paper by William B. Dolan and Chris Brockett.

The dataset consists of 5,801 pairs of sentences, with a label indicating if they are paraphrases or not (i.e., if both sentences mean the same thing).

Dataset Documentation: https://huggingface.co/docs/datasets/index

The ðŸ¤— Datasets library provides a very simple command to download and cache a dataset on the Hub. We can download the MRPC dataset like this:

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

As you can see, we get a **DatasetDict** object which contains the *training set, the validation set, and the test set*.

Each of those contains several **columns** (sentence1, sentence2, label, and idx) and a variable **number of rows**, which are the number of elements in each set (so, there are 3,668 pairs of sentences in the training set, 408 in the validation set, and 1,725 in the test set).

*This command downloads and caches the dataset, by default in ~/.cache/huggingface/datasets. Recall from Chapter 2 that you can customize your cache folder by setting the HF_HOME environment variable.*

We can access each pair of sentences in our raw_datasets object by indexing, like with a dictionary:

In [None]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'idx': 0,
 'label': 1,
 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}

We can see the **labels are already integers**, so we wonâ€™t have to do any preprocessing there.

To know which integer corresponds to which label, we can inspect the **features** of our raw_train_dataset. This will tell us the **type of each column:**

In [None]:
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
 'idx': Value(dtype='int32', id=None)}

Behind the scenes, label is of type ClassLabel, and the mapping of integers to label name is stored in the names folder. 0 corresponds to not_equivalent, and 1 corresponds to equivalent.

# Preprocessing a dataset
To preprocess the dataset, we need to **convert the text to numbers** the model can make sense of. This is done with a **tokenizer**.

Tokenizer guide: https://huggingface.co/docs/transformers/main/en/tokenizer_summary

We can feed the tokenizer one sentence or a list of sentences, so we can directly tokenize all the first sentences and all the second sentences of each pair like this:

In [None]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

However, we canâ€™t just pass two sequences to the model and get a prediction of whether the two sentences are paraphrases or not. We need to handle the **two sequences as a pair**, and apply the appropriate preprocessing. Fortunately, the tokenizer can also take a pair of sequences and prepare it the way our BERT model expects.

**token_type_ids** tells the model which part of the input is the first sentence and which is the second sentence.

In [None]:
inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs

{ 
  'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102],
  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}

If we decode the IDs inside input_ids back to words:

In [None]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]']

So we see the model expects the inputs to be of the form [CLS] sentence1 [SEP] sentence2 [SEP] when there are two sentences. Aligning this with the token_type_ids gives us:

['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]'] <br>
[      0,      0,    0,     0,       0,          0,   0,       0,      1,    1,     1,        1,     1,   1,       1]

As you can see, the parts of the input corresponding to [CLS] sentence1 [SEP] all have a token type ID of 0, while the other parts, corresponding to sentence2 [SEP], all have a token type ID of 1.

Now that we have seen how our tokenizer can deal with one pair of sentences, we can use it to tokenize our **whole dataset:** we can feed the tokenizer a list of pairs of sentences by giving it the list of first sentences, then the list of second sentences.

This is also compatible with the padding and truncation options we saw in Chapter 2. So, one way to preprocess the training dataset is:

In [None]:
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

This works well, but it has the **disadvantage** of returning a dictionary (with our keys, input_ids, attention_mask, and token_type_ids, and values that are lists of lists).

It will also only work if you have **enough RAM** to store your whole dataset during the tokenization.

To keep the data as a dataset, we will use the **Dataset.map()** method.

This also allows us some extra *flexibility*, if we need more preprocessing done than just tokenization.

The **map()** method works by applying a function on each element of the dataset, so letâ€™s define a function that tokenizes our inputs:

In [None]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

This function takes a dictionary (like the items of our dataset) and returns a new dictionary with the keys input_ids, attention_mask, and token_type_ids.

We can use the **batched=True** in our call to map(), which will greatly *speed up the tokenization.*

Note that weâ€™ve left the padding argument out in our tokenization function for now. This is because **padding all the samples to the maximum length is not efficient**: itâ€™s better to pad the samples when weâ€™re building a **batch**, as then we only need to *pad to the maximum length in that batch*. This can **save a lot of time and processing power** when the inputs have very variable lengths!

Here is how we apply the **tokenization function on all our datasets** at once. Weâ€™re using **batched=True** in our call to map so the function is applied to multiple elements of our dataset at once, and not on each element separately. This allows for faster preprocessing.

The way the ðŸ¤— Datasets library applies this processing is by adding new fields to the datasets, one for each key in the dictionary returned by the preprocessing function:

In [None]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 408
    })
    test: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 1725
    })
})

You can even use **multiprocessing** when applying your preprocessing function with map() by passing along a **num_proc** argument.

Our **tokenize_function** returns a **dictionary** with the keys input_ids, attention_mask, and token_type_ids, so those three fields are added to all splits of our dataset.

*Note that we could also have changed existing fields if our preprocessing function returned a new value for an existing key in the dataset to which we applied map().*

The last thing we will need to do is **pad** all the examples to the length of the longest element when we batch elements together â€” a technique we refer to as **dynamic padding**.

# Dynamic Padding

The function that is responsible for *putting together samples inside a batch* is called a **collate function**.

- Itâ€™s an argument you can pass when you build a DataLoader,
- the default being a function that will just convert your samples to PyTorch tensors and concatenate them.

**This wonâ€™t be possible in our case since the inputs we have wonâ€™t all be of the same size.**

To do this in practice, we have to **define a collate function** that *will apply the correct amount of padding to the items of the dataset *we want to batch together.

ðŸ¤— Transformers library provides us with such a function via **DataCollatorWithPadding**. It takes a tokenizer when you instantiate it (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs) and will do everything you need:

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

To test it letâ€™s grab a few samples from our training set that we would like to batch together.

Here, we remove the columns idx, sentence1, and sentence2 as they wonâ€™t be needed and contain strings (and we canâ€™t create tensors with strings) and have a look at the lengths of each entry in the batch:

In [None]:
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

[50, 59, 47, 67, 59, 50, 62, 32]

No surprise, we get samples of varying length, from 32 to 67.

Dynamic padding means the samples in this batch should all be padded to a length of 67, the maximum length inside the batch.

Without dynamic padding, all of the samples would have to be padded to the maximum length in the **whole dataset,** or the maximum length the model can accept. Letâ€™s double-check that our data_collator is dynamically padding the batch properly:

In [None]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

{'attention_mask': torch.Size([8, 67]),
 'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'labels': torch.Size([8])}