# Processing the data

Here is how we would train a sequence classifier on one batch:

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequence = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(sequence, padding=True, truncation=True, return_tensors='pt')

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

[1mBertForSequenceClassification LOAD REPORT[0m from: bert-base-cased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.predictions.transform.dense.bias       | UNEXPECTED | 
cls.seq_relationship.weight                | UNEXPECTED | 
cls.predictions.transform.dense.weight     | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
classifier.bias                            | MISSING    | 
classifier.weight                          | MISSING    | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING[3m	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.[0m


In [2]:
batch

{'input_ids': tensor([[  101,   146,   112,  1396,  1151,  2613,  1111,   170, 20164, 10932,
          2271,  7954,  1736,  1139,  2006,  1297,   119,   102],
        [  101,  1188,  1736,  1110,  6929,   106,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

In [4]:
batch["input_ids"].shape

torch.Size([2, 18])

In [5]:
batch["labels"] = torch.tensor([1, 1])
optimizer = torch.optim.AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

Of course, just training the model on two sentences is not going to yield very good results. To get better results, you will need to prepare a bigger dataset.



In this section we will use as an example the MRPC (Microsoft Research Paraphrase Corpus) dataset, introduced in a paper by William B. Dolan and Chris Brockett. The dataset consists of 5,801 pairs of sentences, with a label indicating if they are paraphrases or not

## 1. Loading dataset from the hub

In [6]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

README.md: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


mrpc/train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

mrpc/validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

mrpc/test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

As you can see, we get a `DatasetDict` object which contains the training set, the validation set, and the test set. Each of those contains several columns (`sentence1`, `sentence2`, `label`, and `idx`) and a variable number of rows, which are the number of elements in each set (so, there are 3,668 pairs of sentences in the training set, 408 in the validation set, and 1,725 in the test set).



We can access each pair of sentences in our `raw_datasets` object by indexing, like with a dictionary:



In [8]:
raw_train_datasets = raw_datasets["train"]
raw_train_datasets[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [9]:
raw_train_datasets[:3]

{'sentence1': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
  "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .",
  'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .'],
 'sentence2': ['Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
  "Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .",
  "On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale ."],
 'label': [1, 0, 1],
 'idx': [0, 1, 2]}

We can see the labels are already integers, so we wonâ€™t have to do any preprocessing there. To know which integer corresponds to which label, we can inspect the `features` of our `raw_train_dataset`. This will tell us the type of each column:



In [12]:
print(raw_train_datasets.features)
print(raw_train_datasets.num_rows)

{'sentence1': Value('string'), 'sentence2': Value('string'), 'label': ClassLabel(names=['not_equivalent', 'equivalent']), 'idx': Value('int32')}
3668


Behind the scenes, `label` is of type `ClassLabel`, and the mapping of integers to label name is stored in the names folder. `0` corresponds to `not_equivalent`, and `1` corresponds to `equivalent`.

## 2. Preprocessing a dataset

To preprocess the dataset, we need to convert the text to numbers the model can make sense of. As you saw in the previous chapter, this is done with a tokenizer. We can feed the tokenizer one sentence or a list of sentences, so we can directly tokenize all the first sentences and all the second sentences of each pair like this:



In [17]:
tokenized_1 = tokenizer(list(raw_train_datasets["sentence1"]))
tokenized_2 = tokenizer(list(raw_train_datasets["sentence2"]))

One way to preprocess the training dataset is:

In [None]:
tokenized_dataset = tokenizer(
    list(raw_train_datasets["sentence1"]),
    list(raw_train_datasets["sentence2"]),
    padding=True,
    truncation=True,
    return_tensors='pt'
)

This works well, but it only works if you have enough RAM to store your whole dataset during the tokenization (whereas the datasets from the ðŸ¤— Datasets library are Apache Arrow files stored on the disk, so you only keep the samples you ask for loaded in memory).

Instead, we will use the `Dataset.map()` method. This also allows us some extra flexibility as the `map()` method works by applying a function on each element of the dataset, so letâ€™s define a function that tokenizes our inputs:



In [30]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

This will allow us to use the option `batched=True` in our call to `map()`, which will greatly speed up the tokenization.

Note that weâ€™ve left the padding argument out in our tokenization function for now. This is because padding all the samples to the maximum length is not efficient: itâ€™s better to pad the samples when weâ€™re building a batch, as then we only need to pad to the maximum length in that batch, and not the maximum length in the entire dataset. This can save a lot of time and processing power when the inputs have very variable lengths!



Here is how we apply the tokenization function on all our datasets at once. Weâ€™re using `batched=True` in our call to map so the function is applied to multiple elements of our dataset at once, and not on each element separately. This allows for faster preprocessing.



In [31]:
tokenized_dataset = raw_datasets.map(tokenize_function, batched=True)
tokenized_dataset

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

You can even use multiprocessing when applying your preprocessing function with `map()` by passing along a `num_proc` argument. We didnâ€™t do this here because the ðŸ¤— Tokenizers library already uses multiple threads to tokenize our samples faster, but if you are not using a fast tokenizer backed by this library, this could speed up your preprocessing.



### Dynamic Padding

The function that is responsible for putting together samples inside a batch is called a `collate` function. Itâ€™s an argument you can pass when you build a `DataLoader`, the default being a function that will just convert your samples to PyTorch tensors and concatenate them. 

To do this in practice, we have to define a collate function that will apply the correct amount of padding to the items of the dataset we want to batch together. Fortunately, the ðŸ¤— Transformers library provides us with such a function via `DataCollatorWithPadding`. It takes a tokenizer when you instantiate it (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs) and will do everything you need:

In [32]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

To test this new toy, letâ€™s grab a few samples from our training set that we would like to batch together. Here, we remove the columns `idx`, `sentence1`, and `sentence2` as they wonâ€™t be needed and contain strings (and we canâ€™t create tensors with strings) and have a look at the lengths of each entry in the batch:



In [33]:
samples = tokenized_dataset["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
print(samples)
[len(x) for x in samples["input_ids"]]

{'label': [1, 0, 1, 0, 1, 1, 0, 1], 'input_ids': [[101, 7277, 2180, 5303, 4806, 1117, 1711, 117, 2292, 1119, 1270, 107, 1103, 7737, 107, 117, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102, 11336, 6732, 3384, 1106, 1140, 1112, 1178, 107, 1103, 7737, 107, 117, 7277, 2180, 5303, 4806, 1117, 1711, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102], [101, 10684, 2599, 9717, 1161, 2205, 11288, 1377, 112, 188, 1196, 4147, 1103, 4129, 1106, 19770, 2787, 1107, 1772, 1111, 109, 123, 119, 126, 3775, 119, 102, 10684, 2599, 9717, 1161, 3306, 11288, 1377, 112, 188, 1107, 1876, 1111, 109, 5691, 1495, 1550, 1105, 1962, 1122, 1106, 19770, 2787, 1111, 109, 122, 119, 129, 3775, 1107, 1772, 119, 102], [101, 1220, 1125, 1502, 1126, 16355, 1113, 1103, 4639, 1113, 1340, 1275, 117, 4733, 1103, 6527, 1111, 4688, 117, 1119, 1896, 119, 102, 1212, 1340, 1275, 117, 1103, 2062, 112, 188, 5032, 1125, 1502, 1126, 16355, 1113, 1103, 4639, 117, 4733, 1103, 16454, 1111, 4688, 119, 102], [101, 5596, 5347, 19297

[52, 59, 47, 69, 60, 50, 66, 32]

No surprise, we get samples of varying length, from 32 to 69. Dynamic padding means the samples in this batch should all be padded to a length of 67, the maximum length inside the batch. Without dynamic padding, all of the samples would have to be padded to the maximum length in the whole dataset, or the maximum length the model can accept. Letâ€™s double-check that our `data_collator` is dynamically padding the batch properly:



In [34]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([8, 69]),
 'token_type_ids': torch.Size([8, 69]),
 'attention_mask': torch.Size([8, 69]),
 'labels': torch.Size([8])}

Looking good! Now that weâ€™ve gone from raw text to batches our model can deal with, weâ€™re ready to fine-tune it!



## 3. Key Takeaways:

1. Use `batched=True` with `Dataset.map()` for significantly faster preprocessing

2. Dynamic padding with `DataCollatorWithPadding` is more efficient than fixed-length padding

3. Always preprocess your data to match what your model expects (numerical tensors, correct column names)