<a href="https://colab.research.google.com/github/GianlucaRapaglia/LLM-training/blob/main/03-%20Data%20Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install torch transformers

In [None]:
import torch
from torch.optim import AdamW
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
batch["labels"] = torch.tensor([1, 1])

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

In order to load datasets we can use the *datasets* library. A powerful feature of this library is the Apache Arrow, that is an open source framework fro columnar, in-memory data. Even if the dataset is several GBs, you can access individual examples instantly without loading everything into RAM, thanks to Arrow memory-mapping.

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

we can also access each pair of sentences in our raw_dataset object by indexing, like with a dictionary:

In [None]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

in order to understand what 'label':1 is referring to, we can print out the information

In [None]:
raw_train_dataset.features

and we can see that 1:equivalent, 0:not_equivalent. Now we can tokenize the dataset using the tokenizer. Though, this will work only if we have enough RAM to store the dataset during the tokenization.

In [None]:
sentences_1 = raw_datasets["train"]["sentence1"]
tokenized_sentences_1 = tokenizer(list(raw_datasets["train"]["sentence1"]))

In [None]:
tokenized_dataset = tokenizer(
    list(raw_datasets["train"]["sentence1"]),
    list(raw_datasets["train"]["sentence2"]),
    padding=True,
    truncation=True,
)

For a more efficient tokenization and to keep the data as a dataset, we will use the Dataset.map() method. This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The map() method works by applying a function on each element of the dataset, so let’s define a function that tokenizes our inputs:

In [None]:
def tokenize_function(example):
  return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

Here is how we apply the tokenization function on all our datasets at once. We’re using batched=True in our call to map so the function is applied to multiple elements of our dataset at once, and not on each element separately. This allows for faster preprocessing.

In [None]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

To perform Dynamic Padding in practice, we have to define a collate function that will apply the correct amount of padding to the items of the dataset we want to batch together. Fortunately, the 🤗 Transformers library provides us with such a function via DataCollatorWithPadding. It takes a tokenizer when you instantiate it (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs) and will do everything you need:

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

To test this new toy, let’s grab a few samples from our training set that we would like to batch together. Here, we remove the columns idx, sentence1, and sentence2 as they won’t be needed and contain strings (and we can’t create tensors with strings) and have a look at the lengths of each entry in the batch:

In [None]:
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

No surprise, we get samples of varying length, from 32 to 67. Dynamic padding means the samples in this batch should all be padded to a length of 67, the maximum length inside the batch. Without dynamic padding, all of the samples would have to be padded to the maximum length in the whole dataset, or the maximum length the model can accept. Let’s double-check that our data_collator is dynamically padding the batch properly:

In [None]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

Key Takeaways:



*   Use batched=True with Dataset.map() for significantly faster preprocessing
*   Dynamic padding with DataCollatorWithPadding is more efficient than fixed-length padding
*   Always preprocess your data to match what your model expects (numerical tensors, correct column names)
*   The 🤗 Datasets library provides powerful tools for efficient data processing at scale







In [None]:
def clean_notebook(path_in, path_out):
    import nbformat
    nb = nbformat.read(path_in, as_version=nbformat.NO_CONVERT)
    nb["metadata"].pop("widgets", None)
    nbformat.write(nb, path_out)

# Example usage:
clean_notebook("/content/03_Data_Processing.ipynb", "/content/03_Data_Processing_clean.ipynb")

