# Processing the data (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading r

This is how we trained a sequence classifier on one batch in PyTorch previously.

In [2]:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

# This is new
batch["labels"] = torch.tensor([1, 1])

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Merely training the model on a couple of sentences won't yield satisfactory results. Achieving better outcomes demands preparation of a larger dataset.

So we will use the MRPC (Microsoft Research Paraphrase Corpus) dataset.
- This dataset, presented in a study by William B. Dolan and Chris Brockett, comprises 5,801 sentence pairs.
- Each pair is accompanied by a label indicating whether they represent paraphrased content or not (i.e., if both sentences convey the same meaning).


## Loading the dataset from the Hub
The Hugging Face Hub isn't solely dedicated to models; it also hosts various datasets in numerous languages. MRPC dataset is part of the GLUE benchmark, encompassing 10 datasets used to evaluate ML model performance across 10 diverse text classification tasks. The 🤗 Datasets library offers a straightforward command to download and cache a dataset from the Hub.

In [3]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

A DatasetDict object is obtained, containing three subsets: the training set, validation set, and test set. Each subset consists of multiple columns such as 'sentence1', 'sentence2', 'label', and 'idx'. The number of rows varies for each set; specifically, the training set contains 3,668 pairs of sentences, the validation set has 408 pairs, and the test set comprises 1,725 pairs.

By default, this command downloads and caches the dataset in the ~/.cache/huggingface/datasets directory. We can customize the cache folder by setting the HF_HOME environment variable.

Accessing individual pairs of sentences within the 'raw_datasets' object is achieved using dictionary-like indexing:


In [4]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

The labels in the dataset are already represented as integers, eliminating the need for any preprocessing steps in that regard. To understand the mapping between integers and their corresponding labels, inspecting the features within our 'raw_train_dataset' can be useful. This examination provides insights into the data types associated with each column:


In [5]:
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

Internally, the 'label' column is represented as a ClassLabel type, where the mapping between integers and label names is stored within the 'names' folder. In this dataset, 0 is associated with 'not_equivalent', while 1 is linked with 'equivalent'.


## Preprocessing Dataset
To process the dataset, text must be converted into numerical representations that the model can understand. This can be achieved using a tokenizer, either by feeding it a single sentence or a pair of sentences. To demonstrate:

In [6]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

However, merely feeding two sequences to the model won't effectively determine if the sentences are paraphrases or not. We need to treat the two sequences as a pair and apply the appropriate preprocessing. The tokenizer can handle pairs of sequences as required by our BERT model:

In [7]:
inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In this context, the token_type_ids are crucial. They signify which part of the input represents the first sentence and which represents the second sentence. Decoding the IDs back to words results in:

In [9]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '.',
 '[SEP]',
 'this',
 'is',
 'the',
 'second',
 'one',
 '.',
 '[SEP]']

Upon decoding, the model expects inputs in the format [CLS] sentence1 [SEP] sentence2 [SEP] for two sentences.

As mentioned earlier, when using a different checkpoint, token_type_ids might not be included in the tokenized inputs (such as when using a DistilBERT model). They are only returned if the model has been trained to utilize them.

In BERT's case, it is pre-trained with token type IDs and an additional objective called next sentence prediction, which involves predicting if the second sentence follows the first. This task aims to model the relationship between pairs of sentences.

For processing the entire dataset using the same tokenizer, we can feed it lists of first and second sentences:

In [10]:
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

While this method works, it returns a dictionary with input_ids, attention_mask, and token_type_ids, which might consume significant memory. To overcome this, we will use the Dataset.map() method to maintain the data as a dataset and apply our tokenization function to each dataset item. This function tokenizes inputs in batches, enhancing speed and efficiency:

In [11]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

By using Dataset.map(), we enhance efficiency by applying the tokenization function to multiple dataset elements simultaneously. The library automatically adds new fields to the datasets for each key in the preprocessing function's dictionary output.

In [14]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

To accelerate the preprocessing function when utilizing map(), we can incorporate multiprocessing by specifying the num_proc argument. In our case, since the tokenizer from the 🤗 Tokenizers library already leverages multiple threads for faster tokenization, we didn't use this feature. However, employing multiprocessing could significantly expedite your preprocessing, especially if our tokenizer lacks efficient threading support.

In our case, the tokenize_function generates a dictionary with input_ids, attention_mask, and token_type_ids. These fields are consequently added to all partitions of our dataset. If necessary, we could have also altered existing fields by returning new values for pre-existing keys during the map() application to our dataset.

Finally, we should pad all examples to match the length of the longest element when batching the elements together, a strategy commonly referred to as **dynamic padding**.

## Dynamic Padding

Dynamic padding refers to the approach of adjusting samples within a batch by using a collate function. This function, customizable when creating a DataLoader, typically involves converting samples to PyTorch tensors and concatenating them, especially if elements are structured as lists, tuples, or dictionaries. However, this standard function won't suit our purpose as our inputs vary in size. We've intentionally delayed the padding process, aiming to apply it selectively on each batch to prevent excessive padding in overlong inputs. This optimization significantly enhances training speed, although it might pose challenges when training on TPUs, which favor fixed shapes and can struggle with varying lengths.

To practically implement dynamic padding, a collate function must be defined to appropriately pad the dataset items during batching. Fortunately, the 🤗 Transformers library offers the DataCollatorWithPadding function for this precise task. Upon instantiation, it requires a tokenizer (to determine the padding token and the model's expected padding position) and handles the padding process seamlessly:

In [12]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

To experiment with this functionality, let's select several samples from our training set that we wish to combine into a batch. We will exclude the columns idx, sentence1, and sentence2 as they are unnecessary and contain strings (tensors can't be created from strings). Then, we will examine the lengths of each entry within the batch:

In [15]:
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

[50, 59, 47, 67, 59, 50, 62, 32]

As expected, the selected samples exhibit varying lengths, ranging from 32 to 67. Dynamic padding ensures that all samples within this batch are padded to the length of 67, which is the maximum length within this batch. Without dynamic padding, all samples would need padding to the maximum length across the entire dataset or the maximum length the model can handle. Let's verify that our data_collator is implementing dynamic padding correctly:

In [None]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

{'attention_mask': torch.Size([8, 67]),
 'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'labels': torch.Size([8])}

Everything seems in order! Now that we've transformed raw text into batches suitable for the model, we're set to proceed with fine-tuning!


# Fine-tuning a model with the Trainer API

🤗 Transformers offers a Trainer class tailored for fine-tuning pretrained models on our dataset.


In [16]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

## Training

1. **Define TrainingArguments:** Set up a `TrainingArguments` class specifying the directory to save model checkpoints. Defaults for other hyperparameters usually work well.


In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

2. **Model Definition:** Use `AutoModelForSequenceClassification` with two labels:


In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)


3. **Trainer Initialization:** Instantiate a `Trainer` object with model, training_args, training, validation datasets, and the tokenizer:



In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)


4. **Start Training:** Call the train() method to initiate fine-tuning:


In [None]:
trainer.train()

## Evaluation

To create a meaningful `compute_metrics()` function, which computes evaluation metrics, we'll need predictions from the model:


In [None]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(408, 2) (408,)

The predictions include logits for each dataset element. To transform these logits into predictions, get the index with the maximum value on the second axis:


In [None]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

Now, let's build the `compute_metrics()` function using 🤗 Evaluate library's metrics associated with the MRPC dataset:


In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.8578431372549019, 'f1': 0.8996539792387542}

Finally, define a new Trainer with this `compute_metrics()` function for metric reporting:

In [None]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

This time, it'll display the validation loss and metrics at each epoch's end. The exact accuracy/F1 score might differ slightly due to the model's random head initialization but should be in a similar range.

The Trainer supports multi-GPU/TPU training, mixed-precision training, and more options.

In [None]:
trainer.train()

This concludes the introduction to fine-tuning using the Trainer API in 🤗 Transformers.


# A full training

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate
# To run the training on TPU, uncomment the following line:
# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl


### **Manual Training Loop for Model Fine-Tuning**

To replicate the fine-tuning process without using the Trainer class, let's recap the necessary steps assuming you've completed the data processing earlier:


In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

**Prepare for Training**

1. **Define DataLoaders:** Apply post-processing to `tokenized_datasets`:



In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

In [None]:
["attention_mask", "input_ids", "labels", "token_type_ids"]


2. **Create DataLoaders:**



In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

Inspect a batch:


In [None]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'attention_mask': torch.Size([8, 65]),
 'input_ids': torch.Size([8, 65]),
 'labels': torch.Size([8]),
 'token_type_ids': torch.Size([8, 65])}

3. **Instantiate Model:**



In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Verify if this works correctly

In [None]:
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(0.5441, grad_fn=<NllLossBackward>) torch.Size([8, 2])


4. **Optimizer and Scheduler Initialization:**


In [None]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

In [None]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

1377

**Training Loop**

Define the device (GPU) if available:


In [None]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')


Run the training loop:



In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)


**Evaluation Loop**

For evaluation using metrics:



In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

{'accuracy': 0.8431372549019608, 'f1': 0.8907849829351535}

In [None]:
# from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

# model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
# optimizer = AdamW(model.parameters(), lr=3e-5)

# device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
# model.to(device)

# num_epochs = 3
# num_training_steps = num_epochs * len(train_dataloader)
# lr_scheduler = get_scheduler(
#     "linear",
#     optimizer=optimizer,
#     num_warmup_steps=0,
#     num_training_steps=num_training_steps,
# )

# progress_bar = tqdm(range(num_training_steps))

# model.train()
# for epoch in range(num_epochs):
#     for batch in train_dataloader:
#         batch = {k: v.to(device) for k, v in batch.items()}
#         outputs = model(**batch)
#         loss = outputs.loss
#         loss.backward()

#         optimizer.step()
#         lr_scheduler.step()
#         optimizer.zero_grad()
#         progress_bar.update(1)

**Supercharge Training with 🤗 Accelerate**

To leverage distributed training with 🤗 Accelerate:


In [None]:
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

accelerator = Accelerator()

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)


Running this script or function in a notebook will execute fine-tuning on the model.


In [None]:
from accelerate import notebook_launcher

notebook_launcher(training_function)