## Train a sequence classifier 
Here is how we would train a sequence classifier on one batch in PyTorch:

In [1]:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

In [2]:
# Checkpoint
checkpoint = "bert-base-uncased"

In [3]:
# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [5]:
# initialize model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
# data
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]

In [7]:
# creating a batch
batch = tokenizer(
    sequences,
    padding=True,
    truncation=True,
    return_tensors="pt"
)

In [10]:
from pprint import pprint
print(type(batch))
pprint(batch, compact=True,sort_dicts=False)

<class 'transformers.tokenization_utils_base.BatchEncoding'>
{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  2023,  2607,  2003,  6429,   999,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [11]:
# creating labels
batch['labels'] = torch.tensor([1,1])
pprint(batch, compact=True,sort_dicts=False)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  2023,  2607,  2003,  6429,   999,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'labels': tensor([1, 1])}


In [14]:
model.parameters()

<generator object Module.parameters at 0x13231aea0>

In [15]:
# initiating optimizer
optimizer = AdamW(model.parameters())
optimizer



AdamW (
Parameter Group 0
    betas: (0.9, 0.999)
    correct_bias: True
    eps: 1e-06
    lr: 0.001
    weight_decay: 0.0
)

In [17]:
# output
output = model(**batch)
output

SequenceClassifierOutput(loss=tensor(0.7365, grad_fn=<NllLossBackward0>), logits=tensor([[-0.1779, -0.2608],
        [-0.2002, -0.2873]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [18]:
# loss
loss = output.loss
loss

tensor(0.7365, grad_fn=<NllLossBackward0>)

In [19]:
# backprop
loss.backward()

In [20]:
# optimizer step
optimizer.step()

Training the model on two sentences is not going to yield very good results. To get better results, you will need to prepare a bigger dataset.

In this section we will use as an example the MRPC (Microsoft Research Paraphrase Corpus) dataset, introduced in a paper by William B. Dolan and Chris Brockett. 

The dataset consists of 5,801 pairs of sentences, with a label indicating if they are paraphrases or not (i.e., if both sentences mean the same thing). 

It’s a small dataset, so it’s easy to experiment with training on it.

## Datasets
The HF Hub contains multiple datasets in lots of different languages. You can browse the datasets [here](https://huggingface.co/datasets).

Let’s focus on the MRPC dataset! This is one of the 10 datasets composing the GLUE benchmark, which is an academic benchmark that is used to measure the performance of ML models across 10 different text classification tasks.

The Datasets library provides a very simple command to download and cache a dataset on the Hub. We can download the MRPC dataset like this:

In [21]:
from datasets import load_dataset

In [23]:
raw_datasets = load_dataset("glue", "mrpc")

In [24]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

As you can see, we get a DatasetDict object which contains the training set, the validation set, and the test set. Each of those contains several columns (sentence1, sentence2, label, and idx) and a variable number of rows, which are the number of elements in each set (so, there are 3,668 pairs of sentences in the training set, 408 in the validation set, and 1,725 in the test set).

This command downloads and caches the dataset, by default in ~/.cache/huggingface/datasets. You can customize your cache folder by setting the HF_HOME environment variable.

We can access each pair of sentences in our raw_datasets object by indexing, like with a dictionary:

In [25]:
raw_train_dataset = raw_datasets['train']
raw_train_dataset

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
})

In [26]:
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

We can see the labels are already integers, so we won’t have to do any preprocessing there. 

To know which integer corresponds to which label, we can inspect the features of our raw_train_dataset. 

This will tell us the type of each column:

In [27]:
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

Behind the scenes, label is of type ClassLabel, and the mapping of integers to label name is stored in the names folder. 0 corresponds to `not_equivalent`, and 1 corresponds to `equivalent`.

In [28]:
raw_train_dataset[14]

{'sentence1': 'Gyorgy Heizler , head of the local disaster unit , said the coach was carrying 38 passengers .',
 'sentence2': 'The head of the local disaster unit , Gyorgy Heizler , said the coach driver had failed to heed red stop lights .',
 'label': 0,
 'idx': 15}

In [29]:
raw_valid_dataset = raw_datasets['validation']
raw_valid_dataset

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 408
})

In [30]:
raw_valid_dataset[86]

{'sentence1': 'He was arrested Friday night at an Alpharetta seafood restaurant while dining with his wife , singer Whitney Houston .',
 'sentence2': 'He was arrested again Friday night at an Alpharetta restaurant where he was having dinner with his wife .',
 'label': 1,
 'idx': 796}

In [31]:
raw_train_dataset[:5]

{'sentence1': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
  "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .",
  'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .',
  'Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .',
  'The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange .'],
 'sentence2': ['Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
  "Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .",
  "On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .",
  'Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at 

To preprocess the dataset, we need to convert the text to numbers the model can make sense of which is done with a tokenizer. 

We can feed the tokenizer one sentence or a list of sentences, so we can directly tokenize all the first sentences and all the second sentences of each pair like this:

In [32]:
checkpoint = "bert-base-uncased"

In [33]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [35]:
raw_train_dataset

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
})

In [41]:
len(raw_train_dataset['sentence1']), len(raw_train_dataset['sentence2'])

(3668, 3668)

In [42]:
tokenized_sentences_1 = tokenizer(
    raw_train_dataset["sentence1"]
)
tokenized_sentences_2 = tokenizer(
    raw_train_dataset["sentence2"]
)

However, we can’t just pass two sequences to the model and get a prediction of whether the two sentences are paraphrases or not. 

We need to handle the two sequences as a pair, and apply the appropriate preprocessing. 

Fortunately, the tokenizer can also take a pair of sequences and prepare it the way our BERT model expects:

In [190]:
# example sequence
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing"
]

In [188]:
batch = tokenizer(
    sequences,
    padding=True,
    truncation=True,
    return_tensors="pt"
)

In [189]:
pprint(batch, compact=True, sort_dicts=False)

{'input_ids': tensor([[  101,   146,   112,  1396,  1151,  2613,  1111,   170, 20164, 10932,
          2271,  7954,  1736,  1139,  2006,  1297,   119,   102],
        [  101,  1188,  1736,  1110,  6929,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [191]:
# exmaple of tokenizer taking pairs of sentences
pprint(tokenizer(
    "My name is Sylvain.",
    "I work at HuggingFace."
), compact=True, sort_dicts=False)

{'input_ids': [101, 1422, 1271, 1110, 156, 7777, 2497, 1394, 119, 102, 146,
               1250, 1120, 20164, 10932, 2271, 7954, 119, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [192]:
# Example of multi pair sentences
pprint(tokenizer(
    ["My name is Sylvain.", "Going to cinema."],
    ["I work at HuggingFace.", "This movie is great."]
), compact=True, sort_dicts=False)

{'input_ids': [[101, 1422, 1271, 1110, 156, 7777, 2497, 1394, 119, 102, 146,
                1250, 1120, 20164, 10932, 2271, 7954, 119, 102],
               [101, 11099, 1106, 7678, 119, 102, 1188, 2523, 1110, 1632, 119,
                102]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


`token_type_ids` is what tells the model which part of the input is the first sentence and which is the second sentence.

## Try it out!

Take element 15 of the training set and tokenize the two sentences separately and as a pair. What’s the difference between the two results?

In [48]:
raw_train_dataset[14]

{'sentence1': 'Gyorgy Heizler , head of the local disaster unit , said the coach was carrying 38 passengers .',
 'sentence2': 'The head of the local disaster unit , Gyorgy Heizler , said the coach driver had failed to heed red stop lights .',
 'label': 0,
 'idx': 15}

### Separate

In [49]:
sent_1 = raw_train_dataset[14]['sentence1']
sent_1

'Gyorgy Heizler , head of the local disaster unit , said the coach was carrying 38 passengers .'

In [51]:
sent_2 = raw_train_dataset[14]['sentence2']
sent_2

'The head of the local disaster unit , Gyorgy Heizler , said the coach driver had failed to heed red stop lights .'

In [56]:
type(sent_1)

str

In [54]:
sent1_tok = tokenizer(sent_1)
pprint(sent1_tok, compact=True,sort_dicts=False)

{'input_ids': [101, 1043, 7677, 22637, 2002, 10993, 3917, 1010, 2132, 1997,
               1996, 2334, 7071, 3131, 1010, 2056, 1996, 2873, 2001, 4755, 4229,
               5467, 1012, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                    0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1]}


In [61]:
len(sent1_tok['input_ids'])

24

In [55]:
sent2_tok = tokenizer(sent_2)
pprint(sent2_tok, compact=True,sort_dicts=False)

{'input_ids': [101, 1996, 2132, 1997, 1996, 2334, 7071, 3131, 1010, 1043, 7677,
               22637, 2002, 10993, 3917, 1010, 2056, 1996, 2873, 4062, 2018,
               3478, 2000, 18235, 2094, 2417, 2644, 4597, 1012, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [59]:
len(sent2_tok['input_ids'])

30

### As a pair

In [57]:
paired_sent_tok = tokenizer(
    sent_1,
    sent_2
)
pprint(paired_sent_tok, compact=True,sort_dicts=False)

{'input_ids': [101, 1043, 7677, 22637, 2002, 10993, 3917, 1010, 2132, 1997,
               1996, 2334, 7071, 3131, 1010, 2056, 1996, 2873, 2001, 4755, 4229,
               5467, 1012, 102, 1996, 2132, 1997, 1996, 2334, 7071, 3131, 1010,
               1043, 7677, 22637, 2002, 10993, 3917, 1010, 2056, 1996, 2873,
               4062, 2018, 3478, 2000, 18235, 2094, 2417, 2644, 4597, 1012,
               102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                    0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [62]:
len(paired_sent_tok['input_ids'])

53

## Decoding

In [63]:
# sending the tokenizer two sequences at once
inputs = tokenizer(
    "This is the first sentence.", 
    "This is the second one."
)
pprint(inputs, compact=True, sort_dicts=False)

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996,
               2117, 2028, 1012, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [65]:
# first we need to convert ids to tokens
pprint(tokenizer.convert_ids_to_tokens(inputs['input_ids']), compact=True)

['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is',
 'the', 'second', 'one', '.', '[SEP]']


So we see the model expects the inputs to be of the form [CLS] sentence1 [SEP] sentence2 [SEP] when there are two sentences. 

Aligning this with the `token_type_ids` gives us:

In [66]:
pprint(tokenizer.convert_ids_to_tokens(inputs['input_ids']), compact=True)
pprint(inputs['token_type_ids'], compact=True)

['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is',
 'the', 'second', 'one', '.', '[SEP]']
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]


As you can see, the parts of the input corresponding to [CLS] sentence1 [SEP] all have a token type ID of 0, while the other parts, corresponding to sentence2 [SEP], all have a token type ID of 1.

Note that if you select a different checkpoint, you won’t necessarily have the token_type_ids in your tokenized inputs (for instance, they’re not returned if you use a DistilBERT model). 

They are only returned when the model will know what to do with them, because it has seen them during its pretraining.

Here, BERT is pretrained with token type IDs, and on top of the masked language modeling objective, it has an additional objective called next sentence prediction. The goal with this task is to model the relationship between pairs of sentences.

With next sentence prediction, the model is provided pairs of sentences (with randomly masked tokens) and asked to predict whether the second sentence follows the first. To make the task non-trivial, half of the time the sentences follow each other in the original document they were extracted from, and the other half of the time the two sentences come from two different documents.

In general, you don’t need to worry about whether or not there are token_type_ids in your tokenized inputs: as long as you use the same checkpoint for the tokenizer and the model, everything will be fine as the tokenizer knows what to provide to its model.


## Fixed Padding

To keep the data as a dataset, we will use the Dataset.map() method. This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The map() method works by applying a function on each element of the dataset. 

Let’s define a function that tokenizes our inputs. This function takes a dictionary (like the items of our dataset) and returns a new dictionary with the keys input_ids, attention_mask, and token_type_ids. 

Now that we have seen how our tokenizer can deal with one pair of sentences, we can use it to tokenize our whole dataset, we can feed the tokenizer a list of pairs of sentences by giving it the list of first sentences, then the list of second sentences. This is also compatible with the padding and truncation options. So, one way to preprocess the training dataset is:

In [161]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [162]:
# fixed padding example
raw_dataset = load_dataset("glue", "mrpc")
checkpoint = "bert-base-cased"

In [163]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [164]:
def tokenize_function(example):
    return tokenizer(
        example["sentence1"],
        example["sentence2"],
        padding="max_length",
        truncation=True,
        max_length=128
    )
  

In [165]:
## To speed up we can use batch
tokenized_datasets = raw_dataset.map(tokenize_function)

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

In [151]:
from pprint import pprint
pprint(tokenized_datasets.column_names, compact=True)

{'test': ['sentence1', 'sentence2', 'label', 'idx', 'input_ids',
          'token_type_ids', 'attention_mask'],
 'train': ['sentence1', 'sentence2', 'label', 'idx', 'input_ids',
           'token_type_ids', 'attention_mask'],
 'validation': ['sentence1', 'sentence2', 'label', 'idx', 'input_ids',
                'token_type_ids', 'attention_mask']}


In [152]:
# prepare for training

# Once we are done with tokenization, 
# we don't need idx, sentence1, sentence2 columns
# removing the columns with idx, sentence1, sentence2
tokenized_datasets = tokenized_datasets.remove_columns(
    ["idx", "sentence1", "sentence2"]
)
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [153]:
## renaming the label column to labels
# as huggingface expects it like this
tokenized_datasets = tokenized_datasets.rename_column(
    "label", "labels"
)
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [154]:
## finally selecting the output tensor format
tokenized_datasets = tokenized_datasets.with_format(
    "torch"
)
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [144]:
len(tokenized_datasets['train'])

3668

In [155]:
## We can also create a sample dataset
## We'll create a sample dataset from the training dataset with 100 samples
small_train_dataset = tokenized_datasets["train"].select(range(100))

In [156]:
small_train_dataset

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 100
})

This works well, but it has the disadvantage of returning a dictionary (with our keys, input_ids, attention_mask, and token_type_ids, and values that are lists of lists). 

It will also only work if you have enough RAM to store your whole dataset during the tokenization (whereas the datasets from the Datasets library are Apache Arrow files stored on the disk, so you only keep the samples you ask for loaded in memory).



Note that it also works if the example dictionary contains several samples (each key as a list of sentences) since the tokenizer works on lists of pairs of sentences, as seen before. 

This will allow us to use the option batched=True in our call to map(), which will greatly speed up the tokenization. 

Here is how we apply the tokenization function on all our datasets at once. We’re using batched=True in our call to map so the function is applied to multiple elements of our dataset at once, and not on each element separately. This allows for faster preprocessing.

The tokenizer is backed by a tokenizer written in Rust from the 🤗 Tokenizers library. This tokenizer can be very fast, but only if we give it lots of inputs at once.

In [166]:
# fixed padding example
raw_dataset = load_dataset("glue", "mrpc")
checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(
        example["sentence1"],
        example["sentence2"],
        padding="max_length",
        truncation=True,
        max_length=128
    )
    
tokenized_datasets = raw_dataset.map(tokenize_function, batched=True)

tokenized_datasets = tokenized_datasets.remove_columns(
    ["idx", "sentence1", "sentence2"]
)

tokenized_datasets = tokenized_datasets.rename_column(
    "label", "labels"
)

tokenized_datasets = tokenized_datasets.with_format(
    "torch"
)

In [167]:
# passing this to a pytorch dataloader
from torch.utils.data import DataLoader

In [168]:
train_dataloader = DataLoader(
    tokenized_datasets["train"],
    batch_size=16,
    shuffle=True
)

In [169]:
for step, batch in enumerate(train_dataloader):
    print(batch["input_ids"].shape)
    if step > 5:
        break

torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])


`torch.Size([batch_size, max_length])` fixed batch size of 16 and fixed max length of 128

## Dynamic Padding

The function that is responsible for putting together samples inside a batch is called a collate function. It’s an argument you can pass when you build a DataLoader, the default being a function that will just convert your samples to PyTorch tensors and concatenate them (recursively if your elements are lists, tuples, or dictionaries). 

This won’t be possible in our case since the inputs we have won’t all be of the same size. We have deliberately postponed the padding, to only apply it as necessary on each batch and avoid having over-long inputs with a lot of padding. This will speed up training by quite a bit, but note that if you’re training on a TPU it can cause problems — TPUs prefer fixed shapes, even when that requires extra padding.

To do this in practice, we have to define a collate function that will apply the correct amount of padding to the items of the dataset we want to batch together. 

Fortunately, the Transformers library provides us with such a function via DataCollatorWithPadding. It takes a tokenizer when you instantiate it (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs) and will do everything you need.

In [179]:
# No fixed padding
def tokenize_function(example):
    return tokenizer(
        example["sentence1"],
        example["sentence2"],
        truncation=True,
    )

Note that we’ve left the padding argument out in our tokenization function for now. This is because padding all the samples to the maximum length is not efficient: it’s better to pad the samples when we’re building a batch, as then we only need to pad to the maximum length in that batch, and not the maximum length in the entire dataset. This can save a lot of time and processing power when the inputs have very variable lengths!

In [180]:
tokenized_datasets = raw_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

The way the Datasets library applies this processing is by adding new fields to the datasets, one for each key in the dictionary returned by the preprocessing function.

You can even use multiprocessing when applying your preprocessing function with map() by passing along a num_proc argument. We didn’t do this here because the Tokenizers library already uses multiple threads to tokenize our samples faster, but if you are not using a fast tokenizer backed by this library, this could speed up your preprocessing.

Our tokenize_function returns a dictionary with the keys input_ids, attention_mask, and token_type_ids, so those three fields are added to all splits of our dataset.

Note that we could also have changed existing fields if our preprocessing function returned a new value for an existing key in the dataset to which we applied map().

The last thing we will need to do is pad all the examples to the length of the longest element when we batch elements together — a technique we refer to as dynamic padding.

In [181]:
from transformers import DataCollatorWithPadding

In [182]:
data_collator = DataCollatorWithPadding(
    tokenizer=tokenizer
)

In [183]:
data_collator

DataCollatorWithPadding(tokenizer=BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
), padding=True, max_length=None, pad_to_multiple_of=None, return_tensors='pt')

let’s grab a few samples from our training set that we would like to batch together. 

Here, we remove the columns idx, sentence1, and sentence2 as they won’t be needed and contain strings (and we can’t create tensors with strings) and have a look at the lengths of each entry in the batch:

In [184]:
tokenized_datasets = tokenized_datasets.remove_columns(
    ["idx", "sentence1", "sentence2"]
)

tokenized_datasets = tokenized_datasets.rename_column(
    "label", "labels"
)

tokenized_datasets = tokenized_datasets.with_format(
    "torch"
)

In [185]:
# creating dataloader
train_dataloader = DataLoader(
    tokenized_datasets["train"],
    batch_size=16,
    shuffle=True,
    collate_fn=data_collator
)

In [186]:
for step, batch in enumerate(train_dataloader):
    print(batch["input_ids"].shape)
    if step > 5:
        break

torch.Size([16, 72])
torch.Size([16, 67])
torch.Size([16, 67])
torch.Size([16, 78])
torch.Size([16, 86])
torch.Size([16, 74])
torch.Size([16, 72])


we can see fixed batch size of 16 but different max length in different batches. This will speed up the process much faster in CPU and GPU but not in TPU. TPUs need fixed padding.