In the batch inputs together,
we show that we grouped inputs of different lengths in the same batch by adding padding tokens to different inputs until they are all of the same length. This is done by using max_length = 128, padding = 'max_length'

This will induce more computations when it comes to padding inputs of shorted length than the max_length.
However, it will provide all the batches the same length.
The con is that there will be lots of batches having useless columns with pad tokens only.


# What is Dynamic Padding

To avoid this another strategy is to pad the elements when we batch them together, to the longest sentence inside the batch.
This way batches composed of short inputs won't be smaller than the batches composed of the largest sentence in the dataset.
This will induce some mass speedup in the cpu and gpu.
The downside is that all batches will then have different shape
which slows down the speed on accelerators like TPU!

Lets check how these two paddings are different:

### Fixed Padding:

In [8]:
from transformers import AutoTokenizer

checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example['sentence1'], example['sentence2'],
                     padding='max_length',
                     truncation=True,
                     max_length=128,
                     return_tensors='pt')

In [9]:
from datasets import load_dataset

raw_datasets = load_dataset('glue', 'mrpc')
print(raw_datasets)

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})


In [10]:
raw_datasets.map(tokenize_function)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [11]:
raw_datasets.map(tokenize_function, batched=True)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [15]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['idx','sentence1','sentence2'])
tokenized_datasets = tokenized_datasets.rename_column('label', 'labels')
tokenized_datasets = tokenized_datasets.with_format('torch')

In [16]:
'''Pass these paddings to our pytorch dataloader:'''

from torch.utils.data import DataLoader

train_dataloader = DataLoader(tokenized_datasets['train'], batch_size=16, shuffle=True)

for step, batch in enumerate(train_dataloader):
    print(batch['input_ids'].shape)
    if step > 5:
        break

torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])


### Dynamic padding:

In [18]:
'''To apply dynamic padding,
we must refer the padding to the batch preperation
so we remove the padding from the tokenize_function'''

from datasets import load_dataset
from transformers import AutoTokenizer

raw_datasets = load_dataset('glue', 'mrpc')
checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Note; we deactive return tensors as we have deactivated the padding,
# since, tensors will not be returned with unequal dimensions
def tokenize_function(example):
    return tokenizer(example['sentence1'], example['sentence2'],
                     truncation=True)
    
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['idx','sentence1','sentence2'])
tokenized_datasets = tokenized_datasets.rename_column('label','labels')
tokenized_datasets = tokenized_datasets.with_format('torch')



Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [19]:
'''Then we apply a dynamic padding using the data collator'''

from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer)
train_dataloader = DataLoader(
    tokenized_datasets['train'], batch_size=16, shuffle=True, collate_fn=data_collator
)

for step, batch in enumerate(train_dataloader):
    print(batch['input_ids'].shape)
    if step > 5:
        break

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


torch.Size([16, 72])
torch.Size([16, 74])
torch.Size([16, 78])
torch.Size([16, 81])
torch.Size([16, 76])
torch.Size([16, 76])
torch.Size([16, 76])


In [20]:
'''Observe that the padding is now dynamic and
all the batches generated are various lengths.
Dynamic padding will always be faster on CPUs and GPUs
all way below than 128 as we had seen before.
But make sure to turn them off if you are using TPUs'''

'Observe that the padding is now dynamic and\nall the batches generated are various lengths.\nDynamic padding will always be faster on CPUs and GPUs\nBut make sure to turn them off if you are using TPUs'