# Data collators

Data collators are essential in natural language processing (NLP) because they simplify the process of preparing data for model training. In NLP tasks, input sequences can vary in length, and models typically require fixed-length sequences to process them in batches. Data collators handle tasks like **padding** (to ensure sequences are the same length), **masking** (to ignore padding tokens during training), and **batching** (to group multiple examples together efficiently). Without data collators, these tasks would need to be manually managed, which can be complex and error-prone. By automating these steps, data collators streamline data preprocessing, improve memory efficiency, and ensure models focus only on relevant tokens, ultimately enhancing training performance and simplicity.

[DataCollator documentation](https://huggingface.co/docs/transformers/en/main_classes/data_collator)

[Padding & truncation documentation](https://huggingface.co/docs/transformers/en/pad_truncation)



In [1]:
# !pip install -U numpy

## DefaultDataCollator

https://huggingface.co/docs/transformers/en/main_classes/data_collator#transformers.DefaultDataCollator

`DefaultDataCollator` is a versatile and simple data collator used for a variety of natural language processing tasks. It is particularly effective for tasks where sequences need to be padded to the same length, such as **text classification**, **named entity recognition (NER)**, and **sentiment analysis**. This collator automatically handles the padding of input sequences to the maximum length within a batch, optimizing memory usage and ensuring that models process sequences of equal length. Additionally, it generates **attention masks** to differentiate between real tokens and padding, helping the model ignore padding during training. By offering a straightforward solution for dynamic padding, `DefaultDataCollator` streamlines the preparation of data for training, improving both **efficiency** and **performance** without requiring complex configuration. It is suitable for a wide range of tasks and models, making it an essential tool in standard NLP workflows.

* **Manual Batching**: We manually select the examples to include in a batch (examples[:2]).
* **Dynamic Padding**: If the tokenized examples have different lengths, DefaultDataCollator ensures they are padded to the same length.
* **Scalable**: You can extend this approach to process larger datasets by splitting them into chunks (manual batching).

In [35]:
from transformers import DefaultDataCollator, AutoTokenizer
from datasets import Dataset

# Example data
data = [
    {"text": "Hello, how are you?"},
    {"text": "This is the longest sequence in the batch"},
    {"text": "Shortest one"}
]

# Convert to Hugging Face Dataset
dataset = Dataset.from_list(data)


In [36]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenization function
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=10)

# Tokenize dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/3 [00:00<?, ? examples/s]

In [37]:
from torch.utils.data._utils.collate import default_collate

# Initialize DefaultDataCollator
data_collator = DefaultDataCollator()

# Convert the dataset into a list of examples
examples = [tokenized_dataset[i] for i in range(len(tokenized_dataset))]

# Manually create a batch (e.g., first two examples)
# Batch size = 2
batch = data_collator(examples[:2])

# Print the collated batch
print(batch)
print(batch['input_ids'].size())

{'input_ids': tensor([[  101,  7592,  1010,  2129,  2024,  2017,  1029,   102,     0,     0],
        [  101,  2023,  2003,  1996,  6493,  5537,  1999,  1996, 14108,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
torch.Size([2, 10])


## DataCollatorWithPadding

https://huggingface.co/docs/transformers/en/main_classes/data_collator#transformers.DataCollatorWithPadding

The DataCollatorWithPadding is a Hugging Face utility that dynamically pads tokenized sequences in a batch to match the longest sequence in that batch. It's particularly useful when working with variable-length inputs. `DataCollatorWithPadding` is particularly useful in scenarios where sequences of varying lengths need to be processed efficiently. It is ideal for tasks like **text classification**, **question answering**, and other NLP tasks, where the input sequences can differ in length. The collator dynamically pads the sequences to the length of the longest sequence in the batch, ensuring that the padding is minimized, which optimizes memory usage. It also handles **masking** for padded tokens, ensuring that the model only attends to actual tokens during training. This dynamic padding approach makes it suitable for a wide range of models, including those for tasks like **named entity recognition (NER)** or **sentiment analysis**, where input sequences can vary in size. By reducing unnecessary padding, it enhances **training efficiency** and **model performance**.

1. **Dynamic Padding**:
* Automatically pads sequences to the length of the longest sequence in the batch.
Reduces the amount of padding compared to fixed-length padding, saving memory and computation.

2. **Tokenization-Aware**:
* Uses the tokenizer's padding strategies, ensuring consistency with special tokens (e.g., [PAD]).

3. **Task-Agnostic**:
* Can be used for a wide range of NLP tasks, including classification, question answering, and more.

4. **Integration**:

* Designed to be used directly with PyTorch's DataLoader or manually for smaller datasets.

In [2]:
from transformers import DataCollatorWithPadding, AutoTokenizer

# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Example data
data = [
    {"text": "Hello, how are you?"},
    {"text": "This is the longest sequence in the batch"},
    {"text": "Shortest one"}
]

print("======================= Tokenize individaul sample with padding = max_length  =======")
# Tokenize the data with padding set to max_length - creates a list of dicts
# Assume the longest sequence in the data is length = 100
tokenized_data = [tokenizer(d["text"], truncation=True, padding="max_length", max_length=100) for d in data]
for i, dat in enumerate(tokenized_data):
    print(data[i],'   len[input_ids] = ', len(dat['input_ids']))

print("======================== Tokenize without padding ======")
tokenized_data = [tokenizer(d["text"], truncation=True) for d in data]

for i, dat in enumerate(tokenized_data):
    print(data[i],'   len[input_ids] = ', len(dat['input_ids']))

print("======================== Dynamic padding with DataCollatorWithPadding ======")

# Use DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


# Prepare a batch
batch = data_collator(tokenized_data)

# Print batch keys and tensor shapes
print(batch['input_ids'])

# In a batch all input_ids are packed in a single tensor of equal length
# [x, y]  x = Size of the batch, y = size of the ['input_ids']
# Padding length = Size of the longest sequence in the batch
print(batch['input_ids'].size())

{'text': 'Hello, how are you?'}    len[input_ids] =  100
{'text': 'This is the longest sequence in the batch'}    len[input_ids] =  100
{'text': 'Shortest one'}    len[input_ids] =  100
{'text': 'Hello, how are you?'}    len[input_ids] =  8
{'text': 'This is the longest sequence in the batch'}    len[input_ids] =  10
{'text': 'Shortest one'}    len[input_ids] =  4
tensor([[  101,  7592,  1010,  2129,  2024,  2017,  1029,   102,     0,     0],
        [  101,  2023,  2003,  1996,  6493,  5537,  1999,  1996, 14108,   102],
        [  101, 20047,  2028,   102,     0,     0,     0,     0,     0,     0]])
torch.Size([3, 10])


## DataCollatorForSeq2Seq

https://huggingface.co/docs/transformers/en/main_classes/data_collator#transformers.DataCollatorForSeq2Seq

`DataCollatorForSeq2Seq` is specifically designed for sequence-to-sequence tasks such as machine translation, summarization, and text generation. It is ideal for models like T5, BART, or MarianMT, where both **input sequences** (e.g., source texts) and **target sequences** (e.g., labels) are required. This collator dynamically pads both the input and target sequences to the length of the longest sequence in the batch, ensuring efficient memory usage and reducing unnecessary padding. It also creates **attention masks** for both inputs and labels, allowing the model to ignore padded tokens during training. By handling padding and masking seamlessly for both input and target sequences, `DataCollatorForSeq2Seq` simplifies batch preparation and improves **training efficiency** for sequence-to-sequence tasks. This makes it especially useful for tasks like **machine translation**, **summarization**, and **question answering**, where the model must process pairs of source and target texts.

1. **Dynamic Padding**:

* Automatically pads sequences to the maximum length in the batch, reducing unnecessary padding compared to padding to a fixed length.
* Ensures both input sequences (e.g., source texts) and target sequences (e.g., labels) are padded to the same length for proper batching.

2. **Supports Labels**:

* Specially designed for Seq2Seq tasks where both input and target sequences are needed.
* The collator will handle padding for both input_ids and labels (target sequences).

3. **Masking for Padding**:

* Creates attention_mask tensors for both inputs and labels, ensuring the model ignores padded tokens during training.
* Masks are set to 1 for actual tokens and 0 for padding tokens, ensuring no contribution from padding tokens during loss calculation.

3. **Efficient for Seq2Seq Models**:

* Specifically tailored for sequence-to-sequence models like T5, BART, or MarianMT, which require both source and target sequences.
* It makes batch preparation easier and more efficient for such models.

In [3]:
# Load a tokenizer and model (e.g., T5 for summarization)
from datasets import Dataset

tokenizer = AutoTokenizer.from_pretrained("t5-small")

# Source & Target text
data = [
    {"source_text": "Translate English to French: How are you?", "target_text": "Comment ça va?"},
    {"source_text": "Translate English to French: I love programming.", "target_text": "J'aime programmer."}
]

# Convert to Hugging Face Dataset
dataset = Dataset.from_list(data)

In [4]:
# Tokenization function
def preprocess_function(examples):
    # Tokenize input (source) and target (label) texts
    model_inputs = tokenizer(examples["source_text"], max_length=50, truncation=True, )
    labels = tokenizer(examples["target_text"], max_length=50, truncation=True, )

    # Add labels to the model inputs
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply tokenization
tokenized_dataset = dataset.map(preprocess_function, batched=True)

# Fix to a BUG
# Need to remove text columns otherwise, you will get the error: 
# ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`source_text` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
tokenized_dataset = tokenized_dataset.remove_columns(['source_text', 'target_text'])

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

In [6]:
from transformers import DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM


model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

# Use DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
)

# Example: Prepare a batch
batch = data_collator([tokenized_dataset[i] for i in range(len(data))])

print(batch)
# Check the output
for key, value in batch.items():
    print(f"{key}: {value.shape if hasattr(value, 'shape') else value}")


{'input_ids': tensor([[30355,    15,  1566,    12,  2379,    10,   571,    33,    25,    58,
             1],
        [30355,    15,  1566,    12,  2379,    10,    27,   333,  6020,     5,
             1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[5257, 3664,  409,   58,    1, -100, -100],
        [ 446,   31, 9595, 2486,   52,    5,    1]]), 'decoder_input_ids': tensor([[   0, 5257, 3664,  409,   58,    1,    0],
        [   0,  446,   31, 9595, 2486,   52,    5]])}
input_ids: torch.Size([2, 11])
attention_mask: torch.Size([2, 11])
labels: torch.Size([2, 7])
decoder_input_ids: torch.Size([2, 7])


## DataCollatorForLanguageModeling

https://huggingface.co/docs/transformers/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling

`DataCollatorForLanguageModeling` is useful in several scenarios, particularly when working with language models. It is ideal for **Masked Language Modeling (MLM)**, as seen in models like BERT, where a percentage of tokens are masked and the model must predict them. It is also applicable to **Causal Language Modeling (CLM)**, used in models like GPT, where the model predicts the next token in a sequence. Additionally, it is helpful in **Text Generation** tasks, where models generate text based on context, such as with GPT-based models. Furthermore, it ensures **Efficient Language Model Training** by applying dynamic padding, which optimizes memory usage and improves training efficiency, especially when handling variable-length sequences.

1. **Supports Both MLM and CLM**:
* Handles both masked and causal language modeling tasks, making it versatile.
  
2. **Dynamic Padding**:
* Efficient padding that reduces unnecessary computation and memory usage.

3. **Attention Masking**:
* Ensures padding tokens are ignored in the attention mechanism, making the model focus on real tokens only.

4. **Prepares Tensors for Training**:
* The collator returns the inputs, attention masks, and labels in the correct format for training language models.

In [29]:
from transformers import DataCollatorForLanguageModeling, AutoTokenizer

# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Example data
data = [
    "This is a sample sentence.",
    "Another example for masked language modeling."
]

# Tokenize data
tokenized_data = tokenizer(data, truncation=True, max_length=10, padding=True, return_tensors="pt")

# Use DataCollatorForLanguageModeling
# mlm_probability=0.15, each token has a 15% chance of being masked.
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

# Prepare a batch
batch = data_collator([tokenized_data["input_ids"][i] for i in range(len(data))])

# Print masked inputs
# input_ids: Contains the tokenized sentences with some tokens replaced by [MASK] (or randomized).
# labels: Contains the original token IDs, with masked positions preserved and all other positions set to -100 (ignored by the loss function).
print("Input IDs with masking:\n", batch["input_ids"])
print("Labels:\n", batch['labels'])


Input IDs with masking:
 tensor([[ 101, 2023, 2003, 1037,  103, 6251, 1012,  102,    0],
        [ 101,  103, 2742, 2005,  103, 2653,  103,  103,  102]])
Labels:
 tensor([[ -100,  -100,  -100,  -100,  7099,  -100,  1012,  -100,  -100],
        [ -100,  2178,  -100,  -100, 16520,  -100, 11643,  1012,  -100]])
