# Data collators

Data collators are essential in natural language processing (NLP) because they simplify the process of preparing data for model training. In NLP tasks, input sequences can vary in length, and models typically require fixed-length sequences to process them in batches. Data collators handle tasks like **padding** (to ensure sequences are the same length), **masking** (to ignore padding tokens during training), and **batching** (to group multiple examples together efficiently). Without data collators, these tasks would need to be manually managed, which can be complex and error-prone. By automating these steps, data collators streamline data preprocessing, improve memory efficiency, and ensure models focus only on relevant tokens, ultimately enhancing training performance and simplicity.

[DataCollator documentation](https://huggingface.co/docs/transformers/en/main_classes/data_collator)

[Padding & truncation documentation](https://huggingface.co/docs/transformers/en/pad_truncation)

### Google Colab
* Install the required packages ; uncomment the cell below

In [None]:
# !pip install datasets transformers

## DefaultDataCollator

https://huggingface.co/docs/transformers/en/main_classes/data_collator#transformers.DefaultDataCollator

Very simple data collator that simply collates batches of dict-like objects and performs special handling for potential keys named:

* label: handles a single value (int or float) per object
* label_ids: handles a list of values per object


Does not do any additional preprocessing: property names of the input object will be used as corresponding inputs to the m

**Note:**
* Dataset must have a column name label or label_ids otherwise collator will throw an errorodel. del. ful.

In [1]:
from transformers import DefaultDataCollator, AutoTokenizer
from datasets import Dataset

# Example data
data = [
    {"text": "Hello, how are you?"},
    {"text": "This is the longest sequence in the batch"},
    {"text": "Shortest one"}
]

# Convert to Hugging Face Dataset
dataset = Dataset.from_list(data)


In [2]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenization function
# In general, the max_length is adjusted statically based on the length of the longest sentence in the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=20)

# Tokenize dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)

print("Size of input_ids array : ", len(tokenized_dataset['input_ids'][0]))

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Size of input_ids array :  20


In [3]:
from torch.utils.data._utils.collate import default_collate

# Initialize DefaultDataCollator
# Note : The default data collator does not need a tokenizer, which means that it can throw error if tokenized sequence are not provided
# In this case the sequence are tokenized & padded manually
data_collator = DefaultDataCollator()

# Convert the dataset into a list of examples
examples = [tokenized_dataset[i] for i in range(len(tokenized_dataset))]

# Manually create a batch (e.g., first two examples)
# Batch size = 2
# Size of the input_ids arrays adjusted dynamically to length of the longest sentence in the batch
batch = data_collator(examples[:2])

# Print the collated batch
print(batch)
print("===================")
print(batch['input_ids'].size())

{'input_ids': tensor([[  101,  7592,  1010,  2129,  2024,  2017,  1029,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  2023,  2003,  1996,  6493,  5537,  1999,  1996, 14108,   102,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}
torch.Size([2, 20])


## DataCollatorWithPadding

https://huggingface.co/docs/transformers/en/main_classes/data_collator#transformers.DataCollatorWithPadding

The DataCollatorWithPadding is a Hugging Face utility that dynamically pads tokenized sequences in a batch to match the longest sequence in that batch. It's particularly useful when working with variable-length inputs. `DataCollatorWithPadding` is particularly useful in scenarios where sequences of varying lengths need to be processed efficiently. It is ideal for tasks like **text classification**, **question answering**, and other NLP tasks, where the input sequences can differ in length. The collator dynamically pads the sequences to the length of the longest sequence in the batch, ensuring that the padding is minimized, which optimizes memory usage. It also handles **masking** for padded tokens, ensuring that the model only attends to actual tokens during training. This dynamic padding approach makes it suitable for a wide range of models, including those for tasks like **named entity recognition (NER)** or **sentiment analysis**, where input sequences can vary in size. By reducing unnecessary padding, it enhances **training efficiency** and **model performance**.

### Batch size
Batch size is controlled by the Trainer or other classes such as PyTorch DataLoader. For simplicity, in this example the entrire dataset is getting passed in a single batch.

In [4]:
from transformers import DataCollatorWithPadding, AutoTokenizer

# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Example data
data = [
    {"text": "Hello, how are you?"},
    {"text": "This is the longest sequence in the batch"},
    {"text": "Shortest one"}
]

print("======================= Tokenize individual sample with padding = max_length  =======")
# Tokenize the data with padding set to max_length - creates a list of dicts
# Assume the longest sequence in the data is length = 100
tokenized_data = [tokenizer(d["text"], truncation=True, padding="max_length", max_length=100) for d in data]
for i, dat in enumerate(tokenized_data):
    print(data[i],'   len[input_ids] = ', len(dat['input_ids']))

print("======================== Tokenize without padding i.e., padding=False ======")
tokenized_data = [tokenizer(d["text"], truncation=True) for d in data]

for i, dat in enumerate(tokenized_data):
    print(data[i],'   len[input_ids] = ', len(dat['input_ids']))

print("======================== Dynamic padding with DataCollatorWithPadding ======")

# Use DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


# Prepare a batch
batch = data_collator(tokenized_data)

# Print batch keys and tensor shapes
print(batch['input_ids'])

# In a batch all input_ids are packed in a single tensor of equal length
# [x, y]  x = Size of the batch, y = size of the ['input_ids']
# Padding length = Size of the longest sequence in the batch
print(batch['input_ids'].size())

{'text': 'Hello, how are you?'}    len[input_ids] =  100
{'text': 'This is the longest sequence in the batch'}    len[input_ids] =  100
{'text': 'Shortest one'}    len[input_ids] =  100
{'text': 'Hello, how are you?'}    len[input_ids] =  8
{'text': 'This is the longest sequence in the batch'}    len[input_ids] =  10
{'text': 'Shortest one'}    len[input_ids] =  4
tensor([[  101,  7592,  1010,  2129,  2024,  2017,  1029,   102,     0,     0],
        [  101,  2023,  2003,  1996,  6493,  5537,  1999,  1996, 14108,   102],
        [  101, 20047,  2028,   102,     0,     0,     0,     0,     0,     0]])
torch.Size([3, 10])


## DataCollatorForSeq2Seq

https://huggingface.co/docs/transformers/en/main_classes/data_collator#transformers.DataCollatorForSeq2Seq

`DataCollatorForSeq2Seq` is specifically designed for sequence-to-sequence tasks such as machine translation, summarization, and text generation. It is ideal for models like T5, BART, or MarianMT, where both **input sequences** (e.g., source texts) and **target sequences** (e.g., labels) are required. This collator dynamically pads both the input and target sequences to the length of the longest sequence in the batch, ensuring efficient memory usage and reducing unnecessary padding. It also creates **attention masks** for both inputs and labels, allowing the model to ignore padded tokens during training. By handling padding and masking seamlessly for both input and target sequences, `DataCollatorForSeq2Seq` simplifies batch preparation and improves **training efficiency** for sequence-to-sequence tasks. This makes it especially useful for tasks like **machine translation**, **summarization**, and **question answering**, where the model must process pairs of source and target texts.

1. **Dynamic Padding**:

* Automatically pads sequences to the maximum length in the batch, reducing unnecessary padding compared to padding to a fixed length.
* Ensures both input sequences (e.g., source texts) and target sequences (e.g., labels) are padded to the same length for proper batching.

2. **Supports Labels**:

* Specially designed for Seq2Seq tasks where both input and target sequences are needed.
* The collator will handle padding for both input_ids and labels (target sequences).

3. **Masking for Padding**:

* Creates attention_mask tensors for both inputs and labels, ensuring the model ignores padded tokens during training.
* Masks are set to 1 for actual tokens and 0 for padding tokens, ensuring no contribution from padding tokens during loss calculation.

3. **Efficient for Seq2Seq Models**:

* Specifically tailored for sequence-to-sequence models like T5, BART, or MarianMT, which require both source and target sequences.
* It makes batch preparation easier and more efficient for such models.

**Note:**

Labels are of different length compared to the input sequence

In [None]:
# Load a tokenizer and model (e.g., T5 for summarization)
from datasets import Dataset

tokenizer = AutoTokenizer.from_pretrained("t5-small")

# Source & Target text
data = [
    {"source_text": "Translate English to French: How are you?", "target_text": "Comment ça va?"},
    {"source_text": "Translate English to French: I love programming.", "target_text": "J'aime programmer."}
]

# Convert to Hugging Face Dataset
dataset = Dataset.from_list(data)

In [None]:
# Tokenization function
def preprocess_function(examples):
    # Tokenize input (source) and target (label) texts
    model_inputs = tokenizer(examples["source_text"], max_length=50, truncation=True, )
    labels = tokenizer(examples["target_text"], max_length=50, truncation=True, )

    # Add labels to the model inputs
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply tokenization
tokenized_dataset = dataset.map(preprocess_function, batched=True)

# Fix to a BUG
# Need to remove text columns otherwise, you will get the following error: 
# ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`source_text` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
tokenized_dataset = tokenized_dataset.remove_columns(['source_text', 'target_text'])

In [None]:
from transformers import DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM

# Needed for the data collator
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

# Use DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
)

# Example: Prepare a batch
# https://huggingface.co/docs/transformers/en/main_classes/data_collator#transformers.default_data_collator
batch = data_collator([tokenized_dataset[i] for i in range(len(data))])

print(batch)

# Check the output
for key, value in batch.items():
    print(f"{key}: {value.shape if hasattr(value, 'shape') else value}")


## DataCollatorForLanguageModeling

https://huggingface.co/docs/transformers/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling

`DataCollatorForLanguageModeling` is useful in several scenarios, particularly when working with language models. It is ideal for **Masked Language Modeling (MLM)**, as seen in models like BERT, where a percentage of tokens are masked and the model must predict them. It is also applicable to **Causal Language Modeling (CLM)**, used in models like GPT, where the model predicts the next token in a sequence. Additionally, it is helpful in **Text Generation** tasks, where models generate text based on context, such as with GPT-based models. Furthermore, it ensures **Efficient Language Model Training** by applying dynamic padding, which optimizes memory usage and improves training efficiency, especially when handling variable-length sequences.

1. **Supports Both MLM and CLM**:
* Handles both masked and causal language modeling tasks, making it versatile.
  
2. **Dynamic Padding**:
* Efficient padding that reduces unnecessary computation and memory usage.

3. **Attention Masking**:
* Ensures padding tokens are ignored in the attention mechanism, making the model focus on real tokens only.

4. **Prepares Tensors for Training**:
* The collator returns the inputs, attention masks, and labels in the correct format for training language models.

**Note:**

* Set *mlm=False* for Causal Language Modeling (CLM) i.e., an NLP task in which the model predicts next token

In [None]:
from transformers import DataCollatorForLanguageModeling, AutoTokenizer

# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Example data
data = [
    "This is a sample sentence.",
    "Another example for masked language modeling."
]

# Tokenize data
tokenized_data = tokenizer(data, truncation=True, max_length=10, padding=True, return_tensors="pt")

# Use DataCollatorForLanguageModeling
# mlm_probability=0.15, each token has a 15% chance of being masked.
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

# Prepare a batch
batch = data_collator([tokenized_data["input_ids"][i] for i in range(len(data))])

# Print masked inputs
# input_ids: Contains the tokenized sentences with some tokens replaced by [MASK] (or randomized).
# labels: Contains the original token IDs, with masked positions preserved and all other positions set to -100 (ignored by the loss function).
print("Input IDs with masking:\n", batch["input_ids"])
print("Labels:\n", batch['labels'])
