## Intro
Let us load and work with some datasets from the HuggingFace Hub.

In [11]:
!pip install datasets

from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets



DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

Now let us predict the same examples:

In [12]:
raw_datasets["train"][15]

{'sentence1': 'Rudder was most recently senior vice president for the Developer & Platform Evangelism Business .',
 'sentence2': 'Senior Vice President Eric Rudder , formerly head of the Developer and Platform Evangelism unit , will lead the new entity .',
 'label': 0,
 'idx': 16}

In [13]:
raw_datasets["validation"][87]

{'sentence1': 'However , EPA officials would not confirm the 20 percent figure .',
 'sentence2': 'Only in the past few weeks have officials settled on the 20 percent figure .',
 'label': 0,
 'idx': 812}

In [14]:
raw_datasets["test"][59]

{'sentence1': 'Another body was pulled from the water on Thursday and two seen floating down the river could not be retrieved due to the strong currents , local reporters said .',
 'sentence2': 'Two more bodies were seen floating down the river on Thursday , but could not be retrieved due to the strong currents , local reporters said .',
 'label': 1,
 'idx': 59}

### Tokenize

To tokenize the entire dataset we could do the folloiwng:

In [15]:
# Tokenize
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

In [16]:
# More general tokenization
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

However, this is **slow**, since it does not use batching and tokenization is done a sentence at a time. To fix this, we can use the `.map` function which allows for batching when tokenizing. To use `.map`, we need to write a function that determines how the tokenization is supposed to be done.

In [17]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

+ Our tokenize_function returns a dictionary with the keys input_ids, attention_mask, and token_type_ids, so those three fields are added to all splits of our dataset.

+ Here there is no padding done! This is intentional since we want to perform padding per batch and not across the entire dataset, since it would include less padding and be more efficient.

+ The padding is done using the `collator` per batch. This is called ***dynamic padding***.

In [20]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Create PyTorch DataLoaders for training and validation
from torch.utils.data import DataLoader

# Remove unnecessary columns and convert datasets to PyTorch tensors
train_dataset = tokenized_datasets["train"].remove_columns(["idx", "sentence1", "sentence2"])

# Set batch size
batch_size = 8

# Create DataLoaders with dynamic padding
train_dataloader = DataLoader(
    train_dataset,
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator
)