<a href="https://colab.research.google.com/github/SSRavipati/LLM-course/blob/main/chapter_2/Data_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Processing the data**

In [None]:
!pip install --upgrade datasets fsspec

In [None]:
# from datasets import load_dataset
# raw_datasets = load_dataset("glue", "mrpc")

from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

In [None]:
raw_train_dataset = raw_datasets["train"]
print(raw_train_dataset[0])
raw_train_dataset.features
print(raw_train_dataset[ : 5])
raw_test_dataset = raw_datasets["test"]
raw_test_dataset[0]
raw_validation_dataset = raw_datasets["validation"]
raw_validation_dataset[0]

In [None]:
print(raw_train_dataset[15]['label'])
print(raw_validation_dataset[87]['label'])

In [None]:
import transformers

In [None]:
from transformers import AutoTokenizer
checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_sentence1 = raw_train_dataset[15]['sentence1']
raw_sentence2 = raw_train_dataset[15]['sentence2']
print(raw_sentence1)
print(raw_sentence2)
inputs1 = tokenizer(raw_sentence1)
inputs2= tokenizer(raw_sentence2)
print(inputs1)
print(inputs2)

inputs = tokenizer(raw_sentence1, raw_sentence2)
print(inputs)
# we can see that there are token_typeids in the inputs it is because bert was trained on them

In [None]:
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)


In [None]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

In [None]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

**Dynamic Padding**



*   In the above case we deliberately did not use padding, now all the sequences are of various lengths, but they are batched
*   In batches we try for the sentences in the batch to match the longest sentence in the batch
* *this is advantageous when using cpus and gpus but NOT WITH TPUS, there we need all batches of same shape*
*   The function that is responsible for putting together samples inside a batch is called a collate function. It’s an argument you can pass when you build a DataLoader



To do this in practice, we have to define a collate function that will apply the correct amount of padding to the items of the dataset we want to batch together. Transformers library provides us with such a function via DataCollatorWithPadding.

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
print(samples)
[len(x) for x in samples["input_ids"]]

In [None]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

In [None]:
raw_datasets2 = load_dataset("GLUE", "sst2")
raw_datasets2

In [None]:
raw_datasets2["train"][0]

In [None]:
def tokenize_function2(example):
  return tokenizer(example['sentence'], truncation = True)

In [None]:
tokenized_datasets2 = raw_datasets2.map(tokenize_function2, batched = True)
tokenized_datasets2

In [None]:
# Batching and dynamic padding
from transformers import DataCollatorWithPadding
collator = DataCollatorWithPadding(tokenizer = tokenizer)

samples2 = tokenized_datasets2["train"][:3]
samples2 = {k: v for k, v in samples2.items() if k not in ["idx", "sentence"]}
[len(x) for x in samples2["input_ids"]]

In [None]:
batch = collator(samples2)
{k: v.shape for k, v in batch.items()}