# Processing The Data

To access a dataset we will use the `datasets` lib and for this example we will be using the *MRPC* (*Microsoft Research Paraphrase Corpus*) dataset. The dataset consists of 5,801 pairs of sentences, with a label indicating if they are paraphrases or not (i.e., if both sentences mean the same thing). Furthermore, this is one of the 10 datasets composing the *GLUE* benchmark, which is an academic benchmark that is used to measure the performance of ML models across 10 different text classification tasks. So, we can use this dataset to finetune a *Bert* model (`bert-base-uncased`) to classify paraphrases.

In [None]:
from datasets import load_dataset

# download the mrpc dataset
raw_datasets = load_dataset(
    path="glue",
    name="mrpc"
)

In [None]:
raw_datasets

As we can see, we get a `DatasetDict` object, which contains three datasets in it, one for training, one for validation and one of testing. Each of them have '*sentence1*', '*sentence2*', '*label*' and '*ids*' as their columns and there are *3668* rows in the training dataset, *408* rows in the validation dataset and *1725* rows in the testing dataset.

Now to access any of the data:

In [None]:
raw_train_dataset = raw_datasets['train']
raw_train_dataset[0]

we see the label column has a integer value of 1, now to see what it corresponds to:

In [None]:
raw_train_dataset.features

As we can see, the label column is of type *ClassLabel* and **0** corresponds to **not_equivalent**, and **1** corresponds to **equivalent**

In [None]:
from pprint import pprint

pprint("Data at index 15 in the training dataset:") 
pprint(raw_train_dataset[15]) 
print("\nat index 800:") 
pprint(raw_train_dataset[800])

> Note: As we can see from above the index value when accessing the data  `raw_train_dataset[15]` is not always as same as the *idx* value and that could be because of how the whole main dataset was split into train/valid/test datasets.

# Preprocessing The Data

In our data we have two sequences as a pair that needs to be processed by the model for classification. But that also means that the tokeniser have to convert the sequences into tokens as a pair and the good thing is the tokensier does that for us by itself, we simply have to pass them togther:

In [None]:
from transformers import AutoTokenizer

checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

inputs = tokenizer("This is the first sentence.", "This is the second one.")
pprint(inputs)

This time we see that the tokeniser returns an additional feature *token_type_ids*,  this is what tells the model which part of the input is the first sentence and which is the second sentence. So,

In [None]:
print(tokenizer.convert_ids_to_tokens(inputs['input_ids']))

```python
['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]']
[      0,      0,    0,     0,       0,          0,   0,       0,      1,    1,     1,        1,     1,   1,       1]
```

But as we saw earlier, the test dataset alone has 3668 data in it and even though the data is not really big, fitting the data directly to the tokensier is not a good practice because it can easily cause *RAM Out-Of-Memory* issue. And also passing only the sequences to the tokeniser will only return the `input_ids`, `attention_mask`, and `token_type_ids` as the input for the model and this way we will lose other important info that we had in out orignal dataset like `label`. Therefore, we will use `datasets` in-built `map()` method. The map() method works by applying a function on each element of the dataset, so let’s define a function that tokenizes our inputs:

In [None]:
def tokeniser_function(data):
    return tokenizer(data['sentence1'], data["sentence2"], truncation=True)

This function takes a data dictionary and returns a new dictionary with the keys `input_ids`, `attention_mask`, and `token_type_ids`. This will allow us to use the option `batched=True` in our call to `map()`, which will greatly speed up the tokenisation, because this tokeniser can be very fast, but only if we give it lots of inputs at once and using `batched=True` in our call to `map()` passes multiple elements of our dataset at once to the `tokeniser_function()`, and not on each element separately

Furthermore, we can also see that we've left the `padding` parameter out in our tokenisation function for now. This is because padding all the samples to the maximum length is not efficient: it’s better to pad the samples when we’re building a batch, as then we only need to pad to the maximum length in that batch, and not the maximum length in the entire dataset. This can save a lot of time and processing power when the inputs have very variable lengths!

In [None]:
tokenizer_datasets = raw_datasets.map(tokeniser_function, batched=True)
tokenizer_datasets

As we can see, we haven't lost any columns from out data and that there are new columns added (Note that we could also have changed existing fields if our preprocessing function returned a new value for an existing key in the dataset to which we applied `map()`).

 Also it was really quick to process as well, however, you can make the whole process more faster by passing `num_proc` argument to the `map()` function, as this allows multiprocessing, but since `tokenizers()` already works on multiple threads, there is no use of it here.

# Dynamic padding
Now we will need to do is pad all the examples to the length of the longest element when we batch elements together before passing the input to the model. This technique is refer as **dynamic padding**.

Even though this way of padding makes things go faster when utalising a CPU or GPU, that is always not the case when using a accelerator resource like a TPU and that is because TPUs prefer fixed shapes, even when that requires extra padding.

!["Dynamic Padding"](data/chapter_3/dynamic_padding.png "Dynamic Padding")

Here we will use function that is responsible for putting together samples inside a batch, AKA **collate function**. The *collate function* will also apply the correct amount of padding to the items of the dataset we want to batch together. Fortunately, the Transformers library provides us with such a function via `DataCollatorWithPadding`. It takes a tokenizer when you instantiate it (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs) and will do everything you need.

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

let's try it on a subset of out dataset and assume it as a single batch

In [None]:
samples = tokenizer_datasets["train"][:10]
# filter out the unnecessary columns
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

Here we can see inside this batch the squences are of different lengths, and to add padding to make them all to the max-length inside this particular batch (which is 67), we simply need to pass the samples to the `data_collator()`

In [None]:
batch = data_collator(samples)
pprint({k: v.shape for k, v in batch.items()})

### Example
Let's try to preprocess the **GLUE SST-2** dataset:
!["The GLUE SST-2 Datset"](data/chapter_3/glue_sst2.png "The GLUE SST-2 Datset")

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset(
    path='glue',
    name='sst2'
)
raw_datasets

As we can see, there is only one sentence per data. So, the `tokenisation_function()` only need to work one sentence per data.

In [None]:
from transformers import AutoTokenizer

checkpoint = 'bert-base-uncased'

tokeniser = AutoTokenizer.from_pretrained(checkpoint)

def tokenisation_function(data):
    return tokeniser(data['sentence'], truncation=True)

tokenised_datasets = raw_datasets.map(tokenisation_function, batched=True)
tokenised_datasets

Now let's clean the data by removing the unnecessary columns and changing the name of the *label* columns to *labels* and finally convert the datatype to torch.

In [None]:
tokenised_datasets = tokenised_datasets.remove_columns(
    column_names=['idx', 'sentence']
)
tokenised_datasets = tokenised_datasets.rename_column(
    original_column_name='label',
    new_column_name='labels'
)
tokenised_datasets = tokenised_datasets.with_format("torch")

Now to apply padding using the `data_collator` on the bases of batch we will use the `DataLoader()` function from the class `torch.utils.data`, which creates batches on the given data and passing it to the **collate function**.

In [None]:
from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokeniser)
train_dataloader = DataLoader(
    tokenised_datasets['train'],
    batch_size=16,
    shuffle=True,
    collate_fn=data_collator
)

for step, batch in enumerate(train_dataloader):
    print(batch["input_ids"].shape)
    if step > 5:
        break