# Transformer-based Natural Language Processing
## Introduction to ðŸ¤— Transformers
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/texttechnologylab/WiSe23-TFb-NLP/blob/master/assignment.ipynb)

### Installing necessary packages (i.e. if on Colab)

In [52]:
# Colab:
!pip install torch datasets tokenizers transformers

# Other:
# % pip install torch datasets tokenizers transformers



### Premise

This notebook will guide you through the process of finetuning a transformer model using the [ðŸ¤— Transformers](https://huggingface.co/docs/transformers/index) library.

First, we need to select a task and suitable dataset. Here, we will use the [Textual Entailment or Natrual Language Inference](https://cims.nyu.edu/~sbowman/multinli/) task as an example. A suitable dataset can be found in the [GLUE repository on the ðŸ¤— Hub](https://huggingface.co/datasets/glue). The whole MNLI dataset ist way too big, so we will only use a slice of it.

We can load the MNLI (slice) of the GLUE dataset using [ðŸ¤— Datasets](https://huggingface.co/docs/datasets/index) as follows:

In [53]:
from datasets import load_dataset

mnli_dataset = load_dataset("glue", "mnli", split={"train": "train[:10%]", "validation": "validation_matched", "test": "test_matched"})
print(mnli_dataset)

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 39270
    })
    validation: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9815
    })
    test: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9796
    })
})


As we can see above, the dataset is already split into train, development and test splits.
Each row contains four, but we only need to focus the premise, hypothesis and the label.

The textual entailment task requires us to recognize, given two text fragments, whether the meaning of one text is entailed (*can be inferred*) from the other text.

In this example, we will use a BERT-family model. With BERT, we formulate the entailment task as a simple classification task by concatenating the premise and hypothesis and training our classifier on the first token (the `[CLS]` token) of the input string:

```
"[CLS] This is the premise, i.e. a text that means something. [SEP] This is the hypothesis, i.e. what we may be able to infer [SEP]"
```

But let's first take a look at the dataset.

In [54]:
print(mnli_dataset['train'][:2])
print(mnli_dataset['validation'][:2])
print(mnli_dataset['test'][:2])

{'premise': ['Conceptually cream skimming has two basic dimensions - product and geography.', 'you know during the season and i guess at at your level uh you lose them to the next level if if they decide to recall the the parent team the Braves decide to call to recall a guy from triple A then a double A guy goes up to replace him and a single A guy goes up to replace him'], 'hypothesis': ['Product and geography are what make cream skimming work. ', 'You lose the things to the following level if the people recall.'], 'label': [1, 0], 'idx': [0, 1]}
{'premise': ['The new rights are nice enough', 'This site includes a list of all award winners and a searchable database of Government Executive articles.'], 'hypothesis': ['Everyone really likes the newest benefits ', 'The Government Executive articles housed on the website are not able to be searched.'], 'label': [1, 2], 'idx': [0, 1]}
{'premise': ['Hierbas, ans seco, ans dulce, and frigola are just a few names worth keeping a look-out for

As we see above, the `test_matched` split contains **unlabeled** samples, be we can ignore that for now.

Let's construct the sentences as we outlined above.

*Note:* The `[CLS]` and final `[SEP]` will be added by the BERT's tokenizer, so we omit them here.

In [None]:
def tokenize_mnli_data(examples):
    # Use the tokenizer to process the premise and hypothesis text pairs
    return tokenizer(
        examples['premise'],
        examples['hypothesis'],
        truncation=True,
        padding=True # Add padding for batching efficiency
    )

In [58]:
prepared_dataset = mnli_dataset.map(
    #lambda sample: {'input': f"{sample['premise']} [SEP] {sample['hypothesis']}",
                    #"label": sample["label"]},
    remove_columns=['premise', 'hypothesis', 'idx']
)
print(prepared_dataset['train'][:2])

Map:   0%|          | 0/39270 [00:00<?, ? examples/s]

Map:   0%|          | 0/9815 [00:00<?, ? examples/s]

Map:   0%|          | 0/9796 [00:00<?, ? examples/s]

{'label': [1, 0], 'input': ['Conceptually cream skimming has two basic dimensions - product and geography. [SEP] Product and geography are what make cream skimming work. ', 'you know during the season and i guess at at your level uh you lose them to the next level if if they decide to recall the the parent team the Braves decide to call to recall a guy from triple A then a double A guy goes up to replace him and a single A guy goes up to replace him [SEP] You lose the things to the following level if the people recall.']}


In [59]:
print(prepared_dataset["train"][0])

{'label': 1, 'input': 'Conceptually cream skimming has two basic dimensions - product and geography. [SEP] Product and geography are what make cream skimming work. '}


*Hint:* It is also possible to use the BERT tokenizer directly to construct the samples as shown above, skipping this preparation step entirely!

### Loading Pre-Trained Models

Now we need to load a pre-trained [BERT](https://github.com/google-research/bert) model. You should use a subclass of [AutoModel](https://huggingface.co/docs/transformers/main/en/autoclass_tutorial).

Viable pre-trained BERT models include:

<table>
<tr>
    <th>Model</th><th>Reference</th>
</tr>
<tr>
    <td><a href="https://huggingface.co/bert-base-uncased">bert-base-uncased</a></td>
    <td rowspan="6"><a href="https://aclanthology.org/N19-1423/">Devlin et al., 2019</a></td>
</tr>
<tr><td><a href="https://huggingface.co/bert-base-cased">bert-base-cased</a></td></tr>
<tr><td><a href="https://huggingface.co/bert-large-uncased">bert-large-uncased</a></td></tr>
<tr><td><a href="https://huggingface.co/bert-large-cased">bert-large-cased</a></td></tr>
<tr><td><a href="https://huggingface.co/bert-large-uncased-whole-word-masking">bert-large-uncased-whole-word-masking</a></td></tr>
<tr><td><a href="https://huggingface.co/bert-large-cased-whole-word-masking">bert-large-cased-whole-word-masking</a></td></tr>
<tr><td colspan="2"></td></tr>
<tr>
    <td><a href="https://huggingface.co/prajjwal1/bert-tiny">prajjwal1/bert-tiny</a></td>
    <td rowspan="4"><a href="https://arxiv.org/abs/1908.08962">Turc et al., 2019</a>; <a href="https://arxiv.org/abs/2110.01518">Bhargava et al., 2021</a></td>
</tr>
<tr><td><a href="https://huggingface.co/prajjwal1/bert-mini">prajjwal1/bert-mini</a></td></tr>
<tr><td><a href="https://huggingface.co/prajjwal1/bert-small">prajjwal1/bert-small</a></td></tr>
<tr><td><a href="https://huggingface.co/prajjwal1/bert-medium">prajjwal1/bert-medium</a></td></tr>
</table>


#### Load and instantiate a model for the textual entailment task

In [56]:
from transformers import AutoConfig, AutoTokenizer  #, AutoModelFor?

config = AutoConfig.from_pretrained("google-bert/bert-base-cased")
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", config=config)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Now we could use the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) class for easy training. You can follow the tutorial from [the official documentation](https://huggingface.co/docs/transformers/quicktour#trainer-a-pytorch-optimized-training-loop).

#### Write the training procedure

In [57]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    train_dataset=prepared_dataset['train'],
    eval_dataset=prepared_dataset['validation'],
    tokenizer=tokenizer,

    # TODO
)

# TODO

trainer.train()

evaluation_result = trainer.evaluate(prepared_dataset["test"])
print(evaluation_result)
# TODO: evaluation

  trainer = Trainer(


ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ['label']

### Custom Training
While using the trainer class is very convenient, if you have to run custom procedures during training, a regular training loop can be more accessible.

We do need to do our own tokenization, though.

In [None]:
def encode_pt(batch: dict):
    return tokenizer(
        batch['input'],
        add_special_tokens=True,
        return_token_type_ids=False,
        return_attention_mask=False,
        padding=False,
        truncation=True,
    )


pt_dataset = prepared_dataset.map(encode_pt)
print(pt_dataset['train'][:2])

However, in a manual training loop, we will want to make use of PyTorch's DataLoaders, which require some extra care to collate batches with samples of different lengths.

#### Implement `custom_collate`:
- Pad and stack the `input_ids` in a tensor.
- Stack the labels in a tensor of type `long`.

In [None]:
import torch
from torch.nn.utils.rnn import pad_sequence


def custom_collate(batch: list[dict]) -> tuple[torch.Tensor, torch.Tensor]:
    input_ids = ...
    label = ...
    return input_ids, label

#### Write the training and evaluation loops

In [None]:
from tqdm.notebook import tqdm, trange
from torch.utils.data import DataLoader
from torch.optim import *  # TODO
from torch.nn import *  # TODO

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

criterion = ...  # TODO
optimizer = ...  # TODO
num_epochs = ...  # TODO
batch_size = ...  # TODO

train_dataloader = DataLoader(pt_dataset['train'], batch_size=batch_size, shuffle=True, collate_fn=custom_collate)
dev_dataloader = DataLoader(pt_dataset['validation'], batch_size=batch_size, shuffle=False, collate_fn=custom_collate)

model.to(device)
for epoch in trange(num_epochs, position=0):
    model.train()
    for batch in tqdm(train_dataloader, position=1, leave=False):
        ...  # TODO

    model.eval()
    for batch in tqdm(dev_dataloader, position=1, leave=False):
        ...  # TODO