# Lab L3X: BERT for Natural Language Inference

One of the main selling points of pre-trained language models is that they can be applied to a wide spectrum of different tasks in natural language processing. In this lab you will test this by fine-tuning a pre-trained BERT model on a benchmark task in natural language inference.

To do this lab, you will need a computer with GPU support.

In [None]:
import torch

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

## The data

The data for this lab is the [SNLI corpus](https://nlp.stanford.edu/projects/snli/), a collection of 570k human-written English image caption pairs manually labeled with the labels *Entailment*, *Contradiction*, and *Neutral*. Consider the following sentence pair as an example:

* Sentence 1: A soccer game with multiple males playing.
* Sentence 2: Some men are playing a sport.

This pair is labeled with *Entailment*, because sentence&nbsp;2 is logically entailed (implied) by sentence&nbsp;1 – if sentence&nbsp;1 is true, then sentence&nbsp;2 is true, too. The following sentence pair, on the other hand, is labeled with *Contradiction*, because both sentences cannot be true at the same time.

* Sentence 1: A black race car starts up in front of a crowd of people.
* Sentence 2: A man is driving down a lonely road.

For detailed information about the corpus and how it was constructed, refer to [Bowman et al. (2015)](https://www.aclweb.org/anthology/D15-1075/).

We provide a custom [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) class for this lab:

In [None]:
from torch.utils.data import Dataset

class SNLIDataset(Dataset):

    def __init__(self, filename, max_size=None):
        super().__init__()
        self.xs = []
        self.ys = []
        with open(filename) as source:
            for i, line in enumerate(source):
                if max_size and i >= max_size:
                    break
                sentence1, sentence2, gold_label = line.rstrip().split('\t')
                self.xs.append((sentence1, sentence2))
                self.ys.append(['contradiction', 'entailment', 'neutral'].index(gold_label))

    def __getitem__(self, idx):
        return self.xs[idx], self.ys[idx]

    def __len__(self):
        return len(self.xs)

We load the training portion and the development portion of the dataset. For starters, we only load the first 1k sentence pairs from the training data. You will later need increase the maximal size.

In [None]:
train_dataset = SNLIDataset('snli_1.0_train_preprocessed.txt', max_size=1000)
test_dataset = SNLIDataset('snli_1.0_test_preprocessed.txt')

The cell below shows an example from the training data. The labels *Contradiction*, *Entailment*, and *Neutral* are mapped to the integers 0–2:

In [None]:
train_dataset[120]

## The problem

Your task in this lab is to fine-tune a pre-trained BERT model on the SNLI training data, and evaluate the performance of the fine-tuned model on the test data. Pre-trained BERT models and standard architectures are available in the [Hugging Face Transformers library](https://huggingface.co/transformers/model_doc/bert.html). You will need to read the relevant parts of the documentation of that library.

In [None]:
# Uncomment the next line to install the transformers library:
# !pip install transformers

You will need two classes from the Transformers library:

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification

A `BertTokenizer` is in charge of preparing the inputs to a BERT model. This involves the tokenisation and encoding of the resulting word pieces into integers from the vocabulary. The `BertForSequenceClassification` architecture extends the basic BERT architecture with a linear layer on top of the pooled, token-specific output. You should instantiate both classes with the pre-trained `bert-base-uncased` model. (We have preprocessed the data for this lab by lowercasing.)

Here is the basic recipe for this lab:

1. Use the `BertTokenizer` to convert the data into a tensorised form.
2. Train a `BertForSequenceClassification` model on the tensorised data.
3. Evaluate the trained model by computing its accuracy on the test data.

Submit your final notebook. Include a short (ca. 150&nbsp;words) report about your experience. Compare your results to the one by [Bowman et al. (2015)](https://www.aclweb.org/anthology/D15-1075/).

**⚠️ Your submitted notebook must contain output demonstrating a higher accuracy than the best model of Bowman et al. (2015).**

#### 💡Tips

* You can simplify things by using a [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) with a suitable `collate_fn`.
* Train for 1&nbsp;epoch using a batch size of 32 and a learning rate of 1e-5.
* You will need to train on approximately 40k instances to reach the performance goal.

In [None]:
# TODO: Your code here

*TODO: Your report here*