# A Visual Guide to Using BERT for the First Time



<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-sentence-classification.png" />

In this notebook, we will use pre-trained deep learning model to process some text. We will then use the output of that model to classify the text. The text is a list of sentences from film reviews. And we will classify each sentence as either speaking "positively" about its subject of "negatively".

## Models: Sentence Sentiment Classification
Our goal is to create a model that takes a sentence (just like the ones in our dataset) and produces either 1 (indicating the sentence carries a positive sentiment) or a 0 (indicating the sentence carries a negative sentiment). We can think of it as looking like this:

<img src="https://jalammar.github.io/images/distilBERT/sentiment-classifier-1.png" />

Under the hood, the model is actually made up of two models.

* DistilBERT processes the sentence and passes along some information it extracted from it on to the next model. DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance.
* The next model, a basic Logistic Regression model from scikit learn will take in the result of DistilBERT’s processing, and classify the sentence as either positive or negative (1 or 0, respectively).

The data we pass between the two models is a vector of size 768. We can think of this vector as an embedding for the sentence that we can use for classification.


<img src="https://jalammar.github.io/images/distilBERT/distilbert-bert-sentiment-classifier.png" />

## Dataset
The dataset we will use in this example is [SST2](https://nlp.stanford.edu/sentiment/index.html), which contains sentences from movie reviews, each labeled as either positive (has the value 1) or negative (has the value 0):


<table class="features-table">
  <tr>
    <th class="mdc-text-light-green-600">
    sentence
    </th>
    <th class="mdc-text-purple-600">
    label
    </th>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      apparently reassembled from the cutting room floor of any given daytime soap
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      they presume their audience won't sit still for a sociology lesson
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      this is a visually stunning rumination on love , memory , history and the war between art and commerce
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      jonathan parker 's bartleby should have been the be all end all of the modern office anomie films
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
</table>

## Installing the transformers library
Let's start by installing the huggingface transformers library so we can load our deep learning NLP model.

In [None]:
!pip install transformers

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

import torch
import torch.nn as nn
import torch.nn.functional as F

import transformers

In [None]:
%load_ext tensorboard

In [None]:
import datetime

def get_datetime():
    return datetime.datetime.now().isoformat(sep='_', timespec='milliseconds').replace(':', '-')

## Importing the dataset
We'll use pandas to read the dataset and load it into a dataframe.

In [None]:
data_path = Path('SST2')

if not data_path.exists():
    data_path.mkdir(parents=True)
    for filename in ['train.tsv', 'dev.tsv', 'test.tsv']:
        !wget -q https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/{filename} -O {data_path / filename}
        assert (data_path / filename).exists()

df_train = pd.read_csv(data_path / 'train.tsv', delimiter='\t', header=None, names=['sentence', 'sentiment'])
df_valid = pd.read_csv(data_path / 'dev.tsv', delimiter='\t', header=None, names=['sentence', 'sentiment'])
df_test = pd.read_csv(data_path / 'test.tsv', delimiter='\t', header=None, names=['sentence', 'sentiment'])
df_train

For performance reasons, we'll only use 2,000 sentences from the dataset

In [None]:
batch_1 = df_train[:2000]

We can ask pandas how many sentences are labeled as "positive" (value 1) and how many are labeled "negative" (having the value 0)

In [None]:
batch_1['sentiment'].value_counts()

## Loading the Pre-trained BERT model
Let's now load a pre-trained BERT model. 

In [None]:
# For DistilBERT:
model_class = transformers.DistilBertModel
tokenizer_class = transformers.DistilBertTokenizer
pretrained_model_name = 'distilbert-base-uncased'

## Want BERT instead of distilBERT? Uncomment the following lines:
# model_class = transformers.BertModel
# tokenizer_class = transformers.BertTokenizer
# pretrained_model_name = 'bert-base-uncased'

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_model_name)
model = model_class.from_pretrained(pretrained_model_name)

In [None]:
!nvidia-smi

In [None]:
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
model = model.to(device)

Right now, the variable `model` holds a pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring a lot less memory.

## Model #1: Preparing the Dataset
Before we can hand our sentences to BERT, we need to so some minimal processing to put them in the format it requires.

### Tokenization
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.

In [None]:
tokenized = batch_1['sentence'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))
tokenized

In [None]:
type(tokenized.iloc[0])

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tokenization-2-token-ids.png" />

### Padding
After tokenization, `tokenized` is a list of sentences — each sentence is represented as a list of tokens. We want BERT to process our examples all at once (as one batch). It's just faster that way. For that reason, we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list of lists (of different lengths).

In [None]:
# Compute maximum number of tokens across all tokenized sentences...
max_len = <YOUR CODE>

# ... and use it to construct a single np.array with padding. Use 0 as the padding value.
padded = <YOUR CODE>

# NB: there is also https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_sequence.html
# which is a more idiomatic way to do the same thing for torch.Tensor.

Our dataset is now in the `padded` variable, we can view its dimensions below:

In [None]:
padded.shape

### Masking
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [None]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

## Model #1: And Now, Deep Learning!
Now that we have our model and inputs ready, let's run our model!

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tutorial-sentence-embedding.png" />

The `model()` function runs our sentences through BERT. The results of the processing will be returned into `last_hidden_states`.

In [None]:
input_ids = torch.tensor(padded, device=device)
attention_mask = torch.tensor(attention_mask, device=device)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called `[CLS]` (for classification) at the beginning of every sentence. The output corresponding to that token can be thought of as an embedding for the entire sentence.

<img src="https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png" />

We'll save those in the `features` variable, as they'll serve as the features to our logitics regression model.

In [None]:
last_hidden_states[0].shape

In [None]:
features = last_hidden_states[0][:,0,:].cpu().numpy()

The labels indicating which sentence is positive and negative now go into the `labels` variable

In [None]:
labels = batch_1['sentiment']

## Model #2: Train/Test Split
Let's now split our datset into a training set and testing set (even though we're using 2,000 sentences from the SST2 training set).

In [None]:
features_train, features_test, labels_train, labels_test = train_test_split(features, labels)

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-train-test-split-sentence-embedding.png" />

### [Bonus] Grid Search for Parameters
We can dive into Logistic regression directly with the Scikit Learn default parameters, but sometimes it's worth searching for the best value of the C parameter, which determines regularization strength.

In [None]:
# parameters = {'C': np.linspace(0.0001, 100, 20)}
# grid_search = GridSearchCV(LogisticRegression(), parameters)
# grid_search.fit(features_train, labels_train)

# print('best parameters: ', grid_search.best_params_)
# print('best scrores: ', grid_search.best_score_)

We now train the LogisticRegression model. If you've chosen to do the gridsearch, you can plug the value of C into the model declaration (e.g. `LogisticRegression(C=5.2)`).

In [None]:
lr_clf = <YOUR CODE>

<YOUR CODE>

<img src="https://jalammar.github.io/images/distilBERT/bert-training-logistic-regression.png" />

## Evaluating Model #2
So how well does our model do in classifying sentences? One way is to check the accuracy against the testing dataset:

In [None]:
lr_clf.score(features_test, labels_test)

How good is this score? What can we compare it against? Let's first look at a dummy classifier:

In [None]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, features_train, labels_train)
print("Dummy classifier score: %0.3f (± %0.2f)" % (scores.mean(), scores.std() * 2))

So our model clearly does better than a dummy classifier. Can we do better with larger fine-tuning?

# Larger-scale fine-tuning

Calling `tokenizer()` instead of `tokenizer.encode()` returns a dictionary with `input_ids` and `attention_mask`, so you don't have to compute them manually:

In [None]:
encodings_train = tokenizer(df_train['sentence'].tolist(), truncation=True, padding=True)
encodings_valid = tokenizer(df_valid['sentence'].tolist(), truncation=True, padding=True)
encodings_test = tokenizer(df_test['sentence'].tolist(), truncation=True, padding=True)

In [None]:
encodings_valid.keys()

In [None]:
print(encodings_valid['input_ids'][0])
print(encodings_valid['attention_mask'][0])

We are also going to save targets as Python lists for future use:

In [None]:
labels_train = df_train['sentiment'].tolist()
labels_valid = df_valid['sentiment'].tolist()
labels_test = df_test['sentiment'].tolist()

In [None]:
len(encodings_train['input_ids'])

In [None]:
labels_valid[:20]

Now our goal is to implement a `torch.utils.data.Dataset` subclass that will provide an interface to our dataset.

In [None]:
class SST2Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        """Return a dict whose keys are all keys from self.encodings plus 'labels',
        and the values are torch.Tensors."""
        item = {key: torch.tensor(value[idx]) for key, value in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

dataset_train = SST2Dataset(encodings_train, labels_train)
dataset_valid = SST2Dataset(encodings_valid, labels_valid)
dataset_test = SST2Dataset(encodings_test, labels_test)

batch_size = 16
dataloader_train = torch.utils.data.DataLoader(dataset_train, batch_size=batch_size, shuffle=True)
dataloader_valid = torch.utils.data.DataLoader(dataset_valid, batch_size=batch_size)
dataloader_test = torch.utils.data.DataLoader(dataset_test, batch_size=batch_size)

In [None]:
dataset_valid[0]

In [None]:
assert set(dataset_valid[0].keys()) == {'attention_mask', 'input_ids', 'labels'}
assert all(tensor.dtype == torch.int64 for tensor in dataset_valid[0].values())

Next, we are going to implement a wrapper model that will contain:

* an instance of DistilBERT (or BERT, if you prefer);
* a classifier head.

The classifier head will take embeddings for the `[CLS]` token as input, exactly as before (and hence its input will be 768-dimensional). We will experiment with the following architecture:

* Linear layer from 768 to 768 units
* ReLU
* Dropout with probability of zeroing equal to 0.2
* Linear layer from 768 to 1 unit (since we are doing binary classification)

Note: the number 768 is stored as `distilbert.config.dim`.

Note 2: this architecture is already implemented in `transformers.DistilBertForSequenceClassification`. Some links:

* [Finetuning tutorial](https://huggingface.co/transformers/custom_datasets.html)
* [DistilBertForSequenceClassification docs](https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforsequenceclassification)
* [DistilBertForSequenceClassification source code](https://huggingface.co/transformers/_modules/transformers/models/distilbert/modeling_distilbert.html#DistilBertForSequenceClassification)

This model can be instantiated with

```python
model_for_sequence_classification = transformers.DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
```

In [None]:
class ModelForSequenceClassification(nn.Module):
    def __init__(self, disable_feature_extractor_grad=True):
        super().__init__()

        # Recreate the model just in case
        self.feature_extractor = model_class.from_pretrained(pretrained_model_name)

        self.classifier_head = <YOUR CODE>

        if disable_feature_extractor_grad:
            # Disable DistilBERT parameter gradients here
            <YOUR CODE>
    
    def forward(self, input_ids, attention_mask):
        # Run feature extractor, pass its output to the classifier head and squeeze its output
        <YOUR CODE>
        return logits

model_for_sequence_classification = ModelForSequenceClassification()
model_for_sequence_classification = model_for_sequence_classification.to(device)

assert model_for_sequence_classification(torch.tensor([[101]], device=device), torch.tensor([[1]], device=device)).shape == (1,)

Loss function:

In [None]:
def compute_loss(logits, labels):
    labels = labels.type(torch.float32)
    # What is the correct loss function in our case?
    return <YOUR CODE>

In [None]:
from tqdm.notebook import tqdm
from torch.utils.tensorboard import SummaryWriter


def train(model, dataloader_train, dataloader_valid, tb_dir, tb_tag=None, num_epochs=3):
    model.feature_extractor.eval()

    opt = torch.optim.Adam(model.parameters())

    if tb_tag is None:
        tb_run_name = get_datetime()
    else:
        tb_run_name = f'{get_datetime()}_{tb_tag}'

    with SummaryWriter(log_dir=str(tb_dir / tb_run_name)) as writer:
        train_step = 0
        for epoch in range(num_epochs):
            model.classifier_head.train()
            for batch in tqdm(dataloader_train, desc=f'Epoch {epoch} | Train'):
                # Move everything to device...
                <YOUR CODE>

                # Perform a forward pass...
                logits = <YOUR CODE>
                loss = <YOUR CODE>

                # Do an optimization step...
                <YOUR CODE>

                # Log results.
                writer.add_scalar('train/loss', loss.item(), train_step)
                writer.add_scalar('train/accuracy', ((logits >= 0) == (labels == 1)).cpu().numpy().mean(), train_step)

                train_step += dataloader_train.batch_size

            model.classifier_head.eval()
            with torch.no_grad():
                valid_losses = []
                valid_accuracies = []
                for batch in tqdm(dataloader_valid, desc=f'Epoch {epoch} | Valid'):
                    # Move everything to device...
                    <YOUR CODE>

                    # Perform a forward pass...
                    logits = <YOUR CODE>
                    loss = <YOUR CODE>

                    # Log results.
                    valid_losses.append(loss.item())
                    valid_accuracies.extend(((logits >= 0) == (labels == 1)).cpu().numpy())

                writer.add_scalar('valid/loss', np.mean(valid_losses), train_step)
                writer.add_scalar('valid/accuracy', np.mean(valid_accuracies), train_step)

In [None]:
tb_dir = Path('tb_logs')

In [None]:
%tensorboard --port 6006 --logdir $tb_dir

In [None]:
train(model_for_sequence_classification, dataloader_train, dataloader_valid, tb_dir, tb_tag='finetune')

In [None]:
inputs = tokenizer('this is complete and utter garbage', truncation=True, padding=True)
inputs

In [None]:
inputs = {key: torch.tensor(value, device=device)[np.newaxis] for (key, value) in inputs.items()}
inputs

In [None]:
model_for_sequence_classification(**inputs)

In [None]:
inputs = tokenizer('this is complete and utter miracle', truncation=True, padding=True)
inputs = {key: torch.tensor(value, device=device)[np.newaxis] for (key, value) in inputs.items()}
model_for_sequence_classification(**inputs)

But how does our model compare against the best models?

## Proper SST2 scores
For reference, the [highest accuracy score](https://paperswithcode.com/sota/sentiment-analysis-on-sst-2-binary) for this dataset is currently **97.5**. DistilBERT can be trained to improve its score on this task – a process called **fine-tuning** which updates BERT’s weights to make it achieve a better performance in this sentence classification task (which we can call the downstream task). The fine-tuned DistilBERT turns out to achieve an accuracy score of **91.3**. BERT Large model achieves **93.1**.

And that’s it! That’s a good first contact with BERT. The next step would be to head over to the documentation and try your hand at [fine-tuning](https://huggingface.co/transformers/examples.html#glue). You can also go back and switch from DistilBERT to BERT and see how that works.

# Acknowledgements

This notebook is based on the notebook from [this article](https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/) and extended with an example of larger-scale fine-tuning.