## BERT overview

This notebook shows how to feed text data into a pre-trained BERT model.

In [1]:
from typing import (
    Tuple,
    List
)

import numpy as np

import torch

from transformers import (
    BertModel,
    BertTokenizer
)

## Load BERT model

In [2]:
# There is a huge list of pre-trained models in https://huggingface.co/models
# You can also pre-train yours and upload it!
pretrained_model_name = 'bert-base-cased'
tokenizer = BertTokenizer.from_pretrained(pretrained_model_name)

## Sentence example

In [3]:
sample_text = 'The black dog is sleeping on the couch'
sample_token_ids = tokenizer.encode(sample_text, add_special_tokens=True)

In [4]:
sample_token_ids

[101, 1109, 1602, 3676, 1110, 5575, 1113, 1103, 5943, 102]

What if we truncate the sentence?

In [5]:
tokenizer.encode(sample_text,
                 truncation=True,
                 max_length=5,
                 add_special_tokens=True)

[101, 1109, 1602, 3676, 102]

What if the expected sentence length is longer?

In [6]:
tokenizer.encode(sample_text,
                 truncation=True,
                 max_length=15,
                 padding='max_length',
                 add_special_tokens=True)

[101, 1109, 1602, 3676, 1110, 5575, 1113, 1103, 5943, 102, 0, 0, 0, 0, 0]

## Feeding data into BERT

We will visualize what BERT outputs.

Let's first load the model.

In [7]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = BertModel.from_pretrained(pretrained_model_name,
                                  output_hidden_states=True).to(device)

In [8]:
tokens_tensor = torch.tensor(sample_token_ids).unsqueeze(0).to(device)
last_hidden_state, pooled, hidden_states = model(tokens_tensor)

We get, for each layer, a token-level representation.

In [9]:
print(last_hidden_state)
print(hidden_states[-1])

tensor([[[ 0.1536,  0.0562,  0.1367,  ..., -0.0989,  0.3940,  0.1002],
         [-0.4726, -0.1651,  0.1575,  ...,  0.6513,  0.1141,  0.5288],
         [-0.0889, -0.3289, -0.1701,  ...,  0.6400,  0.1992,  0.5205],
         ...,
         [-0.1692, -0.4604,  0.2934,  ...,  0.1403,  0.2538,  0.3727],
         [ 0.3087, -0.3969, -0.0714,  ..., -0.3647,  0.7529, -0.1209],
         [ 0.9193,  0.1145,  0.0871,  ..., -0.0265,  0.4273,  0.0366]]],
       grad_fn=<NativeLayerNormBackward>)
tensor([[[ 0.1536,  0.0562,  0.1367,  ..., -0.0989,  0.3940,  0.1002],
         [-0.4726, -0.1651,  0.1575,  ...,  0.6513,  0.1141,  0.5288],
         [-0.0889, -0.3289, -0.1701,  ...,  0.6400,  0.1992,  0.5205],
         ...,
         [-0.1692, -0.4604,  0.2934,  ...,  0.1403,  0.2538,  0.3727],
         [ 0.3087, -0.3969, -0.0714,  ..., -0.3647,  0.7529, -0.1209],
         [ 0.9193,  0.1145,  0.0871,  ..., -0.0265,  0.4273,  0.0366]]],
       grad_fn=<NativeLayerNormBackward>)


In [10]:
print(len(hidden_states))  # 12 layers + embedding

13


In [11]:
print(hidden_states[5].shape)  # N x N_tokens # H

torch.Size([1, 10, 768])


In [12]:
# Last layer hidden-state of the first token of the sequence (classification token)
# further processed by a Linear layer and a Tanh activation function. 
print(pooled[0, :10])

tensor([-0.7002,  0.4506,  0.9999, -0.9920,  0.9600,  0.8650,  0.9825, -0.9862,
        -0.9753, -0.5934], grad_fn=<SliceBackward>)


We can use different strategies for creating sentence embeddings from token-level embeddings. Some examples:
    
    - Use average of tokens in the last hidden layer.
    - Use average of tokens in the second-to-last hidden layer.
    - Use average over tokens for the average of the last 4 layers.
    
Let's use this last approach for this example.

In [13]:
def summarize_hidden_states(hidden_states: Tuple, n_last_layers: int = 4) -> torch.Tensor:
    layers_avg = torch.mean(
        torch.stack(hidden_states[-n_last_layers:]),
        dim=0
    )
    tokens_avg = torch.mean(layers_avg, dim=1)
    return tokens_avg

summarize_hidden_states(hidden_states).shape

torch.Size([1, 768])

In [14]:
def sentence_embedding(token_ids: List[int]) -> np.ndarray:
    with torch.no_grad():
        tokens_tensor = torch.tensor(token_ids).unsqueeze(0).to(device)
        _, _, hidden_states = model(tokens_tensor)
        embedding_tensor = summarize_hidden_states(hidden_states)
        return embedding_tensor.cpu().numpy().squeeze(0)

sentence_embedding(sample_token_ids).shape

(768,)

Using this embedding we can perform several tasks such as classification or clustering.

## Sentence embedding as feature

In [15]:
sentences = [
    'there is a dog sleeping on the garden',
    'there was a puppy resting at the back of the house',
    'around 85% of households in the UK have a dog in it',
    'european prime ministers agreed on a plan to tackle the coronavirus crisis'
]

embeddings = np.stack([
    sentence_embedding(tokenizer.encode(sentence, add_special_tokens=True))
    for sentence in sentences
])

similarities = np.dot(embeddings, embeddings.T)
np.fill_diagonal(similarities, 0)
similarities

array([[  0.    , 313.4347, 291.3957, 288.2623],
       [313.4347,   0.    , 288.2033, 281.8941],
       [291.3957, 288.2033,   0.    , 283.4817],
       [288.2623, 281.8941, 283.4817,   0.    ]], dtype=float32)