In [1]:
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel
from bertviz.neuron_view import show

In [2]:
model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

100%|██████████| 433/433 [00:00<00:00, 240834.59B/s]
100%|██████████| 440473133/440473133 [00:07<00:00, 55322320.73B/s]


In [3]:
text = "time flies like an arrow?"
show(model, "bert", tokenizer, text, display_mode = "light", layer = 0, head = 8)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Self-attention

Tokenize the text : To extract the input IDs. For simplicity we are excluding the special tokens [CLS] and [SEP]

In [4]:
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
inputs.input_ids

tensor([[ 2051, 10029,  2066,  2019,  8612,  1029]])

Dense Embeddings : We will create a dense embeddings. This is dense unlike one-hot which is sparse

In [5]:
from torch import nn
from transformers import AutoConfig

In [6]:
config = AutoConfig.from_pretrained(model_ckpt) # to load the config json associated with this checkpoint.
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)

In [7]:
token_emb

Embedding(30522, 768)

Note: The token embeddings at this point are independent of their context. This means that the homonyms (words that have the same spelling but different meaning), like "flies" in the previous examples have the same representation. The role of the subsequent attention layers will be to mix these token embeddings to disambiguate and inform the representation of each token with the content of it s context.

Now that we have out lookup table, we can generate the embeddings by feeding in the input IDs.

In [8]:
inputs_embeds = token_emb(inputs.input_ids)

In [9]:
inputs_embeds.size()

torch.Size([1, 6, 768])

This gives us a tensor of shape [batch_size, seq_len, hidden_dim]

Now we will create the query, key and value vectors and calculate the attention scores using the dot product as the similarity function.

In [10]:
import torch
from math import sqrt

In [11]:
query = key = value = inputs_embeds

In [12]:
query.shape

torch.Size([1, 6, 768])

In [13]:
key.shape

torch.Size([1, 6, 768])

In [14]:
value.shape

torch.Size([1, 6, 768])

In [15]:
query

tensor([[[ 0.9318,  1.6099, -0.1687,  ..., -0.3008,  0.8344, -2.2563],
         [-0.8510, -0.3199,  0.1611,  ...,  0.2529, -1.1908,  0.9522],
         [-2.2926, -1.0937, -0.7750,  ..., -0.6255,  0.7043, -0.1264],
         [-1.0818,  0.3545,  0.0723,  ...,  1.4363, -0.1546,  0.5226],
         [-0.6923, -0.4183, -0.1789,  ..., -1.3920, -0.3620,  0.0047],
         [-0.3842,  0.3586,  0.2184,  ...,  1.0970, -1.2693, -1.2640]]],
       grad_fn=<EmbeddingBackward0>)

In [16]:
dim_k = key.size(-1)
dim_k

768

In [17]:
scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
scores.size()

torch.Size([1, 6, 6])

This has created a 6 x 6 matrix of attention scores per sample in the batch. In scaled dot-product attention, the dot products are scaled by the size of the embedding vectors so that we don't get too many large numbers during the training that can cause the softmax we will apply next to saturate.

torch.bmm() : performs a batch matrix-matrix product that simplifies the computation of the attention scores where the query and key vectors have the shape [batch_size, seq_len, hidden_dim].
We want to do it for all sequences in the batch independently.

Let's apply softmax now -

In [18]:
import torch.nn.functional as F

In [19]:
weights = F.softmax(scores, dim = -1)
weights.sum(dim = -1)

tensor([[1., 1., 1., 1., 1., 1.]], grad_fn=<SumBackward1>)

The final step is to multiply the attention weights by the values

In [20]:
attn_outputs = torch.bmm(weights, value)
attn_outputs.shape

torch.Size([1, 6, 768])

These are all the steps to build a simplified form of self-attention. Note, the whole process in just two matrix multiplications and a softmax i.e. a fancy form of averaging.

So, let;s write a function of self-attention so that we could reuse it

In [21]:
def scaled_dot_product_attention(query, key, value):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    weights = F.softmax(scores, dim = -1)
    return torch.bmm(weights, value)

Our attention mechanism with equal query and key vectors will assign a very large score to identical words in the context, and in particular to the current word itself i.e. the dot product of a query with itself is always 1. But in practice, the meaning of a word will be better informed by complementary words in the context that by identical words like "flies" is better defined by incorporating information from "time" and "arrow" than by another mention of "flies". 

Let's allow the model to create a different set of vectors for the query, key and value of a token by using three different linear projections to project our initial token vector into three different spaces.

Multi-headed attention: In practice, the self-attention layer applies three independent linear transformations to each embedding to generate the query, key and value vectors. These transformations project the embeddings and each projection carries its own learnable parameters, which allows the self-attention layer to focus on different semantic aspectsof the sequence.

It is indeed beneficial to have multiple sets of linear projections, each one representing a so-called attention head. This results in multi-headed attention. The more than one attention head is needed as the softmax of one head tends to focus on mostly one aspect of similarity. Having several heads allow the model to focus on several aspects at once. Example, one head can focus on subject-verb interaction, whereas another finds nearby adjectives.

Obviously we don't handcraft these relationships into the model, and they are fully learned from the data. (This is very similar to use of filters and multiple filters in CNNs to detect faces, car wheels etc.)

Implement Single attention head

In [22]:
class AttentionHead(nn.Module):
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)
        
    def forward(self, hidden_state):
        attn_outputs = scaled_dot_product_attention(
            self.q(hidden_state), self.k(hidden_state), self.v(hidden_state))
        return attn_outputs

Here we have initialized three independent linear layers that apply matrix multiplication to the embedding vectors to produce tensors of shape [batch_size, seq_len, head_dim], where head_dim is the number of dimensions we are projecting into. For example BERT has 12 attention heads, so the dimension of each head is 768 / 12 = 64

Now that we have a single attention head, we can concatenate the outputs of each one to implement the full multi-headed attention layer

In [23]:
class MultiHeadAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        embed_dim = config.hidden_size
        num_heads = config.num_attention_heads
        head_dim = embed_dim // num_heads
        self.heads = nn.ModuleList(
            [AttentionHead(embed_dim, head_dim) for _ in range(num_heads)]
        ) # Holds submodules in a list
        self.output_linear = nn.Linear(embed_dim, embed_dim)
        
    def forward(self, hidden_state):
        x = torch.cat([h(hidden_state) for h in self.heads], dim = -1)
        x = self.output_linear(x)
        return x

Notice, that the concatenated output from the attention heads is also fed through a final linear layer to produce an output tensor of shape [batch_size, seq_len, hidden_dim] that is suitable for the feed-forward network downstream.

In [24]:
config

BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.11.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [25]:
multihead_attn = MultiHeadAttention(config)
attn_output = multihead_attn(inputs_embeds)
attn_output.size()

torch.Size([1, 6, 768])

So, the multi-headed attention works well

Let's see how the attention of two different uses of the word "flies" work -

In [26]:
from bertviz import head_view
from transformers import AutoModel

In [27]:
model = AutoModel.from_pretrained(model_ckpt, output_attentions = True)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [28]:
sentence_a = "time flies like an arrow"
sentence_b = "fruit flies like a banana"

In [29]:
viz_inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt')
attention = model(**viz_inputs).attentions
sentence_b_start = (viz_inputs.token_type_ids == 0).sum(dim=1)
tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])

head_view(attention, tokens, sentence_b_start, heads=[8])

<IPython.core.display.Javascript object>

#### Position-wise feed-forward networks
The feed-forward sublayer in the encoder and decoder is just a simple two-layer fully connected neural network. Instead of processing the whole sequence of embeddings as a single vector, it processes each embedding independently. So,it is referred to as a position-wise feed-forward layer.

In [30]:
class FeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.linear_1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear_2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        
    def forward(self, x):
        x = self.linear_1(x)
        x = self.gelu(x)
        x = self.linear_2(x)
        x = self.dropout(x)
        return x

In [31]:
feed_forward = FeedForward(config)
ff_outputs = feed_forward(attn_outputs)
ff_outputs.size()

torch.Size([1, 6, 768])

#### Adding Layer Normalization
Transformer architecture makes use of Layer Normalization and Skip Connections.
* Layer Normalization : It normalizes each input in the batch to have zero mean and unity variance.
* Skip Connections : It pass a tensor to the next layer of the model without processing and add it to the processed tensor. 

In order to put the layer normalization, we have two choices -

1. Post Layer Normalization : It places the layer normalization in between the skip connections. This is tricky to train as the gradients can diverge. So, we have something called Learning Rate Warm-up where the learning rate is gradually increased from a small value to some maximum value during the training. This setup was used in the transformer paper.

2. Pre Layer Normalization : This is the most common arrangement found in the literature, it places layer normalization within the span of the skip connections. It is much more stable and does not require any learning rate warm-up.

We will use the pre-layer here

In [32]:
class TransformerEncoderLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
        self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
        self.attention = MultiHeadAttention(config)
        self.feed_forward = FeedForward(config)
        
    def forward(self, x):
        # Apply layer normalizaton and then copy input into query, key, value
        hidden_state = self.layer_norm_1(x)
        # apply attention with a skip connection 
        x = x + self.attention(hidden_state)
        # apply feed-forward layer with a skip connection
        x = x + self.feed_forward(self.layer_norm_2(x))
        return x

In [38]:
encoder_layer = TransformerEncoderLayer(config)
inputs_embeds.shape, encoder_layer(inputs_embeds).size()

(torch.Size([1, 6, 768]), torch.Size([1, 6, 768]))

#### Positional Embeddings

The idea is to augment the token embeddings with a position-dependent pattern of values arranged in a vector. If the pattern is characterstics for each position, the attention heads and the feed-forward layer in each stack can learn to incorporate positional information into their transformations

There are several ways to do that, one of the most popular ways is to use a learnable pattern, especially when the pre-training dataset is sufficiently large. This works exactly the same way as the token embeddings, but using the position index instead of the token id as input.

With that approach, an efficient way of encoding the positions of tokens is learned during pre-training

Let's create a Custom embeddings module that combines a token embedding layer that projects the input_ids to a dense hidden state together with the positional embedding that does the same for positon_ids. The resulting embedding is simply the sum of both embeddings

In [39]:
config.max_position_embeddings

512

In [40]:
class Embeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.token_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps = 1e-12)
        self.dropout = nn.Dropout()
        
    def forward(self, input_ids):
        # Create position IDs for input sequence
        seq_length = input_ids.size(1)
        position_ids = torch.arange(seq_length, dtype = torch.long).unsqueeze(0)
        # Create token and position embeddings
        token_embeddings = self.token_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        # combine token and position embeddings -
        embeddings = token_embeddings + position_embeddings
        embeddings = self.layer_norm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

In [41]:
embedding_layer = Embeddings(config)
embedding_layer(inputs.input_ids).size()

torch.Size([1, 6, 768])

Here, the embedding layer creates a single, dense embedding for each token. 

We used learnable position embeddings (it is preferred when you have lot of data). There are some alternatives -

* Absolute Positional Representations : Transformer models can use static patterns consisting of modulated sine and cosine signals to encode positions of the tokens. This works especially well, when there are not large volumes of data available. 

* Relative Positional Representations : Although absolute positions are important, but one can argue that when computing an embedding, the surrounding tokens are most important. Relative positional representations follow that intuition and encode the relative positions betwee tokens. This can not be set up by just introducing a new relative embedding layer at the beginning, since the relative embedding changes for each token depending on where from the sequence we are attending to it. Instead, the attention mechanism itself is modified with additional terms that use the relative position between tokens into account. Model such as DeBERTa use such representatins.

Note: By combining the idea of absolute and relative positional representations, rotary position embeddings achieve excellent results on many tasks. GPT-Neo is an example of a model, with rotary position embeddings

Now let's put all of this together by building the full transformer encoder combining the embeddings with the encoder layers

In [42]:
config.num_hidden_layers

12

In [43]:
class TransformerEncoder(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embeddings = Embeddings(config)
        self.layers = nn.ModuleList([TransformerEncoderLayer(config) for _ in range(config.num_hidden_layers)])
        
    def forward(self, x):
        x = self.embeddings(x)
        for layer in self.layers:
            x = layer(x)
        return x

In [44]:
encoder = TransformerEncoder(config)

In [45]:
inputs

{'input_ids': tensor([[ 2051, 10029,  2066,  2019,  8612,  1029]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}

In [47]:
encoder(inputs.input_ids).size()

torch.Size([1, 6, 768])

Now we have got a hidden state for each token in the batch. Let's recap the architecture visually -

![Transformer Encoder Decoder](/images/transformer-encoder-decoder.png "Transformer Encoder Decoder")

![Encoder](/images/encoder-zoom.png "Encoder")

![Layer Norm](/images/layer-norm "Layer Norm")

#### Adding a Classification Head

Transformer model are divided into a -

* task-independent body
* task-sepcific head

So, far in this example notebook we have built the body. Now we wish to build a classifier so we need to attach a classification head to that body.

Now, we have hidden state for each token, but we only need to make one prediction. There are several approaches to do this, traditionally the first token in such models is used for the prediction and we can attach a dropout and a linear layer to make a classification prediction.

The following class extends the existing encoder for sequence classification.

In [53]:
class TransformerForSequenceClassification(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.encoder = TransformerEncoder(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        
    def forward(self, x):
        x = self.encoder(x)[:, 0, :] # select the hidden state of [CLS] token
        x = self.dropout(x)
        x = self.classifier(x)
        return x

In [54]:
# Let's predict three classes
config.num_labels = 3

In [55]:
encoder_classifier = TransformerForSequenceClassification(config)
encoder_classifier(inputs.input_ids).size()

torch.Size([1, 3])

We got what we were looking for. For each example in that batch we get the unnormalized logits for each class in the output.

#### Decoder
Decoder has two attention sublayers -
1. Masked multi-head self-attention layer : Ensures that the token we generate at each timestep are only based on the past outputs and the current token being predicted. Without this, the decoder could cheat during trainig by simply copying the target translations, masking the inputs ensures the task is not trivial.

2. Encoder-decoder attention layer : Performs multi-head attention over the output key and value vectors of the encoder stack, with the intermediate representations of the decoder acting as queries. This way the encoder-decoder attention layer learns how to relate tokens from two different sequences, such as two different languages. The decoder has access to the encoder keys and values in each block.

In [56]:
seq_len = inputs.input_ids.size(-1)
seq_len

6

In [58]:
mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0)
mask[0]

tensor([[1., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1.]])

In [60]:
scores.masked_fill(mask == 0, -float("inf"))

tensor([[[27.9977,    -inf,    -inf,    -inf,    -inf,    -inf],
         [-1.4371, 29.2613,    -inf,    -inf,    -inf,    -inf],
         [ 0.2097,  1.5528, 29.4429,    -inf,    -inf,    -inf],
         [-1.1537, -1.1900, -0.7217, 29.4185,    -inf,    -inf],
         [-1.3176, -1.5176,  0.0700,  0.1529, 23.7813,    -inf],
         [-0.7471,  1.5617,  2.3382,  0.0765, -0.5057, 30.6999]]],
       grad_fn=<MaskedFillBackward0>)

In [61]:
def scaled_dot_product_attention(query, key, value, mask = None):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
    weights = F.softmax(scores, dim = -1)
    return weights.bmm(value)

There are three main architectures for transformer models -
* Encoders
* Decoders
* Encoder-Decoders