In [28]:
!pip install bertviz

In [29]:
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel
from bertviz.neuron_view import show

model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)

text = "time flies like an arrow"
show(model, "bert", tokenizer, text, display_mode="light", layer=0, head=8)

In [30]:
# tokenise the text
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False) 
# exclude [CLS] and [SEP] to keep things simple (add_special_tokens=False)
inputs.input_ids

In [31]:
from torch import nn
from transformers import AutoConfig

config = AutoConfig.from_pretrained(model_ckpt)
# create dense embeddings; all contain a non-zero value
token_emb = nn.Embedding(config.vocab_size, config.hidden_size) 
token_emb

Each input ID will be mapped to one of 30,522 embedding vectors in nn.Embedding, with size 768. 

The AutoConfig also stores additional metadata, such as label names used to format the model's predictions.

Note: Token embeddings at this point are independent of their context. Subsequent attention layers will mix these token embeddings to disambiguate and inform the representation of each token with the content of its context.

In [32]:
# generate embeddings by feeding in input ids
inputs_embeds = token_emb(inputs.input_ids)
inputs_embeds.size() # [batch_size, seq_len, hidden_dim]

In [33]:
# create query, key and value vectors and calculate attention scores using dot product as similarity fn
import torch
from math import sqrt

query = key = value = inputs_embeds
dim_k = key.size(-1) # 768
print(dim_k)

# transpose returns a transposed version; swap dimensions 1 and 2
# perform batch matrix mul; require transposing key tensor to have [hidden_dim,seq_len]
# then use matrix prduct to collect dot products in [seq_len, seq_len] matrix. bmm takes two batches and multiplies each matrix from first batch with corresponding
# matrix in second batch
scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k) 

scores.size() # 5 x 5 attention scores per sample in the batch

Then we get query, key and value vectors by applying independent weight matrices $W_q$, $W_k$,$W_v$ to the embeddings. 

In scaled dot-product attention, the dot products are scaled by the size of the embedding vectors so that we don't get too many large numbers during training that can cause the softmax we apply next to saturate.

In [34]:
import torch.nn.functional as F

# apply softmax
weights = F.softmax(scores, dim=-1)
weights.sum(dim=-1) # should add upto 1

In [35]:
# finally, multiply attention weights by values
attn_outputs = torch.bmm(weights, value)
attn_outputs.shape

In [36]:
def scaled_dot_product_attention(query, key, value):
    dim_k = query.size(-1) # num hidden states
    scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k)
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

If query and key are equal then a very large score will be assigned to identical words. In practice however, the meaning of a word is better informed by complementary words in the context than by identical words.. How to promote this behaviour?

Allow the model to create different sets of vectors or query, key and value of a token by using three different linear projections to project our initial token vector into three different spaces.

**Multi-headed attention**

In practice, self-attention applies three independent linear transformations to each embedding to generate query, key and value vectors. These project the embeddings and each projection carries its own set of learnable parameters, allowing the self-attention layer to focus on different semantic aspects of the sequence.

These multiple sets of linear projections, each an *attention head*. So we have *multi-headed attention layer*. Softmax of one head tends to focus on one aspect of similarity, so multiple heads allow the model to focus on several aspects at once, e.g. subject-verb interaction, or finding adjectives. So the model learns these relationships from the data. Resemblance to filters in convolutional neural networks, where a filter can be resposible for detecting faces and another one finding wheels of cards.

In [37]:
# code up a single attention head

class AttentionHead(nn.Module):
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        # three independent linear layers
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)
    
    def forward(self, hidden_state):
        # each apply matmul to embedding vectors to produce tensors [batch_size, seq_len, head_dim]
        # where head_dim is the number of dimensions we're projecting into
        # in practice, head_dim is chosen as a multiple of embed_dim so the computation across each
        # head is constant. E.g. BERT has 12 heads, so dimension of each head is 768/12=64
        attn_outputs = scaled_dot_product_attention(
            self.q(hidden_state), self.k(hidden_state), self.v(hidden_state)
        )
        return attn_outputs

In [38]:
# concatenate outputs of each attention head to get full multi-head attention layer
class MultiHeadAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        embed_dim = config.hidden_size
        num_heads = config.num_attention_heads
        head_dim = embed_dim // num_heads
        self.heads = nn.ModuleList(
            [AttentionHead(embed_dim, head_dim) for _ in range(num_heads)]
        )
        self.output_linear = nn.Linear(embed_dim, embed_dim)
        
    def forward(self, hidden_state):
        x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)
        # final linear layer
        # output tensor of shape [batch_size, seq_len, hidden_dim] suitable for feed-forward network downstream
        x = self.output_linear(x) 
        return x

In [39]:
# test with pre-loaded BERT config from prior
multihead_attn = MultiHeadAttention(config)
attn_output = multihead_attn(inputs_embeds)
attn_output.size() # works!

In [40]:
# use BertViz again to visalise attention for two different uses of word "flies"
from bertviz import head_view
from transformers import AutoModel

model = AutoModel.from_pretrained(model_ckpt, output_attentions=True)

sentence_a = "time flies like an arrow"
sentence_b = "fruit flies like a banana"

viz_inputs = tokenizer(sentence_a, sentence_b, return_tensors="pt")
attention = model(**viz_inputs).attentions
sentence_b_start = (viz_inputs.token_type_ids == 0).sum(dim=1)
tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])

head_view(attention, tokens, sentence_b_start, heads=[8])

visaulisation shows the token whose embedding gets updated (left) with every word attended to (right). Line intensity indicates strength of attention weights, with dark lines close to 1 and faint lines close to 0.

One thing we see is that visualisation weights are strongest between words that belong to the same sentence. We can also see how flies associates with arrow and time for sentence A and flies associates to fruit and banana for sentence b, showing how our model is able to distinguis the use of flies as a verb or noun depending n the context!!

**Feed-Forward Layer**

Two-layer fully connected NN. Processes each embedding independently instead of the whole sequence as a single-vector, so often referred to as position-wise feed-forward layer, also one-dimensional convolution with kernel size of one. Rule of thumb from literature is for hidden size of first layer to be four times the size of the embeddings, and GELU activation function is most commonly used.

In [41]:
class FeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.linear_1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear_2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        
    def forward(self, x):
        x = self.linear_1(x)
        x = self.gelu(x)
        x = self.linear_2(x)
        x = self.dropout(x)
        return x

In [42]:
# note: FF layer such as nn.linear is usually applied to tensor of shape (batch_size, input_dim)
# where it acts on each element of the batch dimension independently
# true for any dimension except for last one; so we pass tensor of shape (batch_size, seq_len, hidden_dim)
# layer applied to all token embeddings of the batch and sequence independently
feed_forward = FeedForward(config)
ff_outputs = feed_forward(attn_outputs)
ff_outputs.size() # have all ingredients to make a transformer encoder layer!

**Layer Normalisation**

Normalises each input in batch to have zero mean and unity variance. Skip connections pass a tensor to the next layer of the model without processing and add it to the processed layer. For normalisation, there are two main choices from literature:

-  *Post layer normalisation*:
    Places layer normalisation in between skip connections (after multi-headed attn). This is tricky to train from scratch as gradients can diverge, so often a concept known as learning rate warm-up is implemented, where the learning rate is gradually increased from small to some max value during training. 
-  *Pre layer normalisation*:
    Most common arrangement; places layer normalisation within span of skip connection (includes inside skip connection before attn), tends to be more stable during training and usually does not require any learning rate warm up.
    
Skip connect the layer before attn and afterward.

In [43]:
# use second arrangement
class TransformerEncodingLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
        self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
        self.attention = MultiHeadAttention(config)
        self.feed_forward = FeedForward(config)
        
    def forward(self, x):
        # Apply layer norm, then copy input into query, key, value
        hidden_state = self.layer_norm_1(x)
        # Apply attention with skip-connection
        x = x + self.attention(hidden_state)
        # Apply feed-forward layer with skip connection
        x = x + self.feed_forward(self.layer_norm_2(x))
        return x

In [44]:
# test with input embeddings
encoder_layer = TransformerEncodingLayer(config)
inputs_embeds.shape, encoder_layer(inputs_embeds).size()

Woohoo! First transformer encoder layer from scratch!!! 

Note: Caveat with the way we set up encoders, they are invariant to the position of the tokens. Since the multi-headed attention is a fancy weighted sum, the information on token position is lost.

**Positional Embeddings**

Idea: augment token embeddings with position-dependent pattern of values arranged in a vector. Is a pattern is characteristic for each position, the attention heads and feed-forward layers can learn to incorporate positional information into their transformations.

Several ways to achieve, one popular approach is to use a learnable pattern, especially when the pretraining dataset is sufficiently large. This works the same way as the token embeddings, but using the position index instead of token ID as input. So with that, an efficient way of encoding the token positions is learned during pretraining.

Create custom `Embeddings` module that combines a token embedding layer that projects the `input_ids` to a dense hidden state together with the positional embeddings that does the same for `position_ids`. Resulting embedding is simply the sum of both embeddings.

In [45]:
class Embeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.token_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout()
        
    def forward(self, input_ids):
        # create position IDs for input sequence
        seq_length = input_ids.size(1)
        position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0)
        # create token and position embeddings
        token_embeddings = self.token_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        # combine token and position embeddings
        embeddings = token_embeddings + position_embeddings
        embeddings = self.layer_norm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

In [46]:
embedding_layer = Embeddings(config)
embedding_layer(inputs.input_ids).size()

So embedding layer now creates a single, dense embedding for each token! 

While learnable position embeddings are easy to implement and widely used, there are some alternatives:

-  *Absolute positional representations*:
    Transformers can use static patterns of modulated sine and cosine signals to encode the positions of the token. This works especially well when there are not large volumes of data available
-  *Relative positional representations*: 
    Encode relative positions between tokens; can be set up by introducing a new relative embedding layer at the beginning, since the relative embedding changes for each token depending on where from the sequence we areattending to it. Instead, the attention mechanism is modified with additional terms that take the relative position between tokens into account. Models like DeBERTa use such represnetations.
    
By combining the idea of absolute and relative positional represnetations, rotary positional embeddings achieve excellent results on many tasks; GPT-Neo is one such model example with rotary position embeddings.

In [47]:
# putting it all together; full transformer encoder combining embeddings with encoder layers
class TransformerEncoder(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embeddings = Embeddings(config)
        self.layers = nn.ModuleList([TransformerEncodingLayer(config) for _ in range(config.num_hidden_layers)])
        
    def forward(self, x):
        x = self.embeddings(x)
        for layer in self.layers:
            x = layer(x)
        return x

In [48]:
encoder = TransformerEncoder(config)
encoder(inputs.input_ids).size()

We get a hidden state for each token in the batch! This output format makes the architecture very flexible, and can be adapted for various applications such as predicting missing tokens in masked language modeling or predicting start and end position of an answer in question answering.

**Adding a Classification Head**

How to build a classification head like the one we had in Chapter 2.

Transformer models are typically dvided to task-independent body and a task-specific head; this pattern emerges again in Chapter 4 when reviewing Transformer design patterns. So far we have the body, which provides a hidden state for each token, but we only need to make one prediction. Traditionally, the first token in such models is used for the prediction and we can attach a dropout and linear layer to make classification prediction.

In [49]:
class TransformerForSequenceClassification(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.encoder = TransformerEncoder(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        
    def forward(self, x):
        x = self.encoder(x)[:, 0, :] # select hidden state of [CLS] token
        x = self.dropout(x)
        x = self.classifier(x)
        return x

In [50]:
# we need to define how many classes we wish to predict before initialising our model
config.num_labels = 3
encoder_classifier = TransformerForSequenceClassification(config)
encoder_classifier(inputs.input_ids).size() 

So that's how we can combine the encoder with a task-specific head. Now, we can turn our attention to the decoder.

**The Decoder**

Difference between decoder and encoder is that the decoder has two attention sublayers:

- *Masked multi-head self-attention layer*: Ensures tokens we generate at each timestep are only based on the past outputs and current tokens being predicted. Without this, the decoder could cheat during training by copying the target translations; masking the inputs ensures the task is not trivial
- *Encoder-decoder attention layer*: Performs multi-head attention over output key and value vectors of the encoder stack, with intermediate representations of decoder acting as queries. So the encoder-decoder attention layer learns how to relate tokens from two different sequences, such as two different languages. The decoder has to access the encoder keys and values in each block.

In [51]:
# mask self-attention is to introduce a mask matrix with ones on lower diagonal and zeros above
seq_len = inputs.input_ids.size(-1)
mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0) # lower triangular matrix
print(mask.shape)
mask[0]

In [52]:
# tensor.masked_fill() to prevent each attention head from peeking at future tokens
# replace all zeros with negative infinity; guarantee that all attn weights are all zero once we take softmax over score
# as e^-inf=0
scores.masked_fill(mask==0, -float("inf")) # so only focus on bottom left

In [27]:
# include masked behaviour in scaled dot-product attention earlier
def scaled_dot_product_attention(query, key, value, mask=None):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k)
    if mask is not None:
        scores = scores.masked_fill(mask==0, float("-inf"))
    weights = F.softmax(scores, dim=-1)
    return weights.bmm(value)

## Transformer Architectures

Three main tree architectures (not cover all the different subtypes):
- BERT
- GPT
- T5/BigBird

### Encoder Branch

First encoder-only transformer model was BERT, at the time it outperformed all SOTA on GLUE benchmark, which measures natural language understanding (NLU). BERT model and variants:
- *BERT*: Trained with two objectives of predicting masked tokens in texts and if one text passage is likely to follow another - Masked language modeling (MLM) and next sentence prediction
- *DistilBERT*: Knowledge distilation technique, achieves 97% of BERT performance with 40% less memory and 60% faster
- *RoBERTa*: Further improve performance by modifying pretraining scheme. RoBERTa is trained longer and drops NSP taask
- *XLM*: Different pretraining objectives, including autogregressive modellin; and pretraining including translation language modeling, achieving SOTA on several NLU benchmarks
- *ALBERT*: Decouples token embedding dimension from hidden-dim; allowing embedding dimension to be small and saves parameters. Second, all layers share the same parameters which decreases number of effective parameters. Finally, NSP objective is replaced with sentence ordering; whether or not two consecutive sentences are swapped or not. So uses even fewer parameters
- *DeBERTa*: Each token is represented as two vectors; one for content andthe other for relative position. Self-attn can better model dependency of nearby token pairs. Absolute position is important and is added just before softmax layer of token decoding head. First model to beat human baseline on SuperGLUE benchmark.

### Decoder Branch

Exceptionally good at predicting next word in a sequence, this are used for text generation tasks. Progress fueled by using larger datasets and scaling language models to larger and larger sizes.
- *GPT*: Novel and efficient transformer decoder architecture, and transfer learning. Pretrained by predicting next word based on previous, trained on BookCorpus and achieved great results on down-stream tasks such as classification
- *GPT-2*: Upscaled GPT to produce longer sequences of text
- *CTRL*: Adds "control tokens" to beginning of sequence, allowing the style of generated text to be controlled, allowing for diverse generation.
- *GPT-3*: Upscale GPT-2 by factor of 100 to 175 billion parameters! After analysis of language models at different scales realise there is a power law of compute, dataset size and performance. Can generate realistic text passages, and exhibits few-shot learning capabilities e.g. translating text to code.
- *GPT-Neo/GPT-J-6B*: Trained by EleutherAI, a researcher collective who aim to re-create and release GPT-3 scale models, which are on similar sizes to GPT series model and are competitive with OpenAI models.

### Encoder-Decoder Branch

Several encoder-decoder variants of transformer architecture that have novel applications across NLU and NLG (Natural language generation). 
- *T5*: Text-to-text tasks. All tasks are framed as sequence-to-sequence tasks, where adopting encoder-decoder architecture is natural. Decoder must generate label as a normal text instead of a class, uses the original Transformer architecture and trained using large cralwed C4 and masked language modelling. As well as SuperGLUE and translating them to text-to-text tasks. Largest 11 billion parameters yielded SOTA results
- *BART*: Combines BERT and GPT within encoder-decoder architecture. Input sequences undergo several possible transformations, from simple masking to sentence permutation, token deletion and document rotation. Modified inputs are passed through encoder, and decoder must reconstruct original text. So is more flexible and can be used for NLU and NLG, achieving SOTA on both.
- *M2M-100*: Translation model that can translate between any of 100 languages; allowing high-quality translation of rare and underrepresented languages. Model uses prefix tokens (similar to special [CLS]) to indicate source and target language
- *BigBird*: Overcomes maximum content size which has quadratic memory scaling of attention. BigBird uses sparse form of attention that scales linearly, overcoming limit of 512 tokens to 4,096. Usefulin cases with long dependencies such as text summarization.

All the above models have pretrained checkpoints and can be fine-tuned to a use case with HuggingFace Transformers.

## Conclusion
In this chapter we dived into transformers and self-attention, and added the necessary parts to build a transformer encoder layer with:
- Embedding layers for tokens and positional information
- Feed-forward layer to complement the attention heads
- Classification head to model body to make predictions

Also we look at decoder of transformer architecture and review the most important model architectures. Next step we go beyond simple classification and build multilingual named entity recognition (NER) model!