**Initialization**
- I use these three lines of code on top of my each notebooks because it will help to prevent any problems while reloading the same project. And the third line of code helps to make visualization within the notebook.

In [1]:
#@ INITIALIZATION: 
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**Downloading Libraries and Dependencies** *italicized text*
- I have downloaded all the libraries and dependencies required for the project in one particular cell.

In [17]:
#@ IMPORTING MODULES: UNCOMMENT BELOW:
# !pip install transformers[sentencepiece]
# !pip install bertviz
from transformers import AutoTokenizer
from transformers import AutoConfig
from transformers import AutoModel
from bertviz.transformers_neuron_view import BertModel
from bertviz.neuron_view import show
from bertviz import head_view
import torch
from torch import nn
import torch.nn.functional as F
from math import sqrt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#@ IGNORING WARNINGS: 
import warnings
warnings.filterwarnings("ignore")

**Note:**
- The numerical representation computed for a given token in **encoder** only transformer architecture depends both on the left or before the token and the right or after the token contexts which is called **bidirectional attention**. 
- The numerical representation computed for a given token in **decoder** only transformer architecture depends only on the left context which is called **autoregressive attention**. 

**The Encoder**

In [5]:
#@ VISUALIZING SCALED DOT PRODUCT ATTENTION:
model_ckpt = "bert-base-uncased"                            # Initializing model checkpoint.
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)       # Initializing bert tokenizer.
model = BertModel.from_pretrained(model_ckpt)               # Initializing pretrained bert model. 
text = "time flies like an arrow."                          # Initializing a text.
show(model, "bert", tokenizer, text, display_mode="light",
     layer=0, head=8)                                       # Inspecting bert model.

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [6]:
#@ INITIALIZING TOKENIZATION:
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)         # Initializing tokenization.
inputs.input_ids                                                                # Inspecting inputs. 

tensor([[ 2051, 10029,  2066,  2019,  8612,  1012]])

In [7]:
#@ INITIALIZING EMBEDDINGS: 
config = AutoConfig.from_pretrained(model_ckpt)                                 # Initializing configurations.
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)                 # Initializing embedding layers. 
token_emb                                                                       # Inspection.

Embedding(30522, 768)

In [8]:
#@ INITIALIZING TOKEN EMBEDDINGS:
input_embeds = token_emb(inputs.input_ids)                                      # Initializing token embeddings.
input_embeds.size()                                                             # Inspection.

torch.Size([1, 6, 768])

In [9]:
#@ INITIALIZING QKV VECTORS:
query = key = value = input_embeds                                  # Initialization.
dim_k = key.size(-1)                                                # Initializing dimensions.
scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k)         # Initializing attention scores.
scores.size() 

torch.Size([1, 6, 6])

In [10]:
#@ IMPLEMENTATION OF SOFTMAX LAYER:
weights = F.softmax(scores, dim=-1)                                 # Implementation of softmax.
weights.sum(dim=-1)                                                 # Initializing attention weights.
attn_outputs = torch.bmm(weights, value)                            # Batched matrix multiplication.
attn_outputs.shape                                                  # Inspection. 

torch.Size([1, 6, 768])

**Scaled Dot Product Attention**

In [11]:
#@ INITIALIZING SCALED DOT-PRODUCT ATTENTION:
def scaled_dot_product_attention(query, key, value):                # Defining function. 
    dim_k = query.size(-1)                                          # Initializing dimensions. 
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)    # Initializing attention scores. 
    weights = F.softmax(scores, dim=-1)                             # Implementation of softmax layer.
    return torch.bmm(weights, value)                                # Batched matrix multiplication.

**Multi-Headed Attention**

In [13]:
#@ IMPLEMENTATION OF SINGLE ATTENTION HEAD: 
class AttentionHead(nn.Module):                                             # Defining attention head.
    def __init__(self, embed_dim, head_dim):                                # Constructor function. 
        super().__init__()
        self.q = nn.Linear(embed_dim, head_dim)                             # Initializing query.
        self.k = nn.Linear(embed_dim, head_dim)                             # Initializing key.
        self.v = nn.Linear(embed_dim, head_dim)                             # Initializing value. 
    
    def forward(self, hidden_state):                                        # Forward propagation function. 
        attn_outputs = scaled_dot_product_attention(
            self.q(hidden_state),self.k(hidden_state),self.v(hidden_state)
        )                                                                   # Initializing attention outputs. 
        return attn_outputs                                                 # Getting attention outputs. 

In [14]:
#@ INITIALIZING OF MULTI-HEADED ATTENTION LAYERS:
class MultiHeadAttention(nn.Module):                                        # Defining class. 
    def __init__(self, config):                                             # Constructor function. 
        super().__init__()
        embed_dim = config.hidden_size                                      # Initializing embedding dimensions. 
        num_heads = config.num_attention_heads                              # Initializing attention heads. 
        head_dim = embed_dim // num_heads                                   # Initializing head dimensions. 
        self.heads = nn.ModuleList(
            [AttentionHead(embed_dim, head_dim) for _ in range(num_heads)]  # Implementation of attention head. 
        )
        self.output_linear = nn.Linear(embed_dim, embed_dim)                # Output linear layer.
    
    def forward(self, hidden_state):
        x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)        # Sequence concatenation.
        x = self.output_linear(x)                                           # Implementation of linear layer.
        return x

In [15]:
#@ IMPLEMENTATION OF MULTI HEADED ATTENTION LAYER:
multihead_attn = MultiHeadAttention(config)                                 # Initialization.
attn_output = multihead_attn(input_embeds)                                  # Implementation of multi headed attention. 
attn_output.size()                                                          # Inspection.

torch.Size([1, 6, 768])

In [19]:
#@ VISUALIZING ATTENTION:
model = AutoModel.from_pretrained(model_ckpt, output_attentions=True)       # Initializing pretrained model.
sentence_a = "time flies like an arrow"                                     # Initializing text example.
sentence_b = "fruit flies like a banana"                                    # Initializing text example.
viz_inputs = tokenizer(sentence_a, sentence_b, return_tensors="pt")         # Tokenization.
attention = model(**viz_inputs).attentions                                  # Generating attention. 
sentence_b_start = (viz_inputs.token_type_ids==0).sum(dim=1)
tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])           # Generating tokens.  
head_view(attention, tokens, sentence_b_start, heads=[8])                   # Visualization.

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<IPython.core.display.Javascript object>