# BERT implemented from scratch

In this file, I implemented BERT from scratch, focusing on the core concepts of transformer architecture, attention mechanisms, and matrix operations. BERT (Bidirectional Encoder Representations from Transformers) is a pivotal model in NLP that uses a bidirectional approach to understand the context of words in a sentence.

<br>

I will also demonstrate how to load tensors directly from the pre-trained BERT model provided by Hugging Face. Before running this file, you need to download the weights. Here is the official link to download the weights: [Hugging Face BERT](https://huggingface.co/bert-base-uncased)

<div>
    <img src="images/all-steps.png"/ width=800>
</div>

## Model Initialization and Input Processing

### Tokenization
I won’t be implementing a BPE tokenizer here, but Andrej Karpathy has a neat implementation that you might find useful.
<br>
You can check out his implementation at this link: https://github.com/karpathy/minbpe

In [2]:
# Load Model
import torch
from transformers import BertForMaskedLM, BertTokenizer

model_path="bert-base-uncased"

tokenizer=BertTokenizer.from_pretrained(model_path)

In [3]:
tokenizer.encode("hello world!")

[101, 7592, 2088, 999, 102]

In [4]:
tokenizer.decode(tokenizer.encode("hello world!"))

'[CLS] hello world! [SEP]'

### Loading and Inspecting Model Configuration and Weights

Here, the model configuration and weights are loaded from a pretrained model file. This step ensures that the necessary parameters are ready for the forward pass.

Normally, reading a model file depends on how the model classes are written and the variable names within them. 
However, since we are implementing BERT from scratch, we will read the file one tensor at a time, carefully examining the embedding layers, attention heads, and feed-forward networks that make up BERT’s architecture.

In [5]:
# 1. Load the pre-trained model using transformers

# Load model configuration and weights
model = BertForMaskedLM.from_pretrained(model_path, torch_dtype=torch.float32)

Some weights of the model checkpoint at /home/jlpanc/anaconda3/envs/codef/STA/bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
model

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwi

In [7]:
model = model.state_dict()

In [8]:
model.keys()

odict_keys(['bert.embeddings.word_embeddings.weight', 'bert.embeddings.position_embeddings.weight', 'bert.embeddings.token_type_embeddings.weight', 'bert.embeddings.LayerNorm.weight', 'bert.embeddings.LayerNorm.bias', 'bert.encoder.layer.0.attention.self.query.weight', 'bert.encoder.layer.0.attention.self.query.bias', 'bert.encoder.layer.0.attention.self.key.weight', 'bert.encoder.layer.0.attention.self.key.bias', 'bert.encoder.layer.0.attention.self.value.weight', 'bert.encoder.layer.0.attention.self.value.bias', 'bert.encoder.layer.0.attention.output.dense.weight', 'bert.encoder.layer.0.attention.output.dense.bias', 'bert.encoder.layer.0.attention.output.LayerNorm.weight', 'bert.encoder.layer.0.attention.output.LayerNorm.bias', 'bert.encoder.layer.0.intermediate.dense.weight', 'bert.encoder.layer.0.intermediate.dense.bias', 'bert.encoder.layer.0.output.dense.weight', 'bert.encoder.layer.0.output.dense.bias', 'bert.encoder.layer.0.output.LayerNorm.weight', 'bert.encoder.layer.0.output

In [9]:
# float32
model['bert.embeddings.word_embeddings.weight'].shape

torch.Size([30522, 768])

In [10]:
import json
with open("/home/jlpanc/anaconda3/envs/codef/STA/bert-base-uncased/config.json", "r") as f:
    config = json.load(f)
config

{'architectures': ['BertForMaskedLM'],
 'attention_probs_dropout_prob': 0.1,
 'gradient_checkpointing': False,
 'hidden_act': 'gelu',
 'hidden_dropout_prob': 0.1,
 'hidden_size': 768,
 'initializer_range': 0.02,
 'intermediate_size': 3072,
 'layer_norm_eps': 1e-12,
 'max_position_embeddings': 512,
 'model_type': 'bert',
 'num_attention_heads': 12,
 'num_hidden_layers': 12,
 'pad_token_id': 0,
 'position_embedding_type': 'absolute',
 'transformers_version': '4.6.0.dev0',
 'type_vocab_size': 2,
 'use_cache': True,
 'vocab_size': 30522}

### Extracting Model Parameters from Config
We use this config to infer details about the model like:

1. The model has 12 transformer layers.
2. Each multi-head attention block has 12 heads.
3. The vocabulary size is 30,522 tokens.
4. The hidden size is 768 dimensions.
5. The intermediate size (for the feed-forward layers) is 3072 dimensions.

In [None]:
dim = config["hidden_size"]
n_layers = config["num_hidden_layers"]
n_heads = config["num_attention_heads"]
vocab_size = config["vocab_size"]
norm_eps = config["layer_norm_eps"]
eps = config["layer_norm_eps"]

### converting text to tokens
Here, we utilize the tiktoken library (which I believe is developed by OpenAI) as the tokenizer.
<div>
    <img src="images/embedding_layers.png" width="600"/>
</div>

In [12]:
# 1. Tokenize the prompt into token IDs using the BERT tokenizer.
prompt = "When in Rome, do as the [MASK] do."
input_ids = tokenizer.encode(prompt, return_tensors='pt')
print("input_ids: ", input_ids)
# 2. Create a tensor of token type IDs (all zeros, as we have only one sentence).
token_type_ids = torch.zeros_like(input_ids)
print("token_type_ids: ", token_type_ids)
# 3. Convert the input_ids to a tensor (this step is redundant since it's already a tensor).
input_ids = torch.tensor(input_ids[0])
# 4. Calculate and print the number of tokens in the input prompt.
q_len = len(input_ids)
print("q_len: ", q_len)

input_ids:  tensor([[ 101, 2043, 1999, 4199, 1010, 2079, 2004, 1996,  103, 2079, 1012,  102]])
token_type_ids:  tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
q_len:  12


  input_ids = torch.tensor(input_ids[0])


### Generating Token Embeddings
In this section, we convert input tokens into their corresponding embeddings using BERT’s pre-trained embedding layers. While BERT uses inbuilt neural network modules for this process, it’s crucial to understand the transformation.

So, our [10x1] tokens are now transformed into [10x768], i.e., 12 embeddings (one for each token) of length 768.

Note: Keep track of the tensor shapes throughout the process; it makes understanding the entire architecture much easier.

<div>
    <img src="images/embeddings_output.png" width="600"/>
</div>

In [15]:
# 1. Load the word embeddings from the pre-trained BERT model
embedding_layer = torch.nn.Embedding.from_pretrained(model['bert.embeddings.word_embeddings.weight'])
# 2. Load the position embeddings from the pre-trained BERT model
position_embeddings = torch.nn.Embedding.from_pretrained(model['bert.embeddings.position_embeddings.weight'])
# 3. Load the segment (token type) embeddings from the pre-trained BERT model
segment_embeddings = torch.nn.Embedding.from_pretrained(model['bert.embeddings.token_type_embeddings.weight'])
# embedding_layer.weight.data.copy_(model["model.embed_tokens.weight"])
token_embeddings_unnormalized = embedding_layer(input_ids)

In [14]:
token_embeddings_unnormalized

tensor([[ 0.0136, -0.0265, -0.0235,  ...,  0.0087,  0.0071,  0.0151],
        [-0.0463,  0.0259, -0.0240,  ..., -0.0416,  0.0289,  0.0316],
        [-0.0278,  0.0039, -0.0035,  ...,  0.0078, -0.0306,  0.0221],
        ...,
        [ 0.0066, -0.0533,  0.0057,  ..., -0.0140, -0.0522, -0.0088],
        [-0.0207, -0.0020, -0.0118,  ...,  0.0128,  0.0200,  0.0259],
        [-0.0145, -0.0100,  0.0060,  ..., -0.0250,  0.0046, -0.0015]])

In [16]:
token_embeddings_unnormalized.shape

torch.Size([12, 768])

In [21]:
position_ids = torch.arange(q_len, dtype=torch.long).unsqueeze(0)
position_embeds = position_embeddings(position_ids)
position_embeds.shape

torch.Size([1, 12, 768])

In [25]:
segment_embeds = segment_embeddings(token_type_ids)
segment_embeds.shape

torch.Size([1, 12, 768])

In [26]:
embeddings_unnormalized = token_embeddings_unnormalized+position_embeds+segment_embeds
embeddings_unnormalized.shape

torch.Size([1, 12, 768])

In [27]:
embeddings_unnormalized

tensor([[[ 0.0316, -0.0411, -0.0564,  ...,  0.0021,  0.0044,  0.0219],
         [-0.0381,  0.0392, -0.0397,  ..., -0.0193,  0.0553,  0.0177],
         [-0.0387,  0.0129, -0.0114,  ...,  0.0161, -0.0153,  0.0062],
         ...,
         [ 0.0135, -0.0457, -0.0071,  ...,  0.0045, -0.0567, -0.0130],
         [-0.0299,  0.0069, -0.0199,  ...,  0.0310,  0.0295,  0.0295],
         [-0.0253,  0.0052, -0.0068,  ..., -0.0129,  0.0153, -0.0015]]])

### Applying Layer Normalization
In BERT, we apply Layer Normalization after embedding layers and various other stages.
Note: This step does not change the shape of the tensor; it just normalizes the values across the hidden dimension.

### Things to keep in mind:
1. Layer normalization prevents internal covariate shift by normalizing across the features.
2. A small epsilon value (from the configuration) is added for numerical stability, preventing division by zero.


### Building the First Layer of the Transformer

#### Normalization
In BERT, we start by applying layer normalization to the input of each transformer layer.

You will notice that I access `encoder.layer.0` from the model dictionary (this corresponds to the first layer in BERT).
After applying the normalization, the tensor shape remains the same as the input, [12x768], but the values are now normalized.

<div>
    <img src="images/layer_norm.png" width="600"/>
</div>

In [29]:
def layer_norm(x, gamma, beta, eps=1e-12):
    """
    Parameters:
    x (torch.Tensor): Input tensor with shape (batch_size, num_features)
    gamma (torch.Tensor): Scale parameter, with the same shape as the last dimension of x
    beta (torch.Tensor): Shift parameter, with the same shape as the last dimension of x
    eps (float): A small constant for numerical stability (to prevent division by zero)
    
    Returns:
    torch.Tensor: The output after applying LayerNorm
    """
    # Calculate the mean and variance
    mean = x.mean(dim=-1, keepdim=True)
    variance = x.var(dim=-1, keepdim=True, unbiased=False)
    
    # Normalization
    x_normalized = (x - mean) / torch.sqrt(variance + eps)
    
    # Scaling and Shifting
    out = gamma * x_normalized + beta
    
    return out

# Build the first transformer layer
### Normalization
You can see, after through layer0 from the dict extract from the model.

The output tensor is still shape in [10*768] but normalized.

<div>
    <img src="images/embeding_norm1.png", width=500>
</div>


In [30]:
# Initialize LayerNorm and Dropout
# layer_norm = torch.nn.LayerNorm(768, eps=1e-12)
# layer_norm.weight.data = model['bert.embeddings.LayerNorm.weight']
# layer_norm.bias.data = model['bert.embeddings.LayerNorm.bias']

# Perform Layer Normalization
normalized_embeddings = layer_norm(embeddings_unnormalized, model['bert.embeddings.LayerNorm.weight'], model['bert.embeddings.LayerNorm.bias'], eps=1e-12)

# Apply Dropout
# final_embeddings = dropout(normalized_embeddings)

# Print results
print("Embeddings after LayerNorm:", normalized_embeddings)
# print("Final Embeddings:", final_embeddings)

Embeddings after LayerNorm: tensor([[[ 0.1686, -0.2858, -0.3261,  ..., -0.0276,  0.0383,  0.1640],
         [-0.4807,  0.8188, -0.4227,  ..., -0.1415,  1.1064,  0.5430],
         [-0.4992,  0.3891,  0.0274,  ...,  0.3810, -0.0397,  0.3439],
         ...,
         [ 0.5077, -0.5198,  0.1786,  ...,  0.2883, -0.6555,  0.0927],
         [-0.3585,  0.2777, -0.1210,  ...,  0.5949,  0.6856,  0.7453],
         [-0.4771,  0.0871, -0.0770,  ..., -0.2191,  0.3020,  0.0196]]])


In [31]:
normalized_embeddings.shape

torch.Size([1, 12, 768])

## Self-Attention Mechanism
This section dives into how the self-attention mechanism is implemented in BERT, including the computation of Query, Key, and Value matrices, and how they contribute to generating attention scores.

In [32]:
# Extract the query, key, and value weights for the first attention layer from the BERT model
q_layer0 = model["bert.encoder.layer.0.attention.self.query.weight"]
k_layer0 = model["bert.encoder.layer.0.attention.self.key.weight"]
v_layer0 = model["bert.encoder.layer.0.attention.self.value.weight"]
# Extract the biases for query, key, and value in the first attention layer
q_layer0_bias = model['bert.encoder.layer.0.attention.self.query.bias']
k_layer0_bias = model['bert.encoder.layer.0.attention.self.key.bias']
v_layer0_bias = model['bert.encoder.layer.0.attention.self.value.bias']

In [33]:
# Compute the query, key, and value states by multiplying the normalized embeddings with the respective weights and adding the bias
query_states = torch.matmul(normalized_embeddings, q_layer0.T)+q_layer0_bias
key_states = torch.matmul(normalized_embeddings, k_layer0.T)+k_layer0_bias
value_states = torch.matmul(normalized_embeddings, v_layer0.T)+v_layer0_bias

In [34]:
# Define a function to reshape and permute the states for multi-head attention
def transpose_for_scores(x):
    new_x_shape = x.size()[:-1] + (12, 64) # (num_attention_heads, attention_head_size)
    x = x.view(new_x_shape) # Reshape the tensor to separate attention heads
    return x.permute(0, 2, 1, 3) # Permute the tensor dimensions for attention computation

## Explaination of multi-head attention
Multi-head attention is a core component of the BERT architecture, enabling the model to focus on different parts of the input sequence simultaneously. Instead of calculating attention just once, multi-head attention performs the attention mechanism multiple times in parallel, each time using different learned linear projections of the query, key, and value vectors. This process creates multiple “heads” of attention, with each head focusing on different aspects of the input.

In BERT, each attention head operates independently, capturing diverse relationships between tokens by attending to different positions in the sequence. After computing attention for each head, the results are concatenated and passed through a final linear transformation to produce the output. This approach enriches BERT’s ability to understand context by integrating information from multiple attention perspectives, making it highly effective in natural language processing tasks.

In [37]:
# Apply the transpose_for_scores function to the query, key, and value states
transposed_query_states = transpose_for_scores(query_states)
transposed_key_states = transpose_for_scores(key_states)
transposed_value_states = transpose_for_scores(value_states)

In [39]:
print('transposed_query_states shape: ',transposed_query_states.shape)
print('transposed_key_states shape: ',transposed_key_states.shape)
print('transposed_value_states shape: ',transposed_value_states.shape)

transposed_query_states shape:  torch.Size([1, 12, 12, 64])
transposed_key_states shape:  torch.Size([1, 12, 12, 64])
transposed_value_states shape:  torch.Size([1, 12, 12, 64])


### Calculating Attention Score
We then multiply the queries and key matrices in the process known as self-attention:

- Performing this operation yields a score that maps the relationship between each token and every other token in the sequence. 
- This score indicates how well each token’s query aligns with every other token’s key. 
- The resulting attention score matrix (referred to as qk_per_token) has a shape of [12x12], where 12 represents the number of tokens in the input sequence.


#### Attention Implemented from Scratch

Let's load the attention heads of the first layer of the transformer.

<div>
    <img src="images/qkv.png" width="600"/>
</div>

### Loading the Query, Key, Value, and Output Matrices

When we load the query, key, value, and output matrices from the model, we notice that their shapes are:
- Query: [768x768]
- Key: [768x768]
- Value: [768x768]
- Output: [768x768]

At first glance, this might seem unusual because ideally, we would want separate q, k, v, and o matrices for each attention head individually.

However, BERT’s authors bundled these matrices together for efficiency—it allows parallelization of the attention head calculations.

Now, let's unwrap these bundled matrices to access each attention head individually. 

In [40]:
# Compute the attention output using scaled dot-product attention (built-in function)
attn_output = torch.nn.functional.scaled_dot_product_attention(
    t_query_states,
    t_key_states,
    t_value_states,
    attn_mask=None,
    dropout_p= 0.0,
    # The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1.
    is_causal= False,
)

In [45]:
attn_output.shape

torch.Size([1, 12, 768])

In [41]:
# Reshape and transpose the attention output to match expected dimensions
attn_output = attn_output.transpose(1, 2).reshape(1, 12, 768)

In [42]:
# Extract the output dense layer weights and biases for the first attention layer
o_layer0 = model["bert.encoder.layer.0.attention.output.dense.weight"]
o_layer0_bias = model['bert.encoder.layer.0.attention.output.dense.bias']
# Compute the final attention output states by applying the dense layer transformation
ouput_states = torch.matmul(attn_output, o_layer0.T)+o_layer0_bias

In [43]:
ouput_states

tensor([[[ 0.0469,  0.1639, -0.0603,  ..., -0.1527,  0.0335,  0.1802],
         [ 0.1023,  0.3846, -0.0411,  ..., -0.5442, -0.0701, -0.0260],
         [-0.0404,  0.2899,  0.1311,  ..., -0.2009, -0.2455, -0.0298],
         ...,
         [ 0.2090,  0.3050,  0.0050,  ..., -0.2973, -0.1212, -0.0459],
         [ 0.0831, -0.0208, -0.1554,  ..., -0.4028, -0.2934, -0.0784],
         [ 0.3431, -0.1718,  0.0026,  ..., -0.3566, -0.2113,  0.0350]]])

In [44]:
ouput_states.shape

torch.Size([1, 12, 768])

In [46]:
# Extract the LayerNorm weights and biases for the first attention layer's output
layer_0_output_norm_weight = model['bert.encoder.layer.0.attention.output.LayerNorm.weight']
layer_0_output_norm_bias = model['bert.encoder.layer.0.attention.output.LayerNorm.bias']
# Apply LayerNorm to the output states combined with the input embeddings
attention_output = torch.nn.functional.layer_norm(ouput_states+normalized_embeddings, [ouput_states.size(-1)], layer_0_output_norm_weight, layer_0_output_norm_bias)

In [48]:
attention_output.shape

torch.Size([1, 12, 768])

## Upsampling
In the context of transformer models like BERT, upsampling refers to techniques that increase the resolution or dimensionality of representations, ensuring that the model can capture and retain more detailed information across different layers. Although traditional upsampling methods like bilinear interpolation and transposed convolution are more commonly associated with image processing tasks, the concept can be loosely related to operations in NLP models where the dimensionality of embeddings is increased or preserved to enrich the feature space. This process is crucial in enabling the model to maintain high levels of information density, allowing subsequent layers, such as the feed-forward neural network, to process richer and more complex representations, ultimately improving the model’s ability to understand and generate nuanced text.

## Feed-Forward Neural Network
After the self-attention mechanism, each transformer layer in BERT applies a feed-forward neural network to further process the embeddings. This step is crucial for introducing non-linearity and enabling the model to learn complex representations from the input data.

In [49]:
# Extract the weights and biases for the intermediate dense layer of the first transformer layer
layer_0_intermediate_dense_weight = model['bert.encoder.layer.0.intermediate.dense.weight']
layer_0_intermediate_dense_bias = model['bert.encoder.layer.0.intermediate.dense.bias']
# Compute the output of the intermediate dense layer
intermediate_dense_output = torch.matmul(attention_output, layer_0_intermediate_dense_weight.T)+layer_0_intermediate_dense_bias
intermediate_dense_output

tensor([[[-3.1348, -2.2779, -3.1132,  ..., -2.4296, -3.1535, -2.9147],
         [-1.7857, -1.4517, -3.6339,  ..., -1.9001, -2.6653,  0.6086],
         [-3.0365, -0.8983, -4.2322,  ..., -2.2538, -2.9196, -2.1696],
         ...,
         [-1.7389, -2.1817, -4.1068,  ..., -1.2588, -1.4242, -1.0926],
         [-2.9177, -1.5600, -3.8377,  ..., -1.5096, -2.2996, -1.7339],
         [-3.7864, -2.3012, -3.4133,  ..., -2.6182, -2.6048, -3.1074]]])

In [50]:
intermediate_dense_output.shape

torch.Size([1, 12, 3072])

# Activation Function
After the intermediate dense layer computes its output, an activation function is applied to introduce non-linearity into the model. In BERT, the GELU (Gaussian Error Linear Unit) activation function is used. Unlike the more common ReLU (Rectified Linear Unit), GELU is a smooth, continuous function that models the output as a probabilistic approximation, allowing for both positive and negative values. This property helps BERT to better capture the nuanced relationships in the input data, leading to improved performance on a variety of NLP tasks.

In [51]:
# Apply the GELU activation function to the intermediate dense output
intermediate_output = torch.nn.functional.gelu(intermediate_dense_output)
intermediate_output

tensor([[[-2.6957e-03, -2.5889e-02, -2.8811e-03,  ..., -1.8364e-02,
          -2.5440e-03, -5.1889e-03],
         [-6.6208e-02, -1.0640e-01, -5.0749e-04,  ..., -5.4553e-02,
          -1.0250e-02,  4.4340e-01],
         [-3.6334e-03, -1.6575e-01, -4.8812e-05,  ..., -2.7281e-02,
          -5.1157e-03, -3.2584e-02],
         ...,
         [-7.1338e-02, -3.1777e-02, -8.2369e-05,  ..., -1.3097e-01,
          -1.0994e-01, -1.5000e-01],
         [-5.1445e-03, -9.2632e-02, -2.3813e-04,  ..., -9.8983e-02,
          -2.4687e-02, -7.1898e-02],
         [-2.8956e-04, -2.4598e-02, -1.0957e-03,  ..., -1.1571e-02,
          -1.1973e-02, -2.9320e-03]]])

In [52]:
intermediate_output.shape

torch.Size([1, 12, 3072])

## Downsampling
In the context of transformer models like BERT, downsampling refers to techniques that reduce the resolution or dimensionality of representations, ensuring that the model can process information more efficiently while retaining the most critical features. Although traditional downsampling methods like pooling or strided convolutions are more commonly associated with image processing tasks, the concept can be loosely applied to operations in NLP models where the dimensionality of embeddings is reduced to streamline the feature space. This process is crucial in managing the computational load of the model, allowing subsequent layers, such as the feed-forward neural network, to focus on the most relevant information. By condensing the data, downsampling helps the model maintain a balance between complexity and efficiency, ultimately enhancing its ability to process large input sequences and generate accurate predictions.

In [53]:
layer_0_intermediate_dense_weight.shape

torch.Size([3072, 768])

In [54]:
# Extract the weights and biases for the output dense layer of the first transformer layer
layer_0_output_dense_weight = model['bert.encoder.layer.0.output.dense.weight']
layer_0_output_dense_bias = model['bert.encoder.layer.0.output.dense.bias']
# Compute the final dense layer output for the first transformer layer
output_dense = torch.matmul(intermediate_output, layer_0_output_dense_weight.T)+layer_0_output_dense_bias
output_dense.shape

torch.Size([1, 12, 768])

In [56]:
# Extract the LayerNorm weights and biases for the output of the first transformer layer
layer_0_output_layernorm_weight = model['bert.encoder.layer.0.output.LayerNorm.weight']
layer_0_output_layernorm_bias = model['bert.encoder.layer.0.output.LayerNorm.bias']
# Apply LayerNorm to the sum of the dense layer output and the original attention output
layer_norm(output_dense+attention_output, layer_0_output_layernorm_weight, layer_0_output_layernorm_bias, eps=1e-12).shape

torch.Size([1, 12, 768])

### Forward Pass Through BERT’s 12 Transformer Layers
Next, we run a forward pass through all 12 layers of the BERT model. 

Each layer applies multi-head self-attention to capture token relationships, followed by a dense layer with GELU activation for non-linearity, and then Layer Normalization to stabilize the output. 

The output from each layer is passed to the next, and after processing all 12 layers, we obtain a final contextualized embedding ready for downstream tasks like classification or question answering.

## god, everything all at once
<div>
    <img src="images/12self_attention.png" width="600px"/>
</div>
yep, this is it. everything we did before, all at once, for every single layer.
<br>

### have fun reading :)

In [57]:
# Initialize with the normalized embeddings from the input layer
layer_embedding_norm = normalized_embeddings
# Iterate through all 12 transformer layers in BERT
for layer in range(12):
    # Extract query, key, and value weights for the current layer's attention mechanism
    q_layer = model[f"bert.encoder.layer.{layer}.attention.self.query.weight"]
    k_layer = model[f"bert.encoder.layer.{layer}.attention.self.key.weight"]
    v_layer = model[f"bert.encoder.layer.{layer}.attention.self.value.weight"]
    
    # Extract biases for the query, key, and value
    q_layer_bias = model[f'bert.encoder.layer.{layer}.attention.self.query.bias']
    k_layer_bias = model[f'bert.encoder.layer.{layer}.attention.self.key.bias']
    v_layer_bias = model[f'bert.encoder.layer.{layer}.attention.self.value.bias']
    
    # Compute query, key, and value states by applying the respective weights and biases
    query_states = torch.matmul(layer_embedding_norm, q_layer.T)+q_layer_bias
    key_states = torch.matmul(layer_embedding_norm, k_layer.T)+k_layer_bias
    value_states = torch.matmul(layer_embedding_norm, v_layer.T)+v_layer_bias
    
    # Transpose and reshape states for multi-head attention computation
    t_query_states = transpose_for_scores(query_states)
    t_key_states = transpose_for_scores(key_states)
    t_value_states = transpose_for_scores(value_states)
    
    # Compute the attention output using scaled dot-product attention
    attn_output = torch.nn.functional.scaled_dot_product_attention(
        t_query_states,
        t_key_states,
        t_value_states,
        attn_mask=None,
        dropout_p= 0.0,
        # The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1.
        is_causal= False,
    )
    attn_output = attn_output.transpose(1, 2).reshape(1, 12, 768)
    
    # Extract the dense layer weights and biases for the attention output transformation
    o_layer = model[f"bert.encoder.layer.{layer}.attention.output.dense.weight"]
    o_layer_bias = model[f'bert.encoder.layer.{layer}.attention.output.dense.bias']
    ouput_states = torch.matmul(attn_output, o_layer.T)+o_layer_bias
    
    # Extract LayerNorm weights and biases for the attention output
    layer_output_norm_weight = model[f'bert.encoder.layer.{layer}.attention.output.LayerNorm.weight']
    layer_output_norm_bias = model[f'bert.encoder.layer.{layer}.attention.output.LayerNorm.bias']
    attention_output11 = layer_norm(ouput_states+layer_embedding_norm, layer_output_norm_weight, layer_output_norm_bias, eps=1e-12)

    # Compute intermediate dense layer output
    layer_intermediate_dense_weight = model[f'bert.encoder.layer.{layer}.intermediate.dense.weight']
    layer_intermediate_dense_bias = model[f'bert.encoder.layer.{layer}.intermediate.dense.bias']
    intermediate_dense_output = torch.matmul(attention_output11, layer_intermediate_dense_weight.T)+layer_intermediate_dense_bias
    
    # Apply GELU activation function
    intermediate_output = torch.nn.functional.gelu(intermediate_dense_output)
    
    # Compute output dense layer transformation
    layer_output_dense_weight = model[f'bert.encoder.layer.{layer}.output.dense.weight']
    layer_output_dense_bias = model[f'bert.encoder.layer.{layer}.output.dense.bias']
    output_dense = torch.matmul(intermediate_output, layer_output_dense_weight.T)+layer_output_dense_bias
    
    # Apply final LayerNorm for the current layer
    layer_output_layernorm_weight = model[f'bert.encoder.layer.{layer}.output.LayerNorm.weight']
    layer_output_layernorm_bias = model[f'bert.encoder.layer.{layer}.output.LayerNorm.bias']
    layer_embedding_norm = layer_norm(output_dense+attention_output11, layer_output_layernorm_weight, layer_output_layernorm_bias, eps=1e-12)


# Final output after all 12 layers of the transformer
layer_embedding_norm

tensor([[[ 0.0698,  0.3857, -0.0319,  ..., -0.1526,  0.0446,  0.2253],
         [-0.5065,  0.7490,  0.5782,  ..., -0.7123, -0.4653, -0.7150],
         [-0.0510, -0.6486, -0.1926,  ..., -0.1028,  0.0320,  0.1524],
         ...,
         [ 0.8641,  0.3875,  0.5311,  ...,  0.2063,  0.1742,  0.1639],
         [-0.4663, -0.5197, -0.2011,  ...,  0.6513,  0.0255, -0.1058],
         [ 0.7993,  0.2471, -0.0632,  ...,  0.1179, -0.7094, -0.0537]]])

In [58]:
layer_embedding_norm.shape

torch.Size([1, 12, 768])

In [59]:
# Extract weights and biases for the dense layer in the prediction head
layer_intermediate_dense_weight = model['cls.predictions.transform.dense.weight']
layer_intermediate_dense_bias = model['cls.predictions.transform.dense.bias']

# Compute the output of the dense layer in the prediction head
prediction_output = torch.matmul(layer_embedding_norm, layer_intermediate_dense_weight.T)+layer_intermediate_dense_bias
prediction_output

tensor([[[-0.9308,  0.5810, -1.5488,  ..., -0.3969, -0.2049, -0.7886],
         [-0.4524,  1.4434,  0.7572,  ...,  0.8591,  1.1784,  1.1476],
         [ 0.1031,  0.8686,  0.9038,  ...,  1.2142,  0.3628,  1.5648],
         ...,
         [ 1.0442, -0.6966,  1.6925,  ...,  1.5027,  0.0384,  2.1948],
         [-0.2648, -0.0293,  0.6970,  ...,  0.6369,  1.0992,  1.0511],
         [ 0.1265, -1.1419,  0.2915,  ..., -0.2913,  0.0908,  0.3325]]])

In [60]:
prediction_output.shape

torch.Size([1, 12, 768])

In [61]:
# Apply GELU activation to the prediction output
prediction_act_dense_output = torch.nn.functional.gelu(prediction_output)

# Extract LayerNorm weights and biases for the prediction output
prediction_layernorm_weight = model['cls.predictions.transform.LayerNorm.weight']
prediction_layernorm_bias = model['cls.predictions.transform.LayerNorm.bias']
# Apply LayerNorm to the activated dense output
prediction_dense_output = layer_norm(prediction_act_dense_output, prediction_layernorm_weight, prediction_layernorm_bias, eps=1e-12)
prediction_dense_output

tensor([[[-0.6974,  1.0231, -0.0350,  ..., -0.3549, -0.0176,  0.0247],
         [-4.3757,  1.5879, -1.3361,  ..., -1.1796,  0.3573,  0.4541],
         [-3.5006, -0.2887, -0.2252,  ...,  0.9825, -2.4059,  3.3587],
         ...,
         [-0.4326, -4.0172,  2.9924,  ...,  1.8936, -3.3367,  5.7582],
         [-2.9313, -2.1162, -0.0482,  ..., -0.5154,  1.6312,  1.7295],
         [-0.8187, -2.2804,  0.7585,  ..., -2.2679, -0.4951,  1.3419]]])

In [62]:
prediction_dense_output.shape

torch.Size([1, 12, 768])

In [63]:
# Extract decoder weights and biases for final prediction
decoder_weight = model['cls.predictions.decoder.weight'] # 形状 [vocab_size, hidden_size]
decoder_bias = model['cls.predictions.decoder.bias'] # 形状 [vocab_size]

# Extract decoder weights and biases for final prediction
# if 'cls.predictions.bias' in model:
#     decoder_bias += model['cls.predictions.bias']

# Project the final output to the vocabulary space to compute logits
logits = torch.matmul(prediction_dense_output, decoder_weight.T) + decoder_bias

In [64]:
logits.shape

torch.Size([1, 12, 30522])

In [65]:
# 获取预测的标记 (词汇表中的索引)
predictions = torch.argmax(logits, dim=-1)

In [66]:
predictions.shape

torch.Size([1, 12])

In [67]:
input_ids = tokenizer.encode(prompt, return_tensors='pt')

In [68]:
input_ids

tensor([[ 101, 2043, 1999, 4199, 1010, 2079, 2004, 1996,  103, 2079, 1012,  102]])

## lets go！

In [69]:
tokenizer.decode(predictions[0], return_tensors='pt')

'. when in rome, do as the romans do..'

# thank you

This is the end. Hopefully you enjoyed reading it!
