# BERT implemented from scratch

In this file, I implemented BERT from scratch, focusing on the core concepts of transformer architecture, attention mechanisms, and matrix operations. BERT (Bidirectional Encoder Representations from Transformers) is a pivotal model in NLP that uses a bidirectional approach to understand the context of words in a sentence.

<br>

I will also demonstrate how to load tensors directly from the pre-trained BERT model provided by Hugging Face. Before running this file, you need to download the weights. Here is the official link to download the weights: [Hugging Face BERT](https://huggingface.co/bert-base-uncased)

<div>
    <img src="images/all-steps.png"/ width=800>
</div>

In [None]:
!conda install pytorch torchvision torchaudio cpuonly -c pytorch

Retrieving notices: ...working... done
Channels:
 - pytorch
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/anaconda3

  added / updated specs:
    - cpuonly
    - pytorch
    - torchaudio
    - torchvision


The following packages will be SUPERSEDED by a higher-priority channel:

  certifi            conda-forge/noarch::certifi-2024.7.4-~ --> pkgs/main/osx-arm64::certifi-2024.7.4-py312hca03da5_0 
  conda              conda-forge::conda-24.7.1-py312h81bd7~ --> pkgs/main::conda-24.7.1-py312hca03da5_0 


Proceed ([y]/n)? 

In [None]:
conda install -c conda-forge transformers

## Model Initialization and Input Processing

### Tokenization
I won’t be implementing a BPE tokenizer here, but Andrej Karpathy has a neat implementation that you might find useful.
<br>
You can check out his implementation at this link: https://github.com/karpathy/minbpe

In [None]:
# Load Model
import torch
from transformers import BertForMaskedLM, BertTokenizer

model_path="bert-base-uncased"

tokenizer=BertTokenizer.from_pretrained(model_path)

In [None]:
tokenizer.encode("hello world!")

In [None]:
tokenizer.decode(tokenizer.encode("hello world!"))

### Loading and Inspecting Model Configuration and Weights

Here, the model configuration and weights are loaded from a pretrained model file. This step ensures that the necessary parameters are ready for the forward pass.

Normally, reading a model file depends on how the model classes are written and the variable names within them. 
However, since we are implementing BERT from scratch, we will read the file one tensor at a time, carefully examining the embedding layers, attention heads, and feed-forward networks that make up BERT’s architecture.

In [None]:
# 1. 使用 transformers 加载预训练模型

# 加载模型配置和权重
model = BertForMaskedLM.from_pretrained(model_path, torch_dtype=torch.float32)

In [None]:
model

In [None]:
model = model.state_dict()

In [None]:
model.keys()

In [None]:
# float32
model['bert.embeddings.word_embeddings.weight'].shape

In [None]:
import json
with open("/home/jlpanc/anaconda3/envs/codef/STA/bert-base-uncased/config.json", "r") as f:
    config = json.load(f)
config

### Extracting Model Parameters from Config
We use this config to infer details about the model like:

1. The model has 12 transformer layers.
2. Each multi-head attention block has 12 heads.
3. The vocabulary size is 30,522 tokens.
4. The hidden size is 768 dimensions.
5. The intermediate size (for the feed-forward layers) is 3072 dimensions.

In [None]:
dim = config["hidden_size"]
n_layers = config["num_hidden_layers"]
n_heads = config["num_attention_heads"]
vocab_size = config["vocab_size"]
norm_eps = config["layer_norm_eps"]
eps = config["layer_norm_eps"]

### converting text to tokens
Here, we utilize the tiktoken library (which I believe is developed by OpenAI) as the tokenizer.
<div>
    <img src="images/embedding_layers.png" width="600"/>
</div>

In [None]:
# 1. Tokenize the prompt into token IDs using the BERT tokenizer.
prompt = "When in Rome, do as the [MASK] do."
input_ids = tokenizer.encode(prompt, return_tensors='pt')
print("input_ids: ", input_ids)
# 2. Create a tensor of token type IDs (all zeros, as we have only one sentence).
token_type_ids = torch.zeros_like(input_ids)
print("token_type_ids: ", token_type_ids)
# 3. Convert the input_ids to a tensor (this step is redundant since it's already a tensor).
input_ids = torch.tensor(input_ids)
# 4. Calculate and print the number of tokens in the input prompt.
q_len = len(input_ids)
print("q_len: ", q_len)

### Generating Token Embeddings
In this section, we convert input tokens into their corresponding embeddings using BERT’s pre-trained embedding layers. While BERT uses inbuilt neural network modules for this process, it’s crucial to understand the transformation.

So, our [10x1] tokens are now transformed into [10x768], i.e., 12 embeddings (one for each token) of length 768.

Note: Keep track of the tensor shapes throughout the process; it makes understanding the entire architecture much easier.

<div>
    <img src="images/embeddings_output.png" width="600"/>
</div>

In [None]:
# 1. Load the word embeddings from the pre-trained BERT model
embedding_layer = torch.nn.Embedding.from_pretrained(model['bert.embeddings.word_embeddings.weight'])
# 2. Load the position embeddings from the pre-trained BERT model
position_embeddings = torch.nn.Embedding.from_pretrained(model['bert.embeddings.position_embeddings.weight'])
# 3. Load the segment (token type) embeddings from the pre-trained BERT model
segment_embeddings = torch.nn.Embedding.from_pretrained(model['bert.embeddings.token_type_embeddings.weight'])
# embedding_layer.weight.data.copy_(model["model.embed_tokens.weight"])
token_embeddings_unnormalized = embedding_layer(input_ids)
token_embeddings_unnormalized.shape

In [None]:
token_embeddings_unnormalized

In [None]:
seq_length = input_ids.size(1)
position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0)
position_embeds = position_embeddings(position_ids)
position_embeds.shape

In [None]:
segment_embeds = segment_embeddings(token_type_ids)
segment_embeds.shape

In [None]:
embeddings_unnormalized = token_embeddings_unnormalized+position_embeds+segment_embeds
embeddings_unnormalized.shape

In [None]:
embeddings_unnormalized

### Applying Layer Normalization
In BERT, we apply Layer Normalization after embedding layers and various other stages.
Note: This step does not change the shape of the tensor; it just normalizes the values across the hidden dimension.

### Things to keep in mind:
1. Layer normalization prevents internal covariate shift by normalizing across the features.
2. A small epsilon value (from the configuration) is added for numerical stability, preventing division by zero.


### Building the First Layer of the Transformer

#### Normalization
In BERT, we start by applying layer normalization to the input of each transformer layer.

You will notice that I access `encoder.layer.0` from the model dictionary (this corresponds to the first layer in BERT).
After applying the normalization, the tensor shape remains the same as the input, [12x768], but the values are now normalized.

<div>
    <img src="images/layer_norm.png" width="600"/>
</div>

In [None]:
def layer_norm(x, gamma, beta, eps=1e-12):
    """
    参数:
    x (torch.Tensor): 输入张量，形状为 (batch_size, num_features)
    gamma (torch.Tensor): 缩放参数，形状与x的最后一个维度相同
    beta (torch.Tensor): 偏移参数，形状与x的最后一个维度相同
    eps (float): 用于数值稳定性的常量（防止除以零）
    
    返回:
    torch.Tensor: 经过LayerNorm后的输出
    """
    # 计算均值和方差
    mean = x.mean(dim=-1, keepdim=True)
    variance = x.var(dim=-1, keepdim=True, unbiased=False)
    
    # 标准化
    x_normalized = (x - mean) / torch.sqrt(variance + eps)
    
    # 缩放和偏移
    out = gamma * x_normalized + beta
    
    return out

# Build the first transformer layer
### Normalization
You can see, after through layer0 from the dict extract from the model.

The output tensor is still shape in [10*768] but normalized.

<div>
    <img src="images/embeding_norm1.png", width=500>
</div>


In [None]:
# Initialize LayerNorm and Dropout
# layer_norm = torch.nn.LayerNorm(768, eps=1e-12)
# layer_norm.weight.data = model['bert.embeddings.LayerNorm.weight']
# layer_norm.bias.data = model['bert.embeddings.LayerNorm.bias']

# Perform Layer Normalization
normalized_embeddings = layer_norm(embeddings_unnormalized, model['bert.embeddings.LayerNorm.weight'], model['bert.embeddings.LayerNorm.bias'], eps=1e-12)

# Apply Dropout
# final_embeddings = dropout(normalized_embeddings)

# Print results
print("Embeddings after LayerNorm:", normalized_embeddings)
# print("Final Embeddings:", final_embeddings)

## Self-Attention Mechanism
This section dives into how the self-attention mechanism is implemented in BERT, including the computation of Query, Key, and Value matrices, and how they contribute to generating attention scores.

In [None]:
# Extract the query, key, and value weights for the first attention layer from the BERT model
q_layer0 = model["bert.encoder.layer.0.attention.self.query.weight"]
k_layer0 = model["bert.encoder.layer.0.attention.self.key.weight"]
v_layer0 = model["bert.encoder.layer.0.attention.self.value.weight"]
# Extract the biases for query, key, and value in the first attention layer
q_layer0_bias = model['bert.encoder.layer.0.attention.self.query.bias']
k_layer0_bias = model['bert.encoder.layer.0.attention.self.key.bias']
v_layer0_bias = model['bert.encoder.layer.0.attention.self.value.bias']

In [None]:
# Compute the query, key, and value states by multiplying the normalized embeddings with the respective weights and adding the bias
query_states = torch.matmul(normalized_embeddings, q_layer0.T)+q_layer0_bias
key_states = torch.matmul(normalized_embeddings, k_layer0.T)+k_layer0_bias
value_states = torch.matmul(normalized_embeddings, v_layer0.T)+v_layer0_bias

In [None]:
# Define a function to reshape and permute the states for multi-head attention
def transpose_for_scores(x):
    new_x_shape = x.size()[:-1] + (12, 64) # (num_attention_heads, attention_head_size)
    x = x.view(new_x_shape) # Reshape the tensor to separate attention heads
    return x.permute(0, 2, 1, 3) # Permute the tensor dimensions for attention computation

In [None]:
# Apply the transpose_for_scores function to the query, key, and value states
t_query_states = transpose_for_scores(query_states)
t_key_states = transpose_for_scores(key_states)
t_value_states = transpose_for_scores(value_states)

### Calculating Attention Score
We then multiply the queries and key matrices in the process known as self-attention:

- Performing this operation yields a score that maps the relationship between each token and every other token in the sequence. 
- This score indicates how well each token’s query aligns with every other token’s key. 
- The resulting attention score matrix (referred to as qk_per_token) has a shape of [12x12], where 12 represents the number of tokens in the input sequence.


#### Attention Implemented from Scratch

Let's load the attention heads of the first layer of the transformer.

<div>
    <img src="images/qkv.png" width="600"/>
</div>

### Loading the Query, Key, Value, and Output Matrices

When we load the query, key, value, and output matrices from the model, we notice that their shapes are:
- Query: [768x768]
- Key: [768x768]
- Value: [768x768]
- Output: [768x768]

At first glance, this might seem unusual because ideally, we would want separate q, k, v, and o matrices for each attention head individually.

However, BERT’s authors bundled these matrices together for efficiency—it allows parallelization of the attention head calculations.

Now, let's unwrap these bundled matrices to access each attention head individually. 

In [None]:
# Compute the attention output using scaled dot-product attention (built-in function)
attn_output = torch.nn.functional.scaled_dot_product_attention(
    t_query_states,
    t_key_states,
    t_value_states,
    attn_mask=None,
    dropout_p= 0.0,
    # The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1.
    is_causal= False,
)

In [None]:
# Reshape and transpose the attention output to match expected dimensions
attn_output = attn_output.transpose(1, 2).reshape(1, 12, 768)

In [None]:
# Extract the output dense layer weights and biases for the first attention layer
o_layer0 = model["bert.encoder.layer.0.attention.output.dense.weight"]
o_layer0_bias = model['bert.encoder.layer.0.attention.output.dense.bias']
# Compute the final attention output states by applying the dense layer transformation
ouput_states = torch.matmul(attn_output, o_layer0.T)+o_layer0_bias

In [None]:
ouput_states

In [None]:
# Extract the LayerNorm weights and biases for the first attention layer's output
layer_0_output_norm_weight = model['bert.encoder.layer.0.attention.output.LayerNorm.weight']
layer_0_output_norm_bias = model['bert.encoder.layer.0.attention.output.LayerNorm.bias']
# Apply LayerNorm to the output states combined with the input embeddings
attention_output = torch.nn.functional.layer_norm(ouput_states+normalized_embeddings, [ouput_states.size(-1)], layer_0_output_norm_weight, layer_0_output_norm_bias)

In [None]:
attention_output

## Upsampling
In the context of transformer models like BERT, upsampling refers to techniques that increase the resolution or dimensionality of representations, ensuring that the model can capture and retain more detailed information across different layers. Although traditional upsampling methods like bilinear interpolation and transposed convolution are more commonly associated with image processing tasks, the concept can be loosely related to operations in NLP models where the dimensionality of embeddings is increased or preserved to enrich the feature space. This process is crucial in enabling the model to maintain high levels of information density, allowing subsequent layers, such as the feed-forward neural network, to process richer and more complex representations, ultimately improving the model’s ability to understand and generate nuanced text.

## Feed-Forward Neural Network
After the self-attention mechanism, each transformer layer in BERT applies a feed-forward neural network to further process the embeddings. This step is crucial for introducing non-linearity and enabling the model to learn complex representations from the input data.

In [None]:
# Extract the weights and biases for the intermediate dense layer of the first transformer layer
layer_0_intermediate_dense_weight = model['bert.encoder.layer.0.intermediate.dense.weight']
layer_0_intermediate_dense_bias = model['bert.encoder.layer.0.intermediate.dense.bias']
# Compute the output of the intermediate dense layer
intermediate_dense_output = torch.matmul(attention_output, layer_0_intermediate_dense_weight.T)+layer_0_intermediate_dense_bias
intermediate_dense_output

In [None]:
intermediate_dense_output.shape

# Activation Function
After the intermediate dense layer computes its output, an activation function is applied to introduce non-linearity into the model. In BERT, the GELU (Gaussian Error Linear Unit) activation function is used. Unlike the more common ReLU (Rectified Linear Unit), GELU is a smooth, continuous function that models the output as a probabilistic approximation, allowing for both positive and negative values. This property helps BERT to better capture the nuanced relationships in the input data, leading to improved performance on a variety of NLP tasks.

In [None]:
# Apply the GELU activation function to the intermediate dense output
intermediate_output = torch.nn.functional.gelu(intermediate_dense_output)
intermediate_output

In [None]:
intermediate_output.shape

In [None]:
layer_0_intermediate_dense_weight.shape

In [None]:
# Extract the weights and biases for the output dense layer of the first transformer layer
layer_0_output_dense_weight = model['bert.encoder.layer.0.output.dense.weight']
layer_0_output_dense_bias = model['bert.encoder.layer.0.output.dense.bias']
# Compute the final dense layer output for the first transformer layer
output_dense = torch.matmul(intermediate_output, layer_0_output_dense_weight.T)+layer_0_output_dense_bias
output_dense

In [None]:
# Extract the LayerNorm weights and biases for the output of the first transformer layer
layer_0_output_layernorm_weight = model['bert.encoder.layer.0.output.LayerNorm.weight']
layer_0_output_layernorm_bias = model['bert.encoder.layer.0.output.LayerNorm.bias']
# Apply LayerNorm to the sum of the dense layer output and the original attention output
layer_norm(output_dense+attention_output, layer_0_output_layernorm_weight, layer_0_output_layernorm_bias, eps=1e-12)

### Forward Pass Through BERT’s 12 Transformer Layers
Next, we run a forward pass through all 12 layers of the BERT model. 

Each layer applies multi-head self-attention to capture token relationships, followed by a dense layer with GELU activation for non-linearity, and then Layer Normalization to stabilize the output. 

The output from each layer is passed to the next, and after processing all 12 layers, we obtain a final contextualized embedding ready for downstream tasks like classification or question answering.

## god, everything all at once
<div>
    <img src="images/12self_attention.png" width="600px"/>
</div>
yep, this is it. everything we did before, all at once, for every single layer.
<br>

### have fun reading :)

In [None]:
# Initialize with the normalized embeddings from the input layer
layer_embedding_norm = normalized_embeddings
# Iterate through all 12 transformer layers in BERT
for layer in range(12):
    # Extract query, key, and value weights for the current layer's attention mechanism
    q_layer = model[f"bert.encoder.layer.{layer}.attention.self.query.weight"]
    k_layer = model[f"bert.encoder.layer.{layer}.attention.self.key.weight"]
    v_layer = model[f"bert.encoder.layer.{layer}.attention.self.value.weight"]
    
    # Extract biases for the query, key, and value
    q_layer_bias = model[f'bert.encoder.layer.{layer}.attention.self.query.bias']
    k_layer_bias = model[f'bert.encoder.layer.{layer}.attention.self.key.bias']
    v_layer_bias = model[f'bert.encoder.layer.{layer}.attention.self.value.bias']
    
    # Compute query, key, and value states by applying the respective weights and biases
    query_states = torch.matmul(layer_embedding_norm, q_layer.T)+q_layer_bias
    key_states = torch.matmul(layer_embedding_norm, k_layer.T)+k_layer_bias
    value_states = torch.matmul(layer_embedding_norm, v_layer.T)+v_layer_bias
    
    # Transpose and reshape states for multi-head attention computation
    t_query_states = transpose_for_scores(query_states)
    t_key_states = transpose_for_scores(key_states)
    t_value_states = transpose_for_scores(value_states)
    
    # Compute the attention output using scaled dot-product attention
    attn_output = torch.nn.functional.scaled_dot_product_attention(
        t_query_states,
        t_key_states,
        t_value_states,
        attn_mask=None,
        dropout_p= 0.0,
        # The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1.
        is_causal= False,
    )
    attn_output = attn_output.transpose(1, 2).reshape(1, 12, 768)
    
    # Extract the dense layer weights and biases for the attention output transformation
    o_layer = model[f"bert.encoder.layer.{layer}.attention.output.dense.weight"]
    o_layer_bias = model[f'bert.encoder.layer.{layer}.attention.output.dense.bias']
    ouput_states = torch.matmul(attn_output, o_layer.T)+o_layer_bias
    
    # Extract LayerNorm weights and biases for the attention output
    layer_output_norm_weight = model[f'bert.encoder.layer.{layer}.attention.output.LayerNorm.weight']
    layer_output_norm_bias = model[f'bert.encoder.layer.{layer}.attention.output.LayerNorm.bias']
    attention_output11 = layer_norm(ouput_states+layer_embedding_norm, layer_output_norm_weight, layer_output_norm_bias, eps=1e-12)

    # Compute intermediate dense layer output
    layer_intermediate_dense_weight = model[f'bert.encoder.layer.{layer}.intermediate.dense.weight']
    layer_intermediate_dense_bias = model[f'bert.encoder.layer.{layer}.intermediate.dense.bias']
    intermediate_dense_output = torch.matmul(attention_output11, layer_intermediate_dense_weight.T)+layer_intermediate_dense_bias
    
    # Apply GELU activation function
    intermediate_output = torch.nn.functional.gelu(intermediate_dense_output)
    
    # Compute output dense layer transformation
    layer_output_dense_weight = model[f'bert.encoder.layer.{layer}.output.dense.weight']
    layer_output_dense_bias = model[f'bert.encoder.layer.{layer}.output.dense.bias']
    output_dense = torch.matmul(intermediate_output, layer_output_dense_weight.T)+layer_output_dense_bias
    
    # Apply final LayerNorm for the current layer
    layer_output_layernorm_weight = model[f'bert.encoder.layer.{layer}.output.LayerNorm.weight']
    layer_output_layernorm_bias = model[f'bert.encoder.layer.{layer}.output.LayerNorm.bias']
    layer_embedding_norm = layer_norm(output_dense+attention_output11, layer_output_layernorm_weight, layer_output_layernorm_bias, eps=1e-12)


# Final output after all 12 layers of the transformer
layer_embedding_norm

In [None]:
layer_embedding_norm.shape

In [None]:
# Extract weights and biases for the dense layer in the prediction head
layer_intermediate_dense_weight = model['cls.predictions.transform.dense.weight']
layer_intermediate_dense_bias = model['cls.predictions.transform.dense.bias']

# Compute the output of the dense layer in the prediction head
prediction_output = torch.matmul(layer_embedding_norm, layer_intermediate_dense_weight.T)+layer_intermediate_dense_bias
prediction_output

In [None]:
# Apply GELU activation to the prediction output
prediction_act_dense_output = torch.nn.functional.gelu(prediction_output)

# Extract LayerNorm weights and biases for the prediction output
prediction_layernorm_weight = model['cls.predictions.transform.LayerNorm.weight']
prediction_layernorm_bias = model['cls.predictions.transform.LayerNorm.bias']
# Apply LayerNorm to the activated dense output
prediction_dense_output = layer_norm(prediction_act_dense_output, prediction_layernorm_weight, prediction_layernorm_bias, eps=1e-12)
prediction_dense_output

In [None]:
'cls.predictions.bias', , , 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'

In [None]:
# Extract decoder weights and biases for final prediction
decoder_weight = model['cls.predictions.decoder.weight'] # 形状 [vocab_size, hidden_size]
decoder_bias = model['cls.predictions.decoder.bias'] # 形状 [vocab_size]

# Extract decoder weights and biases for final prediction
# if 'cls.predictions.bias' in model:
#     decoder_bias += model['cls.predictions.bias']

# Project the final output to the vocabulary space to compute logits
logits = torch.matmul(prediction_dense_output, decoder_weight.T) + decoder_bias

In [None]:
logits

In [None]:
# 获取预测的标记 (词汇表中的索引)
predictions = torch.argmax(logits, dim=-1)

In [None]:
input_ids = tokenizer.encode(prompt, return_tensors='pt')

In [None]:
tokenizer.decode(predictions[0], return_tensors='pt')

In [None]:
predictions

### we now multiply the query weights with the token embedding, to recive a query for the token
here you can see the resulting shape is [12x768], this is because we have 12 tokens and for each token there is a 128 length query.

### keys (almost the same as queries)

#### Here are the key points to remember:

- Key vectors are also of dimension 768, corresponding to the token embeddings.
- The weights for the keys are shared across multiple attention heads to optimize computation, reducing the number of parameters.
- Like query vectors, key vectors are adjusted with positional information to encode the relative position of tokens in the sequence.

### at this stage now have both the rotated values of queries and keys, for each token. 

each of the queries and keys are now of shape [12x768]. 

### Calculating Attention Score
We then multiply the queries and key matrices in the process known as self-attention:

- Performing this operation yields a score that maps the relationship between each token and every other token in the sequence. 
- This score indicates how well each token’s query aligns with every other token’s key. 
- The resulting attention score matrix (referred to as qk_per_token) has a shape of [12x12], where 12 represents the number of tokens in the input sequence.


### attention
After calculating the attention scores, the resultant attention vector is obtained by multiplying the attention scores with the value vectors for each token. The resulting attention output for each token has a shape of [12x768].

### multi-head attention
We now have the attention output for the first head in the first layer.

Next, we will run a loop to perform the same calculations for each attention head in the first layer, ensuring that all heads contribute to the final attention output.


We now have the QKV attention matrices for all 12 heads in the first layer. Next, we will concatenate these attention outputs into a single matrix of size [12x768], where 12 is the number of tokens, and 768 is the combined dimension from all heads.

We’re almost at the finish line! :)

### weight matrix, one of the final steps
In BERT, after the self-attention mechanism within a transformer layer, the attention output is typically multiplied by a weight matrix for further processing.

We now have the change in the embedding value after attention, which should be added to the original token embeddings.

### we normalize and then run a feed forward neural network through the embedding delta
After normalization, BERT feeds the embedding delta into a feed-forward neural network. This step helps the model capture more complex relationships and introduces non-linearity.

### loading the ff weights and implementing the feed forward network
In BERT, the feed-forward neural network usually consists of two linear transformations with a non-linear activation function like GELU. This architecture is standard in transformer models and effectively adds the necessary non-linearity for the model.

### WE FINALLY HAVE NEW EDITED EMBEDDINGS FOR EACH TOKEN AFTER THE FIRST LAYER
After processing, the embeddings now carry more contextual information. With each additional transformer layer, BERT encodes more complex queries, eventually producing an embedding that captures deep semantic information for each token in the sequence.

In [None]:
layer_0_embedding = embedding_after_edit+output_after_feedforward
layer_0_embedding.shape

In [None]:
w1 = model["model.layers.0.mlp.gate_proj.weight"]
w3 = model["model.layers.0.mlp.up_proj.weight"]
w2 = model["model.layers.0.mlp.down_proj.weight"]

In [None]:
final_embedding = token_embeddings_unnormalized
for layer in range(n_layers):
    qkv_attention_store = []
    layer_embedding_norm = rms_norm(final_embedding, model[f"model.layers.{layer}.input_layernorm.weight"])
    # layer_embedding_norm = rms_norm(final_embedding, model[f"layers.{layer}.attention_norm.weight"])
    q_layer = model[f"model.layers.{layer}.self_attn.q_proj.weight"]
    q_layer = q_layer.view(n_heads, q_layer.shape[0] // n_heads, dim)
    k_layer = model[f"model.layers.{layer}.self_attn.k_proj.weight"]
    k_layer = k_layer.view(n_kv_heads, k_layer.shape[0] // n_kv_heads, dim)
    v_layer = model[f"model.layers.{layer}.self_attn.v_proj.weight"]
    v_layer = v_layer.view(n_kv_heads, v_layer.shape[0] // n_kv_heads, dim)
    w_layer = model[f"model.layers.{layer}.self_attn.o_proj.weight"]
    for head in range(n_heads):
        q_layer_head = q_layer[head]
        k_layer_head = k_layer[head//4]
        v_layer_head = v_layer[head//4]
        q_per_token = torch.matmul(layer_embedding_norm, q_layer_head.T)
        k_per_token = torch.matmul(layer_embedding_norm, k_layer_head.T)
        v_per_token = torch.matmul(layer_embedding_norm, v_layer_head.T)
        q_per_token_split_into_pairs = q_per_token.float().view(q_per_token.shape[0], -1, 2)
        q_per_token_as_complex_numbers = torch.view_as_complex(q_per_token_split_into_pairs)
        q_per_token_split_into_pairs_rotated = torch.view_as_real(q_per_token_as_complex_numbers * freqs_cis)
        q_per_token_rotated = q_per_token_split_into_pairs_rotated.view(q_per_token.shape)
        k_per_token_split_into_pairs = k_per_token.float().view(k_per_token.shape[0], -1, 2)
        k_per_token_as_complex_numbers = torch.view_as_complex(k_per_token_split_into_pairs)
        k_per_token_split_into_pairs_rotated = torch.view_as_real(k_per_token_as_complex_numbers * freqs_cis)
        k_per_token_rotated = k_per_token_split_into_pairs_rotated.view(k_per_token.shape)
        qk_per_token = torch.matmul(q_per_token_rotated, k_per_token_rotated.T)/(128)**0.5
        mask = torch.full((len(token_embeddings_unnormalized), len(token_embeddings_unnormalized)), float("-inf"))
        mask = torch.triu(mask, diagonal=1)
        qk_per_token_after_masking = qk_per_token + mask
        qk_per_token_after_masking_after_softmax = torch.nn.functional.softmax(qk_per_token_after_masking, dim=1).to(torch.bfloat16)
        qk_per_token_after_masking_after_softmax_fl32 = qk_per_token_after_masking_after_softmax.type(v_per_token.dtype) ## extra
        qkv_attention = torch.matmul(qk_per_token_after_masking_after_softmax_fl32, v_per_token)
        qkv_attention_store.append(qkv_attention)

    stacked_qkv_attention = torch.cat(qkv_attention_store, dim=-1)
    w_layer = model[f"model.layers.{layer}.self_attn.o_proj.weight"]
    w_layer_bf16 = w_layer.type(stacked_qkv_attention.dtype)## extra
    embedding_delta = torch.matmul(stacked_qkv_attention, w_layer_bf16.T)
    embedding_after_edit = final_embedding + embedding_delta
    embedding_after_edit_normalized = rms_norm(embedding_after_edit, model[f"model.layers.{layer}.post_attention_layernorm.weight"])
    w1 = model[f"model.layers.{layer}.mlp.gate_proj.weight"]
    w2 = model[f"model.layers.{layer}.mlp.down_proj.weight"]
    w3 = model[f"model.layers.{layer}.mlp.up_proj.weight"]
    output_after_feedforward = torch.matmul(torch.functional.F.silu(torch.matmul(embedding_after_edit_normalized, w1.T)) * torch.matmul(embedding_after_edit_normalized, w3.T), w2.T)
    final_embedding = embedding_after_edit+output_after_feedforward

### Final Embedding Representation
After passing through all transformer layers, BERT generates a final embedding that encapsulates the contextual information for each token. This embedding represents the model’s best understanding of the input, making it well-equipped to predict the next token.

The shape of the embedding remains consistent with the original token embeddings, where the first dimension is the number of tokens and the second dimension is the embedding size.

In [None]:
final_embedding = rms_norm(final_embedding, model["model.norm.weight"])
final_embedding.shape

## Final Prediction and Output
Once all transformer layers have processed the input, BERT converts the final embeddings into predictions by mapping them to the model’s vocabulary space.

### Decoding the Embeddings into Token Predictions
We use the output projection layer (language modeling head) to translate the final embeddings into token predictions. This step is crucial for generating meaningful outputs, such as the next word in a sentence.

### Using the Embedding of the Last Token to Predict the Next Value
In BERT, the final prediction for the next token is derived from the embedding of the last token in the sequence. The logits produced represent the model’s confidence across the vocabulary, allowing it to select the most likely next token.

In [None]:
model["model.norm.weight"].shape

In [None]:
logits = torch.matmul(final_embedding[-1], model["lm_head.weight"].T)
logits.shape

IM HYPING YOU UP, this is the last cell of code, hopefully you had fun :)

In [None]:
next_token = torch.argmax(logits, dim=-1)
next_token

In [None]:
next_token = torch.argmax(logits, dim=-1)
next_token

## lets go！

In [None]:
tokenizer.decode([next_token.item()])

In [None]:
[next_token.item()]

# thank you

This is the end. Hopefully you enjoyed reading it!
