**FIRST LAYER**

1. Introduction to BERT and the First Layer of Embeddings
BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained transformer model designed to understand the context of words in a sentence bidirectionally. It processes the entire sentence at once (both left-to-right and right-to-left) to capture rich context.

The first layer in BERT is where the model starts processing input text. Each word or token in the input is transformed into an embedding that contains information about its meaning and context within the sentence. BERT uses three types of embeddings:

Token Embeddings: Represent individual words or tokens.
Position Embeddings: Encode the position of each token in the sentence.
Segment Embeddings: Indicate which sentence the token belongs to (useful in tasks involving sentence pairs).


2. How BERT Embeddings Work
Each token in the input sentence is converted into an embedding vector in three steps:

Tokenization: The input sentence is split into smaller units called "tokens" using BERT's tokenizer.
Embedding Layer: The tokens are then converted into three types of embeddings:
Token Embeddings: The meaning of the token itself.
Position Embeddings: The position of the token in the sentence.
Segment Embeddings: Which sentence the token belongs to.


3. The Components of BERT's Embedding Layer (First Layer)
Token Embeddings
What it is: Each token in the input sentence is represented as a high-dimensional vector (usually 768 dimensions for bert-base-uncased).
How it's obtained: The tokenizer converts each word or subword into a token. Each token is mapped to a specific embedding in a large lookup table.
Example: For the sentence "BERT is powerful", the tokenizer might split "BERT" into the token [BERT] and "is" into [is], each having a corresponding embedding.
Position Embeddings
What it is: Since transformers don't have a sense of the position of words (as opposed to RNNs), position embeddings are added to provide this information.
How it's obtained: Each position in the sentence (i.e., token index) is assigned a unique vector. These position embeddings are learned during pre-training.
Example: The position of "BERT" would have a different position embedding compared to "is" based on their location in the sentence.
Segment Embeddings
What it is: Segment embeddings are used to differentiate between two sentences in tasks like question answering.
How it's obtained: BERT uses 0 for the first sentence and 1 for the second sentence (if any).
Example: In tasks like Sentence A: "What is BERT?" and Sentence B: "It is a transformer model," the first sentence would have segment embeddings of [0] and the second one would have [1].


5. Self-Attention Mechanism in the First Layer
The first layer in BERT utilizes the self-attention mechanism to process the embeddings.

Self-Attention: Each token attends to every other token in the input to decide how much weight (importance) each token should receive when forming its representation. This happens at every layer of BERT.
Why it's important: Self-attention helps BERT capture contextual relationships between words (e.g., "BERT is a model" vs "BERT is a transformer model").
In the First Layer: The first layer’s self-attention learns how each token should interact with others to better represent the sentence structure and context.
5. Output of the First Layer
Token Representation: After passing through the first layer, each token has a representation that is context-sensitive, i.e., the embedding of "BERT" will change depending on surrounding words.
Shape of Output: The output shape for each token embedding in the first layer is (batch_size, sequence_length, hidden_size), where:
batch_size is the number of sentences in the input batch.
sequence_length is the number of tokens in the sentence after tokenization.
hidden_size is typically 768 in bert-base.


6. Methods Involved in BERT's First Layer
Step 1: Tokenization
Method: BertTokenizer.from_pretrained('bert-base-uncased')
Functionality: Converts raw text into tokens that BERT understands. It uses WordPiece tokenization, breaking words into smaller subwords.
Step 2: Embedding Layer
Method: The embeddings are created using the embedding lookup table.
embedding_tokens: For each token, retrieve its corresponding embedding.
embedding_positions: Each token's position gets a position embedding.
embedding_segments: For sentence segmentation, retrieve the segment embedding.
Step 3: Self-Attention
Method: torch.matmul (Matrix multiplication)
First, compute the Query, Key, and Value matrices based on the input embeddings.
Then compute the attention scores by taking the dot product of the Query and Key matrices, followed by a softmax function to get attention weights.
Finally, use these attention weights to compute the output for each token.
Step 4: Combine Embeddings
Method: The token embeddings, position embeddings, and segment embeddings are combined element-wise.
The combined embedding is then passed through the first self-attention layer.

7. Key Takeaways for the First Layer
Token Embeddings: These embeddings represent the meaning of individual tokens.
Position Embeddings: Help provide context for each token's position in the sequence.
Segment Embeddings: Used to differentiate between sentences in tasks like sentence pair classification.
Self-Attention: The core mechanism that allows BERT to understand the relationships between tokens in the context of the entire sentence.


In [1]:
import torch
from transformers import BertTokenizer, BertModel

# Initialize the tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)

# Define the paragraphs
paragraphs = [
    "BERT is a transformer model developed by Google. It has revolutionized NLP by introducing bidirectional context and allowing pre-training on vast datasets.",
    "Padding is used to ensure that all inputs in a batch have the same length. This simplifies batch processing in deep learning.",
    "When dealing with paragraphs, BERT can only process up to 512 tokens at a time. Therefore, long paragraphs may need to be truncated."
]

# Tokenize and encode the paragraphs
inputs = tokenizer(paragraphs, return_tensors="pt", padding=True, truncation=True, max_length=128)

# Forward pass through the model
with torch.no_grad():
    outputs = model(**inputs)
    hidden_states = outputs.hidden_states  # List of hidden states for each layer, including embeddings

# Token Embeddings (retrieved from hidden_states[0] which is the output of the embedding layer)
token_embeddings = hidden_states[0]
print("Token Embeddings for first paragraph:\n", token_embeddings[0])
print("Shape of Token Embeddings:", token_embeddings[0].shape)
print()

# Position Embeddings: we need to expand it to match the batch dimension
position_embeddings = model.embeddings.position_embeddings.weight[:inputs['input_ids'].shape[1]]
position_embeddings = position_embeddings.unsqueeze(0).expand(inputs['input_ids'].size(0), -1, -1)
print("Expanded Position Embeddings:\n", position_embeddings)
print("Shape of Expanded Position Embeddings:", position_embeddings.shape)
print()

# Segment Embeddings
segment_ids = inputs['token_type_ids']
segment_embeddings = model.embeddings.token_type_embeddings(segment_ids)
print("Segment Embeddings for first paragraph:\n", segment_embeddings[0])
print("Shape of Segment Embeddings:", segment_embeddings[0].shape)
print()

# Combined Embedding Output for each paragraph
combined_embedding_output = token_embeddings + position_embeddings + segment_embeddings
print("Combined Token + Position + Segment Embeddings for the batch:\n", combined_embedding_output)
print("Shape of Combined Embedding Output:", combined_embedding_output.shape)


Token Embeddings for first paragraph:
 tensor([[ 0.1686, -0.2858, -0.3261,  ..., -0.0276,  0.0383,  0.1640],
        [ 0.9295,  0.5056,  0.4276,  ...,  0.9511,  0.7785, -0.2679],
        [-0.6270, -0.0633, -0.3143,  ...,  0.3427,  0.4636,  0.4594],
        ...,
        [-0.4845, -0.4881,  0.7400,  ..., -0.3568, -0.2392,  0.1933],
        [-0.0812,  0.1353,  0.0899,  ...,  0.0494,  0.7483,  0.5275],
        [-0.0908, -0.2099,  0.0628,  ..., -0.7465,  0.4288, -0.2265]])
Shape of Token Embeddings: torch.Size([36, 768])

Expanded Position Embeddings:
 tensor([[[ 1.7505e-02, -2.5631e-02, -3.6642e-02,  ...,  3.3437e-05,
           6.8312e-04,  1.5441e-02],
         [ 7.7580e-03,  2.2613e-03, -1.9444e-02,  ...,  2.8910e-02,
           2.9753e-02, -5.3247e-03],
         [-1.1287e-02, -1.9644e-03, -1.1573e-02,  ...,  1.4908e-02,
           1.8741e-02, -7.3140e-03],
         ...,
         [ 6.4262e-03, -2.1805e-02, -1.0427e-03,  ..., -1.7709e-02,
           1.8698e-03, -6.5411e-03],
         [ 6

The discrepancy in the dimensions is because the token embeddings and segment embeddings are processed in batches (one for each paragraph), resulting in a shape of [batch_size, sequence_length, hidden_size], which is [3, 36, 768] in your case. However, position embeddings are shared across the entire batch and only depend on the sequence length, so they have a shape of [sequence_length, hidden_size], which is [36, 768]

position embeddings have a fixed length matching the maximum token length per sentence (e.g., 512), while token and segment embeddings vary by batch size.

**SECOND LAYER**

1. Self-Attention
Self-attention is the key mechanism in BERT, and it computes how each token should attend to every other token in the sequence. In the second layer, BERT applies multi-head self-attention to create contextualized representations of each token, considering the entire input sequence.

Q (Query), K (Key), V (Value) Calculation: For each token, three vectors are computed:

Query (Q): Represents the token’s perspective (what it looks for in other tokens).
Key (K): Represents the information available to other tokens.
Value (V): Contains the actual information that will be passed along.
These are generated by multiplying the token embeddings with learned weight matrices.

Attention Scores Calculation: The attention scores between tokens are calculated by performing a dot product between the query vector of one token and the key vector of all other tokens. This gives an idea of how much attention one token should give to others.


Scaled Dot-Product: The result of the dot product is then scaled by the square root of the key dimension to prevent excessively large values that could skew the attention.

Softmax: A softmax function is applied to normalize the attention scores, ensuring that they sum to 1, making them interpretable as probabilities.

Weighted Sum of Values (V): Finally, the weighted sum of the value vectors is computed, where the weights are the attention scores. This gives a new context-aware representation for each token.

2. Multi-Head Attention
Instead of using a single set of attention weights, BERT uses multiple heads for self-attention. This allows the model to focus on different parts of the sentence using different perspectives at the same time. After each attention head processes the sequence, the results are concatenated and passed through a linear layer to combine the information.

3. Add & Norm
Once self-attention has been applied, the result is added to the input (called a residual connection), and layer normalization is performed to stabilize training and help with convergence.

Summary of Steps in the Second Layer:
Input: Token embeddings (with position and segment embeddings added).
Self-Attention: Compute Q, K, V for each token and perform the attention mechanism to get new token representations.
Multi-Head Attention: Apply attention in parallel using multiple heads, then concatenate the results.
Residual Connection: Add the original token embeddings to the output of the self-attention mechanism.
Layer Normalization: Normalize the result to ensure stable training.

In [2]:
inputs = tokenizer(paragraphs, return_tensors="pt", padding=True, truncation=True, max_length=128)
# Get the model outputs
outputs = model(**inputs)

# Extract the token embeddings from the last hidden state
token_embeddings = outputs.last_hidden_state
# Define linear layers to create Q, K, V matrices
hidden_size = token_embeddings.size(-1)
W_Q = torch.nn.Linear(hidden_size, hidden_size)
W_K = torch.nn.Linear(hidden_size, hidden_size)
W_V = torch.nn.Linear(hidden_size, hidden_size)

# Apply Q, K, V transformations to the token embeddings for each token in the batch
Q = W_Q(token_embeddings)
K = W_K(token_embeddings)
V = W_V(token_embeddings)


In [3]:
import torch.nn.functional as F

# Calculate the attention scores (dot product of Q and K, scaled by sqrt of hidden size)
attention_scores = torch.matmul(Q, K.transpose(-1, -2)) / (hidden_size ** 0.5)
# Apply softmax to get attention probabilities
attention_probs = F.softmax(attention_scores, dim=-1)
# Compute the weighted sum of V vectors based on the attention probabilities
attention_output = torch.matmul(attention_probs, V)


In [4]:
print("Query")
print(Q)
print("Length og query is: ",len(Q[0]))
print(Q[0].size())
print()
print("Value")
print(V)
print("Length of value is: ",len(V[0]))
print()
print("Key")
print(K)
print("length of key is",len(K[0]))
print("Attention Output for each paragraph:")
print(attention_output)


Query
tensor([[[-0.5497,  0.1242,  0.1225,  ..., -0.2947, -0.6243, -0.4016],
         [ 0.0134,  0.3883,  0.3186,  ..., -0.0808, -0.4024, -0.2399],
         [ 0.0097, -0.2053,  0.4934,  ..., -0.3806, -0.1386, -0.2752],
         ...,
         [-0.0224,  0.0101,  0.3008,  ..., -0.2930, -0.1133, -0.2517],
         [ 0.3751,  0.0544,  0.1858,  ..., -0.1016, -0.0328, -0.0596],
         [-0.1022, -0.3221, -0.3149,  ..., -0.1627, -0.1107, -0.2446]],

        [[-0.8369, -0.1108,  0.1554,  ..., -0.0823, -0.3242, -0.2340],
         [-0.2528,  0.0139, -0.5795,  ..., -0.1524, -0.2834, -0.8862],
         [-0.1286, -0.3708,  0.3216,  ..., -0.3282,  0.0915, -0.3603],
         ...,
         [-0.1671,  0.0651,  0.0740,  ...,  0.0999, -0.1939, -0.0031],
         [-0.0544,  0.3365,  0.2446,  ..., -0.0227, -0.3401,  0.0894],
         [-0.1258,  0.0091,  0.0557,  ...,  0.1007, -0.2979,  0.0954]],

        [[-0.6766,  0.1456,  0.1801,  ..., -0.1271, -0.2301, -0.2722],
         [-0.5991,  0.1168,  0.0158,  .

**Layer 3**: Early Contextual Understanding
Self-Attention Mechanism: At this stage, BERT starts capturing inter-token relationships in a sentence. The model can attend to any other token in the sequence, meaning each word can "look" at other words in the sentence to understand their relevance.
Key Transformation: Each token now has a deeper understanding of its surrounding tokens. For example, "cat" in the sentence "The cat sat on the mat" attends to "sat" and "mat" to better understand its role in the sentence.
Output: The output of this layer will contain more informative representations where each token’s meaning is enriched with information from the surrounding context.

**Layer 4**: Focus on Syntactic Relations
Self-Attention Refined: By Layer 4, the model further refines its ability to capture the syntax of the sentence. While previous layers might have captured basic relationships, Layer 4 emphasizes understanding grammatical structures (such as subject-verb-object) and syntactic dependencies.
Output: The contextual understanding of tokens is now deeply grounded in their syntactic roles, which is useful for tasks that require understanding sentence structure or parts-of-speech.


**Layer 5**: Enhanced Semantic Understanding
Bidirectional Attention: The model continues to learn from both directions (left-to-right and right-to-left) to get a more holistic representation of each token.
Abstract Meaning: At this point, Layer 5 starts capturing higher-level meanings and abstract relationships. For example, it learns that "sat" in "The cat sat on the mat" has a semantic connection to actions and objects in the sentence.
Output: The embeddings at this layer are more semantically aware and can distinguish meaning based on context, handling ambiguity better than previous layers.

**Layer 6**: Deepening Contextual Understanding
Complex Token Interactions: Layer 6 learns more about how different tokens in the sequence relate to each other. The self-attention mechanism here enables the model to capture more long-range dependencies, meaning the relationship between tokens in different parts of the sentence becomes clearer.
Output: The representations are further refined, considering both local context (nearby words) and global context (farther apart words), which makes the model more robust for understanding complex sentences.

**Layer 7**: Understanding Disambiguation
Polysemy Handling: Layer 7 helps BERT handle polysemy (words with multiple meanings). The model understands that the word "bank" in "river bank" is different from "bank" in "savings bank."
Contextualization: BERT’s self-attention mechanism enables it to learn that the meaning of words can change depending on surrounding words. This layer focuses on disambiguating words based on context.
Output: Layer 7 ensures that token representations now contain the correct sense of words based on their usage in the sentence.

**Layer 8**: Improving Sentence-Level Understanding
Sentence Relationships: At this layer, BERT starts capturing sentence-level relationships, learning how sentences within a paragraph or document connect. For example, it may learn that "but" in one sentence negates or contrasts with a statement in a previous sentence.
Cross-Sentence Attention: Layer 8’s self-attention mechanism enables it to focus on relationships between distant tokens across sentences, improving its ability to process documents or paragraphs.
Output: This representation is now suitable for tasks that require understanding relationships between sentences, such as question-answering or sentence similarity.

**Layer 9**: Handling Long-Term Dependencies
Long-Term Context: Layer 9 captures even longer-term dependencies between tokens that are far apart in the text. This is especially useful for understanding narrative or discourse-level information, where the relationship between words or phrases can span many sentences.
Memory of Context: The model improves its ability to remember important contextual information over long stretches of text, which helps in tasks like document classification or summarization.
Output: The output at this layer contains more abstract and deep contextual representations that are important for understanding long passages or complex texts.

**Layer 10**: Contextual Sensitivity
Enhanced Sensitivity to Context: This layer further increases the model’s sensitivity to both local and global context. It now has a fine-grained understanding of how the meaning of a word or sentence can change depending on the larger context, including previous and future sentences.
Discourse-Level Understanding: Layer 10 starts to focus on discourse-level features, such as coherence between different sections of a text. For instance, understanding how an introductory sentence relates to a conclusion.
Output: The embeddings produced by this layer are highly contextual and capable of dealing with the complexities of language involving discourse markers like “however” or “therefore.”

**Layer 11**: Deep Semantic Representation
Abstract Semantic Features: By this stage, BERT is able to capture very abstract semantic features of the text. Layer 11 synthesizes all previous layers' information and begins to create a more holistic understanding of the text, including nuances like sarcasm, irony, or emotion.
Task-Specific Focus: Depending on the task at hand (e.g., classification or question-answering), Layer 11 adjusts the representation to emphasize relevant aspects of the text.
Output: The output is semantically rich and deeply informed by context, making it suitable for a wide variety of NLP tasks.


**Layer 12**: Final Representation
Top-Level Understanding: Layer 12 represents the final, fully contextualized representation of each token, considering every preceding layer's transformations. This layer produces the final embeddings for downstream tasks.
Rich, Deep Embeddings: This layer’s output is ready to be used for various NLP tasks, including classification, summarization, translation, question-answering, etc.
Output: The embeddings here are rich, incorporating all learned information and capable of understanding not only word meanings but also relationships across sentences and documents.

In [8]:
import torch
from transformers import BertTokenizer, BertModel

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)

paragraphs = [
    "BERT is a transformer model developed by Google. It has revolutionized NLP by introducing bidirectional context.",
    "Padding is used to ensure that all inputs in a batch have the same length. This simplifies batch processing in deep learning.",
    "When dealing with paragraphs, BERT can only process up to 512 tokens at a time."
]

# Tokenize the paragraphs and get model inputs
inputs = tokenizer(paragraphs, return_tensors="pt", padding=True, truncation=True, max_length=128)

# Get the model outputs (including hidden states for each layer)
outputs = model(**inputs)

# Extract the hidden states for each layer (including the last hidden state)
hidden_states = outputs.hidden_states

# Print the number of layers in BERT
num_layers = len(hidden_states) 
print(f"Total number of layers: {num_layers}")

# Access hidden states for Layers 3 to 12
for layer_idx in range(3, 13):  
    layer_output = hidden_states[layer_idx]
    print(f"Layer {layer_idx + 1} output shape: {layer_output.shape}")
    
    print(layer_output)
    print(len(layer_output))
    print(layer_output.shape)


Total number of layers: 13
Layer 4 output shape: torch.Size([3, 29, 768])
tensor([[[ 7.5136e-02, -3.6864e-01, -1.9043e-01,  ...,  3.6594e-01,
           1.9765e-01,  1.4205e-01],
         [ 1.0977e+00, -4.3360e-01,  7.1244e-01,  ...,  3.1043e-01,
           5.7346e-01, -8.7117e-01],
         [-1.1883e+00, -6.1086e-01, -9.5882e-02,  ...,  4.1781e-01,
           3.9787e-01,  5.5537e-01],
         ...,
         [ 1.3817e-01, -4.3395e-01,  3.7636e-01,  ...,  3.2012e-01,
          -1.4817e-01, -7.4781e-02],
         [-2.8239e-01, -3.2095e-01,  3.7810e-01,  ...,  3.9485e-01,
          -2.0048e-01, -2.2469e-01],
         [-5.8460e-02, -3.4482e-01,  4.7770e-01,  ...,  3.4661e-01,
          -3.8280e-01, -3.2190e-01]],

        [[ 5.9574e-04, -3.6950e-01, -1.5953e-01,  ...,  2.3233e-01,
           3.0059e-01,  1.5014e-01],
         [ 5.1878e-01, -8.9722e-01,  1.1747e+00,  ...,  6.4556e-01,
          -1.1184e+00, -8.2523e-01],
         [ 1.7718e-01, -1.3090e+00,  6.0979e-01,  ..., -8.8948e-02,
  