# 3. Implementing the Transformer Core

This notebook contains the three practical ways, designed to solidify your understanding of the Transformer architecture and introduce you to the Hugging Face ecosystem.

**Check out the Attention Function**

## 3.1 Scaled Dot-Product Attention Implementation

**Scaled Dot-Product Attention**, which is the *core operation* inside the Transformer architecture (used in models like BERT, GPT, etc.).

---

### üí° First ‚Äî What‚Äôs ‚ÄúAttention‚Äù?

Think of **attention** as a way for a model to *focus on relevant words* when processing a sentence.

For example, when translating:

> ‚ÄúThe cat sat on the mat.‚Äù

If the model is currently working on the word **‚Äúsat‚Äù**, it should pay more *attention* to **‚Äúcat‚Äù** (the subject) than to ‚Äúmat‚Äù.

That ‚Äúfocus‚Äù process is handled mathematically by **attention**.

---

### ‚öôÔ∏è The Inputs

We have three important components:

| Symbol | Name  | Role                                                          |
| ------ | ----- | ------------------------------------------------------------- |
| **Q**  | Query | What we‚Äôre looking for (e.g., "which words relate to ‚Äòsat‚Äô?") |
| **K**  | Key   | What each word represents (like an address or label)          |
| **V**  | Value | The actual word information or meaning we‚Äôll retrieve         |

Each word in a sentence is represented as a vector (a list of numbers), and for each, we compute **Q**, **K**, and **V** by multiplying by learned weight matrices.

---

### üßÆ Step-by-Step Process of Scaled Dot-Product Attention

#### **Step 1. Compute raw attention scores**

We measure how much each Query relates to every Key:
[
\text{scores} = Q \times K^T
]

This gives a *similarity score* ‚Äî higher means the Query ‚Äúpays more attention‚Äù to that Key.

---

#### **Step 2. Scale the scores**

If vectors are large, the dot products can become very big, which can cause instability in softmax.

So, we scale down by the dimension of the keys:
[
\text{scaled scores} = \frac{Q \times K^T}{\sqrt{d_k}}
]
where ( d_k ) = the dimension (length) of the key vector.

---

#### **Step 3. Apply Softmax**

We convert the scores into probabilities:
[
\text{attention weights} = \text{softmax}\left(\frac{Q \times K^T}{\sqrt{d_k}}\right)
]

This makes all weights sum to 1 ‚Äî so they represent *how much attention* each word gets.

---

#### **Step 4. Weight the Values**

Finally, we multiply the attention weights by the Values:
[
\text{output} = \text{attention weights} \times V
]

This gives us a *weighted combination of the word meanings* ‚Äî words the model attends to more will contribute more to the output.

---

### üß† Putting It All Together (Formula)

[
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V
]

---

### ü™Ñ Intuitive Summary

| Step | What Happens                | Example                               |
| ---- | --------------------------- | ------------------------------------- |
| 1    | Compare Query with all Keys | ‚ÄúHow related is ‚Äòsat‚Äô to each word?‚Äù  |
| 2    | Scale the scores            | Keep values stable                    |
| 3    | Softmax to get weights      | Convert similarities into percentages |
| 4    | Weighted sum of Values      | Focus more on related words           |

---

### üß© In Transformers

Each **Self-Attention** layer does this for every word in parallel ‚Äî every word becomes aware of all others in the sentence.

Then, multiple such attentions (called **Multi-Head Attention**) are combined to let the model focus on *different relationships simultaneously* (like syntax, meaning, etc.).

In [None]:
import numpy as np

# Step 1Ô∏è‚É£ - Create sample Query (Q), Key (K), and Value (V) matrices
# Suppose we have 3 words in a sentence, each represented by a 4-dimensional vector
Q = np.random.rand(3, 4)  # (number_of_words, d_k)
K = np.random.rand(3, 4)
V = np.random.rand(3, 4)
# Q.ndim

# # Step 2Ô∏è‚É£ - Compute raw attention scores: Q √ó K^T
scores = np.dot(Q, K.T)  # shape: (3, 3)
# print("Raw Scores (Q x K^T):\n", scores)

# Step 3Ô∏è‚É£ - Scale the scores by sqrt(d_k)
d_k = K.shape[-1]
# print("\nd_k:", d_k)
scaled_scores = scores / np.sqrt(d_k)
print("\nScaled Scores:\n", scaled_scores)

# # Step 4Ô∏è‚É£ - Apply Softmax to get attention weights
def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))  # numerical stability
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

attention_weights = softmax(scaled_scores)
print("\nAttention Weights (after softmax):\n", attention_weights)

# Step 5Ô∏è‚É£ - Multiply attention weights by Values (V)
output = np.dot(attention_weights, V)
print("\nOutput (Weighted sum of V):\n", output)


Scaled Scores:
 [[0.58711767 0.41642551 0.70251316]
 [0.60428757 0.3900428  0.72479625]
 [0.59964182 0.3768067  0.71002685]]

Attention Weights (after softmax):
 [[0.33722283 0.28430618 0.37847099]
 [0.34068981 0.27498805 0.38432214]
 [0.34282328 0.27434323 0.38283349]]

Output (Weighted sum of V):
 [[0.42345302 0.27156022 0.34457865 0.37448412]
 [0.42580949 0.2649947  0.34185753 0.37916138]
 [0.42731504 0.26444115 0.34263962 0.37777634]]


Interpretation:
Across all words, the third Key (column 3) gets the highest attention share ‚Üí meaning the third word (or token) is semantically or contextually central to the others.
It‚Äôs like all words are focusing more on word‚ÇÉ for meaning.

In [None]:
import torch
import math

In [None]:
# Set tensor dimension (d_k) - determines the scaling factor
d_k = 4

# 1. Create dummy Q, K, V tensors (Batch size=1, Sequence length=3, d_k=4)
# In a real model, Q, K, and V are projections of the input embedding.
Q = torch.randn(1, 3, d_k)
K = torch.randn(1, 3, d_k)
V = torch.randn(1, 3, d_k)

print("--- Step 1: Query (Q) and Key (K) Shape ---")
print(f"Q shape: {Q.shape}, K shape: {K.shape}")
print("Q", Q)
print("K", K)
print("V", V)

--- Step 1: Query (Q) and Key (K) Shape ---
Q shape: torch.Size([1, 3, 4]), K shape: torch.Size([1, 3, 4])
Q tensor([[[-0.1748,  0.4172,  0.2035,  0.2766],
         [-0.0418, -0.2249, -0.0461,  0.2272],
         [-0.4191, -1.8049, -0.4794, -0.7844]]])
K tensor([[[-0.5081,  0.5085,  0.1379, -1.0159],
         [ 0.7072, -0.9508, -0.8331,  0.4412],
         [ 1.0048,  1.2164, -1.7924, -0.1342]]])
V tensor([[[-0.4106, -0.6584, -0.2627, -0.1248],
         [ 0.5591, -0.2125,  0.5781,  1.8891],
         [ 1.0720,  1.3338, -0.6086,  1.2277]]])


In [None]:
# 2. Calculate Q * K^T (Dot Product)
# torch.matmul performs matrix multiplication. K.transpose(-2, -1) transposes the last two dimensions.
scores = torch.matmul(Q, K.transpose(-2, -1))

print("\n--- Step 2: Attention Scores (Q * K^T) ---")
print(f"Scores shape: {scores.shape}")
print("Scores (raw similarity measure):")
print(scores)


--- Step 2: Attention Scores (Q * K^T) ---
Scores shape: torch.Size([1, 3, 3])
Scores (raw similarity measure):
tensor([[[ 0.0481, -0.5679, -0.0702],
         [-0.3302,  0.3229, -0.2635],
         [ 0.0259,  1.4731, -1.6520]]])


In [None]:
# 3. Scale the scores
# Divide by the square root of d_k (the scaling factor)
scores = scores / math.sqrt(d_k)

print("\n--- Step 3: Scaled Scores ---")
print("Scaled Scores:")
print(scores)



--- Step 3: Scaled Scores ---
Scaled Scores:
tensor([[[-0.4903,  0.0669,  0.2154],
         [ 0.0511, -0.4743, -0.2591],
         [ 0.4357, -0.5256, -0.2156]]])


In [None]:
# 4. Apply Softmax to get Attention Weights
# The weights determine how much attention each word gives to every other word.
attention_weights = torch.softmax(scores, dim=-1)

print("\n--- Step 4: Attention Weights (Sum to 1) ---")
print(f"Attention Weights shape: {attention_weights.shape}")
print(f"Weights for the first token sum: {attention_weights[0, 0, :].sum().item():.4f}")
print("Attention Weights (Probabilities):")
print(attention_weights)



--- Step 4: Attention Weights (Sum to 1) ---
Attention Weights shape: torch.Size([1, 3, 3])
Weights for the first token sum: 1.0000
Attention Weights (Probabilities):
tensor([[[0.2096, 0.3659, 0.4245],
         [0.4302, 0.2544, 0.3155],
         [0.5253, 0.2009, 0.2739]]])


In [None]:
# 5. Multiply weights by Value (V) to get the final output
# This creates the final, context-aware representation for each token.
output = torch.matmul(attention_weights, V)

print("\n--- Step 5: Final Output (Context-Weighted V) ---")
print(f"Output shape: {output.shape}")
print("Output:")
print(output)



--- Step 5: Final Output (Context-Weighted V) ---
Output shape: torch.Size([1, 3, 4])
Output:
tensor([[[ 0.2257, -0.1308, -0.4276,  0.9686],
         [ 0.3692, -0.2423, -0.5842,  1.4215],
         [ 0.4288, -0.2882, -0.6476,  1.6196]]])


## 3.2: Building Core Components (Residuals and Normalization)

1 - A residual connection means: ‚ÄúAdd the original input back to the output of a layer.‚Äù Residuals act as highways that let gradients flow easily through layers.

2 -- Layer Normalization (LayerNorm): After adding residuals, the network might produce unstable values. Layer Normalization fixes that by keeping the values in a balanced range.

In [None]:
import torch
import torch.nn as nn

In [None]:
# 1. Simulate the input and the output of an imaginary Transformer sub-layer
input_tensor = torch.randn(1, 3, d_k) # Batch=1, Sequence=3, Embedding=4
sub_layer_output = torch.randn(1, 3, d_k) # Output from the Attention or Feed Forward layer

print("--- Step 1: Residual Connection ---")
print(f"Input mean: {input_tensor.mean():.4f}")
print(f"Sub-Layer Output mean: {sub_layer_output.mean():.4f}")

# The Residual Connection adds the input back to the sub-layer output
residual_output = input_tensor + sub_layer_output

print(f"Residual Output mean (Input + Output): {residual_output.mean():.4f}")
print("This mechanism allows gradient flow to 'skip' the layer (Residual Connection).\n")

--- Step 1: Residual Connection ---
Input mean: 0.0067
Sub-Layer Output mean: -0.0004
Residual Output mean (Input + Output): 0.0063
This mechanism allows gradient flow to 'skip' the layer (Residual Connection).



In [None]:
# 2. Apply Layer Normalization
# LayerNorm normalizes across the features (the last dimension, d_k=4 in this case)
# Note: LayerNorm requires the dimension size to initialize (4)
layer_norm = nn.LayerNorm(d_k)

# Layer Normalization is applied *after* the residual connection
normalized_output = layer_norm(residual_output)

print("--- Step 2: Layer Normalization ---")
print("Layer Norm stabilizes the output by ensuring a consistent distribution.")
print(f"Output mean after LayerNorm: {normalized_output.mean():.4f}")
print(f"Output standard deviation after LayerNorm: {normalized_output.std():.4f}")
# The mean should be very close to 0 and the std dev very close to 1.


--- Step 2: Layer Normalization ---
Layer Norm stabilizes the output by ensuring a consistent distribution.
Output mean after LayerNorm: 0.0000
Output standard deviation after LayerNorm: 1.0445


## 3.3: Hugging Face Ecosystem: Pre-trained Model Inference

Using the Hugging Face transformers library to load a small pre-trained model for Masked Language Modeling (MLM)‚Äîa task where the model predicts a masked word in a sentence. This demonstrates the power of the Auto classes.

In [None]:
!pip install transformers



In [None]:
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

## AutoModelForMaskedLM
It is a pretrained model designed for Masked Language Modeling (MLM) tasks ‚Äî that is, predicting missing words in a sentence.

For example:
‚ÄúThe cat sat on the [MASK].‚Äù
The model predicts ‚Üí ‚Äúmat‚Äù.

### AutoModelForMaskedLM automatically picks the correct architecture (e.g., BERT, RoBERTa, DistilBERT...) based on the model name you load.


## AutoTokenizer
The Tokenizer converts text into numbers (tokens) that the model understands, and vice versa.

## Pipeline
A high-level shortcut that hides all the setup (tokenizer + model + decoding)
and lets you perform tasks with one simple line.

In [None]:
# 1. Define the model name (DistilBERT is a fast, small Transformer)
MODEL_NAME = "distilbert-base-uncased"

In [None]:
# 2. Use the Auto classes to load the model and tokenizer
# AutoTokenizer loads the vocabulary and preprocessing rules
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# AutoModel loads the architecture and pre-trained weights
model = AutoModelForMaskedLM.from_pretrained(MODEL_NAME)

print(f"Successfully loaded model and tokenizer for: {MODEL_NAME}\n")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Successfully loaded model and tokenizer for: distilbert-base-uncased



In [None]:
# 3. Use the high-level 'pipeline' for quick inference (Masked Language Modeling)
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

text_input = "The attention mechanism is the [MASK] of the Transformer."

# Run the prediction
results = fill_mask(text_input, top_k=5)

print(f"--- Predicting Mask for: '{text_input}' ---")
for i, result in enumerate(results):
    # 'token_str' is the predicted word
    print(f"{i+1}. Predicted token: '{result['token_str']}' (Score: {result['score']:.4f})")
    print(f"   Full sentence: {result['sequence']}")

Device set to use cpu


--- Predicting Mask for: 'The attention mechanism is the [MASK] of the Transformer.' ---
1. Predicted token: 'output' (Score: 0.1020)
   Full sentence: the attention mechanism is the output of the transformer.
2. Predicted token: 'inverse' (Score: 0.0475)
   Full sentence: the attention mechanism is the inverse of the transformer.
3. Predicted token: 'action' (Score: 0.0390)
   Full sentence: the attention mechanism is the action of the transformer.
4. Predicted token: 'function' (Score: 0.0377)
   Full sentence: the attention mechanism is the function of the transformer.
5. Predicted token: 'behavior' (Score: 0.0173)
   Full sentence: the attention mechanism is the behavior of the transformer.


In [None]:
# Load pre-trained tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

# Build the pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

# Try it out
result = fill_mask("Data scientist are the building [MASK] of AI models.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


In [None]:
# print(type(result))
for r in result:
  print(f"{r['sequence']}, has a score of {r['score']}")

data scientist the building block of ai models., has a score of 0.40205028653144836
data scientist the building blocks of ai models., has a score of 0.2717633843421936
data scientist the building model of ai models., has a score of 0.08002013713121414
data scientist the building up of ai models., has a score of 0.03662237152457237
data scientist the building designer of ai models., has a score of 0.019568249583244324


2Ô∏è‚É£ The score = model‚Äôs confidence

These numbers are probabilities (softmax scores) from the model‚Äôs output layer.

They represent how confident the model is that this word is the correct replacement.

So:

0.9744 = ~97% confidence that ‚Äúblocks‚Äù is correct.

0.0244 = ~2% confidence that ‚Äúblock‚Äù is correct.

The rest are extremely low probabilities (<0.01%), meaning the model thinks they‚Äôre almost certainly wrong.

# Assignment:

Create five sentences with a [MASK] token (you can make them about GenerativeAI, Natural Language Processing, or daily life).

For each sentence:

- Run it through the pipeline.

- Print the top 3 predictions and their scores.

- Briefly interpret the results (why the top prediction makes sense).

Thank you!

In [None]:
!pip install torch
!pip install transformers



In [None]:
from transformers import pipeline

# Load BERT fill-mask pipeline
fill_mask = pipeline("fill-mask", model="bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


In [None]:
sentences = [
    "Generative AI is transforming the way we [MASK] content.",
    "I always start my day with a cup of [MASK].",
    "Natural Language Processing helps computers understand [MASK].",
    "The student used a [MASK] model to generate text.",
    "Python is the most popular language for [MASK] learning."
]

# Loop through sentences and display predictions
for i, sentence in enumerate(sentences):
    print(f"{i+1}: {sentence}")
    results = fill_mask(sentence)
    for rank, result in enumerate(results[:3], 1):
        token = result["token_str"]
        score = round(result["score"], 4)
        print(f"  Top {rank}: {token} (score: {score})")
    print(f"  ‚Üí Interpretation: '{results[0]['token_str']}' fits best in context.")

1: Generative AI is transforming the way we [MASK] content.
  Top 1: understand (score: 0.1402)
  Top 2: perceive (score: 0.1103)
  Top 3: view (score: 0.0857)
  ‚Üí Interpretation: 'understand' fits best in context.
2: I always start my day with a cup of [MASK].
  Top 1: coffee (score: 0.8125)
  Top 2: tea (score: 0.1553)
  Top 3: water (score: 0.0077)
  ‚Üí Interpretation: 'coffee' fits best in context.
3: Natural Language Processing helps computers understand [MASK].
  Top 1: language (score: 0.2233)
  Top 2: speech (score: 0.1297)
  Top 3: languages (score: 0.0717)
  ‚Üí Interpretation: 'language' fits best in context.
4: The student used a [MASK] model to generate text.
  Top 1: computer (score: 0.1328)
  Top 2: mathematical (score: 0.0978)
  Top 3: linear (score: 0.0215)
  ‚Üí Interpretation: 'computer' fits best in context.
5: Python is the most popular language for [MASK] learning.
  Top 1: machine (score: 0.729)
  Top 2: distance (score: 0.0269)
  Top 3: language (score: 0.022