# **Jupyter Notebook: DistilBERT CPU FLOPs calculation**

## **Block 1: Load Model**
This block is responsible for:
- **Loading the `DistilBERT` pre-trained model**
- **Loading the `DistilBERT` Tokenizer**
- **Printing basic model information (number of parameters, layers, hidden size, etc.)**
- **Ensuring the model runs on CPU**
- **Reading model configuration information**
- **Printing model details (number of parameters, layers, hidden size, etc.)**

In [1]:
### Block 1: Load Model ###
import torch
import time
from transformers import DistilBertTokenizer, DistilBertModel

# Print PyTorch version
print(f"Using PyTorch version: {torch.__version__}")

# Load the pre-trained model and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

# Ensure the model runs on CPU
device = torch.device("cpu")
model.to(device)

# Read model configuration information
config_dict = model.config.to_dict()

print("\n===== DistilBERT Model Info =====")
print(f"Number of Parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Number of Transformer Layers: {config_dict['n_layers']}")
print(f"Hidden Size: {config_dict['dim']}")
print(f"FFN Hidden Size (hidden_dim): {config_dict['hidden_dim']}")
print(f"Number of Attention Heads: {config_dict['n_heads']}")

print(f"Embedding Layer dtype: {model.embeddings.word_embeddings.weight.dtype}")  # Print data type of embedding layer
print(f"Transformer Block 0 FFN dtype: {model.transformer.layer[0].ffn.lin1.weight.dtype}")  # Print data type of FFN layer in the first transformer block
print(f"Transformer Block 1 Self-Attention dtype: {model.transformer.layer[1].attention.q_lin.weight.dtype}")  # Print data type of self-attention layer in the second transformer block

Using PyTorch version: 2.5.1

===== DistilBERT Model Info =====
Number of Parameters: 66,362,880
Number of Transformer Layers: 6
Hidden Size: 768
FFN Hidden Size (hidden_dim): 3072
Number of Attention Heads: 12
Embedding Layer dtype: torch.float32
Transformer Block 0 FFN dtype: torch.float32
Transformer Block 1 Self-Attention dtype: torch.float32


## **Block 2: Tokenization**

- **Converting input text into tokens**
- **Printing the tokenized results**
- **Checking `input_ids` and `attention_mask`**

In [2]:
### Block 2: Tokenization ###
# Define input text
text = "I love deep learning. It’s very interesting"

# Perform tokenization
inputs = tokenizer(text, return_tensors="pt")

# Print tokenized results
print("\n===== Tokenization Results =====")
print(f"Input Text: {text}")
print(f"Tokenized Tokens: {tokenizer.tokenize(text)}")
print(f"Tokens Length: {inputs.input_ids.size()[1]}")
print(f"Token IDs: {inputs['input_ids']}")
print(f"Attention Mask: {inputs['attention_mask']}")


===== Tokenization Results =====
Input Text: I love deep learning. It’s very interesting
Tokenized Tokens: ['i', 'love', 'deep', 'learning', '.', 'it', '’', 's', 'very', 'interesting']
Tokens Length: 12
Token IDs: tensor([[ 101, 1045, 2293, 2784, 4083, 1012, 2009, 1521, 1055, 2200, 5875,  102]])
Attention Mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


## **Block 3: Embedding Calculation**

- **Convert token IDs into 768-dimensional embedding vectors**
- **Check the shape of the `Embedding` output**
- **Compute `Embedding` FLOPs (computational cost)**

In [3]:
### Block 3: Compute Token Embedding ###
# Get the model's embedding layer
embedding_layer = model.embeddings.word_embeddings

# Compute embeddings for token IDs
token_embeddings = embedding_layer(inputs["input_ids"])

# Print results
print("\n===== Token Embedding Results =====")
print(f"Token IDs Shape: {inputs['input_ids'].shape}")
print(f"Embedding Output Shape: {token_embeddings.shape}")  # (batch_size, seq_len, hidden_dim)

# Compute FLOPs
seq_length = token_embeddings.shape[1]
hidden_dim = token_embeddings.shape[2]
embedding_flops = seq_length * hidden_dim

print(f"Embedding FLOPs: {embedding_flops:,} FLOPs")


===== Token Embedding Results =====
Token IDs Shape: torch.Size([1, 12])
Embedding Output Shape: torch.Size([1, 12, 768])
Embedding FLOPs: 9,216 FLOPs


## **Block 4: Self-Attention**

- **Calculate Query (Q), Key (K), Value (V) Projection**
- **Calculate Multihead`QK^T`（Attention Scores）**
- **Calculate Multihead`Softmax(QK^T) V`（Weighted Attention）**
- **Calculate Output (O) Projection**
- **Sum Self-Attention FLOPs**

In [4]:
### Block 4: Multi-Head Self-Attention Calculation ###
import torch.nn.functional as F

# Extract the first layer of DistilBERT Transformer
transformer_layer = model.transformer.layer[0]

# Get parameters from the model configuration
num_heads = model.config.n_heads  # 12 Attention heads
hidden_dim = model.config.dim  # 768
d_k = hidden_dim // num_heads  # Calculate the dimension of each Attention head

# Get Self-Attention weights
W_Q = transformer_layer.attention.q_lin.weight
W_K = transformer_layer.attention.k_lin.weight
W_V = transformer_layer.attention.v_lin.weight
W_O = transformer_layer.attention.out_lin.weight  # Final projection of MHA

# Calculate MatMul FLOPs (Multiply)
flops_qkv_mul= 3 * seq_length * hidden_dim * hidden_dim  # Q, K, V projections
flops_qk_t_mul = num_heads * seq_length * seq_length * d_k  # QK^T
flops_attn_v_mul = num_heads * seq_length * seq_length * d_k  # Softmax(QK^T) * V
flops_output_proj_mul = seq_length * hidden_dim * hidden_dim  # W_O projection

# Calculate MatMul FLOPs (Add)
flops_qkv_add = 3 * seq_length * hidden_dim * (hidden_dim - 1) # Q, K, V projections
flops_qk_t_add = num_heads * seq_length * seq_length * (d_k - 1) # QK^T
flops_attn_v_add = num_heads * seq_length * seq_length * (d_k - 1)  # Softmax(QK^T) * V
flops_output_proj_add = seq_length * hidden_dim * (hidden_dim - 1)  # W_O projection

# Calculate total MatMul FLOPs for Attention
flops_qkv = flops_qkv_mul + flops_qkv_add
flops_qk_t = flops_qk_t_mul + flops_qk_t_add
flops_attn_v = flops_attn_v_mul + flops_attn_v_add
flops_output_proj = flops_output_proj_mul + flops_output_proj_add

# Calculate Softmax FLOPs
softmax_exp = num_heads * seq_length * seq_length  # Exponential calculation
softmax_exp_add = num_heads * seq_length * seq_length  # Add the Exponential results
softmax_div = num_heads * seq_length * seq_length  # Division by sum of Exponential
flops_softmax = softmax_exp + softmax_exp_add + softmax_div # Total Softmax FLOPs

# Total FLOPs calculation
total_mha_flops = flops_qkv + flops_qk_t + flops_attn_v + flops_output_proj + flops_softmax

print("\n===== Multi-Head Self-Attention Calculation Results =====")
print(f"Multi-Head Self-Attention FLOPs: {total_mha_flops / 1e6:.2f} MFLOPs")


===== Multi-Head Self-Attention Calculation Results =====
Multi-Head Self-Attention FLOPs: 57.03 MFLOPs


## **Block 5: Feed Forward Network (FFN) Calculation**

- **Compute `FFN(x) = ReLU(xW1 + b1) W2 + b2`**
- **Check the shape of the `FFN` output**
- **Count `FFN` FLOPs (computational cost)**

In [5]:
### Block 5: Calculate Feed Forward Network (FFN) ###
# Get FFN weights
W1 = transformer_layer.ffn.lin1.weight
W2 = transformer_layer.ffn.lin2.weight

intermediate_size = model.config.hidden_dim  # Internal FFN dimension (3072)

# Calculate FLOPs
# First Linear Layer (768 -> 3072)
flops_ffn1_mul = seq_length * hidden_dim * intermediate_size  # MatMul
flops_ffn1_add = seq_length * intermediate_size * (hidden_dim - 1)  # Add
flops_ffn1 = flops_ffn1_mul + flops_ffn1_add

# GELU Activation
flops_gelu = seq_length * intermediate_size * 4  # GELU computation (approximately 4 FLOPs per element)

# Second Linear Layer (3072 -> 768)
flops_ffn2_mul = seq_length * intermediate_size * hidden_dim  # MatMul
flops_ffn2_add = seq_length * hidden_dim * (intermediate_size - 1)  # Add
flops_ffn2 = flops_ffn2_mul + flops_ffn2_add

# Total FLOPs calculation
total_ffn_flops = flops_ffn1 + flops_gelu + flops_ffn2

print("\n===== Feed Forward Network (FFN) Calculation Results =====")
print(f"Feed Forward Network (FFN) FLOPs: {total_ffn_flops / 1e6:.2f} MFLOPs")


===== Feed Forward Network (FFN) Calculation Results =====
Feed Forward Network (FFN) FLOPs: 113.35 MFLOPs


## **Block 6: Layer Normalization & Residual Connection**

- **Compute `LayerNorm(x + residual)`**
- **Check the shape of the `LayerNorm` output**
- **Count `LayerNorm` FLOPs (computational cost)**

In [6]:
### Block 6: Compute LayerNorm and Residual Connection FLOPs ###
# Compute LayerNorm FLOPs (2 times, 6147 FLOPs per token)
flops_layernorm = 2 * seq_length * 6147

# Compute Residual Connection FLOPs (2 times, 768 FLOPs per token)
flops_residual = 2 * seq_length * hidden_dim

# Total FLOPs
total_layernorm_residual_flops = flops_layernorm + flops_residual

print("\n===== LayerNorm & Residual Calculation Results =====")
print(f"LayerNorm & Residual FLOPs: {total_layernorm_residual_flops / 1e6:.2f} MFLOPs")


===== LayerNorm & Residual Calculation Results =====
LayerNorm & Residual FLOPs: 0.17 MFLOPs


## **Block 7: Transformer Layer Overall Computation Statistics**
This block is responsible for:
- **Summarizing the total computation of Self-Attention, FFN, and LayerNorm**
- **Calculating the FLOPs of a Transformer layer**

In [7]:
### Block 3: Compute Token Embedding ###
# Get the model's embedding layer
embedding_layer = model.embeddings.word_embeddings

# Compute embeddings for token IDs
token_embeddings = embedding_layer(inputs["input_ids"])

# Print results
print("\n===== Token Embedding Results =====")
print(f"Token IDs Shape: {inputs['input_ids'].shape}")
print(f"Embedding Output Shape: {token_embeddings.shape}")  # (batch_size, seq_len, hidden_dim)

# Compute FLOPs
seq_length = token_embeddings.shape[1]
hidden_dim = token_embeddings.shape[2]
embedding_flops = seq_length * hidden_dim  # Embedding FLOPs

# Compute total FLOPs (including Residual, LayerNorm, MHA, FFN)
total_distilbert_flops = (
    embedding_flops + total_mha_flops + total_ffn_flops + total_layernorm_residual_flops
)

print(f"Total FLOPs Per-Layer ≈ {total_distilbert_flops / 1e6:.3f} MFLOPs")
print(f"Total FLOPs Model ≈ {total_distilbert_flops / 1e9 * config_dict['n_layers']:.3f} GFLOPs")


===== Token Embedding Results =====
Token IDs Shape: torch.Size([1, 12])
Embedding Output Shape: torch.Size([1, 12, 768])
Total FLOPs Per-Layer ≈ 170.553 MFLOPs
Total FLOPs Model ≈ 1.023 GFLOPs


## **Block 8: CPU Direct Execution Time**
- **Measure CPU execution time**
- **Calculate `DistilBERT` FLOPs per second on the CPU**

In [9]:
import time
from transformers import DistilBertTokenizer, DistilBertModel

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

device = torch.device("cpu")
model.to(device)

text = "I love deep learning. It’s very interesting"
inputs = tokenizer(text, return_tensors="pt")

# Run the full DistilBERT inference again and measure execution time
start_time = time.time()
with torch.no_grad():
    outputs = model(**inputs)
end_time = time.time()

execution_time = end_time - start_time

print(f"\n===== CPU Execution Time Statistics (FP16) =====")
print(f"Total Execution Time: {execution_time:.6f} sec")
print(f"FLOPs per Second (GFLOPs/s): {total_distilbert_flops / execution_time / 1e9:.2f} GFLOPs/s")


===== CPU Execution Time Statistics (FP16) =====
Total Execution Time: 0.029101 sec
FLOPs per Second (GFLOPs/s): 5.86 GFLOPs/s
