## BERT (Bidirectional Encoder Representation from Transformers)

<img src="Screenshot 2025-05-09 at 01.27.33.png" width="500" />

- Apa itu Attention (udah kamu kuasai feeling-nya 🔥)

- Bagaimana Attention dihitung (Query-Key-Value + hitungan kecil → lagi kita lakuin sekarang)

- Self-Attention di semua kata di kalimat (next step kita)

- Multi-Head Attention (kenapa pakai banyak Attention barengan?)

- Positional Encoding (gimana BERT tau urutan kata?) -> Learned Positional Embedding.

- Encoder Stack (BERT = tumpukan encoder, kayak kue lapis)

- Training Objective BERT (Masked Language Model + Next Sentence Prediction)

- Fine-Tuning (gimana BERT dipakai buat tugas kayak QA, Sentimen, dst.)

# SCALED DOT PRODUCT ATTENTION

# FULL STACK ENCODER

# PRETRAINING 

## MLM (MASKED LANGUAGE MODELLING)


## NSP (NEXT SENTENCE PREDICTION)

# FINE-TUNING

## Text Classification

In [None]:
import torch
from datasets import load_dataset
from transformers import (
    BertTokenizer,
    BertForSequenceClassification,
    TrainingArguments,
    Trainer
)

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load pretrained model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)
model.to(device)  # Move model to appropriate device

# Load dataset
dataset = load_dataset("imdb", split="train[:2000]")
dataset = dataset.train_test_split(test_size=0.2)
print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

# Tokenize function - create a function that properly formats data
def tokenize_function(examples):
    # Return tokenized examples with proper padding and truncation
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=512,  # Specify max_length explicitly
        return_tensors="pt"  # Return PyTorch tensors
    )

# Process dataset to have the right format and data types
tokenized_datasets = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"]  # Remove text column which won't be needed
)

# Convert label column to make it compatible with the model
tokenized_datasets = tokenized_datasets.map(
    lambda examples: {"labels": examples["label"]},
    remove_columns=["label"]  # Remove original label column
)

# Check the structure of our processed dataset
print("\nSample from processed dataset:")
sample = tokenized_datasets["train"][0]
print(f"Type: {type(sample)}")
print(f"Keys: {sample.keys()}")
print(f"Example item: {sample}")

# Define training arguments
# Check the TrainingArguments available parameters
from inspect import signature
print("Available parameters for TrainingArguments:", signature(TrainingArguments))

training_args = TrainingArguments(
    output_dir="./results",
    # Try with no evaluation strategy parameter first
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    # Simplified arguments to minimize potential issues
    logging_dir="./logs"
)

# Define trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)

# Train model
print("\nStarting training...")
trainer.train()

# Evaluate model
print("\nEvaluating model...")
eval_results = trainer.evaluate()
print(f"Evaluation results: {eval_results}")

# Save model
model_path = "./imdb-bert-classifier"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)
print(f"\nModel saved to {model_path}")

## SQuAD 

In [None]:
from transformers import BertTokenizerFast, BertForQuestionAnswering
from transformers import Trainer, TrainingArguments
from datasets import load_dataset


model_name = "bert-base-uncased"

tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)


dataset = load_dataset("squad", split="train[:1000]")
dataset = dataset.train_test_split(test_size=0.2)


def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    contexts = [c.strip() for c in examples["context"]]

    inputs = tokenizer(
        questions,
        contexts,
        max_length=384,
        truncation="only_second",
        padding="max_length",
        return_offsets_mapping=True,
        return_tensors="pt",
    )

    start_positions = []
    end_positions = []

    for i, offsets in enumerate(inputs["offset_mapping"]):
        answer = examples["answers"][i]
        start_char = answer["answer_start"][0]
        end_char = start_char + len(answer["text"][0])

        sequence_ids = inputs.sequence_ids(i)

        context_start = sequence_ids.index(1)
        context_end = len(sequence_ids) - 1 - list(reversed(sequence_ids)).index(1)

        offsets = offsets[context_start:context_end+1]
        for idx, (start, end) in enumerate(offsets):
            if start <= start_char and end >= start_char:
                start_positions.append(idx + context_start)
            if start <= end_char and end >= end_char:
                end_positions.append(idx + context_start)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=dataset["train"].column_names)


training_args = TrainingArguments(
    output_dir="./qa-results",
    evaluation_strategy="epoch",
    learning_rate=3e-5,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
)


trainer.train()

# Output

## HIDDEN STATE (HIDDEN OUTPUT)

## POOLER OUTPUT ([CLS])

| Output        | Bentuk                                       | Gunanya                                                                                  |
| :------------ | :------------------------------------------- | :--------------------------------------------------------------------------------------- |
| Hidden States | `(batch_size, sequence_length, hidden_size)` | Tiap token punya representasi kontekstual. Cocok buat tagging per token (NER, QA).       |
| Pooler Output | `(batch_size, hidden_size)`                  | Representasi seluruh kalimat. Cocok buat klasifikasi kalimat (Pos/Neg, Entailment, dst). |


# HYPERPARAMETER

| Hyperparameter | BERT Umum      | Kenapa              |
| -------------- | -------------- | ------------------- |
| Learning Rate  | 2e-5 atau 3e-5 | Supaya update halus |
| Batch Size     | 8 atau 16      | Supaya hemat memori |
| Epoch          | 2-4            | Cukup buat adaptasi |


# QA MANUALLY

- Masukin input ➔ dapat logits (start_logits, end_logits).

- Cari index paling tinggi dari start_logits ➔ itu awal jawaban.

- Cari index paling tinggi dari end_logits ➔ itu akhir jawaban.

- Potong token dari start sampai end ➔ decode jadi teks jawaban.



In [2]:
from transformers import BertTokenizerFast, BertForQuestionAnswering
import torch

# Load model
model_name = "bert-base-uncased"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)

# Input
context = "The capital city of France is Paris."
question = "What is the capital of France?"

# Tokenize
inputs = tokenizer(question, context, return_tensors="pt")

# Forward pass
with torch.no_grad():
    outputs = model(**inputs)

start_logits = outputs.start_logits
end_logits = outputs.end_logits

# Get the highest scoring start and end token positions
start_index = torch.argmax(start_logits)
end_index = torch.argmax(end_logits)

# Decode the answer
answer_ids = inputs["input_ids"][0][start_index : end_index + 1]
answer = tokenizer.decode(answer_ids)

print(f"Predicted Answer: {answer}")

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Predicted Answer: 


# MLM (MASKED LANGUAGE MODELLING)

In [3]:
from transformers import BertTokenizerFast, BertForMaskedLM
import torch

# Load model
model_name = "bert-base-uncased"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)

# Text with mask
text = "The capital of [MASK] is Paris."

# Tokenize
inputs = tokenizer(text, return_tensors="pt")

# Forward pass
with torch.no_grad():
    outputs = model(**inputs)

logits = outputs.logits

# Index of [MASK]
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]

# Predicted token
predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)

print(f"Predicted Token: {predicted_token}")


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Predicted Token: france


# MODEL : MINI BERT MANUALLY

## ARCHITECTURE

<img src="attachment:4e443c08-d5ca-4537-988c-c9a6cb29be46.png" width="500" />


## INPUT DATA

## TOKENIZATION

- Transforming every word into token (int ID) using vocab

- WordPiece for BERT, for example :

- Kalimat: "cat eating fish"
- Tokenized → [11, 25, 38]
- Vocab → {1 : 11, 2 : 25, 3 : 38}


## POSITIONAL EMBEDDINGS

- every token will turn into vector representation + its position.

- step :

- word embedding lookup table n dimentional for every word

- positional encoding lookup table for every position word and add with embedding value.

- example :

- Word embed [128-d]

- Position embed [128-d]

- Add embed-position → we get [128-d] for every token.



## STACK ENCODER

#### MULTIHEAD ATTENTION

- Multi head meaning every word has n perpective on how that word context is.

##### SCALED DOT PRODUCT ATTENTION / SELF ATTENTION

- every token look into another token and update its representation based on proporsition telling relationship.


- Step :

- Input Vector → Linear projection to Q, K, V.

- attention score softmax(QKᵀ / √d).

- mutiply score × V.

- Concacinate with another head

- Output: new hidden state that more "smart" - has more context.


#### RESIDUAL AND NORMALIZATION

- jaga info lama + stabilkan training.
- Tambahkan input + output Self-Attention ➔ Residual.
- Normalisasi hasilnya ➔ LayerNorm.
- attention_output = LayerNorm(input + attention(input))


#### FEED FORWARD

- Perbaiki representasi token secara individual.

- proses :

- expand → relu → shrink.

- Seleksi fitur-fitur penting per token.

- detail :

- Dense layer 1: expand dimensi (misal 128 ➔ 512).

- ReLU : buang informasi ga guna, fokus ke internal connection

- Dense layer 2: balik ke dimensi awal (512 ➔ 128).




#### RESIDUAL AND NORMALIZATION

- jaga info lama + stabilkan training.
- Tambahkan input + output Self-Attention ➔ Residual.
- Normalisasi hasilnya ➔ LayerNorm.
- attention_output = LayerNorm(input + attention(input))
- ffn_output = LayerNorm(attention_output + FFN(attention_output))


## ANOTHER ENCODER BLOCK

- input = ffn_output

- ....

-  ......

- 12 - 24X

## OUTPUT 


### HIDDEN STATE

- Hidden States	(batch_size, sequence_length, hidden_size)
- Tiap token punya representasi kontekstual.
- buat tagging per token (NER, QA).


### POOLED STATE

- Pooler Output (batch_size, hidden_size)
- Representasi seluruh kalimat.
- buat klasifikasi kalimat (Pos/Neg, Entailment, dst).

---

## PSEUDOCODE INPLEMENTATION

### INPUT
    ## input text or corpus
    
    c = "x y z ...."

### TOKEN

    # text into number by id or index
    
    t = [x_id, y_id, z_id, ...... , n_id]

### POSITIONAL EMBEDDING

    ## encode each token to embed and position
    
    for each token in t:
    
    embedding vector = word embedding matrix [token_id] -> vectorize the input token for each word
    positional vector = positional embeedin matrix [position] -> vector that tell what the order of each token by poition
    input that vectorized: input_vector = embedding vector + positional vector

    ## result:
    
    vectors  = [input_vector_1, input_vector_2, input_vector_3, ......., input_vector_n]




### STACK ENCODER - 2 LAYER

    ## for each encoder layer :

        ## if multihead self attention :
    
        for each token in t :
            linear projection 1, 2, 3 ...., n =  Q, K, V  from input vectors
        
        for each pair of token (i,j) : 
            attention score 1, 2, 3 ...., n = dot product (Q[i], K[j]) / sqrt (d_model = input vectors [-1])

        for each token i:
            normalize the value by softmax accross attention score
            attention output 1, 2, 3 ...., n= sum over j (softmax_score[i][j] * V[j])

            
        ## if self attention block

        for each token in t :
            linear projection =  Q, K, V  from input vectors
        
        for each pair of token (i,j) : 
            attention score = dot product (Q[i], K[j]) / sqrt (d_model = input vectors [-1])

        for each token i:
            normalize the value by softmax accross attention score
            attention output = sum over j (softmax_score[i][j] * V[j])


        ## residual connection 
    
        attention output = layer_norm(input + attention output)


        ## feed forward 
        
        for each token in attention output :
            - expand dimention 2x : h1 : linear(attention output)
            - focus on important information in that token only : relu : relu(h1)
            - compress again 2x -> original : h2 / output : linear(relu)


        ## residual connection 
    
        output = layer_norm(attention output + output)

    
        ## Output
    
        input vetor in the next encoder layer - repeat the process from linear projection - output.


### OUTPUT - STATE

    ## output hidden state
    
    Return:
    - Output hidden vectors for each token
    
    ## output pooled state
    
    Return:
    - [CLS] token output for classification tasks (if needed)


---

## SIMPLE PYTHON IMPLEMENTATION

### STRUCTURE


| Puzzle | Modul           | Isi                                                                                          |
| :----: | :-------------- | :------------------------------------------------------------------------------------------- |
|    1   | Basic Functions | - Matrix multiplication (manual) <br> - Softmax Activation Function (manual) <br> - Layer normalization (manual) |
|    2   | Embedding       | - Word Embedding Lookup table <br> - Positional Encoding Lookup table                                     |
|    3   | Self Attention  | - Linear projection (Q, K, V) of input <br> - Attention score calculation <br> - Output V             |
|    4   | Feed Forward    | - Linear 1 (expand), ReLU, Linear 2 (compress again)                                                |
|    5   | Encoder Layer   | - Residual connection + Norm after Attention <br> - Residual connection + Norm after FFN |


### BASIC FUNCTION

In [6]:
# Dot Product Function

import numpy as np

def matmul(x,y): 
    
    return np.dot(x,y)

In [7]:
# Softmax Activation Funtion

def softmax(x, axis=-1):
    x_max = np.max(x, axis=axis, keepdims=True)
    e_x = np.exp(x - x_max)
    
    return e_x / np.sum(e_x, axis=axis, keepdims=True)

In [8]:
# Normalization Layer Function

def norm(x):
    
    eps = 1e-6
    avg = np.mean(x, axis=-1, keepdims=True)
    std = np.std(x, axis=-1, keepdims=True)

    return (x - avg)/(std + eps)

### EMBEDDING LAYER

In [9]:
# Word Embedding Lookup Table

class wordEmbedding:
    def __init__(self, vocab_size, d_model):
        self.vocab_size = vocab_size
        self.d_model = d_model
        self.embedding_table = np.random.rand(vocab_size, d_model)*0.01

    def forward(self, token_id):
        return self.embedding_table[token_id]

In [10]:
# Positional Encoding Lookup table

class positionalEncoding:
    def __init__(self, seq_len, d_model):
        self.seq_len = seq_len
        self.d_model = d_model

        position = np.arange(seq_len)[:, np.newaxis]
        div_term = np.exp(np.arange(0, d_model, 2) * (-np.log(10000.0) / d_model))

        pe = np.zeros((seq_len, d_model))
        pe[:, 0::2] = np.sin(position * div_term)
        pe[:, 1::2] = np.cos(position * div_term)

        self.embedding_table = pe

    def forward(self, position_token_id):
        return self.embedding_table[position_token_id]

In [11]:
# Example..

vocab_size = 99
d_model = 10
max_seq = 5

word_embed = wordEmbedding(vocab_size, d_model)
position_embed = positionalEncoding(max_seq, d_model)

corpus = [['i', 'like', 'pussies'], ['he', 'hate', 'dogs']]

tokens = np.array([
    [5, 8, 20],   # Sentence 1
    [6, 9, 25]    # Sentence 2
])  # shape: (2, 3)

positions = np.array([0, 1, 2])  # for all sentences, same positions


word_vecs = word_embed.forward(tokens)
pos_vecs = position_embed.forward(positions)

final_input = word_vecs + pos_vecs
print(final_input.shape)

(2, 3, 10)


### SCALED DOT PRODUCT ATTENTION

In [21]:
# Linear projection (Q, K, V)
# Attention score calculation
# Output V

class selfAttention:
    def __init__(self, d_model):
        self.d_model = d_model

        scale = np.sqrt(2.0 / d_model)
        self.w_q = np.random.rand(d_model, d_model) * scale
        self.w_k = np.random.rand(d_model, d_model) * scale
        self.w_v = np.random.rand(d_model, d_model) * scale

    def forward(self, x):
        q = matmul(x, self.w_q)
        k = matmul(x, self.w_k)
        v = matmul(x, self.w_v)

        k_t = np.transpose(k, (0, 2, 1))

        attn_scores = matmul(q, k_t)/np.sqrt(self.d_model)
        attn_probs = softmax(attn_scores, axis=-1)
        attn_output = matmul(attn_probs, v)

        return attn_output

In [10]:
# Example : 3 kata, 64 dimensi

x = np.random.randn(1, 3, 10)

self_attn = selfAttention(d_model=10)

out = self_attn.forward(x)

print(out.shape)  # (1, 3, 64)

(1, 3, 1, 1, 10)


### MULTIHEAD ATTENTION

In [29]:
# Multihead Attention

class multiheadAttention():
    def __init__(self, d_model, num_heads):
        self.d_model = d_model
        self.num_heads = num_heads
        self.heads_d = d_model // num_heads

        assert d_model % num_heads == 0

        # weigth Q, K, V
        scale = np.sqrt(2.0 / d_model)
        self.w_q = np.random.rand(d_model, d_model) * scale
        self.w_k = np.random.rand(d_model, d_model) * scale
        self.w_v = np.random.rand(d_model, d_model) * scale
        self.w_o = np.random.rand(d_model, d_model) * scale

    def forward(self, x):
        batch_size, seq_len, _ = x.shape

        # Q, K, V
        q = np.matmul(x, self.w_q)
        k = np.matmul(x, self.w_k)
        v = np.matmul(x, self.w_v)

        # split into multiple head
        q = q.reshape(batch_size, seq_len, self.num_heads, self.heads_d)
        k = k.reshape(batch_size, seq_len, self.num_heads, self.heads_d)
        v = v.reshape(batch_size, seq_len, self.num_heads, self.heads_d)

        # transpose
        q = q.transpose(0,2,1,3)
        k = k.transpose(0,2,1,3)
        v = v.transpose(0,2,1,3)

        k_t = np.transpose(k, (0, 1, 3, 2))

        # self attention per head
        attn_scores = np.matmul(q, k_t) / np.sqrt(self.heads_d)
        attn_probs = softmax(attn_scores, axis=-1)
        attn_output = np.matmul(attn_probs, v)

        # concatinate all heads into one
        attn_output = attn_output.transpose(0,2,1,3)
        attn_output = attn_output.reshape(batch_size, seq_len, self.d_model)

        # linear projection 
        output = np.matmul(attn_output, self.w_o)

        return output

In [57]:
# Xample : sentences : 2, words : 4, d_model : 8, num of heads = 4 (4 konteks each word)

batch_size = 2
seq_len = 4
d_model = 8
num_heads = 4

x = np.random.randn(batch_size, seq_len, d_model)  # (2, 4, 8)

mha = multiheadAttention(d_model, num_heads)

output = mha.forward(x)

print(f"input : {x}")
print()
print(f"shape of input : {x.shape}")
print()

batch_size, seq_len, d_model = x.shape

print(f"batch_size = sentences : {batch_size}")
print(f"seq_len = words : {seq_len}")
print(f"d_model = dimention : {d_model}")


expected_shape = (batch_size, seq_len, d_model)
actual_shape = output.shape

print()
print(f"output : {output}")
print()
print(f"expected_shape : {expected_shape}")
print()

print(f"actual_shape : {actual_shape}")

input : [[[-0.77281009 -0.26525163  0.49480591 -0.71234687  1.55256265
    0.03180343  0.97385523 -1.85170586]
  [-0.57396265 -0.72822389 -1.92776488  1.28525226  0.62958119
    0.74690941  0.74743178  0.4630312 ]
  [ 0.27997036  0.23979904 -0.09229853 -0.70767435 -0.09800739
   -0.34597603  1.06333414  0.25265971]
  [-1.82387414  0.24315024 -0.23913714  0.07299795 -1.43867199
    0.2218916   0.79692588 -2.53512674]]

 [[-0.99488304  0.85132085  0.08817293  1.058072    0.68034666
   -0.88408008 -0.6120395  -0.78602775]
  [ 0.68272199  0.24199436  0.38193769 -0.50565454 -0.26785863
    1.48656754 -1.38883187 -0.58510711]
  [-0.64977469  0.68919681  1.23141812  0.39152976 -1.00440959
   -0.97901076 -1.48389504 -0.96477045]
  [ 0.32549657  1.18308707 -0.0607689   1.34716348  0.15218075
    0.48560845  0.78811011 -0.05491841]]]

shape of input : (2, 4, 8)

batch_size = sentences : 2
seq_len = words : 4
d_model = dimention : 8

output : [[[-0.00044888 -0.00099372 -0.00042065 -0.00070045 -0.

### FFN

In [23]:
# Feed Forward Neural Network

class feedForward:
    def __init__(self, d_model, d_ff):
        scale1 = np.sqrt(2.0 / d_model)
        scale2 = np.sqrt(2.0 / d_ff)
        
        self.w1 = np.random.rand(d_model, d_ff) * scale1
        self.b1 = np.zeros((d_ff,))
        self.w2 = np.random.rand(d_ff, d_model) * scale2
        self.b2 = np.zeros((d_model,))


    def gelu(self, x):
        return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * np.power(x, 3))))
        

    def forward(self, x):
        x_expanded = np.matmul(x, self.w1) + self.b1
        x_relu = self.gelu(x_expanded)
        output = np.matmul(x_relu, self.w2) + self.b2

        return output

In [83]:
# === Test Code ===
if __name__ == "__main__":
    d_model = 4
    d_ff = 64
    batch_size = 2
    seq_len = 3

    # Buat dummy input (batch_size, seq_len, d_model)
    x = np.random.rand(batch_size, seq_len, d_model)

    # Inisialisasi Feed Forward
    ff = feedForward(d_model, d_ff)

    # Proses forward tiap token
    out = np.array([[ff.forward(token) for token in sample] for sample in x])
    
    print()
    print("Input shape:", x.shape)
    print("Output shape:", out.shape)
    print("Output:\n", out)


Input shape: (2, 3, 4)
Output shape: (2, 3, 4)
Output:
 [[[24.93228925 25.01672739 28.52430536 27.89781689]
  [25.15489125 24.22581311 27.73751544 26.67556067]
  [36.99291892 36.90875791 41.84860775 40.34837315]]

 [[26.06460181 25.73991014 29.46731452 28.66470778]
  [35.34297198 34.76383522 39.95592629 39.16061508]
  [23.08314383 22.70280699 25.94482227 25.20208626]]]


### STACK ENCODER - 3 LAYER

In [35]:
class bertBlock:
    def __init__(self, d_model, num_heads, d_ff):
        self.mha = multiheadAttention(d_model, num_heads)
        self.hh = feedForward(d_model, d_ff)

    def forward(self, x):
        # multihead self attention
        x_mha = self.mha.forward(x)

        # add and norm
        x_norm1 = norm(x + x_mha)

        # feed forward
        x_ffn = self.hh.forward(x_norm1) 

        # add and norm
        x_norm2 = norm(x_norm1 + x_ffn)

        return x_norm2

In [39]:
class bertModel:
    def __init__(self, vocab_size, seq_len, d_model, num_heads, d_ff, n_stacks):
        self.word_embedding = wordEmbedding(vocab_size, d_model)
        self.positional_embedding = positionalEncoding(seq_len, d_model)
        self.bert_blocks = [bertBlock(d_model, num_heads, d_ff) for _ in range(n_stacks)]

    def forward(self, token_id, position_token_id=None):
        batch_size, seq_len = token_id.shape

        if position_token_id is None:
            position_token_id = np.tile(np.arange(seq_len), (batch_size, 1))

        word_embedding = self.word_embedding.forward(token_id)
        positional_embedding = self.positional_embedding.forward(position_token_id)

        x = word_embedding + positional_embedding

        for block in self.bert_blocks:
            x = block.forward(x)

        return x

In [40]:
# Example usage
if __name__ == "__main__":
    # Hyperparameters
    vocab_size = 3000
    seq_len = 512
    d_model = 768
    num_heads = 12
    d_ff = 3072
    n_stacks = 12
    
    # Create model instance
    model = bertModel(vocab_size, seq_len, d_model, num_heads, d_ff, n_stacks)
    
    # Example input (batch_size=2, seq_len=6)
    token_id = np.array([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]])
    
    # Forward pass
    output = model.forward(token_id)
    print(f"Output shape: {output.shape}")  # Should be (2, 6, 768)

Output shape: (2, 6, 768)


---

| Step | Penjelasan                           | Status |
| :--: | :----------------------------------- | :----: |
|   1  | Bangun Mini-BERT Stack               |    ✅   |
|   2  | Pretraining (Masked LM + NSP)        |   🔜   |
|   3  | Fine-tuning ke task spesifik         |   🔜   |
|   4  | Buat dataset dummy buat latihan      |   🔜   |
|   5  | Build mindset & intuition level dewa |   🔜   |


# Pretraining (Masked LM + NSP)

INTI :

- Input: Token yang di-mask sebagian + sepasang kalimat
- Target 1: Isi kata yang di-mask
- Target 2: Apakah kalimat kedua nyambung?

HOW? :

- Tokenisasi kalimat ➔ jadi token ID

- Tambahin [CLS] di awal, [SEP] antar kalimat

- Tambahin Positional Encoding kayak biasa

- Random pilih token buat di-[MASK] (sekitar 15% token)

- Masukin ke Mini-BERT stack - model kita

- Output 1: Prediksi isi token yang ketutup

- Output 2: Prediksi label NSP (IsNext / NotNext)




| Misi                              | Tujuan                         | Gampangnya                                     |
| :-------------------------------- | :----------------------------- | :--------------------------------------------- |
| 1. Masked Language Model (MLM)    | Belajar isi kata yang hilang   | Tebak kata yang ketutupan                      |
| 2. Next Sentence Prediction (NSP) | Belajar hubungan antar kalimat | Tebak apakah kalimat kedua nyambung atau ngaco |


# MLM : MASK LANGUAGE MODEL


## INTUITION

- Belajar isi kata yang hilang, tutup beberapa kata dalam kalimat
- Tebak kata yang ketutupan , suruh bert nebak itu
- Kalimat asli:
- "Saya makan nasi di warung."

- Setelah masking:
- "Saya [MASK] nasi di [MASK]."

- Tugas BERT:
- Tebak [MASK] = "makan", [MASK] = "warung"


## PROCESS

1. Input :

- c = ['kucing bermain di taman']

- t = ['kucing', 'bermain', 'di', 'taman']


2. Special Token :

- ['[CLS]', 'kucing', 'bermain', 'di', 'taman', '[SEP]']


3. Masking 15% Input :

- ['[CLS]', 'kucing', '[MASK]', 'di', 'taman', '[SEP]']

4. Pretrain Model with this Approach :

- Input : ['[CLS]', 'kucing', '[MASK]', 'di', 'taman', '[SEP]']
  
- Embedding (token embedding + positional embedding),
  
- Stack Encoder stack (MHA ➔ AddNorm ➔ FFN ➔ AddNorm),

- keluar tensor representasi semua token.


## PSEUDOCODE

    # pretraining bert for mlm
    initialize bert model with random weight

    def apply mask (tokens):
        for i in range (len token):
            if random < 0.15:
                if random < 0.8:
                    tokens[i] = [mask]
                elif random < 0.9:
                    token[i] = random_token()
                else:
                    token[i] = token[i]
                lebel[i] = original token
            else:
                label[i] = [ignore]

        return tokens, label
    
    for each epoch:
        for each batch in training data :
        # 1. tokenize
        input token = tokenize(batch)

        # 2. masking
        mask input, label = apply mask (input token)

        # 3. feed forward bert
        output = bertmodel(mask input)

        # 4. training, loss 
        loss = cross entropy(output[mask position], labels[mask position])

        # 5. backpropagation or update parameter
        loss.backward()
        optimizer.step()
        optimizer.zero grad()

---

## NEXT SENTENCE PREDICTION (NSP)

- Dikasih dua kalimat, suruh BERT tebak:

- Nyambung? (A ➔ B)

- Atau ngaco? (A ➔ random)

- Kalimat 1: "Saya pergi ke pasar."
- Kalimat 2: "Saya membeli buah."
- ==> Label: IsNext (nyambung)

- Kalimat 1: "Saya pergi ke pasar."
- Kalimat 2: "Bulan purnama sangat indah."
- ==> Label: NotNext (acak)
