# Pretraining (NSP : NEXT SENTENCE PREDICTION)

## PROCESS OVERVIEW



| Step | Penjelasan                           | Status |
| :--: | :----------------------------------- | :----: |
|   1  | Bangun Mini-BERT Stack               |    ✅   |
|   2  | Pretraining (Masked LM + NSP)        |   NOW   |
|   3  | Fine-tuning ke task spesifik         |   🔜   |
|   4  | Buat dataset dummy buat latihan      |   🔜   |
|   5  | Build mindset & intuition level dewa |   🔜   |

---

TUJUAN :

- Bukan cuma paham hubungan kata di kalimat, tapi juga hubungan antar kalimat
- Apakah kalimat kedua nyambung?

HOW? :

- Kalimat berhubungan -> Label NSP = IsNext
- Kalimat tidak berhubungan -> Label NSP = NotNext

- [CLS] Kalimat 1 [SEP] Kalimat 2 [SEP]
- Token [CLS] itu direpresentasikan jadi 1 vektor lewat encoder.
- Vektor [CLS] ini dipakai buat prediksi IsNext atau NotNext lewat 1 layer klasifikasi.

STEPS :
1. Input :
- c = ['kucing bermain di taman']
- t = ['kucing', 'bermain', 'di', 'taman']
2. Special Token :
- ['[CLS]', 'kucing', 'bermain', 'di', 'taman', '[SEP]']
3. Ambil Output vektor [CLS]
4. Linear Layer untuk klasifikasi IsNext/NotNext
5. Loss Function
6. Optimizer
7. Training

| Misi                              | Tujuan                         | Gampangnya                                     |
| :-------------------------------- | :----------------------------- | :--------------------------------------------- |
| 1. Masked Language Model (MLM)    | Belajar isi kata yang hilang   | Tebak kata yang ketutupan                      |
| 2. Next Sentence Prediction (NSP) | Belajar hubungan antar kalimat | Tebak apakah kalimat kedua nyambung atau ngaco |


---

# NSP : NEXT SENTENCE PREDICTION

![](https://amitness.com/posts/images/bert-nsp.png)


## INTUITION

- cls merepresentasikan kalimat dan hubungan antar kalimat



## PROCESS

1. Input :

- s1 = ['kucing bermain di taman']

- s2 = ['kucing makan nasi']


2. Special Token :

- ['[CLS]', 'kucing', 'bermain', 'di', 'taman', '[SEP]', 'kucing', 'makan', 'nasi', '[SEP]']


3. Pretrain Model with this Approach :

- Input : ['[CLS]', 'kucing', 'bermain', 'di', 'taman', '[SEP]', 'kucing', 'makan', 'nasi', '[SEP']

- Embedding (token embedding + positional embedding),
  
- Stack Encoder stack (MHA ➔ AddNorm ➔ FFN ➔ AddNorm),

- inti : train cls untuk memprediksi apakah kalimat kedua merupakan kalimat berikutnya dari kalimat pertama atau tidak

## PSEUDOCODE

    # pretraining bert for mlm
    initialize bert model with random weight

    def apply mask (tokens):
        for i in range (len token):
            if random < 0.15:
                if random < 0.8:
                    tokens[i] = [mask]
                elif random < 0.9:
                    token[i] = random_token()
                else:
                    token[i] = token[i]
                lebel[i] = original token
            else:
                label[i] = [ignore]

        return tokens, label
    
    for each epoch:
        for each batch in training data :
        # 1. tokenize
        input token = tokenize(batch)

        # 2. masking
        mask input, label = apply mask (input token)

        # 3. feed forward bert
        output = bertmodel(mask input)

        # 4. training, loss 
        loss = cross entropy(output[mask position], labels[mask position])

        # 5. backpropagation or update parameter
        loss.backward()
        optimizer.step()
        optimizer.zero grad()


## EXAMPLE

1. s1 = "I love pussy",

2. s2 : "langit berwarna biru hari ini",

3. Embedding :

['[CLS]', 'i', 'love', 'pussy', '[SEP]', 'langit', 'berwarna', 'biru', 'hari', 'ini', '[SEP]']

- [1, 4, 5, 6, 2, 7, 8, 9, 10, 11, 2] 

4. BERT Model :

- MHA ➔ AddNorm

- FFN ➔ AddNorm

- logits + softmax -> cls
- cls = [0.48, 0.45] 
- cls[0] = berhubungan
- cls[1] = tidak berhubungan

---


## PRETRAIN BERT MODEL NSP

src = https://www.101ai.net/text/bert


## Original From BERT

In [224]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

s1 = "aku suka makan nasi"
s2 = "nasi itu enak"

emb1 = model.encode(s1, convert_to_tensor=True)
emb2 = model.encode(s2, convert_to_tensor=True)

score = util.cos_sim(emb1, emb2).item()
percentage = score * 100

print(f"Similarity: {percentage:.2f}%")

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.51k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Similarity: 56.18%


## Breakdown

### Data

In [1]:
import numpy as np

In [228]:
sentences = [
    "I love pussy",
    "langit berwarna biru hari ini",
    "Langit berwarna biru hari ini",
    "Dia sedang belajar matematika"
]

sentences[0], sentences[1]

('I love pussy', 'langit berwarna biru hari ini')

### Tokens

In [191]:
tokens = [word for sentence in sentences for word in sentence.lower().split()]

vocab = {
    "[PAD]": 0,
    "[CLS]": 1,
    "[SEP]": 2,
    "[MASK]": 3,
}

for i, w in enumerate(tokens, start=len(vocab)):
    vocab[w] = i

print(vocab)

{'[PAD]': 0, '[CLS]': 1, '[SEP]': 2, '[MASK]': 3, 'i': 4, 'love': 5, 'pussy': 6, 'langit': 12, 'berwarna': 13, 'biru': 14, 'hari': 15, 'ini': 16, 'dia': 17, 'sedang': 18, 'belajar': 19, 'matematika': 20}


### Tokens ID

In [192]:
t2i = {token : idx for idx, token in enumerate(vocab)}
i2t = {idx : token for token, idx in t2i.items()}

print(t2i)
print()
print(i2t)
print()
print(i2t[1])

{'[PAD]': 0, '[CLS]': 1, '[SEP]': 2, '[MASK]': 3, 'i': 4, 'love': 5, 'pussy': 6, 'langit': 7, 'berwarna': 8, 'biru': 9, 'hari': 10, 'ini': 11, 'dia': 12, 'sedang': 13, 'belajar': 14, 'matematika': 15}

{0: '[PAD]', 1: '[CLS]', 2: '[SEP]', 3: '[MASK]', 4: 'i', 5: 'love', 6: 'pussy', 7: 'langit', 8: 'berwarna', 9: 'biru', 10: 'hari', 11: 'ini', 12: 'dia', 13: 'sedang', 14: 'belajar', 15: 'matematika'}

[CLS]


### Sentences + Its ID

In [193]:
s1 = sentences[0].lower().split()
s2 = sentences[1].lower().split()
print(s1, s2)
print()

tokens = [i2t[1]] + s1 + [i2t[2]] + s2 + [i2t[2]]
ids = [t2i[tokens] for tokens in tokens]

print(tokens)
print()
print(ids, len(ids))

['i', 'love', 'pussy'] ['langit', 'berwarna', 'biru', 'hari', 'ini']

['[CLS]', 'i', 'love', 'pussy', '[SEP]', 'langit', 'berwarna', 'biru', 'hari', 'ini', '[SEP]']

[1, 4, 5, 6, 2, 7, 8, 9, 10, 11, 2] 11


### Embedding

In [194]:
dim = 4

np.random.seed(42)
embedding_matrix = np.random.randn(len(t2i), dim)  # dummy emb
input_emb = np.array([embedding_matrix[i] for i in ids])  # shape: (seq_len, dim)

print(input_emb, input_emb.shape)

[[-0.234 -0.234  1.579  0.767]
 [-1.013  0.314 -0.908 -1.412]
 [ 1.466 -0.226  0.068 -1.425]
 [-0.544  0.111 -1.151  0.376]
 [-0.469  0.543 -0.463 -0.466]
 [-0.601 -0.292 -0.602  1.852]
 [-0.013 -1.058  0.823 -1.221]
 [ 0.209 -1.96  -1.328  0.197]
 [ 0.738  0.171 -0.116 -0.301]
 [-1.479 -0.72  -0.461  1.057]
 [-0.469  0.543 -0.463 -0.466]] (11, 4)


In [195]:
for token, idx in zip(tokens, ids):
    emb = embedding_matrix[idx]
    print(f"Token: {token:<10} | ID: {idx:<3} | Embedding: {emb}")

Token: [CLS]      | ID: 1   | Embedding: [-0.234 -0.234  1.579  0.767]
Token: i          | ID: 4   | Embedding: [-1.013  0.314 -0.908 -1.412]
Token: love       | ID: 5   | Embedding: [ 1.466 -0.226  0.068 -1.425]
Token: pussy      | ID: 6   | Embedding: [-0.544  0.111 -1.151  0.376]
Token: [SEP]      | ID: 2   | Embedding: [-0.469  0.543 -0.463 -0.466]
Token: langit     | ID: 7   | Embedding: [-0.601 -0.292 -0.602  1.852]
Token: berwarna   | ID: 8   | Embedding: [-0.013 -1.058  0.823 -1.221]
Token: biru       | ID: 9   | Embedding: [ 0.209 -1.96  -1.328  0.197]
Token: hari       | ID: 10  | Embedding: [ 0.738  0.171 -0.116 -0.301]
Token: ini        | ID: 11  | Embedding: [-1.479 -0.72  -0.461  1.057]
Token: [SEP]      | ID: 2   | Embedding: [-0.469  0.543 -0.463 -0.466]


### Positional Encoding

In [196]:
max_len = len(t2i)

pos = np.arange(max_len)[:, np.newaxis]
i = np.arange(dim)[np.newaxis, :]

angle_rates = 1 / np.power(10000, (2*(i//2))/dim)
angle_rads = pos * angle_rates

angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

positional_encoding = angle_rads

# PE + EMBEDDING
pe = input_emb + positional_encoding[:len(ids)]

In [197]:
for pos, (token, idx) in enumerate(zip(tokens, ids)):
    emb = embedding_matrix[idx]
    pes = positional_encoding[pos]
    result = emb + pes
    print(f"Token: {token:<10} | Pos: {pos:<2} | ID: {idx:<3} | Emb + PE: {result}")

Token: [CLS]      | Pos: 0  | ID: 1   | Emb + PE: [-0.234  0.766  1.579  1.767]
Token: i          | Pos: 1  | ID: 4   | Emb + PE: [-0.171  0.855 -0.898 -0.412]
Token: love       | Pos: 2  | ID: 5   | Emb + PE: [ 2.375 -0.642  0.088 -0.425]
Token: pussy      | Pos: 3  | ID: 6   | Emb + PE: [-0.403 -0.879 -1.121  1.375]
Token: [SEP]      | Pos: 4  | ID: 2   | Emb + PE: [-1.226 -0.111 -0.423  0.533]
Token: langit     | Pos: 5  | ID: 7   | Emb + PE: [-1.56  -0.008 -0.552  2.851]
Token: berwarna   | Pos: 6  | ID: 8   | Emb + PE: [-0.293 -0.098  0.883 -0.223]
Token: biru       | Pos: 7  | ID: 9   | Emb + PE: [ 0.866 -1.206 -1.258  1.194]
Token: hari       | Pos: 8  | ID: 10  | Emb + PE: [ 1.728  0.026 -0.036  0.696]
Token: ini        | Pos: 9  | ID: 11  | Emb + PE: [-1.066 -1.631 -0.371  2.053]
Token: [SEP]      | Pos: 10 | ID: 2   | Emb + PE: [-1.013 -0.297 -0.364  0.529]


In [198]:
pe

array([[-0.234,  0.766,  1.579,  1.767],
       [-0.171,  0.855, -0.898, -0.412],
       [ 2.375, -0.642,  0.088, -0.425],
       [-0.403, -0.879, -1.121,  1.375],
       [-1.226, -0.111, -0.423,  0.533],
       [-1.56 , -0.008, -0.552,  2.851],
       [-0.293, -0.098,  0.883, -0.223],
       [ 0.866, -1.206, -1.258,  1.194],
       [ 1.728,  0.026, -0.036,  0.696],
       [-1.066, -1.631, -0.371,  2.053],
       [-1.013, -0.297, -0.364,  0.529]])

### Multihead Multiscale Dot Product Attention

In [199]:
def softmax(x, axis=-1):
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / e_x.sum(axis=axis, keepdims=True)

num_heads = 2
seq_len = pe.shape[0]
assert dim%num_heads == 0
depth = dim//num_heads

np.random.seed(0)
wq = np.random.rand(dim, dim)
wk = np.random.rand(dim, dim)
wv = np.random.rand(dim, dim)
wo = np.random.rand(dim, dim)

q = pe @ wq
k = pe @ wk
v = pe @ wv

q = q.reshape((seq_len, num_heads, depth)).transpose(1,0,2)
k = k.reshape((seq_len, num_heads, depth)).transpose(1,0,2)
v = v.reshape((seq_len, num_heads, depth)).transpose(1,0,2)

dk = q.shape[-1]
scores = np.matmul(q, k.transpose(0,2,1)) / np.sqrt(dk)

if 'mask' in locals() and mask is not None:
    scores += (mask * -1e9)

attn_weights = softmax(scores, axis=-1)
output = np.matmul(attn_weights, v)

output = output.transpose(1,0,2).reshape((seq_len, dim))
output = output @ wo

In [200]:
def print_tensor(name, tensor):
    print(f"{name}.shape: {tensor.shape}")
    print(f"{name}:\n{tensor}\n")

print("==== SHAPES ====")
print(f"q.shape           : {q.shape}")
print(f"k.shape           : {k.shape}")
print(f"v.shape           : {v.shape}")
print(f"scores.shape      : {scores.shape}")
print(f"attn_weights.shape: {attn_weights.shape}")
print(f"output.shape      : {output.shape}")
print()

print("==== VALUES (truncated if too long) ====")
np.set_printoptions(precision=3, suppress=True, linewidth=120)

print_tensor("q", q)
print_tensor("k", k)
print_tensor("v", v)
print_tensor("scores", scores)
print_tensor("attn_weights", attn_weights)
print_tensor("output", output)

==== SHAPES ====
q.shape           : (2, 11, 2)
k.shape           : (2, 11, 2)
v.shape           : (2, 11, 2)
scores.shape      : (2, 11, 11)
attn_weights.shape: (2, 11, 11)
output.shape      : (11, 4)

==== VALUES (truncated if too long) ====
q.shape: (2, 11, 2)
q:
[[[ 2.722  2.569]
  [-0.832 -0.297]
  [ 0.874  0.924]
  [-0.893 -0.013]
  [-0.825 -0.617]
  [ 0.229  1.307]
  [ 0.522 -0.14 ]
  [-0.57   0.464]
  [ 1.32   1.883]
  [-0.467 -0.058]
  [-0.732 -0.566]]

 [[ 1.57   1.545]
  [-0.47   0.158]
  [ 1.19   0.731]
  [-1.418 -1.477]
  [-1.085 -0.945]
  [-1.178 -0.9  ]
  [ 0.464  0.201]
  [-0.917 -1.165]
  [ 1.074  1.006]
  [-1.504 -2.053]
  [-0.991 -0.963]]]

k.shape: (2, 11, 2)
k:
[[[ 1.854  2.161]
  [ 0.511 -0.205]
  [-0.792  1.344]
  [-0.283 -1.185]
  [ 0.095 -1.16 ]
  [ 1.383 -0.476]
  [-0.113  0.151]
  [-0.688 -0.553]
  [ 0.419  1.725]
  [-0.59  -1.577]
  [-0.077 -1.094]]

 [[ 0.865  3.254]
  [ 0.023 -0.65 ]
  [ 1.452  1.319]
  [-0.516 -1.031]
  [-0.925 -1.141]
  [-0.542  0.323]
 

In [201]:
print(f"{'Token':<12} | {'ID':<3} | {'Before Attn':<40} | {'After Attn':<40}")
print("-" * 110)

for i, (token, idx) in enumerate(zip(tokens, ids)):
    bev = pe[i]
    attn = output[i]
    bev_str = np.array2string(bev, precision=2, separator=', ', suppress_small=True, max_line_width=40)
    attn_str = np.array2string(attn, precision=2, separator=', ', suppress_small=True, max_line_width=40)
    print(f"{token:<12} | {idx:<3} | {bev_str:<40} | {attn_str:<40}")

Token        | ID  | Before Attn                              | After Attn                              
--------------------------------------------------------------------------------------------------------------
[CLS]        | 1   | [-0.23,  0.77,  1.58,  1.77]             | [3.71, 1.37, 2.75, 1.67]                
i            | 4   | [-0.17,  0.85, -0.9 , -0.41]             | [-0.23, -0.09, -0.29, -0.1 ]            
love         | 5   | [ 2.37, -0.64,  0.09, -0.42]             | [2.41, 0.89, 1.88, 1.1 ]                
pussy        | 6   | [-0.4 , -0.88, -1.12,  1.38]             | [-0.5 , -0.24, -0.83, -0.24]            
[SEP]        | 2   | [-1.23, -0.11, -0.42,  0.53]             | [-0.73, -0.3 , -0.89, -0.33]            
langit       | 7   | [-1.56, -0.01, -0.55,  2.85]             | [1.2 , 0.35, 0.19, 0.51]                
berwarna     | 8   | [-0.29, -0.1 ,  0.88, -0.22]             | [0.56, 0.22, 0.48, 0.28]                
biru         | 9   | [ 0.87, -1.21, -1.26,  1.19]

### Residual Connection + Normalization

In [202]:
def norm(x, eps=1e-6):
    mean = np.mean(x, axis=-1, keepdims=True)
    std = np.std(x, axis=-1, keepdims=True)
    return (x - mean) / (std + eps)

residual_connection = norm(pe + output)

In [203]:
for i, (token, idx) in enumerate(zip(tokens, ids)):
    res = residual_connection[i]
    print(f"Token: {token:<10} | ID: {idx:<5} | After Attn + Residual: {res}")

Token: [CLS]      | ID: 1     | After Attn + Residual: [ 0.174 -1.544  1.255  0.115]
Token: i          | ID: 4     | After Attn + Residual: [-0.097  1.563 -1.218 -0.248]
Token: love       | ID: 5     | After Attn + Residual: [ 1.618 -0.942  0.027 -0.703]
Token: pussy      | ID: 6     | After Attn + Residual: [-0.175 -0.359 -1.093  1.627]
Token: [SEP]      | ID: 2     | After Attn + Residual: [-1.314  0.554 -0.536  1.296]
Token: langit     | ID: 7     | After Attn + Residual: [-0.718 -0.264 -0.72   1.702]
Token: berwarna   | ID: 8     | After Attn + Residual: [-0.356 -0.617  1.715 -0.742]
Token: biru       | ID: 9     | After Attn + Residual: [ 0.858 -0.794 -1.179  1.115]
Token: hari       | ID: 10    | After Attn + Residual: [ 1.646 -1.049 -0.224 -0.374]
Token: ini        | ID: 11    | After Attn + Residual: [-0.575 -0.793 -0.342  1.71 ]
Token: [SEP]      | ID: 2     | After Attn + Residual: [-1.226  0.338 -0.554  1.441]


### FFN

In [204]:
# ukuran layer di BERT (base)
ffn_hidden = 3072  # bisa ubah sesuai

In [205]:
# weight FFN
w1 = np.random.rand(dim, ffn_hidden)
b1 = np.random.rand(ffn_hidden)
w2 = np.random.rand(ffn_hidden, dim)
b2 = np.random.rand(dim)

# FFN
ffn_intermediate = np.maximum(0, residual_connection @ w1 + b1)  # ReLU
ffn_output = ffn_intermediate @ w2 + b2

In [206]:
residual_connection2 = residual_connection + ffn_output
norm2 = (residual_connection2 - residual_connection2.mean(axis=-1, keepdims=True)) / (residual_connection2.std(axis=-1, keepdims=True) + 1e-6)

In [207]:
encoder_output = norm2

In [208]:
cls_vector = encoder_output[0]  # karena [CLS] di awal

In [209]:
nsp_head = np.random.rand(dim, 2)  # dua kelas: is_next / not_next
logits = cls_vector @ nsp_head

In [210]:
nsp_probs = softmax(logits)

### Output

In [211]:
print("CLS vector:", cls_vector)
print("Logits:", logits)
print("NSP Probabilities:", nsp_probs)

CLS vector: [ 0.831 -0.099 -1.597  0.865]
Logits: [ 0.846 -0.474]
NSP Probabilities: [0.789 0.211]


### Probability of Next Sentence

In [212]:
percentages = (nsp_probs * 100)
print(f"Kalimat lanjut: {percentages[0]:.1f}%")
print(f"Bukan lanjut: {percentages[1]:.1f}%")

Kalimat lanjut: 78.9%
Bukan lanjut: 21.1%


## Full Program For Pretraining BERT Model NSP

In [229]:
import numpy as np
import random

def data_for_nsp(corpus_list):
    """
    Process list of sentences for NSP task
    Args:
        corpus_list: List of sentences to process
    Returns:
        vocab: List of unique tokens in the corpus
    """
    # Split each sentence into tokens and flatten
    all_tokens = []
    for sentence in corpus_list:
        all_tokens.extend(sentence.split())
    
    # Create unique vocabulary
    vocab = list(set(all_tokens))
    # Add special tokens
    vocab.extend(['[CLS]', '[SEP]', '[MASK]'])
    return vocab

def tokenizer(vocab):
    """Create mapping between tokens and indices"""
    token2idx = {}
    idx2token = {}
    for idx, token in enumerate(vocab):
        token2idx[token] = idx
        idx2token[idx] = token
    return token2idx, idx2token

def create_nsp_example(corpus_list, is_next=True):
    """
    Create an example for the NSP task
    Args:
        corpus_list: List of sentences to sample from
        is_next: If True, sentences are consecutive; if False, random pair
    Returns:
        tokens: List of tokens for the NSP example
        nsp_label: 1 if sentences are consecutive, 0 if not
    """
    if len(corpus_list) < 2:
        raise ValueError("Need at least 2 sentences for NSP task")
    
    # Select random sentence as the first sentence
    idx = random.randint(0, len(corpus_list) - 2)
    sentence_a = corpus_list[idx].split()
    
    if is_next:
        # If is_next is True, take the next sentence
        sentence_b = corpus_list[idx + 1].split()
        nsp_label = 1
    else:
        # If is_next is False, randomly select a sentence that's not the next one
        random_idx = random.randint(0, len(corpus_list) - 1)
        while random_idx == idx + 1 or random_idx == idx:
            random_idx = random.randint(0, len(corpus_list) - 1)
        sentence_b = corpus_list[random_idx].split()
        nsp_label = 0
    
    # Construct the complete sequence with special tokens
    tokens = ['[CLS]'] + sentence_a + ['[SEP]'] + sentence_b + ['[SEP]']
    
    # Create segment IDs (0 for first sentence, 1 for second sentence)
    segment_ids = [0] * (len(sentence_a) + 2) + [1] * (len(sentence_b) + 1)
    
    return tokens, segment_ids, nsp_label

def apply_mlm(tokens, mask_prob=0.15):
    """
    Apply masking for the MLM task
    Args:
        tokens: List of tokens
        mask_prob: Probability of masking a token
    Returns:
        masked_tokens: Tokens with some replaced by [MASK]
        mlm_labels: Original tokens for masked positions, -1 for unmasked
    """
    masked_tokens = tokens.copy()
    mlm_labels = [-1] * len(tokens)
    
    # Don't mask special tokens [CLS] and [SEP]
    maskable_indices = [i for i, token in enumerate(tokens) 
                        if token not in ['[CLS]', '[SEP]']]
    
    # Number of tokens to mask
    n_to_mask = max(1, int(mask_prob * len(maskable_indices)))
    
    # Randomly select indices to mask
    mask_indices = random.sample(maskable_indices, n_to_mask)
    
    for idx in mask_indices:
        # Store the original token as the label
        mlm_labels[idx] = tokens[idx]
        
        rand = random.random()
        if rand < 0.8:
            # 80% of the time, replace with [MASK]
            masked_tokens[idx] = '[MASK]'
        elif rand < 0.9:
            # 10% of the time, replace with random word
            masked_tokens[idx] = random.choice(tokens)
        # 10% of the time, keep the original word
    
    return masked_tokens, mlm_labels

def token_ids(tokens, token2idx):
    """Convert tokens to token IDs"""
    token_ids = []
    for token in tokens:
        token_ids.append(token2idx.get(token, token2idx.get('[UNK]', 0)))
    return token_ids, len(token_ids)

def embedding(scale, d_model, vocab_size):
    """Create token embedding matrix"""
    np.random.seed(42)
    embed_matrix = np.random.rand(vocab_size, d_model) * scale
    return embed_matrix

def segment_embedding(scale, d_model, n_segments=2):
    """Create segment embedding matrix"""
    np.random.seed(43)
    segment_embed_matrix = np.random.rand(n_segments, d_model) * scale
    return segment_embed_matrix

def position_embedding(max_seq_len, d_model):
    """Create position embedding matrix"""
    pos = np.arange(max_seq_len).reshape(max_seq_len, 1)
    i = np.arange(d_model)
    angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
    angle_rads = pos * angle_rates
    pe = np.zeros((max_seq_len, d_model))
    pe[:, 0::2] = np.sin(angle_rads[:, 0::2])
    pe[:, 1::2] = np.cos(angle_rads[:, 1::2])
    
    return pe

def embedding_output(token_embed, segment_embed, segment_ids, position_embed, seq_len):
    """Combine token, segment, and position embeddings"""
    token_output = token_embed[:seq_len]
    segment_output = np.array([segment_embed[segment_id] for segment_id in segment_ids[:seq_len]])
    position_output = position_embed[:seq_len]
    
    return token_output + segment_output + position_output

def attention_weights(d_model):
    """Initialize attention weights"""
    Wq = np.random.randn(d_model, d_model) * 0.01
    Wk = np.random.randn(d_model, d_model) * 0.01
    Wv = np.random.randn(d_model, d_model) * 0.01

    return Wq, Wk, Wv

def attention_output(x, Wq, Wk, Wv):
    """Compute self-attention"""
    Q = x @ Wq
    K = x @ Wk
    V = x @ Wv
    d_k = Q.shape[-1]
    scaled = np.matmul(Q, K.transpose()) / np.sqrt(d_k)
    softmax = np.exp(scaled) / np.sum(np.exp(scaled), axis=-1, keepdims=True)

    return np.matmul(softmax, V)

def add_and_norm(x, attention_output):
    """Add and normalize"""
    eps = 1e-6
    avg = np.mean(x, axis=-1, keepdims=True)
    std = np.std(x, axis=-1, keepdims=True)
    norm = (x - avg) / (std + eps)
    norm_output = attention_output + norm
    return norm_output
    
def ffn_weights(d_model, d_ff):
    """Initialize feed-forward network weights"""
    W1 = np.random.randn(d_model, d_ff) * 0.01
    W2 = np.random.randn(d_ff, d_model) * 0.01
    return W1, W2

def ffn_output(norm_output, W1, W2):
    """Feed-forward network computation"""
    ffn_output = np.matmul(norm_output, W1)
    relu = np.maximum(0, ffn_output)
    ffn_output = np.matmul(relu, W2)
    return ffn_output

def initialize_model(d_model, d_ff, vocab_size, n_segments=2):
    """Initialize all model parameters"""
    Wq, Wk, Wv = attention_weights(d_model)
    W1, W2 = ffn_weights(d_model, d_ff)
    token_embed_matrix = embedding(0.01, d_model, vocab_size)
    segment_embed_matrix = segment_embedding(0.01, d_model, n_segments)
    
    # MLM output layer
    W_mlm_output = np.random.randn(d_model, vocab_size) * 0.01
    b_mlm_output = np.zeros(vocab_size)
    
    # NSP output layer
    W_nsp_output = np.random.randn(d_model, 2) * 0.01  # 2 classes for NSP
    b_nsp_output = np.zeros(2)
    
    return {
        'Wq': Wq, 'Wk': Wk, 'Wv': Wv,
        'W1': W1, 'W2': W2,
        'token_embed_matrix': token_embed_matrix,
        'segment_embed_matrix': segment_embed_matrix,
        'W_mlm_output': W_mlm_output,
        'b_mlm_output': b_mlm_output,
        'W_nsp_output': W_nsp_output,
        'b_nsp_output': b_nsp_output
    }

def forward_pass(token_ids_input, segment_ids, seq_len, model_params, d_model, vocab_size):
    """
    Forward pass through the BERT model
    Returns MLM and NSP outputs
    """
    # One-hot encode the tokens
    input_matrix = np.zeros((seq_len, vocab_size))
    for i, token_id in enumerate(token_ids_input[:seq_len]):
        input_matrix[i, token_id] = 1
    
    # Get embeddings
    token_embeds = input_matrix @ model_params['token_embed_matrix']
    pos_embeds = position_embedding(seq_len, d_model)
    embed_output = embedding_output(
        token_embeds, 
        model_params['segment_embed_matrix'], 
        segment_ids[:seq_len], 
        pos_embeds, 
        seq_len
    )

    # Self-attention block
    attn_output = attention_output(embed_output, model_params['Wq'], model_params['Wk'], model_params['Wv'])
    norm_output = add_and_norm(embed_output, attn_output)
    
    # Feed-forward block
    ffn_out = ffn_output(norm_output, model_params['W1'], model_params['W2'])
    ffn_norm_output = add_and_norm(norm_output, ffn_out)

    # MLM head: predict token for each position
    mlm_logits = ffn_norm_output @ model_params['W_mlm_output'] + model_params['b_mlm_output']
    mlm_probs = np.exp(mlm_logits) / np.sum(np.exp(mlm_logits), axis=-1, keepdims=True)
    
    # NSP head: use [CLS] token representation for next sentence prediction
    cls_output = ffn_norm_output[0]  # [CLS] token is at the beginning
    nsp_logits = cls_output @ model_params['W_nsp_output'] + model_params['b_nsp_output']
    nsp_probs = np.exp(nsp_logits) / np.sum(np.exp(nsp_logits))

    return mlm_probs, nsp_probs, ffn_norm_output

def compute_loss(mlm_probs, mlm_labels, nsp_probs, nsp_label, token2idx):
    """
    Compute combined loss for MLM and NSP tasks
    Args:
        mlm_probs: Predicted probabilities for masked tokens
        mlm_labels: Original tokens for masked positions, -1 for unmasked
        nsp_probs: Predicted probabilities for NSP
        nsp_label: True label for NSP (0 or 1)
    """
    # MLM loss: only for masked positions
    mlm_loss = 0
    mask_count = 0
    
    for i, label in enumerate(mlm_labels):
        if label != -1:  # If this position was masked
            label_id = token2idx.get(label, 0)
            mlm_loss -= np.log(mlm_probs[i, label_id] + 1e-10)
            mask_count += 1
    
    if mask_count > 0:
        mlm_loss /= mask_count
    
    # NSP loss
    nsp_loss = -np.log(nsp_probs[nsp_label] + 1e-10)
    
    # Combined loss
    total_loss = mlm_loss + nsp_loss
    
    return total_loss, mlm_loss, nsp_loss

def backward_propagation(loss, mlm_probs, mlm_labels, nsp_probs, nsp_label, ffn_norm_output, 
                         model_params, token_ids_input, segment_ids, token2idx, lr=0.01):
    """Basic backward propagation to update model parameters"""
    # Gradients for MLM output layer
    d_W_mlm_output = np.zeros_like(model_params['W_mlm_output'])
    d_b_mlm_output = np.zeros_like(model_params['b_mlm_output'])
    
    # Compute gradients for masked positions
    for i, label in enumerate(mlm_labels):
        if label != -1:  # If this position was masked
            label_id = token2idx.get(label, 0)
            d_probs = np.zeros_like(mlm_probs[i])
            d_probs[label_id] = -1.0 / (mlm_probs[i, label_id] + 1e-10)
            d_logits = mlm_probs[i] * d_probs
            
            d_W_mlm_output += np.outer(ffn_norm_output[i], d_logits)
            d_b_mlm_output += d_logits
    
    # Gradients for NSP output layer
    d_W_nsp_output = np.zeros_like(model_params['W_nsp_output'])
    d_b_nsp_output = np.zeros_like(model_params['b_nsp_output'])
    
    d_nsp_probs = np.zeros_like(nsp_probs)
    d_nsp_probs[nsp_label] = -1.0 / (nsp_probs[nsp_label] + 1e-10)
    d_nsp_logits = nsp_probs * d_nsp_probs
    
    d_W_nsp_output += np.outer(ffn_norm_output[0], d_nsp_logits)  # [CLS] token
    d_b_nsp_output += d_nsp_logits
    
    # Update parameters
    model_params['W_mlm_output'] -= lr * d_W_mlm_output
    model_params['b_mlm_output'] -= lr * d_b_mlm_output
    model_params['W_nsp_output'] -= lr * d_W_nsp_output
    model_params['b_nsp_output'] -= lr * d_b_nsp_output

    return model_params

def train_bert(corpus_list, n_epoch=1000):
    """
    Train BERT with both MLM and NSP objectives
    Args:
        corpus_list: List of sentences
        n_epoch: Number of training epochs
    """
    # Model dimensions
    d_model = 16
    d_ff = 64
    
    # Process corpus
    vocab = data_for_nsp(corpus_list)
    vocab_size = len(vocab)
    token2idx, idx2token = tokenizer(vocab)
    
    # Initialize model
    model_params = initialize_model(d_model, d_ff, vocab_size)
    
    # Training loop
    losses = {"total": [], "mlm": [], "nsp": []}
    
    for epoch in range(n_epoch):
        # Create training example
        is_next = random.choice([True, False])
        tokens, segment_ids, nsp_label = create_nsp_example(corpus_list, is_next)
        
        # Apply masking for MLM
        masked_tokens, mlm_labels = apply_mlm(tokens)
        
        # Convert to token IDs
        token_ids_input, seq_len = token_ids(masked_tokens, token2idx)
        
        # Forward pass
        mlm_probs, nsp_probs, ffn_norm_output = forward_pass(
            token_ids_input, segment_ids, seq_len, model_params, d_model, vocab_size
        )
        
        # Calculate loss
        total_loss, mlm_loss, nsp_loss = compute_loss(
            mlm_probs, mlm_labels, nsp_probs, nsp_label, token2idx
        )
        
        losses["total"].append(total_loss)
        losses["mlm"].append(mlm_loss)
        losses["nsp"].append(nsp_loss)
        
        # Backward pass
        model_params = backward_propagation(
            total_loss, mlm_probs, mlm_labels, nsp_probs, nsp_label, ffn_norm_output,
            model_params, token_ids_input, segment_ids, token2idx, lr=0.01
        )
        
        # Logging
        if (epoch+1) % 100 == 0:
            print(f"Epoch {epoch+1}/{n_epoch}")
            print(f"Total Loss: {total_loss:.4f}, MLM Loss: {mlm_loss:.4f}, NSP Loss: {nsp_loss:.4f}")
            
            # MLM accuracy check: predict random masked token
            mask_indices = [i for i, label in enumerate(mlm_labels) if label != -1]
            if mask_indices:
                random_mask_idx = random.choice(mask_indices)
                predicted_token_id = np.argmax(mlm_probs[random_mask_idx])
                print(f"MLM sample - Predicted: {idx2token[predicted_token_id]}, Target: {mlm_labels[random_mask_idx]}")
            
            # NSP accuracy check
            predicted_nsp = np.argmax(nsp_probs)
            print(f"NSP - Predicted: {predicted_nsp}, Target: {nsp_label}")
            print()
        
    # Final evaluation
    correct_nsp = 0
    total_tests = 100
    
    print("\nFinal Evaluation:")
    for _ in range(total_tests):
        is_next = random.choice([True, False])
        tokens, segment_ids, nsp_label = create_nsp_example(corpus_list, is_next)
        masked_tokens, mlm_labels = apply_mlm(tokens)
        token_ids_input, seq_len = token_ids(masked_tokens, token2idx)
        
        mlm_probs, nsp_probs, _ = forward_pass(
            token_ids_input, segment_ids, seq_len, model_params, d_model, vocab_size
        )
        
        predicted_nsp = np.argmax(nsp_probs)
        if predicted_nsp == nsp_label:
            correct_nsp += 1
    
    print(f"NSP Accuracy: {correct_nsp/total_tests:.2%}")
    
    return model_params, losses

In [230]:
# Example usage
if __name__ == "__main__":
    corpus_list = [
        "the cat sat on the mat",
        "it was happy there",
        "dogs chase cats around",
        "the weather is nice today",
        "i like to read books",
        "programming is fun to learn",
        "artificial intelligence is advancing quickly",
        "neural networks can solve many problems"
    ]
    
    model_params, losses = train_bert(corpus_list, n_epoch=1000)

Epoch 100/1000
Total Loss: 7.9209, MLM Loss: 3.9613, NSP Loss: 3.9596
MLM sample - Predicted: the, Target: i
NSP - Predicted: 1, Target: 0

Epoch 200/1000
Total Loss: 4.3164, MLM Loss: 4.2971, NSP Loss: 0.0193
MLM sample - Predicted: is, Target: networks
NSP - Predicted: 1, Target: 1

Epoch 300/1000
Total Loss: 9.2947, MLM Loss: 4.3278, NSP Loss: 4.9669
MLM sample - Predicted: is, Target: many
NSP - Predicted: 1, Target: 0

Epoch 400/1000
Total Loss: 9.8398, MLM Loss: 3.5188, NSP Loss: 6.3210
MLM sample - Predicted: is, Target: to
NSP - Predicted: 1, Target: 0

Epoch 500/1000
Total Loss: 5.2079, MLM Loss: 5.2009, NSP Loss: 0.0070
MLM sample - Predicted: is, Target: fun
NSP - Predicted: 1, Target: 1

Epoch 600/1000
Total Loss: 5.1462, MLM Loss: 5.1460, NSP Loss: 0.0002
MLM sample - Predicted: is, Target: around
NSP - Predicted: 1, Target: 1

Epoch 700/1000
Total Loss: 7.7294, MLM Loss: 0.0504, NSP Loss: 7.6790
MLM sample - Predicted: is, Target: is
NSP - Predicted: 1, Target: 0

Epoch 8

| Step | Penjelasan                           | Status |
| :--: | :----------------------------------- | :----: |
|   1  | Bangun Mini-BERT Stack               |    ✅   |
|   2  | Pretraining (Masked LM + NSP)        |   ✅    |
|   3  | Fine-tuning ke task spesifik         |   🔜   |
|   4  | Buat dataset dummy buat latihan      |   🔜   |
|   5  | Build mindset & intuition level dewa |   🔜   |