# Pretraining (Masked LM + NSP)

## PROCESS OVERVIEW



| Step | Penjelasan                           | Status |
| :--: | :----------------------------------- | :----: |
|   1  | Bangun Mini-BERT Stack               |    âœ…   |
|   2  | Pretraining (Masked LM + NSP)        |   NOW   |
|   3  | Fine-tuning ke task spesifik         |   ðŸ”œ   |
|   4  | Buat dataset dummy buat latihan      |   ðŸ”œ   |
|   5  | Build mindset & intuition level dewa |   ðŸ”œ   |

---

INTI :

- Input: Token yang di-mask sebagian + sepasang kalimat
- Target 1: Isi kata yang di-mask
- Target 2: Apakah kalimat kedua nyambung?

HOW? :

- Tokenisasi kalimat âž” jadi token ID

- Tambahin [CLS] di awal, [SEP] antar kalimat

- Tambahin Positional Encoding kayak biasa

- Random pilih token buat di-[MASK] (sekitar 15% token)

- Masukin ke Mini-BERT stack - model kita

- Output 1: Prediksi isi token yang ketutup

- Output 2: Prediksi label NSP (IsNext / NotNext)




| Misi                              | Tujuan                         | Gampangnya                                     |
| :-------------------------------- | :----------------------------- | :--------------------------------------------- |
| 1. Masked Language Model (MLM)    | Belajar isi kata yang hilang   | Tebak kata yang ketutupan                      |
| 2. Next Sentence Prediction (NSP) | Belajar hubungan antar kalimat | Tebak apakah kalimat kedua nyambung atau ngaco |


---

# MLM : MASK LANGUAGE MODEL


## INTUITION

- Belajar isi kata yang hilang, tutup beberapa kata dalam kalimat
- Tebak kata yang ketutupan , suruh bert nebak itu
- Kalimat asli:
- "Saya makan nasi di warung."

- Setelah masking:
- "Saya [MASK] nasi di [MASK]."

- Tugas BERT:
- Tebak [MASK] = "makan", [MASK] = "warung"


## PROCESS

1. Input :

- c = ['kucing bermain di taman']

- t = ['kucing', 'bermain', 'di', 'taman']


2. Special Token :

- ['[CLS]', 'kucing', 'bermain', 'di', 'taman', '[SEP]']


3. Masking 15% Input :

- ['[CLS]', 'kucing', '[MASK]', 'di', 'taman', '[SEP]']

4. Pretrain Model with this Approach :

- Input : ['[CLS]', 'kucing', '[MASK]', 'di', 'taman', '[SEP]']
  
- Embedding (token embedding + positional embedding),
  
- Stack Encoder stack (MHA âž” AddNorm âž” FFN âž” AddNorm),

- keluar tensor representasi semua token.


## PSEUDOCODE

    # pretraining bert for mlm
    initialize bert model with random weight

    def apply mask (tokens):
        for i in range (len token):
            if random < 0.15:
                if random < 0.8:
                    tokens[i] = [mask]
                elif random < 0.9:
                    token[i] = random_token()
                else:
                    token[i] = token[i]
                lebel[i] = original token
            else:
                label[i] = [ignore]

        return tokens, label
    
    for each epoch:
        for each batch in training data :
        # 1. tokenize
        input token = tokenize(batch)

        # 2. masking
        mask input, label = apply mask (input token)

        # 3. feed forward bert
        output = bertmodel(mask input)

        # 4. training, loss 
        loss = cross entropy(output[mask position], labels[mask position])

        # 5. backpropagation or update parameter
        loss.backward()
        optimizer.step()
        optimizer.zero grad()


## EXAMPLE

1. Input : ['[CLS]', 'singa', 'berlari', 'cepat', '[SEP]']

2. Mask :['[CLS]', 'singa', '[MASK]', 'cepat', '[SEP]']

3. Embedding :

- [CLS]:  [0.1, 0.2]

- singa:  [0.5, 0.4]

- [MASK]: [0.0, 0.0]  (karena belum tahu)

- cepat:  [0.3, 0.7]

- [SEP]:  [0.1, 0.2]


4. BERT Model :

- MHA âž” AddNorm

- FFN âž” AddNorm

- [MASK]: [0.48, 0.45]


5. Loss :

- Vocab :
{
  'singa':  [0.5, 0.4],
  'berlari': [0.48, 0.45],
  'cepat': [0.3, 0.7],
  'makan': [0.7, 0.2]
}

- Similarity

- ke 'singa' âž” 0.48Ã—0.5 + 0.45Ã—0.4 = 0.24 + 0.18 = 0.42

- ke 'berlari' âž” 0.48Ã—0.48 + 0.45Ã—0.45 = 0.2304 + 0.2025 = 0.4329

- ke 'cepat' âž” 0.48Ã—0.3 + 0.45Ã—0.7 = 0.144 + 0.315 = 0.459

- ke 'makan' âž” 0.48Ã—0.7 + 0.45Ã—0.2 = 0.336 + 0.09 = 0.426

## PYTHON CODE IMPLEMENTATION OF PRETRAIN BERT MODEL

src = https://www.101ai.net/text/bert



### 1. Initialize Pretrain Model

In [1]:
import numpy as np
from Bert_Model import BertModel

class BertPretrainingModel:
    def __init__(self, bert_model, vocab_size):
        self.bert_model = bert_model
        self.d_model = bert_model.d_model
        self.vocab_size = vocab_size

        ## MLM Head
        scale = np.sqrt(self.d_model)
        self.mlm_dense = np.random.randn(self.d_model, self.vocab_size) * scale
        self.mlm_bias = np.zeros((self.vocab_size,))
        self.mlm_decoder = np.random.randn(self.d_model, self.vocab_size) * scale
        self.mlm_decoder_bias = np.zeros((self.vocab_size,))

        ## NSP Head
        self.nsp_dense = np.random.randn(self.d_model, 2) * scale # Binary classification
        self.nsp_bias = np.zeros((2,))

    def gelu(self, x):
        return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * np.power(x, 3))))

    def forward(self, token_id=None, position_id=None, attention_mask=None):
        """
        Forward pass for pretraining
        
        Args:
            token_ids: [batch_size, seq_len] Token IDs
            segment_ids: [batch_size, seq_len] Segment IDs (0 for first sentence, 1 for second)
            position_ids: [batch_size, seq_len] Position IDs
            attention_mask: [batch_size, seq_len] Attention mask (1 for tokens to attend to, 0 for padding)
            
        Returns:
            mlm_logits: [batch_size, seq_len, vocab_size] MLM logits
            nsp_logits: [batch_size, 2] NSP logits
        """

        # bert output
        bert_output = self.bert_model.forward(token_id, position_id)
        
        # mlm task
        mlm_hidden = np.matmul(bert_output, self.mlm_dense) + self.mlm_bias
        mlm_hidden = self.gelu(mlm_hidden)
        mlm_logits = np.matmul(mlm_hidden, self.mlm_decoder) + self.mlm_decoder_bias

        # nsp task
        cls_output = bert_output[:, 0, :]
        nsp_logits = np.matmul(cls_output, self.nsp_dense) + self.nsp_bias

        return mlm_logits, nsp_logits

### 2. Pretrain BERT Model

In [4]:
def mlm_data(tokens, mask_prob=0.15):
    """
    Create masked input and labels for masked language modeling.
    
    Args:
        tokens: [batch_size, seq_len] Token IDs
        mask_prob: Probability of masking a token
        
    Returns:
        masked_tokens: [batch_size, seq_len] Masked token IDs
        mlm_labels: [batch_size, seq_len] Labels (-1 for unmasked tokens, original token ID for masked)
    """

    # Create a mask for tokens to be masked
    masked_tokens = tokens.copy()
    mlm_labels = np.ones_like(tokens) * -1  # -1 for unmasked tokens

    # create mask indices
    prob_matrix = np.random.random(tokens.shape)
    mask_indices = prob_matrix < mask_prob

    # dont mask [cls] and [sep] tokens = 0
    # Replace with your special token IDs
    special_tokens = (tokens == 0)  | (tokens == 101) | (tokens == 102)
    mask_indices &= ~special_tokens

    # set labels for masked tokens
    mlm_labels[mask_indices] = tokens[mask_indices]

    # 80% of the time, replace masked input tokens with [MASK]
    indices_mask = np.random.random(tokens.shape) < 0.8
    indices_mask &= mask_indices
    masked_tokens[indices_mask] = 103  # [MASK] token ID

    # 10% of the input, will replace by mask token 
    indices_random = np.random.random(tokens.shape) < 0.8
    indices_random = mask_indices & ~indices_mask & ~special_tokens
    random_words = np.random.randint(1, 30522, tokens[indices_random].shape)
    masked_tokens[indices_random] = random_words[indices_random]

    # 10% of the input, will keep the original
    # the remaining masked tokens will kept unchanged
    return masked_tokens, mlm_labels


In [None]:
def nsp_data(text, tokenizer, max_len=512, batch_size=32):
    """
    
    """

## NEXT SENTENCE PREDICTION (NSP)

- Dikasih dua kalimat, suruh BERT tebak:

- Nyambung? (A âž” B)

- Atau ngaco? (A âž” random)

- Kalimat 1: "Saya pergi ke pasar."
- Kalimat 2: "Saya membeli buah."
- ==> Label: IsNext (nyambung)

- Kalimat 1: "Saya pergi ke pasar."
- Kalimat 2: "Bulan purnama sangat indah."
- ==> Label: NotNext (acak)
