# MODEL : MINI BERT MANUALLY




## PROCESS OVERVIEW

| Step | Penjelasan                           | Status |
| :--: | :----------------------------------- | :----: |
|   1  | Bangun Mini-BERT Stack               |   Now   |
|   2  | Pretraining (Masked LM + NSP)        |   🔜   |
|   3  | Fine-tuning ke task spesifik         |   🔜   |
|   4  | Buat dataset dummy buat latihan      |   🔜   |
|   5  | Build mindset & intuition level dewa |   🔜   |

---

## ARCHITECTURE

<img src="https://miro.medium.com/v2/resize:fit:1162/1*vG_xN7a9HuLCU05U5IznPQ.png" width="500" />


## INPUT DATA

## TOKENIZATION

- Transforming every word into token (int ID) using vocab

- WordPiece for BERT, for example :

- Kalimat: "cat eating fish"
- Tokenized → [11, 25, 38]
- Vocab → {1 : 11, 2 : 25, 3 : 38}


## POSITIONAL EMBEDDINGS

- every token will turn into vector representation + its position.

- step :

- word embedding lookup table n dimentional for every word

- positional encoding lookup table for every position word and add with embedding value.

- example :

- Word embed [128-d]

- Position embed [128-d]

- Add embed-position → we get [128-d] for every token.



## STACK ENCODER

#### MULTIHEAD ATTENTION

- Multi head meaning every word has n perpective on how that word context is.

##### SCALED DOT PRODUCT ATTENTION / SELF ATTENTION

- every token look into another token and update its representation based on proporsition telling relationship.


- Step :

- Input Vector → Linear projection to Q, K, V.

- attention score softmax(QKᵀ / √d).

- mutiply score × V.

- Concacinate with another head

- Output: new hidden state that more "smart" - has more context.


#### RESIDUAL AND NORMALIZATION

- jaga info lama + stabilkan training.
- Tambahkan input + output Self-Attention ➔ Residual.
- Normalisasi hasilnya ➔ LayerNorm.
- attention_output = LayerNorm(input + attention(input))


#### FEED FORWARD

- Perbaiki representasi token secara individual.

- proses :

- expand → relu → shrink.

- Seleksi fitur-fitur penting per token.

- detail :

- Dense layer 1: expand dimensi (misal 128 ➔ 512).

- ReLU : buang informasi ga guna, fokus ke internal connection

- Dense layer 2: balik ke dimensi awal (512 ➔ 128).




#### RESIDUAL AND NORMALIZATION

- jaga info lama + stabilkan training.
- Tambahkan input + output Self-Attention ➔ Residual.
- Normalisasi hasilnya ➔ LayerNorm.
- attention_output = LayerNorm(input + attention(input))
- ffn_output = LayerNorm(attention_output + FFN(attention_output))


## ANOTHER ENCODER BLOCK

- input = ffn_output

- ....

-  ......

- 12 - 24X

## OUTPUT 


### HIDDEN STATE

- Hidden States	(batch_size, sequence_length, hidden_size)
- Tiap token punya representasi kontekstual.
- buat tagging per token (NER, QA).


### POOLED STATE

- Pooler Output (batch_size, hidden_size)
- Representasi seluruh kalimat.
- buat klasifikasi kalimat (Pos/Neg, Entailment, dst).

---

## PSEUDOCODE INPLEMENTATION

### INPUT
    ## input text or corpus
    
    c = "x y z ...."

### TOKEN

    # text into number by id or index
    
    t = [x_id, y_id, z_id, ...... , n_id]

### POSITIONAL EMBEDDING

    ## encode each token to embed and position
    
    for each token in t:
    
    embedding vector = word embedding matrix [token_id] -> vectorize the input token for each word
    positional vector = positional embeedin matrix [position] -> vector that tell what the order of each token by poition
    input that vectorized: input_vector = embedding vector + positional vector

    ## result:
    
    vectors  = [input_vector_1, input_vector_2, input_vector_3, ......., input_vector_n]




### STACK ENCODER - 2 LAYER

    ## for each encoder layer :

        ## if multihead self attention :
    
        for each token in t :
            linear projection 1, 2, 3 ...., n =  Q, K, V  from input vectors
        
        for each pair of token (i,j) : 
            attention score 1, 2, 3 ...., n = dot product (Q[i], K[j]) / sqrt (d_model = input vectors [-1])

        for each token i:
            normalize the value by softmax accross attention score
            attention output 1, 2, 3 ...., n= sum over j (softmax_score[i][j] * V[j])

            
        ## if self attention block

        for each token in t :
            linear projection =  Q, K, V  from input vectors
        
        for each pair of token (i,j) : 
            attention score = dot product (Q[i], K[j]) / sqrt (d_model = input vectors [-1])

        for each token i:
            normalize the value by softmax accross attention score
            attention output = sum over j (softmax_score[i][j] * V[j])


        ## residual connection 
    
        attention output = layer_norm(input + attention output)


        ## feed forward 
        
        for each token in attention output :
            - expand dimention 2x : h1 : linear(attention output)
            - focus on important information in that token only : relu : relu(h1)
            - compress again 2x -> original : h2 / output : linear(relu)


        ## residual connection 
    
        output = layer_norm(attention output + output)

    
        ## Output
    
        input vetor in the next encoder layer - repeat the process from linear projection - output.


### OUTPUT - STATE

    ## output hidden state
    
    Return:
    - Output hidden vectors for each token
    
    ## output pooled state
    
    Return:
    - [CLS] token output for classification tasks (if needed)


---

## SIMPLE PYTHON IMPLEMENTATION

### STRUCTURE


| Puzzle | Modul           | Isi                                                                                          |
| :----: | :-------------- | :------------------------------------------------------------------------------------------- |
|    1   | Basic Functions | - Matrix multiplication (manual) <br> - Softmax Activation Function (manual) <br> - Layer normalization (manual) |
|    2   | Embedding       | - Word Embedding Lookup table <br> - Positional Encoding Lookup table                                     |
|    3   | Self Attention  | - Linear projection (Q, K, V) of input <br> - Attention score calculation <br> - Output V             |
|    4   | Feed Forward    | - Linear 1 (expand), ReLU, Linear 2 (compress again)                                                |
|    5   | Encoder Layer   | - Residual connection + Norm after Attention <br> - Residual connection + Norm after FFN |


### BASIC FUNCTION

In [1]:
# Dot Product Function

import numpy as np

def matmul(x,y): 
    
    return np.dot(x,y)

In [2]:
# Softmax Activation Funtion

def softmax(x, axis=-1):
    x_max = np.max(x, axis=axis, keepdims=True)
    e_x = np.exp(x - x_max)
    
    return e_x / np.sum(e_x, axis=axis, keepdims=True)

In [3]:
# Normalization Layer Function

def norm(x):
    
    eps = 1e-6
    avg = np.mean(x, axis=-1, keepdims=True)
    std = np.std(x, axis=-1, keepdims=True)

    return (x - avg)/(std + eps)

### EMBEDDING

In [4]:
# Word Embedding Lookup Table

class wordEmbedding:
    def __init__(self, vocab_size, d_model):
        self.vocab_size = vocab_size
        self.d_model = d_model
        self.embedding_table = np.random.rand(vocab_size, d_model)*0.01

    def forward(self, token_id):
        return self.embedding_table[token_id]

In [5]:
# Positional Encoding Lookup table

class positionalEncoding:
    def __init__(self, seq_len, d_model):
        self.seq_len = seq_len
        self.d_model = d_model

        position = np.arange(seq_len)[:, np.newaxis]
        div_term = np.exp(np.arange(0, d_model, 2) * (-np.log(10000.0) / d_model))

        pe = np.zeros((seq_len, d_model))
        pe[:, 0::2] = np.sin(position * div_term)
        pe[:, 1::2] = np.cos(position * div_term)

        self.embedding_table = pe

    def forward(self, position_token_id):
        return self.embedding_table[position_token_id]

In [12]:
# EMBEDDING XAMPLE

vocab_size = 100
d_model = 10
max_seq = 5

word_embed = wordEmbedding(vocab_size, d_model)
position_embed = positionalEncoding(max_seq, d_model)

corpus = [['i', 'like', 'pussies'], ['he', 'hate', 'dogs']]

tokens = np.array([
    [5, 8, 20],   # Sentence 1
    [6, 9, 25]    # Sentence 2
])  # shape: (2, 3)

positions = np.array([0, 1, 2])  # for all sentences, same positions


word_vecs = word_embed.forward(tokens)
pos_vecs = position_embed.forward(positions)

final_input = word_vecs + pos_vecs
print(final_input.shape)

(2, 3, 10)


### SCALED DOT PRODUCT ATTENTION

In [8]:
# Linear projection (Q, K, V)
# Attention score calculation
# Output V

class selfAttention:
    def __init__(self, d_model):
        self.d_model = d_model

        scale = np.sqrt(2.0 / d_model)
        self.w_q = np.random.rand(d_model, d_model) * scale
        self.w_k = np.random.rand(d_model, d_model) * scale
        self.w_v = np.random.rand(d_model, d_model) * scale

    def forward(self, x):
        q = matmul(x, self.w_q)
        k = matmul(x, self.w_k)
        v = matmul(x, self.w_v)

        k_t = np.transpose(k, (0, 2, 1))

        attn_scores = matmul(q, k_t)/np.sqrt(self.d_model)
        attn_probs = softmax(attn_scores, axis=-1)
        attn_output = matmul(attn_probs, v)

        return attn_output

In [11]:
# SCALED DOT PRODUCT ATTENTION EXAMPLE : 3 WORD, 10 DIMENTION

x = np.random.randn(1, 3, 10)

self_attn = selfAttention(d_model=10)

out = self_attn.forward(x)

print(out.shape)

(1, 3, 1, 1, 10)


### MULTIHEAD ATTENTION

In [13]:
# Multihead Attention

class multiheadAttention():
    def __init__(self, d_model, num_heads):
        self.d_model = d_model
        self.num_heads = num_heads
        self.heads_d = d_model // num_heads

        assert d_model % num_heads == 0

        # weigth Q, K, V
        scale = np.sqrt(2.0 / d_model)
        self.w_q = np.random.rand(d_model, d_model) * scale
        self.w_k = np.random.rand(d_model, d_model) * scale
        self.w_v = np.random.rand(d_model, d_model) * scale
        self.w_o = np.random.rand(d_model, d_model) * scale

    def forward(self, x):
        batch_size, seq_len, _ = x.shape

        # Q, K, V
        q = np.matmul(x, self.w_q)
        k = np.matmul(x, self.w_k)
        v = np.matmul(x, self.w_v)

        # split into multiple head
        q = q.reshape(batch_size, seq_len, self.num_heads, self.heads_d)
        k = k.reshape(batch_size, seq_len, self.num_heads, self.heads_d)
        v = v.reshape(batch_size, seq_len, self.num_heads, self.heads_d)

        # transpose
        q = q.transpose(0,2,1,3)
        k = k.transpose(0,2,1,3)
        v = v.transpose(0,2,1,3)

        k_t = np.transpose(k, (0, 1, 3, 2))

        # self attention per head
        attn_scores = np.matmul(q, k_t) / np.sqrt(self.heads_d)
        attn_probs = softmax(attn_scores, axis=-1)
        attn_output = np.matmul(attn_probs, v)

        # concatinate all heads into one
        attn_output = attn_output.transpose(0,2,1,3)
        attn_output = attn_output.reshape(batch_size, seq_len, self.d_model)

        # linear projection 
        output = np.matmul(attn_output, self.w_o)

        return output

In [14]:
# MULTIHEAD XAMPLE : sentences : 2, words : 4, d_model : 8, num of heads = 4 (4 konteks each word)

batch_size = 2
seq_len = 4
d_model = 8
num_heads = 4

x = np.random.randn(batch_size, seq_len, d_model)  # (2, 4, 8)

mha = multiheadAttention(d_model, num_heads)

output = mha.forward(x)

print(f"input : {x}")
print()
print(f"shape of input : {x.shape}")
print()

batch_size, seq_len, d_model = x.shape

print(f"batch_size = sentences : {batch_size}")
print(f"seq_len = words : {seq_len}")
print(f"d_model = dimention : {d_model}")


expected_shape = (batch_size, seq_len, d_model)
actual_shape = output.shape

print()
print(f"output : {output}")
print()
print(f"expected_shape : {expected_shape}")
print()

print(f"actual_shape : {actual_shape}")

input : [[[ 0.36672737 -0.20878601 -0.93312864  0.55049618  0.61309127
    0.59738567  0.97634232  0.46390877]
  [ 1.54528626 -1.18411235 -0.50366323  0.55005004 -0.86921243
    1.8964261  -1.41468836 -0.15893344]
  [-2.1778084   0.19036908 -0.02444217  1.12669448 -0.41743285
    0.76707763  0.87854491  0.32731391]
  [-0.0439196   1.21966191 -0.31403431 -0.87529854  1.49784223
   -0.04638034  0.2728852  -0.75194285]]

 [[-0.54282411 -0.53131012  0.04423507 -0.76264487  0.31901532
    0.4806683   0.16869729  1.01144944]
  [-3.65499794 -0.03117673 -1.67538619  0.06720851  1.30975063
   -0.60280177  0.55697473 -0.72618966]
  [ 0.38107161  0.59713013  0.06308     0.26364099  1.51547201
    2.52786876  0.05574878 -1.2881275 ]
  [ 1.12629148  0.62244861 -0.3588689   0.53430038  0.33699067
    1.57690903 -0.27540016  0.14639957]]]

shape of input : (2, 4, 8)

batch_size = sentences : 2
seq_len = words : 4
d_model = dimention : 8

output : [[[ 0.39168056  0.58393131  0.67286276  0.47773785  0.

### FEED FORWARD NEURAL NETWORKS (FFN)

In [15]:
# Feed Forward Neural Network

class feedForward:
    def __init__(self, d_model, d_ff):
        scale1 = np.sqrt(2.0 / d_model)
        scale2 = np.sqrt(2.0 / d_ff)
        
        self.w1 = np.random.rand(d_model, d_ff) * scale1
        self.b1 = np.zeros((d_ff,))
        self.w2 = np.random.rand(d_ff, d_model) * scale2
        self.b2 = np.zeros((d_model,))


    def gelu(self, x):
        return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * np.power(x, 3))))
        

    def forward(self, x):
        x_expanded = np.matmul(x, self.w1) + self.b1
        x_relu = self.gelu(x_expanded)
        output = np.matmul(x_relu, self.w2) + self.b2

        return output

In [17]:
# FFN XMAPLE

d_model = 4
d_ff = 64
batch_size = 2
seq_len = 3

# dummy input (batch_size, seq_len, d_model)
x = np.random.rand(batch_size, seq_len, d_model)

# Inisialisasi Feed Forward
ff = feedForward(d_model, d_ff)

# Proses forward tiap token
out = np.array([[ff.forward(token) for token in sample] for sample in x])
    
print()
print("Input shape:", x.shape)
print("Output shape:", out.shape)
print("Output:\n", out)


Input shape: (2, 3, 4)
Output shape: (2, 3, 4)
Output:
 [[[4.20066293 4.05456212 4.08290323 3.76315567]
  [2.23218866 2.17856967 2.21050946 2.08578322]
  [4.49221826 4.23657331 4.28514247 3.9508369 ]]

 [[1.1587879  1.06382888 1.06665058 0.97644287]
  [2.81090316 2.74204152 2.77760206 2.52129341]
  [3.87867987 3.65292627 3.68881106 3.43366131]]]


### MULTI-STACKS ENCODER

In [19]:
# STACK / BLOCK BERT ENCODER

class bertBlock:
    def __init__(self, d_model, num_heads, d_ff):
        self.mha = multiheadAttention(d_model, num_heads)
        self.hh = feedForward(d_model, d_ff)

    def forward(self, x):
        # multihead self attention
        x_mha = self.mha.forward(x)

        # add and norm
        x_norm1 = norm(x + x_mha)

        # feed forward
        x_ffn = self.hh.forward(x_norm1) 

        # add and norm
        x_norm2 = norm(x_norm1 + x_ffn)

        return x_norm2

In [20]:
# MULTI STACKS / BLOCK BERT ENCODER MODEL

class bertModel:
    def __init__(self, vocab_size, seq_len, d_model, num_heads, d_ff, n_stacks):
        self.word_embedding = wordEmbedding(vocab_size, d_model)
        self.positional_embedding = positionalEncoding(seq_len, d_model)
        self.bert_blocks = [bertBlock(d_model, num_heads, d_ff) for _ in range(n_stacks)]

    def forward(self, token_id, position_token_id=None):
        batch_size, seq_len = token_id.shape

        if position_token_id is None:
            position_token_id = np.tile(np.arange(seq_len), (batch_size, 1))

        word_embedding = self.word_embedding.forward(token_id)
        positional_embedding = self.positional_embedding.forward(position_token_id)

        x = word_embedding + positional_embedding

        for block in self.bert_blocks:
            x = block.forward(x)

        return x

---

### USING MODEL XAMPLE

In [22]:
# Hyperparameters
vocab_size = 3000
seq_len = 512
d_model = 768
num_heads = 12
d_ff = 3072
n_stacks = 12
    
# Create model instance
model = bertModel(vocab_size, seq_len, d_model, num_heads, d_ff, n_stacks)
    
# Example input (batch_size=2, seq_len=6)
token_id = np.array([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]])
    
# Forward pass
output = model.forward(token_id)
print(f"Output shape: {output.shape}")

Output shape: (2, 6, 768)


---

## PROCESS OVERVIEW

| Step | Penjelasan                           | Status |
| :--: | :----------------------------------- | :----: |
|   1  | Bangun Mini-BERT Stack               |    ✅   |
|   2  | Pretraining (Masked LM + NSP)        |   🔜   |
|   3  | Fine-tuning ke task spesifik         |   🔜   |
|   4  | Buat dataset dummy buat latihan      |   🔜   |
|   5  | Build mindset & intuition level dewa |   🔜   |
