<a href="https://colab.research.google.com/github/Shubhamd13/NLP/blob/main/4_1_Pre_Training_Student_Copy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 01: Masked Language Modeling

Let's say we have the following two sentences:

$\text{Target Text: "I am a student of UC Riverside"}$

$\text{Input Text: "I am a <MASK> of UC Riverside"}$

Our goal is to train an encoder using masked language modeling objective. During training, we will feed the input text (with a masked word) into the encoder and see what it predicts at the masked position. Then, we will compare the prediction to the correct word ("student") to calculate the training loss.

To keep things simple, we will use a basic encoder architecture that follows this flow:

$
\text{Input Embedding} \rightarrow \text{Self-Attention} \rightarrow \text{Linear Projection to Vocabulary} \rightarrow \text{Loss Calculation}
$

We are skipping some of the usual components like MLP layers, residual connections, and layer normalization for simplicity.


## 1. Vocabulary, Input and Target token ids (Do not change)

In [None]:
import numpy as np
import math
import random
np.set_printoptions(precision=2, suppress=True)

words = ["staff", "I", "student", "<EOS>", "UC", "am", "a", "<MASK>", "Riverside", "of"]
vocab = {word: idx for idx, word in enumerate(words)}

target_tokens = ["I", "am", "a", "student", "of", "UC", "Riverside"]
input_tokens =  ["I", "am", "a", "<MASK>", "of", "UC", "Riverside"]

target_ids = [vocab[t] for t in target_tokens] # get target token ids from vocabulary
input_ids = [vocab[t] for t in input_tokens]   # get input token ids from vocabulary

print("Target Ids: ", target_ids)
print("Input Ids:  ", input_ids)

# Find the position of <MASK> token
mask_position = input_tokens.index("<MASK>")

# We only need to calculate loss for the target tokens at masked position, we do not want to calculate loss for other tokens and so we set -100
target_ids = [-100 if i != mask_position else tid for i, tid in enumerate(target_ids)]

Target Ids:  [1, 5, 6, 2, 9, 4, 8]
Input Ids:   [1, 5, 6, 7, 9, 4, 8]


## 2. Input Embedding Matrix (Do not change)

In [None]:
embedding_matrix = np.array([
    [0.1, 0.2, 0.3],     # I
    [0.4, 0.3, 0.1],     # am
    [1, 1, 1.],          # a
    [0., 0., 0.],        # <MASK>
    [0.1, 0.1, 1.0],     # of
    [0., 0., 0.],        # UC
    [0.5, 0.5, 0.5],     # Riverside
])

print(embedding_matrix.shape)

(7, 3)


## 3. Self Attention Calculation (Do not change)

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

Fo simplicity, we assume -

$Q = Embed_{matrix} * 1.25$

$K = Embed_{matrix} * 1.88$

$V = Embed_{matrix} * 1.03$

In [None]:
# Calculate Q, K, V
Q = embedding_matrix * 1.25
K = embedding_matrix * 1.88
V = embedding_matrix * 1.03

# Calculate QK^T
attention_scores = Q @ K.T  # shape: (7, 7)

# Perform scaling: find the dimension of keys in the Key matrix and divide scores by the square root of the key dimension.
dk = K.shape[1]
scaled_attention_scores = attention_scores / np.sqrt(dk)

# Compute Softmax. Softmax converts raw attention scores into probabilities.
attention_weights = np.exp(scaled_attention_scores)
attention_weights /= np.sum(attention_weights, axis=1, keepdims=True)

# Multiply attention weights with value matrix
self_attention_output = attention_weights @ V
self_attention_output

## 4. Linear Projection to Vocabulary (Do not change)

For simplicity, here we assume that the encoder output is the self attention output.

Assume we have a weight matrix for the linear layer: $𝑊_{out} ∈ 𝑅^{3×10}$ → here, embedding size = 3 and vocab size = 10

$logits = \text{self attention}_{output} \; (dot \; product) \; W_{out} $


In [None]:
# Here, we use the encoder output to get logits for each word in the vocabulary
# We do this by using a linear layer.
# For each token, output will be logits over all the words in the vocabulary, since we have 10 words in the vocabulary, output shape will be the vocabulary size for each token
SEED = 42
np.random.seed(SEED)
random.seed(SEED)

A = embedding_matrix.shape[1]
B = len(vocab)

V_W = np.random.randn(A, B) # weight matrix for the linear layer
logits = np.dot(self_attention_output, V_W)

print("Logits:\n", logits)

## 5. Calculate Cross Entropy Loss

The formula for binary cross-entropy loss is:

$\mathcal{L} = - \left[ y \cdot \log(\hat{y}) + (1 - y) \cdot \log(1 - \hat{y}) \right]$

Here,

$y$ = true label (either 0 or 1)

$\hat{y}$ =  model’s predicted probability

In our example, since "student" is the target word, we treat it like a positive class — so we set $y=1$.

So the loss becomes -

$\mathcal{L} = - \log(\hat{y})$

In [None]:
def cross_entropy_loss(logits, target_ids, ignore_index=-100):
    logits = np.array(logits)          # shape: [seq_len, vocab_size]
    target_ids = np.array(target_ids)  # shape: [seq_len]

    # Step 1: Apply softmax to get probabilities
    exp_logits = np.exp(logits)
    probs = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)

    # Step 2: Select probabilities of the target tokens (only where target != ignore_index)
    loss_values = []
    for i, target in enumerate(target_ids):
        if target == ignore_index:
            continue

        temp_prob = probs[i]

        ###<-- Write code here (Q4)
        temp_idx =                ## find the maximum probability index
        ###


        print("Target Word: ", words[target])
        print("Predicted Word: ", words[temp_idx])

        # Step 3: Calculate Loss
        p = temp_prob[target] # probability of the target token id


        ###<-- Write code here (Q5)
        loss =
        ###


        loss_values.append(loss)

    return sum(loss_values)

loss = cross_entropy_loss(logits, target_ids, ignore_index=-100)
print("Masked LM Loss:", loss)

# Part 02: Next Word Prediction

Let's assume that we have the following token sequences:

$\text{Input Tokens:  ["I", "am", "a", "student", "of", "UC", "Riverside"]}$

$\text{Target Tokens: ["am", "a", "student", "of", "UC", "Riverside", "<EOS>"]}$

Our goal is to train a decoder using the next word prediction (also called causal language modeling) objective. During training, we feed the input tokens into the decoder one by one, and at each step, the model tries to predict the next token in the sequence.

For example:

Given input $\text{"I"}$, the model should predict $\text{"am"}$.

Given $\text{"I am"}$, it should predict $\text{"a"}$ and so on.

The final target is the special token $\text{<EOS>}$ that marks the end of the sentence.

To keep things simple, we will use a basic decoder architecture that follows this flow:

$\text{Input Embedding → Masked Self-Attention → Linear Projection to Vocabulary → Loss Calculation}$


We apply a causal mask (lower triangular) in the self-attention layer to make sure the model only attends to previous tokens and not future ones. For simplicity, we are skipping components like MLP layers, residual connections, and layer normalization steps.

## 1. Vocabulary, Input and Target Token IDs (Do not change)

In [None]:
import numpy as np
import math
import random
np.set_printoptions(precision=2, suppress=True)

# Vocabulary construction
words = ["staff", "I", "student", "<EOS>", "UC", "am", "a", "<MASK>", "Riverside", "of"]
vocab = {word: idx for idx, word in enumerate(words)}
vocab_size = len(vocab)

input_tokens =  ["I", "am", "a", "student", "of", "UC", "Riverside"]
target_tokens = ["am", "a", "student", "of", "UC", "Riverside", "<EOS>"]

input_ids = [vocab[t] for t in input_tokens]
target_ids = [vocab[t] for t in target_tokens]

print("Input Ids:  ", input_ids)
print("Target Ids: ", target_ids)

seq_len = len(input_ids)

Input Ids:   [1, 5, 6, 2, 9, 4, 8]
Target Ids:  [5, 6, 2, 9, 4, 8, 3]


## 2. Input Embedding Matrix (Do not change)

In [None]:
embedding_matrix = np.array([
    [1., 0., 0.],     # I
    [0., 1., 0.],     # am
    [0., 0., 1.],     # a
    [1., 1., 0.],     # student
    [1., 0., 1.],     # of
    [0., 1., 1.],     # UC
    [0.5, 0.5, 0.5],  # Riverside
])

print(embedding_matrix.shape)

(7, 3)


## 3. Self Attention Calculation with Masking

$\textbf{Step 1. Query, Key, and Value Computation:}$

$Q = X W_Q, \quad K = X W_K, \quad V = X W_V$

$\textbf{Step 2. Attention Scores:}$

$\text{Attention Scores} = \frac{Q K^T}{\sqrt{d}}$

$\textbf{Step 3. Causal Masking:}$

To enforce causal attention (only attending to past and current tokens), we apply a causal mask \( M \) to the attention scores matrix. The causal mask is defined as:

$ M_{ij} =
\begin{cases}
-\infty & \text{if } j > i \\
0 & \text{otherwise}
\end{cases}
$

The attention scores are updated by adding the mask:

$\text{Masked Attention Scores} = \text{Attention Scores} + M$

$\textbf{Step 4. Softmax:}$

$\text{Attention Weights} = \text{Softmax}(\text{Masked Attention Scores})$

$\textbf{Step 5. Weighted Sum of Values:}$

$\text{Attention Output} = \text{Attention Weights} \cdot V$

$\textbf{Final Formula:}$

$
\text{Attention Output} = \text{Softmax}\left( \frac{Q K^T}{\sqrt{d}} + M \right) \cdot V$


In [21]:
# Step 1: Calculate Q, K, V
Q = embedding_matrix * 1.25
K = embedding_matrix * 1.88
V = embedding_matrix * 1.03

# Step 2: Calculate Attention Scores
attn_scores = Q @ K.T
dk = K.shape[1]
scaled_attention_scores = attention_scores / np.sqrt(dk)
print("Attention Scores Without Causal Masking:\n\n", np.array(scaled_attention_scores))
print("="*50)

# Step 3a: Create Causal Mask only attend to past & current
causal_mask = []
for i in range(seq_len):
    row = []
    for j in range(seq_len):
        if j > i:
            row.append(float('-inf'))  # block future tokens
        else:
            ###<--- Write code here  (Q7: allow current & past tokens)
            row.append(0.0)

            ###

    causal_mask.append(row)
causal_mask = np.array(causal_mask)

print("Attention Mask:\n\n", np.array(causal_mask))
print("="*50)

# Step 3b: Apply Causal Mask to Attention Scores
masked_attn_scores = scaled_attention_scores + causal_mask
print("Attention Scores With Causal Masking:\n\n", np.array(masked_attn_scores))
print("="*50)

# Step 4: Softmax
def softmax(x):
    exp_logits = np.exp(x)
    probs = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
    return probs

attn_weights = softmax(masked_attn_scores)

# Step 5: Weighted Sum of Values
attention_output = attn_weights @ V
print("Final Attention Output:\n\n", np.array(attention_output))

Attention Scores Without Causal Masking:

 [[0.19 0.18 0.81 0.   0.45 0.   0.41]
 [0.18 0.35 1.09 0.   0.23 0.   0.54]
 [0.81 1.09 4.07 0.   1.63 0.   2.04]
 [0.   0.   0.   0.   0.   0.   0.  ]
 [0.45 0.23 1.63 0.   1.38 0.   0.81]
 [0.   0.   0.   0.   0.   0.   0.  ]
 [0.41 0.54 2.04 0.   0.81 0.   1.02]]
Attention Mask:

 [[  0. -inf -inf -inf -inf -inf -inf]
 [  0.   0. -inf -inf -inf -inf -inf]
 [  0.   0.   0. -inf -inf -inf -inf]
 [  0.   0.   0.   0. -inf -inf -inf]
 [  0.   0.   0.   0.   0. -inf -inf]
 [  0.   0.   0.   0.   0.   0. -inf]
 [  0.   0.   0.   0.   0.   0.   0.]]
Attention Scores With Causal Masking:

 [[0.19 -inf -inf -inf -inf -inf -inf]
 [0.18 0.35 -inf -inf -inf -inf -inf]
 [0.81 1.09 4.07 -inf -inf -inf -inf]
 [0.   0.   0.   0.   -inf -inf -inf]
 [0.45 0.23 1.63 0.   1.38 -inf -inf]
 [0.   0.   0.   0.   0.   0.   -inf]
 [0.41 0.54 2.04 0.   0.81 0.   1.02]]
Final Attention Output:

 [[1.03 0.   0.  ]
 [0.47 0.56 0.  ]
 [0.04 0.05 0.95]
 [0.52 0.52 0.26]


## 4. Linear Projection to Vocabulary (Do not change)

For simplicity, here we assume that the decoder output is the self attention output.

Assume we have a weight matrix for the linear layer: $𝑊_{out} ∈ 𝑅^{3×10}$ → here, embedding size = 3 and vocab size = 10

$logits = \text{self attention}_{output} \; (dot \; product) \; W_{out} $


In [None]:
# Here, we use the decoder output to get logits for each word in the vocabulary
# We do this by using a linear layer.
# For each token, output will be logits over all the words in the vocabulary, since we have 10 words in the vocabulary, output shape will be the vocabulary size for each token
SEED = 42
np.random.seed(SEED)
random.seed(SEED)

A = embedding_matrix.shape[1]
B = len(vocab)

V_W = np.random.randn(A, B) # weight matrix for the linear layer
logits = np.dot(self_attention_output, V_W)

print("Logits:\n", logits)

## 5. Calculate Cross Entropy Loss

For each token position, the loss is -

$\mathcal{L} = - \log(\hat{y})$

In [22]:
def cross_entropy(pred_logits, target_index):
    exp_logits = np.exp(pred_logits)
    probs = exp_logits / np.sum(exp_logits, keepdims=True)

    ###<--- Write code here (Q9)
    log_prob = -np.log(probs[target_index])                 ## -log(probability of target index)
    ###

    return log_prob

losses = []
for i in range(seq_len):
    loss = cross_entropy(logits[i], target_ids[i])
    print(f"Token: '{words[input_ids[i]]}' → Target: '{words[target_ids[i]]}', Loss: {round(loss, 4)}")
    losses.append(loss)

total_loss = sum(losses) / len(losses)
print("\nTotal Loss:", round(total_loss, 4))

Token: 'I' → Target: 'am', Loss: 2.5084
Token: 'am' → Target: 'a', Loss: 2.6275
Token: 'a' → Target: 'student', Loss: 1.6803
Token: 'student' → Target: 'of', Loss: 2.6212
Token: 'of' → Target: 'UC', Loss: 3.5866
Token: 'UC' → Target: 'Riverside', Loss: 2.9099
Token: 'Riverside' → Target: '<EOS>', Loss: 3.5235

Total Loss: 2.7796
