### Translation Transformer Model:
- Following 
    - <a href="https://nlp.seas.harvard.edu/2018/04/03/attention.html" target="_blank">The Annotated Transformer</a>
    - <a href="https://towardsdatascience.com/transformers-explained-visually-part-3-multi-head-attention-deep-dive-1c1ff1024853/" target="_blank">Transformers Explained Visually</a>

In [95]:
import torch
import torch.nn as nn
from torch.nn.functional import log_softmax
import copy
import math
import os
import pandas as pd
import time
import numpy as np
import altair as alt        # for data visualization
from torch.optim.lr_scheduler import LambdaLR       
# Use LambdaLR to control how the learning rate changes during training — for example, to:
# 	•	Warm up the learning rate for the first few epochs
# 	•	Decay it over time
# 	•	Use a custom schedule

PRINT = False

### Data Preparation

### Model Architecture
- Encoder-decoder structure works the best for translation.  

- The encoder maps an 
    - input sequence of symbol representations $(x_1, ..., x_n)$ to 
    - a sequence of continuous representations $z = (z_1, ..., z_n)$.
- Given z, the decoder
    - generates an output sequence $(y_1, ..., y_m)$ of symbols one element a time.
    - at each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.

- NOTE:
    - The linear layer in MHA is identical to any other linear layer; what makes it “attention” is what you do with those projected Q, K, V matrices afterward:

    - **Splitting data across Attention heads:**
        - The important thing to understand is that this is a logical split only.  

        - The Query, Key, and Value are not physically split into separate matrices, one for each Attention head. A single data matrix is used for the Query, Key, and Value, respectively, with logically separate sections of the matrix for each Attention head. Similarly, there are not separate Linear layers, one for each Attention head. All the Attention heads share the same Linear layer but simply operate on their 'own' logical section of the data matrix: 
            - reshape: to split the single matrix into chunks
            - transpose to have head as the leading axis
            - for each head, performand scaled dot-product, softmax weighting.  
            - concatenate the score back into one matrix
            - The attention weights of single head and multi heads differ, so we genuinely get different attention patterns when splitting into multiple heads rather than do one big head. (Examples down beblow).  
            - logically split across multiple heads means the separate sectinos of the Embedding can learn different aspects of the meanings of each word. 
            - for the $\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\!\Bigl(\tfrac{QK^T}{\sqrt{d_k}}\Bigr)\,V$, the weights (the softmax of $QK^T$) changes 
            - This allows the transformer to capture richer interpretations of the squence. (for the word 'men': it has genderness(male, female) and cardinality(singular vs plural))

        - Think of it as ‘stacking together’ the separate layer weights for each head.  

        - The computations for all Heads can be therefore be achieved via a single matrix operation rather than requiring N separate operations.   

        - This makes the computations more efficient and keeps the model simple because fewer Linear layers are required, while still achieving the power of the independent Attention heads.   

    - during training, we provide both encoder and decoder inputs, and the model gets all the predicted outputs at once using teacher-forcing.  

    - during inference, the encoder only calculate once because the encoder input never changes. And decoder start with <start> symbol, accumulates the decoder output and feedback to the decoder for the next loop.  

    - The size of encoder's input and output(input of encoder-decoder cross-attention) remain the same. [Batch, Sequence1, Dimension] --> [Batch, Sequence1, Dimension]  

    - The size of Decoder's input and first self-attention layer output remain the same. [Batch, Sequence2, Dimension] --> [Batch, Sequence2, Dimension]

    - In the encoder–decoder “cross‐attention” sub-layer of the Transformer decoder, the three inputs are:
	    - Query $Q\in\mathbb{R}^{B\times T_\text{dec}\times D}$ – your decoder’s own hidden states (after self-attention + residual).
	    - Key $K\in\mathbb{R}^{B\times T_\text{enc}\times D}$ – the encoder’s final hidden states.
	    - Value  $V\in\mathbb{R}^{B\times T_\text{enc}\times D}$ – again the encoder’s final hidden states.
	    - B = batch size
	    - $T_\text{dec}$ = decoder sequence length
	    - $T_\text{enc}$ = encoder sequence length
	    - D = model dimension  

    - The final output of Decoder has [Batch, Sequence2, Dimension], same as the decoder's input.  




In [32]:
class EncoderDecoder(nn.Module):
    '''A standard Encoder-Decoder architecture. Base for this and many other models.'''
     
    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        # python 3
        # super().__init__()
        # python 2
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator
    
    # what are these masks for why do we need two masks?
    def forward(self, src, tgt, src_mask, tgt_mask):
        # src_mask (on encoder's input) is to make sure the inputs have the same length.
        # tgt_mask (on decoder's input) is to prevent the model to cheat by seeing the results (next word it's supposed to predict), zigzag zero padding.
        # NOTE: here it's calling the self-defined encode and decode functions, and switch the order of the parameters.
        return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)
    
    def encode(self, src, src_mask):
        if PRINT:
            print(f'in EncoderDecoder before embedding(Encoder): src: {src.shape}; src_mask: {src_mask.shape}')
        return self.encoder(self.src_embed(src), src_mask)
    
    def decode(self, memory, src_mask, tgt, tgt_mask):
        if PRINT:
            print(f'in EncoderDecoder before embedding(Decoder): tgt: {tgt.shape}; tgt_mask: {tgt_mask.shape}')
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)

In [33]:
class Generator(nn.Module):
    ''' Standard linear + softmax generation step '''
    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)
    
    def forward(self, x):
        return log_softmax(self.proj(x), dim=-1)

In [34]:
def clones(module, N):
    # Produce N identical layers
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])


def attention(query, key, value, mask=None, dropout=None):
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = torch.softmax(scores, dim=-1)

    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn


def subsequent_mask(size):
    attn_shape = (1, size, size)
    # diagonal=0: includes the main diagonal.
	# diagonal=1: starts one step above the main diagonal.
	# diagonal=-1: starts one step below the main diagonal.
    # torch.triu(..., diagonal=1) returns the upper triangular part
    # [[[0, 1, 1, 1, 1],
    #   [0, 0, 1, 1, 1],
    #   [0, 0, 0, 1, 1],
    #   [0, 0, 0, 0, 1],
    #   [0, 0, 0, 0, 0]]]
    # .type(torch.uint8) converts it to 0 (for allowed) and 1 (for masked)
    # Later when apply mask:
    # scores = scores.masked_fill(mask == 0, 0)   # replaces elements of the tensor where the condition is True with the given value 0.
    # scores = scores.masked_fill(mask == 1, -inf)  # replaces elements of the tensor where the condition is True with the given value -inf.
    subsequent_mask = torch.triu(torch.ones(attn_shape), diagonal=1).type(torch.uint8)

    # Return a boolean mask tensor where each element is True if the corresponding element in subsequent_mask is 0, and False if it’s 1.
    # tensor([[[0, 1, 1],
    #          [0, 0, 1],
    #          [0, 0, 0]]], dtype=torch.uint8)
    # would return
    # tensor([[[ True, False, False],
    #          [ True,  True, False],
    #          [ True,  True,  True]]])
    return subsequent_mask == 0

In [49]:
class MultiHeadedAttention(nn.Module):
    def __init__(self, h:int, d_model:int, dropout = 0.1):
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        self.d_k = d_model // h
        self.h = h
        # NOTE: 4 linear layers are Q,K,V, and OUTPUT projection
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(dropout)

    def forward(self, query:torch.Tensor, key:torch.Tensor, value:torch.Tensor, mask=None):
        if mask is not None:
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)

        if PRINT:
            print('In MHA:')
            for i, layer in enumerate(self.linears):
                print(f"Layer {i}: in_features = {layer.in_features}, out_features = {layer.out_features}")
                print(f'layer {i} weights: ', layer.weight)
        
        if PRINT:
            print(f'QKV shape BEFORE split: {query.shape}, {key.shape}, {value.shape}')
        # MEAT AND POTATOES!!! 
        # self.linears is a list of four nn.Linear(d_model, d_model) layers; here we only consume the first three in the comprehension (for Q, K, V)
        # zip pairs them up as
        # 1. (l = W^Q, x = query)
        # 2. (l = W^K, x = key)
        # 3. (l = W^V, x = value)
        # l(x): Applies the linear projection
        # .view reshapes the qkv from (batch_size, seq_len, d_model) --> (batch_size, num_heads, seq_len, d_k):
        # nbatches = batch_size.
        # -1 tells PyTorch to infer that dimension — it will become seq_len.
        # self.h = number of heads.
        # self.d_k = d_model // num_heads.
        # .transpose(1, 2) Swaps dimensions 1 and 2 so head is a leading axis: 
        # from (batch_size, seq_len, num_heads, d_k) --> (batch_size, num_heads, seq_len, d_k)
        query, key, value = [
            l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2) for l, x in zip(self.linears, (query, key, value))
        ]
        if PRINT:
            print(f'QKV shape AFTER split: {query.shape}, {key.shape}, {value.shape}')
            print(f'query: {query}')
            print(f'key: {key}')
            print(f'value: {value}')

        # .view() + .transpose() carves on matrix into h separate “heads” of size d_k each
        # each head’s chunk is used to compute its own attention scores
        x, self.attn = attention(query, key, value, mask=mask, dropout=self.dropout)

        x = (x.transpose(1, 2).contiguous().view(nbatches, -1, self.h * self.d_k))

        del query
        del key
        del value
        mha_return_val = self.linears[-1](x)
        if PRINT:
            print('MHA return val shape:', mha_return_val.shape)
            print(f'MHA return val:', mha_return_val)
        return mha_return_val

### Understand the Multi-Head Attention:


- Say the seq_len = 3, and head_dim = 3  

- $\begin{bmatrix} 0.7 & 0.2 & 0.1 \\ 0.8 & 0.1 & 0.1 \\ 0.3 & 0.3 & 0.4 \end{bmatrix}$

- Each row of this matrix says: “How much attention this token pays to all tokens in the sequence (including itself).”  
- Row 0 → Token 0 pays 70% attention to itself, 20% to token 1, 10% to token 2
- Row 1 → Token 1 pays 80% attention to itself, etc.
- Each row sums to 1 — like a probability distribution over the other tokens.
- when perform the matrix multiplication, the second token has 0.8 on first token, and this weight will multiply with the first token's vector from the Val. (the characteristic of matrix multiplication) 

In [15]:
import torch
import math

# Define Q, K
Q = torch.tensor([[[1.,2.,3.,4.],
                   [5.,6.,7.,8.]]])  # (1, 2, 4)
K = torch.tensor([[[2.,1.,0.,-1.],
                   [4.,3.,2.,1.]]])  # (1, 2, 4)

# 1) Single-head (d_model=4)
scores_single = torch.matmul(Q, K.transpose(-2,-1)) / math.sqrt(4)
attn_single  = torch.softmax(scores_single, dim=-1)

print("Single-head raw scores:\n", scores_single)
print("Single-head softmax:\n", attn_single)

# 2) Multi-head (h=2, d_k=2)
d_model, h = 4, 2
d_k = d_model // h

# reshape & transpose into heads
Qh = Q.view(1, 2, h, d_k).transpose(1,2)  # (1, 2, 2, 2)->(1, h, 2, d_k)
Kh = K.view(1, 2, h, d_k).transpose(1,2)

scores_heads = torch.matmul(Qh, Kh.transpose(-2,-1)) / math.sqrt(d_k)
attn_heads  = torch.softmax(scores_heads, dim=-1)

for i in range(h):
    print(f"\nHead {i} raw scores:\n{scores_heads[0,i]}")
    print(f"Head {i} softmax:\n{attn_heads[0,i]}")

Single-head raw scores:
 tensor([[[ 0., 10.],
         [ 4., 30.]]])
Single-head softmax:
 tensor([[[4.5398e-05, 9.9995e-01],
         [5.1091e-12, 1.0000e+00]]])

Head 0 raw scores:
tensor([[ 2.8284,  7.0711],
        [11.3137, 26.8701]])
Head 0 softmax:
tensor([[1.4166e-02, 9.8583e-01],
        [1.7537e-07, 1.0000e+00]])

Head 1 raw scores:
tensor([[-2.8284,  7.0711],
        [-5.6569, 15.5563]])
Head 1 softmax:
tensor([[5.0197e-05, 9.9995e-01],
        [6.1266e-10, 1.0000e+00]])


In [16]:
import numpy as np
# Set seed for reproducibility
np.random.seed(42)

# ========== 1) Clear Q, K, V ==========
Q = np.array([[
    [1, 2, 3, 4],
    [5, 6, 7, 8]
]])  # shape (1,2,4)
K = np.array([[
    [2, 1, 0, -1],
    [4, 3, 2, 1]
]])  # shape (1,2,4)
V = np.array([[
    [1, 0, 1, 0],
    [0, 1, 0, 1]
]])  # shape (1,2,4)

# ========== 2) Head dimensions ==========
d_model = 4
d_k     = 2
d_v     = 2
h = 2

# ========== 3) Helpers ==========
def softmax(x, axis=-1):
    e = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e / e.sum(axis=axis, keepdims=True)

def scaled_dot_product_attention(q, k, v):
    scores  = np.matmul(q, k.transpose(0,2,1)) / np.sqrt(q.shape[-1])
    weights = softmax(scores, axis=-1)
    return np.matmul(weights, v)

# ========== 4) Single-head projection weights ==========
W_Q0 = np.random.randn(d_model, h*d_k)
W_K0 = np.random.randn(d_model, h*d_k)
W_V0 = np.random.randn(d_model, h*d_v)

# Project Q, K, V and attend once
Q0 = Q @ W_Q0
K0 = K @ W_K0
V0 = V @ W_V0
out_single = scaled_dot_product_attention(Q0, K0, V0)
out_single_t  = torch.from_numpy(out_single).float() 
attn_single  = torch.softmax(out_single_t, dim=-1)

# ========== 5) Two additional heads ==========
Qh = Q0.reshape(2, h, d_k).transpose(1, 0, 2)  # (2, 2, 2)->(h, 2, d_k)
Kh = K0.reshape(2, h, d_k).transpose(1, 0, 2)
Vh = V0.reshape(2, h, d_k).transpose(1, 0, 2)
head_out = scaled_dot_product_attention(Qh, Kh, Vh)
head_out_t  = torch.from_numpy(head_out).float() 
attn_head  = torch.softmax(head_out_t, dim=-1)

# Head 1
W_Q1 = W_Q0[:, :d_k]
print("W_Q1: \n", W_Q1)
W_K1 = W_K0[:, :d_k]
print("W_K1: \n", W_K1)
W_V1 = W_V0[:, :d_v]
print("W_V1: \n", W_V1)
Q1, K1, V1 = Q @ W_Q1, K @ W_K1, V @ W_V1
out_head1  = scaled_dot_product_attention(Q1, K1, V1)
out_head1_t  = torch.from_numpy(out_head1).float() 
attn_head1  = torch.softmax(out_head1_t, dim=-1)

# Head 2
W_Q2 = W_Q0[:, d_k:]
print("W_Q2: \n", W_Q2)
W_K2 = W_K0[:, d_k:]
print("W_K2: \n", W_K2)
W_V2 = W_V0[:, d_v:]
print("W_V2: \n", W_V2)
Q2, K2, V2 = Q @ W_Q2, K @ W_K2, V @ W_V2
out_head2  = scaled_dot_product_attention(Q2, K2, V2)
out_head2_t  = torch.from_numpy(out_head2).float() 
attn_head2  = torch.softmax(out_head2_t, dim=-1)

# ========== 6) Print ==========
print("Single-Head output:\n", attn_single, "\n")
print("Multi-Head output:\n", attn_head, "\n")
print("Head 1 output:\n", attn_head1, "\n")
print("Head 2 output:\n", attn_head2, "\n")

W_Q1: 
 [[ 0.49671415 -0.1382643 ]
 [-0.23415337 -0.23413696]
 [-0.46947439  0.54256004]
 [ 0.24196227 -1.91328024]]
W_K1: 
 [[-1.01283112  0.31424733]
 [ 1.46564877 -0.2257763 ]
 [-0.54438272  0.11092259]
 [-0.60063869 -0.29169375]]
W_V1: 
 [[-0.01349722 -1.05771093]
 [ 0.2088636  -1.95967012]
 [ 0.73846658  0.17136828]
 [-1.47852199 -0.71984421]]
W_Q2: 
 [[ 0.64768854  1.52302986]
 [ 1.57921282  0.76743473]
 [-0.46341769 -0.46572975]
 [-1.72491783 -0.56228753]]
W_K2: 
 [[-0.90802408 -1.4123037 ]
 [ 0.0675282  -1.42474819]
 [-1.15099358  0.37569802]
 [-0.60170661  1.85227818]]
W_V2: 
 [[ 0.82254491 -1.22084365]
 [-1.32818605  0.19686124]
 [-0.11564828 -0.3011037 ]
 [-0.46063877  1.05712223]]
Single-Head output:
 tensor([[[0.0699, 0.0171, 0.0416, 0.8715],
         [0.0699, 0.0171, 0.0416, 0.8715]]]) 

Multi-Head output:
 tensor([[[0.8106, 0.1894],
         [0.8074, 0.1926]],

        [[0.0455, 0.9545],
         [0.0455, 0.9545]]]) 

Head 1 output:
 tensor([[[0.8106, 0.1894],
         [

In [71]:
# multi head attention example
# head=2; d_model=4
Q_t = torch.from_numpy(Q).float()
print("Q_t: \n", Q_t)
K_t = torch.from_numpy(K).float()
print("K_t: \n", K_t)
V_t = torch.from_numpy(V).float()
print("V_t: \n", V_t)
mha = MultiHeadedAttention(2, 4)
mha_output = mha(Q_t, K_t, V_t)

Q_t: 
 tensor([[[1., 2., 3., 4.],
         [5., 6., 7., 8.]]])
K_t: 
 tensor([[[ 2.,  1.,  0., -1.],
         [ 4.,  3.,  2.,  1.]]])
V_t: 
 tensor([[[1., 0., 1., 0.],
         [0., 1., 0., 1.]]])
In MHA:
Layer 0: in_features = 4, out_features = 4
layer 0 weights:  Parameter containing:
tensor([[ 0.1958, -0.3091,  0.0778,  0.1922],
        [ 0.3054,  0.3775,  0.1262,  0.3985],
        [-0.4576,  0.1862, -0.0031,  0.2265],
        [-0.2400,  0.4219,  0.2924,  0.0081]], requires_grad=True)
Layer 1: in_features = 4, out_features = 4
layer 1 weights:  Parameter containing:
tensor([[ 0.1958, -0.3091,  0.0778,  0.1922],
        [ 0.3054,  0.3775,  0.1262,  0.3985],
        [-0.4576,  0.1862, -0.0031,  0.2265],
        [-0.2400,  0.4219,  0.2924,  0.0081]], requires_grad=True)
Layer 2: in_features = 4, out_features = 4
layer 2 weights:  Parameter containing:
tensor([[ 0.1958, -0.3091,  0.0778,  0.1922],
        [ 0.3054,  0.3775,  0.1262,  0.3985],
        [-0.4576,  0.1862, -0.0031,  0.2265]

In [36]:
class LayerNorm(nn.Module):
    '''Construct a layernorm module (See citation for details).'''
    # TODO: what is the purpose of eps?
    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps
    
    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

### Normalization:

- Any technique that re-scales intermediate activations(outputs of network’s neurons) so they have (typically) zero mean and unit variance.
    - zero mean: 
        - mean(average): $\mu = \frac{1}{n}\sum_{i=1}^n x_i$
        - shift data so that the average $\mu$ becomes 0.
    - unit variance: 
        - The variance measures how spread-out the values are around the mean: $\sigma^2 = \frac{1}{n}\sum_{i=1}^n (x_i - \mu)^2$
        - The standard deviation is $\sigma = \sqrt{\sigma^2}$
        - “Unit variance” means we scale our data so that $\sigma^2 = 1$ (equivalently, $\sigma = 1$):
            - on average, the squared deviations of values from their mean is exactly one.  
            - no single dimension can dominate purely by having larger numeric scale.  
	        - It makes gradient updates more uniform across parameters, which stabilizes and speeds up training.  
    
    - Standardization:
        - To transform an arbitrary dataset (or activations) into zero mean and unit variance, we compute $\hat x_i = \frac{x_i - \mu}{\sigma}$
        - After this operation, the new values $\{\hat x_i\}$ satisfy $\frac1n\sum_i \hat x_i = 0$ and $\frac1n\sum_i (\hat x_i - 0)^2 = 1$
    - Benefits: 
        - Zero mean recenters your data around 0, so positive and negative signals balance out.
        - Unit variance ensures no feature “blows up” the scale—every dimension contributes similarly.
        - Stabilizing the learning dynamics by keeping activations in a predictable range
        - Smoothing the loss surface, which often speeds up convergence and allows for larger learning rates

- Batch Norm:

- Layer Norm:

- Post Norm:

- Pre Norm:

In [37]:
class SublayerConnection(nn.Module):
    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        # NOTE: the sublayer is a lambda wrapper on the multi-head attention from encoder & decoder
        # it performs Pre-Layer-Norm because it's more stable to train very deep models, pos-norm suffer from gradient vanishing.
        return x + self.dropout(sublayer(self.norm(x)))

In [38]:
class EncoderLayer(nn.Module):
    def __init__(self, size, self_attn:MultiHeadedAttention, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size

    def forward(self, x, mask):
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)

In [39]:
class Encoder(nn.Module):
    def __init__(self, layer:EncoderLayer, N):
        super(Encoder, self).__init__()
        # The encoder is composed of a stack of N identical layers.
        # repeat the entire Encoder layer (multi-head attention, feedforward, layer_norm...) N times.
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)
        
    def forward(self, x, mask):
        if PRINT:
            print(f'in Encoder after embedding: src:{x.shape}; src_mask: {mask.shape}')
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)


### encoder-decoder attention:
- the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder.

### encoder attention & decoder attention:
- all of the keys, values and queries come from the same place (the output of the previous layer)
- Each position can attend to all positions in the previous layer

In [40]:
class DecoderLayer(nn.Module):
    def __init__(self, size, self_attn:MultiHeadedAttention, src_attn:MultiHeadedAttention, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)
    
    def forward(self, x, memory, src_mask, tgt_mask):
        m = memory
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        return self.sublayer[2](x, self.feed_forward)

In [41]:
class Decoder(nn.Module):
    def __init__(self, layer:DecoderLayer, N):
        super(Decoder, self).__init__()
        # repeat the entire Decoder layer (2 multi-head attention, feedforward, layer_norm...) N times.
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, memory, src_mask, tgt_mask):
        if PRINT:
            print(f'in Decoder after embedding: tgt: {x.shape}; encoder_output:{memory.shape}; encoder_output_mask: {src_mask.shape}; tgt_mask:{tgt_mask.shape}')
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)

In [42]:
class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        if PRINT:
            print(f'in FF: input shape: {x.shape}')
        ff_val = self.w_2(self.dropout(self.w_1(x).relu()))
        if PRINT:
            print(f'in FF: output shape: {x.shape}')
        return ff_val

### What does Embedding layer do:

- This embedding layer maps each token index to a d_model-dimensional vector.
- self.lut(x) Looks up embeddings for the token indices x. Then, scales the embeddings by a constant factor $\sqrt{d_{model}}$ 
- This is a normalization trick to help with training stability. The dot-product attention has a scaling factor $\frac{1}{\sqrt{d_k}}$ so scaling the input embeddings helps balance the magnitudes.  

- Without this scaling, the softmax in attention could produce very small gradients, especially at the beginning of training.

In [43]:
class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        # This is the look up table, retrieving vectors using token IDs.
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model

    def forward(self, x):
        # (batch_size, seq_len) -> (batch_size, seq_len, d_model)
        return self.lut(x) * math.sqrt(self.d_model)

### Why Positional Encoding:
- Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. 

In [44]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        x = x + self.pe[:, : x.size(1)].requires_grad_(False)
        return self.dropout(x)

In [45]:
def make_model(src_vocab, tgt_vocab, N=6, d_model= 512, d_ff=2048, h=8, dropout=0.1):
    c = copy.deepcopy
    attn = MultiHeadedAttention(h, d_model)
    ff = PositionwiseFeedForward(d_model, d_ff, dropout)
    position = PositionalEncoding(d_model, dropout)
    model = EncoderDecoder(
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
        Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout), N),
        nn.Sequential(Embeddings(d_model, src_vocab), c(position)),
        nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)),
        Generator(d_model, tgt_vocab)
    )
    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)
    
    return model

### Testing the model

In [51]:
def inference_test():
    test_model = make_model(11, 11, 2)
    # In PyTorch, every neural network(subclasses of nn.Module) has two of the most important “modes”: training (.train()) and evaluation (.eval()).
    test_model.eval()
    # This constructs a tensor of 64-bit integers
    # has shape (1, 10):
	# 1 is the batch size (you have one sequence).
	# 10 is the sequence length (ten tokens, with IDs 1 through 10).
    # feed this through an nn.Embedding to turn each integer into a vector
    src = torch.LongTensor([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
    print(f'src: {src}')
    # create 1s of shape (1, 1, 10), mask value of 1 means to keep.
    src_mask = torch.ones(1, 1, 10)
    print(f'src_mask: {src_mask}')

    # the mask tells the attention mechanism which source positions to consider. All-ones means “fully attend to every token.”
    memory = test_model.encode(src, src_mask)
    print(f'memory: {memory}')
    # torch.zeros(1, 1): Creates a new tensor of shape (1, 1) filled with 0.0; By default this is a floating-point tensor (dtype=torch.float32) on the CPU.
    # type_as(src.data):	
    # •	Takes whatever tensor you pass in—here src.data—and queries its dtype and device.
	# •	Casts (and, if needed, moves) the zeros-tensor so it matches that type and device.
	# •	If src was a LongTensor on the GPU, ys becomes a LongTensor on the GPU; if src was an FloatTensor on CPU, ys ends up the same, and so on.
    ys = torch.zeros(1, 1).type_as(src.data)
    print(f'ground truth: {ys}')

    for i in range(9):
        out = test_model.decode(memory, src_mask, ys, subsequent_mask(ys.size(1)).type_as(src.data))
        print(f'loop {i} out: {out.shape}')
        # out is decoder’s hidden‐state tensor of shape (batch_size, seq_len, d_model).
        # out[:, -1] grabs the last time-step for each item in the batch, giving a tensor of shape (batch_size, d_model).
        # test_model.generator is typically a nn.Linear(d_model, vocab_size) (often followed by log-softmax), so prob ends up as (batch_size, vocab_size) containing the (log-)probabilities of every possible next token.
        prob = test_model.generator(out[:, -1])
        print(f'loop{i} prob: {prob}')
        
        # torch.max(..., dim=1) returns two things:
        # •	The maximum values (which throw away into _), and
        # •	The indices of those maxima along dim=1 (the vocabulary dimension).
        # So next_word is a tensor of shape (batch_size,) containing the predicted token IDs.
        _, next_word = torch.max(prob, dim=1)
        print(f'loop {i} next word: {next_word}')
        # .data[0] grabs the first element of the tensor (and gives you a raw Python scalar or zero-dim tensor).
	    # Note: in modern PyTorch you’d usually write next_word = next_word.item().
        next_word = next_word.data[0]
        print(f'loop {i} next word: {next_word}')
        ys = torch.cat([ys, torch.empty(1, 1).type_as(src.data).fill_(next_word)], dim=1)
        print(f'loop {i} ground truth: {ys}')
    
    print("Example Untrained Model Prediction:", ys)

In [10]:
x = torch.randn(2, 3, 5)   # say batch_size=4, seq_len=10, d_model=512
y = x[:, -1]                  # integer “-1”
print(x)
print(y.shape)
print(y)

tensor([[[ 0.0751, -1.3743, -0.6627, -1.3150, -0.1900],
         [ 0.6102,  0.5823,  0.9035,  0.3814,  0.5704],
         [-0.3351,  0.3250,  0.0986, -1.7480,  1.2289]],

        [[ 0.2573,  1.1233, -0.7100, -0.6179,  0.6549],
         [-0.0216, -0.2573,  0.0491, -0.1773, -1.9397],
         [-0.0931,  0.4979,  0.6578,  0.3862, -0.7730]]])
torch.Size([2, 5])
tensor([[-0.3351,  0.3250,  0.0986, -1.7480,  1.2289],
        [-0.0931,  0.4979,  0.6578,  0.3862, -0.7730]])


In [52]:
for _ in range (1):
    inference_test()

src: tensor([[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10]])
src_mask: tensor([[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]])
memory: tensor([[[ 0.7866,  1.2446, -0.5330,  ..., -1.0366,  1.2295, -0.7082],
         [ 1.4092,  1.7337, -0.0820,  ...,  0.8748, -0.1855, -0.0618],
         [ 2.3700,  0.7851,  0.8545,  ..., -0.5111, -0.1885, -0.4453],
         ...,
         [ 1.0476,  0.7221, -0.1984,  ..., -0.8055, -0.9070, -0.1119],
         [ 1.5334,  1.2891, -1.2536,  ..., -0.3821, -0.7405,  1.7064],
         [ 1.7038,  0.5228,  1.3222,  ..., -0.2475,  0.2649,  0.6735]]],
       grad_fn=<AddBackward0>)
ground truth: tensor([[0]])
loop 0 out: torch.Size([1, 1, 512])
loop0 prob: tensor([[-3.5683, -3.5945, -3.4997, -3.4052, -1.9056, -1.9189, -1.6448, -2.0978,
         -1.5085, -3.1557, -5.1526]], grad_fn=<LogSoftmaxBackward0>)
loop 0 next word: tensor([8])
loop 0 next word: 8
loop 0 ground truth: tensor([[0, 8]])
loop 1 out: torch.Size([1, 2, 512])
loop1 prob: tensor([[-4.1903, -4.1803, -2.2895, -3

# Training

#### tools to train a standard encoder-decoder model
- Batch object: holds src and target sentences for training, construct masks

In [83]:
class Batch:
    def __init__(self, src, tgt=None, pad=2, device='cpu'):
        self.src = src
        # Build a mask over src to hide padding tokens
        # src = torch.LongTensor([[5, 7, 2, 3],
        #                         [4, 2, 2, 1]])
        # pad = 2
        # mask = (src != pad)
        # mask is now:
        # tensor([[ True,  True, False,  True],
        #         [ True, False, False,  True]])
        # .unsqueeze(-2) adds a dimension so src_mask is shape (batch_size, 1, src_len).
        self.src_mask = (src != pad).unsqueeze(-2).to(device)
        if tgt is not None:
            self.tgt = tgt[:, :-1].to(device)
            self.tgt_y = tgt[:, 1:].to(device)
            self.tgt_mask = self.make_std_mask(self.tgt, pad).to(device)
            self.ntokens = (self.tgt_y != pad).data.sum().to(device)
            
    # A static function belongs to a class rather than an instance of the class.
	# class MyClass:
    #     @staticmethod
    #     def greet(name):    # No self or cls parameter
    #         return f"Hello, {name}!"
    # # Usage:
    # MyClass.greet("Alice")  # No need to create an instance
    @staticmethod
    def make_std_mask(tgt, pad):
        tgt_mask = (tgt != pad).unsqueeze(-2)
        tgt_mask = tgt_mask & subsequent_mask(tgt.size(-1)).type_as(tgt_mask.data)
        return tgt_mask

#### generic training and scoring function to keep track of loss. 
- We pass in a generic loss compute function that also handles parameter updates. 

In [84]:
class TrainState:
    """Track number of steps, examples, and tokens processed"""
    
    # This is not the same as using __init__ or @dataclass. This defines default class variables, not assigning these to each instance (Values can NOT be modified per object).
    step: int = 0           # Steps in the current epoch
    accum_step: int = 0     # Number of gradient accumulation steps
    samples: int = 0        # total number of examples used
    tokens: int = 0         # total number of tokens processed

In [85]:
def run_epoch(data_iter, model, loss_compute, optimizer, scheduler, mode="train", accum_iter=1,train_state=TrainState()):
    
    start = time.time()
    total_tokens = 0
    total_loss = 0
    tokens = 0
    n_accum = 0
    for i, batch in enumerate(data_iter):
        out = model.forward(batch.src, batch.tgt, batch.src_mask, batch.tgt_mask)
        loss, loss_node = loss_compute(out, batch.tgt_y, batch.ntokens)
        # loss_node = loss_node / accum_iter
        if mode == "train" or mode == "train+log":
            loss_node.backward()
            train_state.step += 1
            train_state.samples += batch.src.shape[0]
            train_state.tokens += batch.ntokens
            if i % accum_iter == 0:
                optimizer.step()
                optimizer.zero_grad(set_to_none=True)
                n_accum += 1
                train_state.accum_step += 1
            scheduler.step()

        total_loss += loss
        total_tokens += batch.ntokens
        tokens += batch.ntokens
        if i % 40 == 1 and (mode == "train" or mode == "train+log"):
            lr = optimizer.param_groups[0]["lr"]
            elapsed = time.time() - start
            print(
                (
                    "Epoch Step: %6d | Accumulation Step: %3d | Loss: %6.2f "
                    + "| Tokens / Sec: %7.1f | Learning Rate: %6.1e"
                )
                % (i, n_accum, loss / batch.ntokens, tokens / elapsed, lr)
            )
            start = time.time()
            tokens = 0
        del loss
        del loss_node
    return total_loss / total_tokens, train_state



In [86]:
def rate(step, model_size, factor, warmup):
    """
    we have to default the step to 1 for LambdaLR function to avoid zero raising to negative power.
    """
    if step == 0:
        step = 1
    return factor * (model_size ** (-0.5) * min(step ** (-0.5), step * warmup ** (-1.5)))


In [None]:
def example_learning_schedule():
    opts = [
        [512, 1, 4000],  # example 1
        [512, 1, 8000],  # example 2
        [256, 1, 4000],  # example 3
    ]

    dummy_model = torch.nn.Linear(1, 1)
    learning_rates = []

    # we have 3 examples in opts list.
    for idx, example in enumerate(opts):
        # run 20000 epoch for each example
        # Adaptive Moment Estimation(Adam): an algorithm used to update the parameters (weights) of a neural network during training
        optimizer = torch.optim.Adam(dummy_model.parameters(), lr=1, betas=(0.9, 0.98), eps=1e-9) 
        # Use LambdaLR to control how the learning rate changes during training — for example, to:
        # •	Warm up the learning rate for the first few epochs
        # •	Decay it over time
        # •	Use a custom schedule
        lr_scheduler = LambdaLR(optimizer=optimizer, lr_lambda=lambda step: rate(step, *example))
        tmp = []
        # take 20K dummy training steps, save the learning rate at each step
        for step in range(20000):
            tmp.append(optimizer.param_groups[0]["lr"])
            optimizer.step()
            lr_scheduler.step()
        learning_rates.append(tmp)

    learning_rates = torch.tensor(learning_rates)

    # Enable altair to handle more than 5000 rows
    alt.data_transformers.disable_max_rows()

    opts_data = pd.concat(
        [
            pd.DataFrame(
                {
                    "Learning Rate": learning_rates[warmup_idx, :],
                    "model_size:warmup": ["512:4000", "512:8000", "256:4000"][
                        warmup_idx
                    ],
                    "step": range(20000),
                }
            )
            for warmup_idx in [0, 1, 2]
        ]
    )

    return (
        alt.Chart(opts_data)
        .mark_line()
        .properties(width=600)
        .encode(x="step", y="Learning Rate", color="model_size:warmup:N")
        .interactive()
    )


# example_learning_schedule()


In [None]:
class LabelSmoothing(nn.Module):
    '''
    Implement label smoothing:
    Normally in classification, we use one-hot vectors as targets, where the correct class has a probability of 1 and all others 0. 
    Label smoothing softens this, assigning the correct class a probability slightly less than 1 and distributing the rest over the incorrect classes. 
    This helps avoid overfitting and makes the model less certain in its predictions, which tends to improve performance.
    
    size: number of classes (vocabulary size).
    padding_idx: index of the padding token (which we should ignore)
    smoothing: the smoothing factor (e.g., 0.1 means 90% confidence on correct class, 10% distributed among others)
    '''

    def __init__(self, size, padding_idx, smoothing=0.0):
        super(LabelSmoothing, self).__init__()
        # Use Kullback-Leibler Divergence as the loss, which measures how one probability distribution diverges from another
        self.criterion = nn.KLDivLoss(reduction="sum")
        self.padding_idx = padding_idx
        self.confidence = 1.0 - smoothing
        self.smoothing = smoothing
        self.size = size
        self.true_dist = None

    # x: predicted log-probabilities from the model (log_softmax should be applied before passing).
	# target: true class indices (as integers).
    def forward(self, x, target):
        assert x.size(1) == self.size
        true_dist = x.data.clone()
        
        # filling the distribution with a small probability for each class (excluding padding and correct class)
        true_dist.fill_(self.smoothing / (self.size - 2))
        
        # Overwrite the correct class with the confidence value (e.g., 0.9 if smoothing=0.1)
        true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence)
       
        # Set padding class probability to zero
        true_dist[:, self.padding_idx] = 0
        
        # If any target labels are padding tokens, make their full distribution all zeros (so they’re ignored in loss calculation)
        mask = torch.nonzero(target.data == self.padding_idx)
        if mask.dim() > 0:
            true_dist.index_fill_(0, mask.squeeze(), 0.0)
        
        # Store the true distribution and compute the KL-divergence loss against the prediction x
        self.true_dist = true_dist
        return self.criterion(x, true_dist.clone().detach())

### Synthetica Data example

In [88]:
def data_gen(V, batch_size, nbatches, device="cpu"):
    "Generate random data for a src-tgt copy task."
    for i in range(nbatches):
        data = torch.randint(1, V, size=(batch_size, 10)).to(device)
        data[:, 0] = 1
        src = data.requires_grad_(False).clone().detach()
        tgt = data.requires_grad_(False).clone().detach()
        yield Batch(src, tgt, 0, device=device)

In [89]:
class SimpleLossCompute:
    "A simple loss compute and train function."

    def __init__(self, generator, criterion):
        self.generator = generator
        self.criterion = criterion

    def __call__(self, x, y, norm):
        x = self.generator(x)
        sloss = (self.criterion(x.contiguous().view(-1, x.size(-1)), y.contiguous().view(-1)) / norm)
        
        return sloss.data * norm, sloss

In [93]:
def greedy_decode(model, src, src_mask, max_len, start_symbol, device="cpu"):
    ''' Predicts a translation using greedy decoding for simplicity '''
    
    memory = model.encode(src, src_mask)
    ys = torch.zeros(1, 1).fill_(start_symbol).type_as(src.data).to(device)
    for i in range(max_len - 1):
        out = model.decode(
            memory, src_mask, ys, subsequent_mask(ys.size(1)).type_as(src.data)
        )
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.data[0]
        ys = torch.cat(
            [ys, torch.zeros(1, 1).type_as(src.data).fill_(next_word)], dim=1
        )
    return ys

In [None]:
RUN_EXAMPLES = True
class DummyOptimizer(torch.optim.Optimizer):
    def __init__(self):
        self.param_groups = [{"lr": 0}]
        None

    def step(self):
        None

    def zero_grad(self, set_to_none=False):
        None
        

class DummyScheduler:
    def step(self):
        None
        
        
def execute_example(fn, args=[]):
    if __name__ == "__main__" and RUN_EXAMPLES:
        fn(*args)
         

# Train the simple copy task.
def example_simple_model():
    device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")  # ①
   
    V = 11    # number of classes
    criterion = LabelSmoothing(size=V, padding_idx=0, smoothing=0.0).to(device)  # ②
    
    model = make_model(V, V, N=2).to(device)  # ③
    optimizer = torch.optim.Adam(model.parameters(), lr=0.5, betas=(0.9, 0.98), eps=1e-9)
    
    # LambdaLR lets you define your own learning‐rate schedule by supplying a function that 
    # maps the current step (or epoch) → a multiplicative factor for the base LR.
    
    # •	step: the current update count (when you call scheduler.step() each minibatch or epoch).
	# •	model_size: the model’s hidden dimension (often called d_model in transformer code). Here you grab it from your source‐embedding’s d_model attribute.
	# •	factor: an arbitrary multiplier (1.0 means “no extra scaling”).
	# •	warmup: number of steps to linearly increase the LR before decaying.
    lr_scheduler = LambdaLR(optimizer=optimizer, lr_lambda=lambda step: rate(step, model_size=model.src_embed[0].d_model, factor=1.0, warmup=400))

    # number of sentences (or examples) per minibatch
    batch_size = 80
    # Training
    for epoch in range(20):
        model.train()  # Puts your model into “training” mode (enables dropout, layer-norm updates, etc.).
        run_epoch(
            data_gen(V, batch_size, 20, device),  # ④
            model,
            SimpleLossCompute(model.generator, criterion),
            optimizer,
            lr_scheduler,
            mode="train",
        )
        model.eval()    # Disables dropout and other training-only behaviors.
        run_epoch(
            data_gen(V, batch_size, 5, device),  # ⑤
            model,
            SimpleLossCompute(model.generator, criterion),
            # Stubs that do nothing on .step().
            # This lets us call the same run_epoch API without accidentally updating weights or LRs.
            DummyOptimizer(),
            DummyScheduler(),
            # run_epoch sees this and skips backpropagation entirely (only sums/returns loss)
            mode="eval",
        )[0]

    # Inference / Greedy Decoding
    model.eval()
    src = torch.LongTensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]).to(device)  # ⑥
    max_len = src.shape[1]
    src_mask = torch.ones(1, 1, max_len).to(device)  # ⑦
    print(greedy_decode(model, src, src_mask, max_len=max_len, start_symbol=0, device=device))

execute_example(example_simple_model)

Epoch Step:      1 | Accumulation Step:   2 | Loss:   3.06 | Tokens / Sec:  3464.4 | Learning Rate: 5.5e-06
Epoch Step:      1 | Accumulation Step:   2 | Loss:   2.06 | Tokens / Sec:  8909.1 | Learning Rate: 6.1e-05
Epoch Step:      1 | Accumulation Step:   2 | Loss:   1.70 | Tokens / Sec:  8031.9 | Learning Rate: 1.2e-04
Epoch Step:      1 | Accumulation Step:   2 | Loss:   1.41 | Tokens / Sec:  8653.8 | Learning Rate: 1.7e-04
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.95 | Tokens / Sec:  8506.8 | Learning Rate: 2.3e-04
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.57 | Tokens / Sec:  8817.3 | Learning Rate: 2.8e-04
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.33 | Tokens / Sec:  8554.3 | Learning Rate: 3.4e-04
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.28 | Tokens / Sec:  8499.6 | Learning Rate: 3.9e-04
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.15 | Tokens / Sec:  8629.4 | Learning Rate: 4.5e-04
Epoch Step:      1 | Accumul