**Pytorch implementation of Transformer from scratch Demo**

Note: Many codes and figures are borrowed from:
- https://github.com/BoXiaolei/MyTransformer_pytorch/blob/main/MyTransformer.ipynb
- https://nn.labml.ai/index.html 
- https://github.com/hyunwoongko/transformer 
- https://jalammar.github.io/illustrated-transformer/
- https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/ 

In [7]:
import torch
import torch.nn as nn
import math 
import torch.utils.data as data
import numpy as np

<div>
<img src="images/transformer.png" width="800"/>
</div>

# Positional Encoding

# Mask

## Pading Mask

In the Transformer architecture, a **padding mask** is used to handle **sequences of different lengths**. Here's why it's important:

- **Variable Sequence Lengths**: In many natural language processing tasks, input sequences (like sentences) vary in length. However, neural networks, including Transformers, typically require inputs of a fixed size. To manage this, shorter sequences are often padded with special tokens (like [PAD]) to match the length of the longest sequence in a batch.

- Ignoring Padding Tokens during Training: The padding tokens are not actual data; they're just placeholders. It's crucial that the model doesn't treat these padding tokens as meaningful input. The padding mask is a mechanism to ensure that the model ignores these tokens during training and inference. It does this by zeroing out (masking) the padding tokens' impact on the model's output.

- Attention Mechanism Efficiency: Transformers use an attention mechanism to weigh the importance of different parts of the input sequence. Without a padding mask, the attention mechanism might incorrectly assign significance to the padding tokens, leading to less accurate or meaningful outputs.

- Preventing Data Leakage: In certain cases, especially in language modeling tasks, padding at the beginning or end of sequences might inadvertently reveal information about the sequence. A mask helps ensure that the model's predictions are based solely on actual data, not on these artificial padding tokens.

#### In summary, the pad mask in Transformer architectures is a critical component for handling variable-length input sequences effectively and ensuring that the model's attention mechanism focuses on the meaningful parts of the input, thereby improving the quality and relevance of the model's outputs.

- When is this calculated mask used?

After the multiplication of query and key's transpose, resulting in the attention score matrix of size (len_q,len_k), the mask obtained from this function is used to cover the results of the matrix multiplication. In the original multiplication result matrix (len_q,len_k), the meaning of the element in the ith row and jth column is "the attention score of the ith word in the q sequence to the jth word in the k sequence". The entire ith row represents the attention of this word in q to all words in k, and the entire jth column represents the attention of all words in q to the jth word in k. As padding, none of the words in q should pay attention to it, hence the corresponding columns should be set to True.

- Why is only the padding position of k masked, and not that of q? (i.e., why is it that only the last few columns of the return matrix of this function are True, and not the last few rows as well?)

Logically, it should be like this: as padding, it should neither attract attention nor pay attention to others. The attention that the padding calculates towards other words is also meaningless. Here, we are actually cutting corners, but this is because: the attention of the padding in q to the words in k is not going to be used, as we won't use a padding character to predict the next word. Moreover, its vector representation, no matter how it's updated, will not affect the calculations of other words in q, so we let it be. However, the padding in k is different. If it's not managed, it will meaninglessly absorb a lot of attention from the words in q, leading to biases in the model's learning.

In [3]:
# The sentences we input into the model vary in length, and we use a placeholder 'P' to pad them to the length of the longest sentence. These placeholders are meaningless and we set these positions to True. The following function returns a Boolean tensor indicating whether the position is a placeholder.
# Return: tensor [batch_size, len_q, len_k]，True means the position is a placeholder

def get_attn_pad_mask(seq_q, seq_k):
    batch_size, len_q = seq_q.size()
    _,          len_k = seq_k.size()
    # seq_k.data.eq(0):，element in seq_k will be True (if ele == 0), False otherwise.
    # broadcast
    pad_attn_mask = seq_k.data.eq(0).unsqueeze(1) # pad_attn_mask: [batch_size,1,len_k]

    # To provide a k for each q, so the second dimension is expanded q times.
    # Expand is not really doubling the memory, it just repeats the reference, and any modification to any reference will modify the original value.
    # Here we use it to save memory because we won't modify this mask.
    return pad_attn_mask.expand(batch_size, len_q, len_k) # return: [batch_size, len_q, len_k]
    # Return batch_size len_q * len_k matrix, content is True and False, True means the position is a placeholder.
    # The i-th row and the j-th column indicate whether the attention of the i-th word of the query to the j-th word of the key is meaningless. If it is meaningless, it is True. If it is meaningful, it is False (that is, the position of the padding is True)

In [4]:
# test for get_attn_pad_mask
seq_q = torch.tensor([[1,2,3,0,0],[3,4,5,6,0],[2,3,0,0,0]])
seq_k = torch.tensor([[1,2,3,4,5],[1,2,0,0,0],[1,2,3,0,0]])
print(get_attn_pad_mask(seq_q, seq_k))

tensor([[[False, False, False, False, False],
         [False, False, False, False, False],
         [False, False, False, False, False],
         [False, False, False, False, False],
         [False, False, False, False, False]],

        [[False, False,  True,  True,  True],
         [False, False,  True,  True,  True],
         [False, False,  True,  True,  True],
         [False, False,  True,  True,  True],
         [False, False,  True,  True,  True]],

        [[False, False, False,  True,  True],
         [False, False, False,  True,  True],
         [False, False, False,  True,  True],
         [False, False, False,  True,  True],
         [False, False, False,  True,  True]]])


## Subsequence Mask

This mask is used in the first "Masked Multi-Head self Attention" module in Decoder of Transformer. The goal is to prevent the model seeing the future input.

Take the example below:

Assuming one sentence with 5 tokens as the decoder input. The i-th row and j-th column denotes the attention of i-th token to j-th token.

For the i-th token (row), it can only see itself and tokens before it, and the tokens after it will be filtered. So 1 means filtering, and 0 means keeping.

<div>
<img src="images/subsequenceMask.png" width="500"/>
</div>

In [8]:
# To prevent positions from attending to subsequent positions
def get_attn_subsequence_mask(seq):
    """seq: [batch_size, tgt_len]"""
    attn_shape = [seq.size(0), seq.size(1), seq.size(1)]
    # np.triu: Return a copy of a matrix with the elements below the k-th diagonal zeroed.
    # np.triu is to generate an upper triangular matrix, k is the offset relative to the main diagonal
    # k = 1 means not including the main diagonal (starting from the main diagonal offset 1
    subsequence_mask = np.triu(np.ones(attn_shape), k=1)
    subsequence_mask = torch.from_numpy(subsequence_mask).byte() 
    # Because there are only 0 and 1, byte is used to save memory.
    return subsequence_mask  # return: [batch_size, tgt_len, tgt_len]

In [9]:
# test for get_attn_subsequence_mask
seq = torch.tensor([[1,2,3,0,0],[1,2,3,4,0]])
print(get_attn_subsequence_mask(seq))

tensor([[[0, 1, 1, 1, 1],
         [0, 0, 1, 1, 1],
         [0, 0, 0, 1, 1],
         [0, 0, 0, 0, 1],
         [0, 0, 0, 0, 0]],

        [[0, 1, 1, 1, 1],
         [0, 0, 1, 1, 1],
         [0, 0, 0, 1, 1],
         [0, 0, 0, 0, 1],
         [0, 0, 0, 0, 0]]], dtype=torch.uint8)


# Scaled Dot Product Attention

<div>
<img src="images/ScaledDotProductAttention.png" width="500"/>
</div>

In [None]:
class ScaledDotProductAttention(nn.Module):
    def __init__(self):
        super(ScaledDotProductAttention, self).__init__()

    def forward(self, Q, K, V, attn_mask):
        '''
        Q: [batch_size, n_heads, len_q, d_k]
        K: [batch_size, n_heads, len_k, d_k]
        V: [batch_size, n_heads, len_v(=len_k), d_v] 
        Two types of attention:
        1) self attention
        2) cross attention: K and V are encoder's output, so the shape of K and V are the same
        attn_mask: [batch_size, n_heads, seq_len, seq_len]
        # todo: len_q and len_k may be different???
        '''
        batch_size, n_heads, len_q, d_k = Q.shape 
        # 1) computer attention score QK^T/sqrt(d_k)
        scores = torch.matmul(Q, K.transpose(-1, -2)) / np.sqrt(d_k) # scores: [batch_size, n_heads, len_q, len_k]
        # 2) mask operation (option) and softmax
        # True positions in mask will be replaced with -1e9
        scores.masked_fill_(attn_mask, -1e9)
        attn = nn.Softmax(dim=-1)(scores)  # attn: [batch_size, n_heads, len_q, len_k]
        # 3) use attention weights to weigh the value V
        context = torch.matmul(attn, V)  # context: [batch_size, n_heads, len_q, d_v]
        '''
        得出的context是每个维度(d_1-d_v)都考虑了在当前维度(这一列)当前token对所有token的注意力后更新的新的值，
        换言之每个维度d是相互独立的，每个维度考虑自己的所有token的注意力，所以可以理解成1列扩展到多列

        返回的context: [batch_size, n_heads, len_q, d_v]本质上还是batch_size个句子，
        只不过每个句子中词向量维度512被分成了8个部分，分别由8个头各自看一部分，每个头算的是整个句子(一列)的512/8=64个维度，最后按列拼接起来
        '''
        return context # context: [batch_size, n_heads, len_q, d_v]

# MultiHeadAttention
<div>
<img src="images/MultiHeadAttention.png" width="300"/>
</div>

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads, dropout_rate):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.dropout_rate = dropout_rate
        self.head_dim = d_model // n_heads
        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)
        self.scaled_dot_product_attention = ScaledDotProductAttention()
        self.W_O = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, Q, K, V, attn_mask):
        '''
        Q: [batch_size, len_q, d_model]
        K: [batch_size, len_k, d_model]
        V: [batch_size, len_v, d_model] 
        attn_mask: [batch_size, seq_len, seq_len]
        '''
        batch_size, len_q, d_model = Q.shape
        batch_size, len_k, d_model = K.shape
        batch_size, len_v, d_model = V.shape
        # 1) linear projection
        Q = self.W_Q(Q)
        K = self.W_K(K)
        V = self.W_V(V)
        # 2) split by heads
        # [batch_size, len_q, d_model] -> [batch_size, len_q, n_heads, head_dim]
        Q = Q.reshape(batch_size, len_q, self.n_heads, self.head_dim)
        K = K.reshape(batch_size, len_k, self.n_heads, self.head_dim)
        V = V.reshape(batch_size, len_v, self.n_heads, self.head_dim)
        # 3) transpose for attention dot product
        # [batch_size, len_q, n_heads, head_dim] -> [batch_size, n_heads, len_q, head_dim]
        Q = Q.transpose(1, 2)
        K = K.transpose(1, 2)
        V = V.transpose(1, 2)
        # 4) attention
        context = self.scaled_dot_product_attention(Q, K, V, attn_mask)
        # context: [batch_size, n_heads, len_q, head_dim]
        # 5) concat heads
        # method 1:
        output = context.transpose(1, 2).reshape(batch_size, len_q, self.d_model)
        # output: [batch_size, len_q, d_model]
        # method 2:
        # output = torch.cat([context[:,i,:,:] for i in range(self.n_heads)], dim=-1)
        # output: [batch_size, len_q, d_model]
        # 6) linear projection (concat heads)
        output = self.W_O(output)
        return output # output: [batch_size, len_q, d_model]

# Feed-Forward Networks

In [None]:
class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout_rate):
        super(PositionwiseFeedForward, self).__init__()
        self.d_model = d_model
        self.d_ff = d_ff
        self.W_1 = nn.Linear(d_model, d_ff)
        self.W_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout_rate)
        self.relu = nn.ReLU()
    def forward(self, x):
        '''
        x: [batch_size, seq_len, d_model]
        '''
        output = self.relu(self.W_1(x))
        output = self.W_2(output)
        
        return output

# Encoder

## Encoder Layer

## Encoder

# Decoder

## Decoder Layer

## Decoder

# Transformer

- Encoder
- Decoder
- Dense

# Demo of Training

- ## Dataset Preparation 

In [None]:
# S: 起始标记
# E: 结束标记
# P：意为padding，将当前序列补齐至最长序列长度的占位符
sentence = [
    # enc_input   dec_input    dec_output
    ['ich mochte ein bier P','S i want a beer .', 'i want a beer . E'],
    ['ich mochte ein cola P','S i want a coke .', 'i want a coke . E'],
]

# 词典，padding用0来表示
# 源词典
src_vocab = {'P':0, 'ich':1,'mochte':2,'ein':3,'bier':4,'cola':5}
src_vocab_size = len(src_vocab) # 6
# 目标词典（包含特殊符）
tgt_vocab = {'P':0,'i':1,'want':2,'a':3,'beer':4,'coke':5,'S':6,'E':7,'.':8}
# 反向映射词典，idx ——> word
idx2word = {v:k for k,v in tgt_vocab.items()}
tgt_vocab_size = len(tgt_vocab) # 9

src_len = 5 # 输入序列enc_input的最长序列长度，其实就是最长的那句话的token数
tgt_len = 6 # 输出序列dec_input/dec_output的最长序列长度

In [None]:
# 这个函数把原始输入序列转换成token表示
def make_data(sentence):
    enc_inputs, dec_inputs, dec_outputs = [],[],[]
    for i in range(len(sentence)):
        enc_input = [src_vocab[word] for word in sentence[i][0].split()]
        dec_input = [tgt_vocab[word] for word in sentence[i][1].split()]
        dec_output = [tgt_vocab[word] for word in sentence[i][2].split()]
        
        enc_inputs.append(enc_input)
        dec_inputs.append(dec_input)
        dec_outputs.append(dec_output)
        
    # LongTensor是专用于存储整型的，Tensor则可以存浮点、整数、bool等多种类型
    return torch.LongTensor(enc_inputs),torch.LongTensor(dec_inputs),torch.LongTensor(dec_outputs)

enc_inputs, dec_inputs, dec_outputs = make_data(sentence)

print(' enc_inputs: \n', enc_inputs)  # enc_inputs: [2,5]
print(' dec_inputs: \n', dec_inputs)  # dec_inputs: [2,6]
print(' dec_outputs: \n', dec_outputs) # dec_outputs: [2,6]

In [None]:
# 使用Dataset加载数据
class MyDataSet(data.Dataset):
    def __init__(self,enc_inputs, dec_inputs, dec_outputs):
        super(MyDataSet,self).__init__()
        self.enc_inputs = enc_inputs
        self.dec_inputs = dec_inputs
        self.dec_outputs = dec_outputs
        
    def __len__(self):
        # 我们前面的enc_inputs.shape = [2,5],所以这个返回的是2
        return self.enc_inputs.shape[0] 
    
    # 根据idx返回的是一组 enc_input, dec_input, dec_output
    def __getitem__(self, idx):
        return self.enc_inputs[idx], self.dec_inputs[idx], self.dec_outputs[idx]

# 构建DataLoader
loader = data.DataLoader(dataset=MyDataSet(enc_inputs,dec_inputs, dec_outputs),batch_size=2,shuffle=True)

- ## Model Training