# Transformers for NLP

- Encoder-Decoder Architecture: Vanilla Transformers,BART, T0/T5
- Encoder only: BERT
- Decoder only: GPT-*


1. Encoder only solve the task of predicting the masked word in a sentence. This architecture is used for language modeling tasks when we need to endcode a sequence of input tokens and preducing a fixexd-length representation. The downstream tasks are translation and summarization.
2. Decoder and Encoder-Decoder architectures solve tasks when we need to predict the next token or set of tokens. These architectures are used for language modelling tasks like generating a sequence of output tokens based on an input context vector.

## Calculation of Attention

- Symbols
    - $X \in \mathbb{R}^{n\times d}$ is the input sequence with length $n$ and hiddendimension $d$.
    - $W^Q \in \mathbb{R}^{d\times d}$ is the Query matrix.
    - $W^K \in \mathbb{R}^{d\times d}$ is the Key matrix.
    - $W^V \in \mathbb{R}^{d\times d}$ is the Value matrix.


1. Obtain $QKV$ through linear projection.

    - $Q = XW^Q \in \mathbb{R}^{d \times d}$
    - $K = XW^K \in \mathbb{R}^{d \times d}$
    - $V = XW^V \in \mathbb{R}^{d \times d}$

2. Compute Attention Score.
    - $m = QK^T \in \mathbb{R}^{n \times n}$, each entry $m_{ij}$ represents the attention score of $i$-th and the $j$-th word.
    - $\tilde{m} = \text{softmax}(m)/\sqrt{d}$, normalizes the socres so that every entry is positive and add up to 1 in a ROW.
    - $Z = \tilde{m} V$, sum up the weighted value vectors.

In [1]:
import torch as t
import torch.nn as nn
from torch import Tensor

Attention Head

In [2]:
class AttentionHead(nn.Module):
    """One head of the attention, not this implementation is not very efficient.
    """
    def __init__(self, head_size:int,num_embed:int,block_size:int) -> None:
        super().__init__()
        
        self.key = nn.Linear(num_embed,head_size,bias=False)
        self.query = nn.Linear(num_embed,head_size,bias=False)
        self.value = nn.Linear(num_embed,head_size,bias=False)
        
        self.register_buffer("tril",t.tril(t.ones(block_size,block_size)))
    
    def forward(self,x:Tensor)->Tensor:
        B,T,C = x.shape
        k = self.key.forward(x)
        q = self.query.forward(x)
        # attention score
        wei :Tensor= q@k.transpose(-2,-1)*C**-0.5
        # the triangular matrix is used to mask the future positions
        wei = wei.masked_fill(self.tril[:T,:T]==0,float("-inf"))
        wei = t.nn.functional.softmax(wei,dim=-1)
        v= self.value.forward(x)
        out = wei@v 
        return out

MultiheadAttention, which consists many `AttentionHead`

In [3]:
class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads:int, head_size:int, num_embed:int, block_size:int) -> None:
        super().__init__()
        
        self.heads = nn.ModuleList(
            [
                AttentionHead(head_size,num_embed,block_size)
                for _ in range(num_heads)
            ]
        )
        self.proj = nn.Linear(num_embed,num_embed)
    
    def forward(self,x:Tensor)->Tensor:
        out = t.cat([h.forward(x) for h in self.heads],dim=-1)
        out=self.proj(out)
        return out

Feed forward neural network.

In [4]:
class FeedForwardNN(nn.Module):
    def __init__(self, num_embed:int,mlp_ratio:int) -> None:
        super().__init__()
        
        self.net = nn.Sequential(
            nn.Linear(num_embed,num_embed*mlp_ratio),
            nn.ReLU(),
            nn.Linear(mlp_ratio*num_embed,num_embed),
        )
    def forward(self,x:Tensor)->Tensor:
        return self.net.forward(x)

Put all things together and consturct a transformer block.

In [5]:
class TransformerBlock(nn.Module):
    def __init__(self, num_heads:int, block_size:int, num_embed:int,mlp_ratio=4) -> None:
        super().__init__()
        
        head_size = num_embed // num_heads
        self.sa = MultiHeadAttention(
            num_heads=num_heads,
            head_size=head_size,
            num_embed=num_embed,
            block_size=block_size
        )
        self.ffwd = FeedForwardNN(num_embed=num_embed,mlp_ratio=mlp_ratio)
        self.ln1 = nn.LayerNorm(num_embed)
        self.ln2 = nn.LayerNorm(num_embed)
        
    def forward(self,x:Tensor)->Tensor:
        x = x+self.sa(self.ln1(x))
        x = x+ self.ffwd(self.ln2(x))
        return x

Text Generation

The transformer predicts the next word given the context of all the previous words. This is done by a Linear and a Norm layer.

In [6]:
class Transformer(nn.Module):
    def __init__(self, vocab_size,num_embed,block_size,num_heads,num_layers,mlp_ratio) -> None:
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size,num_embed)
        self.position_embedding_table = nn.Embedding(block_size,num_embed)
        
        self.blocks = nn.Sequential(
            *[
                TransformerBlock(
                    num_heads,block_size,num_embed,mlp_ratio
                ) for _ in range(num_layers)
            ]
        )
        
        self.ln_f = nn.LayerNorm(num_embed)
        self.lm_head = nn.Linear(num_embed,vocab_size)
        
    
    def forward(self,idx:Tensor,targets:Tensor=None):
        B,T= idx.shape
        token_emb = self.token_embedding_table.forward(idx)
        posit_emb = self.position_embedding_table(t.arange(T,device=idx.device))
        x= token_emb+ posit_emb
        
        x=self.blocks.forward(x)
        x = self.ln_f.forward(x)
        
        logits = self.lm_head(x)
        
        if targets!=None:
            B,T,C =logits.shape
            logits = t.reshape(logits,(B*T,C))
            targets = t.reshape(targets,(B*T,))
            loss = t.nn.functional.cross_entropy(logits,targets)
        else:
            loss = None
        return logits,loss

    def generate(self,idx:Tensor,max_new_tokens:int,block_size:int)->Tensor:
        for _ in range(max_new_tokens):
            idx_crop = idx[:,-block_size:]
            logits ,loss = self.forward(idx_crop)
            logits= logits[:,-1,:]
            probs = t.nn.functional.softmax(logits,dim=-1)
            idx_next = t.multinomial(probs,num_samples=1)
            idx = t.cat((idx,idx_next),dim=1)
        return idx 

Test the transformer.

In [7]:
model = Transformer(100,256,32,8,12,4)
model.forward((t.ones(1,10,dtype=t.long)))[0].shape

torch.Size([1, 10, 100])

Successful! the output is the `[batch,length,vocab_size]`

Begin to train the model.

In [8]:
%pip install transformers

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Note: you may need to restart the kernel to use updated packages.


In [9]:
import transformers
from transformers import AutoTokenizer,PreTrainedTokenizer

1. Data preparation.

In [10]:
def encode(text_seq: str, tokenizer: PreTrainedTokenizer) -> Tensor:
    """
    Function to encode input text using a pre-trained tokenizer and vectorized lookups
    """
    # tokenize the input text
    tokens = tokenizer.tokenize(text_seq)
    # convert the tokens to their corresponding ids
    token_indices = tokenizer.convert_tokens_to_ids(tokens)
    token_indices = t.tensor(token_indices, dtype=t.long)
    return token_indices

def decode(enc_sec: Tensor, tokenizer:PreTrainedTokenizer) -> str:
    """
    Function to decode a sequence of token indices back to a string
    """
    # convert the indices to a list
    enc_sec = enc_sec.tolist()
    # decode the indices to a string
    text = tokenizer.decode(enc_sec)
    return text

def get_batch(data: list[str], block_size: int, batch_size: int):
    """
    This is a simple function to create batches of data.
    GPUs allow for parallel processing we can feed multiple chunks at once
    so that's why we would need batches - how many independant sequences
    will we process in parallel.

    Parameters:
    data: list[str]: data to take batch from
    block_size (int): size of the text that is proccessed at once
    batch_size (int): number of sequences to process in parallel

    Returns:
    x, y: a tuple with token sequence and token target
    """
    ix = t.randint(len(data) - block_size, (batch_size,))
    # we stack batch_size rows of sentences
    # so x and y are the matrices with rows_num=batch_size
    # and col_num=block_size
    x = t.stack([data[i : i + block_size] for i in ix])
    # y is x shifted one position right - because we predict
    # word in y having all the previous words as context
    y = t.stack([data[i + 1 : i + block_size + 1] for i in ix])
    return x, y

In [11]:
data_raw = open("./english.txt","r",encoding="utf-8").read()
tokenizer = AutoTokenizer.from_pretrained("./bert_tokenizer")
vocab_size = tokenizer.vocab_size

# transfer raw text to IDs.
data = encode(data_raw,tokenizer)
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

2. Define the transformer.

In [12]:
DEVICE = t.device("cuda:0")

model = Transformer(
    vocab_size=vocab_size,
    num_embed=768,
    block_size=64,
    num_heads=6,
    num_layers=6,
    mlp_ratio=4
)

model.to(DEVICE)

optimizer = t.optim.AdamW(model.parameters(),lr=3e-4)

3. Train the transformer.

In [61]:
for step in range(5000):
    xb,yb = get_batch(data=train_data,block_size=64,batch_size=32)
    xb,yb = xb.to(DEVICE),yb.to(DEVICE)
    logits, loss = model.forward(xb,yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    
    if step%100==0:
        print("EPOCH:[{}/{}], loss:{}".format(step,5000,loss.item()))

EPOCH:[0/5000], loss:10.533297538757324
EPOCH:[100/5000], loss:4.2037272453308105
EPOCH:[200/5000], loss:4.287846565246582
EPOCH:[300/5000], loss:3.4213290214538574
EPOCH:[400/5000], loss:3.247382879257202
EPOCH:[500/5000], loss:2.726608991622925
EPOCH:[600/5000], loss:2.7817575931549072
EPOCH:[700/5000], loss:2.295689105987549
EPOCH:[800/5000], loss:2.372915029525757
EPOCH:[900/5000], loss:2.4961488246917725
EPOCH:[1000/5000], loss:2.3675694465637207
EPOCH:[1100/5000], loss:1.8057996034622192
EPOCH:[1200/5000], loss:2.147637367248535
EPOCH:[1300/5000], loss:2.0009021759033203
EPOCH:[1400/5000], loss:1.7242244482040405
EPOCH:[1500/5000], loss:1.8565446138381958
EPOCH:[1600/5000], loss:1.9473191499710083
EPOCH:[1700/5000], loss:1.7156968116760254
EPOCH:[1800/5000], loss:1.7214375734329224
EPOCH:[1900/5000], loss:1.736757516860962
EPOCH:[2000/5000], loss:1.6959139108657837
EPOCH:[2100/5000], loss:1.705137014389038
EPOCH:[2200/5000], loss:1.5400543212890625
EPOCH:[2300/5000], loss:1.34330

Save the pretrained model.

In [13]:
t.save(model.state_dict(),"ckpt.pth")
model.load_state_dict(t.load("ckpt.pth"))

<All keys matched successfully>

In [14]:
context = t.zeros((1, 1), dtype=t.long, device=DEVICE)
print(
    decode(
        enc_sec=model.generate(idx=context, max_new_tokens=100, block_size=64)[0],
        tokenizer=tokenizer,
    )
)

