<a href="https://www.kaggle.com/code/aisuko/introducing-transformer-architecture?scriptVersionId=188300194" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Historical Context(Neural Network)

* Before 2012
  * Nightmare
* 2012:ImageNet Classification with Deep Convolutional Neural Networks
  * Pattern was found-> Neural Networks->we don't need to worry about the compute and data
* 2014:Neural Nework expends to other areas
  * Sequence to Sequence Learning with Neural Networks-> (encoder-decoder architecture, attention mechanism, RNN)
  * like machine translation
* Dec 2017 Attenion is all you need
  * Everything only Attention, delete all RNN components
  * Positional encodings
  * Residual network (ReSNet) structure
  * Interspersing of Attention and MLP
  * LayerNorms
  * Multiple heads of attention in parallel
  * Great hyperparameters
 
 
 
# Attention = "communication" phase of tranformer

**Self-attention** in Transformers

$$Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V$$


In the communicate phase of the transformer
* Every head applies this, in parallel
* and then every layer, in series with different weights each time

**Multiple-head attention**

Self-attetnion applied multiple times in parallel. Heads is really copy-paste in parallel.

**Cross-attention**

It is different from Self-attention in piece and values come from.

In [1]:
# Attention basically
import numpy as np

class Node:
    def __init__(self):
        # the vector stored at this node
        self.data = np.random.randn(20)
        
        # weights governing how this node interacts with other nodes
        self.w_key=np.random.randn(20, 20)
        self.w_query=np.random.randn(20, 20)
        self.w_value=np.random.randn(20,20)
        
    def key(self):
        # What do I have
        return self.w_key@self.data
    
    def query(self):
        # What am I looking for?
        return self.w_query@ self.data
    
    def value(self):
        # What do I publicly reveal/broadcast to others? communicate
        return self.w_value @ self.data
    
class DIG:
    """
    Mirrow message passing schema at the heart of transformer
    """
    def __init__(self):
        # 10 nodes
        self.nodes=[Node() for _ in range(10)]
        randi=lambda: np.random.randint(len(self.nodes))
        # 40 edges
        self.edges=[[randi(), randi()] for _ in range(40)]
        
    def run(self):
        updates = []
        for i,n in enumerate(self.nodes):
            
            # what is this node looking for?
            q=n.query()
            
            # find all edges that are input to this node
            inputs=[self.nodes[ifrom] for (ifrom,ito) in self.edges if ito==i]
            if len(inputs)==0:
                continue # ignore
            
            # gather their keys, i.e, what they hold
            keys= [m.key() for m in inputs]
            
            # calculate the compatibilities
            scores=[k.dot(q) for k in keys]
            
            # softmax them so they sum to 1- normalization
            scores=np.exp(scores)
            
            scores=scores/np.sum(scores)
            # gather the appropriate values with a weighted sum
            values=[m.value() for m in inputs]
            update=sum([s*v for s,v in zip(scores, values)])
            updates.append(update)
            
        for n,u in zip(self.nodes, updates):
            n.data=n.data+u # residual connection

In Encoder-Decoder models,
* Encoder is a fully-connected cluster, which means all tokens see each other, and know the relationship between each other. In order words, they know the specific tokens can be an answer for other token.
* Decoder is fully connected to Encoder positions, and left-to-right connected in decoder positions. Decoder get tokens from the top of encoders, and also the previous output tokens from itself.

# Decoder only Neural Network

Only as a demonstrate for the deocder only architecture.

In [2]:
import torch
from torch import nn
import torch.nn.functional as F


class MLP(nn.Module):
    """
    m
    Multi-layer perceptron. It is indivisual processing on each node,
    transforming the feature representation at that node.
    
    2-layer neural network
    """
    def __init__(self, config):
        super().__init__()
        self.c_fc=nn.Linear(config.n_embd, 4*config.n_embd)
        self.c_proj=nn.Linear(4*config.n_embd, config.n_embd)
        self.dropout=nn.Dropout(config.dropout)
        
    def forward(self, x):
        x=self.c_fc(x)
        x=new_gelu(x) # nonlinearity
        x=self.c_proj(x)
        x=self.dropout(x)
        return x
    
class CausualSelfAttention(nn.Module):
    """
    How we mask the connectivity in the graph
    
    Note: We can't obtain any information from the future when we are predicting the tokens
    
    For example, like what we did in https://www.kaggle.com/code/aisuko/single-character-nn-prediction-with-pytorch
    
    We use 14th token to predict 15th token. 14th token as input of NN
    
    
    """
    def __init__(self, config):
        assert config.n_embd % config.n_head ==0
        # key, query value projections for all heads, but in a batch
        self.c_attn=nn.Linear(config.n_embd, 3*config.n_embd)
        # output projection
        self.c_proj=nn.Linear(config.n_embd, config.n_embd)
        # regularization
        self.attn_dropout=nn.Dropout(config.dropout)
        self.resid_dropout=nn.Dropout(config.dropout)
        # causal mask to ensure that attention is only applied to the left in the input sequence
        self.register_buffer("bias", torch.trill(torch.ones(config.block_size, config.block_size)).view(1,1, config.block_size, config.block_size))
        self.n_head=config.n_head
        self.n_embd=config.n_embd
        
    def forward(self,x):
        B,T,C=x.size() # batch size, sequence length, embedding dimensionality(n_embd)
        
        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        q,k,v=self.c_attn(x).split(self.n_embd, dim=2)
        k=k.view(B,T,self.n_head, C//self.n_head).transpose(1,2) #(B, nh, T, hs)-> batch dimension, the head dimension, time dimension, we have features at them
        q=q.view(B,T,self.n_head, C//self.n_head).transpose(1,2) #(B, nh, T, hs)
        v=v.view(B,T,self.n_head, C//self.n_head).transpose(1,2) #(B, nh, T, hs)
        
        # causual self-attention; self-attention: (B, bh, T, hs)x(B, nh, hs, T) -> (B, nh, T, T)
        att=(q@k.transpose(-2,-1))*(1.0/math.sqrt(k.size(-1)))
        
        # Note: remove this sinle line that we will get encoder
        # if we don't masked the attetnion, all the note will communicate with each other.
        # And the information flows between all the nodes
        
        # masks the attention-> clamping the attention between the nodes that are not supposed to communicate to be negative infinity
        att=att.masked_fill(self.bias[:,:,:T,:T]==0, float('-inf'))
        
        # we use softmax make negative infitinity attetnion be zero
        att=F.softmax(att, dim=-1)
        
        att=self.attn_dropout(att) # optional dropout
        
        # gathering of the information according to the affinities we calcualted
        # It is a weighted sum of the values of all these nodes.
        y=att@v # (B, bh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        
        # 
        y=y.transpose(1,2).contiguous().view(B,T,C) # re-assemnle all head outputs side by side
        
        # a linear projection back to the residual pathway4
        y=self.resid_dropout(self.c_proj(y))
        return y
        

class Block(nn.Module):
    """
    Decoder only module
    * We don't have encoder
    * We don't have corss(multi-head) attention
    """
    
    def __init__(self, config):
        super().__init__()
        self.ln_1=nn.LayerNorm(config.n_embd)
        self.attn=CausualSelfAttention(config)
        self.ln_2=nn.LayerNorm(config.n_embd)
        self.mlp=MLP(config)
        
    def forward(self, x):
        x=x+self.attn(self.ln_1(x)) # communicate phase(with each other) ->Masked-Multi-Head Attention
        # if we want multi-head attetnion, add new line in here
        x=x+self.mlp(self.ln_2(x)) # compute phase -> Feed Forward
        return x

class GPT(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        assert config.vocab_size is not None
        assert config.block_size is not None
        self.config=config
        
        self.transformer=nn.ModuleDict(
            dict(
                wte=nn.Embedding(config.vocab_size, config.n_embd),
                wpe=nn.Embedding(config.block_size, config.n_embd),
                drop=nn.Dropout(config.dropout),
                h=nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
                ln_f=nn.LayerNorm(config.n_embd),
            )
        )
        self.lm_head=nn.Linear(config.n_embd, config.vocab_size, bias=False)
    
    def forward(self, idx, targets=None):
        device=idx.device
        b,t=idx.size()
        assert t<=self.config.block_size, f"Can't forward sequence of length {t}, block size is only"
        pos=torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0) # shape(1,t)
        
        # forward the GPT model itself
        tok_emb=self.transformer.wte(idx) # token embeddings of shape (b,t, n_embd)
        pos_emb=self.transformer.wpe(pos) # position embeddings of shape (1,t,n_embd)
        x=self.transformer.drop(tok_emb+pos_emb)
        for block in self.transformer.h: 
            x=block(x)
        x=self.transformer.ln_f(x) # layerform
        logits=self.lm_head(x) # linear layer
        
        # if we are given some desired targets also calcualte the loss
        loss=None
        if targets is not None:
            loss=F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
        return logits, loss
    

# Credit

* https://youtu.be/XfpMkf4rD6E?si=Gak8OBHLUMzHVXC6