## MultiHead attention

In [1]:
import torch
import torch.nn as nn
import math

## Input Embeddings

Since, models don't have notion of words, We convert every word in a sequence into a vector representation of specific dimension (256 or 512 ...). There are certain processes involved in doing so. 

First, we should have a vocabulary, For example: think of it as a dictionary which contains letter a,b,c,d...z and the letters have their own indexes such as a = 0, b = 1, ... z = 25. so when we have a new letter presented to us, this embedding layer will be applied to that letter. for example c,d will be mapped to 3,4. 

So instead of having letter in our vocabulary we have words. So, with the help of this vocabulary we map our words in the sequence to the index in its original vocabulary.  

In [2]:
## embeddings 

class InputEmbedding(nn.Module):
    def __init__(self, vocab_size, d_model):
        super().__init__()
        self.d_model = d_model
        self.embedding = nn.Embedding(vocab_size, d_model) 
        
        
    def forward(self, x):
        # x is a batch of sequence of words, batch_size, sequence_length -> batch_size, sequence_length, d_model
        return self.embedding(x) * math.sqrt(self.d_model)

In [3]:
class PositionalEncoding(nn.Module):

    def __init__(self, d_model: int, seq_len: int, dropout: float) -> None:
        super().__init__()
        self.d_model = d_model
        self.seq_len = seq_len
        self.dropout = nn.Dropout(dropout)
        # Create a matrix of shape (seq_len, d_model)
        pe = torch.zeros(seq_len, d_model)
        # Create a vector of shape (seq_len)
        position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1) # (seq_len, 1)
        # Create a vector of shape (d_model)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)) # (d_model / 2)
        # Apply sine to even indices
        pe[:, 0::2] = torch.sin(position * div_term) # sin(position * (10000 ** (2i / d_model))
        # Apply cosine to odd indices
        pe[:, 1::2] = torch.cos(position * div_term) # cos(position * (10000 ** (2i / d_model))
        # Add a batch dimension to the positional encoding
        pe = pe.unsqueeze(0) # (1, seq_len, d_model)
        # Register the positional encoding as a buffer
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + (self.pe[:, :x.shape[1], :]).requires_grad_(False) # (batch, seq_len, d_model)
        return self.dropout(x)

## Layer Normalization

In Layer Normalization, normalization is done across all the features $x_{i,k}$  than across all the batches, this prcoess removes the dependency input sequences with each other.

First, We calculate mean and standard deviation. 

\begin{gather} \mu_i = \frac{1}{K} \sum_{k=1}^{K} x_{i,k} \\ \sigma_i^2 = \frac{1}{K} \sum_{k=1}^{K} (x_{i,k} - \mu_i)^2 \\ \end{gather}


Then we normalize each sample such that the elements in the sample have zero mean and unit variance. 
ϵ
 is for numerical stability in case the denominator becomes zero by chance.
 
 $$\hat{x}_{i,k} = \frac{x_{i,k}-\mu_i}{\sqrt{\sigma_i^2 + \epsilon}}$$
 
 Finally, there is a scaling and shifting step. 
γ
 and 
β
 are learnable parameters.
 
 $$y_i = \gamma \hat{x}_{i} + \beta \equiv {\text{LN}}_{\gamma, \beta} (x_i)$$
 
These parameters $\gamma$ and $\beta$ introduce fluctuations in the normalization

In [4]:
class LayerNormalization(nn.Module):
    def __init__(self):
        super().__init__()
        self.eps = 10**-6
        
        ## specifying nn.Parameter will add requires_grad_ to that parameter so it will be a learnable parameter.
        
        self.gamma = nn.Parameter(torch.ones(1))
        self.beta = nn.Parameter(torch.zeros(1))
        
    def forward(self,x):
        
        # (batch, seq_len, 1)
        mean = x.mean(dim = -1, keepdim = True)
        # (batch, seq_len, 1)
        std = x.std(dim = -1, keepdim = True)
        
        # dimension is batch,seq_len, d_model
        x = ( self.gamma * (x - mean) / (std + self.eps) ) + self.beta
        
        return x
    

In [6]:
LN = LayerNormalization()

In [7]:
x = LN(torch.rand(2,3,256))

In [8]:
x.shape

torch.Size([2, 3, 256])

In [9]:
embed = InputEmbedding(2000,256)

In [10]:
embed('yoo')

TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not str

![(https://production-media.paperswithcode.com/methods/multi-head-attention_l1A3G7a.png)](https://data-science-blog.com/wp-content/uploads/2022/01/mha_img_original.png)

## MultiHeadAttention

Here comes the most important mechanism that powers whole the LLM industry, **attention** 

Attention in general refers to focus on some specific section of words while ignoring others. Attention is used to understand the context of a word in a sequence.

The concept of self-attention is to utilize the entire sequence to compute a weighted average of each token's embedding instead of relying on a fixed embedding for each token like word2vec. Embeddings that are generated this way are called contextualized embeddings. This can be restated as self-attention generating a new sequence of embeddings $x_1', \ldots, x_n'$ when given a sequence of token embeddings $x_1, \ldots, x_n$, where each new embedding $x_i'$ is a linear combination of all the $x_j$ in the sequence.

$x_i' = \sum_{j=1}^{n} w_{ji} x_j$

The coefficients $w_{ji}$ are called attention weights and are normalized so that:

$\sum_{j=1}^{n} w_{ji} = 1$

So, the magic of paying attention is enabled by these attention weights.

For example, let’s consider these two sentences.

    I love cool, crisp fall weather.
    Don’t fall on your way to the gym.

  The word fall in the first sentence denotes weather by looking at words like cool and crips whereas fall in the second sentence denotes actually falling by looking at words like way and gym.


Let's discuss how we construct attention weights and the final embedding representation.

### Scaled dot-product attention

Scaled Dot-Product Attention is used to calculate the attention weights.
The first step in calculating self-attention is to project three vectors from each of the encoder’s input vectors (in this case, token embeddings). So for each word, we project three matrices,  𝑄,𝐾,𝑉  and which are called Query vector, a Key vector, and a Value vector and each has a dimension of  𝑑𝑘 . 

These vectors are created by multiplying the embedding by three matrices that we trained during the training process.
The dot product acts as a similarity function which determines how much the query and key vectors relate to each other. If queries and keys are similar, they will result in a significant dot product.

The output is calculated as a weighted sum of the values, with each value's weight determined by the query's compatibility function with its corresponding key.



To obtain the final weights on the values, first, the dot product of the query with all keys is computed and then normalized by $\sqrt{d_{k}}$. Then a softmax function is applied. Finally, $V$ is multiplied with the previous output.

The final output is:

$Attention(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$

The intuition is that the softmax reweights between 0 and 1 (kind of like probability for each word), which sums upto 1 for all the words. So, multiplying this probability with $V$ determines the contribution of each word against each other. 



If you don't understand the how we calculate the attention weights and the final embedding representation then, let's understand this with the help of an analogy.

Suppose you want to make something to eat for dinner. But you don't know how to. But you've got a Recepie book that tells you what ingredients to use let the Recepie book be our QUERY, now you go to a supermarket to buy these recepies, the ingredients the supermarket has in their shelves is the KEY, and now you look at your recepie book and ingredients in the shelves to find how similar they are, which becomes our attention weights. Now that you've found how similar these ingredients are, you update your shopping cart based on the similarity, which is, multiplying the attention weights with VALUES. That's it Now you have your ingredient, which is the embedding representation which will be used for Language modeling later.

In [120]:
class MultiHeadAttention(nn.Module):
    """This class resembles to the sequence of the above multiheadattenion picture"""
    ##self, input_sequence, head_size, embedding_dimention
    def __init__(self, d_model: int, h: int, dropout: float ) -> None:
        super().__init__()
        
        self.h = h
        
        assert d_model % h == 0, "d_model is not divisible by head"
        self.dropout = nn.Dropout(dropout)
        self.d_k = d_model // h
        self.W_Q = nn.Linear(d_model, d_model, bias = False)
        self.W_K = nn.Linear(d_model, d_model, bias = False)
        self.W_V = nn.Linear(d_model, d_model, bias = False)
        self.W_O = nn.Linear(d_model, d_model, bias = False)
        
        
    @staticmethod
    def scaled_dot_product_attention(query, key, value, mask, dropout: nn.Dropout ):
        d_k = query.shape[-1]
        #dot product between Q and K
        attention_weights = query @ key.transpose(-2,-1)
        
        #scaling
        attention_weights = attention_weights / math.sqrt(d_k)
        
        #masking
        if mask is not None:
            attention_weights = attention_weights.masked_fill_(mask == 0, -1e9)
        
        attention_weights = attention_weights.softmax(dim = -1)
        
        #dropout
        if dropout is not None:
            attention_weights = dropout(attention_weights)
            
        return attention_weights @ key, attention_weights
        
        
    
    def forward(self, q, k, v, mask):
        
        #q,k,v are embeddings of whole batch of sequence, so their size would be batch_size, sequence_length, d_model (embedding dimension)
        query = self.W_Q(q)
        key = self.W_K(k)
        value = self.W_V(v)
        
        #divide the q,k,v into different h heads
        
        #query initially had size of (batch_size, sequence_length, d_model)
        #and we split the d_model which is the embedding into different heads with each size of d_k = d_model/h
        #We finally call transpose to swap the h and sequence length, since, we want all the sequence words to have access to embeddings
        
        query = query.view(query.shape[0], query.shape[1], self.h, self.d_k).transpose(1,2)
        key = key.view(key.shape[0], key.shape[1], self.h, self.d_k).transpose(1,2)
        value = value.view(value.shape[0], value.shape[1], self.h, self.d_k).transpose(1,2)
        
        #print(query.shape)
        
        ## Now we perform scaled dot product here
        output, self.attention_weights = MultiHeadAttention.scaled_dot_product_attention(query, key, value, mask, self.dropout)
        
        #concatination part happens here
        #output's dimension is batch_size, h, sequence_length, d_k we combine
        
        #hen you call contiguous(), it actually makes a copy of the tensor such that the order of its elements in memory is the same as if it had been created from scratch with the same data.
        output = output.transpose(1,2).contiguous().view(output.shape[0], -1 , self.d_k * self.h)
        
        #apply the linear part by multiplying with the Linear layer i.e self.W_O
        
        return self.W_O(output)
        


## FeedForward Neural Network



In [121]:
class FeedForwardNeuralNetwork(nn.Module):
    def __init__(self, d_model, dff, dropout):
        super().__init__()
        self.layer1 = nn.Linear(d_model, dff)
        self.layer2 = nn.Linear(dff, d_model)
        self.dropout = nn.Dropout()
        self.relu = nn.ReLU()
        
    def forward(self,x):
        
        #x has dimension of (batch_size, sequence length, d_model)
        x = self.layer1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.layer2(x)
        return x

In [122]:
checkff = FeedForwardNeuralNetwork(512,2048, 0.1)

In [123]:
checkff(torch.rand(2,3,512)).shape

torch.Size([2, 3, 512])

In [124]:
attention = MultiHeadAttention(256, 8, 0.1)

In [125]:
q = torch.rand(8, 10, 256)

In [126]:
q.shape

torch.Size([8, 10, 256])

In [129]:
x = attention(q,q,q,torch.rand(10,10))

In [130]:
x.shape

torch.Size([8, 10, 256])

## Residual Connection

In [42]:
class ResidualConnection(nn.Module):
    # here sublayer can be either Multihead attention or feedforward neural network see the figure for more info
    def __init__(self, dropout):
        super().__init__()
        self.norm = LayerNormalization()
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, sublayer):
        #This is different from the paper, we add pre-layer normalization here which means
        # we first add normalize and then add the sublayer and then the dropout
        return x + self.dropout(sublayer(self.norm(x) ) )


        

In [59]:
rescon = ResidualConnection(0.1)

## Encoder Layer

In [131]:
class EncoderBlock(nn.Module):
    def __init__(self, multi_head_attention: MultiHeadAttention , feed_forward_neural_network: FeedForwardNeuralNetwork, dropout ):
        super().__init__()
        self.multi_head_attention = multi_head_attention
        self.feed_forward_neural_network = feed_forward_neural_network
        
#         self.ResidualConnectionForAtt = ResidualConnection(dropout)
#         self.ResidualConnectionForFF = ResidualConnection(dropout)
        
        self.residual_connection = nn.ModuleList([ResidualConnection(dropout) for _ in range(2) ])
        
        
    def forward(self, x, src_mask):
        # x has dimension of batch_size, sequence_length, embedding_dimension i.e d_model
        
        #ResidualConnection takes in sublayer that could be either the MultiHeadAttention or FeedForwardNeuralNetwork
        x = self.residual_connection[0](x, lambda x: self.multi_head_attention(x,x,x, src_mask) )
        
        x = self.residual_connection[1](x, self.feed_forward_neural_network )
        
        return x
        

In [134]:
x = torch.rand(2,3,256)
encoder = EncoderBlock(MultiHeadAttention(256,8, 0.1), FeedForwardNeuralNetwork(256, 1024, 0.1), 0.1)




In [136]:
encoder(x,torch.rand(3,3))

tensor([[[-0.1397,  0.7158,  0.6990,  ...,  0.4477,  0.5085,  0.4095],
         [ 0.2188,  0.5374,  0.0916,  ...,  0.6719,  0.6135,  0.4852],
         [-0.0276,  0.6858,  0.6194,  ...,  0.4255,  0.4995,  0.2260]],

        [[ 0.1064,  0.7339,  1.3082,  ..., -0.1363,  0.7165,  0.6924],
         [ 0.1363,  1.3914,  0.3713,  ...,  0.2192, -0.0998, -0.1291],
         [ 0.6823,  0.8963,  0.6930,  ...,  0.8825, -0.0047,  0.6211]]],
       grad_fn=<AddBackward0>)

### Whole Encoder

In [169]:
class Encoder(nn.Module):
    def __init__(self, layers : nn.ModuleList ):
        self.layers = layers
        self.norm = LayerNormalization()
        
    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
    
        return self.norm(x)

### Sample working for ResidualConnectinForAtt

In [27]:
def func1(q,k,v): #Imagine this is MultiHeadAttention layer
    print(q+k+v)

In [32]:
def func2(x, sublayer):
    x = x + 2
    return sublayer(x)

In [33]:
func2(1,lambda x: func1(x,x,x))

9


## Decoder

In [146]:
class DecoderBlock(nn.Module):
    def __init__(self, multi_head_attention: MultiHeadAttention, cross_attention: MultiHeadAttention, feed_forward_neural_network: FeedForwardNeuralNetwork, dropout):
        super().__init__()
        self.multi_head_attention = multi_head_attention
        self.cross_attention = cross_attention
        self.feed_forward_neural_network = feed_forward_neural_network
        
        self.residual_connection = nn.ModuleList([ResidualConnection(dropout) for _ in range(3)])
        
    def forward(self, x,encoder_output, src_mask, target_mask ):
        #Since this is the decoder block so it'll add the target mask, mask sizes are (sequence_length, sequence_length)
        x = self.residual_connection[0](x, lambda x: self.multi_head_attention(x,x,x, target_mask ))
        x = self.residual_connection[1](x, lambda x: self.cross_attention(x, encoder_output, encoder_output, src_mask))
        x = self.residual_connection[2](x, self.feed_forward_neural_network )
        
        return x

In [149]:
#### Testing Decoder

In [150]:
decoder = DecoderBlock(MultiHeadAttention(256,8, 0.1),MultiHeadAttention(256,8, 0.1), FeedForwardNeuralNetwork(256, 1024, 0.1), 0.1) 

In [151]:
decoder(torch.rand(2,10, 256),torch.rand(2,10, 256), torch.rand(10,10), torch.rand(10,10))

tensor([[[ 0.1410,  0.7607,  0.6645,  ...,  0.2440, -0.1953,  0.8882],
         [ 0.6027,  0.9619,  0.6481,  ...,  1.3845, -0.0954,  0.6457],
         [-0.2824,  0.7030,  0.7282,  ...,  1.4268,  0.1493, -0.1989],
         ...,
         [ 0.5194,  0.1220,  1.0299,  ...,  1.6713, -0.4655,  0.6731],
         [ 0.2809,  0.3560,  0.8084,  ...,  0.5149,  0.1513,  0.4887],
         [ 0.3066,  0.1597,  0.7018,  ...,  0.7505,  0.0167,  0.6384]],

        [[ 0.2780,  0.9038,  0.8930,  ..., -0.0428,  0.0349,  0.3159],
         [ 0.1079,  0.6233,  0.9927,  ...,  0.8181,  0.3700,  0.1943],
         [ 0.3418,  0.8948,  0.8120,  ...,  0.1612, -0.0806, -0.1750],
         ...,
         [ 0.7885,  0.8552,  0.5924,  ...,  0.6979,  0.2468,  0.9633],
         [ 0.6603, -0.1144,  1.6342,  ...,  0.6678, -0.6207,  0.7890],
         [ 1.0396,  1.9594,  0.4226,  ...,  0.3469,  0.0022,  0.5772]]],
       grad_fn=<AddBackward0>)

### Whole Decoder

In [153]:
class Decoder(nn.Module):
    def __init__(self, layers: nn.ModuleList):
        self.layers = layers
        self.norm = LayerNormalization()
        
    def forward(self,x, encoder_output, src_mask, target_mask ):
        for layer in self.layers:
            x = layer(x, encoder_output, src_mask, target_mask  )
        return self.norm(x)
    


## Linear Transformation
Projecting each embedding to a word in vocabulory 


In [166]:
class LinearTransformation(nn.Module):
    def __init__(self, d_model, vocab_size):
        super().__init__()
        self.linear = nn.Linear(d_model, vocab_size)
        
    def forward(self, x):
        # (batch, seq_len, d_model) --> (batch, seq_len, vocab_size)
        # This is predicting what comes after each word.
        
        return torch.log_softmax(self.linear(x), dim = -1)
    

In [167]:
lin = LinearTransformation(256, 342)

In [168]:
lin(torch.rand(2,3,256)).shape

torch.Size([2, 3, 342])

## Transformer

In [170]:
class Transformer(nn.Module):
    def __init__(self, encoder: Encoder, decoder: Decoder, projection: LinearTransformation,input_embedding: InputEmbedding, positional_encoding: PositionalEncoding ):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.projection = projection
        self.input_embedding = input_embedding
        self.positional_embedding = position_embedding
        
    def encode(self,x):
        # positional encoding function will concatinate the input embedding with the positional encodings. 
        x = self.input_embedding(x)
        x = self.positional_embedding(x)
        return self.encoder(x)
        
    def decode(self,x, encoder_output,src_mask, target_mask ):
        x = self.input_embedding(x)
        x = self.positional_embedding(x)
        
        return self.decoder(x,encoder_output,src_mask, target_mask )
        
        
    def project(self, x):
        
        return self.projection(x)

## Building the Transformers

This is where we actually build the transformers. Now that we have implemented all the classes, we will go on to build the actual transformer

In [None]:
def buildingTranformers(source_vocab_size,target_vocab_size, source_seq_length:int, target_seq_length: int, d_model: int = 512, dropout: float = 0.1, dff: int = 2048 , h:int = 8, N:int = 6):
    #Encoder, Decoder, LinearTransformation, InputEmbedding, PositionalEncoding
    #MultiHeadAttention, FeedForwardNeuralNetwork, EncoderBlock, DecoderBlock
    
    source_embedding = InputEmbedding(source_vocab_size, d_model)
    target_embedding = InputEmbedding(target_vocab_size, d_model)
    
    source_positional_encoding = PositionalEncoding(d_model, source_seq_length, dropout)
    target_positional_encoding = PositionalEncoding(d_model, target_seq_length, dropout)
    
    multi_head_attention = MultiHeadAttention(d_model, h, dropout)
    feed_forward_neural_network = FeedForwardNeuralNetwork()
    
    
    
    
    
    