# Transformer Neural Network

In this notebood, I will be attempting to create my own transformer neural network from scratch. As many of you know, this is literally how chatGPT and many famous LLMs work under the hood. I'm excited to try this out, to further understand how this stuff ACTUALLY works

<img src="assets/Screenshot 2024-11-12 144358.png">

Note: I will be using pytorch for this one, but I will later try to implement one from scratch similar to my previous project. But we shall see :salute_face:

# Resources:
1. Yt: https://www.youtube.com/watch?v=4Bdc55j80l8
2. DataCamp: https://www.datacamp.com/tutorial/building-a-transformer-with-py-torch

In [3]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import math
import copy

First, let's build the encoder layer; specifically: input embeddings and multi-headed attention.

<img src="assets/attention_layer.png">

In [4]:
class multiHeadAttention(nn.Module):
    def __init__(self, dims, n_heads):                                  
        """
        dims: Dimensionality of input
        n_heads: Number of heads for the attention layer 
        """

        super(multiHeadAttention, self).__init__()              # This is fors the torch nn module class
        assert dims % n_heads == 0, "d_model must be divisible by num_heads"
        '''
        In multi-head attention, the d_model dimension (the overall dimension of each token’s embedding) is split into num_heads 
        smaller chunks so that each head can process a portion of the model’s dimension independently. The dimension of each head, 
        called d_k in the code, is calculated as d_model // num_heads. To make this division possible, d_model needs to be evenly 
        divisible by num_heads.
        '''

        # Initialize dimensions
        self.d_model = dims
        self.num_heads = n_heads
        self.d_k = dims // n_heads      # Dimension of each head's key, query and value

        # Now, time to transform the inputs
        self.W_q = nn.Linear(dims, dims)    # Query 
        self.W_k = nn.Linear(dims, dims)    # Keys 
        self.W_v = nn.Linear(dims, dims)    # Values
        self.W_o = nn.Linear(dims, dims)    # Output


    # Now to calculate the attention scores
    def attention_dot_product(self, Q, K, V, mask=None):
        '''
        Q: Query
        K: Keys
        V: Values
        mask: Can be applied to mask out certain attention score values
        '''

        attn_raw_scores = torch.matmul(Q, K.transpose(-2,-1 ))              # K.transpose(-2, -1) transposes the last two dimensions of the K tensor. (Refer [1].)
        scaled_attn_scores = attn_raw_scores/math.sqrt((self.d_k))

        # Applying the mask (if not none)
        if mask:
            scaled_attn_scores = scaled_attn_scores.masked_fill(mask==0, -1e9)      # Refer [2]

        # Aplplying softmax activation function to find attention probabilities 
        attn_probs = torch.softmax(scaled_attn_scores, dim=-1)

        # Multiply with the Values to obtain final output
        output = torch.matmul(attn_probs, V)

        return output
        

Essentially what's going on here is that we obtain the dot product of the queries wrt keys, which is our attention score. It's like yt's search algorithm, where it looks into the database, looks at the values for the keys 'Title', 'Descriptions' and finds matching keys.

<img src="assets/Calcule_attention_score1.png">

Then, we scale these scores by using d_k which we calculated to split up into multiple heads

<img src="assets/scaling_attention_scores.png">

[1] What Does K.transpose(-2, -1) Do?
K.transpose(-2, -1) transposes the last two dimensions of the K tensor.

-2 and -1 in Tensor Indexing: Negative indices count from the end of the tensor shape, so -2 and -1 refer to the second-to-last and last dimensions. 
In the context of multi-head attention, Q and K typically have the shape:

`(batch_size, num_heads, seq_length, d_k)`

Here:

seq_length is the length of the sequence.
d_k is the dimension of each head (i.e., d_model // num_heads).
Why Transpose? K.transpose(-2, -1) changes the shape of K to:

`(batch_size, num_heads, d_k, seq_length)`
This is necessary so that the matrix multiplication torch.matmul(Q, K.transpose(-2, -1)) results in an output shape of (batch_size, num_heads, seq_length, seq_length). This shape represents the attention scores between each position in the sequence (each query) and all other positions (keys), which is essential for calculating the attention distribution across the sequence.

[2] 

`mask == 0`

The mask tensor is typically a binary tensor with values of 1 and 0. Here, 1 represents positions we want to keep, and 0 represents positions we want to ignore (mask out).
mask == 0 creates a boolean mask where True represents positions that should be masked (ignored), and False represents positions that should be retained.
`-1e9`:

-1e9 (a very large negative number) is used to "mask out" certain positions by setting their attention score to a very low value. When softmax is applied to the attention scores later, this extremely negative value effectively turns the attention probability for masked positions into 0, ensuring they don’t contribute to the weighted sum in the attention mechanism.