# Transformer Neural Network

In this notebood, I will be attempting to create my own transformer neural network from scratch. As many of you know, this is literally how chatGPT and many famous LLMs work under the hood. I'm excited to try this out, to further understand how this stuff ACTUALLY works

<img src="assets/Screenshot 2024-11-12 144358.png">

Note: I will be using pytorch for this one, but I will later try to implement one from scratch similar to my previous project. But we shall see :salute_face:

# Resources:
1. Yt: https://www.youtube.com/watch?v=4Bdc55j80l8
2. DataCamp: https://www.datacamp.com/tutorial/building-a-transformer-with-py-torch

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import math
import copy

First, let's build the encoder layer; specifically: input embeddings and multi-headed attention.

<img src="assets/attention_layer.png">

In [2]:
class multiHeadAttention(nn.Module):
    def __init__(self, dims_model, n_heads):                                  
        """
        dims_model: Dimensionality of input
        n_heads: Number of heads for the attention layer 
        """

        super(multiHeadAttention, self).__init__()              # This is fors the torch nn module class
        assert dims_model % n_heads == 0, "dims_model must be divisible by num_heads"
        '''
        In multi-head attention, the dims_model dimension (the overall dimension of each token’s embedding) is split into num_heads 
        smaller chunks so that each head can process a portion of the model’s dimension independently. The dimension of each head, 
        called d_k in the code, is calculated as dims_model // num_heads. To make this division possible, d_model needs to be evenly 
        divisible by num_heads.
        '''

        # Initialize dimensions
        self.dims_model = dims_model
        self.num_heads = n_heads
        self.d_k = dims_model // n_heads      # Dimension of each head's key, query and value

        # Now, time to transform the inputs
        self.W_q = nn.Linear(dims_model, dims_model)    # Query 
        self.W_k = nn.Linear(dims_model, dims_model)    # Keys 
        self.W_v = nn.Linear(dims_model, dims_model)    # Values
        self.W_o = nn.Linear(dims_model, dims_model)    # Output


    # Now to calculate the attention scores
    def attention_dot_product(self, Q, K, V, mask=None):
        '''
        Q: Query
        K: Keys
        V: Values
        mask: Can be applied to mask out certain attention score values
        '''

        attn_raw_scores = torch.matmul(Q, K.transpose(-2,-1 ))              # K.transpose(-2, -1) transposes the last two dimensions of the K tensor. (Refer [1].)
        scaled_attn_scores = attn_raw_scores/math.sqrt((self.d_k))

        # Applying the mask (if not none)
        if mask:
            scaled_attn_scores = scaled_attn_scores.masked_fill(mask==0, -1e9)      # Refer [2]

        # Aplplying softmax activation function to find attention probabilities 
        attn_probs = torch.softmax(scaled_attn_scores, dim=-1)

        # Multiply with the Values to obtain final output
        output = torch.matmul(attn_probs, V)

        return output


    # Re-shaping the inputs to have n heads (for multi head attention)
    def split_heads(self, x):
        # Refer to [1], we are transposing here to get the desired shape of 
        # (batch_size, num_heads, d_k, seq_length)
        batch_size, seq_len, dims_model = x.size()
        return x.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1,2)
    

    # After applying attention to each head separately, we combine the results
    def combine_heads(self, x):
        batch_size, _, seq_len, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_len, self.dims_model)
    

    # Oh boy, forward propogation time!
    def forward(self, Q, K, V, mask=None):
        # Applying the linear transformations and splitting heads
        Q = self.split_heads(self.W_q(Q))       # Basically passing into nn.Linear (linear transformation)
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))

        # Calculate the dot product (the attention scores)
        attn_output = self.attention_dot_product(Q, K, V, mask)

        # Combine the outputs from all the heads
        output = self.W_o(self.combine_heads(attn_output))

        return output        

Essentially what's going on here is that we obtain the dot product of the queries wrt keys, which is our attention score. It's like yt's search algorithm, where it looks into the database, looks at the values for the keys 'Title', 'Descriptions' and finds matching keys.

<img src="assets/Calcule_attention_score1.png">

Then, we scale these scores by using d_k which we calculated to split up into multiple heads

<img src="assets/scaling_attention_scores.png">

[1] What Does K.transpose(-2, -1) Do?
K.transpose(-2, -1) transposes the last two dimensions of the K tensor.

-2 and -1 in Tensor Indexing: Negative indices count from the end of the tensor shape, so -2 and -1 refer to the second-to-last and last dimensions. 
In the context of multi-head attention, Q and K typically have the shape:

`(batch_size, num_heads, seq_length, d_k)`

Here:

seq_length is the length of the sequence.
d_k is the dimension of each head (i.e., d_model // num_heads).
Why Transpose? K.transpose(-2, -1) changes the shape of K to:

`(batch_size, num_heads, d_k, seq_length)`
This is necessary so that the matrix multiplication torch.matmul(Q, K.transpose(-2, -1)) results in an output shape of (batch_size, num_heads, seq_length, seq_length). This shape represents the attention scores between each position in the sequence (each query) and all other positions (keys), which is essential for calculating the attention distribution across the sequence.

[2] 

`mask == 0`

The mask tensor is typically a binary tensor with values of 1 and 0. Here, 1 represents positions we want to keep, and 0 represents positions we want to ignore (mask out).
mask == 0 creates a boolean mask where True represents positions that should be masked (ignored), and False represents positions that should be retained.
`-1e9`:

-1e9 (a very large negative number) is used to "mask out" certain positions by setting their attention score to a very low value. When softmax is applied to the attention scores later, this extremely negative value effectively turns the attention probability for masked positions into 0, ensuring they don’t contribute to the weighted sum in the attention mechanism.

Now that we have the multi-headed attention part, it's time for the Feed Forward part.

$$ \text{FFN}(x) = \text{Linear}_2(\text{ReLU}(\text{Linear}_1(x))) $$

This is the general equation for the forward feed. We are basically applying non-linear activation to the outputs of the multi-head attention

In [3]:
class positionWiseFeedForward(nn.Module):
    def __init__(self, dims_model, d_ff):
        super(positionWiseFeedForward, self).__init__()
        self.fc1 = nn.Linear(dims_model, d_ff)
        self.fc2 = nn.Linear(d_ff, dims_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        layer_one = self.fc1(x)
        activation_layer = self.relu(layer_one)
        output_layer = self.fc2(activation_layer)

        return output_layer

Now time for positional encoding. Positional Encoding is used to inject the position information of each token in the input sequence.

Even Indices
$$
PE(p, 2i) = \sin\left(\frac{p}{10000^{\frac{2i}{d_{\text{model}}}}}\right)
$$

Odd Indices
$$
PE(p, 2i+1) = \cos\left(\frac{p}{10000^{\frac{2i}{d_{\text{model}}}}}\right)
$$

In [4]:
class positionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_len):
        super(positionalEncoding, self).__init__()
        pe = torch.zeros(max_seq_len, d_model)
        position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)                     
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]


1. `position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)`

- *`torch.arange(0, max_seq_length, dtype=torch.float)`*:
  - This creates a 1D tensor of floating-point numbers from `0` to `max_seq_length - 1`. For example, if `max_seq_length` is 5, it will create: `[0.0, 1.0, 2.0, 3.0, 4.0]`.
  
- *`.unsqueeze(1)`*:
  - This adds an extra dimension at position `1` (the second dimension). This is important because we want to treat the positions in a sequence as a column vector for each token.
  - After this, the shape of `position` will be `(max_seq_length, 1)`. For example, if `max_seq_length` is 5, the result will look like this:
    ```
    [[0.0],
     [1.0],
     [2.0],
     [3.0],
     [4.0]]
    ```

This tensor represents the position of each token in the sequence (starting from 0, 1, 2, etc.), and we will later use it to calculate the positional encodings for each token in the sequence.

2. `div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))`

- *`torch.arange(0, d_model, 2)`*:
  - This creates a tensor starting from `0` to `d_model - 1`, but it only includes every second number (i.e., it has steps of 2). So, if `d_model = 6`, it creates: `[0, 2, 4]`.

- *`.float()`*:
  - This converts the tensor into floating-point numbers, so we can perform precise mathematical operations later.

- *`-(math.log(10000.0) / d_model)`*:
  - This is a scaling factor based on the constant `10000.0`. The logarithm of `10000` is divided by `d_model`. This step is used to scale the frequencies of the sinusoidal functions to get different wavelengths for each dimension in the positional encoding.

- *`torch.exp(...)`*:
  - The `torch.exp()` function takes the result of the multiplication and applies the exponential function (i.e., raising `e` to the power of the value). This step creates the *divisor terms** for each dimension of the positional encoding, which control how quickly the sine and cosine functions oscillate.

Putting It All Together:

- *`position`* is a tensor that represents the position of each token in the sequence.
- *`div_term`* is a scaling factor (exponentially spaced) that controls the wavelength of each sinusoidal wave used for encoding.
  
Together, `position` and `div_term` are used to generate the *sinusoidal** positional encoding that is added to the token embeddings, helping the model understand the *relative position* of each token in the sequence.

Example in Simple Terms:
- The `position` tensor is like a list of "indices" (positions 0, 1, 2,... for each token).
- The `div_term` tensor defines how the positional encoding will *oscillate** at different frequencies depending on the token's position and the dimension in the encoding.

Now, it's time to combine all these pieces to make the encode layer.

In [5]:
class encoderLayer(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout):
        super(encoderLayer, self).__init__()
        self.attn_layer = multiHeadAttention(d_model, n_heads)
        self.feed_forward = positionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)


    def foward(self, x, mask):
        attn_output = self.attn_layer(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        feed_forward = self.feed_forward(x)
        x = self.norm2(x + self.dropout(feed_forward))
        return x