# Transformer Neural Network

In this notebood, I will be attempting to create my own transformer neural network from scratch. As many of you know, this is literally how chatGPT and many famous LLMs work under the hood. I'm excited to try this out, to further understand how this stuff ACTUALLY works


UPDATE: After multiple attempts and almost ruining my CUDA drivers, she's finally training!

UPDATE 2: SHE TALKS! She's very dumb, but she talks

My favorite responses so far:

"Hey! How are you?" ; "i'm." 

"Hey! How are you?" ; "i's a good." 

"Tell me a funny joke" ; "i'm going to the." 

"What's 2+2?" ; "i't know."


<img src="../study/assets/Screenshot 2024-11-12 144358.png">


# Resources:
1. Offical paper: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
2. Yt: https://www.youtube.com/watch?v=4Bdc55j80l8
3. Builtin: https://builtin.com/artificial-intelligence/transformer-neural-network
4. DataCamp: https://www.datacamp.com/tutorial/building-a-transformer-with-py-torch

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data as data
import math
import copy
import pandas as pd
from transformers import AutoTokenizer

First, let's build the encoder layer; specifically: input embeddings and multi-headed attention.

<img src="../study/assets/attention_layer.png">

In [2]:
class multiHeadAttention(nn.Module):
    def __init__(self, dims_model, n_heads):                                  
        """
        dims_model: Dimensionality of input
        n_heads: Number of heads for the attention layer 
        """

        super(multiHeadAttention, self).__init__()              # This is fors the torch nn module class
        assert dims_model % n_heads == 0, "dims_model must be divisible by num_heads"
        '''
        In multi-head attention, the dims_model dimension (the overall dimension of each token’s embedding) is split into num_heads 
        smaller chunks so that each head can process a portion of the model’s dimension independently. The dimension of each head, 
        called d_k in the code, is calculated as dims_model // num_heads. To make this division possible, d_model needs to be evenly 
        divisible by num_heads.
        '''

        # Initialize dimensions
        self.dims_model = dims_model
        self.num_heads = n_heads
        self.d_k = dims_model // n_heads      # Dimension of each head's key, query and value

        # Now, time to transform the inputs
        self.W_q = nn.Linear(dims_model, dims_model)    # Query 
        self.W_k = nn.Linear(dims_model, dims_model)    # Keys 
        self.W_v = nn.Linear(dims_model, dims_model)    # Values
        self.W_o = nn.Linear(dims_model, dims_model)    # Output


    # Now to calculate the attention scores
    def attention_dot_product(self, Q, K, V, mask=None):
        '''
        Q: Query
        K: Keys
        V: Values
        mask: Can be applied to mask out certain attention score values
        '''

        attn_raw_scores = torch.matmul(Q, K.transpose(-2,-1 ))              # K.transpose(-2, -1) transposes the last two dimensions of the K tensor. (Refer [1].)
        scaled_attn_scores = attn_raw_scores/math.sqrt((self.d_k))

        # Applying the mask (if not none)
        if mask.any():
            scaled_attn_scores = scaled_attn_scores.masked_fill(mask==0, -1e9)      # Refer [2]

        # Aplplying softmax activation function to find attention probabilities 
        attn_probs = torch.softmax(scaled_attn_scores, dim=-1)

        # Multiply with the Values to obtain final output
        output = torch.matmul(attn_probs, V)

        return output


    # Re-shaping the inputs to have n heads (for multi head attention)
    def split_heads(self, x):
        # Refer to [1], we are transposing here to get the desired shape of 
        # (batch_size, num_heads, d_k, seq_length)
        batch_size, seq_len, dims_model = x.size()
        return x.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1,2)
    

    # After applying attention to each head separately, we combine the results
    def combine_heads(self, x):
        batch_size, _, seq_len, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_len, self.dims_model)
    

    # Oh boy, forward propogation time!
    def forward(self, Q, K, V, mask=None):
        # Applying the linear transformations and splitting heads
        Q = self.split_heads(self.W_q(Q))       # Basically passing into nn.Linear (linear transformation)
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))

        # Calculate the dot product (the attention scores)
        attn_output = self.attention_dot_product(Q, K, V, mask)

        # Combine the outputs from all the heads
        output = self.W_o(self.combine_heads(attn_output))

        return output        

Essentially what's going on here is that we obtain the dot product of the queries wrt keys, which is our attention score. It's like yt's search algorithm, where it looks into the database, looks at the values for the keys 'Title', 'Descriptions' and finds matching keys.

<img src="../study/assets/Calcule_attention_score1.png">

Then, we scale these scores by using d_k which we calculated to split up into multiple heads

<img src="../study/assets/scaling_attention_scores.png">

[1] What Does K.transpose(-2, -1) Do?
K.transpose(-2, -1) transposes the last two dimensions of the K tensor.

-2 and -1 in Tensor Indexing: Negative indices count from the end of the tensor shape, so -2 and -1 refer to the second-to-last and last dimensions. 
In the context of multi-head attention, Q and K typically have the shape:

`(batch_size, num_heads, seq_length, d_k)`

Here:

seq_length is the length of the sequence.
d_k is the dimension of each head (i.e., d_model // num_heads).
Why Transpose? K.transpose(-2, -1) changes the shape of K to:

`(batch_size, num_heads, d_k, seq_length)`
This is necessary so that the matrix multiplication torch.matmul(Q, K.transpose(-2, -1)) results in an output shape of (batch_size, num_heads, seq_length, seq_length). This shape represents the attention scores between each position in the sequence (each query) and all other positions (keys), which is essential for calculating the attention distribution across the sequence.

[2] 

`mask == 0`

The mask tensor is typically a binary tensor with values of 1 and 0. Here, 1 represents positions we want to keep, and 0 represents positions we want to ignore (mask out).
mask == 0 creates a boolean mask where True represents positions that should be masked (ignored), and False represents positions that should be retained.
`-1e9`:

-1e9 (a very large negative number) is used to "mask out" certain positions by setting their attention score to a very low value. When softmax is applied to the attention scores later, this extremely negative value effectively turns the attention probability for masked positions into 0, ensuring they don’t contribute to the weighted sum in the attention mechanism.

Now that we have the multi-headed attention part, it's time for the Feed Forward part.

$$ \text{FFN}(x) = \text{Linear}_2(\text{ReLU}(\text{Linear}_1(x))) $$

This is the general equation for the forward feed. We are basically applying non-linear activation to the outputs of the multi-head attention

In [3]:
class positionWiseFeedForward(nn.Module):
    def __init__(self, dims_model, d_ff):
        super(positionWiseFeedForward, self).__init__()
        self.fc1 = nn.Linear(dims_model, d_ff)
        self.fc2 = nn.Linear(d_ff, dims_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        layer_one = self.fc1(x)
        activation_layer = self.relu(layer_one)
        output_layer = self.fc2(activation_layer)

        return output_layer

Now time for positional encoding. Positional Encoding is used to inject the position information of each token in the input sequence.

Even Indices
$$
PE(p, 2i) = \sin\left(\frac{p}{10000^{\frac{2i}{d_{\text{model}}}}}\right)
$$

Odd Indices
$$
PE(p, 2i+1) = \cos\left(\frac{p}{10000^{\frac{2i}{d_{\text{model}}}}}\right)
$$

In [4]:
class positionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_len):
        super(positionalEncoding, self).__init__()
        pe = torch.zeros(max_seq_len, d_model)
        position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)                     
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]


1. `position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)`

- *`torch.arange(0, max_seq_length, dtype=torch.float)`*:
  - This creates a 1D tensor of floating-point numbers from `0` to `max_seq_length - 1`. For example, if `max_seq_length` is 5, it will create: `[0.0, 1.0, 2.0, 3.0, 4.0]`.
  
- *`.unsqueeze(1)`*:
  - This adds an extra dimension at position `1` (the second dimension). This is important because we want to treat the positions in a sequence as a column vector for each token.
  - After this, the shape of `position` will be `(max_seq_length, 1)`. For example, if `max_seq_length` is 5, the result will look like this:
    ```
    [[0.0],
     [1.0],
     [2.0],
     [3.0],
     [4.0]]
    ```

This tensor represents the position of each token in the sequence (starting from 0, 1, 2, etc.), and we will later use it to calculate the positional encodings for each token in the sequence.

2. `div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))`

- *`torch.arange(0, d_model, 2)`*:
  - This creates a tensor starting from `0` to `d_model - 1`, but it only includes every second number (i.e., it has steps of 2). So, if `d_model = 6`, it creates: `[0, 2, 4]`.

- *`.float()`*:
  - This converts the tensor into floating-point numbers, so we can perform precise mathematical operations later.

- *`-(math.log(10000.0) / d_model)`*:
  - This is a scaling factor based on the constant `10000.0`. The logarithm of `10000` is divided by `d_model`. This step is used to scale the frequencies of the sinusoidal functions to get different wavelengths for each dimension in the positional encoding.

- *`torch.exp(...)`*:
  - The `torch.exp()` function takes the result of the multiplication and applies the exponential function (i.e., raising `e` to the power of the value). This step creates the *divisor terms** for each dimension of the positional encoding, which control how quickly the sine and cosine functions oscillate.

Putting It All Together:

- *`position`* is a tensor that represents the position of each token in the sequence.
- *`div_term`* is a scaling factor (exponentially spaced) that controls the wavelength of each sinusoidal wave used for encoding.
  
Together, `position` and `div_term` are used to generate the *sinusoidal** positional encoding that is added to the token embeddings, helping the model understand the *relative position* of each token in the sequence.

Example in Simple Terms:
- The `position` tensor is like a list of "indices" (positions 0, 1, 2,... for each token).
- The `div_term` tensor defines how the positional encoding will *oscillate** at different frequencies depending on the token's position and the dimension in the encoding.


self.dropout: Dropout layer, used to prevent overfitting by randomly setting some activations to zero during training.

Now, it's time to combine all these pieces to make the encode layer.

In [5]:
class encoderLayer(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout):
        super(encoderLayer, self).__init__()
        self.attn_layer = multiHeadAttention(d_model, n_heads)
        self.feed_forward = positionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)


    def forward(self, x, mask):
        attn_output = self.attn_layer(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        feed_forward = self.feed_forward(x)
        x = self.norm2(x + self.dropout(feed_forward))
        return x

## Decoder Layer Time!
<img src='../study/assets/Decoder.png'>

x: The input to the decoder layer.

enc_output: The output from the corresponding encoder (used in the cross-attention step).

src_mask: Source mask to ignore certain parts of the encoder's output.

tgt_mask: Target mask to ignore certain parts of the decoder's input.


In [6]:
class decoderLayer(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout):
        super(decoderLayer, self).__init__()
        self.self_attn = multiHeadAttention(d_model, n_heads)
        self.cross_attn = multiHeadAttention(d_model, n_heads)
        self.feed_forward = positionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    
    def forward(self, x, enc_output, src_mask, tgt_mask):
        attn_output = self.self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn_output))
        attn_output = self.cross_attn(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(attn_output))
        ff_output =self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))

        return x


1. Self-Attention with Target Mask (`self_attn`)

Purpose
In the decoder, self-attention computes relationships **within the target sequence** being generated. For example, when generating the third word, the model should only consider the first two words, not future words. The **target mask (tgt_mask)** ensures that attention weights for future positions are zeroed out, forcing the model to rely only on previously generated tokens.

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V
$$
where:
- \( Q \), \( K \), and \( V \) are the query, key, and value matrices, respectively.
- \( d_k \) is the dimensionality of the keys (a scaling factor).

After Masking

$$
\text{MaskedAttention}(Q, K, V, M) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + M\right) V
$$


2. Cross-Attention (`cross_attn`)

Purpose
Cross-attention allows the decoder to attend to relevant information from the encoder’s output (which represents the processed source sequence). By computing cross-attention, the decoder learns what parts of the input sequence to focus on for generating each token of the output sequence.

Mathematical Calculation

Cross-attention computes the weighted sum over the encoder output (`V`) based on how similar the keys (`K`) are to the queries (`Q`). The attention formula here is the same as self-attention:
$$
\text{CrossAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V
$$

After masking:
$$
\text{MaskedCrossAttention}(Q, K, V, M_{\text{src}}) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + M_{\text{src}}\right) V
$$

The **source mask (src_mask)** can be used here to prevent the model from attending to padding tokens in the encoder’s output (e.g., if the source sequence has padding at the end).


3. Position-Wise Feed-Forward Network (`feed_forward`)

After self-attention and cross-attention, the decoder applies a **position-wise feed-forward network**. This is a simple two-layer network with a ReLU activation that transforms each token representation independently. The formula is:
$$
\text{FFN}(x) = \text{ReLU}(x W_1 + b_1) W_2 + b_2
$$
where \( W_1 \) and \( W_2 \) are learned weights, and \( b_1 \) and \( b_2 \) are biases.



### It is now time... To put things together and form the
# TRANSFORMER!!!

<img src="../study/assets/Screenshot 2024-11-12 144358.png">

In [7]:
class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, n_heads, n_layers, d_ff, max_seq_len, dropout):
        super(Transformer, self).__init__()
        self.encoder_embedding = nn.Embedding(src_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = positionalEncoding(d_model, max_seq_len)

        self.encode_layers = [(encoderLayer(d_model, n_heads, d_ff, dropout)) for _ in range(n_layers)]
        self.decode_layers = [(decoderLayer(d_model, n_heads, d_ff, dropout)) for _ in range(n_layers)]

        self.fully_connected = nn.Linear(d_model, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def generate_mask(self, src, tgt):
        src_mask =  (src != 0).unsqueeze(1).unsqueeze(2)        # Gets rid of values that are 0 in the src tensor           Unsqueeze changes the mask shape from [batch_size, seq_length]
        tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(3)         # Same for the tgt sensor                                   to [batch_size, 1, 1, seq_length]
        seq_length = tgt.size(1)
        nopeek_mask = (1 - torch.triu(torch.ones(1, seq_length, seq_length), diagonal=1)).bool()                            # refer [1]
        tgt_mask = tgt_mask & nopeek_mask
        return src_mask, tgt_mask
    
    def forward(self, src, tgt):
        src_mask, tgt_mask = self.generate_mask(src, tgt)
        src_embedding = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
        tgt_embedding = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))

        enc_output = src_embedding
        for enc_layer in self.encode_layers:
            enc_output = enc_layer(enc_output, src_mask)

        dec_output = tgt_embedding
        for dec_layer in self.decode_layers:
            dec_output = dec_layer(dec_output, enc_output, src_mask, tgt_mask)

        
        output = self.fully_connected(dec_output)

        return output


[1]

This line creates a **"no-peek" mask** for the target sequence in the Transformer. The purpose of the no-peek mask is to ensure that during training or decoding, the model does not look at tokens beyond the current position. This is crucial for auto-regressive tasks like text generation.


**`torch.ones(1, seq_length, seq_length)`**
- This creates a tensor filled with ones of shape `(1, seq_length, seq_length)`. The shape ensures compatibility with batched processing.
- Example when `seq_length = 4`:
  ```python
  torch.ones(1, 4, 4) ->
  [[[1, 1, 1, 1],
    [1, 1, 1, 1],
    [1, 1, 1, 1],
    [1, 1, 1, 1]]]
  ```

**`torch.triu(..., diagonal=1)`**
- `torch.triu` stands for "upper triangular," and it sets all elements **below a certain diagonal** to zero, leaving only the upper triangular part.
- The `diagonal=1` argument means the diagonal starts **above the main diagonal**.
- Example result of `torch.triu(torch.ones(1, 4, 4), diagonal=1)`:
  ```python
  [[[0, 1, 1, 1],
    [0, 0, 1, 1],
    [0, 0, 0, 1],
    [0, 0, 0, 0]]]
  ```

**`1 - ...`**
- Subtracting the upper triangular matrix from `1` flips the mask:
  - All `1`s in the upper triangular part become `0`s.
  - All `0`s in the lower triangular part become `1`s.
- Example after `1 - torch.triu(..., diagonal=1)`:
  ```python
  [[[1, 0, 0, 0],
    [1, 1, 0, 0],
    [1, 1, 1, 0],
    [1, 1, 1, 1]]]
  ```

**`.bool()`**
- Converts the mask from numeric (`0`s and `1`s) to Boolean values (`False` and `True`).
- Example result:
  ```python
  [[[True, False, False, False],
    [True, True, False, False],
    [True, True, True, False],
    [True, True, True, True]]]
  ```

Why Use This?
This mask ensures that when the model is predicting a token at position \(i\), it **cannot attend to future tokens \(j > i\)**. For example:
- When predicting the first token, it can only attend to itself.
- When predicting the second token, it can attend to the first and second tokens, and so on.

By making the upper triangle 0s, the attention score given to the future tokens are 0. Padding mask is applied for the reverse where we avoid previously generated tokens that are irrelevant.



## Training Time!

Here, I'm going to try to train this model using synthetic data that I generated as well as the Cornell movie dialogs corpus pre-processed.

In [8]:
data_csv = pd.read_csv(r'C:\Users\User\projects\NeuralNetwork\data\tnn_train.csv')
inputs = data_csv['Input'].tolist()
outputs = data_csv['Output'].tolist()

In [9]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

input_tokens = [tokenizer.encode(text, padding='max_length', max_length=32, truncation=True) for text in inputs]
output_tokens = [tokenizer.encode(text, padding='max_length', max_length=32, truncation=True) for text in outputs]



In [10]:
input_tensor = torch.tensor(input_tokens)
output_tensor = torch.tensor(output_tokens)

dataset = data.TensorDataset(input_tensor, output_tensor)
dataloader = data.DataLoader(dataset, batch_size=32, shuffle=True)

In [11]:
TransformerModel = Transformer(
    src_vocab_size=50000,
    tgt_vocab_size=50000,
    d_model=512,
    n_heads=8,
    n_layers=6,
    d_ff=2048,
    max_seq_len=100,
    dropout=0.1
)

criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)  
optimizer = optim.Adam(TransformerModel.parameters(), lr=5e-5, weight_decay=1e-4)

In [15]:
# Training main loop
NUM_EPOCHS = 20

for epoch in range(NUM_EPOCHS):
    TransformerModel.train()
    epoch_loss = 0

    for batch in dataloader:
        src, tgt = batch

        # Shift target tokens for decoder input and output
        tgt_input = tgt[:, :-1]
        tgt_output = tgt[:, 1:]

        # Forward pass
        optimizer.zero_grad()
        output = TransformerModel(src, tgt_input)

        # Compute loss with reshape
        loss = criterion(output.reshape(-1, output.size(-1)), tgt_output.reshape(-1))
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    print(f"Epoch {epoch + 1}/{NUM_EPOCHS}, Loss: {epoch_loss:.4f}") 

Epoch 1/20, Loss: 1156.9636
Epoch 2/20, Loss: 898.2604
Epoch 3/20, Loss: 724.5180
Epoch 4/20, Loss: 635.6328
Epoch 5/20, Loss: 596.1202
Epoch 6/20, Loss: 574.8221
Epoch 7/20, Loss: 561.7568
Epoch 8/20, Loss: 551.4864
Epoch 9/20, Loss: 541.4934
Epoch 10/20, Loss: 533.4329
Epoch 11/20, Loss: 527.5237
Epoch 12/20, Loss: 521.1897
Epoch 13/20, Loss: 515.4237
Epoch 14/20, Loss: 509.8563
Epoch 15/20, Loss: 504.4478
Epoch 16/20, Loss: 500.8478
Epoch 17/20, Loss: 496.2933
Epoch 18/20, Loss: 491.0588
Epoch 19/20, Loss: 488.7364
Epoch 20/20, Loss: 484.8329


In [47]:
# Saving the model
torch.save(TransformerModel, "transformer_model_full.pth")

In [12]:
# Loading the model itself
# Load the saved model
TransformerModel = torch.load("transformer_model_full.pth")

# Switch to evaluation mode
TransformerModel.eval()

  TransformerModel = torch.load("transformer_model_full.pth")


Transformer(
  (encoder_embedding): Embedding(50000, 512)
  (decoder_embedding): Embedding(50000, 512)
  (positional_encoding): positionalEncoding()
  (fully_connected): Linear(in_features=512, out_features=50000, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

In [13]:
# Function to tokenize inputs, and decode outputs from the model.
def generate_response_direct(transformer_model, tokenizer, user_input, max_len=50):
    # Tokenize the input and convert to tensor
    src = tokenizer.encode(user_input, return_tensors="pt")
    tgt_input = torch.tensor([[tokenizer.cls_token_id]])  # Start with [CLS] token

    for _ in range(max_len):
        # Generate output logits from the transformer model
        output = transformer_model(src, tgt_input)

        # Get the last token's logits and find the most likely token
        next_token_logits = output[:, -1, :]  # Shape: (1, vocab_size)
        next_token_id = torch.argmax(next_token_logits, dim=-1).item()

        # Append the predicted token to the decoder input
        tgt_input = torch.cat([tgt_input, torch.tensor([[next_token_id]])], dim=1)

        # Stop if the [SEP] token is generated
        if next_token_id == tokenizer.sep_token_id:
            break

    # Decode the generated tokens to a string
    response = tokenizer.decode(tgt_input.squeeze().tolist(), skip_special_tokens=True)
    return response


In [50]:
user_input = "Do you like movies?"
response = generate_response_direct(TransformerModel, tokenizer, user_input)
print(f"Model: {response}")


Model: i'm.
