# Transformer Core Math Tutorial

![Self Portrait - This was an attempt by a transformer based model to render a model of itself](SelfPortrait.jpg)
_this image - self portrait - was generated by dall-e_


GPT, as well as other LLMs and LMMs are an amazing advance in AI.  But, how do they work?  They all use an ML model called a transformer.  Transformers allow AI to learn the complex relationships between tokens in the training data, in other words to learn the semantics, grammar, and even underlying knowledge encoded in natural language and images.  

This tutorial will focus on the core math that makes a transformer block work, using multi headed attention as well as position and token embedding.  

Most of the descriptive explanations and the code samples for this tutorial were generated by chatGPT. In some cases the initial code had minor errors, these errors were also fixed by GPT 4 by feeding the errors back into GPT 4 and GPT 4 would generate new code.

This is an advanced tutorial which builds the main components of the Transformer model, the multi headed attention mechanism and the position and token embedding, from scratch in PyTorch.

Try using the following prompt to generate your own transformer tutorial.  There is alot of code to output and GPT can easily lose attention if the response is too long, so you may need to break the prompt up into smaller bits.  You can also ask followup questions to get it to explain how the code works.  Start with this prompt, and go from there:

#### Prompt: 
```
How can I build a transformer model for sentiment analysis using IMDB with multi headed attention and position and token embedding from scratch using pytorch
```



In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import math
import numpy as np
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences


d_model = 128
num_heads = 8
d_ff = 2048
dropout = 0.1
vocab_size = 20000
max_seq_len = 200


  from .autonotebook import tqdm as notebook_tqdm
2024-02-08 17:45:16.288111: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-08 17:45:25.097712: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2024-02-08 17:45:25.097984: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory


# Load the IMDB Data Set

The Keras IMDB dataset is a popular dataset for sentiment analysis tasks in natural language processing (NLP). It contains 50,000 movie reviews from the Internet Movie Database (IMDB) labeled as either positive (1) or negative (0) based on the sentiment expressed in the review. The dataset is divided into 25,000 reviews for training and 25,000 reviews for testing.

The reviews in the dataset have been preprocessed, and each review is encoded as a sequence of word indices (integers). The indices represent the overall frequency rank of the words in the entire dataset. For instance, the integer "3" encodes the 3rd most frequent word in the data. This encoding allows for faster processing and less memory usage compared to working with raw text data.

The Keras IMDB dataset is typically used for binary classification tasks, where the goal is to build a machine learning model that can predict whether a given movie review is positive or negative based on the text content. The dataset is accessible through the tensorflow.keras.datasets module in the TensorFlow library.



In [2]:
class IMDBDataset(Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __len__(self):
        return len(self.x)

    def __getitem__(self, idx):
        return torch.tensor(self.x[idx], dtype=torch.long), torch.tensor(self.y[idx], dtype=torch.float)

def load_imdb_data(num_words, max_seq_len):
    (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=num_words)

    # Pad sequences to max_seq_len
    x_train = pad_sequences(x_train, maxlen=max_seq_len, padding='post', truncating='post')
    x_test = pad_sequences(x_test, maxlen=max_seq_len, padding='post', truncating='post')

    return x_train, y_train, x_test, y_test


In [3]:
# Example usage:
num_words = vocab_size
batch_size = 16

x_train, y_train, x_test, y_test = load_imdb_data(num_words, max_seq_len)

train_dataset = IMDBDataset(x_train, y_train)
test_dataset = IMDBDataset(x_test, y_test)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

print (x_test.shape)
print (y_test.shape)

for i in range(5):
    print(f"IMDB element {i} value: {x_test[i]}")
    print(f"IMDB element {i} label: {y_test[i]}\n")

(25000, 200)
(25000,)
IMDB element 0 value: [    1   591   202    14    31     6   717    10    10 18142 10698     5
     4   360     7     4   177  5760   394   354     4   123     9  1035
  1035  1035    10    10    13    92   124    89   488  7944   100    28
  1668    14    31    23    27  7479    29   220   468     8   124    14
   286   170     8   157    46     5    27   239    16   179 15387    38
    32    25  7944   451   202    14     6   717     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0 

# Token and Position Embedding

This class takes as input the vocabulary size vocab_size, the model dimension d_model, and the maximum sequence length max_seq_len. The forward method takes a tensor of shape (batch_size, sequence_length) with token ids and outputs the combined token and position embeddings with shape (batch_size, sequence_length, d_model).



In [4]:
class TokenPositionEmbedding(nn.Module):
    def __init__(self, vocab_size, embed_size, max_len=5000):
        super(TokenPositionEmbedding, self).__init__()
        self.token_embedding = nn.Embedding(vocab_size, embed_size)
        self.positional_encoding = torch.zeros(max_len, embed_size)
        
        # Create position encoding
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, embed_size, 2) * -(math.log(10000.0) / embed_size))
        
        self.positional_encoding[:, 0::2] = torch.sin(position * div_term)
        self.positional_encoding[:, 1::2] = torch.cos(position * div_term)
        
        self.positional_encoding = self.positional_encoding.unsqueeze(0)
        self.register_buffer('pe', self.positional_encoding, persistent=False)

    def forward(self, x):
        x = self.token_embedding(x) # (batch_size, seq_len, embed_size)
        # Add positional encoding
        x = x + self.pe[:, :x.size(1)]
        return x


## What is the purpose of the token and position embedding, and how is it different from a token embedding without a position embedding?

### Token Embedding

The concepts of token embeddings and position embeddings play crucial roles in processing sequential data like text. Let's explore each of these components:

Token embeddings convert each token (like a word in a sentence) into a vector of fixed size. This vector representation captures the semantic information of the token, enabling the model to understand and process language.

In practice, each unique token in the vocabulary is assigned a corresponding vector. These vectors are learned during the training process and are adjusted to encapsulate the meanings and relationships of words.

If a transformer model uses only token embeddings, it would be able to understand the meaning of each word but not the order in which they appear. Language is inherently sequential, and the order of words affects the overall meaning of a sentence. Without position information, sentences with the same words in different orders would appear identical to the model.

### Position Embedding

Position embeddings are added to the model to give it an understanding of the order or position of words in a sequence. This is crucial for understanding the structure and meaning of sentences.

Position embeddings are vectors that represent the position of each token in the sequence. These vectors are either learned during training or are predefined and based on mathematical functions (like sine and cosine functions).

When combined with token embeddings, the model not only understands the meaning of each word but also the context provided by their order in the sentence. This combination allows the transformer to process sentences effectively, recognizing patterns and relationships that depend on the sequence of words.

### Difference Between Token Embedding with and without Position Embedding

Without position embeddings, the model loses the sequential context. It cannot differentiate between "The cat sat on the mat" and "The mat sat on the cat," which have vastly different meanings.
Handling of Sequential Data: Transformers are designed to handle sequential data, and position embeddings are crucial for maintaining the sequence information. Without position embeddings, transformers would be limited in their ability to process language effectively.

In tasks like translation, question-answering, and text generation, understanding the order of words is essential. Position embeddings significantly enhance the transformer's performance in these tasks.

### Summary
While token embeddings provide meaning to individual words, position embeddings give the model an understanding of the order of those words, which is crucial for most language processing tasks. The combination of both allows transformers to effectively interpret and generate human language.


## TokenPositionEmbedding Core Math

The following code is a walkthrough of how the token and position embedding works.  `input_tokens` is a typical input of one batch, in this example the batch size is 16.  Each input vector contains a vector of 200 tokens, which makes the tensor shape for `input_tokens` (16, 200).  When loaded from the IMDB database, each token represents a single word form the IMDB review, so that `input_tokens` will contain the tokens for one batch of review from the dataset.  You can see the actual tokens for a sample from the IMDB data set in the output below.

In order to embed the position IDs of each token, we create a matching tensor `position_ids` with the same shape, which contains the ordinal position of each token in  `input_tokens`, which is essentially an ordered list from `0:200`

Once we have both the `input_tokens` and `position_ids`, `TokenPositionEmbedding.forward` will run both through an embedding layer, which will be trained to learn the embeddings for both the `input_tokens` and `position_ids`.  The Output of these two embedding layers are then added together into a single output `embeddings'

In [5]:
# Example TokenPositionEmbedding core math
input_tokens = torch.from_numpy(x_test[:16])

batch_size, seq_len = input_tokens.size() 
print(f"batch_size: {batch_size} seq_len: {batch_size, seq_len}")

print(f"input_tokens.shape: {input_tokens.shape}")
print(f"input_tokens: {input_tokens}")


batch_size: 16 seq_len: (16, 200)
input_tokens.shape: torch.Size([16, 200])
input_tokens: tensor([[   1,  591,  202,  ...,    0,    0,    0],
        [   1,   14,   22,  ..., 2033,   19, 7836],
        [   1,  111,  748,  ...,  655, 2212,    5],
        ...,
        [   1,   13,  645,  ...,    4,  154,  132],
        [   1,    6, 1301,  ...,    0,    0,    0],
        [   1,  387,   72,  ...,  533,   18, 3121]], dtype=torch.int32)


In [6]:
# Create the position ids from 0 to max_seq_len - 1
position_ids = torch.arange(0, seq_len, dtype=torch.long, device=input_tokens.device).unsqueeze(0).expand(batch_size, -1)
print(f"position_ids.shape: {position_ids.shape}")
print(f"position_ids: {position_ids}")

position_ids.shape: torch.Size([16, 200])
position_ids: tensor([[  0,   1,   2,  ..., 197, 198, 199],
        [  0,   1,   2,  ..., 197, 198, 199],
        [  0,   1,   2,  ..., 197, 198, 199],
        ...,
        [  0,   1,   2,  ..., 197, 198, 199],
        [  0,   1,   2,  ..., 197, 198, 199],
        [  0,   1,   2,  ..., 197, 198, 199]])


In [7]:
token_embedding = nn.Embedding(vocab_size, d_model)
position_embedding = nn.Embedding(max_seq_len, d_model)


# Get token and position embeddings
token_embeds = token_embedding(input_tokens)


print(f"token_embeds.shape: {token_embeds.shape}")
print(f"token_embeds: {token_embeds}")

token_embeds.shape: torch.Size([16, 200, 128])
token_embeds: tensor([[[ 0.9506,  0.6995,  0.3862,  ...,  0.5055,  0.5259, -1.1664],
         [-1.3788, -0.2906,  0.8214,  ..., -0.8122, -1.0443,  0.7426],
         [ 1.5315,  1.2800,  3.0134,  ..., -0.6404,  2.1197,  0.1127],
         ...,
         [ 0.3092,  0.8322,  0.3072,  ...,  0.1697, -1.1924, -0.1606],
         [ 0.3092,  0.8322,  0.3072,  ...,  0.1697, -1.1924, -0.1606],
         [ 0.3092,  0.8322,  0.3072,  ...,  0.1697, -1.1924, -0.1606]],

        [[ 0.9506,  0.6995,  0.3862,  ...,  0.5055,  0.5259, -1.1664],
         [-0.0363,  0.7055, -0.0226,  ...,  0.5213,  0.3022,  2.2647],
         [-2.5214,  0.0523,  0.5272,  ...,  0.6783,  1.6015, -0.2042],
         ...,
         [ 0.6389, -0.4885, -1.2550,  ..., -0.1269,  1.1253,  1.7807],
         [ 0.5363,  0.6796,  2.2522,  ..., -2.0115,  0.2468,  1.0558],
         [ 1.3387, -0.1286, -0.6733,  ...,  1.1430,  1.3201,  1.2066]],

        [[ 0.9506,  0.6995,  0.3862,  ...,  0.5055,  0.

In [8]:
position_embeds = position_embedding(position_ids)
print(f"position_embeds.shape: {position_embeds.shape}")
print(f"position_embeds: {position_embeds}")

position_embeds.shape: torch.Size([16, 200, 128])
position_embeds: tensor([[[ 0.2923, -0.9336,  0.8262,  ...,  0.2469, -2.4950,  0.2318],
         [ 2.0498,  0.1116,  0.7468,  ..., -1.9119,  0.3289, -0.8703],
         [ 0.1890, -0.3084,  0.6795,  ...,  0.6823,  0.3536, -1.3174],
         ...,
         [-0.6594, -1.3710,  0.6686,  ..., -0.4837,  1.1548,  1.1394],
         [-0.5588,  1.7642, -1.7745,  ...,  0.6993,  1.5732, -0.8184],
         [ 1.7997,  1.4413,  1.0909,  ..., -0.3489, -0.1805, -0.5293]],

        [[ 0.2923, -0.9336,  0.8262,  ...,  0.2469, -2.4950,  0.2318],
         [ 2.0498,  0.1116,  0.7468,  ..., -1.9119,  0.3289, -0.8703],
         [ 0.1890, -0.3084,  0.6795,  ...,  0.6823,  0.3536, -1.3174],
         ...,
         [-0.6594, -1.3710,  0.6686,  ..., -0.4837,  1.1548,  1.1394],
         [-0.5588,  1.7642, -1.7745,  ...,  0.6993,  1.5732, -0.8184],
         [ 1.7997,  1.4413,  1.0909,  ..., -0.3489, -0.1805, -0.5293]],

        [[ 0.2923, -0.9336,  0.8262,  ...,  0.246

In [9]:
# Combine token and position embeddings
embeddings = token_embeds + position_embeds

print(f"embeddings.shape: {embeddings.shape}")
print(f"embeddings: {embeddings}")


embeddings.shape: torch.Size([16, 200, 128])
embeddings: tensor([[[ 1.2429e+00, -2.3413e-01,  1.2124e+00,  ...,  7.5239e-01,
          -1.9692e+00, -9.3457e-01],
         [ 6.7097e-01, -1.7898e-01,  1.5682e+00,  ..., -2.7241e+00,
          -7.1542e-01, -1.2776e-01],
         [ 1.7206e+00,  9.7156e-01,  3.6928e+00,  ...,  4.1968e-02,
           2.4733e+00, -1.2047e+00],
         ...,
         [-3.5023e-01, -5.3875e-01,  9.7583e-01,  ..., -3.1396e-01,
          -3.7642e-02,  9.7877e-01],
         [-2.4954e-01,  2.5965e+00, -1.4672e+00,  ...,  8.6908e-01,
           3.8075e-01, -9.7903e-01],
         [ 2.1089e+00,  2.2735e+00,  1.3981e+00,  ..., -1.7914e-01,
          -1.3729e+00, -6.8989e-01]],

        [[ 1.2429e+00, -2.3413e-01,  1.2124e+00,  ...,  7.5239e-01,
          -1.9692e+00, -9.3457e-01],
         [ 2.0135e+00,  8.1713e-01,  7.2423e-01,  ..., -1.3905e+00,
           6.3109e-01,  1.3943e+00],
         [-2.3324e+00, -2.5615e-01,  1.2067e+00,  ...,  1.3606e+00,
           1.9551e+

## Why do we add `token_embeds + position_embeds` instead of concatenating them into a combined vector?
Using a sum to combine the elements of both `token_embeds` and `position_embeds` combines the two embeddings in a way where each element of the embedding vector can not distinguish between the position and token embedding individually.  an alternative would be to concatenate the vectors, so that the first 64 elements would be just the position embedding and the next 64 would be the token embedding.  There are several reasons why in most transformer models the position and token embeddings are combined in this way.

### Model Learning Dynamics
**Interplay of Information:** Adding position and token embeddings allows the model to blend positional information with the semantic content of each token directly. This interplay is crucial for the model to learn the significance of token positions relative to their semantic content within the sequence. It's believed that this combined representation helps the model better learn contextual relationships between tokens.

**Sufficient for Discrimination:** Despite the apparent risk of losing the distinctiveness of token and positional information when they are added together, in practice, transformer models are still able to effectively learn and distinguish the necessary information for tasks like language understanding and generation. The transformer's attention mechanism, which is highly flexible and capable of modeling complex dependencies, plays a crucial role here.

### Theoretical and Empirical Justification
**Empirical Success:** The effectiveness of adding position and token embeddings has been empirically validated by the success of transformer models across a wide range of natural language processing tasks. These models have shown remarkable performance in understanding context, sequence relationships, and the nuances of language, indicating that the combined embeddings effectively convey necessary information to the model.

**Theoretical Flexibility:** The transformer architecture, particularly its attention mechanism, is designed to weigh and interpret the input embeddings dynamically. This means that even though the embeddings are summed, the model can still learn to attend to the aspects of the embeddings (whether they relate more to position or token information) that are most relevant for the task at hand.

### Summary
While adding the embeddings might seem to obscure the distinction between token and positional information, the transformer model's design and its capacity to learn complex representations ensure that it can still effectively utilize both types of information. The choice to sum embeddings is thus a balance between maintaining efficient computation, preserving dimensionality, and ensuring that the model can learn and leverage the blended information effectively.

## TokenAndPositionEmbedding Example Useage

The following example shows how `TokenAndPositionEmbedding` works on an actual sample from the IMDB data set.  Of course, the embedding layers are not trained, they are initialized with random weights.

In [10]:
# Example usage:
embedding_layer = TokenPositionEmbedding(vocab_size, d_model, max_seq_len)

#Enumerate the TransformerBlock layers
for i, layer in enumerate(embedding_layer.children()):
    print(f"Layer {i}: {layer}")

embedding_layer.printSizes = True
embeddings = embedding_layer(input_tokens)
print(f"embeddings.shape: {embeddings.shape}")
print(f"embeddings: {embeddings}")


Layer 0: Embedding(20000, 128)
embeddings.shape: torch.Size([16, 200, 128])
embeddings: tensor([[[ 1.1366,  0.6945, -1.0397,  ...,  0.5091,  1.5011, -0.1113],
         [-0.7782, -0.3730,  1.9482,  ...,  0.6241, -1.9986,  1.9754],
         [-0.0910,  0.2173,  1.5017,  ...,  1.3948, -0.6971,  2.7789],
         ...,
         [ 1.2660,  0.0476,  0.9828,  ...,  0.4175, -1.7775,  2.6281],
         [ 0.3906, -0.3437,  1.1404,  ...,  0.4175, -1.7773,  2.6281],
         [-0.4116,  0.1815,  0.6146,  ...,  0.4175, -1.7772,  2.6281]],

        [[ 1.1366,  0.6945, -1.0397,  ...,  0.5091,  1.5011, -0.1113],
         [ 0.6556,  1.0537,  0.4247,  ..., -0.0662, -0.7059,  1.6077],
         [-0.8953, -0.6155,  2.8492,  ...,  1.5974, -1.1865,  0.3257],
         ...,
         [-0.5186, -1.0908,  0.6706,  ...,  0.9974,  1.0385, -0.2767],
         [-0.0154, -0.7802,  2.8194,  ...,  1.5213, -0.5876,  1.5199],
         [-0.1645,  0.3670,  1.2615,  ...,  1.7319,  0.5702,  1.4335]],

        [[ 1.1366,  0.6945, 

# Multi-Headed attention

This class takes as input the model dimension d_model and the number of attention heads num_heads. The forward method takes a tensor of shape (batch_size, sequence_length, d_model) and an optional mask, and it outputs the context vectors and attention weights.

In [11]:
class MultiHeadSelfAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadSelfAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads

        self.w_queries = nn.Linear(d_model, d_model)
        self.w_keys = nn.Linear(d_model, d_model)
        self.w_values = nn.Linear(d_model, d_model)

        self.linear = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, queries, keys, values, mask=None):
        attention_logits = torch.matmul(queries, keys.transpose(-2, -1)) / (self.head_dim ** 0.5)
        if mask is not None:
            attention_logits = attention_logits.masked_fill(mask == 0, float('-inf'))
        attention_weights = F.softmax(attention_logits, dim=-1)
        return torch.matmul(attention_weights, values), attention_weights

    def split_heads(self, x):
        batch_size, seq_len, _ = x.size()
        return x.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)

    def combine_heads(self, x):
        batch_size, _, seq_len, _ = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)

    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.size()

        queries = self.split_heads(self.w_queries(x))
        keys = self.split_heads(self.w_keys(x))
        values = self.split_heads(self.w_values(x))

        if mask is not None:
            mask = mask.unsqueeze(1)

        context_vectors, attention_weights = self.scaled_dot_product_attention(queries, keys, values, mask)
        context_vectors = self.combine_heads(context_vectors)

        return self.linear(context_vectors), attention_weights


In [12]:
# Example usage:
input_tensor = torch.rand(16, 50, d_model)  # 16 is batch_size and 50 is sequence length

self_attention = MultiHeadSelfAttention(d_model, num_heads)
output, attention_weights = self_attention(input_tensor)

#Enumerate the MultiHeadSelfAttention layers
for i, layer in enumerate(self_attention.children()):
    print(f"Layer {i}: {layer}")

Layer 0: Linear(in_features=128, out_features=128, bias=True)
Layer 1: Linear(in_features=128, out_features=128, bias=True)
Layer 2: Linear(in_features=128, out_features=128, bias=True)
Layer 3: Linear(in_features=128, out_features=128, bias=True)


 ## What is the purpose of Queries, Keys, and Values and how are they different from a simple densely connected layer?

The multi-head self-attention mechanism is a crucial component, characterized by three key elements: Queries (Q), Keys (K), and Values (V). Let's explore the purpose of each and how they differ from a simple densely connected (fully connected) neural network layer.

### Queries (Q), Keys (K), and Values (V)

1. **Queries (Q):** 
Represent the current word (or token) for which we are trying to establish its context and relationships with other words in the input sequence.

1. **Keys (K):**
Represent all words (or tokens) in the input sequence. The model uses them to determine how much focus or 'attention' each word in the sequence should get in relation to the current query word.

1. **Values (V):**
Also represent all words in the input sequence, but they are used to construct the output of the self-attention layer. The amount of attention a word gets influences how much its corresponding value contributes to the output.

#### How They Work:

In the self-attention mechanism, each word in the input sequence is initially transformed into Q, K, and V vectors through distinct linear transformations (learnable weights).
The model calculates the attention scores by performing a dot product of the Q vector with all K vectors. These scores determine how much each word in the sequence should contribute to the representation of the current word.
The attention scores are then used to create a weighted sum of the V vectors, which forms the output of the self-attention layer for each word.

### Difference from a Densely Connected Layer:

A densely connected layer learns a fixed transformation of its input data, applying the same transformation to all inputs. In contrast, the self-attention mechanism dynamically calculates how much each part of the input should contribute to the output based on the input data itself.

The self-attention mechanism can capture relationships and dependencies between words in a sequence, regardless of their distance from each other. A densely connected layer lacks this contextual awareness and processes each input independently.

Self-attention allows the model to focus on different parts of the input sequence differently for each output element, enabling a more nuanced and context-aware processing. Densely connected layers don't offer this level of flexibility as they apply the same transformation to all inputs.

### Summary
In a multi-head self-attention function, Queries, Keys, and Values are used to dynamically compute how different parts of the input sequence should be emphasized or 'attended to' for each element in the sequence. This differs from a simple densely connected layer, which lacks the ability to capture sequential and contextual relationships within the input data. Self-attention is inherently more flexible and context-aware, making it well-suited for tasks involving sequential data, like natural language processing.

## What does the `split_heads` function do and how does it work?

The multi-head self-attention mechanism involves a function often called split_heads or a similar variant. This function is essential for enabling the "multi-head" aspect of the self-attention. Let's delve into what this function does and how it works:

### Purpose of `split_heads`
The primary purpose of `split_heads` is to enable the model to simultaneously attend to information from different representation subspaces at different positions. By splitting the attention mechanism into multiple heads, the model can capture a richer variety of features in the input data.

Each head in the multi-head attention can potentially focus on different aspects of the input data, allowing for parallel and diverse feature extraction. This leads to a more comprehensive understanding of the input.

### How split_heads Works
1. **Input to the Function:**
    - The function typically takes the matrices Queries, Keys, and Values as inputs. Each of these matrices is the result of transforming the input sequence through different linear layers specific for Queries, Keys, and Values.

1. **Reshaping the Matrices:**
    - The `split_heads` function reshapes each of Queries, Keys, and Values matrices from their original shape `[batch_size, sequence_length, feature_dimension]` to a new shape `[batch_size, num_heads, sequence_length, feature_dimension/num_heads]`.

    - This reshaping effectively splits the last dimension (feature_dimension) into two dimensions: the number of heads (num_heads) and the reduced feature dimension for each head.

1. **Parallel Attention Processing:**

    - After splitting, each head processes a slice of the original feature dimension, allowing the model to attend to different parts of the feature space independently and in parallel.
    - This parallel processing enables the model to capture different types of relationships in the data, such as different aspects of semantic meaning in a language model.

1. **Recombination and Output:**
    - Once each head has processed its respective slice, the outputs are typically concatenated back together and passed through another linear layer to combine the information from all heads.

    - This recombination ensures that the multi-head attention captures a wide range of information from the input while still being able to integrate these diverse signals.

### Summary
The split_heads function in a Transformer's multi-head self-attention mechanism plays a crucial role in diversifying the attention process. By splitting the Queries, Keys, and Values matrices into multiple heads, the Transformer can process the input data in parallel across different feature subspaces, enhancing its ability to capture complex patterns and relationships in the data. This functionality is fundamental to the Transformer architecture's success in various tasks like language understanding, translation, and generation.

## How does `combine_heads` work?

The combine_heads function plays a crucial role after the scaled dot product attention has been computed for each head. This function is essential for integrating the outputs from all heads back into a unified representation. Let's explore what this function does and how it operates:

### Purpose of `combine_heads`
1. **Aggregating Outputs from Multiple Heads:** The main purpose of `combine_heads` is to merge the outputs from each of the attention heads. Since each head captures different aspects or features of the input data, combining them allows the model to consider all these aspects simultaneously.

1. **Restoring Original Dimensionality:** The function also serves to reshape the output back to the original embedding dimensionality. This is necessary for maintaining consistency in the network's layers and for subsequent processing.

### How `combine_heads` Works
1. **Input to the Function:** The function typically receives the outputs from the attention heads, where each head has produced an output matrix of shape `[batch_size, sequence_length, feature_dimension/num_heads]`.

1. **Concatenating the Outputs:**
The outputs from all the heads are concatenated along the dimension that represents the feature space. This concatenation effectively reverses the operation performed by the split_heads function.
After concatenation, the shape of the resulting matrix is `[batch_size, sequence_length, feature_dimension]`, where `feature_dimension` is typically the original embedding size.

1. **Preparing for Subsequent Layers:**
The output of `combine_heads` is now in a suitable format to be passed on to the next layer in the Transformer, such as a feed-forward neural network layer.
This step is crucial for ensuring that the sequential processing in the Transformer architecture is maintained.
### Summary
The combine_heads function in a Transformer's multi-head self-attention mechanism is integral for integrating the diverse outputs from each attention head. By concatenating and optionally transforming these outputs, the function provides a comprehensive representation that encapsulates the varied features captured by each head. This step is key to the Transformer's ability to process and understand complex patterns in data, particularly in tasks involving sequential or structured data like natural language processing.

## What is the `scaled_dot_product_attention` function do and how does it work

The `scaled_dot_product_attention` function is a critical component. It computes the attention weights and produces a weighted sum of the values. This function is where the actual 'attention' part of the mechanism takes place. Let's explore what this function does and how it operates:

### Purpose of `scaled_dot_product_attention`
1. **Computing Attention Weights:** The primary purpose of this function is to calculate how much attention each element of the sequence should pay to every other element. It's about determining the relevance or importance of all other tokens in the sequence for a given token.

1. **Producing Contextualized Representations:** By computing these attention weights and applying them to the values, the function produces a new set of vectors that are contextually informed. These vectors represent each token not just as itself, but as a summary of how it relates to every other token in the sequence.

### How scaled_dot_product_attention Works
1. **Inputs to the Function:** The function typically takes three inputs: Queries (Q), Keys (K), and Values (V). Optionally, a mask may also be provided to exclude certain positions from attention (like padding tokens).

1. **Calculating Dot Products of Queries and Keys:**
The function starts by computing the dot product between each query and all keys. This operation essentially measures the similarity or compatibility between each query and key pair.
The resulting matrix of dot products has a shape `[sequence_length, sequence_length]`, representing attention scores for each pair of tokens in the sequence.
1. **Scaling the Dot Products:**
The dot products are scaled down by the square root of the dimension of the key vectors. This scaling is done to prevent the softmax function (applied in the next step) from having a too-small gradient, which can happen when the dot products are large. The scaling helps in stabilizing the gradient descent algorithm during training.

1. **Applying Softmax: **
A softmax function is applied to each row of the scaled dot product matrix. The softmax function converts the raw scores into probabilities, which sum up to 1. This step determines the final attention weights.
The softmax is often applied after masking, ensuring that positions to be ignored (like padding) receive zero weight.
1. **Multiplying with the Values:**
The attention weights are then used to create a weighted sum of the value vectors. This step effectively selects or highlights the information in the values based on the computed attention weights.
The output is a new set of vectors, each representing a token in the sequence, reweighted to include information from other relevant tokens.
1. **Output of the Function:**
The output is a matrix of the same shape as the values matrix, representing the input sequence where each element now has contextual information from the entire sequence.
### Summary
The `scaled_dot_product_attention` function is at the heart of the self-attention mechanism in Transformers. It enables the model to focus on different parts of the input sequence in a context-sensitive manner. By calculating attention weights and applying them to the values, this function produces output vectors that are contextualized representations of each input token, taking into account the entire sequence. This sophisticated attention mechanism is a key reason for the effectiveness of Transformers in tasks that require an understanding of the entire context, such as natural language processing and sequence modeling.

##  What is the purpose of the Linear layer

In a Transformer's multi-head self-attention mechanism, the fourth layer, commonly referred to as the fully connected layer (fc) or sometimes as a linear layer, plays a vital role in integrating and refining the outputs from the self-attention process. Let's break down its purpose:

### Purpose of the linear layer (fc)
1. **Integration of Attention Heads:**
After the self-attention mechanism processes the input through multiple heads, the results from each head need to be integrated. The linear layer serves to combine these diverse attention outputs into a single, unified output.

1. **Transformation of Concatenated Outputs:**
The outputs of the multiple attention heads are concatenated to form a single matrix. The linear layer then applies a linear transformation to this concatenated matrix. This step is crucial for mapping the combined, multi-dimensional attention information back into the original input space (or to a desired output dimensionality).

1. **Maintaining Depth of Representation:**
The linear layer (fc) ensures that the depth of the model's representation (i.e., the dimensionality of the feature space) is maintained or appropriately transformed. This consistency is essential for stacking multiple layers of the Transformer, allowing each layer to build upon the transformed representations of the previous layer.

1. **Adding Learnable Parameters:**
The linear layer (fc) introduces additional learnable parameters to the model. These parameters are optimized during training, allowing the model to better integrate and interpret the information gleaned from the multiple attention heads.

1. **Enhancing Model's Capacity:** By combining and transforming the outputs of the attention heads, the linear layer (fc) enhances the model's capacity to capture complex patterns and relationships in the data. This step is critical for the overall performance of the Transformer in tasks like language understanding and generation.

### How the linear layer (fc) Layer Works
- **Linear Transformation:** The linear layer (fc) typically performs a linear transformation. It takes the concatenated outputs from the attention heads and multiplies them with a weight matrix (learnable parameters), often followed by adding a bias term.

- **Dimensionality Management:** The linear layer (fc) can either preserve the dimensionality of the input or transform it to a different dimensionality, depending on the design of the Transformer model. This flexibility allows the model to be tailored to specific tasks or requirements.

### Summary
The linear layer (fc) in a transformer's multi-head self-attention mechanism serves as a critical component for integrating, transforming, and refining the outputs from the attention heads. It adds depth and capacity to the model, enabling complex feature integration and aiding in the model's overall ability to process and understand sequential data effectively.

# Transfomer Block
This class takes as input the model dimension d_model, the number of attention heads num_heads, the feed-forward hidden dimension d_ff, the vocabulary size vocab_size, and the maximum sequence length max_seq_len. The forward method takes a tensor of shape (batch_size, sequence_length) with token ids and an optional mask, and it outputs the processed tensor with shape (batch_size, sequence_length, d_model).

In [13]:
class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, vocab_size, max_seq_len, dropout=0.1):
        super(TransformerBlock, self).__init__()

        self.embedding_layer = TokenPositionEmbedding(vocab_size, d_model, max_seq_len)

        self.self_attention = MultiHeadSelfAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)

        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Token and position embedding
        x = self.embedding_layer(x)

        # Multi-head self-attention
        attn_output, _ = self.self_attention(x, mask)
        x = self.norm1(x + self.dropout1(attn_output))

        # Position-wise feed-forward
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout2(ff_output))

        return x


In [14]:
# Example usage:
input_ids = torch.randint(0, vocab_size, (16, max_seq_len))  # 16 is batch_size

transformer_block = TransformerBlock(d_model, num_heads, d_ff, vocab_size, max_seq_len)
output = transformer_block(input_ids)

#Enumerate the TransformerBlock layers
for i, layer in enumerate(transformer_block.children()):
    print(f"Layer {i}: {layer}")


Layer 0: TokenPositionEmbedding(
  (token_embedding): Embedding(20000, 128)
)
Layer 1: MultiHeadSelfAttention(
  (w_queries): Linear(in_features=128, out_features=128, bias=True)
  (w_keys): Linear(in_features=128, out_features=128, bias=True)
  (w_values): Linear(in_features=128, out_features=128, bias=True)
  (linear): Linear(in_features=128, out_features=128, bias=True)
)
Layer 2: LayerNorm((128,), eps=1e-05, elementwise_affine=True)
Layer 3: Dropout(p=0.1, inplace=False)
Layer 4: Sequential(
  (0): Linear(in_features=128, out_features=2048, bias=True)
  (1): ReLU()
  (2): Dropout(p=0.1, inplace=False)
  (3): Linear(in_features=2048, out_features=128, bias=True)
)
Layer 5: LayerNorm((128,), eps=1e-05, elementwise_affine=True)
Layer 6: Dropout(p=0.1, inplace=False)


# Build the Model

Here's an example of building and training a transformer model using TransformerBlock, MultiHeadSelfAttention, TokenAndPositionEmbedding, and IMDBDataset from the previous examples. This example calculates and outputs the loss and accuracy for both training and test data for each epoch:

This example creates a TransformerClassifier class that uses the TransformerBlock as the main component. The output of the transformer block is pooled along the sequence dimension using mean pooling before passing through a linear layer for classification.

The training loop iterates through num_epochs and calculates the training and test loss and accuracy for each epoch. Note that the model should be set to train mode during training and eval mode during evaluation to enable/disable dropout and other regularization techniques correctly.

The main components of the code are as follows:

Loading the IMDB dataset: The load_imdb_data function is called to load the IMDB dataset, preprocess it by padding or truncating sequences to a fixed length (max_seq_len), and split it into training and testing sets.

Creating Dataset and DataLoader instances: PyTorch Dataset and DataLoader instances are created for the training and validation sets. These will be used to iterate through the data during the training process.

Defining the model: The TransformerClassifier class is created by combining the TransformerBlock with a fully connected layer for classification. This class is then instantiated using the hyperparameters, such as d_model, num_heads, and d_ff.

Setting up the training loop: The model is trained for a specified number of epochs using the CrossEntropyLoss and the Adam optimizer. For each epoch, the model is trained on the training set and evaluated on the validation set. The loss and accuracy for both training and validation sets are calculated and printed for each epoch.

In summary, this sample code demonstrates how to build, train, and evaluate a simple Transformer-based model for sentiment analysis on the Keras IMDB dataset. The model is trained using a single TransformerBlock and the performance metrics (loss and accuracy) are reported for each epoch.


In [15]:
class TransformerClassifier(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, vocab_size, max_seq_len, num_classes, dropout=0.1):
        super(TransformerClassifier, self).__init__()

        self.transformer_block = TransformerBlock(d_model, num_heads, d_ff, vocab_size, max_seq_len, dropout)
        self.classifier = nn.Linear(d_model, num_classes)

    def forward(self, x, mask=None):
        x = self.transformer_block(x, mask)
        x = x.mean(dim=1)
        return self.classifier(x)

def train_epoch(model, loader, criterion, optimizer, device):
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0

    for inputs, labels in loader:
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()

        outputs = model(inputs)
        loss = criterion(outputs, labels.unsqueeze(1))
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        total += labels.size(0)
        correct += ((outputs > 0) == labels.unsqueeze(1)).sum().item()

    return running_loss / len(loader), correct / total

def evaluate(model, loader, criterion, device):
    model.eval()
    running_loss = 0.0
    correct = 0
    total = 0

    with torch.no_grad():
        for inputs, labels in loader:
            inputs, labels = inputs.to(device), labels.to(device)

            outputs = model(inputs)
            loss = criterion(outputs, labels.unsqueeze(1))

            running_loss += loss.item()
            total += labels.size(0)
            correct += ((outputs > 0) == labels.unsqueeze(1)).sum().item()

    return running_loss / len(loader), correct / total

# Model and training parameters
num_classes = 1
dropout = 0.1
num_epochs = 10
lr = 1e-4
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load data and create DataLoaders
x_train, y_train, x_test, y_test = load_imdb_data(num_words, max_seq_len)
train_dataset = IMDBDataset(x_train, y_train)
test_dataset = IMDBDataset(x_test, y_test)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

# Create the model
model = TransformerClassifier(d_model, num_heads, d_ff, vocab_size, max_seq_len, num_classes, dropout).to(device)
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)


In [16]:
#Enumerate the model layers
for i, layer in enumerate(model.children()):
    print(f"Layer {i}: {layer}")

print("\n")

for name, param in model.named_parameters():
    print(f"{name}: {param.size()}")

Layer 0: TransformerBlock(
  (embedding_layer): TokenPositionEmbedding(
    (token_embedding): Embedding(20000, 128)
  )
  (self_attention): MultiHeadSelfAttention(
    (w_queries): Linear(in_features=128, out_features=128, bias=True)
    (w_keys): Linear(in_features=128, out_features=128, bias=True)
    (w_values): Linear(in_features=128, out_features=128, bias=True)
    (linear): Linear(in_features=128, out_features=128, bias=True)
  )
  (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
  (dropout1): Dropout(p=0.1, inplace=False)
  (feed_forward): Sequential(
    (0): Linear(in_features=128, out_features=2048, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.1, inplace=False)
    (3): Linear(in_features=2048, out_features=128, bias=True)
  )
  (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
  (dropout2): Dropout(p=0.1, inplace=False)
)
Layer 1: Linear(in_features=128, out_features=1, bias=True)


transformer_block.embedding_layer.token_embedding.weight: tor

# Train the model

In [17]:
#Train the model
for epoch in range(num_epochs):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    test_loss, test_acc = evaluate(model, test_loader, criterion, device)
    print(f'Epoch {epoch + 1}/{num_epochs}, '
          f'Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}, '
          f'Test Loss: {test_loss:.4f}, Test Accuracy: {test_acc:.4f}')

Epoch 1/10, Train Loss: 0.6137, Train Acc: 0.6400, Test Loss: 0.5087, Test Accuracy: 0.7446
Epoch 2/10, Train Loss: 0.4654, Train Acc: 0.7774, Test Loss: 0.4306, Test Accuracy: 0.7962
Epoch 3/10, Train Loss: 0.4008, Train Acc: 0.8170, Test Loss: 0.4053, Test Accuracy: 0.8124
Epoch 4/10, Train Loss: 0.3641, Train Acc: 0.8391, Test Loss: 0.4793, Test Accuracy: 0.7860
Epoch 5/10, Train Loss: 0.3354, Train Acc: 0.8555, Test Loss: 0.3860, Test Accuracy: 0.8258
Epoch 6/10, Train Loss: 0.3123, Train Acc: 0.8665, Test Loss: 0.3900, Test Accuracy: 0.8242
Epoch 7/10, Train Loss: 0.2899, Train Acc: 0.8790, Test Loss: 0.3852, Test Accuracy: 0.8306
Epoch 8/10, Train Loss: 0.2679, Train Acc: 0.8896, Test Loss: 0.4235, Test Accuracy: 0.8235
Epoch 9/10, Train Loss: 0.2486, Train Acc: 0.8994, Test Loss: 0.4078, Test Accuracy: 0.8301
Epoch 10/10, Train Loss: 0.2299, Train Acc: 0.9066, Test Loss: 0.4191, Test Accuracy: 0.8282


# More info on Transformers

If you want more info on transformers, and some tutorials that _weren't_ generated by an AI, check out these links:

![Transformer](transformer.jpg)

## Deep Learning
This is an authoritative treatment of deep learning:
[Deep Learning PDF - Ian Goodfellow, Yoshua Bengio and Aaron Courville](https://github.com/janishar/mit-deep-learning-book-pdf/blob/master/complete-book-pdf/deeplearningbook.pdf)

## Keras tutorial:
https://keras.io/examples/nlp/text_classification_with_transformer/

## Other good tutorials:
https://machinelearningmastery.com/how-to-implement-multi-head-attention-from-scratch-in-tensorflow-and-keras/

https://towardsdatascience.com/build-your-own-transformer-from-scratch-using-pytorch-84c850470dcb

https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-f74876522bc0

https://www.tensorflow.org/text/tutorials/transformer

https://www.kaggle.com/code/ritvik1909/text-classification-attention


## General Overview:
https://towardsdatascience.com/all-you-need-to-know-about-attention-and-transformers-in-depth-understanding-part-1-552f0b41d021

https://towardsdatascience.com/all-you-need-to-know-about-attention-and-transformers-in-depth-understanding-part-2-bf2403804ada

https://huggingface.co/learn/nlp-course/chapter1/1?fw=pt
