#### https://github.com/KRcpl88/GptTransformerTutorial

# Advanced Transformer Tutorial Generated by GPT 4

![Transformer](Optimusprime-originaltoy.jpg)

Both the descriptive explanations and the code samples for this tutorial were generated entirely with chatGPT using the GPT 4 model. In some cases the initial code had minor errors, these errors were also fixed by GPT 4 by feeding the errors back into GPT 4 and GPT 4 would generate new code.

This is an advanced tutorial which builds the main components of the Transformer model, the multi headed attention mechanism and the position and token embedding, from scratch in PyTorch.

#### Prompt: 
```
How can I build a transformer model from scratch using IMDB and pytorch
```


## IMDB Sentiment Analysis

The Keras IMDB dataset is a popular dataset for sentiment analysis tasks in natural language processing (NLP). It contains 50,000 movie reviews from the Internet Movie Database (IMDB) labeled as either positive (1) or negative (0) based on the sentiment expressed in the review. The dataset is divided into 25,000 reviews for training and 25,000 reviews for testing.

The reviews in the dataset have been preprocessed, and each review is encoded as a sequence of word indices (integers). The indices represent the overall frequency rank of the words in the entire dataset. For instance, the integer "3" encodes the 3rd most frequent word in the data. This encoding allows for faster processing and less memory usage compared to working with raw text data.

The Keras IMDB dataset is typically used for binary classification tasks, where the goal is to build a machine learning model that can predict whether a given movie review is positive or negative based on the text content. The dataset is accessible through the tensorflow.keras.datasets module in the TensorFlow library.


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import math
import numpy as np
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences




# Multi-Headed attention

This class takes as input the model dimension d_model and the number of attention heads num_heads. The forward method takes a tensor of shape (batch_size, sequence_length, d_model) and an optional mask, and it outputs the context vectors and attention weights.

In [None]:
class MultiHeadSelfAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadSelfAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads

        self.W_Queries = nn.Linear(d_model, d_model)
        self.W_Keys = nn.Linear(d_model, d_model)
        self.W_Values = nn.Linear(d_model, d_model)

        self.FullyConnectedLayer = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, Queries, Keys, Values, mask=None):
        attention_logits = torch.matmul(Queries, Keys.transpose(-2, -1)) / (self.head_dim ** 0.5)
        if mask is not None:
            attention_logits = attention_logits.masked_fill(mask == 0, float('-inf'))
        attention_weights = F.softmax(attention_logits, dim=-1)
        return torch.matmul(attention_weights, Values), attention_weights

    def split_heads(self, x):
        batch_size, seq_len, _ = x.size()
        return x.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)

    def combine_heads(self, x):
        batch_size, _, seq_len, _ = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)

    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.size()

        Queries = self.split_heads(self.W_Queries(x))
        Keys = self.split_heads(self.W_Keys(x))
        Values = self.split_heads(self.W_Values(x))

        if mask is not None:
            mask = mask.unsqueeze(1)

        context_vectors, attention_weights = self.scaled_dot_product_attention(Queries, Keys, Values, mask)
        context_vectors = self.combine_heads(context_vectors)

        return self.FullyConnectedLayer(context_vectors), attention_weights

d_model = 128
num_heads = 8
d_ff = 2048
dropout = 0.1
vocab_size = 20000
max_seq_len = 200

# Example usage:
input_tensor = torch.rand(16, 50, d_model)  # 16 is batch_size and 50 is sequence length

self_attention = MultiHeadSelfAttention(d_model, num_heads)
output, attention_weights = self_attention(input_tensor)

#Enumerate the MultiHeadSelfAttention layers
for i, layer in enumerate(self_attention.children()):
    print(f"Layer {i}: {layer}")

 ## What is the purpose of Queries, Keys, and Values and how are they different from a simple densely connected layer?

The multi-head self-attention mechanism is a crucial component, characterized by three key elements: Queries (Q), Keys (K), and Values (V). Let's explore the purpose of each and how they differ from a simple densely connected (fully connected) neural network layer.

### Queries (Q), Keys (K), and Values (V)

1. **Queries (Q):** 
Represent the current word (or token) for which we are trying to establish its context and relationships with other words in the input sequence.

1. **Keys (K):**
Represent all words (or tokens) in the input sequence. The model uses them to determine how much focus or 'attention' each word in the sequence should get in relation to the current query word.

1. **Values (V):**
Also represent all words in the input sequence, but they are used to construct the output of the self-attention layer. The amount of attention a word gets influences how much its corresponding value contributes to the output.

#### How They Work:

In the self-attention mechanism, each word in the input sequence is initially transformed into Q, K, and V vectors through distinct linear transformations (learnable weights).
The model calculates the attention scores by performing a dot product of the Q vector with all K vectors. These scores determine how much each word in the sequence should contribute to the representation of the current word.
The attention scores are then used to create a weighted sum of the V vectors, which forms the output of the self-attention layer for each word.

### Difference from a Densely Connected Layer:

A densely connected layer learns a fixed transformation of its input data, applying the same transformation to all inputs. In contrast, the self-attention mechanism dynamically calculates how much each part of the input should contribute to the output based on the input data itself.

The self-attention mechanism can capture relationships and dependencies between words in a sequence, regardless of their distance from each other. A densely connected layer lacks this contextual awareness and processes each input independently.

Self-attention allows the model to focus on different parts of the input sequence differently for each output element, enabling a more nuanced and context-aware processing. Densely connected layers don't offer this level of flexibility as they apply the same transformation to all inputs.

### Summary
In a multi-head self-attention function, Queries, Keys, and Values are used to dynamically compute how different parts of the input sequence should be emphasized or 'attended to' for each element in the sequence. This differs from a simple densely connected layer, which lacks the ability to capture sequential and contextual relationships within the input data. Self-attention is inherently more flexible and context-aware, making it well-suited for tasks involving sequential data, like natural language processing.

##  What is the purpose of the FullyConnectedLayer layer

In a Transformer's multi-head self-attention mechanism, the fourth layer, commonly referred to as the FullyConnectedLayer (fc) or sometimes as a linear layer, plays a vital role in integrating and refining the outputs from the self-attention process. Let's break down its purpose:

### Purpose of the FullyConnectedLayer (fc)
1. **Integration of Attention Heads:**
After the self-attention mechanism processes the input through multiple heads, the results from each head need to be integrated. The FullyConnectedLayer serves to combine these diverse attention outputs into a single, unified output.

1. **Transformation of Concatenated Outputs:**
The outputs of the multiple attention heads are concatenated to form a single matrix. The FullyConnectedLayer then applies a linear transformation to this concatenated matrix. This step is crucial for mapping the combined, multi-dimensional attention information back into the original input space (or to a desired output dimensionality).

1. **Maintaining Depth of Representation:**
The FullyConnectedLayer (fc) ensures that the depth of the model's representation (i.e., the dimensionality of the feature space) is maintained or appropriately transformed. This consistency is essential for stacking multiple layers of the Transformer, allowing each layer to build upon the transformed representations of the previous layer.

1. **Adding Learnable Parameters:**
The FullyConnectedLayer (fc) introduces additional learnable parameters to the model. These parameters are optimized during training, allowing the model to better integrate and interpret the information gleaned from the multiple attention heads.

1. **Enhancing Model's Capacity:** By combining and transforming the outputs of the attention heads, the FullyConnectedLayer (fc) enhances the model's capacity to capture complex patterns and relationships in the data. This step is critical for the overall performance of the Transformer in tasks like language understanding and generation.

### How the FullyConnectedLayer (fc) Layer Works
- **Linear Transformation:** The FullyConnectedLayer (fc) typically performs a linear transformation. It takes the concatenated outputs from the attention heads and multiplies them with a weight matrix (learnable parameters), often followed by adding a bias term.

- **Dimensionality Management:** The FullyConnectedLayer (fc) can either preserve the dimensionality of the input or transform it to a different dimensionality, depending on the design of the Transformer model. This flexibility allows the model to be tailored to specific tasks or requirements.

### Summary
The FullyConnectedLayer (fc) in a Transformer's multi-head self-attention mechanism serves as a critical component for integrating, transforming, and refining the outputs from the attention heads. It adds depth and capacity to the model, enabling complex feature integration and aiding in the model's overall ability to process and understand sequential data effectively.

## What does the split_heads function do and how does it work?

The multi-head self-attention mechanism involves a function often called split_heads or a similar variant. This function is essential for enabling the "multi-head" aspect of the self-attention. Let's delve into what this function does and how it works:

### Purpose of split_heads
The primary purpose of `split_heads` is to enable the model to simultaneously attend to information from different representation subspaces at different positions. By splitting the attention mechanism into multiple heads, the model can capture a richer variety of features in the input data.

Each head in the multi-head attention can potentially focus on different aspects of the input data, allowing for parallel and diverse feature extraction. This leads to a more comprehensive understanding of the input.

### How split_heads Works
1. **Input to the Function:**
    - The function typically takes the matrices Queries, Keys, and Values as inputs. Each of these matrices is the result of transforming the input sequence through different linear layers specific for Queries, Keys, and Values.

1. **Reshaping the Matrices:**
    - The `split_heads` function reshapes each of Queries, Keys, and Values matrices from their original shape `[batch_size, sequence_length, feature_dimension]` to a new shape `[batch_size, num_heads, sequence_length, feature_dimension/num_heads]`.

    - This reshaping effectively splits the last dimension (feature_dimension) into two dimensions: the number of heads (num_heads) and the reduced feature dimension for each head.

1. **Parallel Attention Processing:**

    - After splitting, each head processes a slice of the original feature dimension, allowing the model to attend to different parts of the feature space independently and in parallel.
    - This parallel processing enables the model to capture different types of relationships in the data, such as different aspects of semantic meaning in a language model.

1. **Recombination and Output:**
    - Once each head has processed its respective slice, the outputs are typically concatenated back together and passed through another linear layer to combine the information from all heads.

    - This recombination ensures that the multi-head attention captures a wide range of information from the input while still being able to integrate these diverse signals.

### Summary
The split_heads function in a Transformer's multi-head self-attention mechanism plays a crucial role in diversifying the attention process. By splitting the Queries, Keys, and Values matrices into multiple heads, the Transformer can process the input data in parallel across different feature subspaces, enhancing its ability to capture complex patterns and relationships in the data. This functionality is fundamental to the Transformer architecture's success in various tasks like language understanding, translation, and generation.

# Token and Position Embedding

This class takes as input the vocabulary size vocab_size, the model dimension d_model, and the maximum sequence length max_seq_len. The forward method takes a tensor of shape (batch_size, sequence_length) with token ids and outputs the combined token and position embeddings with shape (batch_size, sequence_length, d_model).

In [None]:
class TokenPositionEmbedding(nn.Module):
    def __init__(self, vocab_size, d_model, max_seq_len):
        super(TokenPositionEmbedding, self).__init__()

        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(max_seq_len, d_model)

        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        batch_size, seq_len = x.size()

        # Create the position ids from 0 to max_seq_len - 1
        position_ids = torch.arange(0, seq_len, dtype=torch.long, device=x.device).unsqueeze(0).expand(batch_size, -1)

        # Get token and position embeddings
        token_embeds = self.token_embedding(x)
        position_embeds = self.position_embedding(position_ids)

        # Combine token and position embeddings
        embeddings = token_embeds + position_embeds

        return self.dropout(embeddings)

# Example usage:
input_ids = torch.randint(0, vocab_size, (16, max_seq_len))  # 16 is batch_size

embedding_layer = TokenPositionEmbedding(vocab_size, d_model, max_seq_len)
embeddings = embedding_layer(input_ids)

#Enumerate the TransformerBlock layers
for i, layer in enumerate(embedding_layer.children()):
    print(f"Layer {i}: {layer}")

## What is the purpose of the token and position embedding, and how is it different from a token embedding without a position embedding?

### Token Embedding

The concepts of token embeddings and position embeddings play crucial roles in processing sequential data like text. Let's explore each of these components:

Token embeddings convert each token (like a word in a sentence) into a vector of fixed size. This vector representation captures the semantic information of the token, enabling the model to understand and process language.

In practice, each unique token in the vocabulary is assigned a corresponding vector. These vectors are learned during the training process and are adjusted to encapsulate the meanings and relationships of words.

If a transformer model uses only token embeddings, it would be able to understand the meaning of each word but not the order in which they appear. Language is inherently sequential, and the order of words affects the overall meaning of a sentence. Without position information, sentences with the same words in different orders would appear identical to the model.

### Position Embedding

Position embeddings are added to the model to give it an understanding of the order or position of words in a sequence. This is crucial for understanding the structure and meaning of sentences.

Position embeddings are vectors that represent the position of each token in the sequence. These vectors are either learned during training or are predefined and based on mathematical functions (like sine and cosine functions).

When combined with token embeddings, the model not only understands the meaning of each word but also the context provided by their order in the sentence. This combination allows the transformer to process sentences effectively, recognizing patterns and relationships that depend on the sequence of words.

### Difference Between Token Embedding with and without Position Embedding

Without position embeddings, the model loses the sequential context. It cannot differentiate between "The cat sat on the mat" and "The mat sat on the cat," which have vastly different meanings.
Handling of Sequential Data: Transformers are designed to handle sequential data, and position embeddings are crucial for maintaining the sequence information. Without position embeddings, transformers would be limited in their ability to process language effectively.

In tasks like translation, question-answering, and text generation, understanding the order of words is essential. Position embeddings significantly enhance the transformer's performance in these tasks.

### Summary
While token embeddings provide meaning to individual words, position embeddings give the model an understanding of the order of those words, which is crucial for most language processing tasks. The combination of both allows transformers to effectively interpret and generate human language.


# Transfomer Block
This class takes as input the model dimension d_model, the number of attention heads num_heads, the feed-forward hidden dimension d_ff, the vocabulary size vocab_size, and the maximum sequence length max_seq_len. The forward method takes a tensor of shape (batch_size, sequence_length) with token ids and an optional mask, and it outputs the processed tensor with shape (batch_size, sequence_length, d_model).

In [None]:
class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, vocab_size, max_seq_len, dropout=0.1):
        super(TransformerBlock, self).__init__()

        self.embedding_layer = TokenPositionEmbedding(vocab_size, d_model, max_seq_len)

        self.self_attention = MultiHeadSelfAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)

        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Token and position embedding
        x = self.embedding_layer(x)

        # Multi-head self-attention
        attn_output, _ = self.self_attention(x, mask)
        x = self.norm1(x + self.dropout1(attn_output))

        # Position-wise feed-forward
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout2(ff_output))

        return x

# Example usage:
input_ids = torch.randint(0, vocab_size, (16, max_seq_len))  # 16 is batch_size

transformer_block = TransformerBlock(d_model, num_heads, d_ff, vocab_size, max_seq_len)
output = transformer_block(input_ids)

#Enumerate the TransformerBlock layers
for i, layer in enumerate(transformer_block.children()):
    print(f"Layer {i}: {layer}")


# Load the IMDB Data Set


In [None]:
class IMDBDataset(Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __len__(self):
        return len(self.x)

    def __getitem__(self, idx):
        return torch.tensor(self.x[idx], dtype=torch.long), torch.tensor(self.y[idx], dtype=torch.float)

def load_imdb_data(num_words, max_seq_len):
    (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=num_words)

    # Pad sequences to max_seq_len
    x_train = pad_sequences(x_train, maxlen=max_seq_len, padding='post', truncating='post')
    x_test = pad_sequences(x_test, maxlen=max_seq_len, padding='post', truncating='post')

    return x_train, y_train, x_test, y_test

# Example usage:
num_words = vocab_size
batch_size = 16

x_train, y_train, x_test, y_test = load_imdb_data(num_words, max_seq_len)

train_dataset = IMDBDataset(x_train, y_train)
test_dataset = IMDBDataset(x_test, y_test)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

# Build the Model

Here's an example of building and training a transformer model using TransformerBlock, MultiHeadSelfAttention, TokenAndPositionEmbedding, and IMDBDataset from the previous examples. This example calculates and outputs the loss and accuracy for both training and test data for each epoch:

This example creates a TransformerClassifier class that uses the TransformerBlock as the main component. The output of the transformer block is pooled along the sequence dimension using mean pooling before passing through a linear layer for classification.

The training loop iterates through num_epochs and calculates the training and test loss and accuracy for each epoch. Note that the model should be set to train mode during training and eval mode during evaluation to enable/disable dropout and other regularization techniques correctly.

The main components of the code are as follows:

Loading the IMDB dataset: The load_imdb_data function is called to load the IMDB dataset, preprocess it by padding or truncating sequences to a fixed length (max_seq_len), and split it into training and testing sets.

Creating Dataset and DataLoader instances: PyTorch Dataset and DataLoader instances are created for the training and validation sets. These will be used to iterate through the data during the training process.

Defining the model: The TransformerClassifier class is created by combining the TransformerBlock with a fully connected layer for classification. This class is then instantiated using the hyperparameters, such as d_model, num_heads, and d_ff.

Setting up the training loop: The model is trained for a specified number of epochs using the CrossEntropyLoss and the Adam optimizer. For each epoch, the model is trained on the training set and evaluated on the validation set. The loss and accuracy for both training and validation sets are calculated and printed for each epoch.

In summary, this sample code demonstrates how to build, train, and evaluate a simple Transformer-based model for sentiment analysis on the Keras IMDB dataset. The model is trained using a single TransformerBlock and the performance metrics (loss and accuracy) are reported for each epoch.


In [None]:
class TransformerClassifier(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, vocab_size, max_seq_len, num_classes, dropout=0.1):
        super(TransformerClassifier, self).__init__()

        self.transformer_block = TransformerBlock(d_model, num_heads, d_ff, vocab_size, max_seq_len, dropout)
        self.classifier = nn.Linear(d_model, num_classes)

    def forward(self, x, mask=None):
        x = self.transformer_block(x, mask)
        x = x.mean(dim=1)
        return self.classifier(x)

def train_epoch(model, loader, criterion, optimizer, device):
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0

    for inputs, labels in loader:
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()

        outputs = model(inputs)
        loss = criterion(outputs, labels.unsqueeze(1))
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        total += labels.size(0)
        correct += ((outputs > 0) == labels.unsqueeze(1)).sum().item()

    return running_loss / len(loader), correct / total

def evaluate(model, loader, criterion, device):
    model.eval()
    running_loss = 0.0
    correct = 0
    total = 0

    with torch.no_grad():
        for inputs, labels in loader:
            inputs, labels = inputs.to(device), labels.to(device)

            outputs = model(inputs)
            loss = criterion(outputs, labels.unsqueeze(1))

            running_loss += loss.item()
            total += labels.size(0)
            correct += ((outputs > 0) == labels.unsqueeze(1)).sum().item()

    return running_loss / len(loader), correct / total

# Model and training parameters
num_classes = 1
dropout = 0.1
num_epochs = 10
lr = 1e-4
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load data and create DataLoaders
x_train, y_train, x_test, y_test = load_imdb_data(num_words, max_seq_len)
train_dataset = IMDBDataset(x_train, y_train)
test_dataset = IMDBDataset(x_test, y_test)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

# Create the model
model = TransformerClassifier(d_model, num_heads, d_ff, vocab_size, max_seq_len, num_classes, dropout).to(device)
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)




In [None]:

#Enumerate the model layers

for i, layer in enumerate(model.children()):
    print(f"Layer {i}: {layer}")

print("\n")

for name, param in model.named_parameters():
    print(f"{name}: {param.size()}")

# Train the model

In [None]:
#Train the model

for epoch in range(num_epochs):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    test_loss, test_acc = evaluate(model, test_loader, criterion, device)
    print(f'Epoch {epoch + 1}/{num_epochs}, '
          f'Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}, '
          f'Test Loss: {test_loss:.4f}, Test Accuracy: {test_acc:.4f}')

# More info on Transformers

![Transformer](Optimusprime-originaltoy.jpg)

If you want more info on transformers, and some tutorials that _weren't_ generated by an AI, check out these links:

## Keras tutorial:
https://keras.io/examples/nlp/text_classification_with_transformer/

## Other good tutorials:
https://machinelearningmastery.com/how-to-implement-multi-head-attention-from-scratch-in-tensorflow-and-keras/

https://towardsdatascience.com/build-your-own-transformer-from-scratch-using-pytorch-84c850470dcb

https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-f74876522bc0

https://www.tensorflow.org/text/tutorials/transformer

https://www.kaggle.com/code/ritvik1909/text-classification-attention


## General Overview:
https://towardsdatascience.com/all-you-need-to-know-about-attention-and-transformers-in-depth-understanding-part-1-552f0b41d021

https://towardsdatascience.com/all-you-need-to-know-about-attention-and-transformers-in-depth-understanding-part-2-bf2403804ada

https://huggingface.co/learn/nlp-course/chapter1/1?fw=pt
