# Advanced Transformer Tutorial Generated by GPT 4

Both the descriptive explanations and the code samples for this tutorial were generated entirely with chatGPT using the GPT 4 model. In some cases the initial code had minor errors, these errors were also fixed by GPT 4 by feeding the errors back into GPT 4 and GPT 4 would generate new code.

This is an advanced tutorial which builds the main components of the Transformer model, the multi headed attention mechanism and the position and token embedding, from scratch in PyTorch.

## IMDB Sentiment Analysis

The Keras IMDB dataset is a popular dataset for sentiment analysis tasks in natural language processing (NLP). It contains 50,000 movie reviews from the Internet Movie Database (IMDB) labeled as either positive (1) or negative (0) based on the sentiment expressed in the review. The dataset is divided into 25,000 reviews for training and 25,000 reviews for testing.

The reviews in the dataset have been preprocessed, and each review is encoded as a sequence of word indices (integers). The indices represent the overall frequency rank of the words in the entire dataset. For instance, the integer "3" encodes the 3rd most frequent word in the data. This encoding allows for faster processing and less memory usage compared to working with raw text data.

The Keras IMDB dataset is typically used for binary classification tasks, where the goal is to build a machine learning model that can predict whether a given movie review is positive or negative based on the text content. The dataset is accessible through the tensorflow.keras.datasets module in the TensorFlow library.


In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import math
import numpy as np
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences




  from .autonotebook import tqdm as notebook_tqdm
2023-05-10 00:42:34.438071: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Multi-Headed attention

This class takes as input the model dimension d_model and the number of attention heads num_heads. The forward method takes a tensor of shape (batch_size, sequence_length, d_model) and an optional mask, and it outputs the context vectors and attention weights.

In [3]:
class MultiHeadSelfAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadSelfAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads

        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)

        self.fc = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        attention_logits = torch.matmul(Q, K.transpose(-2, -1)) / (self.head_dim ** 0.5)
        if mask is not None:
            attention_logits = attention_logits.masked_fill(mask == 0, float('-inf'))
        attention_weights = F.softmax(attention_logits, dim=-1)
        return torch.matmul(attention_weights, V), attention_weights

    def split_heads(self, x):
        batch_size, seq_len, _ = x.size()
        return x.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)

    def combine_heads(self, x):
        batch_size, _, seq_len, _ = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)

    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.size()

        Q = self.split_heads(self.W_Q(x))
        K = self.split_heads(self.W_K(x))
        V = self.split_heads(self.W_V(x))

        if mask is not None:
            mask = mask.unsqueeze(1)

        context_vectors, attention_weights = self.scaled_dot_product_attention(Q, K, V, mask)
        context_vectors = self.combine_heads(context_vectors)

        return self.fc(context_vectors), attention_weights

# Example usage:
input_tensor = torch.rand(16, 50, d_model)  # 16 is batch_size and 50 is sequence length

self_attention = MultiHeadSelfAttention(d_model, num_heads)
output, attention_weights = self_attention(input_tensor)

# Token and Position Embedding

This class takes as input the vocabulary size vocab_size, the model dimension d_model, and the maximum sequence length max_seq_len. The forward method takes a tensor of shape (batch_size, sequence_length) with token ids and outputs the combined token and position embeddings with shape (batch_size, sequence_length, d_model).

In [4]:
class TokenPositionEmbedding(nn.Module):
    def __init__(self, vocab_size, d_model, max_seq_len):
        super(TokenPositionEmbedding, self).__init__()

        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(max_seq_len, d_model)

        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        batch_size, seq_len = x.size()

        # Create the position ids from 0 to max_seq_len - 1
        position_ids = torch.arange(0, seq_len, dtype=torch.long, device=x.device).unsqueeze(0).expand(batch_size, -1)

        # Get token and position embeddings
        token_embeds = self.token_embedding(x)
        position_embeds = self.position_embedding(position_ids)

        # Combine token and position embeddings
        embeddings = token_embeds + position_embeds

        return self.dropout(embeddings)

# Example usage:
vocab_size = 20000
max_seq_len = 200
input_ids = torch.randint(0, vocab_size, (16, max_seq_len))  # 16 is batch_size

embedding_layer = TokenPositionEmbedding(vocab_size, d_model, max_seq_len)
embeddings = embedding_layer(input_ids)

# Transfomer Block
This class takes as input the model dimension d_model, the number of attention heads num_heads, the feed-forward hidden dimension d_ff, the vocabulary size vocab_size, and the maximum sequence length max_seq_len. The forward method takes a tensor of shape (batch_size, sequence_length) with token ids and an optional mask, and it outputs the processed tensor with shape (batch_size, sequence_length, d_model).

In [5]:
class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, vocab_size, max_seq_len, dropout=0.1):
        super(TransformerBlock, self).__init__()

        self.embedding_layer = TokenPositionEmbedding(vocab_size, d_model, max_seq_len)

        self.self_attention = MultiHeadSelfAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)

        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Token and position embedding
        x = self.embedding_layer(x)

        # Multi-head self-attention
        attn_output, _ = self.self_attention(x, mask)
        x = self.norm1(x + self.dropout1(attn_output))

        # Position-wise feed-forward
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout2(ff_output))

        return x

# Example usage:


input_ids = torch.randint(0, vocab_size, (16, max_seq_len))  # 16 is batch_size

transformer_block = TransformerBlock(d_model, num_heads, d_ff, vocab_size, max_seq_len)
output = transformer_block(input_ids)

# Load the IMDB Data Set


In [6]:
class IMDBDataset(Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __len__(self):
        return len(self.x)

    def __getitem__(self, idx):
        return torch.tensor(self.x[idx], dtype=torch.long), torch.tensor(self.y[idx], dtype=torch.float)

def load_imdb_data(num_words, max_seq_len):
    (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=num_words)

    # Pad sequences to max_seq_len
    x_train = pad_sequences(x_train, maxlen=max_seq_len, padding='post', truncating='post')
    x_test = pad_sequences(x_test, maxlen=max_seq_len, padding='post', truncating='post')

    return x_train, y_train, x_test, y_test

# Example usage:
num_words = vocab_size
batch_size = 16

x_train, y_train, x_test, y_test = load_imdb_data(num_words, max_seq_len)

train_dataset = IMDBDataset(x_train, y_train)
test_dataset = IMDBDataset(x_test, y_test)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

# Build and Train the Model

Here's an example of building and training a transformer model using TransformerBlock, MultiHeadSelfAttention, TokenAndPositionEmbedding, and IMDBDataset from the previous examples. This example calculates and outputs the loss and accuracy for both training and test data for each epoch:

This example creates a TransformerClassifier class that uses the TransformerBlock as the main component. The output of the transformer block is pooled along the sequence dimension using mean pooling before passing through a linear layer for classification.

The training loop iterates through num_epochs and calculates the training and test loss and accuracy for each epoch. Note that the model should be set to train mode during training and eval mode during evaluation to enable/disable dropout and other regularization techniques correctly.

The main components of the code are as follows:

Loading the IMDB dataset: The load_imdb_data function is called to load the IMDB dataset, preprocess it by padding or truncating sequences to a fixed length (max_seq_len), and split it into training and testing sets.

Creating Dataset and DataLoader instances: PyTorch Dataset and DataLoader instances are created for the training and validation sets. These will be used to iterate through the data during the training process.

Defining the model: The TransformerClassifier class is created by combining the TransformerBlock with a fully connected layer for classification. This class is then instantiated using the hyperparameters, such as d_model, num_heads, and d_ff.

Setting up the training loop: The model is trained for a specified number of epochs using the CrossEntropyLoss and the Adam optimizer. For each epoch, the model is trained on the training set and evaluated on the validation set. The loss and accuracy for both training and validation sets are calculated and printed for each epoch.

In summary, this sample code demonstrates how to build, train, and evaluate a simple Transformer-based model for sentiment analysis on the Keras IMDB dataset. The model is trained using a single TransformerBlock and the performance metrics (loss and accuracy) are reported for each epoch.


In [9]:
class TransformerClassifier(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, vocab_size, max_seq_len, num_classes, dropout=0.1):
        super(TransformerClassifier, self).__init__()

        self.transformer_block = TransformerBlock(d_model, num_heads, d_ff, vocab_size, max_seq_len, dropout)
        self.classifier = nn.Linear(d_model, num_classes)

    def forward(self, x, mask=None):
        x = self.transformer_block(x, mask)
        x = x.mean(dim=1)
        return self.classifier(x)

def train_epoch(model, loader, criterion, optimizer, device):
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0

    for inputs, labels in loader:
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()

        outputs = model(inputs)
        loss = criterion(outputs, labels.unsqueeze(1))
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        total += labels.size(0)
        correct += ((outputs > 0) == labels.unsqueeze(1)).sum().item()

    return running_loss / len(loader), correct / total

def evaluate(model, loader, criterion, device):
    model.eval()
    running_loss = 0.0
    correct = 0
    total = 0

    with torch.no_grad():
        for inputs, labels in loader:
            inputs, labels = inputs.to(device), labels.to(device)

            outputs = model(inputs)
            loss = criterion(outputs, labels.unsqueeze(1))

            running_loss += loss.item()
            total += labels.size(0)
            correct += ((outputs > 0) == labels.unsqueeze(1)).sum().item()

    return running_loss / len(loader), correct / total

# Model and training parameters
num_classes = 1
dropout = 0.1
num_epochs = 10
lr = 1e-4
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load data and create DataLoaders
x_train, y_train, x_test, y_test = load_imdb_data(num_words, max_seq_len)
train_dataset = IMDBDataset(x_train, y_train)
test_dataset = IMDBDataset(x_test, y_test)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

# Create and train the model
model = TransformerClassifier(d_model, num_heads, d_ff, vocab_size, max_seq_len, num_classes, dropout).to(device)
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)

for epoch in range(num_epochs):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    test_loss, test_acc = evaluate(model, test_loader, criterion, device)
    print(f'Epoch {epoch + 1}/{num_epochs}, '
          f'Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}, '
          f'Test Loss: {test_loss:.4f}, Test Accuracy: {test_acc:.4f}')


Epoch 1/10, Train Loss: 0.6343, Train Acc: 0.6237, Test Loss: 0.5478, Test Accuracy: 0.7199
Epoch 2/10, Train Loss: 0.5074, Train Acc: 0.7491, Test Loss: 0.4775, Test Accuracy: 0.7671
Epoch 3/10, Train Loss: 0.4477, Train Acc: 0.7916, Test Loss: 0.4384, Test Accuracy: 0.7941
Epoch 4/10, Train Loss: 0.4066, Train Acc: 0.8114, Test Loss: 0.4271, Test Accuracy: 0.8021
Epoch 5/10, Train Loss: 0.3753, Train Acc: 0.8319, Test Loss: 0.4086, Test Accuracy: 0.8119
Epoch 6/10, Train Loss: 0.3532, Train Acc: 0.8428, Test Loss: 0.4042, Test Accuracy: 0.8186
Epoch 7/10, Train Loss: 0.3323, Train Acc: 0.8556, Test Loss: 0.3939, Test Accuracy: 0.8240
Epoch 8/10, Train Loss: 0.3138, Train Acc: 0.8660, Test Loss: 0.3914, Test Accuracy: 0.8277
Epoch 9/10, Train Loss: 0.2999, Train Acc: 0.8724, Test Loss: 0.3866, Test Accuracy: 0.8320
Epoch 10/10, Train Loss: 0.2818, Train Acc: 0.8821, Test Loss: 0.3864, Test Accuracy: 0.8324
