# Transformers with Pytorch

## Developing Single-Head Attention


### Section 1: Conceptual Background

#### 1.1 Theoretical Foundation and Key Concepts

**Attention Mechanisms** have revolutionized the field of natural language processing and beyond by allowing models to focus on relevant parts of the input data when making predictions. **Single-Head Attention** is the foundational component of the more complex multi-head attention mechanisms used in transformer architectures.

**Key Components of Single-Head Attention:**
- **Queries (Q)**: Represent the current step's information.
- **Keys (K)**: Represent the information to be matched against the queries.
- **Values (V)**: Represent the information to be aggregated based on the attention scores.
- **Attention Scores**: Calculated by measuring the compatibility between queries and keys.
- **Weighted Sum**: Aggregates the values based on the attention scores to produce the final output.

#### 1.2 Real-World Applications and Relevance

Single-Head Attention is widely used in various applications, including:
- **Machine Translation**: Enhancing the ability of models to translate sentences by focusing on relevant words.
- **Text Summarization**: Allowing models to generate concise summaries by attending to important parts of the text.
- **Image Captioning**: Enabling models to describe images by focusing on relevant regions.
- **Speech Recognition**: Improving the accuracy of transcriptions by concentrating on pertinent audio segments.

**Example Application**: In machine translation, single-head attention helps the model align words in the source language with their corresponding words in the target language, ensuring accurate and contextually relevant translations.

#### 1.3 Prerequisite Knowledge

To effectively complete this assignment, students should be familiar with:
- **Python Programming**: Basic syntax and data structures.
- **PyTorch**: Understanding of tensors, autograd, and neural network modules.
- **Deep Learning Concepts**: Knowledge of neural networks, linear transformations, activation functions, and loss functions.
- **Basic Understanding of Natural Language Processing (NLP)**: Familiarity with sequence data and tokenization.

#### 1.4 Mathematical Concepts and Formulas

Understanding the following mathematical concepts is essential:

- **Dot-Product Attention**: Calculates the attention scores by taking the dot product of queries and keys.
  
  $
  \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
  $
  
  - $ Q $: Queries matrix
  - $ K $: Keys matrix
  - $ V $: Values matrix
  - $ d_k $: Dimension of the keys

- **Softmax Function**: Converts raw attention scores into probabilities that sum to one.

  $
  \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
  $

- **Scaled Dot-Product**: Scaling the dot product by the square root of the key dimension to prevent large values that can destabilize the softmax.

- **Linear Transformation**: Applying linear layers to project inputs into queries, keys, and values.

#### 1.5 Specific Algorithms and Techniques

- **Single-Head Attention Mechanism**: Implements the attention calculation using a single set of queries, keys, and values.
- **Linear Layers**: Used to project input embeddings into queries, keys, and values.
- **Masking (Optional)**: Prevents the model from attending to certain positions, useful in tasks like language modeling.
- **Residual Connections and Layer Normalization (Advanced)**: Enhances training stability and performance.

#### 1.6 Common Pitfalls and Misconceptions

- **Incorrect Dimension Alignment**: Misaligning the dimensions of queries, keys, and values can lead to matrix multiplication errors.
- **Ignoring Scaling Factor**: Forgetting to scale the dot product by the square root of the key dimension can cause softmax to produce extremely small gradients.
- **Overcomplicating the Mechanism**: Single-head attention is foundational; adding unnecessary complexity can hinder understanding.
- **Neglecting to Use Softmax Properly**: Applying softmax incorrectly can result in invalid attention distributions.


In [9]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

# Section 2.1: Single-Head Attention Module
class SingleHeadAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super(SingleHeadAttention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads

        assert (
            self.head_dim * heads == embed_size
        ), "Embedding size needs to be divisible by heads"

        # Define linear layers for queries, keys, and values
        self.query_linear = nn.Linear(self.embed_size, self.embed_size)
        self.key_linear = nn.Linear(self.embed_size, self.embed_size)
        self.value_linear = nn.Linear(self.embed_size, self.embed_size)

        # Define the final linear layer
        self.fc_out = nn.Linear(self.embed_size, self.embed_size)

    def forward(self, values, keys, queries, mask=None):
        N = queries.shape[0]
        value_len, key_len, query_len = values.shape[1], keys.shape[1], queries.shape[1]

        # Apply linear transformations to queries, keys, and values
        queries = self.query_linear(queries)
        keys = self.key_linear(keys)
        values = self.value_linear(values)

        # Reshape queries, keys, and values for multi-head attention
        queries = queries.view(N, query_len, self.heads, self.head_dim)
        keys = keys.view(N, key_len, self.heads, self.head_dim)
        values = values.view(N, value_len, self.heads, self.head_dim)

        # Compute the dot product attention scores
        energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])

        # Apply scaling
        scaling_factor = np.sqrt(self.head_dim)
        energy = energy / scaling_factor

        # Apply mask (if provided)
        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-1e20"))

        # Apply softmax to get attention weights
        attention = F.softmax(energy, dim=-1)

        # Compute the weighted sum of values
        out = torch.einsum("nhql,nlhd->nqhd", [attention, values])

        # Concatenate the heads
        out = out.transpose(1, 2).contiguous().view(N, query_len, self.embed_size)

        # Pass through the final linear layer
        out = self.fc_out(out)

        return out

# Section 2.2: Simple Model Using Single-Head Attention
class SimpleAttentionModel(nn.Module):
    def __init__(self, embed_size, heads, vocab_size, num_classes):
        super(SimpleAttentionModel, self).__init__()
        self.embed_size = embed_size
        self.heads = heads

        # Define an embedding layer
        self.embedding = nn.Embedding(vocab_size, embed_size)

        # Initialize the SingleHeadAttention module
        self.attention = SingleHeadAttention(embed_size, heads)

        # Define a fully connected output layer
        self.fc = nn.Linear(embed_size, num_classes)

    def forward(self, x, mask=None):
        # Pass input through embedding layer
        embedded = self.embedding(x)

        # Apply single-head attention
        attention_out = self.attention(embedded, embedded, embedded, mask)

        # Pool the attention output
        pooled = attention_out.mean(dim=1)

        # Pass through the fully connected layer
        out = self.fc(pooled)

        return out

# Section 2.3: Training Loop (Simplified)
def train(model, data_loader, optimizer, criterion, device):
    model.train()
    for batch in data_loader:
        # Move data to the specified device
        inputs, targets = batch
        inputs, targets = inputs.to(device), targets.to(device)

        # Forward pass
        outputs = model(inputs)

        # Compute loss
        loss = criterion(outputs, targets)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# Section 2.4: Evaluation Function
def evaluate(model, data_loader, criterion, device):
    model.eval()
    total_loss = 0
    correct = 0
    with torch.no_grad():
        for batch in data_loader:
            # Move data to the specified device
            inputs, targets = batch
            inputs, targets = inputs.to(device), targets.to(device)

            # Forward pass
            outputs = model(inputs)

            # Compute loss
            loss = criterion(outputs, targets)
            total_loss += loss.item()

            preds = outputs.argmax(dim=1)
            correct += (preds == targets).sum().item()

    avg_loss = total_loss / len(data_loader)
    accuracy = correct / len(data_loader.dataset)
    return avg_loss, accuracy

# Section 2.5: Main Function to Initialize and Train the Model
def main():
    # Set hyperparameters
    embed_size = 256
    heads = 1
    vocab_size = 10000  # Example vocabulary size
    num_classes = 2     # Example number of classes
    learning_rate = 1e-3
    num_epochs = 10

    # Initialize the model
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = SimpleAttentionModel(embed_size, heads, vocab_size, num_classes).to(device)

    # Define loss function and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

    # Prepare data loaders (use dummy data for simplicity)
    from torch.utils.data import DataLoader, TensorDataset

    # Example dummy data
    inputs = torch.randint(0, vocab_size, (1000, 10))  # (batch_size, sequence_length)
    targets = torch.randint(0, num_classes, (1000,))

    dataset = TensorDataset(inputs, targets)
    train_loader = DataLoader(dataset, batch_size=32, shuffle=True)
    val_loader = DataLoader(dataset, batch_size=32)

    # Training loop
    for epoch in range(num_epochs):
        train(model, train_loader, optimizer, criterion, device)
        val_loss, val_acc = evaluate(model, val_loader, criterion, device)
        print(f"Epoch {epoch+1}: Val Loss = {val_loss}, Val Accuracy = {val_acc}")

if __name__ == "__main__":
    main()

Epoch 1: Val Loss = 0.6096616592258215, Val Accuracy = 0.747
Epoch 2: Val Loss = 0.2306022602133453, Val Accuracy = 0.922
Epoch 3: Val Loss = 0.02071854089444969, Val Accuracy = 0.997
Epoch 4: Val Loss = 0.0021022752171120374, Val Accuracy = 1.0
Epoch 5: Val Loss = 0.0006700000676573836, Val Accuracy = 1.0
Epoch 6: Val Loss = 0.00039312213766606874, Val Accuracy = 1.0
Epoch 7: Val Loss = 0.0002756175304057251, Val Accuracy = 1.0
Epoch 8: Val Loss = 0.00020742487322422676, Val Accuracy = 1.0
Epoch 9: Val Loss = 0.0001633970225611847, Val Accuracy = 1.0
Epoch 10: Val Loss = 0.00013315497858457093, Val Accuracy = 1.0


### Conceptual Background

#### 1.1 Theoretical Foundation and Key Concepts

**Attention Mechanisms** have significantly advanced the capabilities of neural networks, particularly in the field of natural language processing. **Multi-Head Attention** is an extension of the single-head attention mechanism, allowing the model to focus on different representation subspaces at different positions.

**Key Components of Multi-Head Attention:**
- **Multiple Heads**: Allows the model to attend to information from different representation subspaces.
- **Queries (Q)**, **Keys (K)**, and **Values (V)**: Similar to single-head attention but projected separately for each head.
- **Attention Scores**: Calculated for each head to determine the relevance of keys to the queries.
- **Concatenation and Linear Transformation**: The outputs from all heads are concatenated and passed through a final linear layer.

**Advantages of Multi-Head Attention:**
- **Enhanced Representation**: Captures diverse aspects of the input by attending to different parts simultaneously.
- **Improved Learning**: Facilitates better gradient flow and learning dynamics compared to single-head attention.

#### 1.2 Real-World Applications and Relevance

Multi-Head Attention is a cornerstone of the Transformer architecture and is widely used in various applications, including:
- **Machine Translation**: Enhances the ability to translate sentences by capturing complex relationships between words.
- **Text Summarization**: Allows models to generate concise summaries by focusing on different parts of the text.
- **Question Answering**: Improves the accuracy of answering systems by attending to relevant information in the context.
- **Image Captioning**: Enables models to describe images by focusing on various regions and their relationships.
- **Speech Recognition**: Enhances transcription accuracy by attending to pertinent audio segments.

**Example Application**: In machine translation, multi-head attention helps the model understand and translate sentences by attending to different words and their contextual relationships simultaneously, leading to more accurate and fluent translations.

#### 1.3 Prerequisite Knowledge

To effectively complete this assignment, students should be familiar with:
- **Python Programming**: Basic syntax and data structures.
- **PyTorch**: Understanding of tensors, autograd, and neural network modules.
- **Deep Learning Concepts**: Knowledge of neural networks, linear transformations, activation functions, and loss functions.
- **Basic Understanding of Natural Language Processing (NLP)**: Familiarity with sequence data and tokenization.
- **Single-Head Attention**: Understanding of single-head attention mechanisms as a foundation for multi-head attention.

#### 1.4 Mathematical Concepts and Formulas

Understanding the following mathematical concepts is essential:

- **Scaled Dot-Product Attention**: Calculates the attention scores by taking the dot product of queries and keys, scaling them, and applying the softmax function.
  
  $
  \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
  $
  
  - $ Q $: Queries matrix
  - $ K $: Keys matrix
  - $ V $: Values matrix
  - $ d_k $: Dimension of the keys

- **Multi-Head Attention**: Extends scaled dot-product attention by projecting queries, keys, and values multiple times with different linear projections.

  $
  \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, \ldots, \text{head}_h)W^O
  $
  
  where each head is:
  
  $
  \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
  $
  
  - $ W_i^Q, W_i^K, W_i^V $: Projection matrices for the $ i $-th head
  - $ W^O $: Output projection matrix

- **Softmax Function**: Converts raw attention scores into probabilities that sum to one.
  
  $
  \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
  $

- **Linear Transformation**: Applies linear layers to project inputs into queries, keys, and values.

#### 1.5 Specific Algorithms and Techniques

- **Multi-Head Attention Mechanism**: Implements multiple attention heads to capture different aspects of the input data.
- **Linear Layers**: Used to project input embeddings into queries, keys, and values for each head.
- **Masking**: Prevents the model from attending to certain positions, useful in tasks like language modeling and handling padding tokens.
- **Residual Connections and Layer Normalization**: Enhances training stability and performance by allowing gradients to flow more easily and normalizing inputs to layers.
- **Positional Encoding**: Adds information about the position of tokens in the sequence, since attention mechanisms are permutation-invariant.

#### 1.6 Common Pitfalls and Misconceptions

- **Dimension Mismatch**: Misaligning the dimensions of queries, keys, and values can lead to matrix multiplication errors.
- **Ignoring Scaling Factor**: Forgetting to scale the dot product by the square root of the key dimension can cause softmax to produce extremely small gradients.
- **Overcomplicating the Mechanism**: Multi-head attention builds upon single-head attention; adding unnecessary complexity can hinder understanding.
- **Incorrect Masking**: Applying masks incorrectly can prevent the model from attending to necessary information or allow it to attend to irrelevant parts.
- **Neglecting Positional Encoding**: Omitting positional information can limit the model's ability to understand the order of tokens in a sequence.

In [10]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

# Section 2.1: Multi-Head Attention Module
class MultiHeadAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super(MultiHeadAttention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads

        assert (
            self.head_dim * heads == embed_size
        ), "Embedding size needs to be divisible by heads"

        # Define linear layers for queries, keys, and values
        self.query_linear = nn.Linear(embed_size, embed_size)
        self.key_linear = nn.Linear(embed_size, embed_size)
        self.value_linear = nn.Linear(embed_size, embed_size)

        # Define the final linear layer
        self.fc_out = nn.Linear(embed_size, embed_size)

    def forward(self, values, keys, queries, mask=None):
        N = queries.shape[0]
        value_len, key_len, query_len = values.shape[1], keys.shape[1], queries.shape[1]

        # Apply linear transformations to queries, keys, and values
        queries = self.query_linear(queries)
        keys = self.key_linear(keys)
        values = self.value_linear(values)

        # Reshape queries, keys, and values for multi-head attention
        queries = queries.view(N, query_len, self.heads, self.head_dim).transpose(1, 2)
        keys = keys.view(N, key_len, self.heads, self.head_dim).transpose(1, 2)
        values = values.view(N, value_len, self.heads, self.head_dim).transpose(1, 2)

        # Compute the dot product attention scores
        energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])

        # Apply scaling
        scaling_factor = np.sqrt(self.head_dim)
        energy = energy / scaling_factor

        # Apply mask (if provided)
        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-1e20"))

        # Apply softmax to get attention weights
        attention = torch.softmax(energy, dim=-1)

        # Compute the weighted sum of values
        out = torch.einsum("nhql,nlhd->nqhd", [attention, values])

        # Concatenate the heads
        out = out.transpose(1, 2).contiguous().view(N, query_len, self.embed_size)

        # Pass through the final linear layer
        out = self.fc_out(out)

        return out

# Section 2.2: Positional Encoding Module
class PositionalEncoding(nn.Module):
    def __init__(self, embed_size, max_length=100):
        super(PositionalEncoding, self).__init__()
        self.embed_size = embed_size

        # Create positional encoding matrix
        positional_encoding = torch.zeros(max_length, embed_size)
        position = torch.arange(0, max_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, embed_size, 2).float() * (-np.log(10000.0) / embed_size))
        positional_encoding[:, 0::2] = torch.sin(position * div_term)
        positional_encoding[:, 1::2] = torch.cos(position * div_term)
        positional_encoding = positional_encoding.unsqueeze(0)

        # Register as buffer to prevent it from being considered a model parameter
        self.register_buffer('positional_encoding', positional_encoding)

    def forward(self, x):
        # Add positional encoding to input embeddings
        x = x + self.positional_encoding[:, :x.size(1), :].to(x.device)
        return x

# Section 2.3: Simple Model Using Multi-Head Attention
class SimpleMultiHeadAttentionModel(nn.Module):
    def __init__(self, embed_size, heads, vocab_size, num_classes):
        super(SimpleMultiHeadAttentionModel, self).__init__()
        self.embed_size = embed_size
        self.heads = heads

        # Define an embedding layer
        self.embedding = nn.Embedding(vocab_size, embed_size)

        # Initialize the PositionalEncoding module
        self.pos_encoder = PositionalEncoding(embed_size)

        # Initialize the MultiHeadAttention module
        self.attention = MultiHeadAttention(embed_size, heads)

        # Define a fully connected output layer
        self.fc = nn.Linear(embed_size, num_classes)

    def forward(self, x, mask=None):
        # Pass input through embedding layer
        embedded = self.embedding(x)

        # Add positional encoding
        embedded = self.pos_encoder(embedded)

        # Apply multi-head attention
        attention_out = self.attention(embedded, embedded, embedded, mask)

        # Pool the attention output
        pooled = attention_out.mean(dim=1)

        # Pass through the fully connected layer
        out = self.fc(pooled)

        return out

# Section 2.4: Training Loop (Simplified)
def train(model, data_loader, optimizer, criterion, device):
    model.train()
    for batch in data_loader:
        # Move data to the specified device
        inputs, targets = batch
        inputs, targets = inputs.to(device), targets.to(device)

        # Forward pass
        outputs = model(inputs)

        # Compute loss
        loss = criterion(outputs, targets)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# Section 2.5: Evaluation Function
def evaluate(model, data_loader, criterion, device):
    model.eval()
    total_loss = 0
    correct = 0
    with torch.no_grad():
        for batch in data_loader:
            # Move data to the specified device
            inputs, targets = batch
            inputs, targets = inputs.to(device), targets.to(device)

            # Forward pass
            outputs = model(inputs)

            # Compute loss
            loss = criterion(outputs, targets)
            total_loss += loss.item()

            # Compute accuracy
            preds = outputs.argmax(dim=1)
            correct += (preds == targets).sum().item()

    avg_loss = total_loss / len(data_loader)
    accuracy = correct / len(data_loader.dataset)
    return avg_loss, accuracy

# Section 2.6: Main Function to Initialize and Train the Model
def main():
    # Set hyperparameters
    embed_size = 256
    heads = 8
    vocab_size = 10000  # Example vocabulary size
    num_classes = 2     # Example number of classes
    learning_rate = 1e-3
    num_epochs = 10
    max_length = 100

    # Initialize the model
    model = SimpleMultiHeadAttentionModel(embed_size, heads, vocab_size, num_classes)

    # Define loss criterion and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

    # Move model to device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)

    # Prepare data loaders (use dummy data for simplicity)
    from torch.utils.data import DataLoader, TensorDataset

    # Example dummy data
    torch.manual_seed(0)  # For reproducibility
    inputs = torch.randint(0, vocab_size, (1000, max_length))  # (batch_size, sequence_length)
    targets = torch.randint(0, num_classes, (1000,))

    dataset = TensorDataset(inputs, targets)
    train_loader = DataLoader(dataset, batch_size=32, shuffle=True)
    val_loader = DataLoader(dataset, batch_size=32)

    # Training loop
    for epoch in range(num_epochs):
        train(model, train_loader, optimizer, criterion, device)
        val_loss, val_acc = evaluate(model, val_loader, criterion, device)
        print(f"Epoch {epoch+1}: Val Loss = {val_loss:.4f}, Val Accuracy = {val_acc*100:.2f}%")

if __name__ == "__main__":
    main()

Epoch 1: Val Loss = 0.7020, Val Accuracy = 51.20%
Epoch 2: Val Loss = 0.6807, Val Accuracy = 51.20%
Epoch 3: Val Loss = 0.6598, Val Accuracy = 51.20%
Epoch 4: Val Loss = 0.3814, Val Accuracy = 98.50%
Epoch 5: Val Loss = 0.0098, Val Accuracy = 100.00%
Epoch 6: Val Loss = 0.0009, Val Accuracy = 100.00%
Epoch 7: Val Loss = 0.0004, Val Accuracy = 100.00%
Epoch 8: Val Loss = 0.0003, Val Accuracy = 100.00%
Epoch 9: Val Loss = 0.0002, Val Accuracy = 100.00%
Epoch 10: Val Loss = 0.0002, Val Accuracy = 100.00%


## Utilizing the Transformer Class in PyTorch



### Section 1: Conceptual Background

#### 1.1 Theoretical Foundation and Key Concepts

**Transformers** have revolutionized the field of natural language processing (NLP) and beyond by enabling models to handle sequential data efficiently without relying on recurrent structures. Introduced in the seminal paper ["Attention is All You Need"](https://arxiv.org/abs/1706.03762) by Vaswani et al., Transformers leverage self-attention mechanisms to capture long-range dependencies in data.

**Key Components of Transformers:**
- **Self-Attention Mechanism**: Allows the model to weigh the importance of different parts of the input sequence when encoding a particular element.
- **Multi-Head Attention**: Extends self-attention by allowing the model to focus on different representation subspaces simultaneously.
- **Positional Encoding**: Injects information about the position of tokens in the sequence, compensating for the lack of recurrence.
- **Feed-Forward Networks**: Apply non-linear transformations to the data, enhancing the model's capacity to learn complex patterns.
- **Layer Normalization**: Stabilizes and accelerates training by normalizing the inputs across the features.
- **Residual Connections**: Facilitate gradient flow, allowing deeper architectures without vanishing gradients.

**PyTorch’s Transformer Class**: PyTorch provides a flexible and efficient implementation of the Transformer architecture through its `torch.nn.Transformer` class, enabling developers to build and train Transformer-based models with ease.

#### 1.2 Real-World Applications and Relevance

Transformers are at the heart of many state-of-the-art models and applications, including:
- **Machine Translation**: Translating text from one language to another with high accuracy.
- **Text Summarization**: Generating concise summaries of longer documents.
- **Question Answering**: Building systems that can understand and answer questions based on given contexts.
- **Text Generation**: Creating coherent and contextually relevant text, as seen in models like GPT-3.
- **Speech Recognition and Generation**: Converting spoken language to text and vice versa.
- **Image Processing**: Enhancing image recognition and generation tasks through Vision Transformers (ViT).

**Example Application**: Building a **Next-Word Prediction** model using the Transformer class to assist in autocomplete functionalities, enhancing user experience in text editors and messaging apps.

#### 1.3 Prerequisite Knowledge

To effectively complete this assignment, students should be familiar with:
- **Python Programming**: Basic syntax, data structures, and libraries.
- **PyTorch**: Understanding of tensors, automatic differentiation, and neural network modules.
- **Deep Learning Concepts**: Knowledge of neural networks, linear transformations, activation functions, and loss functions.
- **Natural Language Processing (NLP)**: Familiarity with sequence data, tokenization, and language modeling.
- **Attention Mechanisms**: Basic understanding of single-head and multi-head attention.

#### 1.4 Mathematical Concepts and Formulas

Understanding the following mathematical concepts is essential:

- **Scaled Dot-Product Attention**: Calculates attention scores by taking the dot product of queries and keys, scaling them, and applying the softmax function.
  
  $
  \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
  $
  
  - $ Q $: Queries matrix
  - $ K $: Keys matrix
  - $ V $: Values matrix
  - $ d_k $: Dimension of the keys

- **Multi-Head Attention**: Extends scaled dot-product attention by projecting queries, keys, and values multiple times with different linear projections.
  
  $
  \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, \ldots, \text{head}_h)W^O
  $
  
  where each head is:
  
  $
  \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
  $
  
  - $ W_i^Q, W_i^K, W_i^V $: Projection matrices for the $ i $-th head
  - $ W^O $: Output projection matrix

- **Positional Encoding**: Adds information about the position of tokens in the sequence using sine and cosine functions of different frequencies.
  
  $
  \text{PE}_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)
  $
  
  $
  \text{PE}_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)
  $
  
  - $ pos $: Position of the token
  - $ i $: Dimension index
  - $ d_{model} $: Embedding dimension

#### 1.5 Specific Algorithms and Techniques

- **Transformer Architecture**: Understanding the encoder-decoder structure, though for this assignment, focus will be on using the Transformer for encoding and predicting next words.
- **Linear Layers**: Used to project input embeddings into queries, keys, and values.
- **Masking**: Prevents the model from attending to future tokens in autoregressive tasks, ensuring causality in predictions.
- **Layer Normalization and Residual Connections**: Enhance training stability and model performance.
- **Positional Encoding**: Incorporates sequence order information into the model.

#### 1.6 Common Pitfalls and Misconceptions

- **Dimension Mismatch**: Misaligning the dimensions of queries, keys, and values can lead to matrix multiplication errors.
- **Ignoring Scaling Factor**: Forgetting to scale the dot product by the square root of the key dimension can cause softmax to produce extremely small gradients.
- **Incorrect Masking**: Applying masks incorrectly can prevent the model from attending to necessary information or allow it to attend to irrelevant parts.
- **Overfitting with Small Datasets**: Using small datasets without proper regularization can lead to models that do not generalize well.
- **Neglecting Positional Encoding**: Omitting positional information can limit the model's ability to understand the order of tokens in a sequence.
- **Improper Learning Rate Selection**: Selecting an inappropriate learning rate can lead to poor convergence or unstable training.

In [11]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import math
from torch.utils.data import Dataset, DataLoader
import numpy as np

# Section 2.1: Data Preprocessing
class TextDataset(Dataset):
    def __init__(self, text, seq_length, vocab):
        """
        Initializes the dataset by encoding the text and creating input-output pairs.

        Args:
            text (str): The raw text data.
            seq_length (int): The length of each input sequence.
            vocab (dict): A dictionary mapping characters to indices.
        """
        self.seq_length = seq_length
        self.vocab = vocab
        self.vocab_size = len(vocab)
        self.encoded_text = self.encode_text(text)
        self.inputs, self.targets = self.create_sequences()

    def encode_text(self, text):
        """
        Encodes the text into integer indices.

        Args:
            text (str): The raw text data.

        Returns:
            List[int]: The encoded text.
        """
        return [self.vocab[char] for char in text if char in self.vocab]

    def create_sequences(self):
        """
        Creates input and target sequences from the encoded text.

        Returns:
            Tuple[List[List[int]], List[int]]: Input sequences and corresponding targets.
        """
        inputs = []
        targets = []
        for i in range(len(self.encoded_text) - self.seq_length):
            inputs.append(self.encoded_text[i:i+self.seq_length])
            targets.append(self.encoded_text[i+self.seq_length])
        return inputs, targets

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        return torch.tensor(self.inputs[idx], dtype=torch.long), torch.tensor(self.targets[idx], dtype=torch.long)

# Section 2.2: Positional Encoding
class PositionalEncoding(nn.Module):
    def __init__(self, embed_size, max_length=5000):
        """
        Initializes the positional encoding module.

        Args:
            embed_size (int): The embedding dimension.
            max_length (int): The maximum length of input sequences.
        """
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_length, embed_size)
        position = torch.arange(0, max_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, embed_size, 2).float() * (-math.log(10000.0) / embed_size))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # Shape: (1, max_length, embed_size)
        self.register_buffer('pe', pe)

    def forward(self, x):
        """
        Adds positional encoding to input embeddings.

        Args:
            x (Tensor): Input embeddings of shape (batch_size, seq_length, embed_size).

        Returns:
            Tensor: Embeddings with positional encoding added.
        """
        x = x + self.pe[:, :x.size(1), :]
        return x

# Section 2.3: Transformer Model
class TransformerModel(nn.Module):
    def __init__(self, vocab_size, embed_size, num_heads, hidden_dim, num_layers, dropout=0.1):
        """
        Initializes the Transformer-based language model.

        Args:
            vocab_size (int): Size of the vocabulary.
            embed_size (int): Embedding dimension.
            num_heads (int): Number of attention heads.
            hidden_dim (int): Dimension of the feedforward network.
            num_layers (int): Number of Transformer encoder layers.
            dropout (float): Dropout probability.
        """
        super(TransformerModel, self).__init__()
        self.embed_size = embed_size
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.pos_encoder = PositionalEncoding(embed_size)
        encoder_layers = nn.TransformerEncoderLayer(d_model=embed_size, nhead=num_heads, dim_feedforward=hidden_dim, dropout=dropout)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layers, num_layers=num_layers)
        self.fc_out = nn.Linear(embed_size, vocab_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, src, src_mask):
        """
        Defines the forward pass of the model.

        Args:
            src (Tensor): Input tensor of shape (batch_size, seq_length).
            src_mask (Tensor): Mask tensor to prevent attention to future tokens.

        Returns:
            Tensor: Output logits of shape (batch_size, vocab_size).
        """
        src = self.embedding(src) * math.sqrt(self.embed_size)  # Scale embeddings
        src = self.pos_encoder(src)
        src = self.dropout(src)
        src = src.transpose(0, 1)  # Transformer expects (seq_length, batch_size, embed_size)
        output = self.transformer_encoder(src, src_mask)
        output = output.transpose(0, 1)  # Back to (batch_size, seq_length, embed_size)
        output = output.mean(dim=1)  # Aggregate over sequence length
        out = self.fc_out(output)
        return out

# Section 2.4: Mask Generation
def generate_square_subsequent_mask(sz):
    """
    Generates a square mask for the sequence. The masked positions are filled with float('-inf').
    Unmasked positions are filled with float(0.0).

    Args:
        sz (int): Size of the mask (seq_length).

    Returns:
        Tensor: The generated mask of shape (sz, sz).
    """
    mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask

# Section 2.5: Training Loop
def train_epoch(model, data_loader, optimizer, criterion, device, scheduler=None):
    """
    Trains the model for one epoch.

    Args:
        model (nn.Module): The Transformer model.
        data_loader (DataLoader): DataLoader for training data.
        optimizer (Optimizer): Optimizer for training.
        criterion (Loss): Loss function.
        device (torch.device): Device to run the training on.
        scheduler (Scheduler, optional): Learning rate scheduler.

    Returns:
        float: Average training loss for the epoch.
    """
    model.train()
    total_loss = 0
    for batch, (inputs, targets) in enumerate(data_loader):
        inputs = inputs.to(device)
        targets = targets.to(device)
        optimizer.zero_grad()
        seq_length = inputs.size(1)
        src_mask = generate_square_subsequent_mask(seq_length).to(device)
        outputs = model(inputs, src_mask)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
        if scheduler:
            scheduler.step()
        total_loss += loss.item()
    avg_loss = total_loss / len(data_loader)
    return avg_loss

# Section 2.6: Evaluation Function
def evaluate(model, data_loader, criterion, device):
    """
    Evaluates the model on the validation set.

    Args:
        model (nn.Module): The Transformer model.
        data_loader (DataLoader): DataLoader for validation data.
        criterion (Loss): Loss function.
        device (torch.device): Device to run the evaluation on.

    Returns:
        Tuple[float, float]: Average validation loss and accuracy.
    """
    model.eval()
    total_loss = 0
    correct = 0
    with torch.no_grad():
        for batch, (inputs, targets) in enumerate(data_loader):
            inputs = inputs.to(device)
            targets = targets.to(device)
            seq_length = inputs.size(1)
            src_mask = generate_square_subsequent_mask(seq_length).to(device)
            outputs = model(inputs, src_mask)
            loss = criterion(outputs, targets)
            total_loss += loss.item()
            preds = torch.argmax(outputs, dim=1)
            correct += (preds == targets).sum().item()
    avg_loss = total_loss / len(data_loader)
    accuracy = correct / len(data_loader.dataset)
    return avg_loss, accuracy

# Section 2.7: Main Function to Initialize and Train the Model
def main():
    # TODO: Set hyperparameters
    embed_size = 128
    num_heads = 8
    hidden_dim = 512
    num_layers = 2
    dropout = 0.1
    seq_length = 30
    batch_size = 64
    num_epochs = 20
    learning_rate = 1e-3

    # TODO: Load and preprocess data
    # For simplicity, use a sample text. Replace this with a larger dataset as needed.
    sample_text = (
        "In the beginning God created the heaven and the earth. "
        "And the earth was without form, and void; and darkness was upon the face of the deep. "
        "And the Spirit of God moved upon the face of the waters. "
        "And God said, Let there be light: and there was light."
    )

    # Create vocabulary
    vocab = sorted(list(set(sample_text)))
    stoi = {ch:i for i, ch in enumerate(vocab)}
    itos = {i:ch for i, ch in enumerate(vocab)}
    vocab_size = len(vocab)

    # Initialize dataset and dataloader
    dataset = TextDataset(sample_text, seq_length, stoi)
    train_size = int(0.8 * len(dataset))
    val_size = len(dataset) - train_size
    train_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size)

    # TODO: Initialize the model
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = TransformerModel(
        vocab_size=vocab_size,
        embed_size=embed_size,
        num_heads=num_heads,
        hidden_dim=hidden_dim,
        num_layers=num_layers,
        dropout=dropout
    ).to(device)

    # TODO: Define loss criterion and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    # TODO: Optionally, define a learning rate scheduler
    scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

    # TODO: Training loop
    for epoch in range(num_epochs):
        train_loss = train_epoch(model, train_loader, optimizer, criterion, device, scheduler)
        val_loss, val_acc = evaluate(model, val_loader, criterion, device)
        print(f"Epoch {epoch+1}/{num_epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f} | Val Acc: {val_acc*100:.2f}%")

    # TODO: Save the trained model
    torch.save(model.state_dict(), "transformer_next_word_model.pth")

if __name__ == "__main__":
    main()


Epoch 1/20 | Train Loss: 3.2237 | Val Loss: 3.0104 | Val Acc: 17.78%
Epoch 2/20 | Train Loss: 2.8846 | Val Loss: 3.0277 | Val Acc: 17.78%
Epoch 3/20 | Train Loss: 2.8465 | Val Loss: 3.0359 | Val Acc: 17.78%
Epoch 4/20 | Train Loss: 2.8129 | Val Loss: 3.0358 | Val Acc: 17.78%
Epoch 5/20 | Train Loss: 2.8107 | Val Loss: 3.0302 | Val Acc: 17.78%
Epoch 6/20 | Train Loss: 2.7982 | Val Loss: 3.0233 | Val Acc: 17.78%
Epoch 7/20 | Train Loss: 2.7967 | Val Loss: 3.0177 | Val Acc: 17.78%
Epoch 8/20 | Train Loss: 2.7867 | Val Loss: 3.0172 | Val Acc: 17.78%
Epoch 9/20 | Train Loss: 2.8059 | Val Loss: 3.0167 | Val Acc: 17.78%
Epoch 10/20 | Train Loss: 2.7699 | Val Loss: 3.0161 | Val Acc: 17.78%
Epoch 11/20 | Train Loss: 2.7873 | Val Loss: 3.0161 | Val Acc: 17.78%
Epoch 12/20 | Train Loss: 2.7817 | Val Loss: 3.0161 | Val Acc: 17.78%
Epoch 13/20 | Train Loss: 2.7878 | Val Loss: 3.0160 | Val Acc: 17.78%
Epoch 14/20 | Train Loss: 2.7684 | Val Loss: 3.0160 | Val Acc: 17.78%
Epoch 15/20 | Train Loss: 2.7