# 1. Introduction

## Introduction to Building and Training a Transformer Model with PyTorch

Transformers have significantly advanced natural language processing tasks, including text generation, translation, and summarization. This tutorial will guide you through the process of building and training a Transformer model from scratch using PyTorch. We will cover key components such as:

- Input embeddings
- Positional encoding
- Multi-head attention
- Feedforward blocks

By the end of this tutorial, you will gain a solid understanding of how these elements integrate to create a functional and efficient language model.

**Deep Learning Training**

# Table of Contents
1. [Input Embeddings](#input-embeddings)
2. [Positional Encoding](#positional-encoding)
3. [Layer Normalization](#layer-normalization)
4. [Feed Forward Block](#feed-forward-block)
5. [Multi-Head Attention Block](#multi-head-attention-block)
6. [Residual Connection](#residual-connection)
7. [Projection Head](#projection-head)
8. [Transformer Block](#transformer-block)
9. [Building the Transformer](#building-the-transformer)
10. [Sample Usage](#sample-usage)
11. [Training the Transformer](#training-the-transformer)
    1. [Data Preprocessing](#data-preprocessing)
    2. [Model Training](#model-training)
12. [Inference](#inference)

**Install Dependencies**

Before we start building our Transformer model, we need to install and import the necessary libraries. Each of these libraries plays a specific role in our project:

- **math**: Provides access to mathematical functions.
- **pandas**: Used for data manipulation and analysis. It helps us handle and preprocess our datasets efficiently.
- **torch**: The core library of PyTorch, used for building and training neural networks.
- **torch.nn**: A sub-library of PyTorch that provides tools for building neural network layers.
- **torch.nn.functional**: Contains functions used for building neural networks (often used for activation functions and other operations).
- **torch.optim**: Contains optimization algorithms for training neural networks.
- **torch.utils.data**: Provides tools like DataLoader and Dataset to manage and process data efficiently.
- **transformers**: A library by Hugging Face that provides pre-trained models and tools for natural language processing. We will use `AutoTokenizer` from this library to tokenize our input text.

In [None]:
import math
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM

**1. Input Embeddings**

This cell initializes the embedding layer that converts tokenized input into dense numerical vectors. Embeddings capture semantic relationships between tokens and are crucial for the Transformer model to understand the meaning of words in the input sequence.

An embedding layer is essentially a lookup table where each word in the vocabulary is mapped to a vector of fixed size (embedding dimension). These vectors are learned during training and represent words in a continuous vector space. By the end of training, words with similar meanings have similar vector representations. 

In the context of natural language processing, vectors are mathematical representations of words. Each word is represented by a point in a high-dimensional space. The dimensions can be thought of as features that capture different aspects of the word's meaning. 

For example, in a 3-dimensional space, a word like "king" might be represented by a vector (0.5, 0.2, 0.7). These vectors allow us to perform mathematical operations to understand word relationships. One common operation is calculating the cosine similarity between two vectors to measure how similar they are. Cosine similarity is the cosine of the angle between two vectors, with values ranging from -1 (completely opposite) to 1 (exactly the same).

In [None]:
class InputEmbedding(nn.Module):
    def __init__(self, embed_dim: int, vocab_size: int):
        """
        Initialize the InputEmbedding module.

        Args:
            embed_dim (int): The dimensionality of the input embedding.
            vocab_size (int): The size of the vocabulary.

        """
        super().__init__()
        # Store the dimensionality and vocabulary size
        self.embed_dim = embed_dim
        self.vocab_size = vocab_size

        # Create an embedding layer that maps the vocabulary to an embed_dim-dimensional space
        # The embedding layer should have shape (vocab_size, embed_dim)
        self.embedding = nn.Embedding(vocab_size, embed_dim)

    def forward(self, x):
        """
        Perform the forward pass of the InputEmbedding module.

        Args:
            x (tensor): The input tensor.

        Returns:
            tensor: The embedded input tensor after scaling it by the square root of the dimensionality.

        """
        # Embed the input tensor using the embedding layer
        # Shape: (batch_size, seq_len) -> (batch_size, seq_len, embed_dim)
        embedded_input = self.embedding(x)
        # Scale the embedded input tensor by the square root of the dimensionality
        # Shape: (batch_size, seq_len, embed_dim) -> (batch_size, seq_len, embed_dim)
        scaled_embedded_input = embedded_input * torch.sqrt(torch.tensor(self.embed_dim))
        return scaled_embedded_input

**2. Positional Encoding**

This cell computes positional encodings that add information about the order of tokens in the sequence. Positional encodings are necessary because Transformers don't inherently understand the sequential nature of data like RNNs; these encodings help maintain sequential information.

Unlike RNNs, which process input tokens sequentially, Transformers process all tokens in parallel. This parallel processing does not inherently capture the order of tokens. Positional encoding addresses this by adding a unique positional signal to each token, allowing the Transformer to differentiate between tokens at different positions.

The positional encoding is calculated using sine and cosine functions of different frequencies. This ensures that each position has a unique encoding that can be added to the input embeddings. The following code shows how to implement positional encoding:

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, embed_dim: int = 512, max_seq_len: int = 100, dropout: float = 0.1,):
        """Initialize the PositionalEncoding module."""
        super().__init__()
        self.embed_dim = embed_dim
        self.max_seq_len = max_seq_len
        self.dropout = nn.Dropout(dropout)
        # Precompute the positional encoding matrix
        self.positional_encoding = self._precompute_positional_encoding(max_seq_len, embed_dim)

    def _precompute_positional_encoding(self, max_seq_len, embed_dim):
        """Precompute the positional encoding matrix."""
        with torch.no_grad():
            # Create a positional encoding matrix of shape (max_seq_len, embed_dim)
            positional_encoding = torch.zeros(max_seq_len, embed_dim)
            # Create a tensor 'pos' with values [0, 1, 2, ..., max_seq_len - 1] (max_seq_len, 1)
            position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
            # Compute the positional encoding matrix
            division_term = torch.exp(torch.arange(0, embed_dim, 2).float() * (-torch.log(torch.tensor(10000.0)) / embed_dim))
            positional_encoding[:, 0::2] = torch.sin(position * division_term)
            positional_encoding[:, 1::2] = torch.cos(position * division_term)
            # Shape (max_seq_len, embed_dim) -> (1, max_seq_len, embed_dim)
            positional_encoding = positional_encoding.unsqueeze(0)

        return positional_encoding

    def forward(self, x):
        """Perform the forward pass of the PositionalEncoding module."""
        # Add the positional encoding matrix to the input tensor
        x = x + self.positional_encoding[:, : x.size(1)].to(x.device)
        # Apply dropout to the input tensor
        x = self.dropout(x)
        return x

**3. Layer Normalization**

Implements layer normalization, which normalizes activations across the features. This improves training stability by reducing the internal covariate shift problem, allowing each layer of the model to learn more independently of others.

Layer normalization is a technique to normalize the inputs of a layer to have zero mean and unit variance. This helps in stabilizing and speeding up the training process. Unlike batch normalization, which normalizes across the batch dimension, layer normalization normalizes across the features. This makes it suitable for sequence models where batch statistics can vary significantly.

In [None]:
class LayerNormalization(nn.Module):
    def __init__(self, embed_dim: int, eps: float = 1e-6):
        """Initialize the LayerNormalization module."""
        super().__init__()
        self.eps = eps
        # Create two learnable parameters to scale and shift the normalized input
        self.gain = nn.Parameter(torch.Tensor(embed_dim).uniform_())  # Initialize with values sampled from a uniform distribution
        self.bias = nn.Parameter(torch.Tensor(embed_dim).normal_())    # Initialize with values sampled from a normal distribution


    def forward(self, x):
        """Perform the forward pass of the LayerNormalization module."""
        # Compute the mean and standard deviation of the input tensor
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        # Zero center by subtracting the mean from the input tensor
        # Normalize scale by dividing by the standard deviation and add epsilon for numerical stability
        # Scale and shift the normalized input using the learnable parameters
        return (x - mean) / (std + self.eps) * self.gain + self.bias

**4. Feed Forward Block**

Defines a feedforward neural network block, which consists of two linear transformations with a ReLU activation function in between. This block facilitates nonlinear transformations of feature representations learned by the Transformer model.

A feedforward block is a simple fully connected neural network applied to each position separately and identically. It consists of two linear transformations with a ReLU activation function in between. The first linear layer projects the input to a higher-dimensional space, and the second linear layer projects it back to the original space. ReLU (Rectified Linear Unit) is an activation function that adds non-linearity to the model, allowing it to learn more complex patterns.

In [None]:
class FeedForwardBlock(nn.Module):
    def __init__(self, embed_dim: int, intermediate_size: int, dropout: float = 0.1):
        """Initialize the FeedForwardBlock module.
        embed_dim is the hidden size of the transformer model functions as input and output size of the FeedForwardBlock
        intermediate_size is the hidden size of the intermediate layer in the FeedForwardBlock
        dropout is the dropout probability
        """
        super().__init__()
        # embed_dim is the dimensionality of the input and output of the FeedForwardBlock
        # intermediate_size is the dimensionality of the intermediate layer in the FeedForwardBlock
        self.fc1 = nn.Linear(embed_dim, intermediate_size) # W1 and B1 in the formula
        self.fc2 = nn.Linear(intermediate_size, embed_dim) # W2 and B2 in the formula
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        """Perform the forward pass of the FeedForwardBlock module."""
        # (Batch, Seq_len, embed_dim) -> (Batch, Seq_len, intermediate_size) -> (Batch, Seq_len, embed_dim)
        x_intermediate = self.dropout(F.relu(self.fc1(x)))
        x_output = self.fc2(x_intermediate)
        return x_output

**5. Multi-Head Attention Block**

Implements multi-head attention, which allows the model to jointly attend to information from different representation subspaces at different positions. This mechanism enhances the model's ability to capture dependencies and relationships in the input data.

Attention mechanisms allow models to focus on specific parts of the input sequence, giving more importance to certain words or tokens. Multi-head attention extends this idea by having multiple attention heads, each attending to different parts of the sequence independently. The results from each head are then concatenated and linearly transformed.

Here's how it works in detail:
- **Query, Key, Value (QKV) Transformations**: For each attention head, we have three learned matrices: query (Q), key (K), and value (V). These matrices transform the input into query, key, and value vectors.
- **Attention Scores**: The attention score is calculated by taking the dot product of the query and key vectors, scaled by the square root of the dimension of the key vectors. This is followed by a softmax operation to obtain the attention weights.
- **Weighted Sum**: The attention weights are used to compute a weighted sum of the value vectors.
- **Concatenation and Projection**: The outputs of all the heads are concatenated and projected back to the original dimension.

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim: int = 512, num_heads: int = 8, attn_dropout: float = 0.1, ff_dropout: float = 0.1, max_len: int = 512):
        super().__init__()
        self.num_heads = num_heads
        assert embed_dim % self.num_heads == 0, "invalid heads and embedding dimension configuration"
        self.query = nn.Linear(embed_dim, embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)
        self.proj = nn.Linear(embed_dim, embed_dim)
        self.attn_dropout = nn.Dropout(attn_dropout)
        self.proj_dropout = nn.Dropout(ff_dropout)

    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.size()
        # Apply linear transformations to the input tensor
        # Take input tensor and apply linear transformation,
        # then split the tensor into num_heads and head_dim
        # transpose the tensor into correct order
        # Shape: (batch_size, seq_len, embed_dim) -> (batch_size, seq_len, num_heads, head_dim) ->
        # (batch_size, seq_len, num_heads, head_dim) -> (batch_size, num_heads, seq_len, head_dim)
        q = self.query(x).view(batch_size, seq_len, self.num_heads, -1).transpose(1, 2)
        k = self.key(x).view(batch_size, seq_len, self.num_heads, -1).transpose(1, 2)
        v = self.value(x).view(batch_size, seq_len, self.num_heads, -1).transpose(1, 2)

        # Compute attention scores using Einsum
        # b: batch size, h: num_heads, i: seq_len, j: seq_len, d: head_dim
        # Multiply query and key tensors element-wise and sum along the shared dimension (head_dim)
        # Divide by the square root of the dimension of the query/key vectors
        # Equivalent to: attention = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(q.size(-1))
        # Shape: (batch_size, num_heads, seq_len, head_dim) * (batch_size, num_heads, seq_len, head_dim)
        # -> (batch_size, num_heads, seq_len, seq_len)
        attention = torch.einsum('bhid,bhjd->bhij', q, k) / math.sqrt(q.size(-1))

        # Apply mask if provided
        if mask is not None:
            attention = attention.masked_fill(mask == 0, float("-inf"))

        # Apply softmax and dropout
        # Shape: (batch_size, num_heads, seq_len, seq_len) -> (batch_size, num_heads, seq_len, head_dim)
        attention = self.attn_dropout(F.softmax(attention, dim=-1))

        # Compute the weighted sum of values using attention scores
        # Equivalent to: torch.matmul(attention, v)
        # Shape: (batch_size, num_heads, seq_len, seq_len) * (batch_size, num_heads, seq_len, head_dim)
        # -> (batch_size, num_heads, seq_len, head_dim)
        y = torch.einsum('bhij,bhjd->bhid', attention, v)

        # Merge the num_heads and head_dim back to the embed_dim
        # Transpose sequence length and num_heads
        # Flatten out the full tensor
        # Reshape based on batch size, sequence length and embed_dim
        # Shape: (batch_size, num_heads, seq_len, head_dim) -> (batch_size, seq_len, num_heads, head_dim)
        # -> (batch_size, seq_len, num_heads * head_dim)
        # -> (batch_size, seq_len, embed_dim)
        y = y.transpose(1, 2).contiguous().view(batch_size, seq_len, -1)

        # Apply linear transformation and dropout
        # Shape: (batch_size, seq_len, embed_dim) -> (batch_size, seq_len, embed_dim)
        return self.proj_dropout(self.proj(y))

**6. Residual Connection**

Defines a residual connection that adds the input to the output of a sub-layer before applying layer normalization. This helps in mitigating the vanishing gradient problem and allows gradients to flow more effectively during training.

Residual connections, also known as skip connections, allow the input to bypass one or more layers. By adding the input directly to the output of a layer, we help preserve the original signal and prevent the gradients from vanishing or exploding during backpropagation. This makes it easier to train deep networks.

Implementation time :)

In [None]:
class ResidualConnection(nn.Module):
    def __init__(self, embed_dim, dropout: float = 0.1):
        """Initialize the ResidualConnection module."""
        super().__init__()
        self.layer_norm = LayerNormalization(embed_dim=embed_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        """Perform the forward pass of the ResidualConnection module."""
        # Apply sublayer (e.g., feedforward block)
        # (batch_size, seq_len, embed_dim) -> (batch_size, seq_len, embed_dim)
        sublayer_output = sublayer(x)
        # Apply layer normalization
        # (batch_size, seq_len, embed_dim) -> (batch_size, seq_len, embed_dim)
        normalized_x = self.layer_norm(x)
        # Add residual connection
        # (batch_size, seq_len, embed_dim) + (batch_size, seq_len, embed_dim) -> (batch_size, seq_len, embed_dim)
        residual_output = normalized_x + sublayer_output
        # Apply dropout to the sum
        return self.dropout(residual_output)

**7. Projection Head**

Defines the projection head, which projects the model's hidden representations back into the original vocabulary space for output generation. This final layer ensures that the model's output is interpretable in terms of the input tokens.

The projection head is a linear layer that maps the high-dimensional hidden states of the model back to the vocabulary space. This allows the model to generate probabilities for each token in the vocabulary, enabling tasks like text generation and classification.

In [None]:
class ProjectionHead(nn.Module):
    def __init__(self, embed_dim: int, vocab_size: int):
        """Initialize the ProjectionHead module."""
        super().__init__()
        self.fc = nn.Linear(embed_dim, vocab_size)

    def forward(self, x):
        """Perform the forward pass of the ProjectionHead module."""
        # Apply linear transformation to the input tensor
        # (batch_size, seq_len, embed_dim) -> (batch_size, seq_len, vocab_size)
        return self.fc(x)

**8. Transformer Block**

Combines all the previously defined components (multi-head attention, feedforward block, residual connection, and layer normalization) into a single Transformer block. This block forms the core building block of the Transformer architecture, facilitating the model's ability to process and generate text.

A Transformer block consists of two main sub-layers:
- **Multi-Head Attention**: This sub-layer helps the model focus on different parts of the input sequence simultaneously.
- **Feedforward Neural Network**: This sub-layer performs nonlinear transformations on each position separately and identically.

Each sub-layer is followed by a residual connection and layer normalization to ensure stable training. The following code shows how to implement a Transformer block:

In [None]:
class DecoderBlock(nn.Module):
    def __init__(
        self,
        embed_dim: int = 512,
        num_heads: int = 8,
        ff_dim: int = 2048,
        attn_dropout: float = 0.1,
        ff_dropout: float = 0.1,
        dropout: float = 0.1,
        max_len: int = 512,
    ):
        super().__init__()
        # Initialize multi-head self-attention mechanism
        self.MultiHeadAttention = MultiHeadAttention(
            embed_dim=embed_dim,
            num_heads=num_heads,
            attn_dropout=attn_dropout,
            ff_dropout=ff_dropout,
            max_len=max_len,
            )
        # Initialize feed-forward block
        self.feed_forward = FeedForwardBlock(
            embed_dim=embed_dim,
            intermediate_size=ff_dim,
            dropout=ff_dropout,
            )
        # Initialize residual connections
        self.residual_connection1 = ResidualConnection(embed_dim=embed_dim, dropout=dropout)
        self.residual_connection2 = ResidualConnection(embed_dim=embed_dim, dropout=dropout)

    def forward(self, x, attention_mask=None):
        # Apply self-attention mechanism with residual connection
        x_with_attention = self.residual_connection1(x, lambda x: self.MultiHeadAttention(x, mask=attention_mask))
        # Apply feed-forward block with residual connection
        x_with_ff = self.residual_connection2(x_with_attention, self.feed_forward)
        return x_with_ff

**9. Building the Transformer**

Constructs the entire Transformer architecture by stacking multiple Transformer blocks. This setup creates a deep neural network capable of processing sequential data and generating output predictions based on learned patterns and relationships.

The following code shows how to build the Transformer model by stacking several Transformer blocks and adding the input embedding, positional encoding, and projection head:

In [None]:
class GPT(nn.Module):
    def __init__(
        self,
        vocab_size: int,
        embed_dim: int = 512,
        max_len: int = 512,
        embed_dropout: float = 0.1,
        num_blocks: int = 6,
        num_heads: int = 8,
        ff_dim: int = 2048,
        attn_dropout: float = 0.1,
        ff_dropout: float = 0.1
    ):
        super().__init__()
        self.max_len = max_len
        self.token_embedding = InputEmbedding(
            embed_dim=embed_dim,
            vocab_size=vocab_size,
            )
        self.positional_embedding = PositionalEncoding(
            embed_dim=embed_dim,
            max_seq_len=max_len,
            dropout=embed_dropout,
            )
        self.blocks = nn.ModuleList([DecoderBlock(
            embed_dim=embed_dim,
            num_heads=num_heads,
            ff_dim=ff_dim,
            attn_dropout=attn_dropout,
            ff_dropout=ff_dropout,
            max_len=max_len,
            ) for _ in range(num_blocks)])

        self.projection_head = ProjectionHead(embed_dim=embed_dim, vocab_size=vocab_size)

    def forward(self, input_ids: torch.Tensor, attention_mask: torch.Tensor = None):
        # Shape: (batch_size, seq_len) -> (seq_len)
        seq_len = input_ids.size(1)
        assert seq_len <= self.max_len, "Sequence longer than model capacity"

        # Token embedding
        # Shape: (batch_size, seq_len) -> (batch_size, seq_len, embed_dim)
        x = self.token_embedding(input_ids)  # (batch_size, seq_len, embed_dim)

        # Add positional embedding
        # Shape: (batch_size, seq_len, embed_dim) -> (batch_size, seq_len, embed_dim)
        x = self.positional_embedding(x)

        # Forward through decoder blocks
        # output of each block is the hidden state of the transformer
        # Shape: (batch_size, seq_len, embed_dim) -> (batch_size, seq_len, embed_dim)
        for block in self.blocks:
            x = block(x, attention_mask=attention_mask)

        # Linear layer for output logits
        # Shape: (batch_size, seq_len, embed_dim) -> (batch_size, seq_len, vocab_size)
        x = self.projection_head(x)  # (batch_size, seq_len, vocab_size)

        return x

**10. Sample Usage**

Now we demonstrate how to initialize and use the Transformer model for tasks such as text generation and prediction. This cell provides a practical example of how to interact with the Transformer architecture once it's been constructed.

The following code initializes the Transformer model with specified parameters and demonstrates how to use it for text generation:

In [None]:
# Define model parameters
vocab_size = 50257  # Example vocab size; specific to GPT2 tokenizer (borrowed from OpenAI :) )
embed_dim = 768
max_len = 1024 # This can be adjusted based on the use case
embed_dropout = 0.1
num_blocks = 6 # This can be adjusted based on the use case
num_heads = 8 # This can be adjusted based on the use case
ff_dim = 2048 # This can be adjusted based on the use case
attn_dropout = 0.1
ff_dropout = 0.1

# Initialize GPT model
model = GPT(
    vocab_size=vocab_size,
    embed_dim=embed_dim,
    max_len=max_len,
    embed_dropout=embed_dropout,
    num_blocks=num_blocks,
    num_heads=num_heads,
    ff_dim=ff_dim,
    attn_dropout=attn_dropout,
    ff_dropout=ff_dropout
)

**11. Training the Transformer**

**11.1 Data Preprocessing**

Data preprocessing is a crucial step in training a Transformer model. It involves preparing the input data in a format suitable for the model. This includes tokenizing the text, padding or truncating sequences to a fixed length, and creating attention masks.

Here's a sample dataset and the code to preprocess it:

In [None]:
sample_data = [
    "Mary had a little lamb",
    "Its fleece was white as snow",
    "And everywhere that Mary went",
    "The lamb was sure to go",
    "The quick brown fox jumps over the lazy dog",
    "Artificial Intelligence is transforming the world",
    "PyTorch is a popular deep learning framework",
    "Transformers are powerful models for NLP",
    "Machine learning can solve many complex problems",
    "Deep learning models require large amounts of data",
    "Natural Language Processing is a fascinating field",
    "GPT models are great for generating text",
    "Attention mechanisms help models focus on important parts of the input",
    "Feedforward neural networks are used in many applications",
    "Multi-head attention enhances the model's ability to capture relationships",
    "Positional encoding adds order information to the input",
    "Residual connections help in training deep neural networks",
    "Layer normalization stabilizes the learning process",
    "Recurrent Neural Networks process input sequentially",
    "Convolutional Neural Networks are used for image processing",
    "Autoencoders are used for unsupervised learning tasks",
    "Reinforcement learning involves training agents to make decisions",
    "Generative Adversarial Networks generate new data samples",
    "Transfer learning leverages pre-trained models for new tasks",
    "Self-supervised learning uses data itself as a source of supervision",
    "The sigmoid function is an activation function used in neural networks",
    "The softmax function converts logits to probabilities",
    "Backpropagation is the algorithm used for training neural networks",
    "Optimization algorithms like Adam and SGD are used to train models",
    "Loss functions measure the error in model predictions",
    "Gradient descent is an optimization technique used in machine learning",
    "Hyperparameter tuning is crucial for model performance",
    "Data augmentation techniques are used to increase dataset size",
    "Overfitting occurs when a model performs well on training data but poorly on new data",
    "Regularization techniques help prevent overfitting",
    "Cross-validation is used to evaluate model performance",
    "Batch normalization improves training speed and stability",
    "Dropout is a regularization technique to prevent overfitting",
    "Long Short-Term Memory networks are used for sequential data",
    "Bidirectional RNNs process data in both forward and backward directions",
    "Sequence-to-sequence models are used for tasks like translation",
    "Beam search is used for decoding sequences in NLP",
    "Named Entity Recognition identifies entities in text",
    "Part-of-Speech tagging assigns grammatical categories to words",
    "Dependency parsing analyzes the grammatical structure of a sentence",
    "Word embeddings capture semantic meaning of words",
    "Sentence embeddings represent entire sentences as vectors",
    "Graph Neural Networks are used for processing graph-structured data",
    "Knowledge graphs represent information in a structured form",
    "Collaborative filtering is used in recommendation systems",
    "Content-based filtering recommends items based on item features",
    "Matrix factorization techniques are used in recommendation systems",
    "Anomaly detection identifies unusual patterns in data",
    "Time series forecasting predicts future values based on past data",
    "Clustering algorithms group similar data points together",
    "Dimensionality reduction techniques like PCA reduce data complexity",
    "t-SNE is a technique for visualizing high-dimensional data",
    "UMAP is another technique for visualizing high-dimensional data",
    "Bayesian networks represent probabilistic relationships among variables",
    "Markov chains model systems that transition from one state to another",
    "Hidden Markov Models are used for sequence modeling",
    "Support Vector Machines are used for classification tasks",
    "Decision trees are used for both classification and regression",
    "Random forests are ensembles of decision trees",
    "Gradient boosting combines weak learners to create a strong learner",
    "XGBoost is a popular implementation of gradient boosting",
    "LightGBM is another gradient boosting framework",
    "CatBoost is a gradient boosting framework designed for categorical features",
    "K-means is a popular clustering algorithm",
    "Hierarchical clustering builds a hierarchy of clusters",
    "DBSCAN is a density-based clustering algorithm",
    "Linear regression models the relationship between two variables",
    "Logistic regression is used for binary classification",
    "Ridge regression adds regularization to linear regression",
    "Lasso regression adds L1 regularization to linear regression",
    "Elastic net combines L1 and L2 regularization",
    "Principal Component Analysis is used for dimensionality reduction",
    "Independent Component Analysis separates a multivariate signal into additive components",
    "Factor analysis models observed variables as linear combinations of potential factors",
    "Canonical correlation analysis explores the relationships between two sets of variables",
    "Discriminant analysis is used for classification",
    "Quadratic discriminant analysis extends linear discriminant analysis",
    "Naive Bayes classifiers are based on Bayes' theorem",
    "Gaussian Naive Bayes assumes the features follow a normal distribution",
    "Multinomial Naive Bayes is used for discrete data",
    "Bernoulli Naive Bayes is used for binary/boolean data",
    "Markov Decision Processes are used for decision making in stochastic environments",
    "Q-learning is a reinforcement learning algorithm",
    "Policy gradients optimize the policy directly in reinforcement learning",
    "Deep Q-Networks use neural networks to approximate Q-values",
    "Actor-critic methods combine policy gradients and value function approximation",
    "Proximal Policy Optimization is a popular reinforcement learning algorithm",
    "AlphaGo used deep reinforcement learning to play Go",
    "Game theory studies strategic interactions between agents",
    "Mechanism design is a field related to game theory",
    "Econometrics applies statistical methods to economic data",
    "Time series analysis involves analyzing data points collected over time",
    "Survival analysis studies the time until an event occurs",
    "Epidemiology models the spread of diseases",
    "Bioinformatics applies computational techniques to biological data",
    "Genomics studies the structure, function, evolution, and mapping of genomes",
    "Proteomics studies the structure and function of proteins",
    "Metabolomics studies the chemical processes involving metabolites",
    "Systems biology models complex biological systems",
    "Synthetic biology designs and constructs new biological parts and systems",
    "Structural biology studies the molecular structure of biological macromolecules",
    "Quantum computing uses principles of quantum mechanics for computation",
    "Quantum machine learning applies quantum computing to machine learning",
    "Cryptography secures communication in the presence of adversaries",
    "Blockchain is a decentralized ledger technology",
    "Smart contracts are self-executing contracts with the terms directly written into code",
    "Internet of Things connects everyday objects to the internet",
    "Edge computing processes data at the edge of the network",
    "Fog computing extends cloud computing to the edge of the network",
    "5G networks provide high-speed wireless communication",
    "Software-defined networking decouples the control and data planes in networking",
    "Network function virtualization virtualizes network services",
    "Cybersecurity protects computer systems from theft or damage",
    "Penetration testing evaluates the security of a computer system",
    "Intrusion detection systems monitor networks for malicious activity",
    "Firewalls control incoming and outgoing network traffic",
    "Public key infrastructure manages digital keys and certificates",
    "Zero-trust architecture assumes no trusted zones in a network",
    "Digital forensics investigates digital evidence",
    "Data governance ensures data is managed and used responsibly",
    "Data privacy protects personal data from unauthorized access",
    "Data anonymization removes personally identifiable information",
    "Data integrity ensures data is accurate and consistent",
    "Data wrangling involves transforming raw data into a usable format",
    "Data visualization presents data in a visual context",
    "Data storytelling combines data visualization with narrative techniques",
    "Business intelligence analyzes data to support business decision-making",
    "Customer relationship management manages interactions with customers",
    "Enterprise resource planning integrates business processes",
    "Supply chain management oversees the flow of goods and services",
    "Project management plans and organizes resources to achieve specific goals",
    "Agile development is an iterative approach to software development",
    "Scrum is a framework for agile project management",
    "Kanban is a visual system for managing work",
    "Lean methodology focuses on maximizing value while minimizing waste",
    "DevOps combines software development and IT operations",
    "Continuous integration automates the integration of code changes",
    "Continuous delivery automates the delivery of software updates",
    "Microservices architecture structures an application as a collection of loosely coupled services",
    "Containerization packages software into containers",
    "Docker is a platform for developing, shipping, and running applications in containers",
    "Kubernetes orchestrates containerized applications",
    "Serverless computing allows developers to build applications without managing servers",
    "Function as a Service runs code in response to events",
    "Platform as a Service provides a platform for developing, running, and managing applications",
    "Infrastructure as a Service provides virtualized computing resources over the internet",
    "Software as a Service delivers software over the internet",
    "Cloud computing delivers computing services over the internet",
    "Hybrid cloud combines public and private clouds",
    "Multi-cloud uses multiple cloud services from different providers",
    "Big data analytics analyzes large and complex datasets",
    "Hadoop is a framework for processing large datasets",
    "Spark is a fast and general-purpose cluster computing system",
    "NoSQL databases are designed for large-scale data storage and retrieval",
    "Graph databases store data in graph structures",
    "Document databases store data in document format",
    "Key-value stores are a simple type of NoSQL database",
    "Column-family stores organize data into columns",
    "Data lakes store raw data in its native format",
    "Data warehouses store structured data for analysis",
    "ETL processes extract, transform, and load data",
    "Data pipelines automate the flow of data",
    "Machine learning pipelines automate the process of training and deploying models",
    "Feature engineering creates new features from raw data",
    "Feature selection selects the most important features for a model",
    "Model evaluation assesses the performance of a model",
    "Model interpretability explains how a model makes decisions",
    "Fairness in machine learning ensures models do not discriminate against certain groups",
    "Ethics in AI addresses the moral implications of AI",
    "Explainable AI makes AI decisions understandable to humans",
    "Human-in-the-loop AI involves humans in the AI decision-making process",
    "AI for social good applies AI to address social challenges",
    "AI policy and governance address the regulation and management of AI",
    "AI safety ensures AI systems operate as intended",
    "AI alignment aligns AI systems with human values",
    "AI ethics examines the ethical implications of AI",
    "AI and law explores the legal aspects of AI",
    "AI in healthcare applies AI to medical and healthcare tasks",
    "AI in finance applies AI to financial tasks",
    "AI in education applies AI to learning and teaching",
    "AI in transportation applies AI to transportation systems",
    "AI in robotics integrates AI with robotics",
    "AI in gaming enhances game design and player experience",
    "AI in creative arts enhances creativity in art, music, and literature",
    "AI in agriculture optimizes farming practices",
    "AI in environmental science addresses environmental challenges",
    "AI in space exploration advances space missions",
    "AI in manufacturing optimizes production processes",
    "AI in customer service improves customer interactions",
    "AI in marketing personalizes marketing campaigns",
    "AI in sales improves sales processes",
    "AI in human resources enhances recruitment and employee management",
    "AI in legal services assists with legal research and case management",
    "AI in public services improves government operations",
    "AI in security enhances threat detection and response",
    "AI in smart cities optimizes urban infrastructure",
    "AI in energy optimizes energy production and consumption",
    "AI in retail enhances the shopping experience",
    "AI in fashion designs new clothing and accessories",
    "AI in sports analyzes player performance",
    "AI in entertainment creates new content",
    "AI in hospitality improves guest experiences",
    "AI in tourism enhances travel planning",
    "AI in real estate optimizes property management",
    "AI in construction improves building design and management",
    "AI in logistics optimizes supply chain operations",
    "AI in insurance assesses risk and processes claims",
    "AI in telecommunications optimizes network operations",
    "AI in mining improves resource extraction",
    "AI in chemical engineering optimizes chemical processes",
    "AI in pharmaceuticals enhances drug discovery",
    "AI in biotechnology advances biological research",
    "AI in materials science discovers new materials",
    "AI in nanotechnology advances nanoscale research",
    "AI in quantum physics explores quantum phenomena",
    "AI in cognitive science studies human cognition",
    "AI in linguistics analyzes language",
    "AI in psychology studies human behavior",
    "AI in sociology studies social interactions",
    "AI in anthropology studies human cultures",
    "AI in economics analyzes economic data",
    "AI in political science studies political systems",
    "AI in history analyzes historical data",
    "AI in philosophy explores philosophical questions",
    "AI in theology studies religious texts",
    "AI in archaeology analyzes archaeological data",
    "AI in education policy studies education systems",
    "AI in public health analyzes health data",
    "AI in food science optimizes food production",
    "AI in animal science studies animal behavior",
    "AI in veterinary medicine advances animal healthcare",
    "AI in plant science studies plant growth",
    "AI in forestry manages forest resources",
    "AI in marine biology studies ocean life",
    "AI in meteorology predicts weather patterns",
    "AI in climatology studies climate change",
    "AI in geology studies Earth's structure",
    "AI in astronomy explores the universe",
    "AI in aerospace advances space technology",
    "AI in automotive industry designs smart vehicles",
    "AI in electrical engineering optimizes electronic systems",
    "AI in mechanical engineering designs mechanical systems",
    "AI in civil engineering improves infrastructure",
    "AI in chemical engineering optimizes chemical processes",
    "AI in industrial engineering improves manufacturing",
    "AI in biomedical engineering develops medical devices",
    "AI in environmental engineering solves environmental problems",
    "AI in urban planning designs better cities",
    "AI in architecture designs smarter buildings",
    "AI in interior design creates personalized spaces",
    "AI in product design innovates new products",
    "AI in graphic design creates digital art",
    "AI in photography enhances photo editing",
    "AI in videography improves video production",
    "AI in music composition creates new music",
    "AI in sound engineering optimizes audio production",
    "AI in theater designs better performances",
    "AI in film industry enhances movie production",
    "AI in animation creates animated content",
    "AI in virtual reality creates immersive experiences",
    "AI in augmented reality enhances real-world experiences",
    "AI in mixed reality combines virtual and real worlds",
    "AI in gaming industry improves game design",
    "AI in fitness optimizes workout routines",
    "AI in wellness enhances mental health",
    "AI in mindfulness promotes relaxation",
    "AI in meditation guides meditation practices",
    "AI in nutrition optimizes diet plans",
    "AI in personal finance manages finances",
    "AI in investment advises on investments",
    "AI in retirement planning plans for retirement",
    "AI in estate planning manages estates",
    "AI in tax planning optimizes taxes",
    "AI in wealth management grows wealth",
    "AI in philanthropy optimizes charitable giving",
    "AI in community service enhances community projects",
    "AI in nonprofit management manages nonprofits",
    "AI in humanitarian aid improves disaster response",
    "AI in conflict resolution mediates disputes",
    "AI in international relations studies global interactions",
    "AI in diplomacy enhances diplomatic efforts",
    "AI in intelligence analyzes security threats",
    "AI in military applications advances defense technology",
    "AI in space missions advances space exploration",
    "AI in planetary science studies planets",
    "AI in astrophysics explores cosmic phenomena",
    "AI in cosmology studies the universe",
    "AI in astrobiology searches for extraterrestrial life",
    "AI in space weather predicts space weather events",
    "AI in satellite technology enhances satellite operations",
    "AI in space debris management addresses space debris",
    "AI in space mining explores space resources",
    "AI in space tourism develops space tourism",
    "AI in astronaut training trains astronauts",
    "AI in space habitat designs space habitats",
    "AI in space station operations manages space stations",
    "AI in planetary defense protects Earth from asteroids",
    "AI in space policy studies space laws",
    "AI in space law regulates space activities",
    "AI in space economics studies space economy",
    "AI in space medicine advances space healthcare",
    "AI in space agriculture grows food in space",
    "AI in space exploration explores space",
    "AI in space robotics develops space robots",
    "AI in space navigation navigates space missions",
    "AI in space communication enhances space communication",
    "AI in space telemetry monitors space missions",
    "AI in space data analysis analyzes space data",
    "AI in space research advances space knowledge",
    "AI in space innovation innovates space technology",
    "AI in space sustainability promotes space sustainability",
    "AI in space ethics studies space ethics",
    "AI in space culture explores space culture",
    "AI in space history studies space history",
    "AI in space future studies future space missions",
    "AI in space colonization explores space colonization",
    "AI in space economy studies space economy",
    "AI in space governance regulates space activities",
    "AI in space cooperation promotes space cooperation",
    "AI in space competition analyzes space competition",
    "AI in space strategy plans space missions",
    "AI in space missions manages space missions",
    "AI in space education educates about space",
    "AI in space advocacy advocates for space",
    "AI in space outreach reaches out about space",
    "AI in space engagement engages people in space",
    "AI in space awareness raises space awareness",
    "AI in space inspiration inspires about space",
    "AI in space creativity creates space content",
    "AI in space communication communicates about space",
    "AI in space storytelling tells space stories",
    "AI in space visualization visualizes space",
    "AI in space simulation simulates space missions",
    "AI in space exploration explores new frontiers",
    "AI in space colonization colonizes new worlds",
    "AI in space governance establishes space governance",
    "AI in space innovation drives space innovation",
    "AI in space sustainability ensures space sustainability",
    "AI in space ethics promotes space ethics",
    "AI in space culture celebrates space culture",
    "AI in space history preserves space history",
    "AI in space future envisions space future",
]

In [None]:
class GPTDataset(Dataset):
    def __init__(self, data:list, tokenizer, max_length:int):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.end_token = tokenizer.eos_token_id

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text = self.data[idx]
        input_txt = self.tokenizer(text, truncation=True, return_tensors="pt")["input_ids"].squeeze(0)
        text_len = input_txt.size(0)
        if text_len < self.max_length:
            padding_len = self.max_length - text_len
            padding = torch.tensor([self.end_token] * padding_len)
            input_ids = torch.cat((input_txt, padding), dim=0)
            label = torch.cat((input_txt[1:], torch.tensor([self.end_token]), padding), dim=0)
        else:
            input_ids = input_txt[:self.max_length]
            label = torch.cat((input_txt[1:self.max_length], torch.tensor([self.end_token])), dim=0)
        return input_ids, label

In [None]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")

train_dataset = GPTDataset(
    data = sample_data,
    tokenizer = tokenizer,
    max_length = 200,
    )

In [None]:
input_ids, label = train_dataset[2]
input_ids = input_ids.unsqueeze(0)
label = label.unsqueeze(0)

print("Label:", label)
print("Input IDs:", input_ids)

print("Label Shape:", label.shape)
print("Input IDs Shape:", input_ids.shape)

**11.2 Model Training**

Training the Transformer model involves defining a loss function, optimizer, and training loop. The loss function measures how well the model's predictions match the true labels. The optimizer updates the model's parameters to minimize the loss.

# Generate Square Subsequent Mask

In Transformer models, especially during training, it's important to ensure that the model doesn't "cheat" by looking ahead at the subsequent tokens when generating sequences. To prevent this, we use a mask that blocks future positions. This is called a "causal" or "look-ahead" mask.

The `generate_square_subsequent_mask` function creates such a mask. Here’s a detailed breakdown of the function and its implementation:

### Function Explanation

**Purpose**: 
The function generates a square mask for a sequence of a given size. This mask is used to ensure that each position in the sequence can only attend to the positions before it (and itself), and not the positions that come after it. 

**Arguments**:
- `size` (int): The size of the square mask. This is typically the length of the sequence.
- `device` (torch.device): The device on which the mask is created (e.g., CPU or GPU).

**Returns**:
- `torch.Tensor`: A mask tensor of shape `(size, size)`, where positions that should be masked are filled with `float('-inf')`, and unmasked positions are filled with `float(0.0)`.

### Code Explanation

1. **Initialization**:
    ```python
    mask = torch.triu(torch.ones(size, size, device=device) * float('-inf'), diagonal=1)
    ```
    - `torch.ones(size, size, device=device)`: Creates a tensor of shape `(size, size)` filled with ones on the specified device.
    - `* float('-inf')`: Multiplies all the ones by negative infinity (`-inf`), resulting in a tensor filled with `-inf`.
    - `torch.triu(..., diagonal=1)`: Extracts the upper triangular part of the tensor, starting from the first diagonal above the main diagonal (hence `diagonal=1`). This effectively sets all elements above the main diagonal to `-inf` and keeps the lower triangle (including the main diagonal) as zeros.

2. **Return Mask**:
    ```python
    return mask
    ```
    - The function returns the generated mask.

# Example usage
size = 5
mask = generate_square_subsequent_mask(size)
print(mask)
```

### Example Output

For `size = 5`, the generated mask will look like this:

```
tensor([[  0., -inf, -inf, -inf, -inf],
        [  0.,   0., -inf, -inf, -inf],
        [  0.,   0.,   0., -inf, -inf],
        [  0.,   0.,   0.,   0., -inf],
        [  0.,   0.,   0.,   0.,   0.]])
```

This mask ensures that each position can only attend to itself and the positions before it, effectively preventing any position from attending to future positions during sequence generation.

In [None]:
def generate_square_subsequent_mask(size, device='cpu'):
    """
    Generate a square mask for the sequence. The masked positions are filled with float('-inf').
    Unmasked positions are filled with float(0.0).
    
    Args:
        size (int): The size of the square mask.
        device (torch.device): The device on which the mask is created.
    
    Returns:
        torch.Tensor: The generated mask tensor of shape (size, size).
    """
    mask = torch.triu(torch.ones(size, size, device=device) * float('-inf'), diagonal=1)
    return mask

# Example usage
size = 5
mask = generate_square_subsequent_mask(size)
print(mask)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
lr = 1e-5
batch_size = 2
num_epochs = 5

In this block of code, we are performing the training loop for our Transformer model. Training a neural network involves multiple epochs, where in each epoch the model sees the entire training dataset once. During each epoch, the dataset is divided into batches to efficiently utilize memory and speed up computations. Here’s a detailed explanation of what each part of the code does:

1. **Moving Model to Device**:
   The model is moved to the specified device (CPU or GPU). Training on a GPU can significantly speed up the process compared to a CPU.

2. **Setting Up Optimizer and Loss Function**:
   - **Optimizer**: The Adam optimizer is used, which is an adaptive learning rate optimization algorithm that's popular for training deep learning models. It adjusts the learning rate for each parameter dynamically.
   - **Loss Function**: CrossEntropyLoss is used, which is suitable for classification tasks. It combines LogSoftmax and NLLLoss in one single class, which is useful when training a language model.

3. **Preparing DataLoader**:
   The DataLoader is used to efficiently handle the dataset during training. It allows us to iterate over the dataset in mini-batches, shuffle the data, and load the data in parallel using multiple workers.

4. **Training Loop**:
   The loop runs for a specified number of epochs. Each epoch involves one complete pass through the training dataset.

5. **Setting Model to Training Mode**:
   The model is set to training mode. Certain layers, like dropout, behave differently during training and evaluation, so it's important to set the correct mode.

6. **Initializing Total Loss**:
   `total_loss` is initialized to accumulate the loss over all batches in the epoch.

7. **Iterating Over Batches**:
   We iterate over each batch in the DataLoader. Each batch contains a subset of the training data.

8. **Zeroing Gradients**:
   Before the backward pass, all the gradients for the variables are zeroed. This is important because PyTorch accumulates the gradients on subsequent backward passes.

9. **Unpacking and Moving Data to Device**:
   The input data and labels are unpacked from the batch and moved to the specified device.

10. **Generating Causal Mask**:
    A causal mask is generated for the sequence to ensure that each position can only attend to the positions before it (and itself), preventing information leakage from future tokens.

11. **Forward Pass**:
    A forward pass is performed through the model to obtain the logits (raw predictions before applying softmax).

12. **Flattening Logits and Labels**:
    The logits and labels are flattened to compute the loss. This is necessary because CrossEntropyLoss expects the inputs to be of shape (N, C) where N is the number of examples and C is the number of classes.

13. **Computing Loss**:
    The loss between the predicted logits and the true labels is computed using the cross-entropy loss function.

14. **Backward Pass**:
    A backward pass is performed to compute the gradients of the loss with respect to the model parameters.

15. **Optimizer Step**:
    The model parameters are updated using the gradients computed during the backward pass.

16. **Accumulating Loss**:
    The loss is accumulated over all batches to keep track of the total loss for the epoch.

17. **Logging Loss**:
    At the end of each epoch, the average loss over all batches is printed to monitor the training progress.

### Additional Debugging Lines (Commented Out):

- **Inspecting Input Data and Labels**:
    These lines are useful for debugging to check if the input data and labels are correctly loaded.

- **Inspecting Logits**:
    This line helps in inspecting the raw model outputs (logits) to ensure that the model is functioning correctly.

- **Checking for NaN or Inf in Loss**:
    These lines are useful for checking if the loss contains any NaN or infinite values, which can indicate issues in the training process.

- **Checking for NaN or Inf in Gradients**:
    These lines check if the gradients contain NaN or infinite values, which can also indicate problems in the training process.

By following these steps, we train our Transformer model over multiple epochs, updating the model parameters to minimize the loss and improve performance on the training dataset.

In [None]:
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True,)

for epoch in range(num_epochs):
    model.train()
    total_loss = 0.0

    for batch in train_loader:
        optimizer.zero_grad()
        # Unpack input and label from the batch and send them to the device
        input_ids, labels = batch
        input_ids, labels = input_ids.to(device), labels.to(device)

        # Print input data and labels to inspect
        # print(f"Input IDs: {input_ids}")
        # print(f"Labels: {labels}")

        # Generate the causal mask
        # Shape: (batch_size, seq_len, seq_len)
        mask = generate_square_subsequent_mask(input_ids.size(1), device=device)

        # Forward pass
        logits = model(input_ids=input_ids, attention_mask=mask)

        # Print logits to inspect
        # print(f"Logits: {logits}")

        # Flatten the logits and labels for computing the loss
        logits_flat = logits.view(-1, logits.size(-1))
        labels_flat = labels.view(-1)

        # Compute the loss
        loss = criterion(logits_flat, labels_flat)

        # Check for NaN or Inf values in loss
        # if torch.isnan(loss) or torch.isinf(loss):
        #     print(f"NaN or Inf detected in loss at epoch {epoch}")
        #     continue

        # Backward pass and optimization step
        loss.backward()

        # Check for NaN or Inf values in gradients
        # for name, param in model.named_parameters():
        #     if param.grad is not None and (torch.isnan(param.grad).any() or torch.isinf(param.grad).any()):
        #         print(f"NaN or Inf detected in gradients at epoch {epoch} for parameter {name}")
        #         continue

        optimizer.step()

        total_loss += loss.item()

    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(train_loader)}')

**12. Inference**

Inference is the process of using a trained model to make predictions on new data. This cell demonstrates how to use the trained Transformer model to generate text.

The following code shows how to perform inference with the trained Transformer model:

In [None]:
vocab_size = 50257
embed_dim = 768
max_len = 1024
embed_dropout = 0.1
num_blocks = 12  # or 24 for GPT-2 XL
num_heads = 12   # or 24 for GPT-2 XL
ff_dim = 3072
attn_dropout = 0.1
ff_dropout = 0.1

# Initialize GPT model
model = GPT(
    vocab_size=vocab_size,
    embed_dim=embed_dim,
    max_len=max_len,
    embed_dropout=embed_dropout,
    num_blocks=num_blocks,
    num_heads=num_heads,
    ff_dim=ff_dim,
    attn_dropout=attn_dropout,
    ff_dropout=ff_dropout
)

This code performs several key steps in preparing and setting up a PyTorch DataLoader for training a language model using the GPT-2 architecture. Here’s a detailed explanation of each step:

1. **Importing Necessary Libraries**:
   - `torch`: The core library of PyTorch, used for tensor operations and building neural networks.
   - `DataLoader` and `TensorDataset` from `torch.utils.data`: Tools to handle and process data efficiently.

2. **Initializing Tokenizer and Model**:
   - `model_name = "gpt2"`: Specifies the name of the pre-trained model to be used.
   - `tokenizer = AutoTokenizer.from_pretrained(model_name)`: Loads the pre-trained tokenizer for GPT-2, which is responsible for converting text into token IDs that the model can understand.
   - `model = AutoModelForCausalLM.from_pretrained(model_name)`: Loads the pre-trained GPT-2 model, which is designed for causal language modeling tasks.

3. **Setting Pad Token**:
   - If the tokenizer does not have a pad token, it sets the pad token to be the same as the end-of-sequence (eos) token. This ensures that sequences are padded correctly to a uniform length.

4. **Tokenizing the Data**:
   - `inputs = tokenizer(sample_data, return_tensors='pt', max_length=128, padding='max_length', truncation=True)`: Tokenizes the sample data, converts it to PyTorch tensors, and ensures all sequences are padded or truncated to a maximum length of 128 tokens.

5. **Creating TensorDataset**:
   - `input_ids = inputs['input_ids']`: Extracts the input IDs (tokenized data) from the tokenized inputs.
   - `labels = inputs['input_ids']`: Sets the labels to be the same as the input IDs for language modeling tasks, where the model tries to predict the next token in the sequence.
   - `train_dataset = TensorDataset(input_ids, labels)`: Creates a TensorDataset from the input IDs and labels, which will be used to create a DataLoader.

6. **Creating DataLoader**:
   - `train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)`: Creates a DataLoader from the TensorDataset, specifying a batch size of 2 and shuffling the data at the beginning of each epoch to ensure the model sees the data in a different order each time.

This setup is crucial for efficiently feeding data into the model during training, ensuring that the data is properly tokenized, padded, and batched.

In [None]:
import torch
from torch.utils.data import DataLoader, TensorDataset

# Tokenizer and model
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Tokenize the data
inputs = tokenizer(sample_data, return_tensors='pt', max_length=128, padding='max_length', truncation=True)

# Create TensorDataset
input_ids = inputs['input_ids']
labels = inputs['input_ids']  # In this case, labels are the same as input_ids for language modeling

# Create DataLoader
train_dataset = TensorDataset(input_ids, labels)
train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)

1. **Sample Input Text**:
   - `input_txt = "Building Deep Neural Networks with PyTorch"`: Defines the sample text that you want to tokenize and process. This text will be converted into token IDs that the model can understand.

2. **Tokenizing the Input Text**:
   - `input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)`: Uses the tokenizer to convert the input text into token IDs. The `return_tensors="pt"` argument ensures that the token IDs are returned as a PyTorch tensor. The `to(device)` method moves the tensor to the specified device (CPU or GPU) for further processing or inference.

3. **Printing the Token IDs and Shape**:
   - `print(input_ids)`: Prints the token IDs generated from the input text. These IDs represent the input text in a numerical format that the model can process.
   - `print(input_ids.shape)`: Prints the shape of the `input_ids` tensor. This helps in understanding the dimensions of the tensor, which is useful for verifying that the tokenization process worked correctly.

This is essential for preparing textual data for input into a neural network model, particularly in natural language processing tasks. By converting the text into token IDs and ensuring the data is in the correct format, the model can effectively process the input and generate meaningful predictions or outputs.

In [None]:
input_txt = "Building Deep Neural Networks with PyTorch"

input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
print(input_ids)
print(input_ids.shape)

We are now performing text generation using a pre-trained language model. It generates new text by predicting the next token iteratively and appending it to the input sequence. Here’s a detailed explanation of each part:

1. **Moving Model to Device**:
   - `model = model.to(device)`: Moves the model to the specified device (CPU or GPU). This is important for efficient computation, especially when using a GPU.

2. **Setting Up Iteration Parameters**:
   - `iterations = []`: Initializes an empty list to store the results of each iteration.
   - `n_steps = 10`: Sets the number of steps (tokens) to generate. This determines how many new tokens will be generated.
   - `choices_per_step = 5`: Sets the number of top token choices to store for each step. This is useful for analyzing the model’s predictions.

3. **Text Generation Loop**:
   - `with torch.no_grad()`: Disables gradient computation, which is not needed during inference, to save memory and computation.
   - `for _ in range(n_steps)`: Loops for the specified number of steps to generate new tokens iteratively.

4. **Decoding and Forward Pass**:
   - `iteration = dict()`: Initializes a dictionary to store the results of the current iteration.
   - `iteration["Input"] = tokenizer.decode(input_ids[0])`: Decodes the current input IDs to the corresponding text and stores it.
   - `output = model(input_ids=input_ids)`: Performs a forward pass through the model with the current input IDs to get the output logits.

5. **Processing Output Logits**:
   - `logits = output.logits`: Extracts the logits from the model output. Logits are the raw, unnormalized predictions of the model.
   - `next_token_logits = logits[0, -1, :]`: Selects the logits corresponding to the last token in the sequence.
   - `next_token_probs = torch.softmax(next_token_logits, dim=-1)`: Applies the softmax function to the logits to convert them into probabilities.
   - `sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)`: Sorts the token probabilities in descending order to get the most likely next tokens.

6. **Storing Top Token Choices**:
   - Loops over the top token choices to store their probabilities and decoded text representations in the `iteration` dictionary.
   - `iterations.append(iteration)`: Appends the current iteration’s results to the `iterations` list.

7. **Appending Predicted Token to Input**:
   - `input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)`: Appends the most likely next token to the input sequence for the next iteration.

8. **Creating DataFrame for Analysis**:
   - `sample_inference = pd.DataFrame(iterations)`: Converts the `iterations` list into a pandas DataFrame for easier analysis and visualization.
   - `sample_inference.head()`: Displays the first few rows of the DataFrame to inspect the results.

This simply shows how to generate text using a pre-trained language model by iteratively predicting and appending new tokens to the input sequence. The use of softmax ensures that the model’s predictions are probabilistic, allowing for more controlled and interpretable text generation.

In [None]:
model = model.to(device)
iterations = []
n_steps = 10
choices_per_step = 5

with torch.no_grad():
    for _ in range(n_steps):
        iteration = dict()
        iteration["Input"] = tokenizer.decode(input_ids[0])
        output = model(input_ids=input_ids)

        # Extract the logits from the output
        logits = output.logits

        # Select logits of the first batch and the last token and apply softmax to get the probability
        next_token_logits = logits[0, -1, :]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
        sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)

        # Store tokens with highest probabilities in our little table
        for choice_idx in range(choices_per_step):
            token_id = sorted_ids[choice_idx]
            token_prob = next_token_probs[token_id].cpu().numpy()
            token_choice = (
                f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)"
            )
            iteration[f"Choice {choice_idx+1}"] = token_choice
        iterations.append(iteration)

        # Append predicted next token to input
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)

sample_inference = pd.DataFrame(iterations)
sample_inference.head()


### Function Parameters:
- `input_text` (str): The initial input text to start the generation process.
- `model`: The pre-trained language model used for text generation.
- `tokenizer`: The tokenizer corresponding to the model, used to convert text to token IDs and vice versa.
- `max_length` (int): The maximum length of the generated text. The function will stop generating new tokens once this length is reached.
- `device` (str): The device on which the computation will be performed (e.g., 'cpu' or 'cuda').

### Function Workflow:

1. **Move Model to Device**:
   - `model = model.to(device)`: Moves the model to the specified device (CPU or GPU) for efficient computation.

2. **Tokenize Input Text**:
   - `input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)`: Tokenizes the input text, converts it to a PyTorch tensor, and moves it to the specified device. The `return_tensors='pt'` argument ensures that the token IDs are returned as a PyTorch tensor.

3. **Initialize Variables**:
   - `end_token_id = tokenizer.eos_token_id`: Retrieves the end-of-sequence token ID from the tokenizer. This token indicates the end of the generated sequence.
   - `generated_ids = input_ids.flatten().clone()`: Flattens the input IDs to a 1-dimensional tensor and clones it to initialize the generated sequence.

4. **Disable Gradient Computation**:
   - `with torch.no_grad()`: Disables gradient computation to save memory and computation during inference, as gradients are not needed.

5. **Text Generation Loop**:
   - The loop continues until the end-of-sequence token is generated or the generated sequence reaches the specified maximum length.

6. **Forward Pass**:
   - `output = model(input_ids=input_ids)`: Performs a forward pass through the model with the current input IDs to get the output logits.

7. **Extract and Process Logits**:
   - `logits = output.logits`: Extracts the logits from the model output. Logits are the raw, unnormalized predictions of the model.
   - `next_token_logits = logits[:, -1, :]`: Selects the logits corresponding to the last token in the sequence.

8. **Predict Next Token**:
   - `next_token_id = torch.argmax(next_token_logits, dim=-1)`: Selects the token ID with the highest probability (argmax) from the logits of the last token.
   - `generated_ids = torch.cat([generated_ids, next_token_id], dim=-1)`: Appends the predicted token ID to the generated sequence.
   - `input_ids = next_token_id.unsqueeze(0)`: Prepares the predicted token ID as the input for the next iteration.

9. **Check Stopping Condition**:
   - The loop breaks if the end-of-sequence token is generated or the generated sequence reaches the specified maximum length.

10. **Decode Generated Text**:
    - `generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)`: Decodes the generated token IDs back to text, skipping special tokens like the end-of-sequence token.

11. **Return Generated Text**:
    - `return generated_text`: Returns the generated text.

Now we know how to use a pre-trained language model to generate text by iteratively predicting the next token and appending it to the input sequence. The use of softmax ensures that the model’s predictions are probabilistic, allowing for controlled and interpretable text generation.

In [None]:
def generate_text_until_end(
    input_text: str,
    model,
    tokenizer,
    max_length: int = 100,
    device='cpu',
):
    model = model.to(device)
    input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)
    end_token_id = tokenizer.eos_token_id
    generated_ids = input_ids.flatten().clone()  # Convert to 1-dimensional tensor

    with torch.no_grad():
        while True:
            output = model(input_ids=input_ids)
            logits = output.logits  # Extract the logits

            next_token_logits = logits[:, -1, :]  # Select logits of the last token
            next_token_id = torch.argmax(next_token_logits, dim=-1)
            generated_ids = torch.cat([generated_ids, next_token_id], dim=-1)
            input_ids = next_token_id.unsqueeze(0)

            if next_token_id == end_token_id or len(generated_ids) >= max_length:
                break

    generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
    return generated_text

### Generating Text:

1. **Function Call**:
   - `generated_text = generate_text_until_end(...)`: Calls the `generate_text_until_end` function with the specified parameters to generate text.

2. **Input Parameters**:
   - `input_text="I like to eat"`: Specifies the initial input text to start the generation process. This is the prompt that the model will continue from.
   - `model=model`: The pre-trained language model to be used for text generation.
   - `tokenizer=tokenizer`: The tokenizer corresponding to the model, used to convert text to token IDs and vice versa.
   - `max_length=20`: The maximum length of the generated text. The function will stop generating new tokens once this length is reached.
   - `device=device`: The device on which the computation will be performed (e.g., 'cpu' or 'cuda').

3. **Printing Generated Text**:
   - `print(generated_text)`: Prints the generated text to the console.

### Why though?:

The purpose of this code snippet is to demonstrate how to use the `generate_text_until_end` function to generate a sequence of text starting from a given prompt. The function iteratively predicts and appends new tokens to the input sequence until the specified maximum length is reached or an end-of-sequence token is generated. The generated text is then printed to the console for inspection.

By specifying the `max_length`, the user controls the length of the generated sequence, ensuring that the output is of manageable size. The use of the pre-trained language model allows for coherent and contextually relevant text generation based on the given prompt.

In [None]:
generated_text = generate_text_until_end(
    input_text="I like to eat",
    model=model,
    tokenizer=tokenizer,
    max_length=20,
#     k=50,  # Top-k value
    device=device,
)

print(generated_text)

**Congratulations!! for how far you have come.** <br>
**Feel The AGI!**

**Author(s):**
**Adiza Alhassan And Jason Quist**


*This notebook was originally created by Ghana Data Science Summit for the* *IndabaX Ghana 2024 Conference and is published under MIT license. *