# Build your own GPT (Are you Ready!!!)


The Generative Pre-trained Transformer (GPT), unveiled by Radford and colleagues in 2018, marks a pivotal advancement in the field of natural language processing. This model leverages the decoder component of the transformer architecture to produce text that closely mimics human writing and to execute a wide array of language-related tasks with impressive accuracy.
Our objective in this notebook is to delve into the core principles underlying GPT. We will construct a simplified version of the model from scratch using PyTorch, providing a hands-on approach to understanding its inner workings.
Reference:
[Introduced by Radford et al. in 2018](https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035)

## Transformer Block Idea

The Transformer Block is the fundamental building unit of the GPT architecture. In this section, we'll dissect the components of a Transformer Block and implement it from scratch.

A typical Transformer Block consists of:
1. Multi-Head Attention
2. Layer Normalization
3. Feed-Forward Neural Network
4. Residual Connections.

Let's explore each of these components in detail before putting them together.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [2]:
from helpers.show_mermaid import mm

How does Transformer Block even look like?

In [3]:
mm("""
graph TD
    A[Input] --> B[Multi-Head Attention]
    A --> C[Add & Norm]
    B --> C
    C --> D[Feed Forward]
    C --> E[Add & Norm]
    D --> E
    E --> F[Output]
""")

## 1. Layer Normalization

LayerNorm is a method used to enhance and speed up neural network training. It works by standardizing inputs across features or channels, adjusting them to have a mean of 0 and a standard deviation of 1 for each individual input.
The main benefits of using LayerNorm are:

Improved training stability:  It reduces internal covariate shift, which can cause issues during training.

Consistent activation ranges: By normalizing inputs, it ensures that activations remain within a predictable range throughout the model.

More efficient training: The standardization of inputs leads to more predictable and faster training processes.

In essence, LayerNorm helps create a more stable and efficient environment for neural networks to learn, making it easier to train complex models

In [4]:
#  Use your example
x = torch.tensor([[1.0, 2.0, 3.0], 
                  [4.0, 6.0, 3.0], 
                  [7.0, 8.0, 9.0]])

In [5]:
layer_norm = nn.LayerNorm(3)  # Normalizing across the feature dimension
normalized_output = layer_norm(x)

print("Input:\n", x)
print("Normalized Output:\n", normalized_output)

Input:
 tensor([[1., 2., 3.],
        [4., 6., 3.],
        [7., 8., 9.]])
Normalized Output:
 tensor([[-1.2247,  0.0000,  1.2247],
        [-0.2673,  1.3363, -1.0690],
        [-1.2247,  0.0000,  1.2247]], grad_fn=<NativeLayerNormBackward0>)


## 2. Feed Forward Neural Network

The FFN in a Transformer is a compact two-layer neural network applied independently to each input. Its primary functions are:

Introducing non-linearity: This allows the model to capture and learn more intricate patterns in the data.
Enhancing complexity: The FFN enables the model to approximate more sophisticated functions.

The non-linear aspect of the FFN is crucial, as it's the key factor that empowers neural networks to model and represent complex relationships within the data.

As an interesting side note, the "Universal approximation theorem" - is indeed a fascinating topic in neural network theory. This theorem essentially states that neural networks with certain properties can approximate any continuous function to arbitrary precision, given enough neurons.

The ReLU (Rectified Linear Unit) activation function is a key component in many neural networks, including the Feed-Forward Network (FFN) in Transformers. Here's a breakdown of its characteristics and significance:

Function behavior:

For positive inputs: ReLU(x) = x
For negative inputs: ReLU(x) = 0


Shape: It resembles a hockey stick, hence the analogy.
Biological inspiration: ReLU loosely mimics the firing patterns of neurons in the brain. Neurons typically have:

A firing rate of zero when not activated (like ReLU for negative inputs)
Increased firing rates for stronger stimuli (like ReLU for positive inputs)


Non-linearity: ReLU introduces non-linearity to the network, which is crucial for learning complex patterns.
Comparison to GeLU:

GeLU (Gaussian Error Linear Unit) is another activation function that has gained popularity.
Unlike ReLU's sharp transition at 0, GeLU has a smoother curve.
The derivative of GeLU is continuous, which can sometimes lead to better gradient flow during training.


Differentiated value:

ReLU's derivative is 1 for positive inputs and 0 for negative inputs (undefined at exactly 0).
This simple derivative makes it computationally efficient during backpropagation.

The choice between ReLU, GeLU, or other activation functions often depends on the specific task and model architecture. Each has its advantages in different scenarios.

In [6]:
class SimpleFeedForward(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(SimpleFeedForward, self).__init__()
        self.linear1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(hidden_dim, input_dim)
        
    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))

In [7]:
#  Define your example
x = torch.tensor([[1.0, -2.0, 3.0]])

In [8]:
linear = nn.Linear(in_features=3, out_features=6)  # Keeping the same dimensionality
relu = nn.ReLU()

x_linear = linear(x)
print(x_linear)

x_relued = relu(x_linear)
print(x_relued)

tensor([[ 0.9647,  0.1788,  1.0139,  1.5917, -0.3992, -0.3365]],
       grad_fn=<AddmmBackward0>)
tensor([[0.9647, 0.1788, 1.0139, 1.5917, 0.0000, 0.0000]],
       grad_fn=<ReluBackward0>)


In [9]:
ffn = SimpleFeedForward(input_dim=3, hidden_dim=6)
ffn_output = ffn(x)

print("Input:\n", x)
print("FeedForward Output:\n", ffn_output)

Input:
 tensor([[ 1., -2.,  3.]])
FeedForward Output:
 tensor([[-0.9308, -1.2707, -0.7993]], grad_fn=<AddmmBackward0>)


## 3. Residual connections with Dropout

Residual Connections are a technique where the input to a layer is added directly to the output of that layer. In the case of transformers, the input is typically added to the output of the attention layer. Here's why this is important:

`Addressing the Vanishing Gradient Problem:`

As neural networks become deeper, gradients can become very small (vanish) as they're backpropagated through many layers.
This makes it difficult for earlier layers to learn effectively.
Residual connections provide a direct path for gradients to flow back to earlier layers, mitigating this issue.

`Easier Optimization:`

Residual connections make it easier for the network to learn identity mappings.
If a layer isn't necessary, the network can easily learn to "skip" it by setting its weights close to zero.

`Improved Information Flow:`

They allow information to flow more easily through the network.
This is particularly important in very deep networks.

`Stable Training:`

Residual connections can lead to more stable training, especially in deep networks.

`Performance Boost:`

Networks with residual connections often achieve better performance with fewer parameters compared to their non-residual counterparts.

In a nutshell, residual connections are a key solution to this fundamental challenge in training deep neural networks. They allow for the creation of much deeper, more powerful networks while maintaining trainability.

After making a prediction, the network assesses each neuron's contribution to the error. It then uses backpropagation to update weights, calculating gradients via the chain rule. This process involves multiplying many small numbers (usually 0-1) as it moves backward through the network. In deeper networks, these multiplications lead to extremely small gradients in earlier layers. Consequently, the weights in these layers barely update, essentially halting the learning process for them. This phenomenon is known as the vanishing gradient problem, which can significantly hinder the training of deep neural networks.

The residual connection, expressed as `output = x + attention_f(x)`, serves two key purposes:

Information preservation: It directly adds the original input (x) to the output of the attention function, ensuring that the original information is retained.
Combating information loss: By preserving the input, it helps prevent the vanishing of important information as it passes through multiple layers of the network.

This simple addition allows the network to learn residual functions with reference to the input, rather than having to learn the entire transformation. As a result, it mitigates the vanishing gradient problem and allows for more effective training of deeper neural networks.

We will re-use the multi-head attention implementation. As we recall, multi-head attention allows the model to focus on different parts of the input sequence simultaneously, capturing various types of relationships in the data. And it will benefit from using residual connections!

We will also apply a regularization technique of Dropout. It randomly sets some input units to zero during training. Dropout helps prevent overfitting by ensuring the network does not rely too heavily on any single input. The models with Dropout generalize better on unseen data. 

In [10]:
#  Define your dimensions
d_model = 6  # Model dimensionality
batch_size = 2
seq_len = 4
num_heads=2

In [11]:
x = torch.randn(batch_size, seq_len, d_model)

# Define linear layers for transformation
W_q = nn.Linear(d_model, d_model)
W_k = nn.Linear(d_model, d_model)
W_v = nn.Linear(d_model, d_model)

# Convert x into query, key, and value
query = W_q(x)  # Shape: [batch_size, seq_len, d_model]
key = W_k(x)    # Shape: [batch_size, seq_len, d_model]
value = W_v(x)  # Shape: [batch_size, seq_len, d_model]

print("Query shape:", query.shape) 
print("Key shape:", key.shape)    
print("Value shape:", value.shape)

Query shape: torch.Size([2, 4, 6])
Key shape: torch.Size([2, 4, 6])
Value shape: torch.Size([2, 4, 6])


In [12]:
# Import the MultiHeadAttention class from the src.multiattention module
from src.multiattention import MultiHeadAttention

In [13]:
attention = MultiHeadAttention(d_model=d_model, num_heads=num_heads)
attn_output, weights = attention(query, key, value)

print("Attention output:", attn_output) # Do you know of what shape it is?
#print("Attention weights:", weights)

Attention output: tensor([[[-0.2629, -0.0067, -0.2713,  0.2276, -0.2592, -0.2580],
         [-0.2615,  0.0171, -0.2893,  0.2401, -0.2671, -0.2665],
         [-0.2629,  0.0175, -0.2889,  0.2419, -0.2685, -0.2668],
         [-0.2688,  0.0629, -0.3194,  0.2750, -0.2909, -0.2835]],

        [[-0.2081,  0.2592, -0.4933,  0.2407, -0.2217, -0.2386],
         [-0.2079,  0.2555, -0.4905,  0.2395, -0.2231, -0.2384],
         [-0.2048,  0.2548, -0.4902,  0.2390, -0.2208, -0.2428],
         [-0.2079,  0.2527, -0.4883,  0.2333, -0.2237, -0.2303]]],
       grad_fn=<ViewBackward0>)


In [14]:
dropout = nn.Dropout(p=0.5)  # re-run this cell to see different elements zeroed
dropout_output = dropout(attn_output)

print("Output after Dropout:\n", dropout_output)

Output after Dropout:
 tensor([[[-0.0000, -0.0000, -0.5427,  0.0000, -0.0000, -0.0000],
         [-0.0000,  0.0342, -0.0000,  0.4802, -0.0000, -0.0000],
         [-0.5259,  0.0000, -0.5779,  0.4838, -0.5370, -0.5335],
         [-0.5376,  0.0000, -0.6389,  0.5499, -0.5817, -0.0000]],

        [[-0.0000,  0.0000, -0.0000,  0.4814, -0.0000, -0.0000],
         [-0.0000,  0.0000, -0.9811,  0.0000, -0.0000, -0.4768],
         [-0.0000,  0.0000, -0.9804,  0.0000, -0.4417, -0.0000],
         [-0.4158,  0.0000, -0.9765,  0.0000, -0.0000, -0.4607]]],
       grad_fn=<MulBackward0>)


Then we use dropout for attention mechanism output together with tehinput in a residual connection. 

In [15]:
x = x + dropout_output

print("Layer after residual connections and dropout:\n", x)

Layer after residual connections and dropout:
 tensor([[[ 1.5343, -0.6365,  1.3991, -0.5462,  1.5781,  2.0067],
         [ 0.0301, -0.0836,  0.5374,  0.6827,  0.4619, -0.5571],
         [-0.3692,  0.3704,  0.1088,  0.5024, -0.6496, -1.2839],
         [-1.0465,  0.0314, -0.9227,  0.7850, -4.8886, -2.4436]],

        [[ 0.1267, -1.9457,  0.0955, -0.1224, -0.2461,  0.7153],
         [-0.2390,  0.1072, -1.9518,  0.6050,  0.2563, -0.5283],
         [-0.4494,  0.7024, -0.0066,  0.6395,  0.2437,  0.2772],
         [-0.1319,  0.9554, -1.6117, -1.1465,  0.5277, -0.5346]]],
       grad_fn=<AddBackward0>)


## 4. Transformer Block

To bring everything together, we'll use `LayerNorm`, `FeedForward`, and `Dropout` in the `TransformerBlock`. The `TransformerBlock` combines multi-head attention with a feed-forward network, using LayerNorm and Dropout at each step to improve training stability and generalization.

In [16]:
%%writefile src/transformerblock.py
import torch
import torch.nn as nn
import math
from src.multiattention import MultiHeadAttention


class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):

        #  Complete TransformerBlock.forward()
        att = attention(x,x,x)
        x = self.norm1(x + self.dropout(att))
        ff = self.ff(x)
        x = self.norm2(x + self.dropout(ff))
        return x

        # pass

Overwriting src/transformerblock.py


In [None]:
from src.transformerblock import TransformerBlock

## 5. Positional Encoding

If we would be in a crowded space with different people (think "node") talking, we'd be paying attention to pieces of information given, especially to the ones that sound relevant to us (this is like a weighted sum that we have seen with some pieces having more importance). We don't really memorize where which person has been standing. That's why our GPT architecture should pay special attention to positional encodings.

Attention mechanism hasn't been storing the information itself not where it comes from. To create our `PseudoGPT` we would need to encode the position of tokens alongside TransformerBlock.

Position embeddings are used in Transformer models to incorporate the position of each token in the sequence because, unlike RNNs or LSTMs, Transformers do not have inherent sequential order information. They process tokens in parallel, so positional information is added explicitly.

In [18]:
vocab_size = 5  # Number of discrete items
d_model = 4 # Size of the embedding vector

positions = torch.arange(vocab_size)
print("Output of arange:\n", positions)

Output of arange:
 tensor([0, 1, 2, 3, 4])


In [19]:
# nn.Linear
linear_layer = nn.Linear(1, d_model)
# We need to reshape the input for nn.Linear
linear_input = positions.float().unsqueeze(1)
linear_output = linear_layer(linear_input)

print(linear_layer)
print("\nnn.Linear output:")
print(linear_output)
print("Shape:", linear_output.shape)

Linear(in_features=1, out_features=4, bias=True)

nn.Linear output:
tensor([[-0.7213, -0.5687,  0.5899, -0.1792],
        [-1.0725,  0.2110, -0.0696,  0.5765],
        [-1.4236,  0.9906, -0.7291,  1.3321],
        [-1.7748,  1.7703, -1.3886,  2.0878],
        [-2.1259,  2.5499, -2.0481,  2.8435]], grad_fn=<AddmmBackward0>)
Shape: torch.Size([5, 4])


If we look at the output columnwise,  we can see the linear relationship between values. The difference between consecutive values is constant (approximately 0.69). `nn.Linear` would function as one-hot vector.

`nn.Embedding`creates dense vector representations and essentially works as a lookup table. It maps an index value to a weight matrix of a certain dimension.

In [20]:
# nn.Embedding
position_embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=d_model)
pos_emb = position_embedding(positions)

print("nn.Embedding output:")
print(pos_emb)
print("Shape:", pos_emb.shape)

nn.Embedding output:
tensor([[ 1.2701, -0.9995,  0.4889, -1.3305],
        [ 1.0389, -1.2353, -0.2923, -0.3398],
        [-0.1789,  1.3103,  0.5005,  0.9266],
        [-1.1340,  1.7937,  0.3405,  0.4843],
        [ 0.5450,  0.5316,  1.7035, -0.6980]], grad_fn=<EmbeddingBackward0>)
Shape: torch.Size([5, 4])


In [21]:
pos_emb = pos_emb.unsqueeze(0) # Add batch dimension
print("Shape of position embedding to accommodate for batch:", pos_emb.shape)

Shape of position embedding to accommodate for batch: torch.Size([1, 5, 4])


#### Question to you:

If we'd have a scenario of embedding letters of English language and we would have the model dimensionality of 26, what will be the difference between applying `nn.Linear`and `nn.Embedding`?

## 6. Your own GPT (PseudoGPT)

 A simple GPT model can be described as having these main components:

1. Token Embedding: Converting words to number vectors
2. Positional Embedding: Adding position information
3. TransformerBlock (repeated several times), which includes:

    - Multi-head Attention: Allowing the model to focus on different aspects of the input
    - Layer Normalization: Helping to stabilize the learning process
    - Feedforward Network: Processing the attention output further
    - Dropout: Helping to prevent overfitting

This architecture allows the model to process input text, paying attention to relevant parts, while maintaining awareness of word order, and learning complex patterns in the data.

Shall we build our GPT from scratch?

In [22]:
%%writefile src/gpt.py
import torch
import torch.nn as nn
import math

from src.multiattention import MultiHeadAttention
from src.transformerblock import TransformerBlock



class PseudoGPT(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout=0.1):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(max_seq_length, d_model)
        self.transformer_blocks = nn.ModuleList(
            [TransformerBlock(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)]
        )
        self.fc_out = nn.Linear(d_model, vocab_size)
        self.dropout = nn.Dropout(dropout)
        self.norm = nn.LayerNorm(d_model)

        
    def forward(self, x, mask=None):
        seq_length = x.size(1)
        x = self.token_embedding(x) + self.position_embedding(torch.arange(seq_length, device=x.device))
        x = self.dropout(x)
        for block in self.transformer_blocks:
            x = block(x, mask)
        x = self.norm(x)
        x = self.fc_out(x)
        return x
        
    

Overwriting src/gpt.py


In [23]:
from src.gpt import PseudoGPT

#### Congratulations! We now have built a GPT-like model from scratch. 

While our implementation is a simplified version of the full GPT model, it provides a solid foundation for understanding more complex variants like GPT-2, GPT-3, and beyond.

Through this process, we've explored the key components of the GPT architecture, including:

- Token and positional embeddings
- Multi-head attention mechanism
- Transformer blocks
- The overall GPT model structure.

Now our GPT model is ready to be trained & test.
 Note: Make your own gpt's i.e arGPT :)
 