# Week 4: GPT from Scratch

GPT, [introduced by Radford et al. in 2018](https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035), represents a significant milestone in natural language processing. It utilizes the transformer architecture, focusing on the decoder portion, to generate human-like text and perform various language tasks with remarkable proficiency.

In this notebook, we'll explore the foundational concepts behind GPT and build a simplified version from the ground up using PyTorch.

## 0. Transformer Block Idea

The Transformer Block is the fundamental building unit of the GPT architecture. In this section, we'll dissect the components of a Transformer Block and implement it from scratch.

A typical Transformer Block consists of:
1. Multi-Head Attention
2. Layer Normalization
3. Feed-Forward Neural Network
4. Residual Connections.

Let's explore each of these components in detail before putting them together.

In [None]:
# Import necessary libraries
import torch
import torch.nn as nn
import torch.nn.functional as F

from typing import Tuple, List

In [None]:
from helpers.check_todo import check_implementation
from helpers.show_mermaid import mm

How does Transformer Block even look like?

In [None]:
mm("""
graph TD
    A[Input] --> B[Multi-Head Attention]
    A --> C[Add & Norm]
    B --> C
    C --> D[Feed Forward]
    C --> E[Add & Norm]
    D --> E
    E --> F[Output]
""")

## 1. Layer Normalization

It is a technique to stabilize and accelerate the training of neural networks. It normalizes the inputs across the features (or channels), ensuring that they have a mean of 0 and a standard deviation of 1, independently for each input.

Why use LayerNorm? Because it helps with training stability by reducing internal covariate shift. And it also ensures that activations have a consistent range throughout the model, making training more predictable and efficient.

Let's check how it works.

In [None]:
# TODO: Use your example
x = torch.tensor([[1.0, 2.0, 3.0], 
                  [4.0, 5.0, 6.0], 
                  [7.0, 8.0, 9.0]])

In [None]:
layer_norm = nn.LayerNorm(3)  # Normalizing across the feature dimension
normalized_output = layer_norm(x)

print("Input:\n", x)
print("Normalized Output:\n", normalized_output)

## 2. Feed Forward Neural Network

FFN in the Transformer is simply a small 2-layer network applied to each input independently. It helps introduce non-linearity and allows the model to learn more complex patterns. The non-linearity is what gives neural networks their power to approximate complex functions.

By the way, a fun search keyword is "Universal approximation theorem" ;)

Let's look into this class. You can see `ReLU` here, and that's a function that helps us with non-linear patterns. What it does is to be a hockey stick to check whether a number is positive (leaving it as it is) or negative (making it 0). ReLU loosely mimics the behavior of neurons in the brain, which typically have a firing rate of zero when not activated and increase their firing rate for stronger stimuli. You can compare it to `GeLU`and think of the differentiated value. 

In [None]:
class SimpleFeedForward(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int):
        super(SimpleFeedForward, self).__init__()
        self.linear1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(hidden_dim, input_dim)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.linear2(self.relu(self.linear1(x)))

In [None]:
# TODO: Define your example
x = torch.tensor([[1.0, -2.0, 3.0]])

In [None]:
linear = nn.Linear(in_features=3, out_features=6)  # Keeping the same dimensionality
relu = nn.ReLU()

x_linear = linear(x)
print(x_linear)

x_relued = relu(x_linear)
print(x_relued)

In [None]:
ffn = SimpleFeedForward(input_dim=3, hidden_dim=6)
ffn_output = ffn(x)

print("Input:\n", x)
print("FeedForward Output:\n", ffn_output)

## 3. Residual connections with Dropout

Residual Connection (or Skip Connection) represents the input that we add to the output of the attention layer. Why do we do this? Hello, a vanishing gradient problem!

When we have made a predicted based on the input, the network calculates how much each neuron contributed to the error and then uses gradients to update weights.
These gradients are computed using the chain rule, multiplying many small numbers (typically between 0 and 1) as we move backwards through the network.
As we go deeper into the network (towards the earlier layers), these multiplications result in extremely small numbers.
These tiny gradients mean that the weights in earlier layers barely update, effectively stopping the learning process for these layers.

This residual connection (simply `output = x + attention_f(x)`) helps to preserve the original input information, which is combined with the learned features from the attention mechanism. And thus, we're combating vanishing of the information!

We will re-use the multi-head attention implementation. As we recall, multi-head attention allows the model to focus on different parts of the input sequence simultaneously, capturing various types of relationships in the data. And it will benefit from using residual connections!

We will also apply a regularization technique of Dropout. It randomly sets some input units to zero during training. Dropout helps prevent overfitting by ensuring the network does not rely too heavily on any single input. The models with Dropout generalize better on unseen data. 

In [None]:
# TODO: Define your dimensions
d_model = 6  # Model dimensionality
batch_size = 2
seq_len = 4
num_heads=2

In [None]:
x = torch.randn(batch_size, seq_len, d_model)

# Define linear layers for transformation
W_q = nn.Linear(d_model, d_model)
W_k = nn.Linear(d_model, d_model)
W_v = nn.Linear(d_model, d_model)

# Convert x into query, key, and value
query = W_q(x)  # Shape: [batch_size, seq_len, d_model]
key = W_k(x)    # Shape: [batch_size, seq_len, d_model]
value = W_v(x)  # Shape: [batch_size, seq_len, d_model]

print("Query shape:", query.shape) 
print("Key shape:", key.shape)    
print("Value shape:", value.shape)

In [None]:
# Import the MultiHeadAttention class from the src.multiattention module
from src.multiattention import MultiHeadAttention

In [None]:
attention = MultiHeadAttention(d_model=d_model, num_heads=num_heads)
attn_output, weights = attention(query, key, value)

print("Attention output:", attn_output) # Do you know of what shape it is?
#print("Attention weights:", weights)

In [None]:
dropout = nn.Dropout(p=0.5)  # re-run this cell to see different elements zeroed
dropout_output = dropout(attn_output)

print("Output after Dropout:\n", dropout_output)

Then we use dropout for attention mechanism output together with tehinput in a residual connection. 

In [None]:
x = x + dropout_output

print("Layer after residual connections and dropout:\n", x)

## 4. Transformer Block

To bring everything together, we'll use `LayerNorm`, `FeedForward`, and `Dropout` in the `TransformerBlock`. The `TransformerBlock` combines multi-head attention with a feed-forward network, using LayerNorm and Dropout at each step to improve training stability and generalization.

In [None]:
%%writefile src/transformerblock.py
import torch
import torch.nn as nn
import math
from src.multiattention import MultiHeadAttention


class TransformerBlock(nn.Module):
    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor] = None) -> torch.Tensor:

        # TODO: Complete TransformerBlock.forward()

        # Hint: first apply attention mechanism, spoiler: the shape is (x, x, x, mask)
        # Then Add with Residual Connection with Dropout & Norm
        # Complete with FFN and another Add with Dropout & Norm

        pass

In [None]:
from src.transformerblock import TransformerBlock

try:
    check_implementation(TransformerBlock)
except NotImplementedError as e:
    print(e)
# you can move on once your forward pass is ready

## 5. Positional Encoding

If we would be in a crowded space with different people (think "node") talking, we'd be paying attention to pieces of information given, especially to the ones that sound relevant to us (this is like a weighted sum that we have seen with some pieces having more importance). We don't really memorize where which person has been standing. That's why our GPT architecture should pay special attention to positional encodings.

Attention mechanism hasn't been storing the information itself not where it comes from. To create our `PseudoGPT` we would need to encode the position of tokens alongside TransformerBlock.

Position embeddings are used in Transformer models to incorporate the position of each token in the sequence because, unlike RNNs or LSTMs, Transformers do not have inherent sequential order information. They process tokens in parallel, so positional information is added explicitly.

In [None]:
vocab_size = 5  # Number of discrete items
d_model = 4 # Size of the embedding vector

positions = torch.arange(vocab_size)
print("Output of arange:\n", positions)

In [None]:
# nn.Linear
linear_layer = nn.Linear(1, d_model)
# We need to reshape the input for nn.Linear
linear_input = positions.float().unsqueeze(1)
linear_output = linear_layer(linear_input)

print(linear_layer)
print("\nnn.Linear output:")
print(linear_output)
print("Shape:", linear_output.shape)

If we look at the output columnwise,  we can see the linear relationship between values. The difference between consecutive values is constant (approximately 0.69). `nn.Linear` would function as one-hot vector.

`nn.Embedding`creates dense vector representations and essentially works as a lookup table. It maps an index value to a weight matrix of a certain dimension.

In [None]:
# nn.Embedding
position_embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=d_model)
pos_emb = position_embedding(positions)

print("nn.Embedding output:")
print(pos_emb)
print("Shape:", pos_emb.shape)

In [None]:
pos_emb = pos_emb.unsqueeze(0) # Add batch dimension
print("Shape of position embedding to accommodate for batch:", pos_emb.shape)

#### Question to you:

If we'd have a scenario of embedding letters of English language and we would have the model dimensionality of 26, what will be the difference between applying `nn.Linear`and `nn.Embedding`?

## 6. Hello, PseudoGPT!

Yes, you've got it right! A simple GPT model can be described as having these main components:

1. Token Embedding: Converting words to number vectors
2. Positional Embedding: Adding position information
3. TransformerBlock (repeated several times), which includes:

    - Multi-head Attention: Allowing the model to focus on different aspects of the input
    - Layer Normalization: Helping to stabilize the learning process
    - Feedforward Network: Processing the attention output further
    - Dropout: Helping to prevent overfitting

This architecture allows the model to process input text, paying attention to relevant parts, while maintaining awareness of word order, and learning complex patterns in the data.

Shall we build our GPT from scratch?

In [None]:
%%writefile src/gpt.py
import torch
import torch.nn as nn
import math

from src.multiattention import MultiHeadAttention
from src.transformerblock import TransformerBlock


class PseudoGPT(nn.Module):
    def __init__(self, vocab_size: int, d_model: int, num_heads: int, num_layers: int, d_ff: int, max_seq_length: int, dropout: float = 0.1):       
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(max_seq_length, d_model)
        self.transformer_blocks = nn.ModuleList(
            [TransformerBlock(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)]
        )
        self.fc_out = nn.Linear(d_model, vocab_size)
        self.dropout = nn.Dropout(dropout)
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor] = None) -> torch.Tensor:
        seq_length = x.size(1)

        # TODO: Complete PseudoGPT
        
        # Hint: we first use token and position embeddings added together
        # Then we apply dropout and pass through Transformer Blocks
        # Finally, we do LayerNorm and Linear projection and return logits

        
        pass

In [None]:
from src.gpt import PseudoGPT

try:
    check_implementation(PseudoGPT)
except NotImplementedError as e:
    print(e)
# you can move on once your forward pass is ready

#### Congratulations! We now have built a GPT-like model from scratch. 

While our implementation is a simplified version of the full GPT model, it provides a solid foundation for understanding more complex variants like GPT-2, GPT-3, and beyond.

Through this process, we've explored the key components of the GPT architecture, including:

- Token and positional embeddings
- Multi-head attention mechanism
- Transformer blocks
- The overall GPT model structure.

Now our GPT model is ready to be trained. 