# **6: The Main Event - Building a Transformer (GPT)**

In the previous sections, we built up the foundational components of a language model, starting from understanding the data and creating a simple bigram model. Now, we are ready to build the main event: a `Transformer-based language model`, specifically a GPT (Generative Pre-trained Transformer) architecture. This model will be capable of generating coherent and contextually relevant text based on the input it receives. We will go through the architecture, how it works, and how to train it effectively.


## **The Limitations of the Bigram Model**

Before `Transformers`, we had simpler models like the `Bigram Model` that could only predict the next token based on the previous token. While this is a good starting point, it has significant limitations:

- It can only capture relationships between adjacent tokens, which means it cannot understand long-range dependencies in the text.
- It does not have the capacity to learn complex patterns or structures in the language, such as grammar, syntax, or semantics.
- It is not capable of generating coherent text over longer sequences, as it lacks the ability to maintain context beyond the immediate previous token.
- Each prediction is made independently, without considering the broader context of the entire sequence, leading to disjointed and incoherent text generation.

Imgine trying to complete a sentence where you can only see the last word you wrote. You would have no idea what you just wrote, and it would be very difficult to write a coherent sentence. This is the problem with the `Bigram Model` - it can only see one token at a time, and it cannot maintain any context or understanding of the overall structure of the text.


## **Introducing the Transformer Architecture**

The `Transformer` architecture, introduced in the paper "Attention is All You Need" by Vaswani et al., revolutionized the field of natural language processing (NLP) by allowing models to capture long-range dependencies and complex patterns in text. The key innovation of the Transformer is the `self-attention mechanism`, which enables the model to weigh the importance of different tokens in the input sequence when making predictions. Its the foundation for many state-of-the-art language models, including GPT (Generative Pre-trained Transformer), BERT (Bidirectional Encoder Representations from Transformers), and many others. The transfomer superpower is its ability to process and understand sequences of data, such as text, by capturing long-range dependencies and contextual relationships between tokens.

**self-attention** allows the model to consider the entire input sequence when making predictions, rather than just the previous token. This means that the model can learn to capture complex patterns and structures in the language, such as grammar, syntax, and semantics, which are essential for generating coherent and contextually relevant text. It allows every token in the input sequence to attend to every other token, enabling the model to capture relationships between tokens regardless of their distance in the sequence. This is particularly important for understanding and generating natural language, where the meaning of a word can depend on the context provided by other words in the sentence or paragraph. for example, in the sentence "The cat sat on the mat," the word "cat" is related to "sat" and "mat," and the self-attention mechanism allows the model to capture these relationships effectively. The `Transformer` architecture consists of an encoder and a decoder, but for language modeling tasks like GPT, we typically use only the decoder part of the architecture. The decoder is responsible for generating text based on the input it receives, and it uses self-attention to capture the relationships between tokens in the input sequence. The decoder is made up of multiple layers of self-attention and feed-forward neural networks, which allow it to learn complex patterns and structures in the language.


## **The Problem with Attention and the Solution: Flash Attention**

While the standard self-attention mechanism is powerful, it can be computationally expensive, especially for long sequences. The attention mechanism requires calculating the attention scores for every pair of tokens in the input sequence, which can lead to a quadratic increase in computational complexity as the sequence length increases. This can make it difficult to train large models on long sequences of text. It needs to create a large `(sequence_length, sequence_length)` attention matrix, to store attention scores for every pair of tokens. This can lead to memory issues and slow down training, especially for long sequences. For a sequence of length `n`, the attention mechanism requires `O(n^2)` computations, which can become prohibitive as `n` increases. This matrix is slow to read from and write to the `GPU's` main memory (HBM - High Bandwidth Memory), which can significantly slow down training and inference. 


## **Flash Attention: A More Efficient Attention Mechanism**

To address this issue, researchers have developed a more efficient attention mechanism called `Flash Attention`. Flash Attention is designed to reduce the memory and computational overhead of the standard attention mechanism by using a more efficient algorithm for computing attention scores. It achieves this by using a technique called `sparse attention`, which allows the model to focus on a subset of tokens in the input sequence when calculating attention scores, rather than considering all pairs of tokens. This can significantly reduce the computational complexity and memory requirements of the attention mechanism, allowing for faster training and inference, especially for long sequences of text. Flash Attention is particularly beneficial for training large language models like `GPT`, as it allows them to process longer sequences of text without running into memory issues or slowdowns.

It was introduced in the paper "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" by Tri Dao et al. The key idea behind Flash Attention is to compute attention scores in a more efficient way that reduces memory usage and computational overhead, allowing for faster training and inference of large language models. By using Flash Attention, we can train larger models on longer sequences of text without running into memory issues or slowdowns, making it an important advancement in the field of natural language processing. Instead of storing all attention scores in memory, Flash Attention uses a clever trick: it computes the attention output in chunks, keeping intermediate results only in the `GPU's` super-fast cache (SRAM - Static Random Access Memory). 

Think of it this way: Instead of writing a huge intermediate report (the attention matrix) to a slow hard drive (HBM), Flash Attention does all its calculations in the `CPU's` super-fast cache (SRAM), which is much faster to read from and write to in one go. This results in a massive speedup and uses much less memory, often 10-20x faster and requiring far less memory for long sequences. This makes it possible to train larger models on longer sequences of text without running into memory issues or slowdowns, making it an important advancement in the field of natural language processing.




---

## **Implementing Attention the Modern Way**:

We don't need to implement Flash Attention from scratch, as it is already available in libraries like `xformers` and `triton`. These libraries provide efficient implementations of Flash Attention that we can easily integrate into our GPT model. By using these libraries, we can take advantage of the performance benefits of Flash Attention without having to worry about the underlying implementation details. This allows us to focus on building and training our GPT model, while still benefiting from the efficiency of Flash Attention for handling long sequences of text.

PyTorch has also introduced native support for Flash Attention in its `torch.nn` module, making it even easier to use this efficient attention mechanism in our models. By simply using the appropriate attention layer provided by PyTorch, we can leverage the benefits of Flash Attention without needing to install additional libraries or write custom code. This integration allows us to build and train our GPT model with improved efficiency and performance, especially when dealing with long sequences of text.

The function signatiure is simple:

`F.scaled_dot_product_attention(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False)`

This function computes the scaled dot-product attention, which is the core operation in the attention mechanism. The `query`, `key`, and `value` tensors are the inputs to the attention mechanism, and the function computes the attention output based on these inputs. The `attn_mask` can be used to mask out certain positions in the input sequence, while `dropout_p` specifies the dropout probability for regularization. 

**The Casual Mask**:

The `is_causal` flag indicates whether to apply a causal mask, which is typically used in autoregressive models like GPT to prevent attending to future tokens. By using this function, we can efficiently compute attention scores and outputs while benefiting from the performance advantages of Flash Attention.

This is much simpler than implementing the attention mechanism from scratch, and it allows us to take advantage of the optimized implementation provided by PyTorch, which is designed to be efficient and fast, especially for long sequences of text. By using this function, we can easily integrate Flash Attention into our GPT model and benefit from its efficiency without having to worry about the underlying implementation details.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import random

# set device to GPU if available else mps (Apple Silicon) else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: mps


In [4]:
# The Attention Head
class Head(nn.Module):
    """A single head of self-attention"""
    
    def __init__(self, n_embd, head_size, dropout=0.1):
        super().__init__()
        # Each head has its own set of linear layers to compute queries, keys, and values
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        # x is of shape (batch_size, seq_length, n_embd)
        B, T, C = x.shape
        
        # compute query, key, value matrices
        q = self.query(x) # (B, T, head_size)
        k = self.key(x) # (B, T, head_size)
        v = self.value(x) # (B, T, head_size)
        
        # Use PyTorch's optimized sclaled dot-product attention function
        # This function computes the attention scores and applies them to the value vectors efficiently
        attn_output = F.scaled_dot_product_attention(q, k, v, 
                                                     is_causal=True, # ensure that the model cannot attend to future tokens
                                                     dropout_p=self.dropout.p if self.training else 0.0) # apply dropout during training
        return attn_output # (B, T, head_size)
    

# Test the attention head
n_embd = 64 # embedding dimension
head_size = 16 # size of each attention head
head = Head(n_embd, head_size).to(device)

# Create a test input tensor of shape (batch_size, seq_length, n_embd)
batch_size = 2
seq_length = 10
x = torch.randn(batch_size, seq_length, n_embd).to(device)
output = head(x)
print(f"Output shape from attention head: {output.shape}") # should be (batch_size, seq_length, head_size)
print(f"Input shape: {x.shape}, Output shape: {output.shape}")
print(f"Head successfully computed self-attention!")

Output shape from attention head: torch.Size([2, 10, 16])
Input shape: torch.Size([2, 10, 64]), Output shape: torch.Size([2, 10, 16])
Head successfully computed self-attention!


## **Building the Full Transformer Block**

The full `Transformer` block consists of multiple layers of self-attention and feed-forward neural networks. Each layer includes a multi-head self-attention mechanism followed by a position-wise feed-forward network. The multi-head self-attention allows the model to attend to different parts of the input sequence simultaneously, while the feed-forward network helps to capture complex patterns and relationships in the data. By stacking multiple layers of these components, we can create a powerful language model that can generate coherent and contextually relevant text based on the input it receives. The architecture of the Transformer block can be summarized as follows:

1. **Multi-Head Self-Attention**: This component allows the model to attend to different parts of the input sequence simultaneously. It consists of multiple attention heads, each of which computes attention scores and outputs for a different representation of the input data. The outputs from all attention heads are then concatenated and passed through a linear layer to produce the final output of the self-attention mechanism. One head might focus on grammatical relationships, another on semantic meaning, and another on long-range dependencies. By concatenating the outputs of multiple heads, we get a richer representation. 

2. **Position-Wise Feed-Forward Network**: This component consists of two linear layers with a `ReLU` activation function in between. It is applied to each position in the input sequence independently, allowing the model to capture complex patterns and relationships in the data.

3. **Layer Normalization and Residual Connections**: Each layer of the Transformer block includes layer normalization and residual connections to help stabilize training and improve the flow of gradients through the network. The output of the multi-head self-attention is added to the input of the feed-forward network, and the output of the feed-forward network is added back to the input of the multi-head self-attention, creating a residual connection that helps to prevent vanishing gradients and allows for deeper networks.

4. **Stacking Multiple Layers**: By stacking multiple layers of these components, we can create a powerful language model that can capture complex patterns and relationships in the data, allowing it to generate coherent and contextually relevant text based on the input it receives.

5. **Output Layer**: Finally, the output from the last Transformer block is passed through a linear layer to produce logits for each token in the vocabulary, which can then be converted to probabilities using the softmax function for text generation.

6. **Positional Encoding**: Since the Transformer architecture does not have any inherent notion of the order of tokens in the input sequence, we need to add positional encoding to the input embeddings to provide the model with information about the position of each token in the sequence. This allows the model to capture the sequential nature of language and generate coherent text based on the order of tokens.

7. **Training the Model**: To train the Transformer-based language model, we typically use a large corpus of text data and optimize the model's parameters using a loss function such as cross-entropy loss. The model learns to predict the next token in the sequence based on the previous tokens, allowing it to generate coherent and contextually relevant text over time.

8. **Autoregressive Generation**: Once the model is trained, we can use it to generate text by feeding in an initial token and repeatedly predicting the next token until we reach a desired length of generated text. This process is known as `autoregressive generation`, where the model generates one token at a time based on the previously generated tokens, allowing it to create coherent and contextually relevant text over time.

9.  **Sampling Strategies**: When generating text, we can use different sampling strategies to control the diversity and creativity of the generated text. Common strategies include 
- `greedy sampling` (choosing the token with the highest probability), 
- `top-k sampling` (choosing from the top k most probable tokens), and 
- `nucleus sampling` (choosing from the smallest set of tokens whose cumulative probability exceeds a certain threshold). These strategies allow us to balance between generating coherent text and introducing some level of randomness and creativity in the output.

10. **Fine-Tuning and Transfer Learning**: After training the base Transformer model on a large corpus of text, we can fine-tune it on specific tasks or domains by continuing to train the model on a smaller, task-specific dataset. This allows the model to adapt its knowledge to the specific requirements of the task, such as sentiment analysis, question answering, or machine translation, while still leveraging the general language understanding it has learned from the larger corpus.

11. **Evaluation and Metrics**: To evaluate the performance of the Transformer-based language model, we can use various metrics such as perplexity, BLEU score, ROUGE score, or human evaluation. These metrics help us assess the quality of the generated text and compare it to reference texts or human-generated outputs. By evaluating the model's performance, we can identify areas for improvement and further refine the architecture or training process to achieve better results.

12. **Deployment and Applications**: Once we have a trained Transformer-based language model, we can deploy it in various applications such as chatbots, virtual assistants, content generation, and more. The model can be integrated into applications to provide natural language understanding and generation capabilities, allowing for more interactive and engaging user experiences. By leveraging the power of the Transformer architecture, we can create sophisticated language models that can understand and generate human-like text for a wide range of applications.



The complete `Transformer` Block combines all of these:
- `Multi-Head Attention` -> `Residual Connection & LayerNorm` -> `FeedForward Network` -> `Residual Connection & LayerNorm`.

In [5]:
# Multi-Head Attention: Run multiple heads in parallel
class MultiHeadAttention(nn.Module):
    """Multiple attention heads running in parallel."""
    
    def __init__(self, n_embd, num_heads, dropout=0.1):
        super().__init__()
        assert n_embd % num_heads == 0, "n_embd must be divisible by num_heads"
        
        self.num_heads = num_heads
        self.head_size = n_embd // num_heads
        self.n_embd = n_embd
        
        # Create multiple heads
        self.heads = nn.ModuleList([Head(n_embd, self.head_size, dropout) for _ in range(num_heads)])
        
        # Output projection
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        # Run each head in parallel and concatenate
        out = torch.cat([head(x) for head in self.heads], dim=-1)
        # Apply output projection
        out = self.proj(out)
        out = self.dropout(out)
        return out


# FeedForward Network: A simple MLP
class FeedForward(nn.Module):
    """A simple 2-layer MLP with GELU activation."""
    
    def __init__(self, n_embd, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),  # Expand by 4x
            nn.GELU(),                       # GELU is smoother than ReLU
            nn.Linear(4 * n_embd, n_embd),  # Project back
            nn.Dropout(dropout)
        )
        
    def forward(self, x):
        return self.net(x)


# The Transformer Block: The complete building block
class Block(nn.Module):
    """Transformer block: communication (attention) followed by computation (feedforward)."""
    
    def __init__(self, n_embd, num_heads, dropout=0.1):
        super().__init__()
        # Multi-head self-attention
        self.sa = MultiHeadAttention(n_embd, num_heads, dropout)
        # Feedforward network
        self.ffwd = FeedForward(n_embd, dropout)
        # Layer normalization
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)
        
    def forward(self, x):
        # Attention with residual connection and layer norm
        x = x + self.sa(self.ln1(x))  # Residual connection
        
        # Feedforward with residual connection and layer norm
        x = x + self.ffwd(self.ln2(x))  # Residual connection
        
        return x

# Test the Block
n_embd = 64
num_heads = 4
block = Block(n_embd, num_heads).to(device)

test_input = torch.randn(2, 10, n_embd).to(device)
output = block(test_input)
print(f"Block input shape: {test_input.shape}")
print(f"Block output shape: {output.shape}")
print(f"Transformer Block successfully created!")

Block input shape: torch.Size([2, 10, 64])
Block output shape: torch.Size([2, 10, 64])
Transformer Block successfully created!


## **The Final GPT Model**

Now we can assemble the complete `GPT` (Generative Pre-trained Transformer) model. The architecture consists of:

1. `Token Embedding Layer`: Converts token indices (integers) into dense vectors, just like the `Bigram model`. Each token in the vocabulary gets mapped to a learnable embedding vector.

2. `Positional Embedding Layer`: Since attention treats all tokens equally, we need to give the model a sense of token order. Positional embeddings encode the position of each token in the sequence, allowing the model to understand "first word", "second word", etc.

3. `Stack of Transformer Blocks`: Multiple blocks (typically 6-12 for small models, 96+ for large models) stacked on top of each other. Each block refines the understanding of the sequence.

4. `Final LayerNorm`: Normalizes the final representations before the output layer.

5. `Output Linear Layer`: Maps the final hidden representations back to vocabulary logits (scores for each token in the vocabulary).


The `generate` method is almost identical to the Bigram model, we still sample tokens one at a time, but now each prediction benefits from the full context of all previous tokens!

In [6]:
class GPTLanguageModel(nn.Module):
    """GPT model: a stack of transformer blocks."""
    
    def __init__(self, vocab_size, n_embd, block_size, num_heads, num_layers, dropout=0.1):
        super().__init__()
        self.block_size = block_size
        
        # Each token directly reads off the logits from the embedding table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, num_heads, dropout) for _ in range(num_layers)])
        self.ln_f = nn.LayerNorm(n_embd)  # Final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)  # Language model head
        
        # Better initialization
        self.apply(self._init_weights)
        
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, idx, targets=None):
        B, T = idx.shape
        
        # idx and targets are both (B, T) tensors of integers
        tok_emb = self.token_embedding_table(idx)  # (B, T, n_embd)
        pos_emb = self.position_embedding_table(torch.arange(T, device=idx.device))  # (T, n_embd)
        x = tok_emb + pos_emb  # (B, T, n_embd)
        x = self.blocks(x)  # (B, T, n_embd)
        x = self.ln_f(x)  # (B, T, n_embd)
        logits = self.lm_head(x)  # (B, T, vocab_size)
        
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss
    
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """
        Generate new tokens given a context.
        
        Args:
            idx: (B, T) array of indices in the current context
            max_new_tokens: Maximum number of tokens to generate
            temperature: Controls randomness (higher = more random)
            top_k: Only sample from top k most likely tokens
        """
        self.eval()
        
        for _ in range(max_new_tokens):
            # Crop idx to the last block_size tokens
            idx_cond = idx[:, -self.block_size:] if idx.shape[1] >= self.block_size else idx
            
            # Get the predictions
            logits, _ = self(idx_cond)
            # Focus only on the last time step
            logits = logits[:, -1, :] / temperature  # (B, C)
            
            # Optionally apply top-k filtering
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')
            
            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # Sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # Append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        
        self.train()
        return idx

print("GPTLanguageModel class defined successfully!")

GPTLanguageModel class defined successfully!


In [7]:
# Hyperparameters
vocab_size = 65  # Character-level vocabulary (for simplicity)
block_size = 256  # Maximum context length
n_embd = 384  # Embedding dimension
num_heads = 6  # Number of attention heads
num_layers = 6  # Number of transformer blocks
dropout = 0.1
batch_size = 64
learning_rate = 3e-4
max_iters = 5000
eval_interval = 500
eval_iters = 200

# Create a simple text dataset for demonstration
# In practice, you'd load a real dataset like Shakespeare, Wikipedia, etc.
text = """
The quick brown fox jumps over the lazy dog. 
The dog barks at the fox. The fox runs away quickly.
Machine learning is fascinating. Deep learning models can understand language.
Transformers are powerful architectures. Attention mechanisms enable long-range dependencies.
Natural language processing has advanced rapidly. Large language models can generate coherent text.
Artificial intelligence continues to evolve. Neural networks learn complex patterns.
"""

# Create character-level vocabulary
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"Vocabulary size: {vocab_size}")
print(f"Characters: {''.join(chars)}")

Vocabulary size: 36
Characters: 
 -.ADLMNTabcdefghijklmnopqrstuvwxyz


In [12]:
import urllib.request as request

# Download the tiny shakespeare dataset
# This is a small dataset of Shakespeare's works, which is often used for character-level language modeling.
url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'

print("Downloading dataset...")
response = request.urlopen(url)
text = response.read().decode('utf-8')
print("Dataset downloaded.\n")


# Create character-level vocabulary
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"Vocabulary size: {vocab_size}")
print(f"Characters: {''.join(chars)}")

Downloading dataset...
Dataset downloaded.

Vocabulary size: 65
Characters: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


In [13]:
# Create character-to-index mappings
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

# Encode the text
data = torch.tensor(encode(text), dtype=torch.long)

# Split into train and validation sets
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

print(f"Train data length: {len(train_data)}")
print(f"Val data length: {len(val_data)}")


Train data length: 1003854
Val data length: 111540


In [14]:
def get_batch(split):
    """Generate a small batch of data."""
    data_split = train_data if split == 'train' else val_data
    ix = torch.randint(len(data_split) - block_size, (batch_size,))
    x = torch.stack([data_split[i:i+block_size] for i in ix])
    y = torch.stack([data_split[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

# Instantiate the model
model = GPTLanguageModel(
    vocab_size=vocab_size,
    n_embd=n_embd,
    block_size=block_size,
    num_heads=num_heads,
    num_layers=num_layers,
    dropout=dropout
).to(device)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Model created!")
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

Model created!
Total parameters: 10,788,929
Trainable parameters: 10,788,929


In [15]:
# Create optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training loop
@torch.no_grad()
def estimate_loss():
    """Estimate loss on train and val sets."""
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

print("\nStarting training...")
for iter_num in range(max_iters):
    # Every once in a while evaluate the loss on train and val sets
    if iter_num % eval_interval == 0 or iter_num == max_iters - 1:
        losses = estimate_loss()
        print(f"Step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
    
    # Sample a batch of data
    xb, yb = get_batch('train')
    
    # Evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print("\nTraining completed!")

# Generate text
print("\n" + "="*50)
print("Generating text with GPT:")
print("="*50)


Starting training...
Step 0: train loss 4.2518, val loss 4.2430
Step 500: train loss 2.0092, val loss 2.0920
Step 1000: train loss 1.5070, val loss 1.6955
Step 1500: train loss 1.3326, val loss 1.5652
Step 2000: train loss 1.2286, val loss 1.5069
Step 2500: train loss 1.1483, val loss 1.5020
Step 3000: train loss 1.0672, val loss 1.5099
Step 3500: train loss 0.9789, val loss 1.5403
Step 4000: train loss 0.8841, val loss 1.5961
Step 4500: train loss 0.7804, val loss 1.6683
Step 4999: train loss 0.6728, val loss 1.7579

Training completed!

Generating text with GPT:


In [16]:
# Start with a context
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated = model.generate(context, max_new_tokens=200, temperature=0.8, top_k=50)
generated_text = decode(generated[0].tolist())
print(generated_text)

print("\n" + "="*50)
print("Notice how the GPT model produces much more coherent text")
print("compared to a Bigram model, thanks to self-attention!")
print("="*50)


pale me believe me.

LEONTES:
How! then, Camillo!
Though me our bastard; my thrice souls have been with
A shepherd's death, my lord.

PAULINA:
I am assist for this appointed tears,
Poor surprised with

Notice how the GPT model produces much more coherent text
compared to a Bigram model, thanks to self-attention!
