# Chapter 4: Training a GPT Model

<div class="alert alert-block alert-success">
In the previous chapters, we successfully designed and built every architectural component of a GPT model from scratch. We now have a complete, functional `GPTModel` class ready to be used.

However, the model currently has randomly initialized weights and knows nothing about language. In this chapter, we will take the crucial next step: **training the model**. We will feed it a text corpus, calculate its performance using a loss function, and iteratively update its weights to teach it to generate coherent text.
</div>

## 4.1 Import and Setup

<div class="alert alert-block alert-success">
We'll begin by importing the necessary libraries and the `GPTModel` we finalized in the last chapter.
</div>

In [1]:
# Standard library and third-party imports
import sys
import os
import torch
import torch.nn as nn
import tiktoken

# --- Add Project Root to Python Path ---

# Get the directory of the current notebook
current_notebook_dir = os.getcwd()

# Go up one level to the project's root directory
project_root = os.path.abspath(os.path.join(current_notebook_dir, '..'))

# Add the project root to the Python path if it's not already there
if project_root not in sys.path:

# --- Imports from your `src` package ---
from src.model import GPTModel

# We will use the same configuration dictionary
GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 1024,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "dropout_rate": 0.1,
    "qkv_bias": False
}



## 4.2 Generating Text with the Untrained Model

<div class="alert alert-block alert-success">
    
Before we train the model, let's see what it produces with its random, untrained weights. To do this, we need a function that can perform an **autoregressive** generation loop. This process involves feeding the model an initial context, predicting the next token, adding that token back to the context, and repeating the process.
</div>

In [2]:
def generate_text_simple(model, idx, max_new_tokens, context_size):
    # idx is (batch, n_tokens) array of indices in the current context

    for _ in range(max_new_tokens):
        # Crop current context if exceeds the supported context size
        idx_cond = idx[:, -context_size:]

        # Get the predictions from the model
        with torch.no_grad():
            logits = model(idx_cond) ### batch, n_tokens, vocab_size

         # Focus only on the prediction for the very last token in the sequence
        logits = logits[:, -1, :]

        # Apply softmax to get probabilities
        probas = torch.softmax(logits, dim=-1)

        # Get the token ID with the highest probability (greedy decoding)
        idx_next = torch.argmax(probas, dim=-1, keepdim=True)

        # Append the new token ID to therunning sequence
        idx = torch.cat((idx, idx_next), dim=1)

    return idx

<div class="alert alert-block alert-success">
    
The `generate_text_simple` function implements **greedy decoding**. In this autoregressive process, the model repeatedly predicts the single most likely next token, appends it to the sequence, and feeds the new sequence back into the model.

</div>

<div class="alert alert-block alert-info">
    
  <b>A Note on `softmax` and `argmax`</b><br>
  
  In our function, we include a `softmax` step to convert the model's output scores (logits) into probabilities before finding the most likely token with `argmax`.

  However, since `softmax` doesn't change the order of the scores, applying `argmax` directly to the `logits` would produce the exact same result. We include the step here to clearly illustrate the full process of generating probabilities, but it is technically redundant for greedy decoding.
</div>

<div class="alert alert-block alert-success">
Later in this chapter, when we will implement the GPT training code, we will also introduce additional sampling techniques where we modify the softmax outputs such that the model doesn't always select the most likely token, which introduces variability and creativity in the generated text.
</div>

<div class="alert alert-block alert-success">
Now, let's test our `generate_text_simple` function. We'll provide it with the starting context "Hello, I am" by first encoding the string into a batch of token IDs.
</div>

In [3]:
# Prepare the input
start_context = "Hello, I am"
tokenizer = tiktoken.get_encoding("gpt2")
encoded = tokenizer.encode(start_context)
encoded_tensor = torch.tensor(encoded).unsqueeze(0)
print("encoded:", encoded)
print("Encoded tensor shape:", encoded_tensor.shape)

encoded: [15496, 11, 314, 716]
Encoded tensor shape: torch.Size([1, 4])


<div class="alert alert-block alert-success">
Next, we'll set the model to evaluation mode with `model.eval()`. This disables random components like dropout that are only used during training. We can then call our function to generate new tokens from the starting context.
</div>

In [4]:
# --- Instantiate the Model ---
torch.manual_seed(100)
model = GPTModel(GPT_CONFIG_124M)
model.eval() # Set to evaluation mode

print("Model instantiated successfully.")

Model instantiated successfully.


<div class="alert alert-block alert-success">
Let's generate text using our `generate_text_simple` function.
</div>

In [5]:
# Generate text
output_ids = generate_text_simple(
    model=model,
    idx=encoded_tensor,
    max_new_tokens=6,
    context_size=GPT_CONFIG_124M["context_length"]
)

print("Output:", output_ids)
print("Output length:", len(output_ids[0]))

Output: tensor([[15496,    11,   314,   716,  1908, 41574, 14356, 11426, 42884, 32296]])
Output length: 10


<div class="alert alert-block alert-success">
    
Finally, we use the tokenizer's `.decode()` method to convert the output token IDs back into a readable string.
</div>

In [6]:
# Decode the output
decoded_text = tokenizer.decode(output_ids.squeeze(0).tolist())
print(f"\nDecoded text: '{decoded_text}'")


Decoded text: 'Hello, I am sent agitated AW telephone Tomas shroud'


<div class="alert alert-block alert-warning">
    
  <b>Why is the output gibberish?</b><br>
  
  As we can see, the generated text is incoherent. This is the correct and expected result at this stage.

  The reason is that our model is completely **untrained**. Its weights are still the random values they were initialized with. It has not yet learned any patterns of the English language. This demonstration perfectly illustrates *why* we need to train the model. In the next sections, we will prepare a dataset and implement a training loop to do just that.
</div>

## 4.3 Evaluating Generative Text Models

<div class="alert alert-block alert-success">

In the previous section, we saw that our untrained model produces incoherent gibberish. While we can see this qualitatively, we need a quantitative way to measure the model's performance. How do we capture "good text" versus "bad text" in a number that we can track and optimize?

The answer is to use a **loss function**. For language models that predict next-token probabilities, the standard metric is **cross-entropy loss**. It measures how "surprised" the model is by the true next token; a lower loss means the model's predictions are closer to reality. A related, more interpretable metric is **perplexity**.
</div>

<div class="alert alert-block alert-success">

Before we can calculate these metrics, we need to prepare our workspace. This involves two main setup steps:

1.  **Initialize the Model:** We will instantiate a `GPTModel` using our configuration with a `context_length` of 256. Using a smaller context size (compared to the original GPT-2's 1024) reduces the computational requirements, making the examples in this chapter accessible on a standard laptop.

2.  **Define Helper Functions:** We will define two convenience functions, `text_to_token_ids` and `token_ids_to_text`. These utilities will make it easier to convert back and forth between text and the model's numerical token IDs throughout our analysis.
</div>

In [7]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,   # Vocabulary size
    "context_length": 256, # Shortened context length (orig: 1024)
    "emb_dim": 768,        # Embedding dimension
    "n_heads": 12,         # Number of attention heads
    "n_layers": 12,        # Number of layers
    "dropout_rate": 0.1,      # Dropout rate
    "qkv_bias": False      # Query-key-value bias
}

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval();  # Disable dropout during inference

In [23]:
def text_to_token_ids(text, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
    return encoded_tensor

def token_ids_to_text(token_ids, tokenizer):
    flat = token_ids.squeeze(0) # remove batch dimension
    return tokenizer.decode(flat.tolist())

<div class="alert alert-block alert-success">

With our model initialized and helper functions defined, we can now perform our first end-to-end text generation. We will provide the model with the starting context "Every effort moves you" and use the `generate_text_simple` function to have it autoregressively generate the next 10 tokens.
</div>    

In [25]:
start_context = "Every effort moves yosu"

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(start_contexta2, tokenizer),
    max_new_tokens=10,
    context_size=GPT_CONFIG_124M["context_length"]
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves yosu disclosing func intrusiveH bear confidently 480osit caric Psychic


<div class="alert alert-block alert-success">

As we can see above, the model does not produce good textbecause it has not been trained yet.

**How do we measue or capture what "good text" is, in a numeric form, to track it during training?**

The next subsection introduce metrics to calculate a loss metric for the generated output that we can use to measure the training progress.

The next chapters on finetuning LLMs will also introduce additonal ways to neasure model quality.
</div>