# Chapter 7: Loading Pre-trained Weights

<div class="alert alert-block alert-success">

In the previous chapters, we successfully built and trained a small GPT model from scratch. While this was a great learning exercise, real-world performance comes from models trained on massive, diverse datasets, which requires enormous computational resources.

Fortunately, OpenAI released the weights for their trained GPT-2 models. In this chapter, we will load these professional, pre-trained weights into our own `GPTModel` architecture. This is the ultimate test of our implementation and will allow us to generate high-quality, coherent text.
</div>

## 7.1 Import and Setup

<div class="alert alert-block alert-success">

To load the original GPT-2 weights, which were saved in a TensorFlow checkpoint file, we need to install the `tensorflow` library. We will also use `tqdm` for a nice progress bar during the download.
</div>

In [1]:
# Install required packages
# !pip install tensorflow tqdm

print("Installation complete")

# Standard imports and setup
import os
import sys
import urllib.request
import json

import numpy as np
import torch
import tiktoken
import tensorflow as tf

# --- Add Project Root to Python Path ---

# Get the directory of the current notebook
current_notebook_dir = os.getcwd()

# Go up one level to the project's root directory
project_root = os.path.abspath(os.path.join(current_notebook_dir, '..'))

# Add the project root to the Python path if it's not already there
if project_root not in sys.path:
    sys.path.append(project_root)

# Import from src package
from src.config import GPT_CONFIG_124M
from src.model import GPTModel
from src.text_generation import generate
from src.utils import text_to_token_ids, token_ids_to_text, download_file, load_gpt2_params_from_tf_ckpt

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = tiktoken.get_encoding("gpt2")

Installation complete


## 7.2 Downloading the Pre-trained Weights

<div class="alert alert-block alert-success">
    
Previously, for educational purposes, we trained a small GPT-2 model using a limited dataset comprising a short-story book.

This approach allowed us to focus on the fundamentals without the need for extensive time and computational resources.
    
Fortunately, OpenAI openly shared the weights of their GPT-2 models, thus eliminating the need to invest tens to hundreds of thousands of dollars in retraining the model on a large corpus ourselves.
    
In this chapter, we will load these weights into our GPTModel class and use the model for text generation. 

Here, weights refer to the weight parameters that are stored in the .weight attributes of PyTorch's Linear and Embedding layers, for example. 
</div>

<div class="alert alert-block alert-success">

We'll define a utility function (`download_and_load_gpt2`) to download the necessary files for the GPT-2 124M model from the OpenAI repository if they don't already exist in our `models/gpt2/124M` directory.

This function which will load the GPT-2 architecture **settings** (`settings`) and **weight parameters** (`params`).
</div>

In [2]:
def download_and_load_gpt2(model_size, models_dir):
    # Validate model size
    allowed_sizes = ("124M", "355M", "774M", "1558M")
    if model_size not in allowed_sizes:
        raise ValueError(f"Model size not in {allowed_sizes}")

    # Define paths
    model_dir = os.path.join(models_dir, model_size)
    base_url = "https://openaipublic.blob.core.windows.net/gpt-2/models"
    backup_base_url = "https://f001.backblazeb2.com/file/LLMs-from-scratch/gpt2"
    filenames = [
        "checkpoint", "encoder.json", "hparams.json",
        "model.ckpt.data-00000-of-00001", "model.ckpt.index",
        "model.ckpt.meta", "vocab.bpe"
    ]

    # Download files
    os.makedirs(model_dir, exist_ok=True)
    for filename in filenames:
        file_url = os.path.join(base_url, model_size, filename)
        backup_url = os.path.join(backup_base_url, model_size, filename)
        file_path = os.path.join(model_dir, filename)
        download_file(file_url, file_path, backup_url)

    # Load settings and params
    tf_ckpt_path = tf.train.latest_checkpoint(model_dir)
    settings = json.load(open(os.path.join(model_dir, "hparams.json"), "r", encoding="utf-8"))
    params = load_gpt2_params_from_tf_ckpt(tf_ckpt_path, settings)

    return settings, params

In [3]:
settings, params = download_and_load_gpt2(model_size="124M", models_dir="../models/gpt2")

checkpoint: 100%|██████████████████████████| 77.0/77.0 [00:00<00:00, 76.9kiB/s]
encoder.json: 100%|██████████████████████| 1.04M/1.04M [00:00<00:00, 1.21MiB/s]
hparams.json: 100%|████████████████████████| 90.0/90.0 [00:00<00:00, 44.8kiB/s]
model.ckpt.data-00000-of-00001: 100%|██████| 498M/498M [01:11<00:00, 6.96MiB/s]
model.ckpt.index: 100%|██████████████████| 5.21k/5.21k [00:00<00:00, 5.20MiB/s]
model.ckpt.meta: 100%|██████████████████████| 471k/471k [00:00<00:00, 842kiB/s]
vocab.bpe: 100%|████████████████████████████| 456k/456k [00:00<00:00, 818kiB/s]


<div class="alert alert-block alert-success">
    
After the execution of the previous code has been completed, let's inspect the contents of settings and params:
</div>

In [4]:
print("Settings:", settings)
print("Parameter dictionary keys:", params.keys())

Settings: {'n_vocab': 50257, 'n_ctx': 1024, 'n_embd': 768, 'n_head': 12, 'n_layer': 12}
Parameter dictionary keys: dict_keys(['blocks', 'b', 'g', 'wpe', 'wte'])


<div class="alert alert-block alert-success">
    
Both settings and params are Python dictionaries. The settings dictionary stores the LLM architecture settings similarly to our manually defined GPT_CONFIG_124M settings. 

The params dictionary contains the actual weight tensors. 

    
Note that we only printed the dictionary keys because printing the weight contents would take up too much screen space
</div>

<div class="alert alert-block alert-success">
    
We can inspect these weight tensors by printing the whole dictionary via print(params) or by selecting individual tensors via the respective dictionary keys, for example, the embedding layer weights:
</div>

In [5]:
print(params["wte"])
print("Token embedding weight tensor dimensions:", params["wte"].shape)

[[-0.11010301 -0.03926672  0.03310751 ... -0.1363697   0.01506208
   0.04531523]
 [ 0.04034033 -0.04861503  0.04624869 ...  0.08605453  0.00253983
   0.04318958]
 [-0.12746179  0.04793796  0.18410145 ...  0.08991534 -0.12972379
  -0.08785918]
 ...
 [-0.04453601 -0.05483596  0.01225674 ...  0.10435229  0.09783269
  -0.06952604]
 [ 0.1860082   0.01665728  0.04611587 ... -0.09625227  0.07847701
  -0.02245961]
 [ 0.05135201 -0.02768905  0.0499369  ...  0.00704835  0.15519823
   0.12067825]]
Token embedding weight tensor dimensions: (50257, 768)


<div class="alert alert-block alert-info">
    
We downloaded and loaded the weights of the smallest GPT-2 model via the `download_and_load_gpt2(model_size="124M", ...)` setting. However, note that OpenAI also shares the weights of larger models: "355M", "774M", and "1558M".

</div>

<div class="alert alert-block alert-success">
    
Above, we loaded the **124M GPT-2** model weights into Python, however we still need to transfer them into our GPTModel instance.

First, we initialize a new GPTModel instance.

Note that the original GPT model initialized the **linear layers** for the query, key, and value matrices in the multi-head attention module with **bias vectors**, which is not required or recommended; however, to be able to load the weights correctly, we have to enable these too by setting qkv_bias to True in our implementation, too.
                                                                                                                                                                                                          
Also, OpenAI used **bias vectors** in the **multi-head attention module's linear layers** to implement the query, key, and value matrix computations.

<div class="alert alert-block alert-warning">

**Bias vectors** are not commonly used in LLMs anymore as they don't improve the modeling performance and are thus unnecessary.
</div>

However, since we are working with pretrained weights, we need to match the settings for consistency and enable these bias vectors:

</div>

In [6]:
# Define model configurations in a dictionary for compactness
model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

GPT_CONFIG_124M.update({"context_length": 1024, "qkv_bias": True})
gpt = GPTModel(GPT_CONFIG_124M)
gpt.eval();

## 7.3 Adapting the Weight Keys

<div class="alert alert-block alert-success">
    
The parameter names in OpenAI's TensorFlow checkpoint (e.g., `attn/c_attn/w`) are different from the names in our PyTorch `GPTModel` (e.g., `attn.W_query.weight`).

We must create a mapping function that carefully renames and reshapes the pre-trained weights to match our model's architecture precisely. This step is critical for a successful transfer.
</div>

<div class="alert alert-block alert-success">

To assist with this mapping, we'll first define a small helper utility called `assign`. Its job is to ensure that a pre-trained weight tensor and our model's layer tensor have the exact same shape before assigning the weights. This acts as a valuable safety check.
</div>

In [7]:
def assign(left, right):
    if left.shape != right.shape:
        raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")
    return torch.nn.Parameter(torch.tensor(right))

<div class="alert alert-block alert-info">

The `load_weights_into_gpt` function bridges the gap between OpenAI's pre-trained parameters and our custom `GPTModel` architecture.

It works by systematically assigning the pre-trained token and positional embedding weights to their corresponding layers. Then, it iterates through each transformer block, carefully mapping the various layer weights. A key part of this process is splitting the combined query, key, and value weights from the OpenAI checkpoint into the separate `W_query`, `W_key`, and `W_value` matrices in our attention module.

Finally, it implements **weight tying** by assigning the token embedding weights to the model's final output head, ensuring a perfect one-to-one transfer of the learned parameters.
</div>

In [8]:
def load_weights_into_gpt(gpt, params):
    gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe'])
    gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte'])
    
    for b in range(len(params["blocks"])):
        q_w, k_w, v_w = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1)
        gpt.trf_blocks[b].attn.W_query.weight = assign(
            gpt.trf_blocks[b].attn.W_query.weight, q_w.T)
        gpt.trf_blocks[b].attn.W_key.weight = assign(
            gpt.trf_blocks[b].attn.W_key.weight, k_w.T)
        gpt.trf_blocks[b].attn.W_value.weight = assign(
            gpt.trf_blocks[b].attn.W_value.weight, v_w.T)

        q_b, k_b, v_b = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1)
        gpt.trf_blocks[b].attn.W_query.bias = assign(
            gpt.trf_blocks[b].attn.W_query.bias, q_b)
        gpt.trf_blocks[b].attn.W_key.bias = assign(
            gpt.trf_blocks[b].attn.W_key.bias, k_b)
        gpt.trf_blocks[b].attn.W_value.bias = assign(
            gpt.trf_blocks[b].attn.W_value.bias, v_b)

        gpt.trf_blocks[b].attn.out_proj.weight = assign(
            gpt.trf_blocks[b].attn.out_proj.weight, 
            params["blocks"][b]["attn"]["c_proj"]["w"].T)
        gpt.trf_blocks[b].attn.out_proj.bias = assign(
            gpt.trf_blocks[b].attn.out_proj.bias, 
            params["blocks"][b]["attn"]["c_proj"]["b"])

        gpt.trf_blocks[b].ff.layers[0].weight = assign(
            gpt.trf_blocks[b].ff.layers[0].weight, 
            params["blocks"][b]["mlp"]["c_fc"]["w"].T)
        gpt.trf_blocks[b].ff.layers[0].bias = assign(
            gpt.trf_blocks[b].ff.layers[0].bias, 
            params["blocks"][b]["mlp"]["c_fc"]["b"])
        gpt.trf_blocks[b].ff.layers[2].weight = assign(
            gpt.trf_blocks[b].ff.layers[2].weight, 
            params["blocks"][b]["mlp"]["c_proj"]["w"].T)
        gpt.trf_blocks[b].ff.layers[2].bias = assign(
            gpt.trf_blocks[b].ff.layers[2].bias, 
            params["blocks"][b]["mlp"]["c_proj"]["b"])

        gpt.trf_blocks[b].norm1.scale = assign(
            gpt.trf_blocks[b].norm1.scale, 
            params["blocks"][b]["ln_1"]["g"])
        gpt.trf_blocks[b].norm1.shift = assign(
            gpt.trf_blocks[b].norm1.shift, 
            params["blocks"][b]["ln_1"]["b"])
        gpt.trf_blocks[b].norm2.scale = assign(
            gpt.trf_blocks[b].norm2.scale, 
            params["blocks"][b]["ln_2"]["g"])
        gpt.trf_blocks[b].norm2.shift = assign(
            gpt.trf_blocks[b].norm2.shift, 
            params["blocks"][b]["ln_2"]["b"])

    gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"])
    gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])
    gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"])

<div class="alert alert-block alert-success">

Developing the load_weights_into_gpt function took a lot of guesswork since OpenAI used a slightly different naming convention from ours. 

However, the assign function would alert us if we try to match two tensors with different dimensions. 

Also, if we made a mistake in this function, we would notice this as the resulting GPT model would be unable to produce coherent text.
</div>

## 7.4 Generating Text with Pre-trained Weights

<div class="alert alert-block alert-success">

With our mapping function ready, let's instantiate our `GPTModel`, load the adapted weights into it, and see the incredible difference in generation quality.
</div>

In [9]:
load_weights_into_gpt(gpt, params)
gpt.to(device);

In [10]:
torch.manual_seed(100)

token_ids = generate(
    model=gpt,
    idx=text_to_token_ids("Every effort moves you", tokenizer).to(device),
    max_new_tokens=25,
    context_size=GPT_CONFIG_124M["context_length"],
    top_k=50,
    temperature=1.4
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you through the world in different ways, not one based around your skills. Every job involves some other skill but to create a career


<div class="alert alert-block alert-info">
    
**Success!**

The model now generates perfectly **coherent and contextually relevant text**. This confirms that our from-scratch `GPTModel` architecture is implemented correctly and is compatible with the original GPT-2 design.
</div>

<div class="alert alert-block alert-success">

## Chapter 7 Summary and Next Steps

This concludes a major milestone in our project. We have successfully taken our from-scratch `GPTModel` architecture and loaded it with the professionally trained weights from OpenAI's original GPT-2 model.

<div class="alert alert-block alert-info">
 
**Milestone Reached: Pre-trained Model is Operational!**

Throughout this chapter, we have:

* Downloaded the official GPT-2 (124M) weights and configuration from OpenAI's repository.
* Handled the original TensorFlow checkpoint format to extract the model's parameters.
* Written a detailed mapping function to carefully adapt the names and shapes of the pre-trained weights to match our own `GPTModel` architecture.
* Successfully loaded these weights into our model, replacing the random initial parameters.
* Generated high-quality, coherent text, proving that our from-scratch implementation is correct and compatible with the original GPT-2 design.
</div>

### Where We Are Now
We now have a powerful, pre-trained language model. It is no longer a toy model trained on a small text; it now has the general knowledge and language capabilities of the original GPT-2, learned from a massive web text corpus.

### What's Next?
Our model is now a powerful generalist, but it is not specialized for any particular task beyond text completion. The next step in the LLM lifecycle is to adapt this pre-trained model for a specific downstream task.

In the next notebook, **Chapter 8: Finetuning for Text Classification**, we will take this powerful model and finetune it to become a spam classifier.
</div>