
## Overview of the Jupyter Notebook

This Jupyter Notebook demonstrates the process of setting up and using a GPT-2 model for text classification. Below is a summary of the key steps and components involved:

1. **Importing Required Libraries**:
    - Essential libraries such as `torch`, `pandas`, and `tiktoken` are imported for model creation, data handling, and tokenization.

2. **Model Configuration**:
    - The `BASE_CONFIG` dictionary is defined to store the configuration parameters for the GPT-2 model, including vocabulary size, context length, dropout rate, and query-key-value bias.
    - The `model_configs` dictionary contains specific configurations for different GPT-2 model sizes (small, medium, large, xl).
    - The `BASE_CONFIG` is updated with the configuration of the chosen model (`CHOOSE_MODEL`).

3. **Tokenization**:
    - The `tiktoken` library is used to get the GPT-2 tokenizer, which is then used to tokenize input texts.

4. **Dataset Preparation**:
    - A custom `SpamDataset` class is defined to handle the loading and preprocessing of the dataset from a CSV file. This includes tokenizing the text data and padding the sequences to a uniform length.

5. **Model Initialization**:
    - The GPT-2 model is instantiated using the `GPTModel` class with the specified configuration.
    - The model's output head is modified to have an output size of 2, suitable for binary classification.

6. **Loading Pretrained Weights**:
    - Pretrained weights are loaded into the model from a file (`review_classifier.pth`).

7. **Loss Function**:
    - The cross-entropy loss function (`criteron`) is defined for training the model.

8. **Gradient Calculation**:
    - A forward hook is registered to capture the token embeddings during the forward pass.
    - The model is run on an example input to compute the output and loss.
    - Backpropagation is performed to compute the gradients of the embeddings.

9. **Gradient Analysis**:
    - The gradients of the first feed-forward layer's weights in the first transformer block are extracted and analyzed.

This notebook provides a comprehensive workflow for setting up a GPT-2 model for text classification, including data preprocessing, model configuration, training, and gradient analysis.


In [58]:
from importlib.metadata import version
from gpt_download import download_and_load_gpt2
from previous_chapters_1 import GPTModel, load_weights_into_gpt
import torch


pkgs = ["matplotlib",
        "numpy",
        "tiktoken",
        "torch",
        "tensorflow", # For OpenAI's pretrained weights
        "pandas"      # Dataset loading
       ]
for p in pkgs:
    print(f"{p} version: {version(p)}")

# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

matplotlib version: 3.8.4
numpy version: 1.26.3
tiktoken version: 0.8.0
torch version: 2.2.2+cu121
tensorflow version: 2.16.1
pandas version: 2.2.2


In [59]:
import pandas as pd


## Model Configuration

In this section, we define the configuration for the GPT-2 model. The chosen model is `gpt2-small (124M)`, and the configuration parameters are stored in the `BASE_CONFIG` dictionary. The configuration includes:

- **Vocabulary Size**: 50257
- **Context Length**: 1024
- **Dropout Rate**: 0.0
- **Query-Key-Value Bias**: True

The `model_configs` dictionary contains specific configurations for different GPT-2 model sizes. The `BASE_CONFIG` is updated with the configuration of the chosen model (`gpt2-small (124M)`), which has:

- **Embedding Dimension**: 768
- **Number of Layers**: 12
- **Number of Heads**: 12

In [60]:
CHOOSE_MODEL = "gpt2-small (124M)"
INPUT_PROMPT = "Every effort moves"

BASE_CONFIG = {
    "vocab_size": 50257,     # Vocabulary size
    "context_length": 1024,  # Context length
    "drop_rate": 0.0,        # Dropout rate
    "qkv_bias": True         # Query-key-value bias
}

model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

BASE_CONFIG.update(model_configs[CHOOSE_MODEL])


In [61]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

In [62]:
model = GPTModel(BASE_CONFIG)
# model.to(device)


### SpamDataset Class

The `SpamDataset` class is designed to handle the loading and preprocessing of the dataset from a CSV file. It tokenizes the text data using the GPT-2 tokenizer and pads the sequences to a uniform length. This class is essential for preparing the data for training the GPT-2 model on the spam classification task. Here we are using it to pass the input to the trained model.



In [63]:
import torch
from torch.utils.data import Dataset


class SpamDataset(Dataset):
    def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256):
        self.data = pd.read_csv(csv_file)

        # Pre-tokenize texts
        self.encoded_texts = [
            tokenizer.encode(text) for text in self.data["Text"]
        ]

        if max_length is None:
            self.max_length = self._longest_encoded_length()
        else:
            self.max_length = max_length
            # Truncate sequences if they are longer than max_length
            self.encoded_texts = [
                encoded_text[:self.max_length]
                for encoded_text in self.encoded_texts
            ]

        # Pad sequences to the longest sequence
        self.encoded_texts = [
            encoded_text + [pad_token_id] * (self.max_length - len(encoded_text))
            for encoded_text in self.encoded_texts
        ]

    def __getitem__(self, index):
        encoded = self.encoded_texts[index]
        label = self.data.iloc[index]["Label"]
        return (
            torch.tensor(encoded, dtype=torch.long),
            torch.tensor(label, dtype=torch.long)
        )

    def __len__(self):
        return len(self.data)

    def _longest_encoded_length(self):
        max_length = 0
        for encoded_text in self.encoded_texts:
            encoded_length = len(encoded_text)
            if encoded_length > max_length:
                max_length = encoded_length
        return max_length

In [64]:
train_dataset = SpamDataset(
    csv_file="train.csv",
    max_length=None,
    tokenizer=tokenizer
)

In [65]:
example = train_dataset[4]
print(type(example[0]))

<class 'torch.Tensor'>


adding the last layer with two outputs for classification


In [66]:
model.out_head = torch.nn.Linear(in_features=BASE_CONFIG["emb_dim"], out_features=2)

In [67]:
model_state_dict = torch.load("review_classifier.pth", map_location="cpu", weights_only=True)
model.load_state_dict(model_state_dict)

<All keys matched successfully>

In [75]:
model

GPTModel(
  (tok_emb): Embedding(50257, 768)
  (pos_emb): Embedding(1024, 768)
  (drop_emb): Dropout(p=0.0, inplace=False)
  (trf_blocks): Sequential(
    (0): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=768, out_features=768, bias=True)
        (W_key): Linear(in_features=768, out_features=768, bias=True)
        (W_value): Linear(in_features=768, out_features=768, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
        (dropout): Dropout(p=0.0, inplace=False)
      )
      (ff): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (norm1): LayerNorm()
      (norm2): LayerNorm()
      (drop_resid): Dropout(p=0.0, inplace=False)
    )
    (1): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=768,

### Defining the loss function


In [68]:
criteron = torch.nn.CrossEntropyLoss()

### Custom Hooks (get_token_embeddings)

This custom hook is designed to take the output of the first layer (the input tokens as inputs will give us the embeddings as the output) and get the gradients of that (we need gradients with respect to the embeddings of the tokens).

In [None]:
embeddings = None
# embeddings.to(device)
def get_token_embeddings(module, input, output):
    global embeddings
    embeddings=output
    # embeddings.to(device)
    embeddings.retain_grad()
    print("Embeddings wrt to inputs")



### Registering Forward Hook and Calculating Gradients

In this section, we register a forward hook to the token embedding layer of the model. This hook captures the embeddings during the forward pass. We then perform a forward pass using an example from the dataset, compute the loss, and perform backpropagation to calculate the gradients of the embeddings.

The steps are as follows:

1. **Register Forward Hook**:
    - The `get_token_embeddings` function is registered as a forward hook to the token embedding layer (`tok_emb`) of the model. This function captures the embeddings during the forward pass and retains the gradients.

2. **Forward Pass**:
    - The model performs a forward pass using an example input from the `train_dataset`. The output of the model is stored in the `output` variable.

3. **Compute Loss**:
    - The loss is computed using the cross-entropy loss function (`criteron`). The loss is calculated between the model's output and the true label of the example.

4. **Backpropagation**:
    - Backpropagation is performed to compute the gradients of the embeddings with respect to the loss. The gradients are stored in the `embeddings.grad` variable.

This process allows us to analyze the gradients of the token embeddings, which can provide insights into how the model is learning and which parts of the input are most influential in the model's predictions.

In [None]:
hook = model.tok_emb.register_forward_hook(get_token_embeddings)

output = model(example[0].unsqueeze(0))
embeddings.grad
loss = criteron(output[:, -1, :], example[1].unsqueeze(0))
loss.backward()
embeddings.grad

Below is the code for getting the gradients with respect to weights.

In [1]:
# model.trf_blocks._getattr_("0").ff.layers[0].weight.grad

grad = model.trf_blocks[0].ff.layers[0].weight.grad

NameError: name 'model' is not defined

In [74]:
grad

tensor([[ 8.3300e-05,  3.6141e-06, -3.1241e-05,  ...,  1.0998e-05,
          3.9674e-05,  9.9545e-05],
        [ 2.6122e-05,  1.9891e-06, -6.2745e-06,  ...,  2.8791e-06,
          1.2063e-05,  2.2125e-05],
        [-4.7453e-06, -3.2032e-07,  6.9602e-07,  ...,  1.3305e-07,
         -1.9704e-06, -3.5673e-06],
        ...,
        [ 3.3106e-05,  2.9938e-06, -6.9407e-06,  ...,  3.3974e-06,
          1.7098e-05,  2.8725e-05],
        [-1.0225e-06, -5.9012e-07,  4.5830e-07,  ...,  1.3941e-06,
         -1.0809e-06, -1.4936e-06],
        [-5.2052e-06,  3.2255e-06, -7.6528e-06,  ..., -4.7861e-06,
          7.2860e-06,  1.0909e-05]])