# LLM Data Pipeline: Tokenization & Sliding Windows

**Purpose:** This notebook demonstrates how to tokenize a raw text file with `tiktoken`, build a sliding-window `Dataset` for autoregressive training, and create small batches for inspection. All code cells remain unchanged — only markdown and explanations were improved for clarity.

---

### Overview

* **Tokenization** — use `tiktoken` (GPT-2 BPE) to convert text to token IDs.
* **Sliding window dataset** — produce overlapping input/target chunks for autoregressive training.
* **Batching & embeddings** — create `DataLoader` batches and map tokens to dense vectors.

### Requirements

* `Data.txt` (a UTF-8 text file) in the same directory as the notebook.
* `tiktoken` Python package (install with `pip install tiktoken`).
* `PyTorch` available for `Dataset`, `DataLoader`, and tensors.

### Table of contents

1.  Setup and Installation
2.  Data Loading & Tokenization
3.  Quick tokenization check
4.  Causal Modeling — input/target shift
5.  Growing-context illustration
6.  Sliding-window Dataset: idea & implementation
7.  DataLoader factory
8.  Sanity checks (reload & sample batches)
9.  Token embeddings and shapes
10. Create a small embedded batch (example)
11. Embed inputs into 256-dimension vectors
12. Notes & next steps

## 1. Setup and Installation
Install the required tokenizer package if you haven't already:

In [None]:
pip install tiktoken

## 2. Data Loading & Tokenization
Read the source text and initialize the tokenizer. The following code reads `Data.txt` into `raw_text` and prepares `tiktoken` (GPT-2 encoding).

In [None]:
import tiktoken
with open("Data.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

## 3. Quick tokenization check
This short block prints the first character of the file, tokenizes with the **GPT-2 BPE encoder** and shows the token count and a short token sample.


In [None]:
print(raw_text[:1])
tokenizer = tiktoken.get_encoding('gpt2')
enc_text = tokenizer.encode(raw_text)
print(len(enc_text))
enc_sample = enc_text[:50]

## 4. Causal Modeling — input/target shift (x, y example)
**Autoregressive models** predict the next token given a context. Below we create `x` as a context of `context_size` tokens and `y` as the next token sequence (shifted by one, which is the model's target).


In [None]:
context_size = 4

x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x: {x}")
print(f"y:     {y}")

## 5. Growing-context illustration (what the model sees)
This loop demonstrates how the context grows token-by-token and the desired next token at each step. This visualizes the fundamental prediction task of the model.

In [None]:
for i in range (1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(f"in interation {i}, with : {tokenizer.decode(context)} ==> : {tokenizer.decode([desired])}")

## 6. Sliding-window Dataset: idea & implementation
**Idea:** From a single long token sequence, we produce many overlapping input/target examples.
* For a `max_length` window, we take tokens `[i:i+max_length]` as **input**.
* We take tokens `[i+1:i+max_length+1]` as the **target**.
* The **stride** controls the amount of overlap between consecutive samples.

**Why:** This approach maximizes training data utilization and ensures each token appears in several contexts. 

**Implementation (unchanged)** — this class produces `(input_ids, target_ids)` samples.

In [None]:
from torch.utils.data import Dataset, DataLoader
class GPTDataSetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i+max_length]
            target_chunk = token_ids[i+1: i+max_length+1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk)) 

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, index):
        return self.input_ids[index], self.target_ids[index]


## 7. DataLoader factory
The helper function below constructs the `GPTDataSetV1` and wraps it in a PyTorch `DataLoader`. This allows for efficient batching and multi-processing (controlled by `num_workers`). This cell remains unchanged.

In [None]:
def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=6):
    
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDataSetV1(txt, tokenizer, max_length, stride)

    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)

    return dataloader

## 8. Sanity checks (reload & sample batches)
We reload the text (optional but kept for flow) and sample a couple of batches with a small `stride=1` and `max_length=4`. This clearly validates that the windowing and batching behave as expected.

In [None]:
with open("Data.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

In [None]:
import torch
dataloader = create_dataloader_v1(raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)

data_iter = iter(dataloader)
first_batch = next(data_iter)
second_batch = next(data_iter)
print("First Iter : ", first_batch)
print("Second Iter : ", second_batch)

## 9. Token embedding layer (shapes)
We create a **token embedding layer** (`nn.Embedding`) to map the integer token IDs to dense, continuous vectors. This is the first step of the Transformer input process. Positional embeddings are typically added after this step.


In [None]:
vocab_size= 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

# Shape note: If inputs has shape (B, T) then token_embedding_layer(inputs) => (B, T, output_dim).

## 10. Create a small embedded batch (example)
We instantiate a dataloader with a `batch_size=8` and `max_length=4`. The input tensor will have shape $(8, 4)$, which then gets embedded to $(8, 4, 256)$ when passed through `token_embedding_layer`.

In [None]:
max_length =4
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=max_length, stride=max_length, shuffle=False)

data_iter = iter(dataloader)
inputs,targets = next(data_iter)
#for i in range(5):
 #   print("\nInput tokens : \n", tokenizer.decode(inputs[i].tolist()))
    #print("\n Inputs Shape: \n", inputs.shape)\
    
  #  print("\n Targets tokens : \n", tokenizer.decode(targets[i].tolist()))
    #print("\n Targets Shape: \n", targets.shape)

## 11. Embed inputs into 256-dimension vectors
The step to convert the token IDs to their dense vector representation, ready for the Transformer blocks.

In [None]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

## 12. Create another embedding layer for the postional encoder

Add positional embeddings and implement the self-attention mechanism + transformer blocks to build a full model.


In [None]:
context_size = max_length
pos_embedding_layer = torch.nn.Embedding(context_size, output_dim)

In [None]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

## 12. Implementation of a placeholder gpt model

Just a dummy class with placeholders


In [None]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 1024, # Context length
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of layers
    "drop_rate": 0.1,       # Dropout rate
    "qkv_bias": False       # Query-Key-Value bias
}

Behold the dummy class

In [None]:
import torch
import torch.nn as nn


class DummyGPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])
        
        # Use a placeholder for TransformerBlock
        self.trf_blocks = nn.Sequential(
            *[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])])
        
        # Use a placeholder for LayerNorm
        self.final_norm = DummyLayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits


class DummyTransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        # A simple placeholder

    def forward(self, x):
        # This block does nothing and just returns its input.
        return x


class DummyLayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5):
        super().__init__()
        # The parameters here are just to mimic the LayerNorm interface.

    def forward(self, x):
        # This layer does nothing and just returns its input.

        return x

## GPT ARCHITECTURE PART 2: **The Layer Normalization Part**

The layer normalization in the GPT architecture is crucial for two primary reasons:

1.  **To mitigate vanishing or exploding gradients**, making the training process more stable.
2.  **To reduce Internal Covariate Shift**, which refers to the problem of input distributions changing within the neural network layers during training.

***

The normalized values are adjusted to have a mean of zero and a variance of one.

$$
\text{formula} = \frac{(\text{layer} - \text{mean\_of\_layer})}{\sqrt{\text{variance\_of\_layer}}}
$$

***

In [None]:
#Example
import torch, torch.nn as nn
torch.manual_seed(123)
batch_example = torch.rand(2,5)
layer = nn.Sequential(nn.Linear(5,6), nn.ReLU()) # Relu for positive values onlly
out = layer(batch_example)
print("Out : \n", out)
mean = out.mean(dim=-1, keepdim=True)
var = out.var(dim=-1, keepdim=True)
print("Mean : \n ", mean)
print("Variance : \n ", var)
out_norm = (out - mean)/ torch.sqrt(var)
print("Normalized layer outputs : \n", out_norm)
mean_norm = out_norm.mean(dim=-1, keepdim=True)
var_norm = out_norm.var(dim=-1, keepdim=True)
print("Normalized Layers Mean: \n", mean_norm)
print("Normalized Laters Variance \n", var_norm)

Layer normalizatoin as a class. The scale and shift are two trainable parameters (of the same dimension
as the input) that the LLM automatically adjusts during training if it is determined that
doing so would improve the model's performance on its training task. 

This allows the model
to learn appropriate scaling and shifting that best suit the data it is processing.

In [None]:
class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift

In [None]:
ln = LayerNorm(emb_dim=5)
out_ln = ln(batch_example)
mean = out_ln.mean(dim=-1, keepdim=True)
var = out_ln.var(dim=-1, unbiased=False, keepdim=True)
print("Mean:\n", mean)
print("Variance:\n", var)

<div class="alert alert-block alert-success">

As we can see based on the results, the layer normalization code works as expected and
normalizes the values of each of the two inputs such that they have a mean of 0 and a
variance of 1:
</div>

## GPT ARCHITECTURE PART 3: FEEDFORWARD NEURAL NETWORK WITH GELU ACTIVATION

The ReLu activation function is used to introduce non-linearility to the linear layers of the network, and to reduce the issue of the vanishing gradient problems. The ReLu sets all negative numbers to zero, which results in a zero gradient. If a neuron keeps on receiving negative inputs it can become permanently inactive or stop learning resulting in a Dead Neuron. The GeLu can allow small negative values to pass through it , and its also differentiable at every point, since it has a curvature.

Calculating the Gaussian CDF is computationally expensive, so there is a close approximation :

$$GELU(x) \approx 0.5x \left( 1 + \tanh \left( \sqrt{\frac{2}{\pi}} (x + 0.044715x^3) \right) \right)$$


In [None]:
class GELU(nn.Module):
    def __init__(self):
        super().__init__()
    
    def forward(self, x):   # GeLu is the CDF of the standard guassian distribution
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) * 
            (x + 0.044715 * torch.pow(x, 3))
        ))

PLOT GELU VS RELU:

In [None]:
import matplotlib.pyplot as plt

gelu, relu = GELU(), nn.ReLU()

# Some sample data
x = torch.linspace(-3, 3, 100)
y_gelu, y_relu = gelu(x), relu(x)

plt.figure(figsize=(8, 3))
for i, (y, label) in enumerate(zip([y_gelu, y_relu], ["GELU", "ReLU"]), 1):
    plt.subplot(1, 2, i)
    plt.plot(x, y)
    plt.title(f"{label} activation function")
    plt.xlabel("x")
    plt.ylabel(f"{label}(x)")
    plt.grid(True)

plt.tight_layout()
plt.show()


Using the GELU class in a small FeedForward network, which will be used in the transformer block later:

In [None]:
class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
            GELU(),
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"])
        )

    def forward(self, x):
        return self.layers(x)

In [None]:
print(GPT_CONFIG_124M["emb_dim"])


The Feedforward Module, helps the model be able to learn and generalize the data. The input and output dimensions of this network is the same, it expands the embedding dimension into a higher dimensional space for a rich representation exploration. Also the uniformity of the input and output dimensions enables the stacking of layers which makes the model scalable.

In [None]:
ffn = FeedForward(GPT_CONFIG_124M)
x = torch.rand(2, 3, 768) #A
out = ffn(x)
print(out.shape)

## GPT ARCHITECTURE PART 4: **Shortcut Connections**

Vanishing gradients are a problem so that weights dont get updated which halts the learning process. Shortcut connections create an alternate path for the gradient to flow, skipping one or more layers. This is achieved by adding the output of one layer to the output of a latter layer. Also reduces the chances of local minimums occuring in the loss surface.

In [None]:
class ExampleDeepNeuralNetwork(nn.Module):
    def __init__(self, layer_sizes, use_shortcut):
        super().__init__()
        self.use_shortcut = use_shortcut
        self.layers = nn.ModuleList(
            [
                nn.Sequential(nn.Linear(layer_sizes[0], layer_sizes[1], GELU)),
                nn.Sequential(nn.Linear(layer_sizes[1], layer_sizes[2], GELU)),
                nn.Sequential(nn.Linear(layer_sizes[2], layer_sizes[3], GELU)),
                nn.Sequential(nn.Linear(layer_sizes[3], layer_sizes[4], GELU)),
                nn.Sequential(nn.Linear(layer_sizes[4], layer_sizes[5], GELU)),
            ]
        )

    def forward(self, x):
        for layer in self.layers:
            layer_output = layer(x)
            # ts the place where shorti cuti
            if self.use_shortcut and x.shape == layer_output.shape:
                x = x + layer_output
            else:
                x = layer_output
            
        return x 

For now initialize the neural network without shortcut connection

In [None]:
layer_sizes = [3,3,3,3,3,1]
sample_input = torch.tensor([[1., 0., -1.]])
torch.manual_seed(123)
with_out = ExampleDeepNeuralNetwork(layer_sizes, False)

In [None]:
def print_gradient(model, x):
    output = model(x)
    target = torch.tensor([[0.]])

    loss = nn.MSELoss()
    loss = loss(output, target)

    loss.backward()

    for name, param in model.named_parameters():
        if 'weight'  in name:
            print(f"{name} has gradient mean of {param.grad.abs().mean().item()}")

In code above, there is a loss function that computes how close the output of the model is close to the one specified by the user. loss.backward() , this computes the loss gradient for each layer in the model

In [None]:
print_gradient(with_out, sample_input)

In [None]:
torch.manual_seed(123)
model_with = ExampleDeepNeuralNetwork(layer_sizes, True)
print_gradient(model_with, sample_input)

When using a shortcut connection, it can be seen that instead of the weights vanishing as it approaches the layer 0, it stabilizes.

## GPT ARCHITECTURE PART 5: **The Attention and Linear Block**

In [None]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 1024, # Context length
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of layers
    "drop_rate": 0.1,       # Dropout rate
    "qkv_bias": False       # Query-Key-Value bias
}

Layer Normalization, GeLu and Feed-Forward NN

In [None]:
import torch
import torch.nn as nn

In [None]:
class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))
    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift
 

class GELU(nn.Module):
    def __init__(self):
        super().__init__()
    
    def forward(self, x ): # GeLu is the CDF of the standard guassian distribution
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) * 
            (x + 0.044715 * torch.pow(x, 3))
        ))

class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
            GELU(),
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"])
        )

    def forward(self, x):
        return self.layers(x)


The Multi-Head attention Class from the Attention Mechanisms notebook

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert (d_out % num_heads == 0), "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads 

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        
        self.out_proj = nn.Linear(d_out, d_out)  
        self.dropout = nn.Dropout(dropout)

        self.register_buffer(
            "mask",
            torch.triu(torch.ones(context_length, context_length), diagonal=1)
        )

    def forward(self, x):
        b, num_tokens, d_in = x.shape

        keys = self.W_key(x) 
        queries = self.W_query(x)
        values = self.W_value(x)

        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) 
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        attn_scores = queries @ keys.transpose(2, 3) 

        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
        attn_scores.masked_fill_(mask_bool, -torch.inf)
        
        attn_weights = torch.softmax(attn_scores / self.head_dim**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        context_vec = (attn_weights @ values).transpose(1, 2) 
        
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
        
        return self.out_proj(context_vec)

Lets be coding on the **Transformer Block**

<div align="center">
    <img src="image_26d267.png" width="400px">
    <p>
        <b>Figure 1: Transformer Block Architecture</b><br>
        A single transformer layer showing the <b>Masked Multi-Head Attention</b> and 
        <b>Feed-Forward</b> sub-layers. Each sub-layer is wrapped in a 
        <b>Residual Connection</b> followed by <b>Layer Normalization</b>. 
        The Feed-Forward block uses a <b>GELU</b> activation between two linear layers 
        to process the (2, 4, 768) tensor.
    </p>
</div>

In [None]:
class TransformerBlock(nn.Module):
    def __init__(self,cfg):
        super().__init__()
        self.attn = MultiHeadAttention(
            d_in = cfg["emb_dim"],
            d_out = cfg["emb_dim"],
            context_length = cfg["context_length"],
            num_heads = cfg["n_heads"],
            dropout = cfg["drop_rate"],
            qkv_bias = cfg["qkv_bias"]
        )
        self.ff = FeedForward(cfg)
        self.norm1 = LayerNorm(cfg["emb_dim"])
        self.norm2 = LayerNorm(cfg["emb_dim"])
        self.drop_shortcut = nn.Dropout(cfg["drop_rate"])
    
    def forward(self, x):
        # x has a shape of [ batch_size, num_tokens, emb_size]
        shortcut = x
        x = self.norm1(x)
        x = self.attn(x)
        x = x + shortcut

        shortcut = x
        x = self.norm2(x)
        x = self.ff(x)
        x = self.drop_shortcut(x)
        x = x + shortcut

        return x 

Instantiateeeee the transformer finally.....

In [None]:
torch.manual_seed(123)
x = torch.rand(2, 4 , 768)
block = TransformerBlock(GPT_CONFIG_124M)
output = block(x)

print("Input shape: ", x.shape)
print("Output shape : ", output.shape)

this be showing that the dimension of the inputs and outputs are identical, this is important for shortcut(residual) connections, and also the output is rich with context and representation.

## GPT ARCHITECTURE PART 6: **The enture gpt model architecture implementation**

In [None]:
import torch, torch.nn as nn =

In [None]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 1024, # Context length
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of layers
    "drop_rate": 0.1,       # Dropout rate
    "qkv_bias": False       # Query-Key-Value bias
}

In [None]:
class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])

        self.trf_blocks = nn.Sequential( *[TransformerBlock(cfg) for _ in range (cfg["n_layers"])])
        self.final_norm = LayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))

        x = tok_embeds+ pos_embeds
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x) # the logits of the possible vocabs
        return logits

Init a the 124 milli parameter model using the gpt config and pass an input with 2 batches and 4 tokens

In [None]:
torch.manual_seed(123)
batch = torch.tensor([[6109, 3626, 6100, 345], [6109, 1110, 6222, 257]])
model = GPTModel(GPT_CONFIG_124M)
output = model(batch)
print("Input : \n", batch)
print("Output shape : ", output.shape)
print("Output : \n", output)

As seen the output has 2 batches , with 4 tokens with the vocab size of 50257 as the possible logits

**Using the numel() method, short for "number of elements," we can collect the total
number of parameters in the model's parameter tensors:**

In [None]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")

This number of paramters is higher than the expected 124M paramters, this is because in the GPT2 architecture, the paramters used for the token embedding layer where the same ones used in the linear output layer that is why it is less.

In [None]:
print("Token embedding layer shape:", model.tok_emb.weight.shape)
print("Output layer shape:", model.out_head.weight.shape)

Both these weights have the sane shape, so if we remove this paramter count from the total and get 124 million

In [None]:
total_params_gpt2 = total_params - sum(p.numel() for p in model.out_head.parameters())
print(f"Number of trainable parameters considering weight tying: {total_params_gpt2:,}")

This reduces memory usage and computation cost of the model. It is said that using separate weights leads to better results and model performance

the memory requirements of the 163 million parameters in GPTModel object: ( assuming that using float32 weights which is 4 bytes)

In [None]:
total_size_bytes = total_params * 4 #A
total_size_mb = total_size_bytes / (1024 * 1024) #B
print(f"Total size of the model: {total_size_mb:.2f} MB")

## GPT ARCHITECTURE PART 7: **Getting the text from the output of the GPT model**

So basically the input(idx) is of shape batch, number of tokens which is the current context. Crop the current context(input), exceeds the supported context size. There is also a specified amount of maximum amount of newly generated tokens by the model. We take the output of the last token array, apply a softmax function to get it as a tensor of probailities and pick the index of the largest one and that will be the token id of the predicted word. Here the loss function wasnt measured, and the model isn't trained, just to demonstrate how the text is generated after the output of the GPT class/model

In [None]:
def gen_text_v1(model, idx, max_new_tokens, context_size):
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:]

        with torch.no_grad(): # ts disables the autrograd engine , used for inference here so weights dont accidently be changing
            logits = model(idx_cond)
        
        logits = logits[:, -1, :] # ts takes the last time step/row from each batch [ batch, n_tokens, vocab_size]
        probab = torch.softmax(logits, dim=-1)
        idx_next = torch.argmax(probab, dim=-1, keepdim=True) # argmax returns the index of the one with the largest value
        idx = torch.cat((idx, idx_next), dim=1) # cat concatenates tensors across a specified dimension
    
    return idx

For now softax is kind of redundant, could literally just put logits in argmax but later on it is useful to use sampling techniques suxh that the model doesnt pick most likely token which brings variability and creativity in the output text.

Generating text in the untrained model:

In [None]:
import tiktoken
tokenizer = tiktoken.get_encoding('gpt2')
init_context = "The window was "
encoded = tokenizer.encode(init_context)
encoded_tensor = torch.tensor(encoded).unsqueeze(0) # unsqueeze to add dim to be a batch
print("Encoded Tensor : \n", encoded_tensor)

Since this is just inference and not training do model.eval() which bypasses dropout and normalization.

In [None]:
model.eval()
out = gen_text_v1(
    model=model,
    idx=encoded_tensor,
    max_new_tokens=6,
    context_size=GPT_CONFIG_124M["context_length"]
)

print("Output : ", out)
print("Decoded Output : \n ", tokenizer.decode(out.squeeze(0).tolist()))