In [None]:
!pip install accelerate -U
!pip install numpy==2.0.1

!pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0
!pip install dataclasses==0.6

In [None]:
"""
Reference code for GPT-2 training and inference.
Will save the model weights into files, to be read from C as initialization.

References:
1) the official GPT-2 TensorFlow implementation released by OpenAI:
https://github.com/openai/gpt-2/blob/master/src/model.py
2) huggingface/transformers PyTorch implementation:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py

Example launches to only benchmark the speed of bfloat16 compiled GPU training:
1 GPU:
python train_gpt2.py --write_tensors=0 --num_iterations=50 --sequence_length=1024 --compile=1 --tensorcores=1 --dtype=bfloat16
you can also turn on flash-attention by appending --flash=1
4 GPU:
torchrun --standalone --nproc_per_node=4 train_gpt2.py --write_tensors=0 --num_iterations=50 --sequence_length=1024 --compile=1 --tensorcores=1 --dtype=bfloat16
"""

In [None]:
import os
import math
import glob
import struct
import inspect
from contextlib import nullcontext
from dataclasses import dataclass

import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F
import torch._inductor.config as config
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group
from torch.distributed.optim import ZeroRedundancyOptimizer
import torch.distributed as dist

In [None]:
# -----------------------------------------------------------------------------
# PyTorch nn.Module definitions for the GPT-2 model

In [None]:
class NewGELU(nn.Module):
    """Careful there are a few versions of GeLU, this one is the exact one used by OpenAI""" # defines GeLU activation function, tanh version.
    def forward(self, input):
        return 0.5 * input * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * torch.pow(input, 3.0))))

In [None]:
# using a global to toggle flash-attention

# The global variable FLASH is set to 0, which means that by default, the custom flash-attention mechanism is turned off. Flash attention is a type of efficient
# attention mechanism that can handle longer sequences more effectively by reducing memory consumption and computational requirements. This would be toggled or
# used in specific parts of the attention calculations in your model depending on the requirements or experiments you are conducting.
FLASH = 0

## TIL notes for self attention

Certainly! Here’s a brief overview of embeddings and attention heads, their roles in the model, and why they are used:

### Embeddings
**What are they?**
Embeddings are dense vector representations of data, typically words or tokens, that capture semantic meanings and relationships in a way that is usable by machine learning models. In the context of models like GPT-2, embeddings transform discrete input tokens (e.g., words) into continuous vectors.

**Where are they in the pipeline?**
In transformer models, embeddings are usually the first layer. For GPT-2, this includes:
- **Token Embeddings (`wte`)**: Convert each token into a vector.
- **Positional Embeddings (`wpe`)**: Encode the position of each token in the sequence, allowing the model to understand the order of tokens.

**What do they do and why?**
Embeddings map tokens to a high-dimensional space where similar tokens are closer together, facilitating the model's understanding of language patterns and context. The positional embeddings ensure the model can consider the order of words, crucial for generating coherent text and understanding language structure.

### Attention Heads
**What are they?**
Attention heads are components of the attention mechanism within each transformer layer. They allow the model to focus on different parts of the input sequence when predicting each output token.

**Where are they in the pipeline?**
Attention heads are part of the self-attention layers in the transformer blocks. Each head computes its own attention scores independently, allowing the model to capture various aspects of the data.

**What do they do and why?**
Each attention head:
- Computes **queries**, **keys**, and **values** from the input embeddings using linear transformations.
- Uses the **queries** to assess the importance (or relevance) of other parts of the sequence (the **keys**) and aggregates the corresponding **values** based on these computed relevancies.

This mechanism enables the model to dynamically decide which parts of the input are important for each output token, allowing for contextually aware generation of text. The use of multiple heads increases the model's capacity to focus on different subspaces of the embedding dimension, capturing a wider array of relationships within the data.

### Summary
In the pipeline, embeddings serve as the initial transformation layer, translating raw input into a format suitable for processing by deeper network layers. Attention heads come into play within the self-attention mechanism, which is fundamental to transformers' ability to handle sequences of data. They allow the model to selectively focus on different segments of the input data, enhancing its ability to make contextually informed predictions.

query: i dont get how theres more embedding splitted across a number of head etc. is this due to gpu parallelism? and i get that embedding = token embd + positional embd.

but, how is the tokens splitted across the attention heads? i cant visualise it. can you show me an example of how this works. thx

---

Certainly! Let’s dive deeper into how embeddings are used within attention heads and the concept of splitting across multiple heads, which can indeed seem a bit abstract at first.

### Embeddings Recap
As you mentioned, the embedding for each token in a transformer model like GPT-2 is the sum of:
- **Token Embedding**: Converts each token (like a word) into a vector.
- **Positional Embedding**: Adds information about the position of each token in the sequence.

This results in a combined embedding that conveys both the meaning of the token and its position in the sequence.

### Attention Heads: Splitting the Embeddings
The purpose of having multiple attention heads is to allow the model to attend to different parts of the input sequence from different 'perspectives' or 'subspaces'. Here’s how it works step-by-step:

1. **Starting Point**: Assume each token in the sequence has been converted to a combined embedding vector (from token and positional embeddings). Suppose each vector has a dimensionality \(C\).

2. **Linear Projections**: Each attention head will transform the embedding vector into three different vectors: Queries (Q), Keys (K), and Values (V). These transformations are done through learned linear projections (typically implemented as a single linear layer split into three parts).

3. **Splitting Across Heads**:
   - For a transformer with multiple heads, the dimension \(C\) of each embedding vector is split into \(N\) parts, where \(N\) is the number of heads. Each part handles a fraction of the total dimension. For instance, if \(C = 768\) and \(N = 12\) (like in GPT-2), each head deals with a sub-vector of size 64.
   - Each head processes its portion of the vector independently. This means every head will produce its own Q, K, and V vectors from different subspaces of the embedding.

### Visualization Example
Let's visualize this with an example where \(C = 6\) (for simplicity) and \(N = 3\) heads. Each head will handle 2 dimensions:

- **Original Embeddings for a Sequence** (3 tokens):
  \[
  \begin{align*}
  \text{Token 1:} & \ [e_{11}, e_{12}, e_{13}, e_{14}, e_{15}, e_{16}] \\
  \text{Token 2:} & \ [e_{21}, e_{22}, e_{23}, e_{24}, e_{25}, e_{26}] \\
  \text{Token 3:} & \ [e_{31}, e_{32}, e_{33}, e_{34}, e_{35}, e_{36}]
  \end{align*}
  \]

- **Split Across 3 Heads**:
  - **Head 1** processes dimensions 1 and 2 of each token's embedding:
    - \( Q_1, K_1, V_1 \) from \([e_{11}, e_{12}]\), \([e_{21}, e_{22}]\), \([e_{31}, e_{32}]\)
  - **Head 2** processes dimensions 3 and 4:
    - \( Q_2, K_2, V_2 \) from \([e_{13}, e_{14}]\), \([e_{23}, e_{24}]\), \([e_{33}, e_{34}]\)
  - **Head 3** processes dimensions 5 and 6:
    - \( Q_3, K_3, V_3 \) from \([e_{15}, e_{16}]\), \([e_{25}, e_{26}]\), \([e_{35}, e_{36}]\)

Each head computes attention independently, focusing on different aspects (or 'slices') of the embedding space. This allows the model to integrate diverse perspectives when synthesizing the output at each step.

### GPU Parallelism
The splitting across heads is indeed highly conducive to parallel processing on GPUs. Each head operates independently of the others, which means their computations can be parallelized effectively, leading to significant performance gains especially with high-dimensional embeddings and long sequences.

This setup not only speeds up computations but also enriches the model's capacity to learn varied dependencies across the input data, leveraging multiple representational strategies simultaneously.

query: wait you sure this is real? that the initial token is splited across all 3 instead of Q, K, V all using the same input?


---

You've raised a great point for clarification. Let's refine the explanation regarding how the embeddings are processed through the attention mechanism, particularly in the context of dividing the input across multiple heads for computing queries (Q), keys (K), and values (V).

### How Embeddings Are Handled in Attention Heads

In transformer models like GPT-2, each head does **not** process only a specific part of the input token's embedding vector to produce its own Q, K, and V. Rather, the entire embedding vector is used by each head to generate its own Q, K, and V. However, each head transforms the embeddings into different representational subspaces, so while they all start with the same input, the projections they learn and the subsequent focus of their attention can differ significantly.

### Corrected Process:

1. **Input Embedding**: Every token in the sequence is first converted into a dense vector representation, which is the sum of its token embedding and its positional embedding.

2. **Projection in Attention Heads**:
   - Each attention head applies three distinct linear transformations to the entire embedding vector of each token, generating its own set of Q, K, and V:
     - `Q = W_Q * embedding`
     - `K = W_K * embedding`
     - `V = W_V * embedding`
   - These transformations are usually implemented as a single large linear layer (part of `self.c_attn` in your code) that outputs three times the embedding dimension, which is then split into Q, K, and V for each head.

3. **Dimensionality**:
   - If the model’s embedding size \(C\) is 768 and there are 12 heads, each head will output Q, K, and V vectors that are each \(768/12 = 64\) dimensions. Thus, each head transforms the full embedding vector into smaller, 64-dimensional Q, K, and V vectors.

4. **Independent Attention Calculations**:
   - Each head computes attention scores independently based on its Q, K, and V. The scores determine how much each element of the sequence (values) should be considered for each position in the output sequence.
   - This allows the model to focus on different parts of the sequence differently in each head, capturing various types of relationships in the data (e.g., syntactic, semantic).

### Visualization Example Revisited:

- **Given**: A single token's embedding vector is `[e1, e2, ..., e768]`.
- **In Each Head**: The entire vector is used but transformed differently:
  - **Head 1** produces:
    - `Q1 = W_Q1 * [e1, ..., e768]`
    - `K1 = W_K1 * [e1, ..., e768]`
    - `V1 = W_V1 * [e1, ..., e768]`
  - **Head 2** through **Head 12** similarly use the entire vector but project it using different weights.

By using the whole embedding vector in each head, the model can maintain a comprehensive understanding of each token while still allowing different heads to "specialize" in different types of relationships or features within the data. This design leverages the parallel processing power of GPUs effectively and is fundamental to the model's ability to generate nuanced and contextually appropriate outputs.

Query: then what happens to the separate 12 heads in the end? do they just dot product each other?


---

The outputs from the 12 attention heads in a transformer model like GPT-2 don't simply undergo a dot product with each other. Instead, there's a more structured way of integrating these multiple streams of outputs to enhance the model's ability to capture diverse aspects of the information. Here's what happens step-by-step after each of the attention heads processes its input:

### Combining the Outputs of Attention Heads

1. **Independent Attention Computations**:
   - Each head computes its own queries (Q), keys (K), and values (V).
   - The attention mechanism for each head calculates a set of attention scores based on the dot product of queries and keys, which is then used to weight the values. This results in an output vector from each head that emphasizes different parts of the input sequence according to the head's learned attention pattern.

2. **Concatenation of Head Outputs**:
   - The output vectors from each head are concatenated back together. If each head produces a vector of dimension \( \frac{C}{N} \) (where \( C \) is the embedding size and \( N \) is the number of heads), the concatenation of these vectors restores the original dimensionality \( C \). For example, if \( C = 768 \) and \( N = 12 \), each head outputs a vector of size 64, and concatenating these vectors results in a vector of size 768.

3. **Final Linear Projection**:
   - The concatenated vector is then passed through another learned linear projection (often represented by `self.c_proj` in your code). This step is crucial because it integrates the information from all the heads into a single output vector per token. This final projection layer combines the diverse representations produced by different heads into a coherent output that can be used in subsequent layers or for generating predictions.
   - This projection not only serves to mix the information from different heads but also maps the concatenated output back to the desired embedding dimensionality, preparing it for the next layer or the output layer of the model.

### Why This Process?

This multi-headed approach allows the model to simultaneously focus on different types of relationships in the data—from very local (like specific syntactic ties) to more global (like thematic elements or long-range dependencies). By recombining the outputs, the model can leverage these diverse insights effectively, ensuring that no single aspect dominates unless it is contextually relevant.

This design is one of the reasons transformers are particularly effective for tasks involving complex dependencies within data, such as natural language understanding and generation, where different parts of a sequence can have various types of influence on each other.

### Visualization

Imagine you have a sequence of three words processed by a transformer model with three attention heads. Each head processes the sequence and focuses on different relationships. The outputs from these heads might look something like this for each word:

- **Head 1 Output**: [Focus on next word relationships]
- **Head 2 Output**: [Focus on whole sentence theme]
- **Head 3 Output**: [Focus on grammatical structure]

These vectors are then concatenated and passed through a final projection, synthesizing a comprehensive representation that balances all these perspectives. This processed output is what feeds into subsequent layers or contributes to the final task-specific computations, such as predicting the next word in a sequence.

## Back to coding

In [None]:
class CausalSelfAttention(nn.Module):

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0 # starts by checking that embedding size "n_embd" can be divisible by number of attention heads "n_heads". Crucial for equally
                                                  # splitting the attention computations across the heads.

                                                  # TIL, embeddings are splitted across number of heads. Each processes their own Q, K, V then all the dimensions once processed
                                                  # are then concatenated back into a final vector.

        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        self.c_proj.LLMC_RESIDUAL_SCALE_FLAG = 1 # // CONFIRMED # treated as residual.
        # regularization
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        # not really a 'bias', more of a mask, but following the OpenAI/HF naming though
        self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size)).view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)
        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        qkv = self.c_attn(x) # projects into qkv from input
        q, k, v = qkv.split(self.n_embd, dim=2) # splits it into 3 tensors
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        if FLASH:
            # flashattention
            y = F.scaled_dot_product_attention(q, k, v, is_causal=True)
        else:
            # manual implementation of attention
            # this materializes the large (T,T) matrix for all the queries and keys
            att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1))) # math.sqrt() used here instead of **-0.5 for the attention(q, k, v) equation
            att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
            att = F.softmax(att, dim=-1)
            y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side
        # output projection
        y = self.c_proj(y)
        return y

## Notes

query: ah yeah here we go it is. any components with "LLMC_RESIDUAL_SCALE_FLAG = 1" will be treated like a residual. does this changes anyt?

---

Thanks for sharing the weight initialization function. The inclusion of the `LLMC_RESIDUAL_SCALE_FLAG` in the initialization logic provides more context on how these specific layers are treated differently, especially in terms of initialization. Here's how this affects the overall behavior and processing in the model:

### Understanding `LLMC_RESIDUAL_SCALE_FLAG`

In your code, the `LLMC_RESIDUAL_SCALE_FLAG` is checked during the initialization of weights for layers within your model. Here’s how it impacts the initialization process:

1. **Scaled Initialization**: If a layer (such as `self.c_proj` in your MLP module) is flagged with `LLMC_RESIDUAL_SCALE_FLAG`, the standard deviation for initializing weights is adjusted. Specifically, the standard deviation is set to `0.02 / math.sqrt(2 * self.config.n_layer)`, which scales down the variance of the initial weights. This scaled initialization helps in controlling the magnitude of outputs from these layers, which is particularly important for layers involved in residual connections. By scaling down the weights:
   - **Enhances Training Stability**: Smaller weights in layers that are part of a residual pathway can help prevent exploding gradients, especially in deeper networks.
   - **Facilitates Learning**: The scale factor rooted in the number of layers (`2 * self.config.n_layer`) suggests a deeper network requires more delicate handling of weight magnitudes to facilitate effective learning and information flow.

2. **Purpose of Scaling**:
   - In deep learning, especially with networks that use residual connections, it’s crucial to maintain a balance in the flow of gradients during backpropagation. Scaled initialization as applied here ensures that the updates to weights are not disproportionately large, which can destabilize learning, especially in the early phases.

3. **Impact on Model Behavior**:
   - This tailored initialization approach is crucial for ensuring that the contributions from the residual connections are effectively integrated without overwhelming the network’s learning process. It subtly tunes the network to leverage the benefits of deep architectures (more layers) while mitigating associated risks like vanishing or exploding gradients.

### Broader Implications

By using the `LLMC_RESIDUAL_SCALE_FLAG`, your model effectively acknowledges and addresses the unique challenges posed by residual connections in deep networks. This specialized treatment indicates a thoughtful approach to network design, where different components are initialized and scaled based on their role and impact on the overall architecture. It shows an advanced understanding of deep learning dynamics, specifically within the context of Transformers where depth and complexity can significantly influence performance.

### Conclusion

Incorporating such flags and customized initialization schemes directly ties into overall network robustness and efficacy. It aligns well with best practices for training deep neural networks, ensuring that each component is optimized not just for function but also for harmonious integration into the larger system. This modification does indeed change how the model will learn and stabilize through training iterations, particularly enhancing its ability to scale to deep configurations effectively.

query: is MLP just a different name for feedforward? if not what does it do and what is it for


---

Yes, within the context of neural network architectures, particularly in transformers, the term "MLP" (Multi-Layer Perceptron) is often used to refer to a specific component known as a feedforward neural network. However, it's important to clarify the traditional and specific uses of these terms:

### Traditional Definitions:
- **MLP (Multi-Layer Perceptron)**: Traditionally, an MLP is a type of neural network that consists of multiple layers of perceptrons (basic neural network units) with non-linear activation functions, except for the input nodes. It usually includes one or more hidden layers between the input and output layers. MLPs are capable of learning non-linear models and are used widely across various types of regression and classification problems.

- **Feedforward Neural Network**: This is a broader category of neural networks where connections between the nodes do not form a cycle. This means the information moves in only one direction—forward—from the input nodes, through the hidden nodes (if any), and to the output nodes. There are no cycles or loops in the network. MLPs are a subset of feedforward neural networks.

### In the Context of Transformers:
- **MLP as Feedforward Network**: In transformer models, the term "MLP" is used specifically to describe the feedforward neural network component that follows the self-attention mechanism in each layer of the transformer. This MLP usually has two linear layers:
  - The first linear layer expands the dimensionality of the input, often by a factor (such as 4 times the size of the model’s embedding dimension).
  - This is followed by a non-linear activation function (like ReLU or GELU).
  - The second linear layer then projects the output back to the original embedding dimension.

### Role and Function of MLP in Transformers:
- **Integration and Transformation**: The MLP component in transformers serves to further process each position of the input sequence independently of the others. After the self-attention layer aggregates information across the input sequence, taking into account the relationship between different positions, the MLP is used to integrate and transform this information at each position separately.
  
- **Complexity and Non-linearity**: The expansion and subsequent contraction of dimensions, combined with the non-linear activation, allow the MLP to capture complex patterns in the data. This is crucial for tasks that require understanding intricate relationships and dependencies, as often encountered in natural language processing.

- **Enhancement of Model Capacity**: The inclusion of MLPs in each transformer layer significantly increases the model’s capacity (its ability to learn a wide variety of functions). This is essential for the performance on complex tasks that transformers are often employed for, such as machine translation, text generation, and various other forms of sequence-to-sequence modeling.

### Summary:
In summary, while the terms "MLP" and "feedforward neural network" can be used somewhat interchangeably in the context of transformers, an MLP in this setting specifically refers to a two-layer feedforward network with a non-linear activation used to process the output of the self-attention layers within each transformer block. This component plays a crucial role in enhancing the transformer's ability to process and integrate information across the sequence effectively.

query: oh wait a question about the c_proj which is a residual connection. so. in feed forward or MLP for example, if its a residual connection. how is its output now different? like as in, is the entire output of MLP now get residual treatment? or its partially treated as residually and mostly still affects the next layer etc

---

In the context of transformers and MLPs specifically, when we talk about residual connections, it typically means that the output of the MLP is added back to its original input (before the MLP processing), and then this combined output is often normalized (e.g., using layer normalization). This method is a key component in the architecture of transformers and has several important implications for how the network processes information and learns during training.

### How Residual Connections Work in MLP within Transformers:

1. **Processing in MLP**:
   - The input \( x \) to the MLP is first transformed by an expansion layer (increases dimensionality), followed by a non-linear activation (like GELU), and finally a contraction layer (reduces dimensionality back to the original size).
   - Let's denote the output of this process as \( \text{MLP}(x) \).

2. **Adding the Residual**:
   - The output of the MLP \( \text{MLP}(x) \) is then added to the original input \( x \). This operation is defined as:
     \[
     y = x + \text{MLP}(x)
     \]
   - Here, \( y \) represents the residual output, which is the sum of the original input and the transformed output.

3. **Layer Normalization**:
   - After adding the residual, the result is usually normalized:
     \[
     \text{output} = \text{LayerNorm}(y)
     \]
   - This normalization helps stabilize the learning process by ensuring that the outputs of the layers have a mean of zero and a standard deviation of one, reducing the risk of vanishing or exploding gradients.

### Impact of Residual Connections:

- **Preservation of Information**: Residual connections help in preserving the information from the input throughout the layers of the network. By adding the input directly to the output, the network can maintain essential features that might be lost due to transformations within the MLP.

- **Easing the Training of Deep Networks**: They allow gradients to flow directly through the network via the added pathways, facilitating the training of very deep networks by combating problems associated with vanishing gradients.

- **Enhancing Learning Capability**: By combining the original information and the new features learned by the MLP, the network can refine its predictions, leading to potentially better performance and more nuanced understanding.

### How the Output is Affected:

- **Entire Output Treated Residually**: In transformers, typically the entire output of the MLP gets the "residual treatment". This means the whole transformed output is added back to the original input, not just a portion of it.

- **Influence on Subsequent Layers**: The residual output, after normalization, is then passed to the next transformer layer (if any) or used in subsequent computations. This output now contains both the original and newly learned features, affecting how the subsequent layers process the input.

### Summary:
In summary, the residual connection in the MLP of a transformer model ensures that the entire output of the MLP is adjusted by its own input, enhancing the model's ability to leverage both new and existing information across different parts of the model. This approach is fundamental to the design of modern neural architectures like transformers, enabling them to achieve remarkable performance across various tasks.

## Back to Coding

In [None]:
class MLP(nn.Module): # MLP is just another term for feed forward // its a subset of feed forward because unlike conventional feed forward
                      # MLP actually expands to higher dimension before contract REFER TO NOTES ABOVE, which allows more complex relationships to be captured.
                              # note this is just MLP for transformers.

    def __init__(self, config):
        super().__init__()
        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd) # increases dimension for more information?
        self.gelu    = NewGELU()
        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd)
        self.c_proj.LLMC_RESIDUAL_SCALE_FLAG = 1 # treated as residual.

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x) # residual component
        return x

### Purpose and Functionality

This setup is typical of transformer blocks, where each component layer (like self-attention and MLP) is wrapped with a residual connection followed by normalization. This architecture helps in several ways:

- **Enhanced Gradient Flow**: Residual connections allow gradients to flow directly through the network during backpropagation, which can significantly improve training efficiency and enable training of very deep networks.

- **Feature Preservation**: By adding the input directly to the output of processing units (attention and MLP), the network can preserve important features that might otherwise be lost or overly transformed.

- **Capability Enrichment**: Layer normalization and subsequent processing ensure that the network can adaptively scale and shift the input features, refining the model's ability to handle a wide range of input distributions and feature scales.

Overall, the `Block` class represents a fundamental building block of transformer models, encapsulating the essential mechanisms that make transformers powerful for handling sequential data across various tasks such as language modeling, text generation, and more.

In [None]:
class Block(nn.Module): # this is implementaiton of single layer in a transformer.

    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x)) # input x is first normalised by layer normalisation, then passed to self.attn for self-attention processing
        x = x + self.mlp(self.ln_2(x)) # result of first residual connection is first normalised by self.ln_2. Then fed into mlp.
        return x

In [None]:
# -----------------------------------------------------------------------------
# The main GPT-2 model

@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50257
    n_layer: int = 12
    n_head: int = 12
    n_embd: int = 768

## Notes - Forward method


Certainly! Let's break down the two different scenarios depicted in the code snippet: when targets are provided for training, and when targets are not provided, typically during inference.

### When Targets are Provided (Training)

In this scenario, the model computes both the logits and the loss.

- **Logits Calculation**: The `self.lm_head(x)` call computes the logits across the entire sequence output from the transformer blocks. Here, `x` is the output from the last transformer block (or layer normalization), and `self.lm_head` is a linear layer that projects the dimensions of `x` to the vocabulary size. This projection is necessary to determine the likelihood of each vocabulary token at each position in the sequence.
  
- **Logits Reshaping**: `logits.view(-1, logits.size(-1))` reshapes the logits tensor to a 2D tensor where each row corresponds to a token in the sequence across all batches. This reshaping is standard practice before applying the cross-entropy loss, which expects inputs in this form.

- **Loss Calculation**: `F.cross_entropy` computes the loss between the reshaped logits and the reshaped targets. The `ignore_index=-1` parameter tells the loss function to ignore certain target indices (usually padding tokens in NLP tasks) so that they do not contribute to the loss. This helps in focusing the model's learning on meaningful tokens only.

### When Targets are Not Provided (Inference)

This scenario typically occurs during inference, where the goal is to generate predictions rather than train the model.

- **Optimized Logits Calculation**: Instead of computing the logits for the entire sequence, the logits are calculated only for the last position in the sequence: `self.lm_head(x[:, [-1], :])`. This optimization is significant for tasks like language modeling where only the next token prediction is needed. Using `x[:, [-1], :]` extracts the last set of features from the transformer output across all batches, which corresponds to the last token's features.
  
  - **List Indexing to Preserve Time Dimension**: The notation `[-1]` within a list (`x[:, [-1], :]`) ensures that the sliced output retains its time dimension, even though it's a single timestep. This is important because `self.lm_head` expects inputs with a time dimension.

- **No Loss Calculation**: Since no targets are provided, there is no need to compute a loss, and thus `loss` is set to `None`.

### Summary of Differences

- **Training Mode**: Computes logits and loss for all positions in the sequence to facilitate learning based on the entire sequence of targets.
- **Inference Mode**: Computes logits only for the last position in the sequence to optimize performance, as only the prediction for the next token is necessary.

This optimization during inference reduces computational overhead significantly because it limits the number of operations to just what is necessary for generating the next token prediction, rather than processing unnecessary positions that are not of interest at the time.

query: no. from the start. starting from the input parameter to the end. for both target and inference

---

Let's go through the shape and contents of `logits` throughout the `forward` function in detail, tracking its evolution from the input all the way through to the end of the function, both when targets are provided (training mode) and when they're not (inference mode).

### Starting with Input Parameters
- **Input `idx`**: Typically has shape `(batch_size, sequence_length)`, containing indices of tokens for each position in each sequence in the batch.

### Token and Position Embeddings
1. **Token Embeddings (`tok_emb`)**:
   - Generated by: `tok_emb = self.transformer.wte(idx)`
   - **Shape**: `(batch_size, sequence_length, n_embd)`. Each token index from `idx` is transformed into an embedding vector of dimension `n_embd`.

2. **Positional Embeddings (`pos_emb`)**:
   - Generated by: `pos_emb = self.transformer.wpe(pos)`
   - **Shape**: `(sequence_length, n_embd)`. Each position in the sequence gets an embedding vector of the same dimension as the token embeddings.

### Combining Embeddings
3. **Combined Embeddings (`x`)**:
   - Calculation: `x = tok_emb + pos_emb`
   - **Shape**: `(batch_size, sequence_length, n_embd)`. The positional embeddings are broadcasted across the batch size and added to the token embeddings.

### Passing through Transformer Blocks
4. **After Last Transformer Block**:
   - `x` is modified by each block in `self.transformer.h` and finally by `self.transformer.ln_f(x)`.
   - **Shape** remains `(batch_size, sequence_length, n_embd)` through all transformations.

### Logit Calculation
5. **Logits Calculation**:
   - **If targets are provided (Training Mode)**:
     - Calculation: `logits = self.lm_head(x)`
     - **Shape**: `(batch_size, sequence_length, vocab_size)`. The linear layer `lm_head` projects each vector in `x` to the vocabulary size, creating a logit for each possible token at each position in each sequence.

   - **If targets are not provided (Inference Mode)**:
     - Optimization: Only calculate for the last position.
     - Calculation: `logits = self.lm_head(x[:, [-1], :])`
     - **Shape**: `(batch_size, 1, vocab_size)`. Only the logits for the last position in each sequence are computed.

### Visual Representation
To better visualize this:
- **Training Mode**: Logits represent a 3D tensor where:
  - Dimensions represent batch items, sequence positions, and scores for each vocabulary token.
  - Each row in a "sequence position" dimension is a distribution over all tokens in the vocabulary.

- **Inference Mode**: Logits are simplified to focus only on the last position:
  - This reduces the tensor to only considering the last token's potential following tokens.
  - Useful for tasks like text generation where the immediate next token prediction is needed.

### Return of Logits and Loss
- **Return**: Depending on `return_logits`, the logits may or may not be returned. If not needed, setting `logits = None` helps save memory.
- **Loss Calculation**: Only in training mode. The shape for cross-entropy input requires logits to be 2D (`(batch_size * sequence_length, vocab_size)`), so logits are reshaped and compared against similarly reshaped targets to compute the loss.

This breakdown should help visualize the transformation and usage of `logits` across different modes and points in the function.

can you explain what does each sequence length contain at this point? like for all previous and target token (prediction)?

---

Certainly! Let's explore what the `logits` tensor represents at the point where it has the shape `(batch_size, sequence_length, vocab_size)` during training mode. This representation is crucial for understanding how the model predicts the next token at each position in a sequence based on the previous tokens.

### Structure of the Logits Tensor
- **Batch Size (`batch_size`)**: Represents how many sequences are being processed simultaneously. Each sequence in the batch is processed independently by the model.
  
- **Sequence Length (`sequence_length`)**: Corresponds to the number of tokens in each sequence. For each token in a sequence, the model outputs a prediction.
  
- **Vocabulary Size (`vocab_size`)**: Represents the number of possible tokens that can be predicted. Each position in the sequence has a distribution over this vocabulary, indicating the likelihood of each token being the next token.

### Contents at Each Sequence Position
For a given batch of sequences, here’s what the logits tensor contains:

- **First Position**: At the very start of a sequence (position 0), the model generates logits based on the initial token (often a special start token like `<s>` in some models). These logits represent the probability distribution over the entire vocabulary for what the next token could be after the start token.

- **Subsequent Positions (1 to N-1)**: For each subsequent position in the sequence, the model looks at all the previous tokens (from the start up to the current position) and predicts the next token. The logits at each position are influenced by the embeddings of the previous tokens and the learned contextual relationships captured by the transformer blocks.

  - **Example**: If the sequence is `[The, quick, brown, fox]`, the logits at position 2 (considering zero indexing, this corresponds to the token `brown`) will contain the model's predictions for what the next token could be after `[The, quick, brown]`. This prediction is based on the learned patterns and dependencies that the model has captured during training.

### Target Tokens
- **Targets**: During training, each position's logits are compared against the actual next token in the sequence (provided as part of the `targets` tensor). For instance, if the input sequence token at position 2 is `brown`, and the actual next token (target) is `fox`, the model's logits at position 2 will be used to calculate the loss with respect to the target token `fox`.

- **Loss Calculation**: The cross-entropy loss is calculated between the logits (predictions) and the target tokens. This loss provides a measure of how well the model's predictions match the actual next tokens in the sequences. The model's weights are then adjusted during backpropagation to minimize this loss, improving the model's predictions over iterations.

### Inference Mode
- During inference (when `targets` is `None`), only the logits for the last position in the sequence are computed if the inference-time optimization is used. This is typically the case when generating text, where you only need the next token prediction after the given sequence.

### Summary
In summary, each sequence length in the `logits` tensor contains a full set of predictions (a score for each possible vocabulary token) for the next token, based on all preceding tokens in the sequence up to that point. These predictions are used either to calculate loss against known targets during training or to select the next token during inference.

oh wait so each stuff in dimension T is prediction for their respective target tokens? and it keep moving downwards?

---

Yes, that's exactly right! Let's clarify how this works in the context of the sequence processing in a model like GPT:

### Sequence Processing and Prediction Dynamics

- **Dimension T (`sequence_length`)**: Each element in this dimension corresponds to a particular position in the input sequence. For each position `t`, the model outputs a set of logits that represent the probability distribution over the vocabulary for the next token, given the sequence of tokens seen up to position `t`.

### How Predictions are Generated Step-by-Step
1. **Initial Token**:
   - At position 0, the model predicts the next token based on the initial token of the sequence. If a special start token is used, the prediction at position 0 will be based on this start token.

2. **Subsequent Tokens**:
   - For each subsequent position \( t \), the model uses all the previous tokens \( \text{tokens}_0 \) through \( \text{tokens}_{t-1} \) to predict the token at position \( t \).
   - This means the logits at each position \( t \) are the model's predictions for what the next token will be, following the sequence of tokens seen up to that point.

### Visual Example
Imagine an input sequence for training: `[The, cat, sat]`. The corresponding target tokens, shifted by one position, might be `[cat, sat, on]`.

- **At position 0**:
  - **Input**: `[The]`
  - **Prediction**: Predict the next token (likely target: `cat`).
  - **Logits at position 0**: Represent scores over the entire vocabulary for predicting `cat`.

- **At position 1**:
  - **Input**: `[The, cat]`
  - **Prediction**: Predict the next token after `cat` (likely target: `sat`).
  - **Logits at position 1**: Represent scores over the entire vocabulary for predicting `sat`.

- **At position 2**:
  - **Input**: `[The, cat, sat]`
  - **Prediction**: Predict the next token after `sat` (likely target: `on`).
  - **Logits at position 2**: Represent scores over the entire vocabulary for predicting `on`.

### Training and Loss Calculation
During training:
- Each set of logits at position \( t \) is compared to the actual target token at \( t \) using a loss function like cross-entropy. This comparison quantifies how well the model's predictions match the actual tokens.

### Inference (Generating Text)
During inference, especially in text generation:
- The model might only need the final logits if you are generating one token at a time. After generating a token, it can be fed back into the model as part of the input for generating the next token.

### Conclusion
In essence, each position in the sequence length dimension (Dimension T) is associated with predictions for the next token, based on all prior tokens in the sequence. This structured prediction mechanism allows the model to generate contextually relevant text by building on the tokens it has sequentially processed.

query: then what about beam search

---

Beam search is an advanced decoding technique used in natural language processing models, especially in sequence generation tasks like machine translation, text summarization, and conversational agents. It aims to find the most likely sequence of tokens by considering multiple potential paths at each step and selecting the best ones to explore further. This is a significant enhancement over simpler methods like greedy decoding, which always chooses the most probable token at each step without considering the global context.

### How Beam Search Works

Beam search works by keeping a fixed number of candidates (paths) at each step of the sequence generation, which is referred to as the "beam width" or "beam size." Here's a step-by-step breakdown of the process:

1. **Initialization**:
   - Begin with the start token and initiate the beam with the most probable initial tokens according to the model's prediction.

2. **Iteration Over Sequence**:
   - At each time step, for each candidate in the beam, expand the sequence by one token. This involves calculating the next token probabilities using the model (e.g., GPT or another language model).
   - Each candidate sequence is extended by all possible tokens in the vocabulary, and the probabilities of these extended sequences are calculated based on the logits produced by the model.

3. **Scoring and Pruning**:
   - Evaluate the complete set of expanded sequences. Each sequence is scored, typically by the cumulative log probability of the sequence normalized by the sequence length (to penalize longer sequences, a technique called length normalization may be used).
   - From this set, select the top sequences with the highest scores. The number of sequences kept is determined by the beam width. All other sequences are pruned (discarded).

4. **Repeat or Terminate**:
   - This process is repeated for each subsequent token until the end of the sequence is reached (typically marked by a special end token) or until a predetermined maximum sequence length is met.
   - For each step, only the best-performing sequences within the beam are carried forward to the next step.

5. **Selection of Final Output**:
   - Once the beam search completes (when reaching the end token or the maximum length), the highest scoring sequence within the beam is selected as the final output.

### Benefits of Beam Search

- **Better Contextual Decisions**: By maintaining multiple hypotheses and extending them simultaneously, beam search makes more contextually informed decisions, avoiding the pitfalls of early commitments to suboptimal paths.
- **Trade-off Between Efficiency and Quality**: Increasing the beam width allows the model to explore more paths, potentially increasing the quality of the output but at the cost of computational efficiency.

### Drawbacks of Beam Search

- **Computational Cost**: Larger beam widths significantly increase the number of sequences to evaluate at each step.
- **Search Errors**: Despite its sophistication, beam search can still suffer from search errors where the global optimal sequence might be missed if it doesn’t make it into the beam in early steps.

### Usage Example in a Language Model

In a setting like text generation with GPT, beam search can be implemented at the point where logits are calculated for each token position. After obtaining the logits, instead of selecting the top token, you maintain a list of top `k` candidates (where `k` is the beam size), and then iteratively extend and prune as described.

This structured, iterative refinement helps in generating more coherent and contextually appropriate text in complex language tasks.

## Back to Coding

## Notes - @classmethod def from_pretrained

query: can you explain what is this method here even is supposed to do. if just getting pretrained model off HF. why not just use .from_pretraned alone with tokenizer and pipeline() function? instead of going through all that trouble of filtering and transposing. what is this meant to do here?

---

The method `from_pretrained` in the context you've provided appears to be performing several functions that go beyond just loading a pretrained model from Hugging Face's `transformers` library. Here’s a breakdown of the rationale and the advantages of this customized approach:

### Custom Model Integration
The primary purpose of this method is to integrate pretrained weights from a standard Hugging Face model into a custom model architecture that might have some structural differences from the original Hugging Face model. Here are key reasons and benefits for this approach:

1. **Custom Model Architecture**:
   - **Flexibility**: The method allows you to define a custom GPT-like model with specific configurations or architectural modifications that are not exactly the same as those in the standard models provided by Hugging Face.
   - **Specific Configurations**: It allows the use of specific layer configurations, such as different numbers of layers, heads, or embedding dimensions, which may not directly align with those available in the standard `transformers` models.

2. **Control Over Initialization and State Management**:
   - **Initialization**: This approach provides more control over how the model is initialized. For example, it allows specific initialization routines or modifications to the initialization logic that Hugging Face’s simple `.from_pretrained` might not support directly.
   - **State Manipulation**: By manually handling the state dictionary (which includes weights and other parameters), you can selectively transfer weights and even modify them if necessary (e.g., transposing weights to fit different layer types).

3. **Handling of Non-Standard Components**:
   - **Transposition of Weights**: If the custom model uses different types of layers (e.g., replacing Conv1D layers with Linear layers) that expect weights in a different format, this method can handle necessary transformations like transposing weights.
   - **Filtering Out Unneeded Parameters**: Some parameters (like certain biases or masks) may not be needed or may be implemented differently in the custom model. This method allows for these components to be excluded or handled differently during the loading process.

### Why Not Just Use `.from_pretrained` Directly?

- **Limitations of Direct Loading**: Using `.from_pretrained` directly loads the model with all its original configurations and parameters. This is straightforward and efficient when the model architecture in your application directly matches one of the pretrained models provided by Hugging Face.

- **Need for Customization**: However, if your model architecture deviates from the standard architectures (for example, in how layers are structured or how certain operations are performed), directly loading the model might not be feasible without further adjustments.

- **Integration with Custom Code**: The method allows the pretrained weights to be integrated into potentially complex systems or pipelines that require specific setups or initializations that are not supported directly by the standard loading mechanism.

### Conclusion

This customized approach is particularly useful when you need to merge the robust capabilities of pretrained models with unique model architectures or when integrating into an existing codebase that requires specific configurations or handling. It provides a bridge between the ease of using pretrained models and the flexibility required in research and application-specific development environments.

## Back to coding

## Notes - Optimisers method

The method `configure_optimizers` is designed to configure and return an optimizer for the training of a neural network, specifically tailored for handling various parameters differently based on their characteristics and roles in the model. Here's a detailed breakdown of what each part of this method does:

### Method Explanation

1. **Extract All Model Parameters**:
   ```python
   param_dict = {pn: p for pn, p in self.named_parameters()}
   ```
   - This line retrieves all parameters of the model along with their names. The `named_parameters()` method yields both the names and parameter tensors from the model, which include weights and biases across all layers.

2. **Filter Parameters That Require Gradients**:
   ```python
   param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}
   ```
   - Filters out parameters that do not require gradients. Non-trainable parameters are typically not involved in the backpropagation process, so they are excluded from the optimizer setup.

3. **Organize Parameters into Groups for Different Treatment**:
   ```python
   decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
   nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
   ```
   - This divides parameters into two groups: those that will have weight decay applied (`decay_params`) and those that will not (`nodecay_params`). Typically, weight tensors (which are at least 2D) receive weight decay to prevent overfitting, while biases and other 1D parameters do not because applying decay to these can lead to suboptimal training dynamics.

4. **Define Optimization Groups**:
   ```python
   optim_groups = [
       {'params': decay_params, 'weight_decay': weight_decay},
       {'params': nodecay_params, 'weight_decay': 0.0}
   ]
   ```
   - This constructs the configuration for optimization groups, specifying how weight decay is applied differently to different groups of parameters.

5. **Logging Information About Parameters**:
   - The method logs the number of parameters in each group, which is useful for debugging and understanding the model's parameter distribution.

6. **Create AdamW Optimizer**:
   ```python
   fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters
   use_fused = fused_available and device_type == 'cuda'
   ```
   - Checks if the fused implementation of AdamW is available (faster on CUDA devices) and decides whether to use it based on the device type.
   
   ```python
   if zero_stage == 1:
       optimizer = ZeroRedundancyOptimizer(**optim_groups[0], optimizer_class=torch.optim.AdamW,
                                           lr=learning_rate, betas=betas, fused=use_fused)
       optimizer.add_param_group(optim_groups[1])
   else:
       optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, fused=use_fused)
   ```
   - Depending on whether Zero Redundancy Optimization (used to minimize memory consumption across multiple GPUs) is enabled, it either sets up a ZeroRedundancyOptimizer or a regular AdamW optimizer. Parameters are then passed accordingly, including learning rate adjustments and betas for momentum terms.

### Summary

This method allows for a detailed and customized setup of the optimizer, crucial for effective training, especially in large-scale models like those based on the Transformer architecture. It ensures that each parameter is treated appropriately according to its role and characteristics, optimizing training efficiency and effectiveness. This approach is more nuanced than simply using `.from_pretrained` with a standard optimizer because it fine-tunes the training process to the specific needs and configurations of the custom model setup.

## back to coding

In [None]:
class GPT(nn.Module):

    def __init__(self, config):
        super().__init__() # initialises base class nn.Module
        self.config = config # stores configuration passed into the constructor

        self.transformer = nn.ModuleDict(dict( # storing various components of GPT model. Each component can now be accessed using a key, just like a python dict.
            wte = nn.Embedding(config.vocab_size, config.n_embd), # maps token indices to token embeddings
            wpe = nn.Embedding(config.block_size, config.n_embd), # provides positional information to tokens
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]), # list of Nx transformer block, refer to architecture.
            ln_f = nn.LayerNorm(config.n_embd), # final layer normalisation applied to final output from blocks before finally outputting for transformer sequence processing
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False) # maps n_embd=768 dimension to vocab_size=50257 for generating predictions
                                                                               # bias=False, no bias term here
        self.lm_head.LLMC_SKIP_INIT = 1 # don't init this one, we will tie weights
        self.transformer.wte.weight = self.lm_head.weight # https://paperswithcode.com/method/weight-tying <-- this is recommended based on research findings!
        """
        This line ties the weights of the language modeling head to the token embedding weights, which helps reduce the model's parameter count and can improve
        performance by reusing learned representations.
        """

        # init all weights, use a torch rng object to be very careful
        self.init_rng = torch.Generator() # initialises a RNG to ensure controlled randomness in weight initialisation
        self.init_rng.manual_seed(42)
        self.apply(self._init_weights) # applies _init_weights method to all submodules of the model, customising the weights' initialisation based on whether they are part
                                       # of linear layers, embedding layers or other configurations (handling special flags)

    def _init_weights(self, module): # this method is applied to all submodules of the model, as seen with the method above.
        if isinstance(module, nn.Linear):
            # apply special scaled init to the residual projections, per GPT-2 paper
            std = 0.02 if not hasattr(module, 'LLMC_RESIDUAL_SCALE_FLAG') else 0.02/math.sqrt(2 * self.config.n_layer)
                  # sets standard deviation for normal distribution used to initialise weights. If has RESIDUALflag, the std is reduced to "0.02/math.sqrt(2 * self.config.n_layer)"
                        # This is recommended by GPT-2 paper, and this scaling helps control the varience of weights in deeper layers, aiding in stabilising training by
                        # mitigating issues like exploding gradients.

            # we want to skip initializing lm_head, which shares parameters with wte
            # and wte was already initialized down below during the Embedding init
            if not hasattr(module, 'LLMC_SKIP_INIT'):
                torch.nn.init.normal_(module.weight, mean=0.0, std=std, generator=self.init_rng) # skips intialisation if has LLMC_SKIP_INIT flag, else initialise using default values
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias) # if module has bias term, the bias is set to zero.
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02, generator=self.init_rng) # normal_ just initialises the weights of tensors to Gaussian distribution based on its params

    def forward(self, idx, targets=None, return_logits=True): # idx=batch of input data consisting of indices of tokens
                                                              # targets=optional, target outputs of input data, used for training. None when during inference
                                                              # determines whether raw pre-softmax outputs from final layer shold be returned. Useful for evaluating or continuing the computation with the logits.
        device = idx.device
        b, t = idx.size()
        assert t <= self.config.block_size, f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}" # ensures input doesn't exceed the configured maximum block size, else assertion is thrown.
        pos = torch.arange(0, t, dtype=torch.long, device=device) # shape (t)

        # forward the GPT model itself
        tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
        pos_emb = self.transformer.wpe(pos) # position embeddings of shape (t, n_embd)
        x = tok_emb + pos_emb

        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x) # applies final ln

        if targets is not None:
            # if we are given some desired targets also calculate the loss
            logits = self.lm_head(x)
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1) # i think ignore beam search for now // this basically contains probabilities for each word that it predicts.
                                                                                                        # hence by calculation loss, its just comparing target word vs what it predicts.
        else:
            # inference-time mini-optimization: only forward the lm_head on the very last position
            logits = self.lm_head(x[:, [-1], :]) # note: using list [-1] to preserve the time dim
            loss = None

        # there are performance reasons why not returning logits is prudent, if not needed
        if not return_logits:
            logits = None

        return logits, loss

    @classmethod # equivalent to static method in C/C++ // meaning not bind to instance, but is a class method // they can be called from both class itself, don't need instance.
    def from_pretrained(cls, model_type):
        """Loads pretrained GPT-2 model weights from huggingface"""
        assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
        from transformers import GPT2LMHeadModel
        print("loading weights from pretrained gpt: %s" % model_type)

        # n_layer, n_head and n_embd are determined from model_type
        config_args = {
            'gpt2':         dict(n_layer=12, n_head=12, n_embd=768),  # 124M params
            'gpt2-medium':  dict(n_layer=24, n_head=16, n_embd=1024), # 350M params
            'gpt2-large':   dict(n_layer=36, n_head=20, n_embd=1280), # 774M params
            'gpt2-xl':      dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params
        }[model_type]
        config_args['vocab_size'] = 50257 # always 50257 for GPT model checkpoints
        config_args['block_size'] = 1024 # always 1024 for GPT model checkpoints

        # create a from-scratch initialized minGPT model
        config = GPTConfig(**config_args) # creates instance of GPTConfig using config args defined in the dict "config_args"
        model = GPT(config)               # instance of GPT class is created, initialised using config args

        sd = model.state_dict() # returns a dict that contains a whole state of the module, both parameters and persistent buffers (running avgs) // this state enables the model's weights and other params to be saved or loaded
        sd_keys = sd.keys() # retrieves all keys from the state dictionary "sd"
        sd_keys = [k for k in sd_keys if not k.endswith('.attn.bias')] # discard this mask / buffer, not a param
                                                                       # filters out keys that ends with ".attn.bias". Which represents attention bias parameters that are often not included in transfer of weights because
                                                                       # they are not essential parameters (often implemented as buffers)

        # init a huggingface/transformers model
        model_hf = GPT2LMHeadModel.from_pretrained(model_type)
        sd_hf = model_hf.state_dict()

        # copy while ensuring all of the parameters are aligned and match in names and shapes
              # this section further cleans the keys of the pretrained model, removing any that are related to biases or other non-parameter buffers.
        sd_keys_hf = sd_hf.keys()
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.masked_bias')] # ignore these, just a buffer
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.bias')] # same, just the mask (buffer)
        transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight'] # identifies which matrices needs to be transposed.

        # basically the openai checkpoints use a "Conv1D" module, but we only want to use a vanilla Linear
        # this means that we have to transpose these weights when we import them
              """This block ensures that the keys and their corresponding shapes match between the newly initialized model and the pretrained model. For weights that need transposition
              (due to differences in layer implementations), it transposes them before copying. This ensures that the dimensions align correctly.
              """
        assert len(sd_keys_hf) == len(sd_keys), f"mismatched keys: {len(sd_keys_hf)} != {len(sd_keys)}"
        for k in sd_keys_hf:
            if any(k.endswith(w) for w in transposed): # transposed if in transposed list.
                # special treatment for the Conv1D weights we need to transpose
                assert sd_hf[k].shape[::-1] == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k].t())
            else:
                # vanilla copy over the other parameters
                assert sd_hf[k].shape == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k])

        return model

    def configure_optimizers(self, weight_decay, learning_rate, betas, device_type, zero_stage):
        # start with all of the candidate parameters
        param_dict = {pn: p for pn, p in self.named_parameters()}
        # filter out those that do not require grad
        param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}

        # create optim groups. Any parameters that is 2D will be weight decayed, otherwise no.
        # i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't.
        decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
        nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
        optim_groups = [
            {'params': decay_params, 'weight_decay': weight_decay},
            {'params': nodecay_params, 'weight_decay': 0.0}
        ]
        num_decay_params = sum(p.numel() for p in decay_params)
        num_nodecay_params = sum(p.numel() for p in nodecay_params)
        print0(f"num decayed parameter tensors: {len(decay_params)}, with {num_decay_params:,} parameters")
        print0(f"num non-decayed parameter tensors: {len(nodecay_params)}, with {num_nodecay_params:,} parameters") # logs the number of parameters in each group, which is useful for debugging and understanding
                                                                                                                    # the model's parameter distribution.

        # Create AdamW optimizer and use the fused version if it is available // faster on CUDA devices
        fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters
        use_fused = fused_available and device_type == 'cuda'
        print0(f"using fused AdamW: {use_fused}")

        if zero_stage == 1:
            print0("using ZeroRedundancyOptimizer")
            optimizer = ZeroRedundancyOptimizer(**optim_groups[0], optimizer_class=torch.optim.AdamW,
                                                lr=learning_rate, betas=betas, fused=use_fused)
            optimizer.add_param_group(optim_groups[1])
        else:
            print0("using regular AdamW")
            optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, fused=use_fused)
        return optimizer

    @torch.no_grad() # disables gradient, not needed during inference
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None): # idx is input tensor containing token indices of the start of the sequence
        """
        Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
        the sequence max_new_tokens times, feeding the predictions back into the model each time.
        Most likely you'll want to make sure to be in model.eval() mode of operation for this.
        """
        for _ in range(max_new_tokens):
            # if the sequence context is growing too long we must crop it at block_size
            idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:] # This line ensures that the token sequence used for conditioning does not exceed the model's
                                                                                                          # maximum block size (context window). If it does, it trims the sequence from the left.
            # forward the model to get the logits for the index in the sequence
            logits, _ = self(idx_cond) # calls forward method
            # pluck the logits at the final step and scale by desired temperature
            logits = logits[:, -1, :] / temperature
            # optionally crop the logits to only the top k options
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')
            # apply softmax to convert logits to (normalized) probabilities
            probs = F.softmax(logits, dim=-1)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # samples from probability instead of taking highest probability each time.
            # append sampled index to the running sequence and continue
            idx = torch.cat((idx, idx_next), dim=1)

        return idx

In [None]:
# -----------------------------------------------------------------------------
# Our own simple Distributed Data Loader

## Notes - Data Loader

The provided code outlines the functionality of a custom `DistributedDataLoader`, specifically designed to efficiently handle the loading and distribution of dataset shards across multiple processes in a distributed computing environment. This is particularly useful for training large models on large datasets where data needs to be split and processed in parallel to enhance performance and scalability. Let's break down each part of the data loader for clarity:

### Helper Functions for Data Loading

1. **_peek_data_shard(filename)**:
   - This function is used to read only the header of a data shard file, which contains metadata about the contents of the file.
   - It reads the first 256 integers (assuming each integer is 4 bytes) to get the header.
   - This header includes a magic number (for file validation), version number, and the number of tokens the file claims to contain.
   - It checks for consistency in the file format and returns the number of tokens.

2. **_load_data_shard(filename)**:
   - Similar to `_peek_data_shard`, but this function loads the entire data shard.
   - After reading the header, it reads the rest of the file, which contains the actual data tokens stored as uint16.
   - It verifies that the number of tokens read matches the number reported in the header and returns these tokens.

### DistributedDataLoader Class

1. **Constructor (__init__)**:
   - Initializes the data loader with a file pattern, batch size (B), sequence length (T), the rank of the process, and the total number of processes.
   - It finds all files matching the given pattern and loads basic information (like the number of tokens each file should have) to prepare for distributed loading.
   - The constructor sets up initial values and prepares the loader to start fetching batches from the first data shard.

2. **reset()**:
   - Resets the loader to begin (or restart) loading from the first shard. If the first shard is already loaded, it simply resets pointers without reloading data, optimizing for repeated accesses.

3. **advance()**:
   - Moves the loader to the next shard in the sequence. This is called automatically when the loader reaches the end of the current shard and needs more data.

4. **next_batch()**:
   - Fetches the next batch of data from the current shard.
   - It extracts a sequence of data (both input `x` and target `y`, where `y` is offset by one token from `x` for typical sequence prediction tasks).
   - It handles the edge case where the next batch might cross the boundary of the data shard, triggering a load of the next shard if necessary.

### Significance of DistributedDataLoader

- **Efficiency in Distributed Environments**: By distributing the data across multiple processes, each process can load and handle only a portion of the entire dataset, reducing memory overhead and potentially speeding up training by parallelizing data loading.

- **Scalability**: This approach scales well as the number of processes increases, each managing its portion of the dataset.

- **Customizability**: Unlike more generic data loaders, this one is tailored for sequence data often used in NLP tasks (like training language models), making it suited for specific requirements such as sequence padding, handling special tokens, etc.

In summary, `DistributedDataLoader` is a sophisticated component designed for scenarios where data needs to be efficiently partitioned and processed across multiple computing nodes, a common setup in modern deep learning environments for handling large datasets and models.

query: whats a datashard

---

A **data shard**, in the context of computing and data management, refers to a horizontal partition of data in a database or a dataset. Each shard is a subset of the entire dataset and can be stored on separate database servers or storage systems. This division of data helps distribute the computational load and storage demands, enhancing the performance and scalability of data processing applications, especially in distributed systems.

### Key Characteristics and Uses of Data Shards

1. **Distributed Storage**:
   - Data shards enable the storage of different parts of a dataset across multiple servers or nodes. This setup allows for parallel processing, where different nodes can handle operations on different shards simultaneously.

2. **Scalability**:
   - Sharding is particularly valuable in scaling applications and databases. As the volume of data grows, more shards can be added, and they can be distributed across additional servers without affecting the performance of existing servers.

3. **Load Balancing**:
   - By distributing the data across several shards, it's possible to balance the load among servers. This means no single server bears too much load, which can help prevent bottlenecks and improve overall response times in data-intensive applications.

4. **Fault Tolerance and Availability**:
   - If one shard or server fails, the system can continue to operate, albeit at reduced capacity, with the remaining shards. This setup can increase the availability and fault tolerance of the system.

5. **Localized Data Management**:
   - In some scenarios, data sharding can be used to keep data physically closer to where it is most frequently accessed, reducing latency and improving performance (a practice often seen in global applications).

### Application in Machine Learning and Data Processing

In the context of machine learning and large-scale data processing, like the one described in your query about the `DistributedDataLoader`:

- **Training Data Distribution**:
  - Data sharding is used to divide a large dataset into manageable pieces so that each computing node or process in a distributed training architecture can work on a different piece. This parallelism can significantly speed up training times for large models.

- **Efficient Data Handling**:
  - Shards make it easier to manage large datasets by breaking them down into smaller, more manageable pieces. Each piece can be loaded and processed independently, which is particularly useful when working with limited memory resources.

- **Implementation in Data Loaders**:
  - In deep learning frameworks, custom data loaders can be designed to handle data shards efficiently, ensuring that each part of the dataset is fed into the model at the right time during training, optimizing resource usage and training performance.

### Conclusion

In summary, data sharding is a strategy employed to enhance performance, scalability, and manageability of large datasets in distributed computing environments. It's particularly relevant in today's data-driven landscape where the volume of data often exceeds the capacity of single machines or servers.

query: show me an example. i cant visualise

---

Certainly! Let's consider a practical example of data sharding in a machine learning context, specifically for training a neural network model on a large dataset. Suppose we have a dataset that's too large to fit into the memory of a single machine, or we want to utilize several machines to speed up the training process.

### Scenario: Training a Language Model on Distributed Systems

Imagine you have a large dataset of text data, such as a collection of books or articles, and you need to train a language model like GPT on this dataset. The dataset is split into multiple shards that can be distributed across different nodes in a computing cluster.

#### Step 1: Data Preparation
- **Dataset**: 10 million sentences.
- **Objective**: Train a language model to predict the next word given a sequence of words.

#### Step 2: Sharding the Data
- **Total Nodes**: 4
- **Shard Creation**: Divide the 10 million sentences equally into 4 parts, each containing 2.5 million sentences.
- Each shard might be stored as a separate file:
  - `shard_1.bin`
  - `shard_2.bin`
  - `shard_3.bin`
  - `shard_4.bin`

#### Step 3: Distributed Training Setup
- Each node in the cluster is assigned one shard. For instance:
  - **Node 1** processes `shard_1.bin`
  - **Node 2** processes `shard_2.bin`
  - **Node 3** processes `shard_3.bin`
  - **Node 4** processes `shard_4.bin`

#### Step 4: Data Loading and Processing
Each node uses a `DistributedDataLoader` to load its assigned shard. Here's a simplified pseudo-code for how the data loader might work on each node:

```python
# Pseudo-code for a DistributedDataLoader running on one node

class DistributedDataLoader:
    def __init__(self, shard_file):
        self.data = self.load_data(shard_file)
    
    def load_data(self, filename):
        # Load the data from the file, this is highly simplified
        return open(filename, 'r').readlines()
    
    def get_batch(self, batch_size=1000):
        # Yield batches of data
        for i in range(0, len(self.data), batch_size):
            yield self.data[i:i+batch_size]

# Example usage on Node 1
dataloader = DistributedDataLoader('shard_1.bin')
for batch in dataloader.get_batch():
    # Train the model on the batch
    model.train(batch)
```

#### Step 5: Training
- Each node trains the model independently on its shard of data. This might involve processing the data to extract features, feeding batches of sequences into the model, and updating the model's weights based on the prediction errors.
- Periodically, the nodes might synchronize their model weights to ensure consistency and convergence of the training process across the entire dataset.

### Advantages of This Approach
- **Scalability**: Training can scale linearly with the number of nodes since each node handles only a part of the dataset.
- **Speed**: Training speed can be significantly improved as the load is distributed.
- **Manageability**: Each shard is smaller and more manageable, fitting easily into the memory and storage of a single node.

This example illustrates how data sharding and distributed training can be implemented in practice to handle large datasets efficiently. It simplifies data management and accelerates the training process, making it feasible to tackle problems that would otherwise be too large for traditional methods.

## back to coding

In [2]:
def _peek_data_shard(filename):
    # only reads the header, returns header data
    with open(filename, "rb") as f:
        # first read the header, which is 256 int32 integers (4 bytes each) // header is expected to be at the first 256 integers
        header = np.frombuffer(f.read(256*4), dtype=np.int32)
    if header[0] != 20240520: # validates first integer // used to verify that file is of the expected type or format
        print("ERROR: magic number mismatch in the data .bin file!")
        print("---> HINT: Are you passing in a correct file with --input_bin?")
        print("---> HINT: Dataset encoding changed recently, re-run data prepro or refer again to README")
        print("---> HINT: For example re-run: `python dev/data/tinyshakespeare.py`, then re-try")
        exit(1)
    assert header[1] == 1, "unsupported version" # header[1]=1 indicates version of file format thats supported
    ntok = header[2] # number of tokens (claimed)
    return ntok # for now just return the number of tokens

In [None]:
def _load_data_shard(filename):
    with open(filename, "rb") as f:
        # first read the header, which is 256 int32 integers (4 bytes each)
        header = np.frombuffer(f.read(256*4), dtype=np.int32)
        assert header[0] == 20240520, "magic number mismatch in the data .bin file" # simillar to what _peek_data_shard do
        assert header[1] == 1, "unsupported version"
        ntok = header[2] # number of tokens (claimed)
        # the rest of it are tokens, stored as uint16
        tokens = np.frombuffer(f.read(), dtype=np.uint16)
    assert len(tokens) == ntok, "number of tokens read does not match header?"
    return tokens

## Notes

query: what does reset() do?

---

The `reset()` method in the `DistributedDataLoader` class plays a crucial role in managing the data loading process from shards across distributed systems. Here's what it specifically does:

### Functionality of `reset()` Method
- **Purpose**: To prepare or reinitialize the data loader to start reading from the first shard of data again. It's typically called during the initialization of the data loader and can be invoked to reset the loader's state, for example, at the start of a new epoch in training.

**Operations within `reset()`:**
1. **Check Current Shard**: It first checks if the current shard index (`self.current_shard`) is not already pointing to the first shard (`0`). If it isn't, it sets the current shard index to `0`.
   
2. **Load Data from Shard**:
   - Loads data from the shard file corresponding to `self.current_shard` using the `_load_data_shard` function. This function reads the file, checks its integrity, and loads the token data into memory.
   - The loaded tokens are then stored in `self.tokens`, which acts as the in-memory storage for the current shard's data.

3. **Set Current Position**:
   - Sets the `self.current_position` attribute, which determines the starting point of data reading for the current process. This position is calculated based on the process's rank (`self.process_rank`) and the configured batch size (`self.B`) and sequence length (`self.T`). The formula used is `self.process_rank * self.B * self.T`, ensuring that each process starts reading from a unique segment of the data shard, thereby distributing the workload evenly across different processes.

### Importance in Distributed Training
- **Efficient Data Management**: The method ensures that the loader is always ready to provide data from the correct position and shard, crucial for maintaining data consistency and training stability in distributed environments.
- **Flexibility**: It allows easy resetting of the loader's state at various points in the training cycle, such as at the beginning of each training epoch, without needing to recreate the data loader instance.

Overall, the `reset()` method is instrumental in ensuring that each training process in a distributed setup operates smoothly and accesses the correct portion of data required for its computations.

## back to coding

In [1]:
class DistributedDataLoader:
    def __init__(self, filename_pattern, B, T, process_rank, num_processes): # filename_pattern=a pattern that matches the data files to be loaded
                                                                             # B=number of sequences per batch
                                                                             # T=number of tokens per sequence
                                                                             # process_rank=rank ID of current process among all the processes
                                                                             # num_processes=total number of processes involved in the training
        self.process_rank = process_rank
        self.num_processes = num_processes
        self.B = B
        self.T = T

        # glob files that match the pattern
        self.files = sorted(glob.glob(filename_pattern)) # finding all files matching the "filename_pattern"
        assert len(self.files) > 0, f"did not find any files that match the pattern {filename_pattern}" # asserts that there are files matching the pattern to ensure data is avaialable for loading

        # load and validate all data shards, count number of tokens in total
        ntok_total = 0
        for fname in self.files: # iterates over all discovered files
            shard_ntok = _peek_data_shard(fname) # uses _peek_data_shard to check number of tokens. To validate each shard can provide enough data for the set batch configuration
            assert shard_ntok >= num_processes * B * T + 1 # ensures that shard has at least enough tokens to satisfy "num_processes * B * T + 1" // ensuring each process can get a complete batch of data.
            ntok_total += shard_ntok
        self.ntok_total = ntok_total
        print0(f"DataLoader: total number of tokens: {ntok_total:,} across {len(self.files)} files") # logs the total number of tokens and number of files loaded

        # kick things off
        self.current_shard = None # sets initial shard to none and calls "reset" method to prepare for dataloading.
        self.reset()

    def reset(self):
        # we're being a bit clever here: if we already had shard 0 loaded,
        # then don't do the work to reload it, just reset the pointer
        if self.current_shard != 0:
            self.current_shard = 0
            self.tokens = _load_data_shard(self.files[self.current_shard])
        self.current_position = self.process_rank * self.B * self.T

    def advance(self): # advance to next data shard
        self.current_shard = (self.current_shard + 1) % len(self.files) # increments the current shard index and uses the module operation with total number of files (shards) to ensure that the shard
                                                                        # index wrapsa round if it reaches the end. Making the data loading cyclic.
        self.current_position = self.process_rank * self.B * self.T     # resets current position to starting position for this process in the new shard. Calculated on the batch size, sequence length and rank of the process
        self.tokens = _load_data_shard(self.files[self.current_shard])  # loads data from newly selected shard file into memory (self.tokens) // _load_data_shard function reads the token data from the file and ensures
                                                                        # it matches the expected format and size

    def next_batch(self):
        B = self.B
        T = self.T
        buf = self.tokens[self.current_position : self.current_position+B*T+1] # creates a buffer "buf" that slices the tokens from the current position in the shard. It extends to the size needed for one batch
                                                                               # plus one additional tokens for target labels.
        buf = torch.tensor(buf.astype(np.int32), dtype=torch.long) # converts buffer to PyTorch tensor of integers
        x = (buf[:-1]).view(B, T) # inputs
        y = (buf[1:]).view(B, T) # targets
        # advance the start pointer in current shard
        self.current_position += B * T * self.num_processes
        # if loading the next batch would be out of bounds advance the shard
        if self.current_position + (B * T * self.num_processes + 1) > len(self.tokens): # Increments the current position by the total number of tokens processed by all processes in one batch to prepare for the next batch retrieval.
            self.advance()
        return x, y

In [None]:
# -----------------------------------------------------------------------------
# Python -> C bridge utilities for saving params/grads/activations to .bin files

## Notes

These functions are used to serialize model weights and tensors into binary files, typically for the purpose of saving and transferring the trained model efficiently. Let's break down each function:

### 1. `write_fp32(tensor, file)`
- **Purpose**: Writes a tensor to a binary file in 32-bit floating point format (`float32`).
- **Steps**:
  - `tensor.detach().cpu().to(torch.float32)`: Detaches the tensor from the computation graph, moves it to CPU, and ensures it is in `float32` format.
  - `b = t.numpy().tobytes()`: Converts the tensor to a NumPy array and then to a byte format.
  - `file.write(b)`: Writes the byte representation of the tensor to the specified file.

### 2. `write_bf16(tensor, file)`
- **Purpose**: Writes a tensor to a binary file in 16-bit floating point format, specifically `bfloat16`, which is a format commonly used for neural network model parameters due to its balance between precision and memory footprint.
- **Steps**:
  - `tensor.detach().cpu().to(torch.bfloat16)`: Detaches the tensor, moves it to CPU, and converts it to `bfloat16`.
  - `t = t.view(torch.int16)`: This is a trick to reinterpret the `bfloat16` tensor as an `int16` tensor because NumPy does not have native support for `bfloat16`. The tensor is then converted to bytes.
  - `b = t.numpy().tobytes()`: Converts the tensor to a NumPy array in `int16` format and then to bytes.
  - `file.write(b)`: Writes the byte data to the specified file.

### 3. `write_tensors(model_tensors, L, file, dtype)`
- **Purpose**: Facilitates writing various model tensors to a binary file, selecting the appropriate function (`write_fp32` or `write_bf16`) based on the desired data type.
- **Steps**:
  - It iterates over the specified model tensors and writes each one to the file using either the `write_fp32` or `write_bf16` function depending on the `dtype` parameter.
  - The function handles various parts of the model such as embedding weights, layer norms, attention and MLP components.
  - Specific tensor keys like 'transformer.wte.weight' or 'transformer.h.i.ln_1.weight' indicate what part of the model the tensor belongs to, ensuring that each part is serialized correctly.

Overall, these functions are critical for efficiently saving the internal state of neural network models, especially in formats that are optimized for storage space and computational performance on specialized hardware.

## back to coding

In [None]:
def write_fp32(tensor, file):
    t = tensor.detach().cpu().to(torch.float32)
    b = t.numpy().tobytes()
    file.write(b)

In [None]:
def write_bf16(tensor, file):
    t = tensor.detach().cpu().to(torch.bfloat16)
    # numpy doesn't have bf16 datatype so we have to trick it
    t = t.view(torch.int16) # trick: reinterpret as int16
    b = t.numpy().tobytes()
    file.write(b)

In [None]:
def write_tensors(model_tensors, L, file, dtype):
    # writes the GPT-2 model's weights to a binary file
    assert dtype in {"float32", "bfloat16"}
    write_fun = write_fp32 if dtype == "float32" else write_bf16
    write_fun(model_tensors["transformer.wte.weight"], file) # (V, C)
    write_fun(model_tensors["transformer.wpe.weight"], file) # (T, C)
    for i in range(L): # (L, C)
        write_fun(model_tensors[f"transformer.h.{i}.ln_1.weight"], file)
    for i in range(L): # (L, C)
        write_fun(model_tensors[f"transformer.h.{i}.ln_1.bias"], file)
    for i in range(L): # (L, 3C, C)
        write_fun(model_tensors[f"transformer.h.{i}.attn.c_attn.weight"], file)
    for i in range(L): # (L, 3C)
        write_fun(model_tensors[f"transformer.h.{i}.attn.c_attn.bias"], file)
    for i in range(L): # (L, C, C)
        write_fun(model_tensors[f"transformer.h.{i}.attn.c_proj.weight"], file)
    for i in range(L): # (L, C)
        write_fun(model_tensors[f"transformer.h.{i}.attn.c_proj.bias"], file)
    for i in range(L): # (L, C)
        write_fun(model_tensors[f"transformer.h.{i}.ln_2.weight"], file)
    for i in range(L): # (L, C)
        write_fun(model_tensors[f"transformer.h.{i}.ln_2.bias"], file)
    for i in range(L): # (L, 4C, C)
        write_fun(model_tensors[f"transformer.h.{i}.mlp.c_fc.weight"], file)
    for i in range(L): # (L, 4C)
        write_fun(model_tensors[f"transformer.h.{i}.mlp.c_fc.bias"], file)
    for i in range(L): # (L, C, 4C)
        write_fun(model_tensors[f"transformer.h.{i}.mlp.c_proj.weight"], file)
    for i in range(L): # (L, C)
        write_fun(model_tensors[f"transformer.h.{i}.mlp.c_proj.bias"], file)
    write_fun(model_tensors["transformer.ln_f.weight"], file) # (C, )
    write_fun(model_tensors["transformer.ln_f.bias"], file) # (C, )

## Notes - pad_vocab

The `pad_vocab` function is designed to adjust the size of a tensor that represents weights related to the vocabulary of a model like GPT-2, making it compatible with more efficient GPU matrix operations by padding it to a "friendlier" size. Here's a breakdown of its steps and purpose:

### Purpose
This function pads the first dimension of a 2D tensor to the nearest multiple of a specified number (default is 128). This padding operation is intended to optimize the tensor for better performance on GPUs, particularly for operations that benefit from certain size constraints.

### Steps

1. **Decorator (`@torch.no_grad()`)**:
   - Ensures that the operation does not track gradients, as it's typically used in a context where gradients are not needed (e.g., preparing a model for inference or export).

2. **Function Signature**:
   - `tensor`: The 2D tensor to be padded, usually associated with vocabulary weights in models like GPT-2.
   - `multiple`: The number by which the vocabulary size should be rounded up. This is usually a power of two because such numbers are more GPU operation friendly.
   - `value`: The padding value to use, defaulting to 0, which is typical for padding operations where you want the added elements to have no effect.

3. **Assertions**:
   - Ensures that the input tensor is 2D (`tensor.ndim == 2`).
   - Verifies that the original vocabulary size is 50,257 (`V == 50257`), which is specific to GPT-2, indicating that this function is specialized for a particular model configuration.

4. **Calculate New Padded Size**:
   - Computes the new size `Vp` for the vocabulary by rounding up to the nearest multiple of the specified `multiple`. This ensures that the size is compatible with optimized matrix operations.

5. **Padding Operation**:
   - Computes the number of rows needed to pad (`pad_rows = Vp - V`).
   - Uses PyTorch's `F.pad` function to add `pad_rows` to the tensor. The padding is added at the end of the first dimension (`(0, 0, 0, pad_rows)`), with `0` indicating no padding for the second dimension.

6. **Shape Verification**:
   - Confirms that the new shape of the tensor is as expected (`(Vp, C)`), ensuring that the padding was applied correctly.

7. **Return**:
   - Returns the padded tensor, which now has a size more amenable to efficient GPU processing.

### General Explanation
This method is utilized typically when preparing model weights for deployment or for operations that are sensitive to tensor dimensions, ensuring that computational efficiency is maximized by aligning data structures to hardware-friendly configurations. This is particularly important in high-performance computing scenarios where even minor inefficiencies can scale to significant computational costs.

## back to coding

In [None]:
@torch.no_grad()
def pad_vocab(tensor, multiple=128, value=0):
    """
    The dimension of the vocab size in GPT-2 is 50,257
    which is unfortunately a very unfriendly number for a lot of
    matrix operations on the GPU. So we pad it to the nearest
    friendlier multiple, e.g. 50,304 if multiple=128 when we
    export the weights into C land. This is a NOOP algorithmically
    and is only done to make the tensor operations more efficient.
    """
    assert tensor.ndim == 2
    V, C = tensor.shape
    assert V == 50257, "just being defensive here"
    # calculate padded vocab size by rounding up to nearest multiple
    Vp = ((V + multiple - 1) // multiple) * multiple
    # pad the tensor
    pad_rows = Vp - V
    padded = tensor if pad_rows == 0 else F.pad(tensor, (0, 0, 0, pad_rows), value=value)
    assert padded.shape == (Vp, C)
    return padded

## notes

The function `write_model` is designed to serialize a model into a binary file format tailored for efficient loading and use, especially in environments where handling different data types is crucial (like in C/C++ applications). The function allows the choice of precision for the data (either `float32` or `bfloat16`), adjusts the size of the vocabulary for performance optimizations, and writes the configured model data into a structured binary file. Here’s how it works:

### Parameters
- **model**: The model instance to serialize. This is typically a pre-trained model such as GPT.
- **filename**: The destination filename for the serialized model.
- **dtype**: The data type for the model's tensors to be saved as. It supports `"float32"` and `"bfloat16"`.

### Steps and Purpose

1. **Header Preparation**:
    - A header is prepared to store metadata about the model. It is an array of 256 `int32` integers, which starts with a predefined structure:
        - **Magic number**: A unique identifier (`20240326`) to validate the file format upon loading.
        - **Version**: Indicates the file version, which differs based on the tensor data type (`3` for `float32`, `5` for `bfloat16`).
        - **Configuration Parameters**: Stores crucial model configuration parameters such as `block_size`, `vocab_size`, `n_layer`, `n_head`, and `n_embd`.

2. **Parameter Extraction**:
    - Retrieves all parameters from the model and transfers them to CPU for serialization. This ensures compatibility and portability of the file across different systems.

3. **Vocabulary Padding**:
    - Specifically adjusts the `transformer.wte.weight` tensor, which is the embedding weight matrix. The vocabulary dimension (number of rows in this matrix) is padded to a multiple of 128 using the `pad_vocab` function to ensure alignment and efficient processing in performance-sensitive environments. This step is crucial for ensuring that the tensor operations in the model are optimized for hardware accelerations.

4. **Header Update**:
    - Updates the header with the new padded vocabulary size, which is crucial for correctly interpreting the data when the model is loaded back.

5. **Writing to File**:
    - Writes the header and the model parameters to the specified binary file:
        - **Header**: Converts the header tensor to bytes and writes it to the beginning of the file.
        - **Parameters**: Calls `write_tensors` to serialize the parameters, storing them according to the specified data type (`float32` or `bfloat16`).

6. **Completion Message**:
    - Prints a confirmation message indicating the successful writing of the model to the file.

### General Explanation
This function encapsulates the model's configuration and its parameters into a compact binary format, making it easy to distribute or deploy the model in different environments. The inclusion of data type choices (`float32`, `bfloat16`) and vocabulary padding are particularly important for ensuring that the model performs optimally in environments where computational efficiency is critical, such as in embedded systems or when running on specialized hardware accelerators.

## back to coding

In [None]:
def write_model(model, filename, dtype):
    # everything we need to instantiate the model
    # 1) header is: version int, GPTConfig ints, padding to 1024 bytes
    assert dtype in {"float32", "bfloat16"} # float16 todo maybe later
    version = {
        "float32": 3, # 3: all tensors are fp32, padded vocab
        "bfloat16": 5, # 5: all tensors are bf16, padded vocab
    }[dtype]
    header = torch.zeros(256, dtype=torch.int32)
    header[0] = 20240326 # magic
    header[1] = version # checkpoint version
    header[2] = model.config.block_size
    header[3] = model.config.vocab_size
    header[4] = model.config.n_layer
    header[5] = model.config.n_head
    header[6] = model.config.n_embd
    # 2) the parameters follow the header
    params = {name: param.cpu() for name, param in model.named_parameters()}
    # pad the vocab to a multiple of 128 here at export, for efficiency in C
    wte = params["transformer.wte.weight"] # (V, C)
    wte_padded = pad_vocab(wte) # (Vp, C)                     # function defined in code block above!
    params["transformer.wte.weight"] = wte_padded # (Vp, C)
    print(f"padded vocab size from {wte.size(0)} to {wte_padded.size(0)}")
    header[7] = wte_padded.size(0) # padded vocab size store in header
    # now write to file
    with open(filename, "wb") as file:
        file.write(header.numpy().tobytes()) # header
        write_tensors(params, model.config.n_layer, file, dtype) # params
    print(f"wrote {filename}")

## Notes

The `write_state` function is designed to serialize the runtime state of a model to a binary file, capturing key data for debugging purposes. It includes inputs, outputs (logits), loss values, and gradients of the model parameters. This is particularly useful for verifying computation correctness, especially when integrating or porting the model to different platforms such as C environments. Here's a detailed breakdown of each step within the function:

### Parameters
- **model**: The model from which the state will be captured.
- **x**: Input tensor to the model.
- **y**: Target tensor for the model's output.
- **logits**: Logits tensor produced by the model.
- **loss**: Scalar loss value computed from the logits.
- **filename**: The filename where the state data will be saved.

### Steps and Purpose

1. **Header Preparation**:
   - Initializes a header with 256 `int32` integers, starting with a unique identifier (magic number) `20240327` and a version number for the runtime state file.
   - Stores the batch size (`B`) and temporal extent (`T`) of the input batch `x`.

2. **Gradients Preparation**:
   - Extracts gradients of all model parameters and transfers them to the CPU.
   - Specifically, it handles the padding of the gradients for the word token embedding (`transformer.wte.weight`) to ensure size alignment, similar to how the model's weights are handled during serialization. This maintains consistency between model state dumps and actual model files.

3. **File Writing**:
   - Writes the prepared header to the file.
   - Serializes the input tensor `x` and the target tensor `y` as 32-bit integers directly to the file.
   - Writes the logits using a helper function `write_fp32`, ensuring they are stored as `float32`.
   - Writes the loss value using `write_fp32`.
   - Serializes the padded gradients using the `write_tensors` function, ensuring all tensor data is stored with `float32` precision.

4. **Completion Message**:
   - Prints a confirmation message indicating the successful writing of the runtime state to the file.

### General Explanation
This function is essential for debugging the behavior of a model under specific inputs by providing a snapshot of the internal state at the time of computation. This snapshot includes input data, model outputs, computed loss, and gradients, all of which are crucial for diagnosing issues, verifying computations, or validating model behavior when porting or optimizing for other platforms. The detailed state dump allows developers or system integrators to cross-verify computations or detect anomalies in model behavior, especially when moving between different computing environments.

## back to coding

In [None]:
def write_state(model, x, y, logits, loss, filename):
    # the state is used for debugging.
    # it contains information about the input, logits, loss, and the parameter gradients
    # this can be used for checking the computation correctness in C
    header = torch.zeros(256, dtype=torch.int32)
    header[0] = 20240327 # magic
    header[1] = 2 # run state version = 2 (1 -> 2 for padded vocab changes)
    header[2] = x.size(0) # batch size of the batch, B
    header[3] = x.size(1) # temporal extent of the batch, T
    grads = {name: param.grad.cpu() for name, param in model.named_parameters()}
    # pad the vocab grads here as well, to mirror write_model
    wte_grad = grads["transformer.wte.weight"] # (V, C)
    wte_grad_padded = pad_vocab(wte_grad, value=0) # (Vp, C) # TODO later maybe pad with nan?
    grads["transformer.wte.weight"] = wte_grad_padded # (Vp, C)
    print(f"padded vocab size in reference grads from {wte_grad.size(0)} to {wte_grad_padded.size(0)}")
    with open(filename, "wb") as file:
        # header
        file.write(header.numpy().tobytes())
        # input x
        file.write(x.cpu().numpy().astype("int32").tobytes()) # (B, T)
        # targets y
        file.write(y.cpu().numpy().astype("int32").tobytes()) # (B, T)
        # logits (result of the model forward pass)
        write_fp32(logits.cpu(), file)
        # loss (single float, result of the cross entropy loss)
        write_fp32(loss.cpu(), file)
        # gradients
        write_tensors(grads, model.config.n_layer, file, "float32")
    print(f"wrote {filename}")

## Note

The `write_tokenizer` function is designed to serialize a tokenizer's encoding information into a binary file, which includes the entire set of tokens that the tokenizer can produce. Here’s a breakdown of the function:

### Parameters:
- **enc**: The tokenizer encoding object which must provide methods for accessing the maximum token value and decoding tokens.
- **filename**: The path where the serialized tokenizer data should be written.

### Steps and Purpose:

1. **Header Preparation**:
   - Initializes a header with 256 `int32` integers, starting with a unique identifier (magic number) `20240328` and a version number for the tokenizer data format.
   - Stores the total number of tokens `n` (which is `enc.max_token_value + 1`) and a special end-of-text (EOT) token value.

2. **File Writing**:
   - Writes the header to the binary file. The header includes metadata about the tokenizer such as the total number of tokens and the specific EOT token.
   - Iterates through each token ID from `0` to `n-1`:
     - Decodes the token into its byte representation using the tokenizer's `decode_bytes` method.
     - Ensures the byte length of each token does not exceed 255 characters, a necessary check to ensure each length can be stored in a single unsigned byte.
     - Writes the length of the token followed by the token’s actual byte representation to the file.

3. **Completion Message**:
   - Prints a message indicating the successful serialization of the tokenizer data to the specified file.

### General Explanation:
This function serializes a tokenizer into a format that can be easily loaded by other software or systems that need to understand or reconstruct the tokenizer’s behavior without having direct access to the high-level tokenizer library. This is particularly useful for deploying models in environments where the original tokenizer implementation might not be available or when optimizing performance in low-level systems like embedded devices or custom hardware. The use of a simple format—token lengths followed by token bytes—ensures that the tokenizer can be reconstructed or used in a straightforward manner across different platforms.

## Back to coding

In [None]:
def write_tokenizer(enc, filename):
    n = enc.max_token_value + 1
    header = torch.zeros(256, dtype=torch.int32)
    header[0] = 20240328 # magic
    header[1] = 2 # tokenizer version = 2 (1 -> 2: includes EOT token)
    header[2] = n # number of tokens
    header[3] = enc.eot_token # EOT token
    with open(filename, "wb") as file:
        file.write(header.numpy().tobytes())
        for i in range(n):
            b = enc.decode_bytes([i])
            length = len(b)
            assert length < 256, f"Token length exceeds 255: {length}"
            file.write(struct.pack("<B", length))  # Write the length as a 1-byte unsigned integer
            file.write(b)  # Write the actual bytes
    print(f"wrote {filename}")

In [None]:
# -----------------------------------------------------------------------------
# int main

In [None]:
def print0(*args, **kwargs):
    # modified print that only prints from the master process
    # if this is not a distributed run, it's just a print
    if int(os.environ.get("RANK", 0)) == 0:
        print(*args, **kwargs)

## NOTES <-- DDP

This section of the script sets up the Distributed Data Parallel (DDP) environment for training a neural network using PyTorch. Let’s go through each line to understand what it does:

1. **Check if running in DDP mode**:
   ```python
   ddp = int(os.environ.get('RANK', -1)) != -1
   ```
   This line checks if the script is being run in a distributed manner. It looks for the environment variable 'RANK', which is typically set by a DDP launching tool like `torchrun` or `torch.distributed.launch`. If 'RANK' is not set, it defaults to -1, indicating the script is not running in DDP mode.

2. **Setup for DDP**:
   If the script is running in DDP mode (`ddp` is `True`), several steps are followed:
   - **Assert CUDA availability**:
     ```python
     assert torch.cuda.is_available(), "for now i think we need CUDA for DDP"
     ```
     This asserts that CUDA is available on the system, which is required for GPU-based distributed training.

   - **Initialize the process group**:
     ```python
     init_process_group(backend='nccl')
     ```
     Initializes the process group for communication among the distributed processes and sets 'nccl' as the backend for GPU-based collective operations. NCCL (NVIDIA Collective Communications Library) is optimized for NVIDIA GPUs.

   - **Set ranks and device**:
     ```python
     ddp_rank = int(os.environ['RANK'])
     ddp_local_rank = int(os.environ['LOCAL_RANK'])
     ddp_world_size = int(os.environ['WORLD_SIZE'])
     device = f'cuda:{ddp_local_rank}'
     torch.cuda.set_device(device)
     ```
     - `ddp_rank` is the global rank of the process.
     - `ddp_local_rank` is the rank of the process within the local machine (useful when multiple GPUs on a single machine are used).
     - `ddp_world_size` is the total number of processes involved in the training.
     - `device` sets the specific GPU to use by its local rank.

   - **Master process and seed offset**:
     ```python
     master_process = ddp_rank == 0
     seed_offset = 0
     ```
     - `master_process` is a boolean indicating if the current process is the master (rank 0). This is often used to manage logging or saving models to prevent multiple processes from performing the same task.
     - `seed_offset` ensures each process gets the same seed for reproducibility in this context.

   - **Set Zero Redundancy Optimizer stage**:
     ```python
     zero_stage = args.zero_stage
     ```
     This line sets the stage for using the Zero Redundancy Optimizer, which is a way to reduce memory usage during distributed training by partitioning the optimizer states.

3. **Setup for non-DDP (single process mode)**:
   If not running in DDP mode, the script sets the following defaults:
   - **Ranks and world size**:
     ```python
     ddp_rank = 0
     ddp_local_rank = 0
     zero_stage = 0
     ddp_world_size = 1
     ```
     Here, it sets the rank, local rank, and world size to values consistent with a single process run.

   - **Master process**:
     ```python
     master_process = True
     ```
     Since there is only one process, it is by default the master process.

   - **Device selection**:
     ```python
     if args.device:
         device = args.device
     else:
         device = "cpu"
         if torch.cuda.is_available():
             device = "cuda"
         elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
             device = "mps"
     ```
     This portion selects the computation device. It defaults to CPU unless a device is specified, checks for CUDA availability, and checks if Metal Performance Shaders (MPS) are available on macOS for using Apple's GPUs.

This setup ensures that the script can be flexibly run either on a single machine or in a distributed manner across multiple machines or GPUs.

## Back to coding

## Notes <- gradient accumulation steps

This section of the script contains a variety of setups for device configuration, gradient accumulation, and random number generation for reproducibility. Here’s what each line does:

1. **Print and Set Device Type**:
   ```python
   print(f"using device: {device}")
   device_type = 'cuda' if 'cuda' in device else 'cpu'
   ```
   This prints the device that will be used for training (CPU or GPU). It sets the `device_type` variable based on whether the `device` string contains "cuda", indicating that a GPU is being used.

2. **Gradient Accumulation Setup**:
   ```python
   tokens_per_fwdbwd = B * T * ddp_world_size
   assert args.total_batch_size % tokens_per_fwdbwd == 0
   grad_accum_steps = args.total_batch_size // tokens_per_fwdbwd
   ```
   - `tokens_per_fwdbwd` calculates the number of tokens processed per forward-backward pass across all devices. This is the product of batch size (`B`), sequence length (`T`), and the number of distributed processes (`ddp_world_size`).
   - An assertion ensures that the desired total batch size (`args.total_batch_size`) is divisible by the number of tokens processed per pass, ensuring even distribution of work.
   - `grad_accum_steps` is calculated by dividing the total desired batch size by the number of tokens per pass, determining how many gradient accumulation steps are needed to achieve the desired batch size.

3. **Context Manager for Automatic Mixed Precision**:
   ```python
   ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[args.dtype]
   ctx = torch.amp.autocast(device_type=device_type, dtype=ptdtype) if device_type == "cuda" else nullcontext()
   ```
   - `ptdtype` selects the PyTorch data type based on the user's choice.
   - `ctx` sets up an automatic mixed precision context if running on a CUDA device. `torch.amp.autocast` automatically casts variables to the specified precision to improve performance during training.

4. **Random Number Generator (RNG) for Reproducibility**:
   ```python
   torch.manual_seed(42)
   if torch.cuda.is_available():
       torch.cuda.manual_seed(42)
   ```
   Sets the seed for the random number generators in PyTorch to ensure reproducibility. This makes sure that random operations such as weight initialization are the same across runs.

5. **Precision Mode Setting**:
   ```python
   if args.tensorcores:
       torch.set_float32_matmul_precision('high')
   ```
   - This line checks if tensor cores are enabled (used in NVIDIA GPUs for faster matrix multiplication).
   - If enabled, it sets the matrix multiplication precision mode to 'high', allowing the use of TensorFloat32 precision, which strikes a balance between performance and accuracy.

Together, these setups ensure that the computational environment is correctly configured for efficient and reproducible training, especially when using advanced features like mixed precision and distributed data parallel processing.

## Back to coding

In [None]:
if __name__ == "__main__":
    import time
    import argparse
    import tiktoken
    print0(f"Running pytorch {torch.version.__version__}") # print0 // seen above, this is meant for distributed runs. so only 1 process is printing instead of all 8 or all 16, etc.

    # default settings will overfit a tiny batch of data
    # and save model weights and debug state to disk on the first iteration
    parser = argparse.ArgumentParser()
    # file system input / output
    parser.add_argument("--input_bin", type=str, default="dev/data/tinyshakespeare/tiny_shakespeare_val.bin", help="input .bin to train on")
    parser.add_argument("--input_val_bin", type=str, default="", help="input .bin to eval validation loss on")
    parser.add_argument("--output_dir", type=str, default="", help="output directory to which to write logs and checkpoints")
    parser.add_argument("--model", type=str, default="gpt2", help="gpt2|gpt2-medium|gpt2-large|gpt2-xl|d12|d24|d36|d48")
    # token layout for each step of the optimization
    parser.add_argument("--batch_size", type=int, default=4, help="batch size, in units of #batch dimensions")
    parser.add_argument("--sequence_length", type=int, default=64, help="sequence length")
    parser.add_argument("--total_batch_size", type=int, default=256, help="total desired batch size, in units of #tokens")
    # workload (number of steps)
    parser.add_argument("--num_iterations", type=int, default=10, help="number of iterations to run")
    parser.add_argument("--inference_only", type=int, default=0, help="only run inference")
    # optimization
    parser.add_argument("--learning_rate", type=float, default=1e-4, help="learning rate warmup iterations")
    parser.add_argument("--warmup_iters", type=int, default=0, help="learning rate warmup iterations")
    parser.add_argument("--learning_rate_decay_frac", type=float, default=1.0, help="learning rate warmup iterations")
    parser.add_argument("--weight_decay", type=float, default=0.0, help="weight decay")
    parser.add_argument("--grad_clip", type=float, default=1.0, help="maximum gradient magnitude")
    # evaluation
    parser.add_argument("--val_loss_every", type=int, default=0, help="every how mant steps to evaluate val loss?")
    parser.add_argument("--val_max_steps", type=int, default=20, help="how many batches of val to average?")
    parser.add_argument("--sample_every", type=int, default=0, help="how often to sample from the model?")
    # debugging
    parser.add_argument("--overfit_single_batch", type=int, default=1, help="overfit just one batch of data")
    # numerics
    parser.add_argument("--tensorcores", type=int, default=0, help="use tensorcores")
    # memory management
    parser.add_argument("--device", type=str, default="", help="by default we autodetect, or set it here")
    parser.add_argument("--compile", type=int, default=0, help="torch.compile the model")
    parser.add_argument("--flash", type=int, default=0, help="use flash attention")
    parser.add_argument("--dtype", type=str, default="float32", help="float32|float16|bfloat16")
    parser.add_argument("--zero_stage", type=int, default=0, help="zero redundancy optimizer stage (0/1/2/3)")
    # python -> C bridge
    parser.add_argument("--write_tensors", type=int, default=1, help="write tensors to disk")
    args = parser.parse_args()

    # args error checking and convenience variables
    B, T = args.batch_size, args.sequence_length # number of training examples processed together in 1 pass, or 1 batch
                                                 # T=sequence length // specifies the length of the input sequence that the model processes.
    assert 1 <= T <= 1024 # asserts sequence length is between t and 1024 // prevents input sequence from being too low or high
    assert args.dtype in {"float32", "float16", "bfloat16"}
    assert args.model in {"gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl", "d12", "d24", "d36", "d48"}

    # set up DDP (distributed data parallel). torchrun sets this env variable
    ddp = int(os.environ.get('RANK', -1)) != -1 # is this a ddp run? // looks for environment variable "RANK"
                                                        # this RANK variable is typically set by a DDP launching tool like "torchrun" or "torch.distributed.launch"
    if ddp:
        # use of DDP atm demands CUDA, we set the device appropriately according to rank
        assert torch.cuda.is_available(), "for now i think we need CUDA for DDP"
        init_process_group(backend='nccl') # initialises process grp for communicating among distributed processes, and sets "nccl" <-- Nvidia Collective Communications Library
        ddp_rank = int(os.environ['RANK']) # global rank of the process
        ddp_local_rank = int(os.environ['LOCAL_RANK'])        # the rank of the process within the local machine (useful when multiple GPUs on a single machine are used).
        ddp_world_size = int(os.environ['WORLD_SIZE'])        # total number of processes involved in the training
        device = f'cuda:{ddp_local_rank}'
        torch.cuda.set_device(device)     # sets the specific GPU to use by its local rank.
        master_process = ddp_rank == 0 # this process will do logging, checkpointing etc. // boolean indicates if current process is the master (rank 0) <-- this is often used to manage logging or saving
                                       # models to prevent multiple processes from performing the same task
        seed_offset = 0 # each process gets the exact same seed
        zero_stage = args.zero_stage   # sets stage for using the Zero Redundancy Optimizer <-- way to reduce memory usage during distributed training by partitioning the optimizer states (rmb abt AdamW vs zero_stage optim)
    else:
        ddp_rank = 0 # single proess mode // setup for non-DDP
        ddp_local_rank = 0
        zero_stage = 0
        ddp_world_size = 1
        master_process = True
        seed_offset = 0
        # select the device
        if args.device:
            # provided explicitly by the user
            device = args.device
        else:
            # attempt to autodetect the device
            device = "cpu"
            if torch.cuda.is_available():
                device = "cuda"
            elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
                device = "mps"
    print(f"using device: {device}") # prints the device that will be used
    device_type = 'cuda' if 'cuda' in device else 'cpu'

    # calculate gradient accumulation from the desired total batch size and the current run configuration
    tokens_per_fwdbwd = B * T * ddp_world_size                        # calculates the number of tokens processed per forward-backward pass across all devices. This is the product of batch size (B),
                                                                      # sequence length (T), and the number of distributed processes (ddp_world_size).
    assert args.total_batch_size % tokens_per_fwdbwd == 0 # assertion ensures desired total batch size is divisible by number of tokens processed per pass
    grad_accum_steps = args.total_batch_size // tokens_per_fwdbwd     # grad_accum_steps is calculated by dividing the total desired batch size by the number of tokens per pass, determining how many
                                                                      # gradient accumulation steps are needed to achieve the desired batch size.
    print0(f"total desired batch size: {args.total_batch_size}")
    print0(f"=> calculated gradient accumulation steps: {grad_accum_steps}")

    # set up a context manager following the desired dtype and device
    ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[args.dtype] # selects PyTorch datatype (dtype) // based on args' argument defined in parser
    ctx = torch.amp.autocast(device_type=device_type, dtype=ptdtype) if device_type == "cuda" else nullcontext() # ctx sets up an automatic mixed precision context if running on a CUDA device. torch.amp.autocast
                                                                                                                 # automatically casts variables to the specified precision to improve performance during training.

    # rng / reproducibility
    torch.manual_seed(42)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(42)

    # set the torch precision mode to use TensorFloat32 (TF32) for matmuls
    # docs https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html
    if args.tensorcores:
        torch.set_float32_matmul_precision('high') # sets FLOPS matmuls precision/

    # turn on/off flash attention
    assert args.flash in {0, 1} # configuring flash attention (refer to self attention class)
    FLASH = args.flash

    # init (and write) the tokenizer
    enc = tiktoken.get_encoding("gpt2") # tokenizer initialised using tiktoken
    if master_process and args.write_tensors: # tokenizer is technically not tensors but ok
        write_tokenizer(enc, "gpt2_tokenizer.bin") # if the current process is the master in a distributed settings, and "write_tensors" is true. The tokenizer is written to
                                                   # a binary file. Allows for easy loading and use of the tokenizer in other components or system.

    # init the model, either from scratch ("d" models) or from OpenAI pretrained checkpoint (load pre-trained weights from GPT-2 variants)
    if args.model[0] == "d":
        # from scratch (random weights)
        model_config = {
            "d12": GPTConfig(block_size=1024, vocab_size=50257, n_layer=12, n_head=12, n_embd=768),
            "d24": GPTConfig(block_size=1024, vocab_size=50257, n_layer=24, n_head=16, n_embd=1024),
            "d36": GPTConfig(block_size=1024, vocab_size=50257, n_layer=36, n_head=20, n_embd=1280),
            "d48": GPTConfig(block_size=1024, vocab_size=50257, n_layer=48, n_head=25, n_embd=1600),
        }[args.model]
        model = GPT(model_config) # initialising class by passing in config
    else:
        # load the GPT-2 model weights
        model = GPT.from_pretrained(args.model)
    model.train()           # model is set to training mode, which enables features like dropout and batch normalisation specific to this phase.
    model.to(device)
    if args.compile:
        """
        Optionally, if args.compile is set, the script attempts to compile the model using PyTorch's advanced optimization features to potentially improve execution
        speed. The coordinate_descent_tuning setting might be part of custom configurations for optimization routines (though it appears as a placeholder here as PyTorch
        does not support it by default).
        """
        if hasattr(config, "coordinate_descent_tuning"):
            config.coordinate_descent_tuning = True # suggested by @Chillee
        print0("compiling the model...")
        model = torch.compile(model)

    # -------------------------------------------------------------------------
    # Our own version of a simple DistributedDataLoader

    # load tokens
    train_loader = DistributedDataLoader(args.input_bin, B, T, ddp_rank, ddp_world_size) # for training set
    val_loader = None
    if args.input_val_bin:
        val_loader = DistributedDataLoader(args.input_val_bin, B, T, ddp_rank, ddp_world_size) # for validation set

    # -------------------------------------------------------------------------
    # PyTorch -> C bridge: save some weights and state for C to load later as reference

    # do one forward pass to generate ground truth for our C tests
    if master_process and args.write_tensors and (not args.inference_only):
        x, y = train_loader.next_batch() # loads batch of data // input "x" and target "y" // using the training data loader
        x, y = x.to(device), y.to(device)
        logits, loss = model(x, y) # assign to forward()
        loss.backward() # backward pass // backpropagation to compute gradients

        # save model params, in both float32 and bfloat16
                  # SECTION BELOW, REFER TO NOTES ABOVE
        model_to_size = {"gpt2": "124M", "gpt2-medium": "355M", "gpt2-large": "774M", "gpt2-xl": "1558M"}
        model_to_size.update({f"d{d}": f"d{d}" for d in [12, 24, 36, 48]})
        model_size_str = model_to_size[args.model] # e.g. "124M", or "d12"
        write_model(model, f"gpt2_{model_size_str}.bin", dtype="float32")
        write_model(model, f"gpt2_{model_size_str}_bf16.bin", dtype="bfloat16")
        # save x, y, logits, loss, and parameter gradients, for debugging C
        # always store these in fp32 to have an accurate reference (?)
        write_state(model, x, y, logits, loss, f"gpt2_{model_size_str}_debug_state.bin")
        # reset the train_loader for the optimization below
        train_loader.reset()

    # -------------------------------------------------------------------------
    # main training loop

    # here we wrap model into DDP container
    if ddp:
        model = DDP(model, device_ids=[ddp_local_rank])
    raw_model = model.module if ddp else model # always contains the "raw" unwrapped model

    # init the optimizer
    optimizer = raw_model.configure_optimizers(weight_decay=args.weight_decay,
                                               learning_rate=args.learning_rate, betas=(0.9, 0.95),
                                               device_type=device, zero_stage=zero_stage)

    # learning rate decay scheduler (cosine with warmup)
    def get_lr(it):
        min_lr = args.learning_rate * args.learning_rate_decay_frac
        # 1) linear warmup for warmup_iters steps
        if it < args.warmup_iters:
            return args.learning_rate * (it+1) / args.warmup_iters
        # 2) if it > lr_decay_iters, return min learning rate
        if it > args.num_iterations:
            return min_lr
        # 3) in between, use cosine decay down to min learning rate
        decay_ratio = (it - args.warmup_iters) / (args.num_iterations - args.warmup_iters)
        assert 0 <= decay_ratio <= 1
        coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff starts at 1 and goes to 0
        return min_lr + coeff * (args.learning_rate - min_lr)

    # create the logging directory if it does not exist
    logfile = None
    if args.output_dir:
        os.makedirs(args.output_dir, exist_ok=True)
        logfile = os.path.join(args.output_dir, "main.log")
        # create the log file "main.log" inside it, and wipe it clean
        with open(logfile, "w") as f:
            pass

    if device == "cuda":
        torch.cuda.reset_peak_memory_stats()
    timings = []
    norm = -1.0   # dummy value to print in inference-only mode
    for step in range(args.num_iterations + 1):
        t0 = time.time()
        last_step = (step == args.num_iterations)

        # once in a while evaluate the validation dataset
        if (args.val_loss_every > 0 \
            and (step % args.val_loss_every == 0 or last_step)) \
            and (val_loader is not None):
            model.eval()
            val_loader.reset()
            with torch.no_grad():
                val_loss = 0.0
                for _ in range(args.val_max_steps):
                    x, y = val_loader.next_batch()
                    x, y = x.to(device), y.to(device)
                    _, loss = model(x, y, return_logits=False)
                    val_loss += loss.item()
                val_loss /= args.val_max_steps
            # log to console and to file
            print0(f"val loss {val_loss}")
            if master_process and logfile is not None:
                with open(logfile, "a") as f:
                    f.write("s:%d tel:%f\n" % (step, val_loss))

        # once in a while perform model inference on the master process
        if (args.sample_every > 0 \
            and (step % args.sample_every == 0 or last_step)) \
            and master_process:
            model.eval()
            # before we end, let's also do one round of inference
            # we'll kick off the generation with "<|endoftext|>", which designates the start of a new sequence
            start_ids = [enc.eot_token]
            xg = (torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...])
            max_new_tokens = 32
            temperature = 1.0
            top_k = 40
            yg = raw_model.generate(xg, max_new_tokens, temperature=temperature, top_k=top_k)
            print0('---------------')
            print0(enc.decode(yg[0].tolist()))
            print0('---------------')

        # bit confusing: we want to make sure to eval and sample on 0th iteration
        # but also after the very last iteration. so we loop for step <= num_iterations
        # instead of just < num_iterations (one extra due to <=), only to do
        # the validation/sampling one last time, and then we break right here as we're done.
        if last_step:
            break

        # --------------- TRAINING SECTION BEGIN -----------------
        model.train()
        optimizer.zero_grad(set_to_none=True)
        # if we are trying to overfit a single batch, we reset the loader here
        if args.overfit_single_batch:
            train_loader.reset()
        # micro-batch loop where we do gradient accumulation to reach desired total batch size
        lossf = 0.0 # for getting the mean loss (as simple float) over the accumulation steps
        for micro_step in range(grad_accum_steps):
            # fetch a batch
            x, y = train_loader.next_batch()
            x, y = x.to(device), y.to(device)
            if ddp:
                # we want only the last micro-step to sync grads in a DDP model
                # the official way to do this is with model.no_sync(), but that is a
                # context manager that bloats the code, so we just toggle this variable
                model.require_backward_grad_sync = (micro_step == grad_accum_steps - 1)
            # forward pass
            with ctx:
                _, loss = model(x, y, return_logits=False)
                # we have to scale the loss to account for gradient accumulation,
                # because the gradients just add on each successive backward().
                # addition of gradients corresponds to a SUM in the objective, but
                # instead of a SUM we want MEAN, so we scale the loss here
                loss = loss / grad_accum_steps
                lossf += loss.detach() # keep track of the mean loss
            # backward pass
            if not args.inference_only:
                loss.backward()
        if ddp:
            dist.all_reduce(lossf, op=dist.ReduceOp.AVG)
        lossf = lossf.item()
        norm = torch.nn.utils.clip_grad_norm_(model.parameters(), args.grad_clip)
        # determine and set the learning rate for this iteration
        lr = get_lr(step)
        for param_group in optimizer.param_groups:
            param_group['lr'] = lr
        # step the optimizer
        optimizer.step()
        # --------------- TRAINING SECTION END -------------------
        # everything that follows now is just diagnostics, prints, logging, etc.

        # wait on the CPU for all device work to end so we get accurate per-iteration timings below
        if device == "mps":
            torch.mps.synchronize()
        elif device == "cuda":
            torch.cuda.synchronize()
        # time and print
        t1 = time.time()
        # the 0th iteration is often an outlier (much slower) => skip logging it
        tokens_per_second = grad_accum_steps * ddp_world_size * B * T / (t1-t0)
        print0(f"step {step+1:4d}/{args.num_iterations} | train loss {lossf:.6f} | norm {norm:.4f} | lr {lr:.2e} | ({(t1-t0)*1000:.2f} ms | {tokens_per_second:.0f} tok/s)")
        # log to logile
        if master_process and logfile is not None:
            with open(logfile, "a") as f:
                f.write("s:%d trl:%f\n" % (step, lossf))

        # keep track of smooth timings, last 20 iterations
        if step > 0 and step > args.num_iterations - 20:
            timings.append(t1-t0)

    # print the average of the last 20 timings, to get something smooth-ish
    timings = timings[-20:]
    print0(f"final {len(timings)} iters avg: {np.mean(timings)*1000:.3f}ms")
    print0(f"peak memory consumption: {torch.cuda.max_memory_allocated() // 1024 // 1024} MiB")

    # -------------------------------------------------------------------------
    # clean up nice
    if ddp:
        destroy_process_group()

# Below is Torch's official implementation
Module is from ```from .module import Module```

In [None]:
class LayerNorm(Module):
    r"""Applies Layer Normalization over a mini-batch of inputs.

    This layer implements the operation as described in
    the paper `Layer Normalization <https://arxiv.org/abs/1607.06450>`__

    .. math::
        y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta

    The mean and standard-deviation are calculated over the last `D` dimensions, where `D`
    is the dimension of :attr:`normalized_shape`. For example, if :attr:`normalized_shape`
    is ``(3, 5)`` (a 2-dimensional shape), the mean and standard-deviation are computed over
    the last 2 dimensions of the input (i.e. ``input.mean((-2, -1))``).
    :math:`\gamma` and :math:`\beta` are learnable affine transform parameters of
    :attr:`normalized_shape` if :attr:`elementwise_affine` is ``True``.
    The standard-deviation is calculated via the biased estimator, equivalent to
    `torch.var(input, unbiased=False)`.

    .. note::
        Unlike Batch Normalization and Instance Normalization, which applies
        scalar scale and bias for each entire channel/plane with the
        :attr:`affine` option, Layer Normalization applies per-element scale and
        bias with :attr:`elementwise_affine`.

    This layer uses statistics computed from input data in both training and
    evaluation modes.

    Args:
        normalized_shape (int or list or torch.Size): input shape from an expected input
            of size

            .. math::
                [* \times \text{normalized\_shape}[0] \times \text{normalized\_shape}[1]
                    \times \ldots \times \text{normalized\_shape}[-1]]

            If a single integer is used, it is treated as a singleton list, and this module will
            normalize over the last dimension which is expected to be of that specific size.
        eps: a value added to the denominator for numerical stability. Default: 1e-5
        elementwise_affine: a boolean value that when set to ``True``, this module
            has learnable per-element affine parameters initialized to ones (for weights)
            and zeros (for biases). Default: ``True``.
        bias: If set to ``False``, the layer will not learn an additive bias (only relevant if
            :attr:`elementwise_affine` is ``True``). Default: ``True``.

    Attributes:
        weight: the learnable weights of the module of shape
            :math:`\text{normalized\_shape}` when :attr:`elementwise_affine` is set to ``True``.
            The values are initialized to 1.
        bias:   the learnable bias of the module of shape
                :math:`\text{normalized\_shape}` when :attr:`elementwise_affine` is set to ``True``.
                The values are initialized to 0.

    Shape:
        - Input: :math:`(N, *)`
        - Output: :math:`(N, *)` (same shape as input)

    Examples::

        >>> # NLP Example
        >>> batch, sentence_length, embedding_dim = 20, 5, 10
        >>> embedding = torch.randn(batch, sentence_length, embedding_dim)
        >>> layer_norm = nn.LayerNorm(embedding_dim)
        >>> # Activate module
        >>> layer_norm(embedding)
        >>>
        >>> # Image Example
        >>> N, C, H, W = 20, 5, 10, 10
        >>> input = torch.randn(N, C, H, W)
        >>> # Normalize over the last three dimensions (i.e. the channel and spatial dimensions)
        >>> # as shown in the image below
        >>> layer_norm = nn.LayerNorm([C, H, W])
        >>> output = layer_norm(input)

    .. image:: ../_static/img/nn/layer_norm.jpg
        :scale: 50 %

    """

    __constants__ = ['normalized_shape', 'eps', 'elementwise_affine']
    normalized_shape: Tuple[int, ...]
    eps: float
    elementwise_affine: bool

    def __init__(self, normalized_shape: _shape_t, eps: float = 1e-5, elementwise_affine: bool = True,
                 bias: bool = True, device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super().__init__()
        if isinstance(normalized_shape, numbers.Integral):
            # mypy error: incompatible types in assignment
            normalized_shape = (normalized_shape,)  # type: ignore[assignment]
        self.normalized_shape = tuple(normalized_shape)  # type: ignore[arg-type]
        self.eps = eps
        self.elementwise_affine = elementwise_affine
        if self.elementwise_affine:
            self.weight = Parameter(torch.empty(self.normalized_shape, **factory_kwargs))
            if bias:
                self.bias = Parameter(torch.empty(self.normalized_shape, **factory_kwargs))
            else:
                self.register_parameter('bias', None)
        else:
            self.register_parameter('weight', None)
            self.register_parameter('bias', None)

        self.reset_parameters()

    def reset_parameters(self) -> None:
        if self.elementwise_affine:
            init.ones_(self.weight)
            if self.bias is not None:
                init.zeros_(self.bias)

    def forward(self, input: Tensor) -> Tensor:
        return F.layer_norm(input, self.normalized_shape, self.weight, self.bias, self.eps)

    def extra_repr(self) -> str:
        return '{normalized_shape}, eps={eps}, ' \
            'elementwise_affine={elementwise_affine}'.format(**self.__dict__)

Below is the raw implementation of torch.nn.Functional's ```F.layer_norm```

In [None]:
def layer_norm(input: Tensor, normalized_shape: List[int], weight: Optional[Tensor] = None, bias: Optional[Tensor] = None, eps: float = 1e-5) -> Tensor:
    r"""Apply Layer Normalization for last certain number of dimensions.

    See :class:`~torch.nn.LayerNorm` for details.
    """
    if has_torch_function_variadic(input, weight, bias):
        return handle_torch_function(layer_norm, (input, weight, bias), input, normalized_shape, weight=weight, bias=bias, eps=eps)

    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)