#### Coding an LLM architecture

In [17]:
import torch
import torch.nn as nn

We specify the configuration of the small GPT-2 model via the following
Python dictionary, which we will use in the code examples later:

### Coding an LLM architecture

In [18]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length":1024,
    "emb_dim":768,
    "n_heads":12,
    "n_layers":12,
    "drop_rate":0.1,
    "qkv_bias":False 
}

- "vocab_size" refers to a vocabulary of 50,257 words, as used by the
BPE tokenizer from chapter 2.
- "context_length" denotes the maximum number of input tokens the
model can handle, via the positional embeddings discussed in chapter 2.
- "emb_dim" represents the embedding size, transforming each token into
a 768-dimensional vector.
- "n_heads" indicates the count of attention heads in the multi-head
attention mechanism, as implemented in chapter 3.
- "n_layers" specifies the number of transformer blocks in the model,
which will be elaborated on in upcoming sections.
- "drop_rate" indicates the intensity of the dropout mechanism (0.1
implies a 10% drop of hidden units) to prevent overfitting, as covered in
chapter 3.
- "qkv_bias" determines whether to include a bias vector in the Linear
layers of the multi-head attention for query, key, and value
computations. We will initially disable this, following the norms of
modern LLMs, but will revisit it in chapter 6 when we load pretrained
GPT-2 weights from OpenAI into our model.

In [19]:
class DummyTransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()

    def forward(self, x):
        return x 

class DummyLayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5):
        super().__init__()

    def forward(self, x):
        return x

In [20]:
# A placeholder GPT model architecture class
class DummyGPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg['vocab_size'],cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg['context_length'],cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])

        self.trf_blocks = nn.Sequential(
            *[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])]
        )
        self.final_norm = DummyLayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )
    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

To implement the steps shown in Figure 4.4 P(118), we tokenize a batch consisting
of two text inputs for the GPT model using the tiktoken tokenizer introduced
in chapter 2:

In [21]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
batch = []
text1 = "Every effort moves you"
text2 = "Every day holds a"

In [22]:
batch.append(torch.tensor(tokenizer.encode(text1)))
batch.append(torch.tensor(tokenizer.encode(text2)))

batch = torch.stack(batch, dim=0)

The resulting token IDs for the two texts are as follows:

In [23]:
print(batch)

tensor([[6109, 3626, 6100,  345],
        [6109, 1110, 6622,  257]])


Next, we initialize a new 124 million parameter DummyGPTModel instance and
feed it the tokenized batch:

In [24]:
torch.manual_seed(123)
model = DummyGPTModel(GPT_CONFIG_124M)
logits = model(batch)

print("Output shape:", logits.shape)
print(logits)

Output shape: torch.Size([2, 4, 50257])
tensor([[[-0.9289,  0.2748, -0.7557,  ..., -1.6070,  0.2702, -0.5888],
         [-0.4476,  0.1726,  0.5354,  ..., -0.3932,  1.5285,  0.8557],
         [ 0.5680,  1.6053, -0.2155,  ...,  1.1624,  0.1380,  0.7425],
         [ 0.0447,  2.4787, -0.8843,  ...,  1.3219, -0.0864, -0.5856]],

        [[-1.5474, -0.0542, -1.0571,  ..., -1.8061, -0.4494, -0.6747],
         [-0.8422,  0.8243, -0.1098,  ..., -0.1434,  0.2079,  1.2046],
         [ 0.1355,  1.1858, -0.1453,  ...,  0.0869, -0.1590,  0.1552],
         [ 0.1666, -0.8138,  0.2307,  ...,  2.5035, -0.3055, -0.3083]]],
       grad_fn=<UnsafeViewBackward0>)


The output tensor has two rows corresponding to the two text samples. Each
text sample consists of 4 tokens; each token is a 50,257-dimensional vector,
which matches the size of the tokenizer's vocabulary.


The embedding has 50,257 dimensions because each of these dimensions
refers to a unique token in the vocabulary. At the end of this chapter, when
we implement the postprocessing code, we will convert these 50,257-
dimensional vectors back into token IDs, which we can then decode into
words.

### Normalizing activations with layer normalization

We can recreate the example shown in Figure 4.5 via the following code,
where we implement a neural network layer with 5 inputs and 6 outputs that
we apply to two input examples

In [25]:
torch.manual_seed(123)
batch_example = torch.randn(2, 5) 
layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())
out = layer(batch_example)
print(out)

tensor([[0.2260, 0.3470, 0.0000, 0.2216, 0.0000, 0.0000],
        [0.2133, 0.2394, 0.0000, 0.5198, 0.3297, 0.0000]],
       grad_fn=<ReluBackward0>)


In [26]:
mean = out.mean(dim=-1, keepdim=True)
var = out.var(dim=-1, keepdim=True)
print("Mean:\n", mean)
print("Variance:\n", var)

Mean:
 tensor([[0.1324],
        [0.2170]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[0.0231],
        [0.0398]], grad_fn=<VarBackward0>)


The first row in the mean tensor above contains the mean value for the first
input row, and the second output row contains the mean for the second input
row.


Using keepdim=True in operations like mean or variance calculation ensures
that the output tensor retains the same shape as the input tensor, even though
the operation reduces the tensor along the dimension specified via dim. For
instance, without keepdim=True, the returned mean tensor would be a 2-
dimensional vector [0.1324, 0.2170] instead of a 2×1-dimensional matrix
[[0.1324], [0.2170]].

An illustration of the dim parameter when calculating the mean of a tensor. For
instance, if we have a 2D tensor (matrix) with dimensions [rows, columns], using dim=0 will
perform the operation across rows (vertically, as shown at the bottom), resulting in an output
that aggregates the data for each column. Using dim=1 or dim=-1 will perform the operation
across columns (horizontally, as shown at the top), resulting in an output aggregating the data for
each row.

In [27]:
# Next, let us apply layer normalization to the layer outputs we obtained earlier. The operation consists of subtracting the mean and dividing by the square root of the variance (also known as standard deviation):

out_norm = (out - mean) / torch.sqrt(var)
mean = out_norm.mean(dim=-1, keepdim=True)
var = out_norm.var(dim=-1, keepdim=True)
print("Normalized layer outputs:\n", out_norm)
print("Mean:\n", mean)
print("Variance:\n", var)

Normalized layer outputs:
 tensor([[ 0.6159,  1.4126, -0.8719,  0.5872, -0.8719, -0.8719],
        [-0.0189,  0.1121, -1.0876,  1.5173,  0.5647, -1.0876]],
       grad_fn=<DivBackward0>)
Mean:
 tensor([[    0.0000],
        [    0.0000]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[1.0000],
        [1.0000]], grad_fn=<VarBackward0>)


Note that the value 9.9341e-09 in the output tensor is the scientific notation
for 9.9341e × 10-9, which is 0.0000000099341 in decimal form. This value is very
close to 0, but it is not exactly 0 due to small numerical errors that can
accumulate because of the finite precision with which computers represent
numbers.


To improve readability, we can also turn off the scientific notation when
printing tensor values by setting sci_mode to False:

In [28]:
torch.set_printoptions(sci_mode=False)
print("Mean:\n", mean)
print("Variance:\n", var)

Mean:
 tensor([[    0.0000],
        [    0.0000]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[1.0000],
        [1.0000]], grad_fn=<VarBackward0>)


So far, in this section, we have coded and applied layer normalization in a
step-by-step process. 

Let's now encapsulate this process in a PyTorch module
that we can use in the GPT model later:

In [29]:
# A layer normalization class
class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))
    def forward(self, x):
        mean = x.mean(dim=-1, keepdim = True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) /torch.sqrt(var+self.eps)
        return self.scale*norm_x+self.shift

This specific implementation of layer Normalization operates on the last
dimension of the input tensor x, which represents the embedding dimension
(emb_dim). The variable eps is a small constant (epsilon) added to thevariance to prevent division by zero during normalization. The scale and
shift are two trainable parameters (of the same dimension as the input) that
the LLM automatically adjusts during training if it is determined that doing
so would improve the model's performance on its training task. This allows
the model to learn appropriate scaling and shifting that best suit the data it is
processing

In [30]:
# Let's now try the LayerNorm module in practice and apply it to the batch input 

ln = LayerNorm(emb_dim=5)
out_ln = ln(batch_example)

mean = out_ln.mean(dim=-1, keepdim=True)
var = out_ln.var(dim=-1, unbiased=False, keepdim=True)

print("Mean: \n", mean)
print("Variance: \n", var)


Mean: 
 tensor([[    -0.0000],
        [     0.0000]], grad_fn=<MeanBackward1>)
Variance: 
 tensor([[1.0000],
        [1.0000]], grad_fn=<VarBackward0>)


### Implementing a feed forward network withGELU activations

Historically, the ReLU activation function has been commonly used in deep
learning due to its simplicity and effectiveness across various neural network
architectures. However, in LLMs, several other activation functions are
employed beyond the traditional ReLU. Two notable examples are GELU
(Gaussian Error Linear Unit) and SwiGLU (Sigmoid-Weighted Linear Unit).


GELU and SwiGLU are more complex and smooth activation functions
incorporating Gaussian and sigmoid-gated linear units, respectively. They
offer improved performance for deep learning models, unlike the simpler
ReLU.


The GELU activation function can be implemented in several ways; the exact
version is defined as GELU(x)=x Φ(x), where Φ(x) is the cumulative
distribution function of the standard Gaussian distribution. In practice,
however, it's common to implement a computationally cheaper
approximation (the original GPT-2 model was also trained with this
approximation):

In [31]:
# An implementation of the GELU activation function
class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) * 
            (x + 0.044715 * torch.pow(x, 3))
        ))

Next, to get an idea of what this GELU function looks like and how it
compares to the ReLU function, let's plot these functions side by side

In [32]:
# import matplotlib.pyplot as plt 
# import numpy as np
# gelu, relu = GELU(), torch.nn.ReLU()  # Assuming nn.ReLU() is the intended ReLU activation

# import matplotlib.pyplot as plt

# gelu, relu = GELU(), nn.ReLU()

# # Some sample data
# x = torch.linspace(-3, 3, 100)
# y_gelu, y_relu = gelu(x), relu(x)

# plt.figure(figsize=(8, 3))
# for i, (y, label) in enumerate(zip([y_gelu, y_relu], ["GELU", "ReLU"]), 1):
#     plt.subplot(1, 2, i)
#     plt.plot(x, y)
#     plt.title(f"{label} activation function")
#     plt.xlabel("x")
#     plt.ylabel(f"{label}(x)")
#     plt.grid(True)

# plt.tight_layout()
# plt.show()

Next, let's use the GELU function to implement the small neural network
module, FeedForward, that we will be using in the LLM's transformer block
later:

In [33]:
#  A feed forward neural network module
class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4*cfg["emb_dim"]),
            GELU(),
            nn.Linear(4*cfg["emb_dim"], cfg["emb_dim"]), 
        )
    def forward(self, x):
        return self.layers(x)

As we can see in the preceding code, the FeedForward module is a small
neural network consisting of two Linear layers and a GELU activation
function. In the 124 million parameter GPT model, it receives the input
batches with tokens that have an embedding size of 768 each via the
GPT_CONFIG_124M dictionary where GPT_CONFIG_124M["emb_dim"] = 768.

let's initialize a new FeedForward
module with a token embedding size of 768 and feed it a batch input with 2
samples and 3 tokens each:

In [34]:
ffn = FeedForward(GPT_CONFIG_124M)
x = torch.rand(2, 3, 768) #A
out = ffn(x)

As we can see, the shape of the output tensor is the same as that of the input
tensor:

In [35]:
print(out.shape)

torch.Size([2, 3, 768])


### Adding shortcut connections

a shortcut connection creates an alternative,
shorter path for the gradient to flow through the network by skipping one or
more layers, which is achieved by adding the output of one layer to the output
of a later layer.

This is why these connections are also known as skipconnections. They play a crucial role in preserving the flow of gradients
during the backward pass in training.

In [36]:
#  A neural network to illustrate shortcut connections
class ExampleDeepNeuralNetwork(nn.Module):
    def __init__(self, layer_sizes, use_shortcut):
        super().__init__()
        self.use_shortcut = use_shortcut
        self.layers = nn.ModuleList([
            #implement 5 layers
            nn.Sequential(nn.Linear(layer_sizes[0], layer_sizes[1])),
            nn.Sequential(nn.Linear(layer_sizes[1], layer_sizes[2])),
            nn.Sequential(nn.Linear(layer_sizes[2], layer_sizes[3])),
            nn.Sequential(nn.Linear(layer_sizes[3], layer_sizes[4])),
            nn.Sequential(nn.Linear(layer_sizes[4], layer_sizes[5]))
        ])
    
    def forward(self, x):
        for layer in self.layers:
            # Compute the output of the current layer
            layer_output = layer(x)
            #check if shortcut can be applied
            if self.use_shortcut and x.shape==layer_output.shape:
                x = x + layer_output
            else:
                x = layer_output
        return x

The code implements a deep neural network with 5 layers, each consisting of
a Linear layer and a GELU activation function. In the forward pass, we
iteratively pass the input through the layers and optionally add the shortcut
connections depicted in Figure 4.12 if the self.use_shortcut attribute is set
to True

Let's use this code to first initialize a neural network without shortcut
connections. Here, each layer will be initialized such that it accepts an
example with 3 input values and returns 3 output values. The last layer
returns a single output value:

In [37]:
layer_sizes = [3, 3, 3, 3, 3, 1]
sample_input = torch.tensor([[1., 0., -1.]])
torch.manual_seed(123) # specify random seed for the initial weights for re
model_without_shortcut = ExampleDeepNeuralNetwork(
layer_sizes, use_shortcut=False
)

Next, we implement a function that computes the gradients in the the model's
backward pass:

In [41]:
def print_gradients(model, x):
    # Forward pass
    output = model(x)
    target = torch.tensor([[0.]])

    # Calculate loss based on how close the target
    # and output are
    loss = nn.MSELoss()
    loss = loss(output, target)
    
    # Backward pass to calculate the gradients
    loss.backward()

    for name, param in model.named_parameters():
        if 'weight' in name:
            # Print the mean absolute gradient of the weights
            print(f"{name} has gradient mean of {param.grad.abs().mean().item()}")

In [42]:
layer_sizes = [3, 3, 3, 3, 3, 1]  

sample_input = torch.tensor([[1., 0., -1.]])

torch.manual_seed(123)
model_without_shortcut = ExampleDeepNeuralNetwork(
    layer_sizes, use_shortcut=False
)
print_gradients(model_without_shortcut, sample_input)

layers.0.0.weight has gradient mean of 0.0015313407639041543
layers.1.0.weight has gradient mean of 0.0008734685834497213
layers.2.0.weight has gradient mean of 0.002111609559506178
layers.3.0.weight has gradient mean of 0.0030934568494558334
layers.4.0.weight has gradient mean of 0.007880656979978085


In [43]:
torch.manual_seed(123)
model_with_shortcut = ExampleDeepNeuralNetwork(
    layer_sizes, use_shortcut=True
)
print_gradients(model_with_shortcut, sample_input)

layers.0.0.weight has gradient mean of 0.2261723130941391
layers.1.0.weight has gradient mean of 0.5056601762771606
layers.2.0.weight has gradient mean of 0.3035311698913574
layers.3.0.weight has gradient mean of 0.454271525144577
layers.4.0.weight has gradient mean of 0.9717205166816711


As we can see based on the output above, shortcut connections prevent the gradients from vanishing in the early layers (towards layer.0)
We will use this concept of a shortcut connection next when we implement a transformer block