<a href="https://colab.research.google.com/github/RCortez25/PhD/blob/main/LLM/5.%20LLM%20architecture/LLM_architecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM architecture

We're going to replicate GPT-2, with 124 million parameteres, whose weights are open source.

In [1]:
# Configuration for our GPT model
GPT_CONFIG_124M = {
    "vocab_size": 50257,        # Number of words/sub-words
    "context_length": 1024,     # How many words used to predict the next word
    "embedding_dimension": 768, # Tokens are projected into a 768-dimensional space
    "number_of_heads": 12,      # This creates 12 query, key, and value matrices
    "number_of_layers": 12,     # Number of transformer blocks
    "dropout_rate": 0.1,
    "qkv_bias": False}

# GPT placeholder architecture

We'll build a GPT placeholder architecture to gain intuition on how everything fits together. It will take the configuration we just outlined above.

In [10]:
import torch
import torch.nn as nn

class GPTModelPlaceholder(nn.Module):
    def __init__(self, config):
        super().__init__()
        # Initialize variables using the configuration dictionary
        # Look up table from ids to embeddings
        self.token_embedding_table = nn.Embedding(config["vocab_size"], config["embedding_dimension"])
        # Look up table from position to position embedding
        self.position_embedding_table = nn.Embedding(config["context_length"], config["embedding_dimension"])
        self.dropout_embedding = nn.Dropout(config["dropout_rate"])

        # Placeholed for transformer blocks
        self.transformer_blocks = nn.Sequential(
            *[TransformerBlockPlaceholder(config) for _ in range(config["number_of_layers"])]
        )

        # Placeholder for LayerNorm
        self.layer_norm = nn.LayerNorm(config["embedding_dimension"])

        # Output head
        self.output_head = nn.Linear(config["embedding_dimension"],
                                     config["vocab_size"],
                                     bias=False)

    # Method for accepting the inputs and make the transformations
    # The inputs are fed into the model as tokens, that is, as IDs
    def forward(self, inputs_ids):
        # Obtain the size of the batch and the sequence length
        batch_size, context_length = inputs_ids.shape

        # Use the lookp table to obtain embeddings given the IDs
        token_embeddings = self.token_embedding_table(inputs_ids)

        # Obtain positional embeddings
        # Create a range object whose length will be equal to the length of the
        # inputs
        range_object = torch.arange(context_length, device=inputs_ids.device)
        # Use the object to use the lookup table for obtaining the positional
        # embeddings corresponding to each position of each token
        position_embeddings = self.position_embedding_table(range_object)

        # Add the vector embeddings
        x = token_embeddings + position_embeddings

        # Apply dropout
        x = self.dropout_embedding(x)

        # Now, the data is passed to transformers blocks
        x = self.transformer_blocks(x)

        # Data is passed through normalization layer
        x = self.layer_norm(x)

        # Data is passed through the output head
        logits = self.output_head(x)

        return logits

# Transformer and Layer Normalization placeholders

In [11]:
class TransformerBlockPlaceholder(nn.Module):
    def __init__(self, config):
        super().__init__()

    def forward(self, x):
        return x


class LayerNormPlaceholder(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5):
        super().__init__()

    def forward(self, x):
        return x

# Example

First, let's create the batch of text to be used and tokenize it.

In [12]:
import tiktoken

# Use GPT-2 encoded
tokenizer = tiktoken.get_encoding("gpt2")
batch = []

# Text to be used in the example
text1 = "Every effort moves you"
text2 = "Every day holds a"

# Obtain the IDs of each text
text_1_tokenized = tokenizer.encode(text1)
text_2_tokenized = tokenizer.encode(text2)

batch.append(text_1_tokenized)
batch.append(text_2_tokenized)

batch = torch.tensor(batch)
print(batch)

tensor([[6109, 3626, 6100,  345],
        [6109, 1110, 6622,  257]])


In [13]:
batch.shape

torch.Size([2, 4])

We can now create a GPT model and pass it this example.

In [17]:
torch.manual_seed(123)

# Create the model
oGPT = GPTModelPlaceholder(GPT_CONFIG_124M)
logits = oGPT(batch)
print(f"Logits' shape: {logits.shape}")
print(f"\nLogits: {logits}")

Logits' shape: torch.Size([2, 4, 50257])

Logits: tensor([[[-0.7867,  0.2203, -0.4508,  ..., -0.9936, -0.1412, -0.2999],
         [-0.0788,  0.3004, -0.2935,  ...,  0.1583,  0.8917,  0.8230],
         [ 0.3708,  1.1126, -0.3226,  ...,  0.8023, -0.0038,  0.3935],
         [ 0.0636,  1.0572, -0.2507,  ...,  0.7542, -0.0750, -0.6896]],

        [[-0.7208,  0.1351, -0.6014,  ..., -1.0272,  0.1729, -0.2920],
         [-0.5938,  0.4453, -0.0059,  ...,  0.3414,  0.0572,  1.0986],
         [ 0.2675,  0.8407, -0.4476,  ..., -0.0181, -0.1090,  0.2541],
         [-0.1035, -0.5901, -0.3932,  ...,  1.4022, -0.3188,  0.1304]]],
       grad_fn=<UnsafeViewBackward0>)


As we can see, we have two batches of inputs, each containing 4 inputs and the corresponding probabilities for all the 50257 tokens in the vocabulary.

# Logits

The final output of the GPT class has as its output a tensor with rows equal to the number of tokens fed into it. For the example "Every effort moves you", since these are 4 tokens then the output will have 4 rows. Now, this output will have a number of columns equal to the size of the vocabulary and each entry will contain a probability. For example, if the vocabulary size is 50257 then the output will have:

* 4 rows
* 50257 columns

where each entry will correspond to a probability, associated with each word in the vocabulary, that corresponds to the probabilities of each word in the vocabulary to be the next token.

For example, for the first row corresponding to "Every", the 50257 columns will contain the probabilities for each token in the vocabulary to be the next word after "Every". For the next row one will have "Every effort", and the 50257 columns will contain the probabilities for each token in the vocabulary to be the next word after "Every effort", and so on.

These entries are termed **logits**.

That's why in the code, for the `output_head` one has the dimensions equal to the number of tokens and the vocabulary size.