# Implementing a GPT Model From the Scratch

This chapter focuses on implementing the LLM architecture

### We would cover the following:
* Coding a GPT-like LLM that can be trained to generate text
* Normalizing layer activations to stabilize neural network training
* Adding shortcut connections in deep neural networks
* Implementing transformer blocks to create GPT models
* Computing number of parameters and storage requirements

In [1]:
# GPT 2 Model Configuration
GPT_CONFIG_124M = {
    "vocab_size": 50257,        # Vocabulary Size
    "context_length": 1024,     # Context length - number of tokens in an input stream
    "embedding_dim": 768,             # Embedding Dimension - each token is transformed into a 768-dimensional vector
    "h_heads": 12,              # Number of attention heads - the cound of attention heads in the multi-head attentoin
    "n_layers": 12,             # NUmber of layers - number of transformer blocks in the model
    "dropout_rate": 0.1,           # Dropout rate of 10%
    "qkv_bias": False           # Query-Key-Value bias - whether to include a bias vector in the Linear layer of the multi-head attention for the query, key and value computation
}

**DummyGTPModel class** \
We first define a DummyGPTModel class which would be a simplified version of a GPT-like model. It would have the following fields:
* token embedding
* positional embedding
* dropout embedding
* series of transformer blocks
* a final normaliztion layer
* a linear output layer \
We pass the configuration via the model configuration dictionary \
\

The **forward** method describes data flow through the model:
* it computes the token and positional embeddings for the input indeices
* applies dropout
* processed data through the tranformer blocks
* applies normalization
* finally produces logits with the linear output layer

### Note
* In the DummyGPT class, the token embedding is handled inside the GPT model
* In LLMs, the embedded input token dimensions typically matches the output dimensions
* The output embeddings here represents the context vector

### Implementation
To implement these steps, we first tokenize a batch consisting of two text inputs for the GPT model using tiktoken \
Next, we initialize a DummyGPT model of 124M paramters and feed it the tokenized batch

In [2]:
# we tokenize a batch consisting of two text inputs for the GPT model using tiktoken
import tiktoken
import torch
from utilities.GPTModels import DummyGPTModel

tokenizer = tiktoken.get_encoding('gpt2')
batch = []
text1 = 'Every effort moves you'
text2 = 'Every day holds a'

batch.append(torch.tensor(tokenizer.encode(text1)))
batch.append(torch.tensor(tokenizer.encode(text2)))
batch = torch.stack(batch, dim=0)
print(batch)

tensor([[6109, 3626, 6100,  345],
        [6109, 1110, 6622,  257]])


In [3]:
# we initialize a DummyGPT model of 124M paramters and feed it the tokenized batch
torch.manual_seed(123)
model = DummyGPTModel(GPT_CONFIG_124M)
logits = model(batch)
print('Output shape:', logits.shape)
print(logits)

Output shape: torch.Size([2, 4, 50257])
tensor([[[-0.9289,  0.2748, -0.7557,  ..., -1.6070,  0.2702, -0.5888],
         [-0.4476,  0.1726,  0.5354,  ..., -0.3932,  1.5285,  0.8557],
         [ 0.5680,  1.6053, -0.2155,  ...,  1.1624,  0.1380,  0.7425],
         [ 0.0447,  2.4787, -0.8843,  ...,  1.3219, -0.0864, -0.5856]],

        [[-1.5474, -0.0542, -1.0571,  ..., -1.8061, -0.4494, -0.6747],
         [-0.8422,  0.8243, -0.1098,  ..., -0.1434,  0.2079,  1.2046],
         [ 0.1355,  1.1858, -0.1453,  ...,  0.0869, -0.1590,  0.1552],
         [ 0.1666, -0.8138,  0.2307,  ...,  2.5035, -0.3055, -0.3083]]],
       grad_fn=<UnsafeViewBackward0>)


The output tensor has two rows corresponding to the two text samples.
Each text sample consists of four tokens where each token is a 50257-dimensional vector.