!!! note "How to read?"
    - Section Headers (In Bold) to segregate the main categories.
    - Sub Headers to represent the respective section explained by sensei in the video
    - The code snippet breakdown: Till that point along with explaination on what was done. The final code snippet is in the `Train.py` file.

## **Introduction**

----------

We will be reproducing the GPT 2 model that was released by OpenAI based on their paper and the source code released. We will be working on the 124 million parameter model, which was the smallest of the mini-series which was released- So, during each release, mini models are made i.e. from smaller parameters to larger ones. And usually the larger ones end up being THE "GPT Model".

The source code of GPT 2 provided by OpenAI was implemented in TensorFlow, but we will be implementing it in PyTorch.

We can even load this model from the HuggingFace library as then we can even access all the parameter value settings that was provided to that original 124M model.

Now, the original implementation code was very complex and hard to understand, so we will be doing our own implementation and building it from scratch to reproduce it. But what our first step will be, is to load the original 124M model from HuggingFace itself into OUR CLASS, therefore we are importing all of the properties, especially the weights and the parameters. So we are ensuring we are within the same environment as the original code but will be doing our own implementation.

In [None]:
from dataclasses import dataclass
import torch
from torch import nn
from torch.nn import functional as F

#=========================================================

@dataclass
class GPTConfig:
    block_size: int = 256
    vocab_size: int = 85
    n_layer: int = 6
    n_head: int = 6
    n_embd: int = 384

class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config

&nbsp;

## **Section 1**

----------

To start with, in the GPT 2 paper they have made slight adjustments to the [original transformer model](https://arxiv.org/pdf/1706.03762) implementation (As seen in the below image), i.e. The Encoder section and The Cross-Attention block which actually utilises the encoder section, itself are completely removed. Therefore GPT Architecture is known as a Decoder only architecture model.

![GPT Architecture](assets/gpt-architecture.png)

Everything else will remain the same, but there will be some differences that we will implement. 

In the [GPT 2 paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf), in `page 4` under `section 2.3 Model` they have mentioned *"Layer normalization (Ba et al., 2016) was moved to the input of each sub-block, similar to a pre-activation residual network (He et al., 2016) and an additional layer normalization was added after the final selfattention block."*.

So basically there have been some reshuffling of the order of the layers and the addition of a layer which are:

- The Norm Layer (layer norm - ln) is added before the Multi-Head attention layer
- One more Norm Layer (layer norm - ln) has been added before the final section of the model i.e. after the self-attention block and before Linear-Softmax layers.

&nbsp;

### Implementing the GPT-2 nn.Module

Now, we will be implementing our `nn.Modules` and we will be using the schema reference of the GPT 2 model which we loaded from HuggingFace in **section 0**, which were:

![GPT 2 Schema](assets/gpt-2-schema.png)

So our aim would be to match up/replicate the above schema.

In [None]:
self.transformer = nn.ModuleDict(dict(
    wte = nn.Embedding(config.vocab_size, config.n_embd),
    wpe = nn.Embedding(config.block_size, config.n_emb),
    h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
    ln_f = nn.LayerNorm(config.n_emb)
))
self.lm_head = nn.Linear(config.vocab_size, config.n_embd, bias=False)

`self.transformer = nn.ModuleDict(dict())`

- In the above schema image we see that the main container which contains all the modules in called 'transformer', therefore that is what we have declared first.
- We are then reflecting the `transformer` module using `nn.ModuleDict` which basically allows you to index into the sub-modules using keys, just like in a dictionary. Our keys are basically strings.

Then within that `transformer` module we have-

`wte = nn.Embedding(config.vocab_size, config.n_embd)` and `wpe = nn.Embedding(config.block_size, config.n_emb)`

- which are the tensor and positional embeddings respectively.
- both of these modules are `nn.Embedding` modules, and a `nn.Embedding` module is just a "fancy wrapper module" for a single array/list/block of numbers, so they just a single tensor.
- so `nn.Embedding` is just a glorified wrapper around these tensor that allows you to access its elements by indexing into their rows.

`h = nn.ModuleList([Block(config) for _ in range(config.n_layer) ])`

- in the schema you can see that `h` is being declared, but the indexing is happening through an integer value i.e. from 0 to 11 (unlike the other modules where indexing was through a string).
- therefore we declare it as a List `nn.ModuleList` so that we can index it using integers exactly as we see in the schema.
- now the `h` module, the module list has a `n_layer` `Blocks`, the `Blocks` still need to be defined (we will in a while).
- the `h` probably stands for 'hidden'

`ln_f = nn.LayerNorm(config.n_embd)`

- this is based on us following the GPT 2 paper where we have to define the additional 'Final Layer Norm', so thats what we have done.

So that is the end of the Transformer Module. After that, we have the final Classifier, which is-

`self.lm_head = nn.Linear(config.vocab_size, config.n_embd, bias=False)`

- The final classifier, which is the Language Model Head (lm_head) which projects the number of embeddings (n_embd, which is 786 in the image) all the way to the vocab size (vocab_size, which is 50257 in the image) and GPT 2 uses no bias for this final projection.

**Therefore this is the skeleton structure of what we saw in the architecture diagram! Below is a breakdown of it for a clearer understanding:**

![GPT 2 Skeleton structure breakdown](assets/gpt-2-replicated-part1.png)