!!! note "How to read?"
    - Section Headers to segregate the main categories.
    - Sub Headers to represent the respective section explained by sensei in the video
    - The code snippet breakdown: Till that point along with explaination on what was done. The final code snippet is in the `Train.py` file.

## **Introduction**

----------

We will be reproducing the GPT 2 model that was released by OpenAI based on their paper and the source code released. We will be working on the 124 million parameter model, which was the smallest of the mini-series which was released - So, during each release, mini models are made i.e. from smaller parameters to larger ones. And usually the larger ones end up being THE "GPT Model".

The source code of GPT 2 provided by OpenAI was implemented in TensorFlow, but we will be implementing it in PyTorch.

We can even load this model from the HuggingFace library as then we can even access all the parameter value settings that was provided to that original 124M model.

Now, the original implementation code was very complex and hard to understand, so we will be doing our own implementation and building it from scratch to reproduce it. But what our first step will be, is to load the original 124M model from HuggingFace itself into OUR CLASS, therefore we are importing all of the properties, especially the weights and the parameters. So we are ensuring we are within the same environment as the original code but will be doing our own implementation.

In [None]:
from dataclasses import dataclass
import torch
from torch import nn
from torch.nn import functional as F

#=========================================================

@dataclass
class GPTConfig:
    block_size: int = 256
    vocab_size: int = 85
    n_layer: int = 6
    n_head: int = 6
    n_embd: int = 384

class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config

&nbsp;

## **Section 1**

----------

In the [GPT 2 paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf), in `page 4` under `section 2.3 Model` they have mentioned *"Layer normalization (Ba et al., 2016) was moved to the input of each sub-block, similar to a pre-activation residual network (He et al., 2016) and an additional layer normalization was added after the final selfattention block."*.

So they have made slight adjustments to the [original transformer model](https://arxiv.org/pdf/1706.03762) implementation i.e. The Cross-Attention block itself is completely removed, the Norm Layer is added before the Multi-Head attention layer and one more Norm Layer has been added before the final section of the model i.e. after the self-attention block and before Linear-Softmax layers.

&nbsp;

### Implementing the GPT-2 nn.Module

Now, we will be implementing our `nn.Modules` and we will be using the schema reference of the GPT 2 model which we loaded from HuggingFace in **section 0**.