# Welcome to OpenAI GPT v2.5

**NOTE: This notebook is the continuation of <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/GPT%20from%20Scratch.ipynb">GPT from Scratch</a> and <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/GPT%20Tokenizer.ipynb">GPT Tokenizer</a> notebooks.**

In this notebook we are going to reproduce the <a href="https://github.com/openai/gpt-2">OpenAI's GPT 2</a> model, the (`124M`) version of it.

Now, when OpenAI released GPT 2, they released it with this <a href="https://openai.com/index/better-language-models/">blog post</a> and this <a href="https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf">paper</a>. And on top of that, they released this <a href="https://github.com/openai/gpt-2">code</a> on GitHub...

But, when reproducing GPT 2, we have to be careful, because we are going to be reproducing the `124M` parameter model. And the thing to be careful with it is there's always a sub-series of models of different sizes when these model releases are made and usually the biggest model is called the **"GPT"**.

Let's consider the charts that we have in the paper for a second: \
![OpenAIGPT2 Graphs](ExplanationMedia/Images/OpenAIGPT2Graphs.png)

Now the reason we have multiple models is because, according to the above graphs we see that we consider the `Number of parameters in the Language Model` in the `x-axis` and the `y-axis` we put a lot of *downstream metrics* that we are interested in like ("Translation", "Summerization", "Question Answering") and so on and we can chart out the *downstream metrics* as the model size increases.

And in the paper we see a table like this:

| Parameters | Layers | $d_{model}$ |
|------------|--------|-----------|
| 117M       | 12     | 768       |
| 345M       | 24     | 1024      |
| 762M       | 36     | 1280      |
| 1542M      | 48     | 1600      |

And we see `4` models in the `GPT-2` sub series, starting at `124M` all the way up to `1558M`...

But you might be thinking that I might have made a mistake because, in the table the numbers are different and the numbers I spoke of are different. And the reason my numbers disagree with this table is because this entire table is wrong and if we go to their <a href="https://github.com/openai/gpt-2">GitHub repository</a> we see a note that says:
> * *Note that our original parameter counts were wrong due to an error (in our previous blog posts and paper). Thus you may have seen small referred to as 117M and medium referred to as 345M.*

And in the `124M` parameter model, we see that they used `12 Layers` in the Transformer and `768 Channel Dimensions` in the Transformer.

And by the end of this notebook we will try to beat the original `GPT-2 124M` model and will be looking at loss graphs to see our model perform better.

The thing to note here is, this paper is more than 5 years old now and it was probably a very complicated optimization at the time and the computation was very low at the time, but today we can reproduce the same model's performance in roughly an hour or so and it will cost us around $10 (if we want to do this on a cloud compute, or in other words, a computer that we can all rent). 

And one more thing to mention is, OpenAI did release it's model's weights and it is available on it's GitHub repository, but it's paper is not good with all of it's details with the training.

So, in addition to the GPT-2 paper, we will also be referring to the <a href="https://arxiv.org/abs/2005.14165">GPT-3 paper</a>, which is a lot more concrete and a lot of the hyper-parameters and optimization settings and so on, which is not a huge departure from the architecture of GPT-2 version of the model.


So, let's do this...

# Understanding Hugging Face Pre-Trained Model

So, the first thing we'd like to do is start at the very end. Or in other words, we'll load the `GPT-2 124M` model as it was released by OpenAI and take it for a spin and sample some `tokens` from it.

Now, the issue is...

When we look at the code base and look for the <a href="">`model.py`</a> we see these imports:
```python
import numpy as np
import tensorflow as tf
from tensorflow.contrib.training import HParams
```

And we realise that the code is written in <a href="https://www.tensorflow.org/">TensorFlow</a> (another alternative for creating and training deep learning models offered by Google). Meaning that the original `GPT-2` code was written in TensorFlow and is not used anymore...

And as per our previous notebooks, we'd like to use PyTorch. And it will be a lot easier if we'd be able to work with the old explanations.

But the problem with that is that the initial code is in TensorFlow and we'd like to use PyTorch. So, in order to get the targets we'd like to use the <a href="https://huggingface.co/docs/transformers/en/index">`Hugging Face Transformers Library`</a> released at PyPi. We can use this <a href="https://huggingface.co/docs/transformers/en/installation">installation documentaiton</a> to walk through the steps to install the library in our system...

We can also check out Hugging Face's implementation of that transformer in their <a href="https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py">`modeling_gpt2.py`</a>. Which did a lot of work to convert all those TensorFlow code to PyTorch such that it becomes easier to load and work with.

So in particular we can look at the <a href="https://huggingface.co/openai-community/gpt2">Hugging Face GPT-2</a> model and load it using the Hugging Face transformers...


So this is what the code looks like:

```python
from transformers import GPT2LMHeadModel

huggingface_model = GPT2LMHeadModel.from_pretrained("gpt2")
huggingfaceStateDictionary = huggingface_model.state_dict()

for key, value in huggingfaceStateDictionary.items():
    print(key, value.shape)
```
Which gives us the result:
```python
transformer.wte.weight torch.Size([50257, 768])
transformer.wpe.weight torch.Size([1024, 768])
transformer.h.0.ln_1.weight torch.Size([768])
transformer.h.0.ln_1.bias torch.Size([768])
transformer.h.0.attn.c_attn.weight torch.Size([768, 2304])
transformer.h.0.attn.c_attn.bias torch.Size([2304])
transformer.h.0.attn.c_proj.weight torch.Size([768, 768])
transformer.h.0.attn.c_proj.bias torch.Size([768])
transformer.h.0.ln_2.weight torch.Size([768])
transformer.h.0.ln_2.bias torch.Size([768])
transformer.h.0.mlp.c_fc.weight torch.Size([768, 3072])
transformer.h.0.mlp.c_fc.bias torch.Size([3072])
transformer.h.0.mlp.c_proj.weight torch.Size([3072, 768])
transformer.h.0.mlp.c_proj.bias torch.Size([768])
transformer.h.1.ln_1.weight torch.Size([768])
transformer.h.1.ln_1.bias torch.Size([768])
transformer.h.1.attn.c_attn.weight torch.Size([768, 2304])
transformer.h.1.attn.c_attn.bias torch.Size([2304])
transformer.h.1.attn.c_proj.weight torch.Size([768, 768])
transformer.h.1.attn.c_proj.bias torch.Size([768])
transformer.h.1.ln_2.weight torch.Size([768])
transformer.h.1.ln_2.bias torch.Size([768])
transformer.h.1.mlp.c_fc.weight torch.Size([768, 3072])
transformer.h.1.mlp.c_fc.bias torch.Size([3072])
transformer.h.1.mlp.c_proj.weight torch.Size([3072, 768])
transformer.h.1.mlp.c_proj.bias torch.Size([768])
...
transformer.h.11.mlp.c_proj.bias torch.Size([768])
transformer.ln_f.weight torch.Size([768])
transformer.ln_f.bias torch.Size([768])
lm_head.weight torch.Size([50257, 768])
```

One awkward thing about this is, when we say `gpt2` it actually loads the `124M` parameter model and if we want the actual `GPT-2` model we'd specify it as `gpt2-xl`...

Now when we actually get this `GPT-2` initialized, we want to get the **state dictionary** which is the **raw tensors loaded with values** and we can get those using the `.state_dict()` method. and we can print the `key` (which are the tensors) and the `value` (which are the tensor values) and we can look at the shapes of the `value` tensors to get an idea of the shapes of the states in the model...

So, we can now look at the different parameters inside the `GPT-2` model and their shapes...

And we can see that there are a lot of short forms of the terms that we already know of, so let's recall that:
1. **wte**: Word Token Embeddings
2. **wpe**: Word Position Embeddings
3. **ln**: Layer Normalization
4. **attn**: Attention
5. **c_attn**: Cross Attention (awkward because `GPT-2` is a decoder only architecture and should be named **self attention**)
6. **c_proj**: Projection layer within attention or MLP
7. **mlp**: Multi-Layer Perceptron
8. **lm_head**: Language Model Head (output layer)

We initially can recall the very first key-value pair `transformer.wte.weight torch.Size([50257, 768])` as the `Word Token Embeddings` having a shape of `[50257, 768]` and it comes from the `50257` vocabulary of tokens (which is exactly the number of tokens we spoke about in our <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/GPT%20Tokenizer.ipynb">Tokenizer Notebook</a>) having `768` feature space (or embedding vector space, or `768 dimensional embedding`)...

We can also look at the second key-value pair `transformer.wpe.weight torch.Size([1024, 768])`, we can recall them as `Word Positional Embeddings` having a shape of `[1024, 768]`. So, because `GPT-2` has a maximum sequence length of `1024` we have upto `1024` positions that each token can attend to in the past. And every one of those positions in `GPT-2` has a fixed vector of `768` that is learnt by optimization.

And everything else is just the other weights and biases of this transformer...

So, now for example, if we take just the `Positional Embeddings` and we flatten it out (we get a `[1, 768]` vector) and take just the first `20` elements of the `768` embeddings we can see that we get the proper weights as an output for this code:

```python
huggingfaceStateDictionary['transformer.wpe.weight'].view(-1)[:20]
```
We get:
```python
tensor([-0.0188, -0.1974,  0.0040,  0.0113,  0.0638, -0.1050,  0.0369, -0.1680,
        -0.0491, -0.0565, -0.0025,  0.0135, -0.0042,  0.0151,  0.0166, -0.1381,
        -0.0063, -0.0461,  0.0267, -0.2042])
```

And we can plot these weights and try to see what they represent like this:

```python
import matplotlib.pyplot as plt
%matplotlib inline

plt.imshow(huggingfaceStateDictionary['transformer.wpe.weight'], cmap='gray')
```

![GPT-2.transformer.wpe.weight](ExplanationMedia/Images/GPT-2.transformer.wpe.weight.png)

And we can see that this has structure, because these positional embeddings end up learning these **sinusoids** and **cosines** to represent each of these positions and each row here stands in for that position and is processed by the transformer to recover all the relative positions and realise which token is where and attend to them depending on their position not just their content...

So now if we look at the individual columns of these we see:

```python
plt.plot(huggingfaceStateDictionary['transformer.wpe.weight'][:, 150])
plt.plot(huggingfaceStateDictionary['transformer.wpe.weight'][:, 200])
plt.plot(huggingfaceStateDictionary['transformer.wpe.weight'][:, 250])
```
![GPT-2Graphs.transformer.wpe.weight](ExplanationMedia/Images/GPT-2Graphs.transformer.wpe.weight.png)

So, we still don't know what these embeddings are doing and why they are the way they are.

But we can still see that the lines are a little noisy and jittery and that is because this model was not fully trained, and the more trained this model becomes the more we'd expect these graphs to smooth out, which also tells us that the original `GPT-2` is an **under-trained** model.

If I remember correctly, in the original "Attention-Is-All-You-Need" paper, the `positional embeddings` are actually initialized and fixed to sinusoids and cosines of different frequencies, but in `GPT-2` these are trained from scratch and they seem to recover these features during the optimization.

Now, using the `Hugging Face Transformers` we can not just get all the raw weights but also get something called `pipeline` and sample from it...

Here is the sample code snippet for `5` different generations of the same context window of tokens `"Hello, I'm a language model,"`:

```python
from transformers import pipeline, set_seed

set_seed(42)

generator = pipeline("text-generation", model="gpt2")
generator("Hello, I'm a language model,", max_length=50, num_return_sequences=5)
```
For which we get:
```python
[{'generated_text': "Hello, I'm a language model, but what I'm really doing is making a human-readable document. There are other languages, but those are the ones I like the most. To do your research, please contact me, this isn't your"},
 {'generated_text': "Hello, I'm a language model, not a syntax model. That's why I like it. I've done a lot of programming projects.\n\nBut my job as a C programmer is to sort through every single line of the script so I"},
 {'generated_text': "Hello, I'm a language model, and I'll do it in no time!\n\nOne of the things we learned from talking to my friend from college a bit earlier, and in the context of the current language model I think it's important"},
 {'generated_text': 'Hello, I\'m a language model, not a command line tool.\n\nIf my code is simple enough:\n\nif (use (string-replace "\\r" ))) {\n\nconsole. log\n\n}\n\nthat\'s'},
 {'generated_text': "Hello, I'm a language model, I've been using Language in all my work. Just a small example, let's see a simplified example. I'm making an API for a game where I want a character to play a little bit of a"}]
```

Sadly, even though we are setting a seed we get different generations from both the code and the official <a href="https://huggingface.co/openai-community/gpt2">Hugging Face GPT-2 Hosted Inference API</a>.

But at this stage what is important is, we are getting coherent text and we were successfully able to load the model and look at all of it's parameters and the keys tell us, where in the model these come from...

But we want to actually write our own `GPT-2` class so that we have a full understanding of what's happening there and we also don't want to work with something like the <a href="https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py">`modeling_gpt2.py`</a> because it's too complicated and we want to write this from scratch ourselves.

So we are going to be implementing our `GPT-2` model in `GPT_v2.5.py` script inside our `GPT Scripts` directory in parallel...

But first let's load the `GPT-2 124M` into our `GPT_v2.5.py` for the class that we are going to develop from scratch, which is going to give us confidence that we can load the OpenAI model and there's a setting of weights that exactly is the `124M` model and we will try to surpass our own created `GPT` class...

So, we're going to get different weights and everything is going to look different and hopefully even better and we will have the confidence that we are in the same model family and same model class and we just have to re-discover a good setting of the weights from scratch... 

So let's now write the `GPT-2` model and let's load the weights and make sure that we can also generate text that looks coherent...

# Hugging Face Pre-Trained Weight Transfer

Let's now swing over to the <a href="https://arxiv.org/abs/1706.03762">"Attention Is All You Need" paper</a> that started everything and look at the Transformer architecture:

![Transformer_Model_Architecture](ExplanationMedia/Images/Transformer_Model_Architecture.png)

Now, once again, like the last notebook, we mentioned that this architecture has changed over the years and `GPT-2` is slightly modified than the original `Transformer`... 

In particular, we do **NOT** have the **Encoder**, and `GPT-2` is a **Decoder** only `Transformer` as we call it. In addition to that the **Cross-Attention** that is used by that **Encoder** is also **missing**. Everything else stays almost the same, but there are some differences that we are going to see next...

So, there are two main differences: \
When we go to the `GPT-2` <a href="https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf">paper</a>, under section `2.3 Model` we see that there's a re-shuffling of the layer-normalizations (they change place) and an additional layer normalization was added after the final self-attention block...

Let's now implement the skeleton of the `nn.Module`(s) in our GPT Script and in particular we want to match up the schema that we got from `Hugging Face GPT-2`...

And we will use a decorator called `@dataclass` which provides a decorator and functions for automatically adding generated special methods such as `__init__()` and `__repr__()` to user-defined classes...

And we will use it to define all the hyper-parameters as a Class called `GPTConfiguration`...

Now because we are going to be implementing the `124M GPT-2 Model`, when we go to the paper we see these hyper-parameters:
1. block-size (context window) → 1024
2. vocabulary-size (token vocabulary) → 50257
3. n-layer (number of layers) → 12
4. n-head (number of self-attention heads) → 12
5. embedding-dimensions ($d_{model}$) → 768

So let's implement this now...

For now our code looks like this:
```python
from dataclasses import dataclass

# GPT configuration hyper-parameters
@dataclass
class GPTConfiguration:
    blockSize = 1024
    vocabularySize = 50257
    numberOfLayers = 12
    numberOfHeads = 12
    numberOfEmbeddingDimensions = 768
```

Now we will be able to use this configuration under the `GPTModel` class that we are going to write...

For now our empty `GPTModel` class looks like this:
```python
import torch
import torch.nn.functional as F

# GPT model architecture
class GPTModel(torch.nn.Module):
    def __init__(self, configuration):
        super().__init__()
        self.configuration = configuration
```

Now we want to **copy** the schema from the **Hugging Face `GPT-2` model** by utilizing the `huggingfaceStateDictionary`...

And here's what I came up with...

We see that the container in the schema is called `transformer` which contains all the modules and we can create something like that using `torch.nn.ModuleDict` which is just a dictionary of torch `Module`(s) which let's us index into `Module`(s) using **keys**, just like a normal python dictionary...

Within that we can create something called `wordTokenEmbeddings` which corresponds with `wte` and create something called `wordPositionalEmbeddings` which corresponds with `wpe`, and we can match the shapes and create our initial layers...

Then in the **Hugging Face `GPT-2` model** we see that we have a long list of **hidden** layers represented by a `.h` and followed by a range of number `.0` to `.11` hinting us about the number of layers as `12`, so we can now utilize our `numberOfLayers` hyper-parameter to construct these long list of layers. And instead of a `torch.nn.ModuleDict` we can use a `torch.nn.ModuleList` instead, which is just a list of `Module`(s).

The important thing to note is, in those hidden layers we see different kinds of layer **weights** and **biases** of different **layers** all having their own shapes and sizes, but we do see a pattern that they repeat themselves in terms of **layer number**, so for now we can just consider these **layers** as `Block`(s) and iterate them through a list and return itself to the list. Keep in mind that the `Block`'s defination has not been defined yet, and we will define it later, but we want all the `Block`(s) to take in the same `configuration` object and construct the layer objects through it, because we already have all the hyper-parameters set inside it...

Now that we have our long list of **hidden layers** it is time to construct the final **layer normalization** layer according to the `GPT-2` paper, so we can create something like `finalLayerNorm` and match the shapes which corresponds to the `ln_f`...

And lastly we can construct our **final classifier** (or the **language model head**) which is just a **Linear Layer** that projects all the **embeddings** to their respective **tokens**, having **no bias**. So, we can easily construct this **languageModelingHead** which corresponds to the `lm_head` and finish with our skeleton of the `GPT-2` model...

Now we end up with a code like this:

```python
from dataclasses import dataclass
import torch
import torch.nn.functional as F


# GPT configuration hyper-parameters
@dataclass
class GPTConfiguration:
    blockSize = 1024
    vocabularySize = 50257
    numberOfLayers = 12
    numberOfHeads = 12
    numberOfEmbeddingDimensions = 768

# GPT model architecture
class GPTModel(torch.nn.Module):
    def __init__(self, configuration):
        super().__init__()
        self.configuration = configuration

        self.transformer = torch.nn.ModuleDict(dict(
            wordTokenEmbeddings = torch.nn.Embedding(configuration.vocabularySize, configuration.numberOfEmbeddingDimensions),
            wordPositionalEmbeddings = torch.nn.Embedding(configuration.blockSize, configuration.numberOfEmbeddingDimensions),
            hidden = torch.nn.ModuleList(Block(configuration) for _ in range(configuration.numberOfLayers)),
            finalLayerNorm = torch.nn.LayerNorm(configuration.numberOfEmbeddingDimensions)
        ))

        self.languageModelingHead = torch.nn.Linear(configuration.numberOfEmbeddingDimensions, configuration.vocabularySize, bias=False)

model = GPTModel(GPTConfiguration())
```