# Welcome to OpenAI GPT v2.5

**NOTE: This notebook is the continuation of <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/GPT%20from%20Scratch.ipynb">GPT from Scratch</a> and <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/GPT%20Tokenizer.ipynb">GPT Tokenizer</a> notebooks.**

In this notebook we are going to reproduce the <a href="https://github.com/openai/gpt-2">OpenAI's GPT 2</a> model, the (`124M`) version of it.

Now, when OpenAI released GPT 2, they released it with this <a href="https://openai.com/index/better-language-models/">blog post</a> and this <a href="https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf">paper</a>. And on top of that, they released this <a href="https://github.com/openai/gpt-2">code</a> on GitHub...

But, when reproducing GPT 2, we have to be careful, because we are going to be reproducing the `124M` parameter model. And the thing to be careful with it is there's always a sub-series of models of different sizes when these model releases are made and usually the biggest model is called the **"GPT"**.

Let's consider the charts that we have in the paper for a second: \
![OpenAIGPT2 Graphs](ExplanationMedia/Images/OpenAIGPT2Graphs.png)

Now the reason we have multiple models is because, according to the above graphs we see that we consider the `Number of parameters in the Language Model` in the `x-axis` and the `y-axis` we put a lot of *downstream metrics* that we are interested in like ("Translation", "Summerization", "Question Answering") and so on and we can chart out the *downstream metrics* as the model size increases.

And in the paper we see a table like this:

| Parameters | Layers | $d_{model}$ |
|------------|--------|-----------|
| 117M       | 12     | 768       |
| 345M       | 24     | 1024      |
| 762M       | 36     | 1280      |
| 1542M      | 48     | 1600      |

And we see `4` models in the `GPT-2` sub series, starting at `124M` all the way up to `1558M`...

But you might be thinking that I might have made a mistake because, in the table the numbers are different and the numbers I spoke of are different. And the reason my numbers disagree with this table is because this entire table is wrong and if we go to their <a href="https://github.com/openai/gpt-2">GitHub repository</a> we see a note that says:
> * *Note that our original parameter counts were wrong due to an error (in our previous blog posts and paper). Thus you may have seen small referred to as 117M and medium referred to as 345M.*

And in the `124M` parameter model, we see that they used `12 Layers` in the Transformer and `768 Channel Dimensions` in the Transformer.

And by the end of this notebook we will try to beat the original `GPT-2 124M` model and will be looking at loss graphs to see our model perform better.

The thing to note here is, this paper is more than 5 years old now and it was probably a very complicated optimization at the time and the computation was very low at the time, but today we can reproduce the same model's performance in roughly an hour or so and it will cost us around $10 (if we want to do this on a cloud compute, or in other words, a computer that we can all rent). 

And one more thing to mention is, OpenAI did release it's model's weights and it is available on it's GitHub repository, but it's paper is not good with all of it's details with the training.

So, in addition to the GPT-2 paper, we will also be referring to the <a href="https://arxiv.org/abs/2005.14165">GPT-3 paper</a>, which is a lot more concrete and a lot of the hyper-parameters and optimization settings and so on, which is not a huge departure from the architecture of GPT-2 version of the model.


So, let's do this...

# Understanding Hugging Face Pre-Trained Model

So, the first thing we'd like to do is start at the very end. Or in other words, we'll load the `GPT-2 124M` model as it was released by OpenAI and take it for a spin and sample some `tokens` from it.

Now, the issue is...

When we look at the code base and look for the <a href="">`model.py`</a> we see these imports:
```python
import numpy as np
import tensorflow as tf
from tensorflow.contrib.training import HParams
```

And we realise that the code is written in <a href="https://www.tensorflow.org/">TensorFlow</a> (another alternative for creating and training deep learning models offered by Google). Meaning that the original `GPT-2` code was written in TensorFlow and is not used anymore...

And as per our previous notebooks, we'd like to use PyTorch. And it will be a lot easier if we'd be able to work with the old explanations.

But the problem with that is that the initial code is in TensorFlow and we'd like to use PyTorch. So, in order to get the targets we'd like to use the <a href="https://huggingface.co/docs/transformers/en/index">`Hugging Face Transformers Library`</a> released at PyPi. We can use this <a href="https://huggingface.co/docs/transformers/en/installation">installation documentaiton</a> to walk through the steps to install the library in our system...

We can also check out Hugging Face's implementation of that transformer in their <a href="https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py">`modeling_gpt2.py`</a>. Which did a lot of work to convert all those TensorFlow code to PyTorch such that it becomes easier to load and work with.

So in particular we can look at the <a href="https://huggingface.co/openai-community/gpt2">Hugging Face GPT-2</a> model and load it using the Hugging Face transformers...


So this is what the code looks like:

```python
from transformers import GPT2LMHeadModel

huggingface_model = GPT2LMHeadModel.from_pretrained("gpt2")
huggingfaceStateDictionary = huggingface_model.state_dict()

for key, value in huggingfaceStateDictionary.items():
    print(key, value.shape)
```
Which gives us the result:
```python
transformer.wte.weight torch.Size([50257, 768])
transformer.wpe.weight torch.Size([1024, 768])
transformer.h.0.ln_1.weight torch.Size([768])
transformer.h.0.ln_1.bias torch.Size([768])
transformer.h.0.attn.c_attn.weight torch.Size([768, 2304])
transformer.h.0.attn.c_attn.bias torch.Size([2304])
transformer.h.0.attn.c_proj.weight torch.Size([768, 768])
transformer.h.0.attn.c_proj.bias torch.Size([768])
transformer.h.0.ln_2.weight torch.Size([768])
transformer.h.0.ln_2.bias torch.Size([768])
transformer.h.0.mlp.c_fc.weight torch.Size([768, 3072])
transformer.h.0.mlp.c_fc.bias torch.Size([3072])
transformer.h.0.mlp.c_proj.weight torch.Size([3072, 768])
transformer.h.0.mlp.c_proj.bias torch.Size([768])
transformer.h.1.ln_1.weight torch.Size([768])
transformer.h.1.ln_1.bias torch.Size([768])
transformer.h.1.attn.c_attn.weight torch.Size([768, 2304])
transformer.h.1.attn.c_attn.bias torch.Size([2304])
transformer.h.1.attn.c_proj.weight torch.Size([768, 768])
transformer.h.1.attn.c_proj.bias torch.Size([768])
transformer.h.1.ln_2.weight torch.Size([768])
transformer.h.1.ln_2.bias torch.Size([768])
transformer.h.1.mlp.c_fc.weight torch.Size([768, 3072])
transformer.h.1.mlp.c_fc.bias torch.Size([3072])
transformer.h.1.mlp.c_proj.weight torch.Size([3072, 768])
transformer.h.1.mlp.c_proj.bias torch.Size([768])
...
transformer.h.11.mlp.c_proj.bias torch.Size([768])
transformer.ln_f.weight torch.Size([768])
transformer.ln_f.bias torch.Size([768])
lm_head.weight torch.Size([50257, 768])
```

One awkward thing about this is, when we say `gpt2` it actually loads the `124M` parameter model and if we want the actual `GPT-2` model we'd specify it as `gpt2-xl`...

Now when we actually get this `GPT-2` initialized, we want to get the **state dictionary** which is the **raw tensors loaded with values** and we can get those using the `.state_dict()` method. and we can print the `key` (which are the tensors) and the `value` (which are the tensor values) and we can look at the shapes of the `value` tensors to get an idea of the shapes of the states in the model...

So, we can now look at the different parameters inside the `GPT-2` model and their shapes...

And we can see that there are a lot of short forms of the terms that we already know of, so let's recall that:
1. **wte**: Word Token Embeddings
2. **wpe**: Word Position Embeddings
3. **ln**: Layer Normalization
4. **attn**: Attention
5. **c_attn**: Cross Attention (awkward because `GPT-2` is a decoder only architecture and should be named **self attention**)
6. **c_proj**: Projection layer within attention or MLP
7. **mlp**: Multi-Layer Perceptron
8. **lm_head**: Language Model Head (output layer)
9. **c_fc**: Current/Common Fully Connected Layer

We initially can recall the very first key-value pair `transformer.wte.weight torch.Size([50257, 768])` as the `Word Token Embeddings` having a shape of `[50257, 768]` and it comes from the `50257` vocabulary of tokens (which is exactly the number of tokens we spoke about in our <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/GPT%20Tokenizer.ipynb">Tokenizer Notebook</a>) having `768` feature space (or embedding vector space, or `768 dimensional embedding`)...

We can also look at the second key-value pair `transformer.wpe.weight torch.Size([1024, 768])`, we can recall them as `Word Positional Embeddings` having a shape of `[1024, 768]`. So, because `GPT-2` has a maximum sequence length of `1024` we have upto `1024` positions that each token can attend to in the past. And every one of those positions in `GPT-2` has a fixed vector of `768` that is learnt by optimization.

And everything else is just the other weights and biases of this transformer...

So, now for example, if we take just the `Positional Embeddings` and we flatten it out (we get a `[1, 768]` vector) and take just the first `20` elements of the `768` embeddings we can see that we get the proper weights as an output for this code:

```python
huggingfaceStateDictionary['transformer.wpe.weight'].view(-1)[:20]
```
We get:
```python
tensor([-0.0188, -0.1974,  0.0040,  0.0113,  0.0638, -0.1050,  0.0369, -0.1680,
        -0.0491, -0.0565, -0.0025,  0.0135, -0.0042,  0.0151,  0.0166, -0.1381,
        -0.0063, -0.0461,  0.0267, -0.2042])
```

And we can plot these weights and try to see what they represent like this:

```python
import matplotlib.pyplot as plt
%matplotlib inline

plt.imshow(huggingfaceStateDictionary['transformer.wpe.weight'], cmap='gray')
```

![GPT-2.transformer.wpe.weight](ExplanationMedia/Images/GPT-2.transformer.wpe.weight.png)

And we can see that this has structure, because these positional embeddings end up learning these **sinusoids** and **cosines** to represent each of these positions and each row here stands in for that position and is processed by the transformer to recover all the relative positions and realise which token is where and attend to them depending on their position not just their content...

So now if we look at the individual columns of these we see:

```python
plt.plot(huggingfaceStateDictionary['transformer.wpe.weight'][:, 150])
plt.plot(huggingfaceStateDictionary['transformer.wpe.weight'][:, 200])
plt.plot(huggingfaceStateDictionary['transformer.wpe.weight'][:, 250])
```
![GPT-2Graphs.transformer.wpe.weight](ExplanationMedia/Images/GPT-2Graphs.transformer.wpe.weight.png)

So, we still don't know what these embeddings are doing and why they are the way they are.

But we can still see that the lines are a little noisy and jittery and that is because this model was not fully trained, and the more trained this model becomes the more we'd expect these graphs to smooth out, which also tells us that the original `GPT-2` is an **under-trained** model.

If I remember correctly, in the original "Attention-Is-All-You-Need" paper, the `positional embeddings` are actually initialized and fixed to sinusoids and cosines of different frequencies, but in `GPT-2` these are trained from scratch and they seem to recover these features during the optimization.

Now, using the `Hugging Face Transformers` we can not just get all the raw weights but also get something called `pipeline` and sample from it...

Here is the sample code snippet for `5` different generations of the same context window of tokens `"Hello, I'm a language model,"`:

```python
from transformers import pipeline, set_seed

set_seed(42)

generator = pipeline("text-generation", model="gpt2")
generator("Hello, I'm a language model,", max_length=50, num_return_sequences=5)
```
For which we get:
```python
[{'generated_text': "Hello, I'm a language model, but what I'm really doing is making a human-readable document. There are other languages, but those are the ones I like the most. To do your research, please contact me, this isn't your"},
 {'generated_text': "Hello, I'm a language model, not a syntax model. That's why I like it. I've done a lot of programming projects.\n\nBut my job as a C programmer is to sort through every single line of the script so I"},
 {'generated_text': "Hello, I'm a language model, and I'll do it in no time!\n\nOne of the things we learned from talking to my friend from college a bit earlier, and in the context of the current language model I think it's important"},
 {'generated_text': 'Hello, I\'m a language model, not a command line tool.\n\nIf my code is simple enough:\n\nif (use (string-replace "\\r" ))) {\n\nconsole. log\n\n}\n\nthat\'s'},
 {'generated_text': "Hello, I'm a language model, I've been using Language in all my work. Just a small example, let's see a simplified example. I'm making an API for a game where I want a character to play a little bit of a"}]
```

Sadly, even though we are setting a seed we get different generations from both the code and the official <a href="https://huggingface.co/openai-community/gpt2">Hugging Face GPT-2 Hosted Inference API</a>.

But at this stage what is important is, we are getting coherent text and we were successfully able to load the model and look at all of it's parameters and the keys tell us, where in the model these come from...

But we want to actually write our own `GPT-2` class so that we have a full understanding of what's happening there and we also don't want to work with something like the <a href="https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py">`modeling_gpt2.py`</a> because it's too complicated and we want to write this from scratch ourselves.

So we are going to be implementing our `GPT-2` model in `GPT_v2.5.py` script inside our `GPT Scripts` directory in parallel...

But first let's load the `GPT-2 124M` into our `GPT_v2.5.py` for the class that we are going to develop from scratch, which is going to give us confidence that we can load the OpenAI model and there's a setting of weights that exactly is the `124M` model and we will try to surpass our own created `GPT` class...

So, we're going to get different weights and everything is going to look different and hopefully even better and we will have the confidence that we are in the same model family and same model class and we just have to re-discover a good setting of the weights from scratch... 

So let's now write the `GPT-2` model and let's load the weights and make sure that we can also generate text that looks coherent...

# Hugging Face Pre-Trained Weight Transfer

## Transformer - Architecture

Let's now swing over to the <a href="https://arxiv.org/abs/1706.03762">"Attention Is All You Need" paper</a> that started everything and look at the Transformer architecture:

![Transformer_Model_Architecture](ExplanationMedia/Images/Transformer_Model_Architecture.png)

Now, once again, like the last notebook, we mentioned that this architecture has changed over the years and `GPT-2` is slightly modified than the original `Transformer`... 

In particular, we do **NOT** have the **Encoder**, and `GPT-2` is a **Decoder** only `Transformer` as we call it. In addition to that the **Cross-Attention** that is used by that **Encoder** is also **missing**. Everything else stays almost the same, but there are some differences that we are going to see next...

So, there are two main differences: \
When we go to the `GPT-2` <a href="https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf">paper</a>, under section `2.3 Model` we see that there's a re-shuffling of the layer-normalizations (they change place) and an additional layer normalization was added after the final self-attention block...

## GPT-2 Skeleton

### Hyper-Parameters (`GPTConfiguration`)

Let's now implement the skeleton of the `nn.Module`(s) in our GPT Script and in particular we want to match up the schema that we got from `Hugging Face GPT-2`...

And we will use a decorator called `@dataclass` which provides a decorator and functions for automatically adding generated special methods such as `__init__()` and `__repr__()` to user-defined classes...

And we will use it to define all the hyper-parameters as a Class called `GPTConfiguration`...

Now because we are going to be implementing the `124M GPT-2 Model`, when we go to the paper we see these hyper-parameters:
1. block-size (context window) → 1024
2. vocabulary-size (token vocabulary) → 50257
3. n-layer (number of layers) → 12
4. n-head (number of self-attention heads) → 12
5. embedding-dimensions ($d_{model}$) → 768

So let's implement this now...

For now our code looks like this:
```python
from dataclasses import dataclass

# GPT configuration hyper-parameters
@dataclass
class GPTConfiguration:
    blockSize: int = 1024
    vocabularySize: int = 50257
    numberOfLayers: int = 12
    numberOfHeads: int = 12
    numberOfEmbeddingDimensions: int = 768
```

### `GPTModel`

Now we will be able to use this configuration under the `GPTModel` class that we are going to write...

For now our empty `GPTModel` class looks like this:
```python
import torch
import torch.nn.functional as F

# GPT model architecture
class GPTModel(torch.nn.Module):
    def __init__(self, configuration):
        super().__init__()
        self.configuration = configuration
```

Now we want to **copy** the schema from the **Hugging Face `GPT-2` model** by utilizing the `huggingfaceStateDictionary`...

And here's what I came up with...

We see that the container in the schema is called `transformer` which contains all the modules and we can create something like that using `torch.nn.ModuleDict` which is just a dictionary of torch `Module`(s) which let's us index into `Module`(s) using **keys**, just like a normal python dictionary...

Within that we can create something called `wordTokenEmbeddings` which corresponds with `wte` and create something called `wordPositionalEmbeddings` which corresponds with `wpe`, and we can match the shapes and create our initial layers...

Then in the **Hugging Face `GPT-2` model** we see that we have a long list of **hidden** layers represented by a `.h` and followed by a range of number `.0` to `.11` hinting us about the number of layers as `12`, so we can now utilize our `numberOfLayers` hyper-parameter to construct these long list of layers. And instead of a `torch.nn.ModuleDict` we can use a `torch.nn.ModuleList` instead, which is just a list of `Module`(s).

The important thing to note is, in those hidden layers we see different kinds of layer **weights** and **biases** of different **layers** all having their own shapes and sizes, but we do see a pattern that they repeat themselves in terms of **layer number**, so for now we can just consider these **layers** as `Block`(s) and iterate them through a list and return itself to the list. Keep in mind that the `Block`'s defination has not been defined yet, and we will define it later, but we want all the `Block`(s) to take in the same `configuration` object and construct the layer objects through it, because we already have all the hyper-parameters set inside it...

Now that we have our long list of **hidden layers** it is time to construct the final **layer normalization** layer according to the `GPT-2` paper, so we can create something like `finalLayerNorm` and match the shapes which corresponds to the `ln_f`...

And lastly we can construct our **final classifier** (or the **language model head**) which is just a **Linear Layer** that projects all the **embeddings** to their respective **tokens**, having **no bias**. So, we can easily construct this **languageModelingHead** which corresponds to the `lm_head` and finish with our skeleton of the `GPT-2` model...

Now we end up with a code like this:

```python
from dataclasses import dataclass
import torch
import torch.nn.functional as F


# GPT configuration hyper-parameters
@dataclass
class GPTConfiguration:
    blockSize: int = 1024
    vocabularySize: int = 50257
    numberOfLayers: int = 12
    numberOfHeads: int = 12
    numberOfEmbeddingDimensions: int = 768

# GPT model architecture
class GPTModel(torch.nn.Module):
    def __init__(self, configuration):
        super().__init__()
        self.configuration = configuration

        self.transformer = torch.nn.ModuleDict(dict(
            wordTokenEmbeddings = torch.nn.Embedding(configuration.vocabularySize, configuration.numberOfEmbeddingDimensions),
            wordPositionalEmbeddings = torch.nn.Embedding(configuration.blockSize, configuration.numberOfEmbeddingDimensions),
            hidden = torch.nn.ModuleList(Block(configuration) for _ in range(configuration.numberOfLayers)),
            finalLayerNorm = torch.nn.LayerNorm(configuration.numberOfEmbeddingDimensions)
        ))

        self.languageModelingHead = torch.nn.Linear(configuration.numberOfEmbeddingDimensions, configuration.vocabularySize, bias=False)

model = GPTModel(GPTConfiguration())
```

### `Block`(s)

Let's now create the `Block` class which is currently undefined...

Now, here the `Block` refers to the `Transformer Block` that gets repeated again and again as hidden layers...

Now, according to our <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/GPT%20from%20Scratch.ipynb">GPT from Scratch</a> notebook, we already defined this `Block`, and as we mentioned `GPT-2` also has a slightly modified `Transformer Block`...

For now we are going to use this template:
```python
# Transformer Block
class Block(torch.nn.Module):
    def __init__(self, configuration):
        super().__init__()
    
    def forward(self, inputs):
        pass
```

For, now we understand that we are left with the following modules & properties:
1. Add & Norm
2. Attention
3. Feed Forward Network
4. Residual Pathways

Let's start with **Addition and Normalization (Add & Norm)**.

According to the diagram, the **Add & Norm** is there **AFTER** the **Attention & Feed Forward Network**, but in our case we will use them **BEFORE** the **Attention & Feed Forward Network**, making it a **Pre-Normalization (Pre-Norm)**...

Then we have our **Attention** module, and for now we can relate it to the attention we built in our last <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/GPT%20from%20Scratch.ipynb">GPT from Scratch</a> notebook. Specifically, we built two modules (`MultiHeadAttention` and `Head`), but here we will implement both modules in a combined and mathematically optimized class called `CausalSelfAttention`...

Then we have our **Feed Forward Network**, and we will call it our **MultiLayerPerceptron**.

The thing to note is, once again two of the modules (`CausalSelfAttention` and `MultiLayerPerceptron`) remain undefined, and we are going to define them later...

For now, we end up with a code like this:
```python
# Transformer Block
class Block(torch.nn.Module):
    def __init__(self, configuration):
        super().__init__()
        self.layerNormalization1 = torch.nn.LayerNorm(configuration.numberOfEmbeddingDimensions)
        self.attention = CausalSelfAttention(configuration)
        self.layerNormalization2 = torch.nn.LayerNorm(configuration.numberOfEmbeddingDimensions)
        self.multiLayerPerceptron = MultiLayerPerceptron(configuration)
    
    def forward(self, inputs):
        pass
```

And lastly, we arrive at the **Residual Pathways**, and we see that the normalizations are **inside** the residual stream, or in other words, the residual pathway has normalizations **inside** them (which is not very good or desirable from an optimization perspective) and we actually prefer to have a single and clean residual stream all the way from **supervision** to all the way to the **inputs (or `tokens`)**, which is desirable because the gradients that flow from the top distributes the gradients equally because of additions, indicating that the gradients from the top flow straight to the inputs through the residual pathway (unchanged) but then addition to that, the gradient also flows through the blocks and the blocks contribute their own contribution over time when the optimization kicks in.

Which means that we want to apply a **clean residual pathway**. And to do that we need the normalization to be applied to the residual pathway (result of layer normalization) **before** adding it back to the original **inputs**. This ensures that the residual connections are additive and do not interfere with the normalization process, facilitating better gradient flow and optimization stability.

Therefore, we end up with a code like this:
```python
# Transformer Block
class Block(torch.nn.Module):
    def __init__(self, configuration):
        super().__init__()
        self.layerNormalization1 = torch.nn.LayerNorm(configuration.numberOfEmbeddingDimensions)
        self.attention = CausalSelfAttention(configuration)
        self.layerNormalization2 = torch.nn.LayerNorm(configuration.numberOfEmbeddingDimensions)
        self.multiLayerPerceptron = MultiLayerPerceptron(configuration)
    
    def forward(self, inputs):
        inputs = inputs + self.attention(self.layerNormalization1(inputs))
        inputs = inputs + self.multiLayerPerceptron(self.layerNormalization2(inputs))
        return inputs
```

Here's how the residual pathways are structured:
- **Attention Layer**: After applying `self.layerNormalization1`, the residual (inputs) are added to `attention_output`. This adheres to the clean residual pathway because the normalization (`layerNormalization1`) is applied **before** adding to inputs.
- **MLP Layer**: Similarly, after applying `self.layerNormalization2`, the residual (inputs) is added to `mlp_output`. This also adheres to the clean residual pathway for the same reason as above.

And one more thing that is interesting to note is that, **Attention** is a **communication operation**, and it is where all the tokens line-up in a sequence and this is where the tokens communicate and exchange information. And **Attention** is an **aggregation function**, it's a **pooling function**, it's a **weighted sum function**, it's a **reduce operation**.

Whereas, **Multi Layer Perceptron** happens at every single token individually (**mapped**), and there is no information being exchanged or collected between the tokens.

So, the **Attention** is the **reduce** and **Multi Layer Perceptron** is the **map**. And what we end up with is a repeated application of **Map-Reduce**. And this is where the `tokens` communicate and this is where they *think* individually about the information that they gathered. And every one of these blocks, iteratively refines the representation inside the residual stream...

And now we can move on the to implementation of `CausalSelfAttention` and `MultiLayerPerceptron`...

### `MultiLayerPerceptron`

Let's now move on to the `MultiLayerPerceptron (MLP)`, and I implemented the class as follows...

```python
# Multi Layer Perceptron (MLP)
class MultiLayerPerceptron(torch.nn.Module):
    def __init__(self, configuration):
        super().__init__()
        
    def forward(self, inputs):
        pass
```

It is relatively straight forward. For now, we just have two `Linear` layers which wrap around a `GELU` non-linearity layer. So our block now becomes something like this:
```python
# Multi Layer Perceptron (MLP)
class MultiLayerPerceptron(torch.nn.Module):
    def __init__(self, configuration):
        super().__init__()
        self.currentFullyConnected = torch.nn.Linear(configuration.numberOfEmbeddingDimensions, 4 * configuration.numberOfEmbeddingDimensions)
        self.gelu = torch.nn.GELU(approximate="tanh")
        self.currentProjection = torch.nn.Linear(4 * configuration.numberOfEmbeddingDimensions, configuration.numberOfEmbeddingDimensions)

    def forward(self, inputs):
        inputs = self.currentFullyConnected(inputs)
        inputs = self.gelu(inputs)
        inputs = self.currentProjection(inputs)
        return inputs
```

Now when we swing over to the <a href="https://pytorch.org/docs/stable/generated/torch.nn.GELU.html">GELU PyTorch Documentation</a>, we see **two** different `GELU`(s) being hinted there:
1. Original GELU formulation (We will discuss this in a bit):
   $$ \text{GELU}(x) = x \cdot \Phi(x) $$
2. Approximate GELU formulation:
   $$ \text{GELU}(x) = 0.5 \cdot x \cdot \left(1 + \tanh \left( \frac{2}{\pi} \cdot \left(x + 0.044715 \cdot x^3 \right) \right) \right) $$

![GELU](ExplanationMedia/Images/GELU.png)

Just as a preview, we can see that `GELU` is basically like a `ReLU`, except there's **no exactly flat tail at exactly `0`**. Otherwise, it just looks more like a slightly *smoother* `ReLU`. And it comes from this paper <a href="https://arxiv.org/abs/1606.08415">"Gaussian Error Linear Units (GELUs)"</a> and there's a little bit of history here and I also invite you to step through the paper if you'd like. But for now, we will use the **approximate** version of the `GELU`, because that's what `GPT-2` in their model used...

Now, one other reason of why we prefer to use `GELU` is that, in previous notebooks we have spoken about the **Dead-ReLU-Neuron-Problem** where, in the tail of a `ReLU`, where it's exactly flat at `0`, any activations that fall there will get exactly `0` gradient (meaning that there's no change, there's no adaptation, there's no development of the network), but `GELU` always contributes to a **local-gradient** and so there's always going to be a change and there's always going to be an adaptation in a *smoothed-out* way which empirically working better, as demonstrated in the paper.

And we also followed the rule of *"Position-wise Feed-Forward Networks"* section of the original "Attention-is-all-you-need" paper, which is why we have the `4 * numberOfEmbeddingDimensions` in the shapes...

And finally we can now move on to implement the `CausalSelfAttention` part of the code...

### `CausalSelfAttention`

We can now start implementing our `CausalSelfAttention` block which is none other than the combination of **Scaled Dot-Product Attention** and **Multi-Head Attention**...

For now we have a skeleton like this:
```python
# Causal Self Attention (Scaled Dot-Product Attention + Multi-Head Attention)
class CausalSelfAttention(torch.nn.Module):
    def __init__(self, configuration):
        super().__init__()
    
    def forward(self, inputs):
        pass
```

Remember from our <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/GPT%20from%20Scratch.ipynb">GPT from Scratch notebook</a>, that **Multi-Head Attention** is just multiple **Scaled Dot-Product Attention**(s) running in parellel and their outputs are just being concatenated and that becomes the output.

Instead, we do a bunch of tensor *gymnastics* of mathematical operations of the same logic used behind both these **Multi-Head Attention** & **Scaled Dot-Product Attention** modules in a single block. But fundamentally, and algorithmically, nothing is different from what we implemented previously...

And this is what we end up with:
```python
# Causal Self Attention (Scaled Dot-Product Attention + Multi-Head Attention)
class CausalSelfAttention(torch.nn.Module):
    def __init__(self, configuration):
        super().__init__()
        assert configuration.numberOfEmbeddingDimensions % configuration.numberOfHeads == 0
        self.causalAttention = torch.nn.Linear(configuration.numberOfEmbeddingDimensions, 3 * configuration.numberOfEmbeddingDimensions)
        self.causalProjection = torch.nn.Linear(configuration.numberOfEmbeddingDimensions, configuration.numberOfEmbeddingDimensions)
        self.numberOfHeads = configuration.numberOfHeads
        self.numberOfEmbeddingDimensions = configuration.numberOfEmbeddingDimensions
        self.register_buffer("bias", torch.tril(torch.ones(configuration.blockSize, configuration.blockSize)).view(1, 1, configuration.blockSize, configuration.blockSize))
    
    def forward(self, inputs):
        B, T, C = inputs.size()
        query_key_value = self.causalAttention(inputs)
        query, key, value = query_key_value.split(self.numberOfEmbeddingDimensions, dim=2)
        query = query.view(B, T, self.numberOfHeads, C // self.numberOfHeads).transpose(1, 2)
        key = key.view(B, T, self.numberOfHeads, C // self.numberOfHeads).transpose(1, 2)
        value = value.view(B, T, self.numberOfHeads, C // self.numberOfHeads).transpose(1, 2)
        attention = (query @ key.transpose(-2, -1)) * (1.0 / math.sqrt(key.size(-1)))
        attention = attention.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
        attention = F.softmax(attention, dim=-1)
        outputs = attention @ value
        outputs = outputs.transpose(1, 2).contiguous().view(B, T, C)
        outputs = self.causalProjection(outputs)
        return outputs
```

## GPT-2 Weights Transfer

Now that we have the skeleton ready, we can now move on to transfer the weights of the `Hugging Face GPT-2` to our `Custom GPT-2`...

### `from_pretrained()` - Skeleton

Let's start by understanding what we want to do first...

We want to have a `from_pretrained()` method in our `GPTModel` class, that will transfer the weights for any kind of model we pass it (among the `4` models that are there in `GPT-2`), and copy the weights of each of those parameters and ensure their sizes and shapes match perfectly...

And we also want our `from_pretrained()` method to be decorated by a `@classmethod` such that it could be accessed directly using the class reference and it is also able to modify the state of the class and return the appropriate model along with their appropriate parameters...

So for now we can have a skeleton like this:
```python
# GPT model architecture
class GPTModel(torch.nn.Module):
    def __init__(self, configuration):
        super().__init__()
        ... # Rest of the code

    # Method to transfer weights from Hugging Face GPT-2
    @classmethod
    def from_pretrained(cls, modelType):
        assert modelType in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
        from transformers import GPT2LMHeadModel
        print("Loading weights from pretrained gpt: %s" % modelType)
```

### `from_pretrained()` - Configuration

We can now create the separate configuartion arguements for all `4` `GPT-2` configurations...

And our code will look like this:
```python
# GPT model architecture
class GPTModel(torch.nn.Module):
    def __init__(self, configuration):
        super().__init__()
        ... # Rest of the code

    # Method to transfer weights from Hugging Face GPT-2
    @classmethod
    def from_pretrained(cls, modelType):
        assert modelType in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
        from transformers import GPT2LMHeadModel
        print("Loading weights from pretrained gpt: %s" % modelType)

        # Creating separate configurations for separate GPT-2 models
        configurationArguements = {
            'gpt2':         dict(numberOfLayers=12, numberOfHeads=12, numberOfEmbeddingDimensions=768),  # 124M parameters
            'gpt2-medium':  dict(numberOfLayers=24, numberOfHeads=16, numberOfEmbeddingDimensions=1024), # 350M parameters
            'gpt2-large':   dict(numberOfLayers=36, numberOfHeads=20, numberOfEmbeddingDimensions=1280), # 774M parameters
            'gpt2-xl':      dict(numberOfLayers=48, numberOfHeads=25, numberOfEmbeddingDimensions=1600), # 1558M parameters
        }[modelType]
        configurationArguements['vocabularySize'] = 50257
        configurationArguements['blockSize'] = 1024
```

Now we can unpack our configurations into a variable called `configuration` based on the `modelType` arguement we pass as a parameter. And then we can initialize our `GPTModel` based on the `configuration` that we initialize. And then we can also copy the state-dictionary containing all the layers in our model in a variable called `stateDictionary` with the method `state_dict()` and unpack it's keys using the `keys()` method into a variable called `stateDictionaryKeys` (We have to keep in mind that we discard all the buffers that are not a part of the parameters like **Attention Mask** and **Attention Bias**)...

So, now we have a code like this:
```python
# GPT model architecture
class GPTModel(torch.nn.Module):
    def __init__(self, configuration):
        super().__init__()
        ... # Rest of the code

    # Method to transfer weights from Hugging Face GPT-2
    @classmethod
    def from_pretrained(cls, modelType):
        assert modelType in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
        from transformers import GPT2LMHeadModel
        print("Loading weights from pretrained gpt: %s" % model_type)

        # Creating separate configurations for separate GPT-2 models
        configurationArguements = {
            'gpt2':         dict(numberOfLayers=12, numberOfHeads=12, numberOfEmbeddingDimensions=768),  # 124M parameters
            'gpt2-medium':  dict(numberOfLayers=24, numberOfHeads=16, numberOfEmbeddingDimensions=1024), # 350M parameters
            'gpt2-large':   dict(numberOfLayers=36, numberOfHeads=20, numberOfEmbeddingDimensions=1280), # 774M parameters
            'gpt2-xl':      dict(numberOfLayers=48, numberOfHeads=25, numberOfEmbeddingDimensions=1600), # 1558M parameters
        }[modelType]
        configurationArguements['vocabularySize'] = 50257
        configurationArguements['blockSize'] = 1024

        configuration = GPTConfiguration(**configurationArguements)
        model = GPTModel(configuration)

        stateDictionary = model.state_dict()
        stateDictionaryKeys = stateDictionary.keys()
        stateDictionaryKeys = [key for key in stateDictionaryKeys if not key.endswith('.attention.bias')]

        return model
```

### `from_pretrained()` - Hugging Face Initialization

Now that we have our own custom model completely initialized, we can go ahead and initialize the `Hugging Face GPT-2 Model` into a single variable and it's state dictionary and keys into other variables called `huggingfaceStateDictionary` and `huggingfaceStateDictionaryKeys`...

And then we can start copying the weights after **ignoring the buffers**. But now, before copying we have to keep in mind that the original code for the `GPT-2` model was trained using the `TensorFlow` library and some of the weights are **transposed** in that architecture, so we will manually hard-code those weights and **copy them after transposing them to their original PyTorch form**...

One last thing to be careful about is, in our model we are using **custom names for our variables** but the `Hugging Face GPT-2 Model` has an architecture of **short forms**, so it is better to have a `parameterKeyMapping` that **maps our custom keys with the Hugging Face GPT-2 keys of the state-dictionary** such that it becomes much easier to iterate through...

So, now we have a final code like this:
```python
# GPT model architecture
class GPTModel(torch.nn.Module):
    def __init__(self, configuration):
        super().__init__()
        ... # Rest of the code

    # Method to transfer weights from Hugging Face GPT-2
    @classmethod
    def from_pretrained(cls, modelType):
        assert modelType in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
        from transformers import GPT2LMHeadModel
        print("Loading weights from pretrained gpt: %s" % modelType)
        
        # Creating separate configurations for separate GPT-2 models
        blockSize = 1024
        vocabularySize = 50257
        configurationArguements = {
            'gpt2':         dict(numberOfLayers=12, numberOfHeads=12, numberOfEmbeddingDimensions=768),  # 124M parameters
            'gpt2-medium':  dict(numberOfLayers=24, numberOfHeads=16, numberOfEmbeddingDimensions=1024), # 350M parameters
            'gpt2-large':   dict(numberOfLayers=36, numberOfHeads=20, numberOfEmbeddingDimensions=1280), # 774M parameters
            'gpt2-xl':      dict(numberOfLayers=48, numberOfHeads=25, numberOfEmbeddingDimensions=1600), # 1558M parameters
        }[modelType]
        configurationArguements['vocabularySize'] = 50257
        configurationArguements['blockSize'] = 1024

        configuration = GPTConfiguration(**configurationArguements)
        model = GPTModel(configuration)
        stateDictionary = model.state_dict()
        stateDictionaryKeys = stateDictionary.keys()
        stateDictionaryKeys = [key for key in stateDictionaryKeys if not key.endswith('.attention.bias')]

        huggingfaceModel = GPT2LMHeadModel.from_pretrained(modelType)
        huggingfaceStateDictionary = huggingfaceModel.state_dict()
        huggingfaceStateDictionaryKeys = huggingfaceStateDictionary.keys()
        huggingfaceStateDictionaryKeys = [key for key in huggingfaceStateDictionaryKeys if not key.endswith('.attn.masked_bias')]
        huggingfaceStateDictionaryKeys = [key for key in huggingfaceStateDictionaryKeys if not key.endswith('.attn.bias')]
        transposedParameters = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']

        assert len(huggingfaceStateDictionaryKeys) == len(stateDictionaryKeys), f"Mismatched Keys: {len(huggingfaceStateDictionaryKeys)} != {len(stateDictionaryKeys)}"

        parameterKeyMapping = {
            customKey: huggingfaceKey
            for customKey, huggingfaceKey in zip(stateDictionaryKeys, huggingfaceStateDictionaryKeys)
            }

        for customKey, huggingfaceKey in parameterKeyMapping.items():
            if (huggingfaceStateDictionary[huggingfaceKey].shape != stateDictionary[customKey].shape):
                # Special treatment for the Conv1D weights (Transposed Weights)
                if (huggingfaceKey.endswith(word) for word in transposedParameters):
                    assert huggingfaceStateDictionary[huggingfaceKey].shape[::-1] == stateDictionary[customKey].shape
                    with torch.no_grad():
                        stateDictionary[customKey].copy_(huggingfaceStateDictionary[huggingfaceKey].t())
            # Vanilla copy for other parameters
            else:
                assert huggingfaceStateDictionary[huggingfaceKey].shape == stateDictionary[customKey].shape
                with torch.no_grad():
                    stateDictionary[customKey].copy_(huggingfaceStateDictionary[huggingfaceKey])
        
        return model
```

As soon as we do this, you will see that it will start downloading the model from Hugging Face like this:
```bash
Loading weights from pretrained gpt: gpt2
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 665/665 [00:00<?, ?B/s]
model.safetensors:  34%|████████████████████████████████████████████████████▋                                                                                                    | 189M/548M [00:18<00:35, 10.1MB/s]
```

Once it's done we can now move on to the generation phase of the model...

## GPT-2 Forward Function

Now, before we can generate from this model that we have implemented, we need a forward function that forwards the token sequence through the model to get the logits...

Now the inputs are none other than the `token indeces` that is represented in a shape of `(Batch, Time)` tensor, where `Batch` dimension is the **independent sequences** and `Time` dimension is the **maximum sequence length**. Meaning that we have inputs in the shape of a matrix with each row having independent sequences of a maximum length of a sequence.

So, we first unpack the `Batch` and `Time` dimensions in a variable. Then we forward the `positional embeddings` by creating a different tensor as `tokenPositions` which is none other than the `(0, Time)` dimension of the `token indeces`...

Then we forward the `positional embeddings` and the `token embeddings` and when we get their respective outputs, we will concatenate them in a variable called `inputs`.

And lastly, we will loop through every block of the `transformer` and forward the `inputs` through them, and finally forward them through the final `layer normalization` and `language modeling head` to get the logits...

So now we have a forward loop inside the `GPTModel` class like this:
```python
def forward(self, indeces):
    Batch, Time = indeces.size()
    assert T <= self.configuration.blockSize, f"Cannot forward sequence of length {Time}, Block Size is only {self.configuration.blockSize}"

    tokenPositions = torch.arange(0, Time, dtype=torch.long)
    positionalEmbeddings = self.transformer.wordPositionalEmbeddings(tokenPositions)
    tokenEmbeddings = self.transformer.wordTokenEmbeddings(indeces)
    inputs = tokenEmbeddings + positionalEmbeddings

    for block in self.transformer.hidden:
        inputs = block(x)
    inputs = self.transformer.finalLayerNorm(inputs)
    logits = self.languageModelingHead(inputs)
    return logits
```

We also want to use this model on the `GPU`, so for now we can change the `tokenPositions` that we create to and specify the correct device like this:
`tokenPositions = torch.arange(0, Time, dtype=torch.long, device=indeces.device)`.

And then we can move on to the generation phase of the model...

## GPT-2 Generation

Before we generate from the model, we need to think through what we will generate from the model and what will go through the inputs...

We will firstly put the model into `eval()` mode because the even though we don't have layers that uses different mechanisms during training time and inference time, it is a good practice to keep the model change it's state if we do any further changes to the model.

Well, we will forward `encoded tokens` into the model and get `encoded tokens` as output which we need to decode again to see the generation. Now, these tokens will have two hyper-parameters that we will define, one being the `maximum generation length` (denotes how much tokens will the model generate for each independant token sequence) and one being the `number of sequences to generate` (denotes the number of sequences we are trying to generate in a single run in parallel).

You might be wondering "where suddenly these `encoded tokens` are coming from?".

To answer this, we can go back to the <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/GPT%20Tokenizer.ipynb">GPT Tokenizer</a> notebook and refer to it to understand that we will use a library to encode and decode tokens called `tiktoken`...

And we will encode the same sequence:
> "Hello, I'm a language model,"

And if we go to <a href="https://tiktokenizer.vercel.app/?model=gpt2">tiktokenizer</a> and see the token sequence size, we will see that there are `8` tokens in this sequence...

So, we can now create a tensor called `tokens` which will be of shape `(numberOfSequences, numberOfTokensInSequence)` (Let's say we want to generate `5` sequences, then the shape of this `tokens` tensor will be of shape `(5, 8)`).

Then we will create loop with `maximumGenerationLength` and forward the `encoded tokens` to get the `logits`. Now, because the `logits` will be of shape `(Batch, Time, Channel)` where `Batch` represents the **independent sequences**, `Time` represents the **tokens in a sequence** and `Channel` represents the **vocabulary that `logits` will classify into**, we will select only the **last** generated `Time` dimension from the `logits` tensor and pass them through a `softmax()` to get the `probabilites`...

Now, by default the Hugging Face pipeline uses **top-k probabilites of 50 by default**, so, we will also implement this in our loop by using PyTorch's `topk()` method and pass in our probabilites and specify the correct dimension. This `topk()` method will then return the final probabilites (`topKProbabilites`) and the indeces of these probabilites(`tokKIndeces`). **This helps us to never sample very rare tokens.**

Then we can sample a token from the `multinomial distribution` of these `topKProbabilites` and get the correct `tokenIndeces` (which is a single sampled `token` in a batch). Then we can create a column of `tokenIndeces` using <a href="https://pytorch.org/docs/stable/generated/torch.gather.html">`torch.gather()`</a> and append them on the original `tokens` to have our generated tokens in a sequence...

We will also keep this entire section inside the loop into `torch.no_grad()` to let PyTorch know that we won't be needing any backward processing (gradient calculation and intermediate operation caching) to save us some memory and hopefully some time...

And finally, when we have our `encoded tokens in a sequence` which will be in the shape of `(numberOfSequences, maximumGenerationLength)` and we can decode them using a loop based on `numberOfSequences`...

So, we end up with a final code like this:
```python
from dataclasses import dataclass
import torch
import torch.nn.functional as F
import math
import tiktoken

# Causal Self Attention (Scaled Dot-Product Attention + Multi-Head Attention)
class CausalSelfAttention(torch.nn.Module):
    def __init__(self, configuration):
        super().__init__()
        assert configuration.numberOfEmbeddingDimensions % configuration.numberOfHeads == 0
        self.causalAttention = torch.nn.Linear(configuration.numberOfEmbeddingDimensions, 3 * configuration.numberOfEmbeddingDimensions)
        self.causalProjection = torch.nn.Linear(configuration.numberOfEmbeddingDimensions, configuration.numberOfEmbeddingDimensions)
        self.numberOfHeads = configuration.numberOfHeads
        self.numberOfEmbeddingDimensions = configuration.numberOfEmbeddingDimensions
        self.register_buffer("bias", torch.tril(torch.ones(configuration.blockSize, configuration.blockSize)).view(1, 1, configuration.blockSize, configuration.blockSize))
    
    def forward(self, inputs):
        B, T, C = inputs.size()
        query_key_value = self.causalAttention(inputs)
        query, key, value = query_key_value.split(self.numberOfEmbeddingDimensions, dim=2)
        query = query.view(B, T, self.numberOfHeads, C // self.numberOfHeads).transpose(1, 2)
        key = key.view(B, T, self.numberOfHeads, C // self.numberOfHeads).transpose(1, 2)
        value = value.view(B, T, self.numberOfHeads, C // self.numberOfHeads).transpose(1, 2)
        attention = (query @ key.transpose(-2, -1)) * (1.0 / math.sqrt(key.size(-1)))
        attention = attention.masked_fill(self.bias[:, :, :T, :T] == 0, float('-inf'))
        attention = F.softmax(attention, dim=-1)
        outputs = attention @ value
        outputs = outputs.transpose(1, 2).contiguous().view(B, T, C)
        outputs = self.causalProjection(outputs)
        return outputs

# Multi Layer Perceptron (MLP)
class MultiLayerPerceptron(torch.nn.Module):
    def __init__(self, configuration):
        super().__init__()
        self.currentFullyConnected = torch.nn.Linear(configuration.numberOfEmbeddingDimensions, 4 * configuration.numberOfEmbeddingDimensions)
        self.gelu = torch.nn.GELU(approximate="tanh")
        self.currentProjection = torch.nn.Linear(4 * configuration.numberOfEmbeddingDimensions, configuration.numberOfEmbeddingDimensions)

    def forward(self, inputs):
        inputs = self.currentFullyConnected(inputs)
        inputs = self.gelu(inputs)
        inputs = self.currentProjection(inputs)
        return inputs


# Transformer Block
class Block(torch.nn.Module):
    def __init__(self, configuration):
        super().__init__()
        self.layerNormalization1 = torch.nn.LayerNorm(configuration.numberOfEmbeddingDimensions)
        self.attention = CausalSelfAttention(configuration)
        self.layerNormalization2 = torch.nn.LayerNorm(configuration.numberOfEmbeddingDimensions)
        self.multiLayerPerceptron = MultiLayerPerceptron(configuration)
    
    def forward(self, inputs):
        inputs = inputs + self.attention(self.layerNormalization1(inputs))
        inputs = inputs + self.multiLayerPerceptron(self.layerNormalization2(inputs))
        return inputs

# GPT configuration hyper-parameters
@dataclass
class GPTConfiguration:
    blockSize: int = 1024
    vocabularySize: int = 50257
    numberOfLayers: int = 12
    numberOfHeads: int = 12
    numberOfEmbeddingDimensions: int = 768

# GPT model architecture
class GPTModel(torch.nn.Module):
    def __init__(self, configuration):
        super().__init__()
        self.configuration = configuration

        self.transformer = torch.nn.ModuleDict(dict(
            wordTokenEmbeddings = torch.nn.Embedding(configuration.vocabularySize, configuration.numberOfEmbeddingDimensions),
            wordPositionalEmbeddings = torch.nn.Embedding(configuration.blockSize, configuration.numberOfEmbeddingDimensions),
            hidden = torch.nn.ModuleList(Block(configuration) for _ in range(configuration.numberOfLayers)),
            finalLayerNorm = torch.nn.LayerNorm(configuration.numberOfEmbeddingDimensions)
        ))

        self.languageModelingHead = torch.nn.Linear(configuration.numberOfEmbeddingDimensions, configuration.vocabularySize, bias=False)
    
    def forward(self, indeces):
        Batch, Time = indeces.size()
        assert Time <= self.configuration.blockSize, f"Cannot forward sequence of length {Time}, Block Size is only {self.configuration.blockSize}"

        tokenPositions = torch.arange(0, Time, dtype=torch.long, device=indeces.device)
        positionalEmbeddings = self.transformer.wordPositionalEmbeddings(tokenPositions)
        tokenEmbeddings = self.transformer.wordTokenEmbeddings(indeces)
        inputs = tokenEmbeddings + positionalEmbeddings

        for block in self.transformer.hidden:
            inputs = block(inputs)
        inputs = self.transformer.finalLayerNorm(inputs)
        logits = self.languageModelingHead(inputs)
        return logits

    # Method to transfer weights from Hugging Face GPT-2
    @classmethod
    def from_pretrained(cls, modelType):
        assert modelType in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
        from transformers import GPT2LMHeadModel
        print("Loading weights from pretrained GPT: %s" % modelType)
        
        # Creating separate configurations for separate GPT-2 models
        blockSize = 1024
        vocabularySize = 50257
        configurationArguements = {
            'gpt2':         dict(numberOfLayers=12, numberOfHeads=12, numberOfEmbeddingDimensions=768),  # 124M parameters
            'gpt2-medium':  dict(numberOfLayers=24, numberOfHeads=16, numberOfEmbeddingDimensions=1024), # 350M parameters
            'gpt2-large':   dict(numberOfLayers=36, numberOfHeads=20, numberOfEmbeddingDimensions=1280), # 774M parameters
            'gpt2-xl':      dict(numberOfLayers=48, numberOfHeads=25, numberOfEmbeddingDimensions=1600), # 1558M parameters
        }[modelType]
        configurationArguements['vocabularySize'] = 50257
        configurationArguements['blockSize'] = 1024

        configuration = GPTConfiguration(**configurationArguements)
        model = GPTModel(configuration)
        stateDictionary = model.state_dict()
        stateDictionaryKeys = stateDictionary.keys()
        stateDictionaryKeys = [key for key in stateDictionaryKeys if not key.endswith('.attention.bias')]

        huggingfaceModel = GPT2LMHeadModel.from_pretrained(modelType)
        huggingfaceStateDictionary = huggingfaceModel.state_dict()
        huggingfaceStateDictionaryKeys = huggingfaceStateDictionary.keys()
        huggingfaceStateDictionaryKeys = [key for key in huggingfaceStateDictionaryKeys if not key.endswith('.attn.masked_bias')]
        huggingfaceStateDictionaryKeys = [key for key in huggingfaceStateDictionaryKeys if not key.endswith('.attn.bias')]
        transposedParameters = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']

        assert len(huggingfaceStateDictionaryKeys) == len(stateDictionaryKeys), f"Mismatched Keys: {len(huggingfaceStateDictionaryKeys)} != {len(stateDictionaryKeys)}"

        parameterKeyMapping = {
            customKey: huggingfaceKey
            for customKey, huggingfaceKey in zip(stateDictionaryKeys, huggingfaceStateDictionaryKeys)
            }

        for customKey, huggingfaceKey in parameterKeyMapping.items():
            if (huggingfaceStateDictionary[huggingfaceKey].shape != stateDictionary[customKey].shape):
                # Special treatment for the Conv1D weights (Transposed Weights)
                if (huggingfaceKey.endswith(word) for word in transposedParameters):
                    assert huggingfaceStateDictionary[huggingfaceKey].shape[::-1] == stateDictionary[customKey].shape
                    with torch.no_grad():
                        stateDictionary[customKey].copy_(huggingfaceStateDictionary[huggingfaceKey].t())
            # Vanilla copy for other parameters
            else:
                assert huggingfaceStateDictionary[huggingfaceKey].shape == stateDictionary[customKey].shape
                with torch.no_grad():
                    stateDictionary[customKey].copy_(huggingfaceStateDictionary[huggingfaceKey])
        return model

model = GPTModel.from_pretrained('gpt2')

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model.eval()
model.to(device=device)

# Generation
maximumGenerationLength = 30
numberOfSequences = 5

encoder = tiktoken.get_encoding('gpt2')
encodedTokens = encoder.encode("Hello, I'm a language model,")
encodedTokens = torch.tensor(encodedTokens, dtype=torch.long)
encodedTokens = encodedTokens.unsqueeze(0).repeat(numberOfSequences, 1)
inputs = encodedTokens.to(device=device)

torch.manual_seed(42)
torch.cuda.manual_seed(42)

while inputs.size(1) < maximumGenerationLength:
    with torch.no_grad():
        logits = model(inputs)
        logits = logits[:, -1, :]
        probabilites = F.softmax(logits, dim=-1)

        topKProbabilites, tokKIndeces = torch.topk(input=probabilites, k=50, dim=-1)

        tokenIndeces = torch.multinomial(input=topKProbabilites, num_samples=1)
        columnOfTokenIndeces = torch.gather(input=tokKIndeces, dim=-1, index=tokenIndeces)

        inputs = torch.cat((inputs, columnOfTokenIndeces), dim=1)

for i in range(numberOfSequences):
    tokensToDecode = inputs[i, :maximumGenerationLength].tolist()
    decodedTokens = encoder.decode(tokensToDecode)
    print(">", decodedTokens)
```

And we get the following generations:
```plaintext
Loading weights from pretrained GPT: gpt2
> Hello, I'm a language model, as's the - a. and for they.. were also. is of -
 to/ ' can
> Hello, I'm a language model, I (, the
 ( have " to ( " of are
., ' the'sa. that
> Hello, I'm a language model, - on the<|endoftext|> was will also's't: of or<|endoftext|>.. are is to he, is the
> Hello, I'm a language model, or "
 a will. will the the.. The and in.,- ofThe's- for
> Hello, I'm a language model,.. willThe's. and was and, was I would of a's his's's's. or
```

Now, comes the interesting part... We want to initialize everything from scratch... We don't want to use any of these weights, and we want to use random numbers and initialize them and train them and generate from them...

So let's now move on to the next part of this notebook...

# Moving Model to GPU

Now, in case you do not have a `GPU` available, you can still follow along the notebook to some extent, but probably not to the very end because we will be actually using multiple `GPU`(s) and an actually perform a serious training run, but for now you can actually follow along with the notebook...

And the one thing that I'd like to do is to *auto-detect* the **device** that is available to you and run the code on the highest compute capability...

And you can do that with a code like this:
```python
device = "cpu"
if torch.cuda.is_available():
    device = "cuda"
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = "mps"
print(f"Using Device: {device}")
```

We see that by default, the device is the `CPU` which is available everywhere, but then, we can detect the `GPU` using `CUDA`, and then if we don't have a `CUDA` we can detect if it atleast has `MPS` which is the backend for `Apple Silicon` (Newer Macbook Models)...

And once we have this `device` we can potentially use this in the model:
```python
model.to(device=device)
```

Now, you'll remember that we used `tokenPositions` by carefully setting the device to `device=indeces.device` in the `GPTModel`, and we did this to carefully set the location of initialization of this tensor to the correct device to **prevent the device mis-match**...

Now, I do want to loop back around to this section to 'what it means to have different devices in PyTorch, and what it is exactly that PyTorch does in the background when we do something like `model.to(device=device)` and how it works', but for now we'd like to get to training of this model and we'd like to start training the model and for now let's just say the `device` makes the code go fast...

# Training Loop

## Importing Dataset and Encoding Data

It turns out that we are initializing our weights at random already because PyTorch already initializes our layers randomly and by default...

Right now for the Hugging Face model initialization we are using the code:
```python
model = GPTModel.from_pretrained('gpt2')
```
But if we want to use our default initialization we can use our older code:
```python
model = GPTModel(GPTConfiguration())
```

And for now if we try to run our code it *blabbers* garbage like this:
```plaintext
Using Device: cpu
> Hello, I'm a language model,FCConnect Sandwich 64 Seed SHARparam Bloodyivil Sketch arrang Deaths backdoor Steeledoorccording bathingParentiven revers cafeteria trustees
> Hello, I'm a language model, Brenblescler Awakens55 collar foe COUNboarding70710erry Evidence promotionMeet Icononticient Copobyl survivor Advoc Gro
> Hello, I'm a language model, asylumacaninventoryQuantity physician OHBillyhirt controlling doctrines Summers wallet disdain Test repercussions Nighthew Goblinbreeding flight amuse Most paradox
> Hello, I'm a language model,Double lendersortion book appetite Times complaint regulationokinglyrelease vans351specialRail Fal Faustmessage()); Vo BryceHeightuserc
> Hello, I'm a language model,regonUU enc trouble correctly dentist weekends involve Spirit Cars benef assessing sporadic mattress Neckliterallyributes** awkwardly canned contingは
```

Now, we'd like to start training the model, and to train the model we are going to need some dataset, and for me the best and simplest debugging dataset that I like to use is the `Harry_Potter_Books.txt` dataset and it's available at this <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/Datasets/Harry_Potter_Books.txt">URL</a>...

And if you're not moving the files around in this repository, you will see that it is availble under the `Datasets/Harry_Potter_Books.txt` and I already have this on the local system...

And now we can read the entire text that we have here in this dataset by using this code:
```python
with open("Datasets/Harry_Potter_Books.txt", "r") as file:
    text = file.read()
```
And in order to produce tokens from our dataset, we can use the `GPT-2 Tokenizer from TikToken` and convert our dataset into a list of `tokens` like this:
```python
encoder = tiktoken.get_encoding('gpt2')
encodedDataTokens = encoder.encode(text)
```

And now we actually want to process these token sequences and feed them into a transformer...

## Processing Tokens

Now, we actually want to process these token sequences and feed them into a transformer...

And in particular, we want to rearrange these tokens into the `indeces` variable that's available inside the `GPTModel`'s `forward()` function...

So, we don't want a single very long one-dimensional sequence, instead, we want entire `Batch`(s) of `Time` sequences (where `Time` is the **maximum-sequence-length** or the **context-window**).

Let's get to understand what we want with a small *toy-example* now...

Suppose we have a tensor like this:
```python
import torch
import torch.nn.functional as F

torch.manual_seed(69420)

buffer = torch.randint(low=0, high=50257, size=(24, ))
print(buffer)
```
And we get:
```python
tensor([ 9360, 29278,  3940, 26797, 47584,  1285, 28358, 45295, 29845, 38908,
        35303, 48343,  2579, 34456, 15560,  6453, 10159, 28005, 11891,  3940,
        33806, 27357, 36749, 40952])
```
Here, each item in the above tensor represents the `token` in a sequence...

And if we wanted to create batches out of it, we can use `view()` from PyTorch to stack up incremental parts of the tensor in a tensor like this:
```python
import torch
import torch.nn.functional as F

torch.manual_seed(69420)

buffer = torch.randint(low=0, high=50257, size=(24, ))
inputs = buffer.view(4, 6)

print(inputs)
```
And we get:
```python
tensor([[ 9360, 29278,  3940, 26797, 47584,  1285],
        [28358, 45295, 29845, 38908, 35303, 48343],
        [ 2579, 34456, 15560,  6453, 10159, 28005],
        [11891,  3940, 33806, 27357, 36749, 40952]])
```
But even if this is the case that we were able to make the `batches`, this **does not make sense until we know what we want to do with these `batches`**...

And we want to take the next `token` in a sequence and we want them to be the `label(s)` for the current sequence for the model to train on and calculate the `loss`...

We also see that for this example, for the `token` `45295` the token `29845` comes next as a `label`. But, at the same time the last `token` `40952`, we cannot determine the next `label` because we don't have any information about it...

So, let me show you my favourite way to get the `label(s)`...

Let's understand what we want to have...

We want to have a tensor that contains the `label(s)` at every single position and is the same size as the `inputs`...

And this is the way I like to do this:
```python
import torch
import torch.nn.functional as F

torch.manual_seed(69420)

buffer = torch.randint(low=0, high=50257, size=(25, ))

inputs = buffer[:-1].view(4, 6)
labels = buffer[1:].view(4, 6)

print(inputs)
print(labels)
```
And we get:
```python
tensor([[ 9360, 29278,  3940, 26797, 47584,  1285],
        [28358, 45295, 29845, 38908, 35303, 48343],
        [ 2579, 34456, 15560,  6453, 10159, 28005],
        [11891,  3940, 33806, 27357, 36749, 40952]])
tensor([[29278,  3940, 26797, 47584,  1285, 28358],
        [45295, 29845, 38908, 35303, 48343,  2579],
        [34456, 15560,  6453, 10159, 28005, 11891],
        [ 3940, 33806, 27357, 36749, 40952,  8182]])
```
You will see that I took a `buffer` of `25 tokens` this time instead of `24 tokens` to specify that we have a longer sequence and as `inputs` we are taking **everything excluding the last `token`** and for `labels` we are taking **everything starting from the first `token` (Offset the `tokens` by `1`)** and using `view()` to make batches of them...

And we can also understand that the `buffer`'s size is $\text{Batch} * \text{Time} + 1$ and we are viewing the `inputs` and the `labels` as `(Batch, Time)`...

So, we can now implement this in our main script now...

For now our script snippet looks like this:
```python
# Device Auto-Detection
device = "cpu"
if torch.cuda.is_available():
    device = "cuda"
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = "mps"
print(f"Using Device: {device}")

# Data-Loader
with open("Datasets/Harry_Potter_Books.txt", "r", encoding="UTF-8") as file:
    text = file.read()

encoder = tiktoken.get_encoding('gpt2')
encodedDataTokens = encoder.encode(text)

Batch, Time = 4, 32
buffer = torch.tensor(encodedDataTokens[:Batch*Time + 1])
inputs = buffer[:-1].view(Batch, Time)
labels = buffer[1:].view(Batch, Time)

# Constructing Model
model = GPTModel(GPTConfiguration())

model.eval()
model.to(device=device)

logits = model(inputs)
print(logits.shape)

# Halting Generation...(Will Remove Later)
import sys; sys.exit(0)
```
And we get something like this:
```python
Using Device: cuda
torch.Size([4, 32, 50257])
```
Keep in mind, that this is just a single batch... And we will modify this code later to take the entire text to load into batches and run the optimization... For now, this looks good an we can move on to calculate the loss, do the backward pass for us to run the optimization...

## Calculating `Loss`

Let's calculate the loss first...

And in order to calculate the `loss` we are going to modify the `forward()` function inside our `GPTModel` module...

In particular, we are not just going to return the `logits`, but we are also going to return the `loss` for the function, and we are not just going to pass in the `input(s) indeces` for it to train, but we are also going to pass in the `label(s)`...

So this old code:
```python
logits = model(inputs)
```
Becomes:
```python
logits, loss = model(inputs, labels)
```
And these `labels` will be optional because we will train our model when we have the `labels` as an input to the model otherwise we will use the `forward()` to generate `tokens` for the already implemented code that we have written in our script...

So this old code:
```python
def forward(self, indeces):
    ...
    return logits
```
Becomes:
```python
def forward(self, indeces, labels=None):
    ...
    return logits, loss
```

And we will be calculating the `cross_entropy()` loss, and we have already discussed this in our previous notebooks, as to why we are using the `cross_entropy()` loss...

But, `cross_entropy()` does not take multi-dimensional inputs. And if we remember properly, our `logits` came out in the shape `[4, 32, 50257]` which clearly is a multi-dimensional input to the function...

So, we stretch the `[4, 32, 50257]` tensor of `logits` to be `[128, 50257]` which is $\text{Batch} * \text{Time}$ and also stretch the `labels` to be a single long tensor of `128` (or $\text{Batch} * \text{Time}$) and pass them as arguements in our `cross_entropy()` function like this:
```python
loss = None
if labels is not None:
    loss = F.cross_entropy(logits.view(-1, logits.size(-1)), labels.view(-1))
return logits, loss
```
And if we try to print out the loss now, we will see something like this:
```python
tensor(10.9503, grad_fn=<NllLossBackward0>)
```
Which seems fairly reasonable if we try to deduce what loss should we expect during initialization. We have a vocabulary of `50257` and if we take the probability of a single `token` and calculate the **negative log likelihood** of the probability by the formula:
$$-\ln{(\frac{1}{50257})}$$

This is because, at initialization you'd expect every token to get a uniform probability such that the model does not favor any `token` way too much, and we are not confidently wrong about a `token` at initialization...

We have around `10.82` (which is fairly reasonable for the `loss` that we have already as an output)...

Now, we can successfully move on to the optimization...

## Optimization Loop

For the optimization we will use an `Optimizer` object from PyTorch and we will use the `Adam` optimizer, which is the alternative to the `Stochastic Gradient Descent (SGD)` optimizer, which is a bit more evolved than the `SGD`. And specifically we will use the `AdamW` variation, which in my opinion it kind of fixes a bug.

And when we go the documentation of <a href="https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html">AdamW</a> we see that it is a little bit more complecated than the `SGD` that we have used before in our previous notebooks, because in addition to obtaining the parameters with the gradient scaled by the learning rate it keeps some buffers around ($m_0 \rightarrow \text{ first moment and }  v_0 \rightarrow \text{ second moment }$), which is something that looks like momentum and something that looks like **Root Mean Square Propagation**. And it's something like a normalization that happens at each gradient individually and speeds up the optimization especially for language models...

Also, the learning rate I used was `3e-4`, which is a fairly good default for most optimizations that you want to run at a very early debugging stage...

So our optimizer object initialization code looks like this:
```python
optimizer = torch.optim.AdamW(params=model.parameters(), lr=3e-4)
```

And to run an optimization loop we will use this sequence:
1. Zero the gradients (because backward pass does a `+=` to the gradients and we should start with a `0` gradient)
2. Forward the inputs to the model to get loss and the logits
3. Complete the backward pass
4. Step the optimizer

So our old code:
```python
logits, loss = model(inputs, labels)
print(loss)
```
Becomes:
```python
# Optimization
epochs = 50
optimizer = torch.optim.AdamW(params=model.parameters(), lr=3e-4)
for epoch in range(epochs):
    optimizer.zero_grad()
    logits, loss = model(inputs, labels)
    loss.backward()
    optimizer.step()
    print(f"Step: {epoch}, Loss: {loss.item()}")
```

But suddenly we get an error:
```bash
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! when resuming training
```
And we see that there's something called `cuda:0`, and that is because I have `8 GPUs` on my system and `cuda:0` is the $1^{\text{st}}$ GPU out of `8`...

But let's fix this error...

It seems that the error is generating from the `buffer` that we created during batch construction, and we never moved this to the device...

And we have to be careful because we can't just do `buffer.to(device=device)`, we have to do `buffer = buffer.to(device=device)` and there's a big internal reason for this...

In PyTorch, when you create a tensor using `torch.tensor()`, it **returns** a new tensor on the **default device** (typically **CPU**). When you use `.to(device)` on a tensor, it **returns** a new tensor that is on the **specified device** (e.g., **GPU**), but **it does not modify the original tensor in place**...

So let's run our code now...

So what do we expect to see?

We expect to see a reasonable loss in the beginning and then we continue to optimize just a **single batch**, and we want to see if we can overfit this single batch, crush this little batch and perfectly predict the indeces on just this little batch...

Also, I changed our manual seed from `42` to `69`, because I like the number `69`...

And this is the output I get:
```bash
Using Device: cpu
Step: 0, Loss: 11.032245635986328
Step: 1, Loss: 6.84295654296875
Step: 2, Loss: 4.308750629425049
Step: 3, Loss: 2.497817039489746
...
Step: 48, Loss: 0.003069450380280614
Step: 49, Loss: 0.002997332951053977
```
And that's what we get... We get a very very low loss at the end of the optimization of a **single batch**. Or, in other words, the `transformer` network is **memorizing** this single individual batch...

But now, we don't want to overfit a single batch, instead we want to run an actual optimization on actual fresh batches each time through a **data-loader**...

## Data-Loader (Considering Fresh Batches)

We already have the code:
```python
# Data-Loader
with open("Datasets/Harry_Potter_Books.txt", "r", encoding="UTF-8") as file:
    text = file.read()

encoder = tiktoken.get_encoding('gpt2')
encodedDataTokens = encoder.encode(text)

Batch, Time = 4, 32
buffer = torch.tensor(encodedDataTokens[:Batch*Time + 1])
buffer = buffer.to(device=device)
inputs = buffer[:-1].view(Batch, Time)
labels = buffer[1:].view(Batch, Time)
```

And we will modify this code and convert it into a class called `DataLoaderLite` such that our code becomes cleaner and gets easier to access the fresh batches...

And this `DataLoaderLite` class will take in the shape of the batch as two separate parameters `Batch` and `Time` (which for our case we used `4` and `32` respectively), and give us back the `inputs` and the `labels` as an output when we call a method called `nextBatch()` on this class's object...

So, for now our emply class skeleton looks like this:
```python
# Data-Loader
class DataLoaderLite:
    def __init__(self, Batch, Time):
        pass
    
    def nextBatch(self):
        pass
```

We can initialize the `Batch` and `Time` inside of the class to reuse the shapes now, and read the file and get the encoding inside it just like we did before, and also make sure to convert the encoded `tokens` in a tensor and save it...

We can also print out the total number of `tokens` just to make sure what we are dealing with and also print out the number of `batches` in a single **epoch** of iterating over this dataset (how many `unique batches` do we output before we loop back around to the beginning of the dataset to start reading it again)...

So, now our implementation looks like this:
```python
# Data-Loader
class DataLoaderLite:
    def __init__(self, Batch, Time):
        self.Batch = Batch
        self.Time = Time
        with open("Datasets/Harry_Potter_Books.txt", "r", encoding="UTF-8") as file:
            text = file.read()
        encoder = tiktoken.get_encoding('gpt2')
        encodedDataTokens = encoder.encode(text)
        self.encodedDataTokens = torch.tensor(encodedDataTokens)
        print(f"Loaded {len(self.encodedDataTokens)} Tokens")
        print(f"1 Epoch = {len(self.encodedDataTokens) // (Batch * Time)} Batches")

        # State
        self.currentPosition = 0
        
    def nextBatch(self):
        pass
```
Now let's implement the `nextBatch()` method...

You will see that I have also used something called the `self.currentPosition = 0`. This is the state of the `DataLoaderLite` object at initialization, and it will be used in the `nextBatch()` method to take chunks of data and convert them into batches...

Previously, we used this line of code:
```python
buffer = torch.tensor(encodedDataTokens[:Batch*Time + 1])
```
Which was used to take the first encoded tokens of `Batch*Time + 1`, but now, because we are using chunks, we will use the **slice of current position up till the current position succeeded by `Batch*Time + 1`**. And we can copy and paste our old `inputs` and `labels` code safely now...

Which turns our code into:
```python
buffer = torch.tensor(encodedDataTokens[self.currentPosition : self.currentPosition + Batch*Time + 1])
inputs = buffer[:-1].view(Batch, Time)
labels = buffer[1:].view(Batch, Time)
```

We also need to advance our `currentPosition` by exactly `Batch*Time` to get the chunks...

And remember that we are fetching `Batch*Time + 1` but we are chunking `Batch*Time`. This might create the out of bounds problem for our tensor and we need to handle that as well, and we will also run back our `currentPosition` to `0` if we are out of data in our dataset...

So, our entire `DataLoaderLite` code now becomes:
```python
# Data-Loader
class DataLoaderLite:
    def __init__(self, Batch, Time):
        self.Batch = Batch
        self.Time = Time
        with open("Datasets/Harry_Potter_Books.txt", "r", encoding="UTF-8") as file:
            text = file.read()
        encoder = tiktoken.get_encoding('gpt2')
        encodedDataTokens = encoder.encode(text)
        self.encodedDataTokens = torch.tensor(encodedDataTokens)
        print(f"Loaded {len(self.encodedDataTokens)} Tokens")
        print(f"1 Epoch = {len(self.encodedDataTokens) // (Batch * Time)} Batches")

        # State
        self.currentPosition = 0
        
    def nextBatch(self):
        Batch, Time = self.Batch, self.Time
        buffer = self.encodedDataTokens[self.currentPosition : self.currentPosition + Batch*Time + 1]
        inputs = buffer[:-1].view(Batch, Time)
        labels = buffer[1:].view(Batch, Time)
        self.currentPosition += Batch * Time
        if self.currentPosition + (Batch * Time + 1) > len(self.encodedDataTokens):
            self.currentPosition = 0
        return inputs, labels
```

Now, we also need to modify the optimization loop to get the correct batches...

And we can initialize a `DataLoaderLite` object with a variable called `trainingLoader` and the same arguements that we used before and use this object to get the `inputs` and `labels` using `nextBatch()` method...

We also need to be careful to not run into the same error that we encountered before because previously if you remember we used this line:
```python
buffer = buffer.to(device=device)
```
But now, because we are directly getting the separate tensors as `inputs` and `labels`, we need to guide them both to their specific device like this:
```python
inputs, labels = inputs.to(device=device), labels.to(device=device)
```

So our entire initialization and optimization code looks like:
```python
# Data-Loader
Batch, Time = 4, 32
trainingLoader = DataLoaderLite(Batch=Batch, Time=Time)
...
# Optimization
epochs = 50
optimizer = torch.optim.AdamW(params=model.parameters(), lr=3e-4)
for epoch in range(epochs):
    inputs, labels = trainingLoader.nextBatch()
    inputs, labels = inputs.to(device=device), labels.to(device=device)
    optimizer.zero_grad()
    logits, loss = model(inputs, labels)
    loss.backward()
    optimizer.step()
    print(f"Step: {epoch}, Loss: {loss.item()}")
```
So, let's now run the optimization and discuss what we expect to see in this optimization...

Well, we expect our loss to come down pretty fast during the first few epochs, because in our vocabulary of `50257` tokens we don't actually use all of them and thus, there are pretty easier gains in learning of the network (basically deleting the usage of tokens that never occur). But we also don't expect the loss to go down by too much because we only have `50` epoch iterations at the time and it is not enough to perform a complete document run...

Let's see what we get...

We get an output like this:
```bash
Using Device: cpu
Loaded 2100390 Tokens
1 Epoch = 16409 Batches
Step: 0, Loss: 10.99841022491455
Step: 1, Loss: 9.390006065368652
Step: 2, Loss: 9.049430847167969
Step: 3, Loss: 8.327993392944336
...
Step: 48, Loss: 4.897827625274658
Step: 49, Loss: 5.134008884429932
```
Which is what we are expecting...

We also need to change the generation code, because now our code supports forwarding the model with or without the `labels`, so we need to handle the return of the `forward()` method correctly as well...

So our old code:
```python
logits = model(inputs)
```
Becomes:
```python
logits, loss = model(inputs)
```

And also, because we have moved the initialization of the encoder within the `DataLoaderLite` class, we don't have an explicit encoder initialized, and we need to initialize it as well...

```python
encoder = tiktoken.get_encoding('gpt2')
```


And now you can modify the hyper-parameters and play around, and we can also safely say that we have successfully implemented `GPT-2` architecture supporting both the weight transfer and own weight training in less than `250` lines of code, whereas Hugging Face and OpenAI uses around `2000` lines of code to implement it...

And now we can move on to the next section...

# Parameter Weight Sharing (Fixing A Bug)

Next, we actually want to fix a bug that we have in our code... It's not a major bug, but it is indeed a bug with respect to how `GPT-2` training should happen.

So, let's have a look at the bug...

In our Hugging Face model, if we try to print these layer shapes:
```python
print(huggingfaceStateDictionary["lm_head.weight"].shape)
print(huggingfaceStateDictionary["transformer.wte.weight"].shape)
```
They output:
```python
torch.Size([50257, 768])
torch.Size([50257, 768])
```
We see that they are both `2D tensors` and are identical. And we can also understand that the `wte` is none other than the `word token embedding` at the bottom of the `transformer` and `lm_head` is none other than the `language modelling head` at the top of the `transformer`...

Let's have a look at the transformer architecture image to have a clearer understanding:\
![Parameter_Weight_Sharing_Transformer_Model_Architecture](ExplanationMedia/Images/Parameter_Weight_Sharing_Transformer_Model_Architecture.png)

But, now if we try to have an element wise equality like this:
```python
print((huggingfaceStateDictionary["lm_head.weight"] == huggingfaceStateDictionary["transformer.wte.weight"]).all())
```
We get:
```python
tensor(True)
```
Which means every single element inside both the tensors are identical.

And what's interesting is that, if we try to look at their **data pointer(s)** using `.data_ptr()` like this:
```python
print(huggingfaceStateDictionary["lm_head.weight"].data_ptr())
print(huggingfaceStateDictionary["transformer.wte.weight"].data_ptr())
```
I get:
```python
3069647520000
3069647520000
```
We see that the pointer points to the same location as well...

So, not only do these tensors happen to have these same shapes and elements, they are actually pointing to the identical tensor...

And what's happening here is a common weight-tying-scheme.That actually comes from the original <a href="https://arxiv.org/abs/1706.03762">"Attention Is All You Need"</a> paper. And if we come down to the section **3.4 Embeddings and Softmax**, we see a text mentioning:
```plaintext
In our model, we share the same weight matrix between the two embedding layers and the pre-softmax
linear transformation, similar to [30].
```
Which is an awkward way to say that these matrices are shared and that they are tied and are the same matrix.

And this `[30]` is just another paper <a href="https://arxiv.org/abs/1608.05859">Using the Output Embedding to Improve Language Models</a>, and it argues for this weight-tying-scheme.

But, the conclusion we arrive is, **we actually want these matrices to behave similarly** in the following sense: 

If two tokens are very similar *symantically* (maybe one token is lowercase and other token is uppercase or it's the same token in a different language etc.), presumably we would expect that the lie nearby in the `token embedding space`, but in the exact same way if two tokens are very similar *symantically*, we'd expect them to get the same probabilities at the output of the `transformer`...

So, both positions (top and bottom) have this property that **similar tokens should have similar embeddings or similar weights**...

And this scheme has already been implemented in the `Hugging Face GPT-2`'s <a href="https://github.com/openai/gpt-2/blob/master/src/model.py">`model.py`</a>, and one way to implement this is simply point the weights to the same memory location explicitly after initialization like this:
```python
class GPTModel(torch.nn.Module):
    def __init__(self, configuration):
        super().__init__()
        self.configuration = configuration

        self.transformer = torch.nn.ModuleDict(dict(
            wordTokenEmbeddings = torch.nn.Embedding(configuration.vocabularySize, configuration.numberOfEmbeddingDimensions),
            wordPositionalEmbeddings = torch.nn.Embedding(configuration.blockSize, configuration.numberOfEmbeddingDimensions),
            hidden = torch.nn.ModuleList(Block(configuration) for _ in range(configuration.numberOfLayers)),
            finalLayerNorm = torch.nn.LayerNorm(configuration.numberOfEmbeddingDimensions)
        ))

        self.languageModelingHead = torch.nn.Linear(configuration.numberOfEmbeddingDimensions, configuration.vocabularySize, bias=False)
    
        # Weight-Sharing-Scheme (Parameter Weight Sharing)
        self.transformer.wordTokenEmbeddings.weight = self.languageModelingHead.weight
    ...
```

Meaning, that the old value of `wordTokenEmbeddings` will get orphaned and Python will clean it up and we will then be left with a single tensor and it is going to be used twice in a forward pass..

And another good reason to use it is because of memory efficiency too, because this single tensor is a ton of parameters (`768 * 50257 = 38,597,376 ≈ 40M`) and this is a `124M` parameter model which means around `30%` of the parameters, which we are being efficient with. And we also expect our model to work slightly better, because of this scheme...

# GPT-2 Proper Initialization

Now, we'd like to follow the way `GPT-2` initializes it's weights...

Unfortunately, the `GPT-2` and `GPT-3` papers are not explicit about their initializations and we kind of have to read between the lines, and instead of going through the paper which is quite vague, there's a bit of information in the <a href="https://github.com/openai/gpt-2/blob/master/src/model.py">`model.py` code</a> that OpenAI released...

And once we check the code we see:
```python
def conv1d(x, scope, nf, *, w_init_stdev=0.02):
    with tf.variable_scope(scope):
        *start, nx = shape_list(x)
        w = tf.get_variable('w', [1, nx, nf], initializer=tf.random_normal_initializer(stddev=w_init_stdev))
        b = tf.get_variable('b', [nf], initializer=tf.constant_initializer(0))
        c = tf.reshape(tf.matmul(tf.reshape(x, [-1, nx]), tf.reshape(w, [-1, nf]))+b, start+[nf])
        return c
...
def model(hparams, X, past=None, scope='model', reuse=False):
    with tf.variable_scope(scope, reuse=reuse):
        results = {}
        batch, sequence = shape_list(X)

        wpe = tf.get_variable('wpe', [hparams.n_ctx, hparams.n_embd],
                             initializer=tf.random_normal_initializer(stddev=0.01))
        wte = tf.get_variable('wte', [hparams.n_vocab, hparams.n_embd],
                             initializer=tf.random_normal_initializer(stddev=0.02))
        past_length = 0 if past is None else tf.shape(past)[-2]
        h = tf.gather(wte, X) + tf.gather(wpe, positions_for(X, past_length))

        # Transformer
        presents = []
        pasts = tf.unstack(past, axis=1) if past is not None else [None] * hparams.n_layer
        assert len(pasts) == hparams.n_layer
        for layer, past in enumerate(pasts):
            h, present = block(h, 'h%d' % layer, past=past, hparams=hparams)
            presents.append(present)
        results['present'] = tf.stack(presents, axis=1)
        h = norm(h, 'ln_f')

        # Language model loss.  Do tokens <n predict token n?
        h_flat = tf.reshape(h, [batch*sequence, hparams.n_embd])
        logits = tf.matmul(h_flat, wte, transpose_b=True)
        logits = tf.reshape(logits, [batch, sequence, hparams.n_vocab])
        results['logits'] = logits
        return results
```
And we see that they initialized their weights with a `random_normal_initializer()` from `TensorFlow` which is intuitively the **normal distribution** of specified **standard deviation(s)** for the weights(`0.01`&`0.02`), and for the bias they initialized it with all `0`'s using the `constant_initializer()`...

So, let's follow how they initialized the weights here and implement it in our code...

We can now create a **private function** called `_initializeParameters()` using the `pre-underscore syntax` in Python in our `GPTModel` class and implement the proper initialization of parameters...

And since the standard deviation of `0.01` and `0.02` is about the same, we will stick with `0.02` to decrease the complexity of the code and initialize our model faster...

And we can use the <a href="https://pytorch.org/docs/stable/generated/torch.Tensor.apply_.html">PyTorch's `apply()`</a> method to iterate over the sub-modules of a specified module, at the end of our initialization...

And we come up with the following code:
```python
# GPT model architecture
class GPTModel(torch.nn.Module):
    def __init__(self, configuration):
        ...
        # Initialize Correct Parameters
        self.apply(self._initializeParameters)

    def forward(self, indeces, labels=None):
        ...

    def _initializeParameters(self, module):
        if isinstance(module, torch.nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, torch.nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    # Method to transfer weights from Hugging Face GPT-2
    @classmethod
    def from_pretrained(cls, modelType):
        ...
```

And note that the PyTorch's **default bias initialization** is the **uniform distribution**, which here we are setting to `0` for all the items in bias...

And the only other layer that requires initialization is the `LayerNorm`(s) inside the `Block` module. And PyTorch set's the scale of the initialization to be `1` and the off-set of the initialization to be `0`, which is what we want, and will leave the code as is...

If we also notice the `Linear`'s default initialization, we will see that it is $\mathcal{U}(-\sqrt{k}, \sqrt{k})$, where $k = \frac{1}{\text{fan-in}}$, which is fairly in the vicinity of `0.02`...

And if we look at the sizes of $d_{model}$ we see:
| $d_{model}$ | Initialization ($\mathcal{U}(-\sqrt{k}, \sqrt{k})$) |
|-------------|-----------------------------------------------------|
| 768         | $\sqrt{\frac{1}{768}} \approx 0.03$                 |
| 1024        | $\sqrt{\frac{1}{1024}} \approx 0.03$                |
| 1280        | $\sqrt{\frac{1}{1280}} \approx 0.02$                |
| 1600        | $\sqrt{\frac{1}{1600}} \approx 0.02$                |

Hinting us that we are still in the vicinity of what we already implemented earlier...

But we are still not done with the initialization because there is one more caveat here...

If we swing back to the <a href="https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf">`GPT-2` paper</a>, and scroll down to the section `2.2. Input Representation`, we will find this line:
> A modified initialization which accounts for the accumulation on the residual path with model depth
is used. We scale the weights of residual layers at initialization by a factor of $1/\sqrt{N}$ where $N$ is the number of residual layers.

And we have not implemented that yet, and we can do so now...