# Welcome to GPT from Scratch

## Things to discuss before starting

This notebook is a continuation of my previous notebooks in order:
1. <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/Neural%20Network%20with%20Derivatives.ipynb">Neural Networks with Derivatives</a>
2. <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave.ipynb">NameWeave</a>
3. <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20-%20Multi%20Layer%20Perceptron.ipynb">NameWeave - Multi Layer Perceptron</a>
4. <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20(MLP)%20-%20Activations%2C%20Gradients%20%26%20Batch%20Normalization.ipynb">NameWeave (MLP) - Activations, Gradients & Batch Normalization</a>
5. <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20-%20Manual%20Back%20Propagation.ipynb">NameWeave - Manual Back Propagation</a>
6. <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20-%20WaveNet.ipynb">NameWeave - WaveNet</a>

Which means, I will be using a lot of the terminologies, explanations, and code from the previous notebooks that I have created in the series...

And we will gradually build a complete **GPT** from scratch.

This notebook will be a little bit different from the previous notebooks that I have created previously and I will discuss the changes in a bit...

We will use a completely new dataset `Harry_Potter_Books.txt` which is a combined raw text of all the Harry Potter books by J. K. Rowling combined into a single `text` file, instead of `Indian_Names.txt` which we have used in the previous notebooks that contained Indian Firstnames crawled from a website.

We will also keep softwares like <a href="https://chat.openai.com/">ChatGPT</a>, <a href="https://gemini.google.com/app">Google Gemini</a>, and other Large Language Models (LLM's) in mind and create our own little **Generative Pre-Trained Transformer (GPT)**...

And you would probably know by now what these models are and what they do...

Our **GPT** is going to be a *character level language model*, instead of a *sub-word-tokenized model* which softwares like ChatGPT and Google Gemini use in their models, and we will discuss everything in a bit...

And we will not write all the `TORCH.NN` modules from scratch now, because we have already covered the most important ones already in our previous notebooks, instead we will implement the networks based on `TORCH.NN` library from PyTorch...

And most importantly we will start from the simplest model (Bigram Model) and modify the same model within the same notebook and make our way upto the entire `Transformer` architecture...

I have created an entire folder as `GPT Scripts` in the root of this repository to save each script for you to run them without even having to use jupyter notebooks. Rather you can simple use the `<filename>.py` to run each model to see how they perform on the same dataset as we move up, using:
```bash
python <filename>.py
```

One more thing to discuss is we remember from our <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20-%20WaveNet.ipynb">NameWeave - WaveNet</a> notebook that we refered to multiple dimensions in a tensor as our own made up names, but the thing to know is that in real world these multiple dimensions have names, specifically **batch**, **time** and **channel** dimensions like this in order:

$$
(B, T, C)\rightarrow(Batch, Time, Channel)
$$

And we will refer to these dimensions using the actual names used in real world this time...

So, it's going to me a long journey and it if going to be legen...wait-for-it...dary. Legendary!!!\
![Barney Stinson Wink](https://media.tenor.com/nJ3EeUPhVKkAAAAM/barny-stinson.gif)

So let's get started...

# Understanding GPT

So what is a **GPT**?

Well, **GPT** expands for the terminology as **Generative Pre-Trained Transformer**.

You see how it consists of three words?

Let's look at it in context of each word:
1. Generative → Generates New Content
2. Pre-Trained → Pre-trained on a dataset
3. Transformer → Transformer architecture is being followed to make up the model (Don't worry we will discuss this later)

That was easy...

Let's understand what **GPT** can do currently...

Let's take `ChatGPT` as an example as of now and let's see it's capabilities...

![ChatGPT Current Capabilities](https://miro.medium.com/v2/resize:fit:679/1*_3AM0Yhc7qgCvZ_X1L8mhw.gif)

We see that `GPT` goes from left to right and generates text sequentially...

I wanted to show another thing:
![ChatGPT Different Responses](ExplanationMedia/Images/ChatGPT_Different_Response.gif)

See how we get a different response each time?

Which hints us that it is more like a probabilistic system, which is for any one `prompt` it can give us multiple answers...

Now, this is just one example of a `prompt`, and people have come up with billions of different prompts as of now, and in fact there are many websites that index the interactions with `ChatGPT` as well.

You can look at this <a href="https://writesonic.com/blog/best-chatgpt-examples">website</a> as an example.

We see that it is a very remarkable system, and it is what we call a **"Language Model"**.

Or in other words, it models the sequence of words or characters (or "tokens" more generally) and it knows how words follow each other in English language (even other languages)...

Let's understand what **GPT** does from it's perspective...

Well it is trying to complete the sequence...

In other words, the `inputs` or `task` that we give to the GPT model, it treats it as a *start of a sequence* and it tries to complete the sequence as a whole. Which makes it a language model in this sense...

You would think that it is utterly ridiculous and that we cannot just model an entire architecture and make it act like a helpful assistant.

Well that is the beauty of it. And we will discuss all the under-the-hood components of what makes a software like `ChatGPT` work.

So, What is the neural network architecture under-the-hood that models this sequence of words/characters/tokens?

That comes from this paper from Google: <a href="https://arxiv.org/abs/1706.03762">"Attention Is All You Need"</a> from 2017.

This was a landmark paper in Artificial Intelligence that proposed the `Transformer` architecture. But if you start reading this paper, it may seem like a pretty random *machine-translation* paper. And that's because, I think the authors did not fully anticipate the impact it would create in this domain in the years to come...

Let's look at the original `Transformer` architecture as of now:
![Transformer Archtecture](ExplanationMedia/Images/Transformer_Model_Architecture.png)

And this `Transformer` architecture was copy pasted in huge amount of applications in most recent years...

And what we'd like to do now is create something like `ChatGPT`. But we would not be able to completely clone `ChatGPT` because it is a way more serious *production-grade* system which currently requires *thousands* of GPUs and *millions* of dollars to train the network, and also it is trained on a very good *chunk* of internet data. And there are a lot of **pre-training** and **fine-tuning** stages to it.

Rather we would like to create a transformer-based language model, and in our case it is going to be a character level language model. And we also don't want to train on a *chunk* of internet, rather we need a smaller dataset (I proposed we work with `Harry_Potter_Books.txt` which is roughly a `7MB` file). And we would try to model how these characters in this dataset, follow each other.

Let's take this paragraph for example:
```python
"""
Mr. and Mrs. Dursley, of number four, Privet Drive, 
were proud to say that they were perfectly normal, 
thank you very much.
"""
```

Given a chunk of these characters in the past:
```python
"""
Mr. and Mrs. Dursley, of number four, Privet Drive, 
were proud to say that the
"""
```
The `Transformer` model will look at these characters as a context in the past, and it is going to predict that the letter `'y'` is likely to come next in the sequence. And it is going to produce (generate) character sequences that look like Harry Potter. And in that process it is going to model all the patterns inside this data.

And once we have trained the model, our model will be able to generate *infinite `Harry Potter`*

![Harry Potter Woo](https://media4.giphy.com/media/v1.Y2lkPTc5MGI3NjExcjBmNzJ5N2EzMDQzeTB3cXV4ODN5ZGJkdWlldHhleGw3d3hpMGRhMyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/TJO5x5QQM72Q0weWXN/giphy.gif)

So let's install the required dependencies and load our dataset up and look into the data and what it looks like first...

# Installing Dependencies

In [1]:
!pip install torch
!pip install numpy
!pip install pandas
!pip install matplotlib



# Importing Libraries

In [2]:
import random
import torch
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt

# Loading Dataset

This time, I will divide the dataset loading part into two forms:
1. If you're trying to use `Google Colab` to run the code
2. If you're trying to use `Jupyter Notebook locally` to run the code
And you can choose between either one of those with our desired mode...

## For Google Colab users

This will download the `Harry_Potter_Books.txt` into your current folder...

In [None]:
!wget https://raw.githubusercontent.com/AvishakeAdhikary/Neural-Networks-From-Scratch/main/Datasets/Harry_Potter_Books.txt

Now you can load up the dataset like this:

In [None]:
with open('Harry_Potter_Books.txt', 'r', encoding='utf-8') as file:
    text = file.read()

## For local Jupyter Notebook users

You don't have to download the dataset if you have the entire repository cloned.

The dataset `Harry_Potter_Books.txt` is already located inside the `Datasets` directory...

So we can simply open the file and look at its content by specifying the relative path...

In [3]:
with open('Datasets/Harry_Potter_Books.txt', 'r', encoding='utf-8') as file:
    text = file.read()

# Exploring the Dataset

We can look at the length of the entire dataset and it's number of characters...

In [6]:
print("Length of Dataset in Characters: ", len(text))

Length of Dataset in Characters:  6765190


We see that it is roughly `6-million` characters...

And if you want to look at the first `1000` characters we can do:
```python
print(text[:1000])
```
Which prints the output:
```python
"""
/ 




THE BOY WHO LIVED 

Mr. and Mrs. Dursley, of number four, Privet Drive, 
were proud to say that they were perfectly normal, 
thank you very much. They were the last people you’d 
expect to be involved in anything strange or 
mysterious, because they just didn’t hold with such 
nonsense. 

Mr. Dursley was the director of a firm called 
Grunnings, which made drills. He was a big, beefy 
man with hardly any neck, although he did have a 
very large mustache. Mrs. Dursley was thin and 
blonde and had nearly twice the usual amount of 
neck, which came in very useful as she spent so 
much of her time craning over garden fences, spying 
on the neighbors. The Dursley s had a small son 
called Dudley and in their opinion there was no finer 
boy anywhere. 

The Dursleys had everything they wanted, but they 
also had a secret, and their greatest fear was that 
somebody would discover it. They didn’t think they 
could bear it if anyone found out about the Potters. 
Mrs. Potter was Mrs. Dursl
"""
```

# Building Vocabulary

We can now start building our vocabulary, just like we did in our previous notebooks...

In [4]:
characters = sorted(list(set(text))) # Gives us all the characters in the english alphabet, hopefully our dataset has all of them
vocabularySize = len(characters) # We define a common vocabulary size
print("Characters:", characters)
print("Vocabulary Size:", vocabularySize)

Characters: ['\n', ' ', '!', '"', '%', '&', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '>', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '\\', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '|', '~', '—', '‘', '’', '“', '”', '•', '■', '□']
Vocabulary Size: 92


So we have a possible `vocabulary` of `92` characters that our model will be able to see or emit...

Now we would like to develop a strategy to <strong><i>tokenize</i></strong> our input `text`.

And when we say **tokenize** we generally mean to convert raw text as a string to some sequence of integers according to some vocabulary of possible elements...

For us, because we are developing a character level language model, so we are simply going to be translating individual `characters` into `integers`.

And we will build `4` things here:
1. String to Index Vocabulary → `stoi` → A map of `characters` to `integers`
2. Index to String Vocabulary → `itos` → A map of `integers` to `characters`
3. Token Encoder → That will encode sequence of characters into indeces
4. Token Decoder → That will encode sequence of encoded indeces into characters

And you will be able to recognize the first two from our previous notebooks...

Before we dive in, let's understand the python concept of `lambda` functions, which some you might have forgotten...

So what are `lambda` functions?

Python Lambda Functions are *anonymous functions* means that the function is without a name. As we already know the `def` keyword is used to define a normal function in Python. Similarly, the `lambda` keyword is used to define an anonymous function in Python.

Syntax: `lambda arguments : expression`

For example:
```python
output = lambda input: input+1
print(output(input=1))
```
We would print `2`.

Now, let me first run it, then I will explain it later:

In [5]:
stoi = {character:index for index, character in enumerate(characters)}
itos = {index:character for index, character in enumerate(characters)}
encode = lambda string: [stoi[character] for character in string] # Token Encoder that takes in a string as an input, and outputs a list of integers
decode = lambda list: ''.join([itos[index] for index in list]) # Token Decoder that takes in the encoded list of integers and outputs the decoded string

print("STOI:", stoi)
print("ITOS:", itos)
print("Encoded Text: ", encode("Legendary"))
print("Decoded Text: ", decode(encode("Legendary")))

STOI: {'\n': 0, ' ': 1, '!': 2, '"': 3, '%': 4, '&': 5, "'": 6, '(': 7, ')': 8, '*': 9, ',': 10, '-': 11, '.': 12, '/': 13, '0': 14, '1': 15, '2': 16, '3': 17, '4': 18, '5': 19, '6': 20, '7': 21, '8': 22, '9': 23, ':': 24, ';': 25, '>': 26, '?': 27, 'A': 28, 'B': 29, 'C': 30, 'D': 31, 'E': 32, 'F': 33, 'G': 34, 'H': 35, 'I': 36, 'J': 37, 'K': 38, 'L': 39, 'M': 40, 'N': 41, 'O': 42, 'P': 43, 'Q': 44, 'R': 45, 'S': 46, 'T': 47, 'U': 48, 'V': 49, 'W': 50, 'X': 51, 'Y': 52, 'Z': 53, '\\': 54, ']': 55, 'a': 56, 'b': 57, 'c': 58, 'd': 59, 'e': 60, 'f': 61, 'g': 62, 'h': 63, 'i': 64, 'j': 65, 'k': 66, 'l': 67, 'm': 68, 'n': 69, 'o': 70, 'p': 71, 'q': 72, 'r': 73, 's': 74, 't': 75, 'u': 76, 'v': 77, 'w': 78, 'x': 79, 'y': 80, 'z': 81, '|': 82, '~': 83, '—': 84, '‘': 85, '’': 86, '“': 87, '”': 88, '•': 89, '■': 90, '□': 91}
ITOS: {0: '\n', 1: ' ', 2: '!', 3: '"', 4: '%', 5: '&', 6: "'", 7: '(', 8: ')', 9: '*', 10: ',', 11: '-', 12: '.', 13: '/', 14: '0', 15: '1', 16: '2', 17: '3', 18: '4', 19: 

The `Token Encoder` here takes in a `string` or a `sequence of characters` and encodes it into a `list of integers` based on `stoi` mapping. And the `Token Decoder` takes in the encoded `list of integers` and decodes it based on `itos` mapping to get back the exact same string...

In other words, it is more like a translation of `characters` into `integers` and `integers` into `characters`, because our model is going to be a character level language model.

Now this is only one of many possible `encodings` or `tokenizers` that are out there in the world right now...

And people have come up with many such `tokenizers`, for example, Google uses <a href="https://github.com/google/sentencepiece">`sentencepiece`</a>, OpenAI uses <a href="https://github.com/openai/tiktoken">`tiktoken`</a>...

And these `tokenizers` which are out there are more like `sub-word` tokenizers, which are **not** encoding `entire words` and also **not** encoding `individual characters`, and more like a `sub-word` unit level `tokenizers` which is usually what's adapted in practice...

As an example let's take `tiktoken` vocabulary which uses `Byte-Pair Encoding (BPE)` to encode these `tokens`:\
![Tiktoken Vocabulary](ExplanationMedia/Images/Tiktoken_Vocabulary.png)

We see that `tiktoken` has a vocabulary of roughly `50257` which for us is just `92`.

And when we try to encode a sample string in `tiktoken`, we get:\
![Tiktoken Example](ExplanationMedia/Images/TikToken_Example.png)

We see that we only get `3` outputs for and entire string of `9` characters...

Which means that we can *trade-off* `sequences of integers` and `vocabularies`...

In other words, we can have a very long `sequences of integers` and very short `vocabularies` or we can have very short `sequences of integers` and very long `vocabularies`...

But for now I'd like to keep our `tokenizer` extremely simple using our own character-level tokenizer (meaning we have very small `vocabulary`) and very simple `encode` and `decode` functions, but we do get very long `sequences of integers` as a result...

Don't worry, if you'd like I will build a `tokenizer` in the future...

So let's now move forward...

Now that we have a `token encoder` and a `token decoder` or effectively a `tokenizer` we can move forward and encode our entire `Harry Potter` dataset...

And we will use <a href="https://pytorch.org/">PyTorch</a> library for that:
![PyTorch Logo](ExplanationMedia/Images/PyTorchLogo.svg)

So we can now wrap our `text` data after `encoding` it into a `tensor` of datatype `long` because we want *floating-point numbers* to do mathematical transformations on this data later like this:
```python
data = torch.tensor(encode(text), dtype=torch.long)
```

In [6]:
data = torch.tensor(encode(text), dtype=torch.long)

And then we can check the `shape` and `type` of this data and print out the first `100` characters, just like we did before (without decoding it) like this:
```python
print(data.shape, data.dtype)
```
And we get:
```python
torch.Size([6765190]) torch.int64
```
And:
```python
print(data[:100])
```
And we get:
```python
tensor([13,  1,  0,  0,  0,  0,  0, 47, 35, 32,  1, 29, 42, 52,  1, 50, 35, 42,
         1, 39, 36, 49, 32, 31,  1,  0,  0, 40, 73, 12,  1, 56, 69, 59,  1, 40,
        73, 74, 12,  1, 31, 76, 73, 74, 67, 60, 80, 10,  1, 70, 61,  1, 69, 76,
        68, 57, 60, 73,  1, 61, 70, 76, 73, 10,  1, 43, 73, 64, 77, 60, 75,  1,
        31, 73, 64, 77, 60, 10,  1,  0, 78, 60, 73, 60,  1, 71, 73, 70, 76, 59,
         1, 75, 70,  1, 74, 56, 80,  1, 75, 63])
```

And we see that we have a massive list of integers and is an identical translation of the first `100` characters exactly in the `text` file...

And the entire dataset is just stretched out into a very large `sequence of integers`...

Now before we move on with our progress, we would like to do one more thing that is we'd like to split our dataset into a `Train` and `Validation` split...

So let's do that...

# Splitting the Dataset into `Training` and `Validation` splits

Now I'd like to split our data into a split of:
1. First 90% into `Training` Split
2. Last 10% into `Validation` Split

And we are doing this to understand, to what extent our model is `overfitting`...

Because we don't want our model to copy and create the exact book of `Harry Potter` instead, we want a model that will create `Harry Potter` like text...

And here's how I do that:

In [7]:
nintyPercentOfDatasetLength = int((0.9 * len(data)))
trainingData = data[:nintyPercentOfDatasetLength] # Data up till 90% of the length
validationData = data[nintyPercentOfDatasetLength:] # Data from 90% of the length

We can now move on to the next part...

# Loading Data Into Batches

We would like to now proceed to feed these integer sequences into the neural network so that it can train and learn those patterns...

**BUT**

We need to realise that we are not going to feed in the entire dataset into the neural network because that is going to be extremely computationally heavy, and rather we would load the data into small batches or *chunks* of data...

Now I typically use the term `block size` and specify a length to it, but this *chunk* of data can be recognized as different terminologies as well, for example `context length`.

Let's start with a `blockSize` of just `8`...
```python
blockSize = 8
```

Now let's look at the first *chunk* of training data (`blockSize` of `8` + `1`)...

I'll explain why this `+1` is there in a bit...

So we have:
```python
trainingData[:blockSize + 1]
```
For which we get the sequence:
```python
tensor([13,  1,  0,  0,  0,  0,  0, 47, 35])
```

Which is the first `9` characters in a `sequence` in the `training-set`...

Now, I'd like to point out that when we take out a *chunk* of data like this, we actually have **multiple examples** packed within it, because all of these characters **follow** each other...

And we are going to simultaneously train it at every one of these positions...

And in a chunk of `9` characters, there's actually `8` individual examples packed in there...

How so?

Let's look at it this way...

For our example:
```python
[13,  1,  0,  0,  0,  0,  0, 47, 35]
```
1. In the context of `[13]` → `1` is likely to come next,
2. In the context of `[13,  1]` → `0` is likely to come next,
3. In the context of `[13,  1,  0]` → `0` is likely to come next,
4. In the context of `[13,  1,  0,  0]` → `0` is likely to come next,
5. In the context of `[13,  1,  0,  0,  0]` → `0` is likely to come next,
6. In the context of `[13,  1,  0,  0,  0,  0]` → `0` is likely to come next,
7. In the context of `[13,  1,  0,  0,  0,  0,  0]` → `47` is likely to come next,
8. In the context of `[13,  1,  0,  0,  0,  0,  0, 47]` → `35` is likely to come next.

Summing upto `8` individual examples in our case, which is the `blockSize` and we take the `+1` to get the desired `label` for training...

Let's see how we can achieve the same output in a code snippet:
```python
inputs = trainingData[:blockSize] # First Chuck of Characters
outputs = trainingData[1:blockSize + 1] # First Chunk of Characters offset by 1
for i in range(blockSize):
    context = inputs[:i+1] # Context is inputs upto the offset
    label = outputs[i] # Label is the offset
    print(f"When input example is {context}, then the label is: {label}")
```

For which we get:
```python
When input example is tensor([13]), then the label is: 1
When input example is tensor([13,  1]), then the label is: 0
When input example is tensor([13,  1,  0]), then the label is: 0
When input example is tensor([13,  1,  0,  0]), then the label is: 0
When input example is tensor([13,  1,  0,  0,  0]), then the label is: 0
When input example is tensor([13,  1,  0,  0,  0,  0]), then the label is: 0
When input example is tensor([13,  1,  0,  0,  0,  0,  0]), then the label is: 47
When input example is tensor([13,  1,  0,  0,  0,  0,  0, 47]), then the label is: 35
```

One more thing to mention is that we not only train these examples all the way to the context of `blockSize` just for efficiency.

**We also want our network to get *"used to"* seeing these examples for context of as small as `1` all the way upto the `blockSize` and everything in between.**

So, during `inference` we can start sampling from as little as `1` character of context. And once it starts sampling it can go all the way upto `blockSize` and after the context of `blockSize` we can start `truncating`, because the neural network will receive more than `blockSize` inputs when its trying to predict the next character.

And these input examples that we just looked are nothing but the `Time` dimension...

But we need to care about the `Batch` dimension now... And that is because, everytime we feed these *chunks* of texts into a `Transformer`, we are going to have **mini-batches** of multiple chunks of texts that are all **stacked-up** in a single tensor (this is done for efficiency, such that we could keep the `GPU`'s busy, because they are very good at parallel processing of data, and we want to process multiple *chunks* of text all at the same time, but they are processed completely independently and they don't "talk" to each other)...

We will also set up a `seed` so that whatever numbers I see here in my system, you are going to see the same numbers in your system later as well...

Let's now generalize our discussion into code and I will discuss what is happening in the code one by one...

More specifically let's define a `getBatch()` method try to pick out batches of `batchSize`...

Now don't get confused between `blockSize` and `batchSize`...

$$
\displaylines{
\begin{align}
blockSize \rightarrow \text{The number of independent sequences of characters we want to process in parallel} \\
batchSize \rightarrow \text{The maximum context length of predictions}
\end{align}
}
$$

We will now use our older *example* code get batches of examples, but now in context of `batchSize` now, and we will pick random indeces from the entire dataset, which will then be used to process all the possible examples in sequence and their corresponding labels in a batch using `torch.stack` which essentially concatenates a sequence of tensors along a new dimension. Which makes the `inputBatch` and `outputBatch` a `(4, 8)` tensor, where each row in an `inputBatch` is a *chunk* of the training set, and `outputBatch` will be used all the way at the end during **loss-function**...

So, we can spell each of these `examples in a sequence` can be spelled out just like we did before to get their corresponding `labels` for each of these examples...

In [8]:
# We define a manual seed such that you see the same numbers I see in my machine
torch.manual_seed(69420)
batchSize = 4 # Number of independent sequences of characters we want to process in parallel
blockSize = 8 # Maximum context length of predictions

def getBatch(split):
    # Take the trainingData if the split is 'train' otherwise take the validationData
    data = trainingData if split=='train' else validationData
    # Generates random integers of batchSize between 0 and len(data) - blockSize
    indexes = torch.randint(high=len(data) - blockSize, size=(batchSize,))
    # Takes the inputs and outputs after stacking them up in a single tensor
    inputs = torch.stack([data[i:i+blockSize] for i in indexes])
    outputs = torch.stack([data[i+1:i+blockSize+1] for i in indexes])
    return inputs, outputs

# We call the method to initialize inputBatch and outputBatch
inputBatch, outputBatch = getBatch('train')

print("Inputs:", inputBatch.shape, " Values:" , inputBatch)
print("Outputs:", outputBatch.shape, " Values:" , outputBatch)

print('---------------------------------------------')

for batchIndex in range(batchSize):
    for blockIndex in range(blockSize):
        context = inputBatch[batchIndex, : blockIndex+1]
        label = outputBatch[batchIndex, blockIndex]
        print(f"When input example is {context.tolist()}, then the label is: {label}")

Inputs: torch.Size([4, 8])  Values: tensor([[70, 70, 67,  1, 70, 61,  1, 75],
        [59,  1,  0, 56, 75,  1, 75, 63],
        [59,  1,  0, 71, 67, 60, 69, 75],
        [75, 63, 60,  1, 68, 56, 81, 60]])
Outputs: torch.Size([4, 8])  Values: tensor([[70, 67,  1, 70, 61,  1, 75, 63],
        [ 1,  0, 56, 75,  1, 75, 63, 60],
        [ 1,  0, 71, 67, 60, 69, 75, 80],
        [63, 60,  1, 68, 56, 81, 60, 10]])
---------------------------------------------
When input example is [70], then the label is: 70
When input example is [70, 70], then the label is: 67
When input example is [70, 70, 67], then the label is: 1
When input example is [70, 70, 67, 1], then the label is: 70
When input example is [70, 70, 67, 1, 70], then the label is: 61
When input example is [70, 70, 67, 1, 70, 61], then the label is: 1
When input example is [70, 70, 67, 1, 70, 61, 1], then the label is: 75
When input example is [70, 70, 67, 1, 70, 61, 1, 75], then the label is: 63
When input example is [59], then the lab

Now that we have our `inputBatch` and `outputBatch` we can start feeding these batches into a neural network and start getting predictions....

Now we are going to start off with the simplest possible neural network, which in my opinion is a `Bigram Language Model` which was already covered in our <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave.ipynb">NameWeave</a> notebook in a lot of depth, and we will rather go faster and implement a `PyTorch Module` directly that implements the `Bigram Language Model`.

# Bigram Language Model

Before we implement the bigram model, I'd like to discuss this syntax in python:
```python
class A:
    x = 10

class B(A):
    def show(self):
        print(self.x)

y = B()
y.show()
```
For which we get the output:
```python
10
```
Which essentially means that this syntax (`B(A)`) is used for `inheritence`, which is much more helpful now to implement `Torch Modules` in our own implementations. And that is because `torch.nn.Module` contains many such methods already implemented like `forward(*input)` which let's us define a forward pass and internally manages the `__call__()` method and we can return our calculated `logits` within the `Module` that we are going to define and later call `backward()` on...

So, we understand that we need an `Embedding Look-Up Table` for each of our characters in the vocabulary. From our <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20-%20Multi%20Layer%20Perceptron.ipynb">NameWeave - Multi Layer Perceptron</a> notebook we understand that when we index into such an embedding table of two dimensions using **batches of input** of two dimensions, we get an indexed output of a three dimensional tensor of `(B, T, C)` which we can now refer as the `Batch`, `Time` and `Channel` dimensions of that tensor, which essentially is nothing but for each input of a batch, it picks out a row of the embedding table. In our case, `Batch` is `4` which is the `batchSize`, `Time` is `8` which is the `blockSize` and `Channel` is the embedding dimension we will specify...

These indexed embeddings, for now can be interpreted as the `logits` or the scores for the next character in a sequence. Or in other words, we are predicting what comes next based on just the individual identity of a single token (which means they are not *talking* to each other). For example, if a token is say `69`, the token itself will be able to make pretty decent predictions of what comes next by knowing that the token is in fact `69`...

So now, let's write our first `model` out, and test it...

## Writing out BigramModule

In [9]:
class BigramLanguageModel(torch.nn.Module):
    # Constructor for the model
    def __init__(self, vocabularySize):
        # Initializing the embedding table
        super().__init__()
        self.tokenEmbeddingTable = torch.nn.Embedding(vocabularySize, vocabularySize)

    # Forward Pass
    def forward(self, indeces):
        # Index into embeddings to get the logits
        logits = self.tokenEmbeddingTable(indeces) # (B, T, C)
        return logits

# Testing the model
model = BigramLanguageModel(vocabularySize)
outputs = model(inputBatch)
print(outputs.shape)

torch.Size([4, 8, 92])


Looks like we do get the scores for every one of those `(4, 8)` positions...

## Understanding `cross_entropy` loss

Now that we have the predictions of *'what comes next'* we'd like to evaluate the `loss function`. And in the `NameWeave` series we saw that a good way to measure the `loss` or the quality of the predictions is to use the **Negetive Log Likelihood** loss which is also implemented in PyTorch under the name of `cross_entropy`. 

And remember how I said the `outputBatch` is required when we calculate the `loss function`? 

This is exactly when we would require the `outputBatch` to calculate the difference on the predictions or the `logits` and their corresponding `labels`. Or in other words, we have the identity of the next character, but how well are we predicting the next character based on the `logits`...

And we'd like to call the `cross_entropy` in it's **functional** form, which means we don't have to create a `Module` for it. But when we look at the documentation of <a href="https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html">CrossEntropyLoss</a> of PyTorch, we understand that we have **multi-dimensional inputs** (because we have a `(B, T, C) tensor`), but PyTorch `CrossEntropyLoss` expects $(minibatch,C,d_1,d_2,\ldots,d_K)$ or a `(B, C, T) tensor`.

So what to do now with our `(B, T, C) tensor` that we already have?

Well we will try to **reshape** our tensor now, in order to fit those `logits` as well as the `labels` which is a `(B, T) tensor`...

Now because we are only interested in manipulating the `embeddings` of the `logits` (which for our case is the `channel` dimension), we can start treating everything else as a **batch-dimension**.

This is good because PyTorch `CrossEntropyLoss` expects a $(minibatch,C)$ tensor...

So we can stretch our 3-dimensional tensor into 2-dimensional tensor by combining all the other batch dimensions into a single dimension by multiplying the dimension values into one using `view()` method of PyTorch...

And it looks something like this:
![CrossEntropyGPTExplanation](ExplanationMedia/Images/CrossEntropyGPTExplanation.png)

And then we can print out the `loss` to check where we are as well...

So let's do this now...

## Writing out Loss Function

In [10]:
class BigramLanguageModel(torch.nn.Module):
    # Constructor for the model
    def __init__(self, vocabularySize):
        # Initializing the embedding table
        super().__init__()
        self.tokenEmbeddingTable = torch.nn.Embedding(vocabularySize, vocabularySize)

    # Forward Pass
    def forward(self, indeces, labels):
        # Index into embeddings to get the logits
        logits = self.tokenEmbeddingTable(indeces) # (B, T, C)
        # Pop out the shape dimensions
        batch, time, channel = logits.shape
        # Stretch out the logits and labels
        logits = logits.view(batch*time, channel)
        labels = labels.view(batch*time)
        # Calculate loss
        loss = F.cross_entropy(logits, labels)
        return logits, loss

# Testing the model
model = BigramLanguageModel(vocabularySize)
outputs, loss = model(inputBatch, outputBatch)
print("Shape of outputs:", outputs.shape)
print("Loss:", loss)

Shape of outputs: torch.Size([32, 92])
Loss: tensor(5.0409, grad_fn=<NllLossBackward0>)


Now because we have `92` possible characters, we can actually guess what our loss should be... And we have already covered this in our <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20(MLP)%20-%20Activations%2C%20Gradients%20%26%20Batch%20Normalization.ipynb">NameWeave (MLP) - Activations, Gradients & Batch Normalization</a> notebook in more details...

But we are expecting our loss to be around:
$$
-\ln{(P_x = \frac{1}{92})} \approx 4.5217
$$

But right now we are getting around `5.3098`, which tells us that our initial predictions are not super diffuse, and have got a little bit of entropy and so we are guessing wrong, but ultimately we are able to evaluate the loss...

Now that we can evaluate the quality of the model, we'd like to also be able to `generate` from the model... And once again I'll go a little bit faster, because I covered a lot of these already in my previous notebooks of the `NameWeave` series...

## Understanding `generate()`

Now because we train our neural network with transformations in the **forward pass**, we can **forward** a single character index (say the new line character `'\n'` which is also the `0-th character` according to `itos` look-up table and is a fairly good example to forward) and get the predictions (`logits`) from the neural network. 

Now because these `logits` that come out are a *chunk* of a batch of examples, we need to focus on the **last** character in a chunk, because we are trying to predict the next character in a block (which is none other than the `Time` dimension), so we will pop out the `-1` element of that *chunk* (which will make `(B, T, C) tensor` → `(B, C) tensor`) to get the `logits` properly in a batch.

And as we remember, `logits` are the un-normalized probabilities. Which means it needs to go through a non-linearity to get normalized probabilities (for our case we will use `softmax`). We want to apply softmax independently along the **last dimension for each sequence in the batch** (`-1`). This is because we're treating each sequence independently when generating the next token.

And then after we have our `probabilities` we can sample `1` character at a time by using `multinomial` sampling distribution for a batch (Which makes it a `(B, C) tensor` → `(B, 1) tensor`) which is none other than the next index in a sequence.

And lastly we can **concatenate** each time we generate an `index` to the `nextIndex` to keep generating it in a loop along the `Time` dimension making it a `(B, 1) tensor` → `(B, T+1) tensor` each time we generate...

But we also need a stopper for our loop... And we will call it `maximumNewTokens`, which is the number of characters we want to generate from our neural network...

And see how we are not going to use `loss` here during generation?

So we also need to handle `loss` separately during `forward` to handle both kind of inputs and manage our memory efficiently.

So, after this long explanation, we can modify our `BigramLanguageModel` now:

## Modifying `BigramLanguageModel` for `generate()`

In [11]:
class BigramLanguageModel(torch.nn.Module):
    # Constructor for the model
    def __init__(self, vocabularySize):
        # Initializing the embedding table
        super().__init__()
        self.tokenEmbeddingTable = torch.nn.Embedding(vocabularySize, vocabularySize)

    # Forward Pass
    def forward(self, indeces, labels=None):
        # Index into embeddings to get the logits
        logits = self.tokenEmbeddingTable(indeces) # (B, T, C)
        if labels is None:
            loss = None
        else:
            # Pop out the shape dimensions
            batch, time, channel = logits.shape
            # Stretch out the logits and labels
            logits = logits.view(batch*time, channel)
            labels = labels.view(batch*time)
            # Calculate loss
            loss = F.cross_entropy(logits, labels)
        return logits, loss

    # Generation
    def generate(self, indeces, maximumNewTokens):
        for _ in range(maximumNewTokens):
            # Forward Through Model
            logits, loss = self(indeces)
            # Focus on the last time step
            logits = logits[:, -1, :]
            # Applying softmax for the last dimension
            probabilities = F.softmax(logits, dim=-1)
            # Sample from distribution
            nextIndex = torch.multinomial(probabilities, num_samples=1)
            # Concatenate currentIndex with nextIndex
            indeces = torch.cat((indeces, nextIndex), dim=1)
        return indeces

# Testing the model
model = BigramLanguageModel(vocabularySize)
outputs, loss = model(inputBatch, outputBatch)
print("Shape of outputs:", outputs.shape)
print("Loss:", loss)

Shape of outputs: torch.Size([32, 92])
Loss: tensor(5.3285, grad_fn=<NllLossBackward0>)


So we can start generating text from the model now...

Now because our `generate()` expects a `(B, T) tensor`, and we decided to give the first context to be a new line character `'\n'`, create a tensor consisting of a single `0` of `(1, 1)` dimension like this (also setting the `dtype` to `long`):
```python
torch.zeros((1, 1), dtype=torch.long)
```
To get a tensor like this:
```python
tensor([[0]])
```

And let's say we want to generate `100` characters from the `model`, so we will pass the same value in `maximumNewTokens` parameter.

Now remember the example where the **GPT** was able to **generate** multiple responses, given the same `context`?

Well, our model also produces multiple responses in a batch given the same `context`, but we are currently interested in the first response so we need to specify `[0]` to pop out the first response, but our `decode` expects a list so we need to convert it back to a list using `tolist()` method, and then we will be able to print out our first response...

We can now pack all the concepts into a single line:

In [12]:
print(decode(model.generate(indeces=torch.zeros((1, 1), dtype=torch.long), maximumNewTokens=100)[0].tolist()))


"( ”wMi)nd],mRAYfSqL•p‘bI'Jfk\3y\Q1P'□?bZ■h’vrA?Sk? Z5;k22~A!1>E;57■z.~Bhrvq□G4e2ScxssA)FFCl~m7nnA~B


This is the output I get the first time:
```python

"( ”wMi)nd],mRAYfSqL•p‘bI'Jfk\3y\Q1P'□?bZ■h’vrA?Sk? Z5;k22~A!1>E;57■z.~Bhrvq□G4e2ScxssA)FFCl~m7nnA~B
```

Confused?

Don't worry, our model is still untrained, and it just predicts garbage values as of now, and we can train our model now, to make it better...

But I'd like to point out that our `generate()` method is written in a generalized way, but right now it is very ridiculous. And that is because we are feeding it the entire context of parallel examples, but right now we only have the simplest Bigram Language Model, which only for example, to predict `'?'` as a next character needed `'k'` as an example and not the entire context sequence, but we only looked at the very last piece of the context. And the point of writing the `generate()` method this was is, right now we have a Bigram Language Model, but we'd like to keep this function **fixed**, but we'd like to work later and the model is able to look further in the history. Right now the history is not used and so it looks silly, but we will be using the history later and that is why we want to do it this way...

Now we can move forward and train the model... So that our outputs become a lot less random...

## Training `BigramLanguageModel`

Now before training we'd like to initialize an `optimizer`.

Well `optimizers` are none other than the algorithms that take the `gradients` and `update` the data based on the `learning rates` and other variables.

And during our `NameWeave` series, we have only ever used **Stochastic Gradient Descent (SGD)**.

There are many types of `optimizers` that are out there, and the most common ones are:
1. Gradient Descent
2. Stochastic Gradient Descent
3. Adagrad
4. Adadelta
5. RMSprop
6. Adam

You can read a lot about them in this <a href="https://medium.com/analytics-vidhya/this-blog-post-aims-at-explaining-the-behavior-of-different-algorithms-for-optimizing-gradient-46159a97a8c1">Medium Post</a>.

But for now we will use <a href="https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html#torch.optim.AdamW">`AdamW`</a> optimizer. And we will specify the parameters to `optimize` on and also the `learning rate` within the arguements of the constructor during initialization...

And the typical good setting for the `learning rate` is roughly `3e-4` but for smaller networks such as this, we can get away with much higher learning rates such as `1e-3`...

So let's initialize the optimizer now...

In [13]:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

Now let's write out our training loop.

And we can remember from our old `NameWeave` series that we used to:
1. Define a `batchSize` to train on
2. Opened a loop for the number of iterations or `epochs`
3. Get each batch input and output
4. Forwarded the model to calculate `logits` and `loss`
5. Set the gradients to `0`
6. Called `loss.backward()` for backward pass
7. Update the parameters using the `gradients` and the `learning rate`

So we can now write the same thing in code...

But unlike our `NameWeave` series, the steps for `5` and `7` are going to be a little bit different. Now that we are using the official `optimizer` from PyTorch, to set the gradients to `0` using the `optimizer` we would do something like this `optimizer.zero_grad(set_to_none=True)` and to update the parameters using the `optimizer` we would do something like this `optimizer.step()`.

So let's write out the steps that we defined in code:

In [14]:
# We define the number of epochs
epochs = 100
# We define the batch size
batchSize = 32

for _ in range(epochs):
    # Get the inputBatch and outputBatch
    inputBatch, outputBatch = getBatch('train')
    # Forward the model
    logits, loss = model(inputBatch, outputBatch)
    # Setting the gradients to None
    optimizer.zero_grad(set_to_none=True)
    # Backward to calculate gradients
    loss.backward()
    # Update the gradients
    optimizer.step()
    print(loss.item())

4.876785755157471
4.878622531890869
4.971441268920898
4.972983360290527
4.952727794647217
4.961724758148193
4.920673370361328
4.9921674728393555
4.9947509765625
4.968419075012207
4.937097072601318
5.007509708404541
5.0103230476379395
5.033933639526367
4.930401802062988
4.7458176612854
4.932359218597412
4.892716407775879
4.904999732971191
4.885517120361328
4.889068126678467
4.9522624015808105
4.900505542755127
4.931491374969482
4.893527984619141
4.834726810455322
4.967082500457764
4.929368495941162
4.919459342956543
4.922502040863037
5.044764041900635
4.856174468994141
4.875720024108887
4.952334403991699
4.802764415740967
4.9507856369018555
4.928844451904297
4.9131693840026855
4.853192329406738
4.858636856079102
4.826693058013916
4.905089855194092
4.839353084564209
4.882635116577148
4.794174671173096
4.853604793548584
4.845086097717285
4.806244850158691
4.9436869621276855
4.926150798797607
4.911518096923828
4.927879333496094
4.827909469604492
4.833471775054932
4.927444934844971
4.914450

Seems like we are optimizing the model, let's now keep the print statement outside to print the loss at the end and train for about `20000` steps...

In [15]:
# We define the number of epochs
epochs = 20000
# We define the batch size
batchSize = 32

for _ in range(epochs):
    # Get the inputBatch and outputBatch
    inputBatch, outputBatch = getBatch('train')
    # Forward the model
    logits, loss = model(inputBatch, outputBatch)
    # Setting the gradients to None
    optimizer.zero_grad(set_to_none=True)
    # Backward to calculate gradients
    loss.backward()
    # Update the gradients
    optimizer.step()
print(loss.item())

2.3757007122039795


So we see that we get a loss of `2.3757` and that we have significantly come down from our old loss...

## Sampling from Model

So we can try to sample/generate from our model by increasing the `maximumNewTokens` to around `500` to get a sense of a bigger output...

Now keep in mind, that this is still the simplest possible model that we could have, and we still won't be able to get a very good result, but for now atleast our loss has improved...

So let's generate text now...

In [16]:
print(decode(model.generate(indeces=torch.zeros((1, 1), dtype=torch.long), maximumNewTokens=500)[0].tolist()))


“Sifuriner m 

wat ol. ces 

n 
tigerrancoutheven’d t ayoour, rn fowachensprn sethelintese 
d HExpenecherine 
k dorerial SCetllyasino, gr pulet tse sand heiok hery, s oweve, atevealy. 
bover ulind th frishid, ls kne che me Pr, 





“Sooke, .K. d awastokller at inuntiveahuromitedde thamerthaplzad p 



pee woofot oulplelk Thethechllond ostosn hin. Sy Mr Chererm tck Ineyoun,” g, FEDus s iny, 
Habe s on. m,” ory tsoulabeay tonde Prokheata wredinghererongergredisst nabos'~4ot 
he acalis... font d h


It's still a very good improvement than what we had earlier...

Right now, our `tokens` or `characters` are not talking to each other (because, given the previous context of what was generated, we are looking at the very last character to make the predictions about what comes next), and we'd now like to make our `tokens` talk to each other such that they can figure out what is in the `context` so that they can make better predictions about what comes next...

# Moving Code to GPU

You probably know about `GPU`s by now...

`GPU`s can process many pieces of data simultaneously, making them useful for machine learning, video editing, and gaming applications.

So why not update our code such that it is able to run on both a `CPU` and a `GPU` assuming what is available?

Great idea, but there's a slight problem... You see, there are many `GPU`s that are out there in the market, and I don't know which you specifically have. But I have an `NVIDIA` GPU. And to process our code in a `GPU` we need to set our specific `GPU` device...

Right now, there are three main `GPU` companies out there in the market (Apologies if I don't know about the off market GPU brands):
1. NVIDIA (GTX/RTX)
2. AMD (RADEON/VEGA/RX)
3. INTEL (ARC)

Now, I will setup the `GPU` properly for my system (`NVIDIA`), but leave out links for the other two as well, because setting them up is nearly the same with a few pieces here and there...

Now `NVIDIA` specifically works on `CUDA` cores, which utilizes these cores in the GPU to process our data simultaneously...

And in order to do that we need two things:
1. <a href="https://developer.nvidia.com/cuda-downloads">CUDA Toolkit</a>
2. <a href="https://developer.nvidia.com/cudnn-downloads">CUDnn for CUDA</a> (optional)

And the installations are pretty straight forward... You just need to follow the setup for `CUDA` first and then `CUDnn` generally comes in a `.zip` package, which you can extract and copy paste the files that correspond to the `CUDA` folder...

Now `AMD` specifically uses <a href="https://www.amd.com/en/developer/resources/rocm-hub/hip-sdk.html">ROCm</a> for it's graphics processing...

And `Intel` specifically uses <a href="https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/download.htm">OpenVINO</a> for it's graphics processing...

And you can choose your specific `GPU` from these three resource links...

And finally if you're using `Google Colab` for this script, you can simply change your current runtime type like this:
![Google Colab GPU Change](ExplanationMedia/Images/ColabGPUChange.gif)

First you can run two lines of code to check if the `GPU` is available or not using:
```python
import torch
torch.cuda.is_available()
```

Now we have to set our code in such a manner, such that it works on both, a `CPU` and a `GPU` depending on what we have right now...

In order to do that we write this line of code:
```python
device = 'cuda' if torch.cuda.is_available() else 'cpu'
```

Now there is a very specific reason why we set the variable name to `device`, and that is because, during initilization of tensors and models there is an arguement that PyTorch takes in, which is known as `device` and sometimes we shift the device using `.to()` method...

Now, by default, the tensors are generated on the `CPU`. Even the model is **initialized** on the CPU. Thus one has to manually ensure that the operations are done using `GPU`...

To make this, we need to change our `device` to the specific device we have, and to do that we therefore need to change these things:
1. Before returning the `inputs` and `outputs` in `getBatch()` method, we need to change the device using `.to()`, because of tensor initialization.
2. After initializing the `model` we need to change the device using `.to()`, because of initialization of model parameters.
3. During initialization of the `context` during generation, we need to change the device using `.to`, because of tensor initialization.

So let's see the changes now:
1. ```python
   def getBatch(split):
       # Take the trainingData if the split is 'train' otherwise take the validationData
       data = trainingData if split=='train' else validationData
       # Generates random integers of batchSize between 0 and len(data) - blockSize
       indexes = torch.randint(high=len(data) - blockSize, size=(batchSize,))
       # Takes the inputs and outputs after stacking them up in a single tensor
       inputs = torch.stack([data[i:i+blockSize] for i in indexes])
       outputs = torch.stack([data[i+1:i+blockSize+1] for i in indexes])
       inputs, outputs = inputs.to(device), outputs.to(device)
       return inputs, outputs
   ```
2. ```python
   model = BigramLanguageModel(vocabularySize).to(device=device)
   ```
3. ```python
   context = torch.zeros((1, 1), dtype=torch.long, device=device)
   print(decode(model.generate(indeces=context, maximumNewTokens=500)[0].tolist()))
   ```

# Fixing Loss Evaluation

Right now we have the code for training:
```python
for _ in range(epochs):
    # Get the inputBatch and outputBatch
    inputBatch, outputBatch = getBatch('train')
    # Forward the model
    logits, loss = model(inputBatch, outputBatch)
    # Setting the gradients to None
    optimizer.zero_grad(set_to_none=True)
    # Backward to calculate gradients
    loss.backward()
    # Update the gradients
    optimizer.step()
    print(loss.item())
```

And in our <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20(MLP)%20-%20Activations%2C%20Gradients%20%26%20Batch%20Normalization.ipynb">NameWeave (MLP) - Activations, Gradients & Batch Normalization</a> notebook we have discussed how we calculate loss very cautiously, and how every batch is more-or-less lucky everytime...

So we need a loss evaluation method, which averages up the `loss` over multiple batches...

So we change our loss calculation to `@torch.no_grad()` decoration, and define a method for calculating loss as `estimateLoss()` such that no gradients are calculated during the loss evaluation every now and then....

And we have also seen how setting the training mode to `True` like `training = True` and `False` like `training = False` can create problems in layers like `Batch Normalization`...

But now that we have implemented PyTorch Modules in our code, we can do something like `model.train()` to set the mode to `training` and `model.eval()` to set the mode to `evaluation` for our model.

So we can now define two hyper-parameters:
1. `evaluationIntervals` → Which we will use to call the `estimateLoss()` based on the number of intervals
2. `evaluationIterations` → Which we will use to control the number of times the model evaluates its performance on each dataset split

So we can have a checker during training to check if the epoch iteration reaches a certain evaluation interval, we call the `estimateLoss()` method like this, and extract the losses and print them to check the losses:
```python
for iteration in range(epochs):
    # Check if iteration reaches interval
    if iteration % evaluationIntervals == 0:
        # Save the losses in a variable
        losses = estimateLoss()
        # Print the losses (Training and Validation)
        print(f"Step {iteration}: Training Loss {losses['train']:.4f}, Validation Loss {losses['validation']:.4f}")

    
    # Get the inputBatch and outputBatch
    inputBatch, outputBatch = getBatch('train')
    # Forward the model
    logits, loss = model(inputBatch, outputBatch)
    # Setting the gradients to None
    optimizer.zero_grad(set_to_none=True)
    # Backward to calculate gradients
    loss.backward()
    # Update the gradients
    optimizer.step()
```

And inside the `estimateLoss()` method we first set the model to `evaluation` mode and take both the `training` and `validation` splits and take the calculate the loss of batches after forwarding them for `evaluationIterations` storing the losses in a tensor, and then after the `evaluationIterations` are completed, we average out the losses using `mean()` based on the split and set the model back to the `training` mode and return both the `training` and `validation` losses. 

So we get a code like this:
```python
@torch.no_grad()
def estimateLoss():
    output = {}
    # Set the model to evalutaion mode
    model.eval()

    for split in ['train','validation']:
        # Define a losses tensor for the `evaluationIterations` size
        losses = torch.zeros(evaluationIterations)
        for evaluationIteration in range(evaluationIterations):
            inputBatch, outputBatch = getBatch(split)
            logits, loss = model(inputBatch, outputBatch)
            losses[evaluationIteration] = loss.item()
        output[split] = losses.mean()
        
    # Set the model to training mode
    model.train()
    return output
```

So when we call `estimateLoss()` we are going to monitor pretty accurate `training` and `validation` losses.

But right now `model.eval()` and `model.train()` does not actually do anything, but it will come in handy later when we have layers like `Batch Normalization` and `Dropout` layers in our model...

# Converting Bigram To a Script - `bigram_v1.py`

So, I'd now like to convert our entire code that we have discussed so far into a Python Script, such that the entire code can now be run in a single file, out of the box, assuming you have PyTorch installed, such that we can simplify all the intermediate work that we did...

As I have pointed out earlier, I will be completing parts of the discussion and will be releasing the code scripts within the same repository under the directory `GPT Scripts`.

And to run each script you just have to specify the `<filename>.py` in the terminal...

For now I have named this script as `bigram_v1.py`...

And you can run the file using a command like:
```bash
python bigram_v1.py
```

That's it... And everything will run out of the box...

## Keeping track of losses

We will also keep tracking losses from now, so that we can compare our models, as we continue to modify it in the future...

For now we get:

Losses at `bigram_v1.py`:
```python
Step 29500: Training Loss 2.4385, Validation Loss 2.4322
```

# **Self Attention** in GPT

## Mathematical Trick for Self Attention in GPT

Before we discuss `Self Attention`, we'd like to discuss a certain problem that we are currently dealing with... And get used to different ways to solve the problem as well...

The problem is, right now we are only focusing on the `last` token of the context (`last` token of the `blockSize`)... But we'd like our model to look further in the context history like this:\
![GPTContextProblem](ExplanationMedia/Images/GPTContextProblem.png)

Which makes the kind of model we want to be "**autoregressive**" **(AR)**. The **autoregressive model** specifies that the output variable depends linearly on its own previous values and on a stochastic term (an imperfectly predictable term)...

Now what we'd like to do is take a small *toy-example* and to solve the same problem differently and work our way upto an efficient solution to the problem...

Let's take a very small `input` tensor and try to write it out as an example code:
```python
torch.manual_seed(69420)

batch, time, channel = 2, 8, 3
inputs = torch.randn(batch, time, channel)
print("Inputs:", inputs)
print("Shape of inputs:", inputs.shape)
```
For which I get:
```python
Inputs: tensor([[[ 2.3787e+00, -3.5896e-01, -7.1692e-01],
         [-2.4297e-01, -1.8038e-01,  1.4882e+00],
         [ 5.4493e-01,  3.8243e-01,  8.7188e-01],
         [-1.9890e+00, -5.4009e-01, -1.5319e+00],
         [-9.2356e-01,  7.2013e-01, -5.9540e-01],
         [-1.1697e+00,  8.3635e-01,  3.5811e-01],
         [ 3.9933e-01, -1.3606e+00,  1.0168e-01],
         [-4.8538e-02, -1.1643e+00, -1.5403e-01]],

        [[ 1.1998e+00, -8.0983e-01,  1.0315e+00],
         [ 1.6720e+00, -1.0681e+00, -9.6532e-01],
         [ 3.6006e-01,  3.2209e-01,  5.2594e-01],
         [-5.4021e-01, -5.2587e-01,  1.0481e+00],
         [-3.8775e-01, -1.3751e+00, -1.0385e-01],
         [-9.2093e-01, -1.0048e+00, -1.4028e+00],
         [-2.0169e+00, -5.1192e-01, -2.1998e-01],
         [-3.3050e-01, -9.1926e-01,  8.9532e-04]]])
Shape of inputs: torch.Size([2, 8, 3])
```

In the *toy-example* we have a tensor `inputs` with three dimensions:
1. `batch` → Number of batches (`batchSize`)
2. `time` → Number of tokens (characters) in a block of `blockSize`
3. `channel` → Information of the `token` in form of embeddings (features)

In this example we have `8` tokens (character) in the `time` dimension, and these tokens are not *talking* to each other...

And now we'd like them to *"talk to each other"*, we'd like to couple them... And we'd like to couple them in a very specific way...

For example, because we have `8` tokens in a sequence, out of these `8` tokens, if we let's say consider the `5`th token... This `5`th token should not communicate with tokens at locations `6`, `7` and `8` (or the future tokens in the sequence), and they should talk to the tokens at locations `4`, `3`, `2` and `1` (or the previous tokens in the context), such that information only flows from previous context to the current time stamp and we cannot get any information from the future because we are about to predict the future...

So, what is the easiest way for tokens to communicate?

Let's say I am the `5`th token, and I want to communicate with my past tokens (at `4`, `3`, `2` & `1`), and the simplest way to communate with the past is to just do an **average** of the past with context of my own information. Or in other words, if I am the `5`th token, I would like to take up the `channels` that make-up my information at my step, and also the `channels` in my past, and I'd like to average those up to make it like a `feature vector` that *summarizes me in the context of my history*...

Now once again, doing just an average is just a very weak form of "talking" or interaction, and makes this communication extremely **lossy**, which makes us lose a lot of information about the *spatial arrangements* of all those tokens... But for now, that's okay, because in the future solutions to the same problem, we will see how we can get this information back...

So let's see different versions to solve the problem now...

## Version 1 - Naïve Approach

In this approach, what we'd like to do is, for our `inputs`, for every single `batch`, for every `token` we'd like to average out all the vectors in all the `previous tokens` including the `current token`.

Or
```python
# We want: inputs[batch, time] = mean_{i<=token} inputs[batch, i]
```

Now before we dive into the solution, I'd like to discuss a concept called `Bag Of Words`...

So what is `Bag of Words`?


![Bag Of Words](https://miro.medium.com/v2/resize:fit:720/format:webp/1*3K9GIOVLNu0cRvQap_KaRg.png)

The **bag-of-words model** is a model of text which uses a representation of text that is based on an unordered collection (or *"bag"*) of words.
In natural language processing (NLP), the term "bag of words" `(BoW)` typically refers to a simple representation of text where the frequency of each word in a document is counted and represented in a vector.

So why mention it?

Well you see, our concept is similar but for our case, we have a character level model. 
The term "bag of words" `(BoW)` is used in contexts where we want to represent text data by counting the occurrences and then potentially averaging them out.

And we would like to use it in our *toy-example*'s output variable name... 😂

So now, let's write out what we want, in the for of code... To get the idea more clear now:
```python
# Allocating memory for output
inputBagOfWords = torch.zeros((batch, time, channel))

for b in range(batch):
    for t in range(time):
        # Every token in the past and current token in the batch
        previousInput = inputs[b, :t+1] # (B, T, C) → (T, C)
        # Mean over the token or the time dimension
        inputBagOfWords[b, t] = torch.mean(previousInput, 0) # (B, T, C)
```

Now because we have multiple `batches` in our example, let's compare only one batch (let's say we compare the first batch or `0`-th index) of both `inputs` and `inputBagOfWords` like this:
```python
print("Input BoW:", inputBagOfWords[0])
print("Inputs:", inputs[0])
```
We get:
```python
Input BoW: tensor([[ 2.3787, -0.3590, -0.7169],
        [ 1.0679, -0.2697,  0.3856],
        [ 0.8936, -0.0523,  0.5477],
        [ 0.1729, -0.1742,  0.0278],
        [-0.0464,  0.0046, -0.0968],
        [-0.2336,  0.1432, -0.0210],
        [-0.1432, -0.0716, -0.0035],
        [-0.1313, -0.2082, -0.0223]])
Inputs: tensor([[ 2.3787, -0.3590, -0.7169],
        [-0.2430, -0.1804,  1.4882],
        [ 0.5449,  0.3824,  0.8719],
        [-1.9890, -0.5401, -1.5319],
        [-0.9236,  0.7201, -0.5954],
        [-1.1697,  0.8363,  0.3581],
        [ 0.3993, -1.3606,  0.1017],
        [-0.0485, -1.1643, -0.1540]])
```

And we see that we have, at every `token` or `time` dimension, the average of previous and current token, which is what we want...

But in this process we see that we use nested `for-loops` which is extremely inefficient...

And now next, what we will see is that we can be extremely efficient with the same problem with `Matrix Multiplication`...

## Version 2 - Matrix Multiplication Approach

To understand the matrix multiplication approach we will use another *toy-example* for our *toy-example*...

Suppose we have two matrices `a` and `b` of sizes `(3, 3)` and `(3, 2)` respectively, and we understand that the resultant matrix `c` will be of shape `(3, 2)` and will be the dot product of columns and rows of the two matrices.

For example,
$$
\underbrace{\begin{bmatrix}
1\ 1\ 1\\
1\ 1\ 1\\
1\ 1\ 1\\
\end{bmatrix}}_{\text{a}}
\underbrace{\begin{bmatrix}
9\ 4\\
3\ 9\\
4\ 8\\
\end{bmatrix}}_{\text{b}}
\text{=}
\underbrace{\begin{bmatrix}
16\ 21\\
16\ 21\\
16\ 21\\
\end{bmatrix}}_{\text{c}}
\rightarrow \text{c = a @ b}
$$

Let's try to write the same example with code:
```python
torch.manual_seed(69420)

a = torch.ones(3, 3)
b = torch.randint(low=0, high=10, size=(3, 2)).float()
c = a @ b

print("a=")
print(a)
print("-----")
print("b=")
print(b)
print("-----")
print("c=")
print(c)
```

Right now we have a very *boring* matrices `a` of just `1`s where it represents `weights` like a linear layer, and `b` represents the `inputs` similarly.

And we have repeating elements because we are calculating the same `columns` of `b` with every `row` of `1`s in `a` for each item in `c`...

Now instead if we take a lower triangluar matrix of `1`s and keep all the other elements as `0`s for `a`, we have something like this:
$$
\underbrace{\begin{bmatrix}
1\ 0\ 0\\
1\ 1\ 0\\
1\ 1\ 1\\
\end{bmatrix}}_{\text{a}}
$$

To do this, we have a method <a href="https://pytorch.org/docs/stable/generated/torch.tril.html">torch.tril()</a> in PyTorch...

So we can now modify our code and look at the resulting matrices:
```python
torch.manual_seed(69420)

a = torch.tril(torch.ones(3, 3))
b = torch.randint(low=0, high=10, size=(3, 2)).float()
c = a @ b

print("a=")
print(a)
print("-----")
print("b=")
print(b)
print("-----")
print("c=")
print(c)
```

Which gives us something like this:
$$
\underbrace{\begin{bmatrix}
1\ 0\ 0\\
1\ 1\ 0\\
1\ 1\ 1\\
\end{bmatrix}}_{\text{a}}
\underbrace{\begin{bmatrix}
9\ 4\\
3\ 9\\
4\ 8\\
\end{bmatrix}}_{\text{b}}
\text{=}
\underbrace{\begin{bmatrix}
9\ 4\\
12\ 13\\
16\ 21\\
\end{bmatrix}}_{\text{c}}
\rightarrow \text{c = a @ b}
$$

See how because of these `0`s the resultant matrix `c` is just a result of an **incremental addition (`sum`) of their respective `columns`**?

And in the same fashion, because we all know average(`mean`) is just the addition of all the elements divided by the number of elements, you can start to see how the average(`mean`) would come into the picture now...

So because we are dealing with the `weights`(`a`) and trying to manipulate them, and during matrix multiplication in `a` the `rows` play the role, so we can now average(`mean`) them using `normalization` of individual elements, where every `row` sums to one, to get and **incremental average (`mean`) of their respective `columns` in the resultant `c` matrix**...

So our code now looks like:
```python
torch.manual_seed(69420)

a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(low=0, high=10, size=(3, 2)).float()
c = a @ b

print("a=")
print(a)
print("-----")
print("b=")
print(b)
print("-----")
print("c=")
print(c)
```
Which gives us:
$$
\underbrace{\begin{bmatrix}
1.0000\ 0.0000\ 0.0000\\
0.5000\ 0.5000\ 0.0000\\
0.3333\ 0.3333\ 0.3333\\
\end{bmatrix}}_{\text{a}}
\underbrace{\begin{bmatrix}
9\ 4\\
3\ 9\\
4\ 8\\
\end{bmatrix}}_{\text{b}}
\text{=}
\underbrace{\begin{bmatrix}
9.0000 \ 4.0000\\
6.0000 \ 6.5000\\
5.3333 \ 7.0000\\
\end{bmatrix}}_{\text{c}}
\rightarrow \text{c = a @ b}
$$

And this is exactly similar to our original **naïve** approach that we did...

Now let's go back to our original *toy-example* and implement this now...

Remember how we considered `a` to be `weights` and `b` to be `inputs`...

Let's first initialize the `weights` and `inputs` the same way we did before to perform a `Matrix Multiplication` now and think how the `broadcasting` works out for us because we have `batch` dimensions as well...

For now we are dealing with the `tokens` or the `time` dimension, so our weights matrix would be of shape `time` by `time`...

And we will take `inputBagOfWords` and rename it to `inputBagOfWordsV2` where `V2` represents **version** `2` just so we can compare them later...

So our code looks something like this:
```python
weights = torch.tril(torch.ones(time, time))
weights = weights / weights.sum(1, keepdim=True)
inputBagOfWordsV2 = weights @ inputs
```

Now let's think through the **broadcasting** of the matrix multiplication operation...

For now we have `weights` of two dimensions of size `(8, 8)` or `(T, T)` for our *toy-example* and `inputs` of three dimensions of size `(2, 8, 3)` or `(B, T, C)`...

Now during broadcasting, the `weights` will add another `batch` dimension to make the broadcasting work from `(T, T)` to `(B, T, T)` and then perform the multiplication...

Which ultimately results in a **batched matrix multiplication** where the multiplication will be applied to all the **batch elements in parallel** and individually. And for each **batch element** there will be a mutliplication between `(T, T)` and `(T, C)` exactly like the operation we discussed earlier...

And the resultant `tensor` would be of shape `(B, T, C)` which will make `inputBagOfWords` completely identical to `inputBagOfWordsV2`

So we can now compare both the the `inputBagOfWords` and `inputBagOfWords2` with <a href="https://pytorch.org/docs/stable/generated/torch.allclose.html">torch.allclose()</a> like this:
```python
torch.allclose(inputBagOfWords, inputBagOfWordsV2)
```
for which we get:
```python
True
```

And if we want to compare them manually, because both of them are long `tensor`s we can compare the first batch like them to see that they are completely similar like this:
```python
print("First batch of inputBagOfWords:", inputBagOfWords[0]) 
print("First batch of inputBagOfWordsV2:", inputBagOfWordsV2[0])
```
for which we get:
```python
First batch of inputBagOfWords: tensor([[ 2.3787, -0.3590, -0.7169],
        [ 1.0679, -0.2697,  0.3856],
        [ 0.8936, -0.0523,  0.5477],
        [ 0.1729, -0.1742,  0.0278],
        [-0.0464,  0.0046, -0.0968],
        [-0.2336,  0.1432, -0.0210],
        [-0.1432, -0.0716, -0.0035],
        [-0.1313, -0.2082, -0.0223]])
First batch of inputBagOfWordsV2: tensor([[ 2.3787, -0.3590, -0.7169],
        [ 1.0679, -0.2697,  0.3856],
        [ 0.8936, -0.0523,  0.5477],
        [ 0.1729, -0.1742,  0.0278],
        [-0.0464,  0.0046, -0.0968],
        [-0.2336,  0.1432, -0.0210],
        [-0.1432, -0.0716, -0.0035],
        [-0.1313, -0.2082, -0.0223]])
```

So let's conclude what we saw here...

We saw that, **we can do weighted aggregation of our past `tokens` or `characters` by using `Matrix Multiplication` of `weights` and the `inputs`, where `weights` are a matrix of lower-triangular fashion of `1`s and other elements as `0`s, and we are doing weighted `sum` and `normalizing` them to get the *rolling* `average` or `mean`**...

Now we will look at another way of doing this same exact operation using `softmax`...

## Version 3 - Adding Softmax Approach

This time, I will explain what we are doing after I implement the same thing with softmax code for our *toy-example* of our *toy-example*...

So let's see what our new approach looks like:
```python
torch.manual_seed(69420)

lowerTrianguarMatrix = torch.tril(torch.ones(3, 3))
print("lowerTrianguarMatrix=")
print(lowerTrianguarMatrix)
print("-----")
a = torch.zeros(3, 3)
print("a (at initialization)=")
print(a)
print("-----")
a = a.masked_fill(lowerTrianguarMatrix == 0, float('-inf'))
print("a (after masking)=")
print(a)
print("-----")
a = F.softmax(a, dim=-1)
b = torch.randint(low=0, high=10, size=(3, 2)).float()
c = a @ b

print("a (after softmax)=")
print(a)
print("-----")
print("b=")
print(b)
print("-----")
print("c=")
print(c)
```

Here you see that I am printing the transformation of `a` or essentially `weights` 

For which we get something like this:
```python
lowerTrianguarMatrix=
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
"-----"
a (at initialization)=
tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]])
"-----"
a (after masking)=
tensor([[0., -inf, -inf],
        [0., 0., -inf],
        [0., 0., 0.]])
"-----"
a (after softmax)=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
"-----"
b=
tensor([[9., 4.],
        [3., 9.],
        [4., 8.]])
"-----"
c=
tensor([[9.0000, 4.0000],
        [6.0000, 6.5000],
        [5.3333, 7.0000]])
```

Now let me explain what is happening in this transformation...

We see that instead of initializing the `a` matrix as `1`s and taking the lower triangular version of that, we initialize a `lowerTrianguarMatrix` with the same dimensions (`time`). And we initialize the `a` matrix of `0`s of the same dimension (`time`) to allocate memory for our further operation...

Then upon using this line:
```python
a = a.masked_fill(lowerTrianguarMatrix == 0, float('-inf'))
```
We are essentially telling that, for all the elements in `lowerTrianguarMatrix` is `0`, make `a` fill themselves with `-inf`...

Which looks something like this:
$$
\underbrace{\begin{bmatrix}
1\ 0\ 0\\
1\ 1\ 0\\
1\ 1\ 1\\
\end{bmatrix}}_{\text{lowerTrianguarMatrix}}
\xrightarrow[\text{Masked Fill}]{\text{ element == 0}}
\underbrace{\begin{bmatrix}
0\ -\infty\ -\infty\\
0\ \ \ \ \ \ \ \ 0\ -\infty\\
0\ \ \ \ \ \ \ \ 0\ \ \ \ \ \ \ \ 0\\
\end{bmatrix}}_{\text{a}}
$$

This `a` then goes through a `softmax()`, which is a non-linearity function that normalizes the values to sum to `1`(by exponentiating the values and dividing by the sum of the dimension specified)...

And we can recall that:
1. $e^0 = 1$
2. $e^{-\infty} = 0$

And we want to normalize the `rows` of the `weights` which is also the `last` dimension of the matrix `a`, so we can now pass the dimension of normalization to `softmax()` as `-1` like this (we are also doing this because later we are going to deal with `batches` which is an extra dimension):
```python
a = F.softmax(a, dim=-1)
```
Which looks ultimately looks like:
$$
\underbrace{\begin{bmatrix}
0\ -\infty\ -\infty\\
0\ \ \ \ \ \ \ \ 0\ -\infty\\
0\ \ \ \ \ \ \ \ 0\ \ \ \ \ \ \ \ 0\\
\end{bmatrix}}_{\text{a}}
\underbrace{\overrightarrow{\text{exp()}}
\underbrace{\begin{bmatrix}
1\ 0\ 0\\
1\ 1\ 0\\
1\ 1\ 1\\
\end{bmatrix}}_{\text{a}}
\xrightarrow[\text{Divide}]{\text{dimension = -1}}}_{\text{Softmax}}
\underbrace{\begin{bmatrix}
1.0000\ 0.0000\ 0.0000\\
0.5000\ 0.5000\ 0.0000\\
0.3333\ 0.3333\ 0.3333\\
\end{bmatrix}}_{\text{a}}
$$

And then we can continue our regular `Matrix Multiplication` operation as discussed in the above version...

Which we can recall, looks like this:
$$
\underbrace{\begin{bmatrix}
1.0000\ 0.0000\ 0.0000\\
0.5000\ 0.5000\ 0.0000\\
0.3333\ 0.3333\ 0.3333\\
\end{bmatrix}}_{\text{a}}
\underbrace{\begin{bmatrix}
9\ 4\\
3\ 9\\
4\ 8\\
\end{bmatrix}}_{\text{b}}
\text{=}
\underbrace{\begin{bmatrix}
9.0000 \ 4.0000\\
6.0000 \ 6.5000\\
5.3333 \ 7.0000\\
\end{bmatrix}}_{\text{c}}
\rightarrow \text{c = a @ b}
$$

Now, let's try to write the same for our original *toy-example* in code after renaming this output as `inputBagOfWordsV3` to compare them later like we did before:
```python
lowerTrianguarMatrix = torch.tril(torch.ones(time, time))
weights = torch.zeros(time, time)
weights = weights.masked_fill(lowerTrianguarMatrix == 0, float('-inf'))
weights = F.softmax(weights, dim=-1)
inputBagOfWordsV3 = weights @ inputs
```
So we can now compare both the the `inputBagOfWords` and `inputBagOfWords3` with <a href="https://pytorch.org/docs/stable/generated/torch.allclose.html">torch.allclose()</a> like this:
```python
torch.allclose(inputBagOfWords, inputBagOfWordsV3)
```
for which we get:
```python
True
```

Again we can compare both the tensors manually like this:
```python
print("First batch of inputBagOfWords:", inputBagOfWords[0]) 
print("First batch of inputBagOfWordsV3:", inputBagOfWordsV3[0])
```
For which we get:
```python
First batch of inputBagOfWords: tensor([[ 2.3787, -0.3590, -0.7169],
        [ 1.0679, -0.2697,  0.3856],
        [ 0.8936, -0.0523,  0.5477],
        [ 0.1729, -0.1742,  0.0278],
        [-0.0464,  0.0046, -0.0968],
        [-0.2336,  0.1432, -0.0210],
        [-0.1432, -0.0716, -0.0035],
        [-0.1313, -0.2082, -0.0223]])
First batch of inputBagOfWordsV3: tensor([[ 2.3787, -0.3590, -0.7169],
        [ 1.0679, -0.2697,  0.3856],
        [ 0.8936, -0.0523,  0.5477],
        [ 0.1729, -0.1742,  0.0278],
        [-0.0464,  0.0046, -0.0968],
        [-0.2336,  0.1432, -0.0210],
        [-0.1432, -0.0716, -0.0035],
        [-0.1313, -0.2082, -0.0223]])
```

Now, the reason that this is a bit more interesting, and also the reason that we will end up using the `softmax()` version in our `Self Attention` version is because, these `weights` begin with `0`, which you can think of as an *interaction strength* or **affinities** in this line:
```python
weights = torch.zeros(time, time)
```
Which essentially tell us, how much of each `token` from the past do we want to `aggregate` and `average` up...

Next up the line:
```python
weights = weights.masked_fill(lowerTrianguarMatrix == 0, float('-inf'))
weights = F.softmax(weights, dim=-1)
```
Tells us that the tokens from the **future** cannot *communicate* with each other (we will not `aggregate` anything from those `tokens`)

And finally `aggregation` happens through the `Matrx Multiplication` in this line:
```python
inputBagOfWordsV3 = weights @ inputs
```

So long story short...

**We can do weighted aggregations of our past elements, by using matrix multiplication in a lower triangular fashion and the elements in the lower triangular part tells us, how much of each element *fuses* into the current position**...

Which we will use now in our `Self Attention` block... 

Now the point is, these `weights` are currently set by us to be `0`s, but these **affinities** are not going to be a constant at `0` in the `Self Attention`, but rather they are going to be **data-dependent**, and these `tokens` are going to start looking at each other and some `tokens` will find other `tokens`, more or less *interesting*, and depending on what their values are they're going to find each other *interesting* to different amounts (**affinities**)...

So let us look at our approach towards `Self Attention`...

## Cleaning Up Bigram Script - `bigram_v2.py`

Before we move on to `Self Attention` we need to clean up our `bigram_v1.py`...

And there are some changes that we will discuss in this section...

I will keep the original `bigram_v1.py` unchanged, and create a new script as `bigram_v2.py` so that you'd be able to see the changes...

And we will also be able to run the script at each change as well...

So let's discuss the changes...

### Change - 1

We see that inside the script we are already having the `vocabularySize` as a **global variable**,

But we are still using a redundant `vocabularySize` as a parameter during constructor definition and object initialization...

So we will remove the redundant memory allocation... So our code becomes:
1. `def __init__(self, vocabularySize)` → `def __init__(self)`
2. `model = BigramLanguageModel(vocabularySize).to(device=device)` → `model = BigramLanguageModel().to(device=device)`

### Change - 2

We are going to start our neural network bigger now, so I will create a level of **indirection** here, just to introduce you to a new concept and the idea of how we can enlarge our model by including new items...

We are going to introduce a new **hyper-parameter** called `numberOfEmbeddingDimensions`, which explains what it represents within it's name...

For now I am going to set this **hyper-parameter** to `32` and we will change this later when we try to scale up our model...

Now we want our `Embeddings` to go through a simple `Linear` layer now (as a level of **indirection**) just to understand how we can add new layers our existing model and use **hyper-parameters** in our model...

To do this we need to change our code in the **class definition** of `BigramLanguageModel`:
1. During constructor definition
2. During forward-pass definition

**During constructor definition**, we have the following code now (old):
```python
# Constructor for the model
def __init__(self):
    # Initializing the embedding table
    super().__init__()
    self.tokenEmbeddingTable = torch.nn.Embedding(vocabularySize, vocabularySize)
```
Now we already understand that the `fan-in` is `vocabularySize` and the `fan-out` is also `vocabularySize`...

To use this **hyper-parameter** `numberOfEmbeddingDimensions` now, we can add a `Linear` layer using `torch.nn.Linear` and change the code in following steps:
1. changing the `fan-out` of the `Embedding` layer to be `numberOfEmbeddingDimensions`
2. changing the `fan-in` of the `Linear` layer to be `numberOfEmbeddingDimensions`
Which makes `numberOfEmbeddingDimensions` the intermediate **hyper-parameter** which does not affect the original `fan-in` and `fan-out`...

And for a note, we will name our `Linear` layer to be `languageModelingHead`...

In the context of `Large Language Models (LLMs)` like `GPT-3` or `BERT`, the term **"head"** refers to the additional layers or mechanisms added on top of the pre-trained base model to adapt it for specific tasks.

So now our code becomes:
```python
# Constructor for the model
def __init__(self):
    # Initializing the embedding table
    super().__init__()
    self.tokenEmbeddingTable = torch.nn.Embedding(vocabularySize, numberOfEmbeddingDimensions)
    self.languageModelingHead = torch.nn.Linear(numberOfEmbeddingDimensions, vocabularySize)
```

**During forward-pass definition** we have the following code:
```python
# Index into embeddings to get the logits
logits = self.tokenEmbeddingTable(indeces) # (B, T, C)
```

To adapt with our current objective, we need to understand that we do not get the `logits` directly now, so we can now rename our `logits` to something like `tokenEmbeddings`...

And then we can pass our `tokenEmbeddings` through our `languageModelingHead` in the next line to get the `logits`...

So our code looks like this:
```python
# Index into embeddings to get the token embeddings
tokenEmbeddings = self.tokenEmbeddingTable(indeces) # (B, T, C)
# Pass the token embeddings through a linear layer
logits = self.languageModelingHead(tokenEmbeddings) # (B, T, C)
```

We also need to keep in mind that the `channels` dimension in `tokenEmbeddings` are different from `logits`, because in `tokenEmbeddings` the `channel` dimension represents the **number of embedding dimensions** which for now is `32` but, in `logits` the `channel` dimension represents the output **vocabulary size** which is for our case `92`...

### Change - 3

We have seen how we take the `indeces` and we have encoded them based on the *identity* of the `tokens`...

But, In `Transformer` architecture, **positional encodings** are crucial for incorporating sequential information into the model's understanding of the input data. Unlike `Recurrent Neural Networks (RNNs)` or `Convolutional Neural Networks (CNNs)`, `Transformers` don't inherently understand the **order** of the input tokens since they process all tokens in parallel. `Positional Encodings` address this limitation by providing the model with information about the **position of each token in the sequence**.

Positional Encodings are typically **added** to the input embeddings before feeding them into the transformer layers.

Which means that we want to add another `Embedding` layer to the computation now, in order to be ready to implement our `Transformer` architecture...

Which also hints us that we will make changes in the **class definition** of `BigramLanguageModel`:
1. During constructor definition
2. During forward-pass definition

**During constructor definition**, we have the following code now from the above section:
```python
# Constructor for the model
def __init__(self):
    # Initializing the embedding table
    super().__init__()
    self.tokenEmbeddingTable = torch.nn.Embedding(vocabularySize, numberOfEmbeddingDimensions)
    self.languageModelingHead = torch.nn.Linear(numberOfEmbeddingDimensions, vocabularySize)
```

Now we can define another table called `positionalEmbeddingTable` and initialize it with `torch.nn.Embedding()`...

Now we need to understand that we are trying to encode the positions of `tokens` at each **block**, which means that the `fan-in` of this layer is going to be of size `blockSize` and because we will try to feed this to the `Linear` layer after **adding or joining** the `tokenEmbeddingTable` and `positionalEmbeddingTable` together, we want our `fan-out` of this layer to be of shape `numberOfEmbeddingDimensions` as well, in order for the boardcasting to work out...

So, **During constructor definition**, we have the following code now:
```python
# Constructor for the model
def __init__(self):
    # Initializing the embedding table
    super().__init__()
    self.tokenEmbeddingTable = torch.nn.Embedding(vocabularySize, numberOfEmbeddingDimensions)
    self.positionalEmbeddingTable = torch.nn.Embedding(blockSize, numberOfEmbeddingDimensions)
    self.languageModelingHead = torch.nn.Linear(numberOfEmbeddingDimensions, vocabularySize)
```

**During forward-pass definition** we have the following code:
```python
# Forward Pass
def forward(self, indeces, labels=None):
    # Index into embeddings to get the token embeddings
    tokenEmbeddings = self.tokenEmbeddingTable(indeces) # (B, T, C)
    # Pass the token embeddings through a linear layer
    logits = self.languageModelingHead(tokenEmbeddings) # (B, T, C)
```

Now we remember that within the `indeces` we receive the inputs in the shape of `(B, T)` or `(batchSize, blockSize)`, and in order to encode the positions of the `tokens` we need the `blockSize` of the `indeces`, so we can unpack the `blockSize` or the `time` dimension of these `indeces` using `torch.shape`... Then we can arrange them in a single tensor from `0` to `blockSize - 1` (or `0` to `T-1`) using `torch.arange()`, which can be our input now for the `positionalEmbeddingTable` to create a positional embedded tensor of `(blockSize, numberOfEmbeddingDimensions)` or `(T, C)` in an output we will call `positionalEmbeddings`

Now that we have both `positionalEmbeddings` and `tokenEmbeddings`, we can **add** them up into a single tensor we will call, `concatenatedEmbeddings` (internally the broadcasting will work itself out as we have seen for addition like `(B, T, C) + (T, C)`)...

Also, we are using a `GPU` right now to train the model, so we will also set the `device` parameter when we use the `torch.arange()`...

So now we have the following code:
```python
def forward(self, indeces, labels=None):
    # Unpacking the shape of indeces
    batch, time = indeces.shape

    # Index into embeddings to get the token embeddings
    tokenEmbeddings = self.tokenEmbeddingTable(indeces) # (B, T, C)
    # Index into embeddings to get the positional embeddings
    positionalEmbeddings = self.positionalEmbeddingTable(torch.arange(time, device=device)) # (T, C)
    # Fuse the token embeddings and positional embeddings together to pack the information in a single tensor
    concatenatedEmbeddings = tokenEmbeddings + positionalEmbeddings # (B, T, C)
    # Pass the concatenated embeddings through a linear layer
    logits = self.languageModelingHead(concatenatedEmbeddings) # (B, T, C)
```

That's it, I have modified all the changes in the `bigram_v2.py` and I have release the script as well...

And right now these `concatenatedEmbeddings` don't have any use as of now in our `BigramLanguageModel`, but as we start to work in our `Self Attention` block we will see how it starts to matter...

So let's now finally move on to `Self Attention`...

### Change - 4

Now we also have to make sure that during `generate()`, the `indeces` that we pass in, because we are using `positionalEmbeddings` along with `tokenEmbeddings`, the model meight train itself fine, but during the generation, we can never have more context than `blockSize` coming into the model, because if our `indeces` are more than `blockSize`, then our `positionalEmbeddingTable` is going to run out of indexing scope because it only has embeddings for upto `blockSize` (or the `fan-in`)...

So we can crop the `indeces` like this:
```python
croppedIndices = indices[:, -blockSize:]
```
Which selects all rows of the `indices` tensor and selects only the last `blockSize` columns. The negative index `-blockSize:` indicates that we are selecting from the `blockSize`-th column from the end to the last column. So, `croppedIndices` will contain the last `blockSize` indices from each sequence in the batch. This ensures that during generation, only the **most recent context** of size `blockSize` is considered for the next `token` prediction.

So now our `generate()` inside of `BigramLanguageModel` now looks like:
```python
# Generation
def generate(self, indeces, maximumNewTokens):
    for _ in range(maximumNewTokens):
        # Crop the indeces upto most recent block size context
        croppedIndeces = indeces[:, -blockSize:]
        # Forward Through Model
        logits, loss = self(croppedIndeces)
        # Focus on the last time step
        logits = logits[:, -1, :]
        # Applying softmax for the last dimension
        probabilities = F.softmax(logits, dim=-1)
        # Sample from distribution
        nextIndex = torch.multinomial(probabilities, num_samples=1)
        # Concatenate currentIndex with nextIndex
        indeces = torch.cat((indeces, nextIndex), dim=1)
    return indeces
```

### Keeping track of losses

Losses at `bigram_v1.py`:
```python
Step 29500: Training Loss 2.4385, Validation Loss 2.4322
```
Losses at `bigram_v2.py`:
```python
Step 29500: Training Loss 2.4641, Validation Loss 2.4487
```

## Version 4 - Crux of **Self Attention** - Most important part of the Notebook

Before I define `Self Attention`, let's pickup our *toy-example* from before and discuss where we left off...

So had something like this:
```python
torch.manual_seed(69420)

batch, time, channel = 2, 8, 3

inputs = torch.randn(batch, time, channel)

lowerTrianguarMatrix = torch.tril(torch.ones(time, time))
weights = torch.zeros(time, time)
weights = weights.masked_fill(lowerTrianguarMatrix == 0, float('-inf'))
weights = F.softmax(weights, dim=-1)
inputBagOfWords = weights @ inputs
```

Here we had `inputs` of `2 batches` of `8 tokens` with `3 dimensional embeddings`... And in this code, we do a simple **average** of all the past `tokens` and the current `token`. And it does so by creating a lower triangular structure, which allows us to *mask-out* the `weights` matrix that we created. After *masking-out* we **normalize** it and multiply it with our `inputs`...

And, when we initialize the **affinities** between all the different `tokens`, we see that we have `weights` where every single **row** gives us these uniform numbers that sum to `1`.

But the thing now is, we don't want these numbers in `weights` to be all uniform, because we want different `tokens` to find different other `tokens` to find themselves more or less *interesting* in a **data-dependant** way...

For example, because we are using a character level language model, if say I am a *vowel*, then maybe I want to look for *consonants* in my past to know what those *consonants* are and also I want that information to **flow** to me...

In other words **we want to gather information from the past, and we want to do that in a data dependant way**...

And this is the problem that `Self Attention` solves...

And the way `Self Attention` solves this, is...

Every single `token` at each position, emits two vectors: **Query (Q)** and a **Key (K)**...

We can look at **Query (Q)** and **Key (K)** in this way:
- **Query (Q)** says *"What am I looking for?"*
- **Key (K)** says *"What do I contain?"*

And the way these `tokens` generate **affinities** or *interactions* is by a **dot-product** between the **queries** and the **keys**...

Which eventually means that every `token` at every position says *"Hey, I am a `token` and I contain a vowel **and** I am looking for consonants in my past"* for our example... And the way they *interact* with each other is the **dot-product** between the queries and the keys...

For our case, we have an `inputs` tensor of shape `(B, T, C)` or `(2, 8, 3)` for our *toy-example*... Or `2 batches` of `8 tokens` having `3 embeddings`...

But we need to understand that in order for our `inputs` to contribute and emit these **queries** and **keys**, we need the *information* from each `inputs` and we also need these **queries** and **keys** to emit their own *information* as outputs, which brings us to our **hyper-parameter** `headSize`...

You can consider these output shapes of these **queries** and **keys** to be in form of some embeddings, and the easiest way to initialize these **queries** and **keys** is by creating a `Linear` layer without a bias and defining their `fan-in` as the *information* from each `inputs` and `fan-out` with the **hyper-parameter** `headSize` after initializing it with a value...

For our case, the *information* from `inputs` that flow into these **queries** and **keys** (or `fan-in`) are of shape `channel` because thats where the original `Embeddings` are, and the *information* that flow out of these **queries** and **keys** (or `fan-out`) are of shape `headSize`... For now let's consider the `headSize` as `16`...

After we initialize these **queries** and **keys** and run them through their linear transformation, our `inputs` now become from shape `(B, T, C)` to shape `(B, T, 16)` for our example... And these **queries** and **keys** can finally talk to each other after we have a **dot-product** between them (or Matrix Multiplication)...

But the problem is these **queries** and **keys** are 3-dimensional now, and of shape `(B, T, 16)`... And they cannot multiply themselves because, in order for the broadcasting to work out they need to be in shape `(B, T, 16) @ (B, 16, T)`, which results in a `(B, T, T)` tensor... And in order for them to rearrange themselves in this manner, we will use <a href="https://pytorch.org/docs/stable/generated/torch.transpose.html#torch.transpose">`transpose()`</a> and pass the parameters `(-2, -1)` as second last and last dimensions as parameters...

Now they can **dot-product** fine and give us the *interation strength* or the **affinities** of these `inputs`...

So now our *toy-example* code becomes:
```python
torch.manual_seed(69420)

batch, time, channel = 2, 8, 3

headSize = 16

inputs = torch.randn(batch, time, channel)

lowerTriangularMatrix = torch.tril(torch.ones(time, time))

query = torch.nn.Linear(channel, headSize, bias=False)
key = torch.nn.Linear(channel, headSize, bias=False)

q = query(inputs) # (B, T, 16)
k = key(inputs) # (B, T, 16)

weights = q @ k.transpose(-2, -1) # (B, T, T)

weights = weights.masked_fill(lowerTriangularMatrix == 0, float('-inf'))
weights = F.softmax(weights, dim=-1)
inputBagOfWords = weights @ inputs
```

And if we look at our weights now, they look something like this:
```python
tensor([[[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.2184, 0.7816, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.5845, 0.1986, 0.2170, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.0042, 0.7065, 0.2424, 0.0470, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.0238, 0.4702, 0.1258, 0.2795, 0.1007, 0.0000, 0.0000, 0.0000],
         [0.0053, 0.3562, 0.0704, 0.2770, 0.0847, 0.2065, 0.0000, 0.0000],
         [0.3267, 0.0463, 0.1988, 0.0257, 0.2178, 0.1656, 0.0190, 0.0000],
         [0.1526, 0.0854, 0.2203, 0.0336, 0.2216, 0.2357, 0.0239, 0.0268]],

        [[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.0679, 0.9321, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.2828, 0.5033, 0.2139, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.1369, 0.0534, 0.4525, 0.3572, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.1286, 0.0627, 0.6126, 0.1609, 0.0352, 0.0000, 0.0000, 0.0000],
         [0.1770, 0.0401, 0.4274, 0.2621, 0.0554, 0.0379, 0.0000, 0.0000],
         [0.0413, 0.0016, 0.1513, 0.3671, 0.0347, 0.0171, 0.3869, 0.0000],
         [0.1170, 0.0645, 0.3522, 0.1517, 0.0508, 0.0598, 0.1198, 0.0842]]],
       grad_fn=<SoftmaxBackward0>)
```

Now we can see that previously our `weights` were a constant and they were applied on every single `batch` in the *same way*, but now each `batch` has different `weights` because every single batch element consists of different `tokens` at different `positions` which makes it **data-dependant**, which we can already see that how they are not exactly uniform...

And doing these **dot-products** results in **affinities** and these **affinities** tend to be high if their *interaction strength* is high...

And if say the weights have *high affinities* they end up utilizing more of that `token`'s *information* into the current `token`'s position and it gets to *learn* a lot about it, and vise versa...

Now there's one more part to this `Self Attention` and that is the **Value (V)**...

Which is also emitted from these `inputs`...

Now we can look at these **Value (V)** as:
- **Value (V)** says *"What information do I have?"*

And the way these `tokens` now pass the *information* about them is by a **dot-product** between these **data-dependent weights** and the **values**...

Which eventually means that every `token` at every position says *"Hey, I am a `token` and I contain a vowel and I am looking for consonants in my past and my own information is kept in vector `inputs`"*

We can think `inputs` to have **private information** which only gets *communicated* or *passed around* if the *interaction* finds them *interesting*...

So we can now implement **Value (V)**, the same way we initialized the **queries** and **keys**, and we can perform a dot product **after** we get the **data-dependant weights**, like this:
```python
torch.manual_seed(69420)

batch, time, channel = 2, 8, 3

headSize = 16

inputs = torch.randn(batch, time, channel)

lowerTriangularMatrix = torch.tril(torch.ones(time, time))

query = torch.nn.Linear(channel, headSize, bias=False)
key = torch.nn.Linear(channel, headSize, bias=False)
value = torch.nn.Linear(channel, headSize, bias=False)

q = query(inputs) # (B, T, 16)
k = key(inputs) # (B, T, 16)

weights = q @ k.transpose(-2, -1) # (B, T, T)

weights = weights.masked_fill(lowerTriangularMatrix == 0, float('-inf')) # (B, T, T)
weights = F.softmax(weights, dim=-1) # (B, T, T)

v = value(inputs) # (B, T, 16)

inputBagOfWords = weights @ v # (B, T, 16)
```
And we see that these **values** are the ultimate thing that get aggregated and the `inputs` eventually get dissolved...

So these are the **legendary lines** with which we can answer the questions asked, for each of these **queries, keys & values**:
1. **Query (Q)** answers *"Here's what I am interested in..."*
2. **Key (K)** answers *"Here's what I have..."*
3. **Value (V)** answers *"If you find me interesting, here's what here's what I will communicate to you..."*

Let's look at the single `Self Attention` block or the `Scaled Dot Product Attention` image from the original paper:

![Scaled_Dot_Product_Attention](ExplanationMedia/Images/Scaled_Dot_Product_Attention.png)

We see that we end up defining the exact same thing in the end...

And we also see that there is an *optional* case in **Mask** part, and we will discuss this in the later notes...

And we will also discuss why it is called `Scaled Dot Product Attention`...

Let's now discuss a few things about `Self Attention` now to clear things up...

## Notes on Self Attention

**`Self-Attention` is a mechanism used in machine learning, to capture dependencies and relationships within input sequences.**

![Scaled_Dot_Product_Attention](ExplanationMedia/Images/Scaled_Dot_Product_Attention.png)

> Attention is a **communication mechanism**. Can be seen as `nodes` having a vector of *information* in a **directed graph** *looking at each other* and **aggregating information** with a *weighted sum* from all `nodes` that *point to them*, with **data-dependent weights**.

Let's consider the following graph for example:\
![AttentionExplanationwithDirectedGraph](ExplanationMedia/Images/AttentionExplanationwithDirectedGraph.png)

This illustration has a different structure than what we have implemented in our `Self Attention` because, our graph has `8` nodes (because of `blockSize`), and the first node is only pointed to itself, the second node is pointed to second node and itself, following the pattern upto the eighth node which is pointed to all the previous nodes and itself which can be termed as **auto regressive**...

Auto Regressive Graph looks like this:\
![AutoRegressiveDirectedGraph](ExplanationMedia/Images/AutoRegressiveDirectedGraph.png)

But in principle it can be applied to any arbitrary directed graph, and it is just a communication mechanism between the `nodes`...

> There is **no notion of space**. `Attention` simply acts over a set of vectors, and so by default these `nodes` have no idea where they are positined in a space. This is why we need to **positionally encode** tokens, and give them *information* that is *anchored* to a specific position so that they *know* where they are...

For example, this is different than **Convolution**, because if you run a **convolution operation** over some `input`, there is a very specific layout of *information* in space, and the **convolutional filters act in space**. But in `Attention`, we have a set of vectors out there in space, where they communicate and **if you want them to have a notion of space, you need to specifically add it**. Which is what we have done when we have calculated the `positional encodings` and added that information to the vectors...

> Each example across batch dimension is of course processed completely **independently** and never *"talk"* to each other...

So in the analogy of a directed graph, and *toy-example* we can have something like this:
```python
batch, time, channel = 4, 8, 3
```
Where we had `4 batches` of `8 tokens`... And because we have `4` batches, we really have `4` separate pools of `8 nodes` and those `8` nodes only *talk* to each other, but in total there are `32` nodes that are being processed...

Here in the case of **Language Modelling**, we have this specific structure of a directed graph, where the future tokens will not *communicate* with the past tokens, but, this doesn't have to be the case in general. In fact, in many cases, you want all of these `nodes` to fully *talk* to each other...

For example, if you're doing *Sentiment Analysis* with a `Transformer`, you might have a number of tokens, and you may want them to *fully* talk to each other, because later you want to predict the *sentiment* of the sentence which makes it *okay* for them to talk to each other. And in those cases you will use an `encoder` block of self attention...

And all it means to have an `encoder` block is to delete this line of code from our toy example:
```python
weights = weights.masked_fill(lowerTriangularMatrix == 0, float('-inf'))
```
Which makes the block completely talk to each other...

And what we have already implemented in our previous *toy-example* is known as a `decoder` block. And it is called a `decoder` because we have to mask the values with an **auto regressive** lower triangular format which restricts the nodes from future to not talk to each other (because that would give away the *answer*)...

But both are allowed and `Attention` supports arbitrary connection between `nodes`...

So we can now safely say that:

> In an **"encoder"** attention block we delete the single line that does masking with `lowerTrianguarMatrix`, allowing all `tokens` to communicate. This block here is called a **"decoder"** attention block because it has triangular masking, and is usually used in **autoregressive** settings, like language modeling.

Now, you see me mentioning `Attention` and `Self Attention`, but there is something called `Cross Attention` as well...

So what is the difference between `Self Attention` and `Cross Attention`?

So the reason, we are calling our code `Self Attention` is because the **keys, queries & values** are all coming from the same source which is `inputs` (which makes the *self-attending*)...

But in principle, `Attention` is much more general than that...

For example, in encoder-decoder transformers, we can have a case where the **queries** are produces from `inputs`, but the **keys** and the **values** come from a whole separate **external source**, and sometimes from some `encoder` blocks that we like to condition on...

So `Cross Attention` is used when we have a **separate source** of nodes from where we want to pull *information* from into our `nodes`, and in `Self Attention` is used when we have a set of nodes, where we want them to *look* at each other and *talk* to each other...

So now it is safe to say that:

> **"Self-Attention"** just means that the **keys** and **values** are produced from the **same source** as **queries**. In **"Cross-Attention"**, the **queries** still get produced from `inputs`, but the **keys** and **values** come from some other, **external source** (e.g. an Encoder Module)

Lastly when we look at the original <a href="https://arxiv.org/abs/1706.03762">"Attention Is All You Need"</a> paper, we see this equation:

$$
\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

Now according to the equation, we have already implemented `Attention` given the `Query`, `Key` and `Value`... We have multiplied the `Query` and `Key`, we have applied `softmax()` and also aggregating the `Value`s...

But we see that we are still missing this $\frac{1}{\sqrt{d_k}}$ here, and the $d_k$ here is none other than the `headSize`...

So why are they doing it? And also why do they call it the `Scaled Dot Product Attention`?

The problem is that when we have **unit-gaussian** inputs (`0` mean and `unit` variance) and if we do what we have already implemented without the `headSize`, and check the mean and the variance like this:
```python
torch.manual_seed(69420)

B, T = 4, 8
headSize = 16

k = torch.randn(B, T, headSize)
q = torch.randn(B, T, headSize)
weights = q @ k.transpose(-2, -1)

print("Mean: ", weights.mean())
print("Varince", weights.var())
```
We see that our `varince` of `weights` come out in the order of `headSize`:
```python
Mean:  tensor(-0.0133)
Variance:  tensor(16.1469)
```

But the moment we apply this formula during the `weights` initilization:
```python
torch.manual_seed(69420)

B, T = 4, 8
headSize = 16

k = torch.randn(B, T, headSize)
q = torch.randn(B, T, headSize)
weights = q @ k.transpose(-2, -1) * headSize ** -0.5

print("Mean: ", weights.mean())
print("Variance: ", weights.var())
```
We get the `varince` of `weights` to be `1`:
```python
Mean:  tensor(-0.0033)
Variance:  tensor(1.0092)
```

Now why is this important?

We understand that our `weights` get fed into the `softmax()`...

And we understand from our <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20(MLP)%20-%20Activations%2C%20Gradients%20%26%20Batch%20Normalization.ipynb">NameWeave (MLP) - Activations, Gradients & Batch Normalization</a> notebook that we want these `weights` to be fairly diffuse especially during initialization...

But why?

The problem is, if our `weights` take on very positive or very negetive numbers inside it, `softmax()` would actually *converge* towards one-hot vectors...

Let's illustrate this...

Suppose we have a list of `5` simple floating point numbers:
```python
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)
```
We get this:
```python
tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])
```
But the moment we take these numbers and we start sharpening it by suppose multiplying the numbers with a big number:
```python
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]) * 9, dim=-1)
```
We get this:
```python
tensor([0.0228, 0.0015, 0.1382, 0.0015, 0.8359])
```
We see that the `softmax()` starts to sharpen towards the maximum or the highest number that is there...

Which eventually means that we end up being **aggregating** the *information* about a single node... Which is not what we want... So, the `scaling` is used to control the initialized values, and hence the name `Scaled Dot Product Attention`...

So it is safe to say that:

> `"Scaled"` Attention additionally divides `weights` by $\frac{1}{\sqrt{\text{headSize}}}$. This makes it so when input $Q, K$ are **unit variance**, `weights` will be **unit variance** too and `softmax` will stay diffuse and not saturate too much...

# Changing Script to `single_self_attention_gpt.py`

Let's now take the knowledge that we have from `Self Attention` and implement it...

## Change - 1

Let's get our names straight first...

Because now we will be using a `Transformer` style architecture, we will be renaming our model from `BigramLanguageModel` to `GPTModel`...

So, we will change the code during:
1. Model Definition
2. Model Initialization

During **Model Definition**, our code was:
```python
class BigramLanguageModel(torch.nn.Module):
```
Which now becomes:
```python
class GPTModel(torch.nn.Module):
```
And during **Model Initialization**, our code was:
```python
model = BigramLanguageModel().to(device=device)
```
Which now becomes:
```python
model = GPTModel().to(device=device)
```

## Change - 2

We will now create a `Head` module, that implements a **single self attention head** and takes `headSize` as a **hyper-parameter**...

During initialization, we will have the **keys, queries & values** from the `Linear` projections, and because we will be placing the `Head` module right after we create the fuse the `tokenEmbeddings` and `positionalEmbeddings`, we will get the inputs to this layer as a `(B, T, C)` tensor, and if you remember correctly, in the explanation we used the `fan-in` of these **keys, queries & values** to be of shape of the `input's channel` dimension, which for our case now is none other than these fused embeddings having the channel dimension as `numberOfEmbeddingDimensions`, which means we will set the `fan-in` to these **keys, queries & values** to `numberOfEmbeddingDimensions` as well...

And we will rename these `concatenatedEmbeddings` to just `embeddings` now, because we will be using the same outputs to our `languageModelingHead` in order to calculate logits... So it is better to rename it rather than creating confusion...

Now because the `lowerTrianguarMatrix` is not a parameter of the `Head` module, and in PyTorch naming conventions it is called a **buffer**. And because we are inheriting the `torch.nn.Module`, we can now use the <a href="https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.register_buffer">`register_buffer(name, tensor, persistent=True)`</a> which takes in a tensor and a string name to register that tensor with the name, such that we can use it later in our block using `self.<buffer name>` or `self.lowerTrianguarMatrix`... And because our `inputs` have the time dimension as the `blockSize`, we can use it to initialize the tensor of `lowerTriangularMatrix` now, and because in the `masked_fill()` we check whether the `tokens` in each batch are `0` or not, we can now check if the `tokens` in each batch to be `0` or not, using this `lowerTriangularMatrix[:time, :time]` to future proof our code such that if the time dimension is changed later, it does not cause any issues...

And during the `forward()` we pass the `inputs` through the function, such that it can be used to calculate the **keys** and **queries** to initialize the weights using the scaled initialization that we discussed, and then we make sure that those `weights` don't communicate with the past using the `self.lowerTriangularMatrix` mask, then we `softmax()` the weights and calculate the **value** to aggregate the `weights` with the value and return the output...

So now our code looks like:
```python
# Head Module Definiton
class Head(torch.nn.Module):
    """ Single Head of Self Attention """
    # Constructor for the Head
    def __init__(self, headSize):
        super().__init__()
        self.key = torch.nn.Linear(numberOfEmbeddingDimensions, headSize, bias=False)
        self.query = torch.nn.Linear(numberOfEmbeddingDimensions, headSize, bias=False)
        self.value = torch.nn.Linear(numberOfEmbeddingDimensions, headSize, bias=False)
        self.register_buffer(name='lowerTriangularMatrix', tensor=torch.tril(torch.ones(blockSize, blockSize)))

    def forward(self, inputs):
        # Unpacking the shape of inputs
        batch, time, channel = inputs.shape
        # Forwarding the inputs to keys and queries
        k = self.key(inputs) # (B, T, C)
        q = self.query(inputs) # (B, T, C)
        # Initializing weights with scaled dot product
        weights = q @ k.transpose(-2, -1) * headSize ** -0.5 # (B, T, T)
        # Masking the weights
        weights = weights.masked_fill(self.lowerTriangularMatrix[:time, :time] == 0, float('-inf')) # (B, T, T)
        # Softmax the weights
        weights = F.softmax(weights, dim=-1) # (B, T, T)
        # Forwarding the inputs to values
        v = self.value(inputs) # (B, T, C)
        # Aggregating the weights and the values
        output = weights @ v # (B, T, C)
        return output
```

Now we can initialize our `Head` module inside `GPTModel` and modify the forward pass of the module...

So, we will initialize the `Head` as `selfAttentionHead` variable and keep the `headSize` as `numberOfEmbeddingDimensions` as of now...

And during the forward pass, right after we have the fused `tokenEmbeddings` and `positionalEmbeddings`, we will forward the `embeddings` to the `selfAttentionHead` to get the output embeddings and those output embeddings are going to go into the decoder `languageModelingHead` to create the `logits`...

So now our `GPTModel` definition looks like:
```python
# Model Module Definition
class GPTModel(torch.nn.Module):
    # Constructor for the model
    def __init__(self):
        # Initializing the embedding table
        super().__init__()
        self.tokenEmbeddingTable = torch.nn.Embedding(vocabularySize, numberOfEmbeddingDimensions)
        self.positionalEmbeddingTable = torch.nn.Embedding(blockSize, numberOfEmbeddingDimensions)
        self.selfAttentionHead = Head(headSize=headSize)
        self.languageModelingHead = torch.nn.Linear(numberOfEmbeddingDimensions, vocabularySize)

    # Forward Pass
    def forward(self, indeces, labels=None):
        # Unpacking the shape of indeces
        batch, time = indeces.shape

        # Index into embeddings to get the token embeddings
        tokenEmbeddings = self.tokenEmbeddingTable(indeces) # (B, T, C)
        # Index into embeddings to get the positional embeddings
        positionalEmbeddings = self.positionalEmbeddingTable(torch.arange(time, device=device)) # (T, C)
        # Fuse the token embeddings and positional embeddings together to pack the information in a single tensor
        embeddings = tokenEmbeddings + positionalEmbeddings # (B, T, C)
        # Pass the concatenated embeddings into our self attention head
        embeddings = self.selfAttentionHead(embeddings) # (B, T, C)
        # Pass the embeddings through a linear layer
        logits = self.languageModelingHead(embeddings) # (B, T, C)

        if labels is None:
            loss = None
        else:
            # Pop out the shape dimensions
            batch, time, channel = logits.shape
            # Stretch out the logits and labels
            logits = logits.view(batch*time, channel)
            labels = labels.view(batch*time)
            # Calculate loss
            loss = F.cross_entropy(logits, labels)
        return logits, loss

    # Generation
    def generate(self, indeces, maximumNewTokens):
        for _ in range(maximumNewTokens):
            # Crop the indeces upto most recent block size context
            croppedIndeces = indeces[:, -blockSize:]
            # Forward Through Model
            logits, loss = self(croppedIndeces)
            # Focus on the last time step
            logits = logits[:, -1, :]
            # Applying softmax for the last dimension
            probabilities = F.softmax(logits, dim=-1)
            # Sample from distribution
            nextIndex = torch.multinomial(probabilities, num_samples=1)
            # Concatenate currentIndex with nextIndex
            indeces = torch.cat((indeces, nextIndex), dim=1)
        return indeces
```

## Change - 3

Before we run our first `GPT`, we will decrease the `learningRate` from `1e-2` to `1e-3`, because `Self Attention` cannot tolerate very very high learning rates, and we will also increase the `epochs` because our `learningRate` is now lower so from `30000` to say `50000`...

So, let's now run it and keep track of our losses...

## Keeping track of losses

Losses at `bigram_v1.py`:
```python
Step 29500: Training Loss 2.4385, Validation Loss 2.4322
```
Losses at `bigram_v2.py`:
```python
Step 29500: Training Loss 2.4641, Validation Loss 2.4487
```
Losses at `single_self_attention_gpt.py`
```python
Step 49500: Training Loss 2.2869, Validation Loss 2.2751
```

This seems like a nice improvement now, but our model is still not predicting good output and we get something like this:
```bash

“I fout he
astt fery witrel at ce “Theedigh whe way’beny is, llin wary, y’s, orugh adongose.
She, llinggtr acortinul .” Hamobou’ soryot himato ank adrdd lary ige hiovingen cksint hon ederr tited I toplim Paos colso thme he.

Hame thurion ly u?”


Here mpat theirse yin ’ree, I Icthe yound artherre gthe, wing
hesirou’stthe dy, sle Rof winu,”

“Harread ack ladng He Lithamomerd seve han oucu ath hisene te ing

Shis Chory thoun, ”


“Ley whad towourath war oe tohad ly wie here a). Then demame
```

Which means that `Self Attention` head that we have created is doing some useful communication...

# Multi-Head Attention

If we now look at the <a href="https://arxiv.org/abs/1706.03762">"Attention Is All You Need"</a> paper and scroll down a little bit we will see a section of `Multi-Head Attention`...

So what is this new `Multi-Head Attention`?

Well `Multi-Head Attention`, is just applying **multiple** attentions in **parallel** and **concatenating** the results...

Let's check out the diagram according to the original paper...
![Multi Head Attention](ExplanationMedia/Images/Multi_Head_Attention.png)

So we can create our own `MultiHeadAttention` module now that takes `numberOfHeads` and `headSize` as **hyper-parameters** for the module...

Where `numberOfHeads` specifies the number of heads we want in our `Attention` and `headSize` specifies the head size in each of these `numberOfHeads`...

We can then define the initialization of `Heads` in a list of Modules or <a href="https://pytorch.org/docs/stable/generated/torch.nn.ModuleList.html">ModuleList</a> from PyTorch, which can later be used to act as an iterable...

And during the forward pass, we run all of them in parallel in a list, and concatenate them using the last dimension, which is the `channel` dimension for us...

So now the `MultiHeadAttention` module code looks like:
```python
# Multi-Head Attention Module Definiton
class MultiHeadAttention(torch.nn.Module):
    """ Multiple Heads of Self Attention in Parallel """
    # Constructor for the Multi-Head Attention
    def __init__(self, numberOfHeads, headSize):
        super().__init__()
        self.heads = torch.nn.ModuleList([Head(headSize=headSize) for _ in range(numberOfHeads)])

    # Forward Pass
    def forward(self, inputs):
        # Returns the concatenated heads over the channel dimension
        return torch.cat([head(inputs) for head in self.heads], dim=-1)
```

So now we don't have just a single `Attention` that has a `headSize` of `32`(because we used `headSize` as `numberOfEmbeddingDimensions` which was `32`)...

So instead of having `1` communication channel(a single `Head`), we now will have `4` communication channels(`numberOfHeads`) in parallel, and each one of these communication channels will be typically smaller correspondingly... So because we have `4` communication channels, we want `8` dimensional `Self Attention` or `numberOfEmbeddingDimensions // 4` (which concatenates to give us `32`, which is the original `headSize` that we had as before)...

So we will replace our older line:
```python
self.selfAttentionHead = Head(headSize=headSize)
```

With this line:
```python
self.selfAttentionHeads = MultiHeadAttention(numberOfHeads=numberOfHeads, headSize=numberOfEmbeddingDimensions//numberOfHeads)
```

And changed the forward pass from:
```python
# Pass the concatenated embeddings into our self attention head
embeddings = self.selfAttentionHead(embeddings) # (B, T, C)
```
To this:
```python
# Pass the concatenated embeddings into our multihead attention
embeddings = self.selfAttentionHeads(embeddings) # (B, T, C)
```

And if you're familiar with **convolutions**, this is kind of like a **Grouped Convolution** because, instead of having a very large convolution, we do small convolutions in groups...

In multi-head self-attention, each attention head learns different relationships between `tokens` in the input sequence, enabling the model to attend to multiple aspects of the data simultaneously. Similarly, in grouped convolution, different groups of `channels` capture different features in the input, providing a diverse set of representations...

So now let's take our knowledge and create a new script for it...

# Changing Script to `multi_head_attention_gpt.py`

So we will now take all the knowledge that we have from our `Multi-Head Attention` and create a new script `multi_head_attention_gpt.py` and try to check what kind of outputs we get and compare our old losses as well...

So after running the `Multi-Head Attention` we get an output like this:
```bash

I fhur he
astt feye watred at crign. Maid! IW
loaing mons, loin ward this
orng the
hay eak he alling to Hery rey seer had ou’ york this to hatk a Moddgard the his
Page cas, the withey criged I moplionk of toh a thof hake siang thurion lyou?”

Harry stan their asling the
noththe you ding hering the - Jus
hes roubutthersc, sle com youu,” thims Read ack ladn’t bect.”

Marsp whe cas of wi; he asen by Piffagrwas got shet, thage obe leyhith reen buatthe — ieet:

He nowe he ove). Tolng him u
```

## Keeping track of losses

Losses at `bigram_v1.py`:
```python
Step 29500: Training Loss 2.4385, Validation Loss 2.4322
```
Losses at `bigram_v2.py`:
```python
Step 29500: Training Loss 2.4641, Validation Loss 2.4487
```
Losses at `single_self_attention_gpt.py`
```python
Step 49500: Training Loss 2.2869, Validation Loss 2.2751
```
Losses at `multi_head_attention_gpt.py`
```python
Step 49500: Training Loss 2.0459, Validation Loss 2.0368
```

We see that our output is still not that much amazing but our losses are definately improving...

And in conclusion, it helps to have multiple communication channels because our `tokens` have a lot to talk about, which eventually makes them find and communicate many kinds of different things and gather lots of different types of data and then **decode** the output...

# Complete `Transformer` Explanation from <a href="https://arxiv.org/abs/1706.03762">"Attention Is All You Need"</a> Paper

## Understanding Where We Are

Now we haven't discussed the `Transformer` architecture that is illustrated in the original <a href="https://arxiv.org/abs/1706.03762">"Attention Is All You Need"</a> Paper...

Let's look at the diagram again to relate what we have done till now...
![Transformer_Model_Architecture](ExplanationMedia/Images/Transformer_Model_Architecture.png)


We are starting to see some things that we have already implemented...

Such as the `Positional Encodings` and the `Token Encodings` that **add up**, the `Masked Multi-Head Attention`...

We see there's another `Multi-Head Attention` that is a `Cross Attention` into an encoder, which we are **not** going to implement, and I will explain why we won't implement that in a bit...

We also see something known as `Nx` which is mentioned because the parts that we see inside a white box, are grouped into a `Block` and gets repeated `Nx` number of times...

And we see that we have lots of other parts that we have not implemented yet...

So let's do that one by one now. But before we do that I wanted to show you how much we have implemented of the entire illustration so far...
![TransformerArchitectureKnownImplementation](ExplanationMedia/Images/TransformerArchitectureKnownImplementation.png)

We understand that we have a lot to implement still, so let's do that now...

## Understanding `Feed Forward` Block

First let's focus on the `Feed Forward` part that we have here...

This `Feed Forward` part is just a **Multi-Layer Perceptron** at the moment...

And we'd also like to add some computation within the network now (which will be on a **per-node** level)...

But why add computation when we are trying to reduce the computation in the first place?

Good question... The thing is, we went way too fast to compute the `logits`. Basically, the `tokens` looked at each other in the `Multi-Head Attention` but did not really had a lot of time to *think* on what they *found* from the other `tokens`...

So let's first try to write out a `Feed Forward` Module block that will have a single `Linear` layer followed by a `ReLU` non-linearity which forwards itself by calling itself on `inputs` that it gets...

Now because the `Linear` layer expects a `fan-in` and `fan-out`, and because we will place the `Feed Forward` network after we get the outputs from `Multi-Head Attention` and before we pass it to the last `languageModelingHead`, we expect a `fan-in` in the shape of `numberOfEmbeddingDimensions`, and we will keep the `fan-out` to be the same, such that the inputs that `languageModelingHead` expects should be the same that comes out of this `Feed Forward` block...

So our `Feed Forward` module definition looks like:
```python
# Feed Forward Module Definition
class FeedForward(torch.nn.Module):
    """ Simple Feed Forward Network """
    # Constructor for the Feed Forward Network
    def __init__(self, numberOfEmbeddingDimensions):
        # Initializing the layers
        super().__init__()
        self.network = torch.nn.Sequential(
            torch.nn.Linear(numberOfEmbeddingDimensions, numberOfEmbeddingDimensions),
            torch.nn.ReLU()
        )

    # Forward Pass
    def forward(self, inputs):
        return self.network(inputs)
```

Now that we have the definition, we can add this `Feed Forward` by initializing it in our `GPTModel`'s constructor and forward them before the `languageModelingHead`...

Like this:
```python
class GPTModel(torch.nn.Module):
    # Constructor for the model
    def __init__(self):
        # Initializing the embedding table
        super().__init__()
        self.tokenEmbeddingTable = torch.nn.Embedding(vocabularySize, numberOfEmbeddingDimensions)
        self.positionalEmbeddingTable = torch.nn.Embedding(blockSize, numberOfEmbeddingDimensions)
        self.selfAttentionHeads = MultiHeadAttention(numberOfHeads=numberOfHeads, headSize=numberOfEmbeddingDimensions//numberOfHeads)
        self.feedforwardnetwork = FeedForward(numberOfEmbeddingDimensions=numberOfEmbeddingDimensions)
        self.languageModelingHead = torch.nn.Linear(numberOfEmbeddingDimensions, vocabularySize)

    # Forward Pass
    def forward(self, indeces, labels=None):
        # Unpacking the shape of indeces
        batch, time = indeces.shape

        # Index into embeddings to get the token embeddings
        tokenEmbeddings = self.tokenEmbeddingTable(indeces) # (B, T, C)
        # Index into embeddings to get the positional embeddings
        positionalEmbeddings = self.positionalEmbeddingTable(torch.arange(time, device=device)) # (T, C)
        # Fuse the token embeddings and positional embeddings together to pack the information in a single tensor
        embeddings = tokenEmbeddings + positionalEmbeddings # (B, T, C)
        # Pass the concatenated embeddings into our multihead attention
        embeddings = self.selfAttentionHeads(embeddings) # (B, T, C)
        # Forward through the feed forward network
        embeddings = self.feedforwardnetwork(embeddings) # (B, T, C)
        # Pass the embeddings through a linear layer
        logits = self.languageModelingHead(embeddings) # (B, T, C)

        if labels is None:
            loss = None
        else:
            # Pop out the shape dimensions
            batch, time, channel = logits.shape
            # Stretch out the logits and labels
            logits = logits.view(batch*time, channel)
            labels = labels.view(batch*time)
            # Calculate loss
            loss = F.cross_entropy(logits, labels)
        return logits, loss
```

Now we notice that our `Feed Forward` applies `Linear` tranformation on a **per-token** level (all the `tokens` do this independently...

Which means that `Self Attention` is the communication, and once it has gathered all the *information*, in `Feed Forward` they need to think on that gathered *information*, individually...

## Changing Script to `multi_head_attention_with_feedforward_gpt.py`

Let's now take the knowledge that we have from our `Feed Forward` Block and implement it in a `multi_head_attention_with_feedforward_gpt.py` file and check how far do we come from our previous `losses`...

The output we get:
```bash

th-ir, “I dong’t the weerverounselled ...”

“Waid to, Potermoo lasophe
whound tell dolles, stalughead to to
gethe — ... sheile
is a ther
to Harry her. skot
shor beetin a gazinged uppror as of ther






“Whien thed rark of it way sthrey days werelley he wplucten, — ” said no’s her, on to out the
of the pam getion.
Ron her. .. ...
```

### Keeping track of losses

Losses at `bigram_v1.py`:
```python
Step 29500: Training Loss 2.4385, Validation Loss 2.4322
```
Losses at `bigram_v2.py`:
```python
Step 29500: Training Loss 2.4641, Validation Loss 2.4487
```
Losses at `single_self_attention_gpt.py`
```python
Step 49500: Training Loss 2.2869, Validation Loss 2.2751
```
Losses at `multi_head_attention_gpt.py`
```python
Step 49500: Training Loss 2.0459, Validation Loss 2.0368
```
Losses at `multi_head_attention_with_feedforward_gpt.py`
```python
Step 49500: Training Loss 1.9708, Validation Loss 1.9792
```

Now we would like to intersperse the *communication* with the *computation*... Which is also what the `Transformer` does, when it has blocks that *communicate* and *compute*, and it groups them and replicates them...

So let's do that now...

## Interspersing *communication* (Multi-Head Attention) & *computation* (Feed Forward)

So now, we want a `Transformer Block` module that intersperses the *communication* followed by the *computation*.

The *communication* is done by the `Multi-Head Attention` and the *computation* is done using a `Feed Forward` network on all the `tokens` independently (the white box that we talked about from the original illustration)...

Which means that we want to create a `TransformerBlock` Module that initializes two things:
1. *communication* → `Multi-Head Attention`
2. *computation* → `Feed Forward`

Which means we need to pass all the hyper-parameters required by both the blocks through this `TransformerBlock`'s constructor, and we need to call both of them during forward pass with `inputs`...

So now our `TransformerBlock` module looks like:
```python
# Transformer Block Module Definition
class TransformerBlock(torch.nn.Module):
    """ Communication Followed By Computation """
    # Constructor for the Transformer Block 
    def __init__(self, numberOfEmbeddingDimensions, numberOfHeads):
        # Initializing the Transformer Block
        super().__init__()
        self.selfAttention = MultiHeadAttention(numberOfHeads=numberOfHeads, headSize=numberOfEmbeddingDimensions//numberOfHeads)
        self.feedforwardnetwork = FeedForward(numberOfEmbeddingDimensions=numberOfEmbeddingDimensions)

    # Forward Pass
    def forward(self, inputs):
        embeddings = self.selfAttention(inputs) # (B, T, C)
        embeddings = self.feedforwardnetwork(embeddings) # (B, T, C)
        return embeddings
```

Now that we have our `TransformerBlock` module, we can now change our main `GPTModel`'s constructor from this:
```python
 # Constructor for the model
def __init__(self):
    # Initializing the embedding table
    super().__init__()
    self.tokenEmbeddingTable = torch.nn.Embedding(vocabularySize, numberOfEmbeddingDimensions)
    self.positionalEmbeddingTable = torch.nn.Embedding(blockSize, numberOfEmbeddingDimensions)
    self.selfAttentionHeads = MultiHeadAttention(numberOfHeads=numberOfHeads, headSize=numberOfEmbeddingDimensions//numberOfHeads)
    self.feedforwardnetwork = FeedForward(numberOfEmbeddingDimensions=numberOfEmbeddingDimensions)
    self.languageModelingHead = torch.nn.Linear(numberOfEmbeddingDimensions, vocabularySize)
```
To this:
```python
# Constructor for the model
def __init__(self):
    # Initializing the model parameters
    super().__init__()
    self.tokenEmbeddingTable = torch.nn.Embedding(vocabularySize, numberOfEmbeddingDimensions)
    self.positionalEmbeddingTable = torch.nn.Embedding(blockSize, numberOfEmbeddingDimensions)
    self.blocks = torch.nn.Sequential(
        TransformerBlock(numberOfEmbeddingDimensions=numberOfEmbeddingDimensions, numberOfHeads=numberOfHeads),
        TransformerBlock(numberOfEmbeddingDimensions=numberOfEmbeddingDimensions, numberOfHeads=numberOfHeads),
        TransformerBlock(numberOfEmbeddingDimensions=numberOfEmbeddingDimensions, numberOfHeads=numberOfHeads),
    )
    self.languageModelingHead = torch.nn.Linear(numberOfEmbeddingDimensions, vocabularySize)
```

And the forward pass of `GPTModel` from this:
```python
# Forward Pass
def forward(self, indeces, labels=None):
    # Unpacking the shape of indeces
    batch, time = indeces.shape

    # Index into embeddings to get the token embeddings
    tokenEmbeddings = self.tokenEmbeddingTable(indeces) # (B, T, C)
    # Index into embeddings to get the positional embeddings
    positionalEmbeddings = self.positionalEmbeddingTable(torch.arange(time, device=device)) # (T, C)
    # Fuse the token embeddings and positional embeddings together to pack the information in a single tensor
    embeddings = tokenEmbeddings + positionalEmbeddings # (B, T, C)
    # Pass the concatenated embeddings into our multihead attention
    embeddings = self.selfAttentionHeads(embeddings) # (B, T, C)
    # Forward through the feed forward network
    embeddings = self.feedforwardnetwork(embeddings) # (B, T, C)
    # Pass the embeddings through a linear layer
    logits = self.languageModelingHead(embeddings) # (B, T, C)
```
To this:
```python
# Forward Pass
def forward(self, indeces, labels=None):
    # Unpacking the shape of indeces
    batch, time = indeces.shape

    # Index into embeddings to get the token embeddings
    tokenEmbeddings = self.tokenEmbeddingTable(indeces) # (B, T, C)
    # Index into embeddings to get the positional embeddings
    positionalEmbeddings = self.positionalEmbeddingTable(torch.arange(time, device=device)) # (T, C)
    # Fuse the token embeddings and positional embeddings together to pack the information in a single tensor
    embeddings = tokenEmbeddings + positionalEmbeddings # (B, T, C)
    # Pass the concatenated embeddings into our blocks
    embeddings = self.blocks(embeddings) # (B, T, C)
    # Pass the embeddings through a linear layer
    logits = self.languageModelingHead(embeddings) # (B, T, C)
```

## Changing Script to `transformer_block_gpt.py`

Let's now implement this knowledge of replecating `TransformerBlock`s into a `transformer_block_gpt.py` script file...

I will not keep the scores of the loss this time, because we still don't get a pretty good result and I have run it myself already...

And the reason for that is, we are actually starting to get into a very deep neural network now...

And deep neural networks suffer from optimization issues, and that's what we are getting to run into...

So we need two more ideas that we can borrow from our original <a href="https://arxiv.org/abs/1706.03762">"Attention Is All You Need"</a> Paper to resolve those difficulties...

Those are:
1. Residual Connections/Skip Connections
2. Layer Normalization

So let's now discuss the ideas...

## Understanding **Skip Connections / Residual Connections**

Let's look at the `Transformer` architecture again where I have highlighted some parts...
![Transformer_Model_Architecture_Skip_Connections](ExplanationMedia/Images/Transformer_Model_Architecture_Skip_Connections.png)

See the *red coloured* parts that I have highlighted and how they **skip** some of the smaller blocks?

Those are known as **Skip Connections** or **Residual Connections**. And they come from the paper <a href="https://arxiv.org/abs/1512.03385">Deep Residual Learning for Image Recognition</a> from about 2015, which we have already seen in our **NameWeave** series...

Let's understand what they do...

![Residual Block](ExplanationMedia/Images/ResidualBlock.webp)

**Skip Connections** or **Residual Connections** transform the data we get from our `inputs`, but then we have a *skip connection* with addition with previous features...

Now the way I like to visualize it, is the following...

![Residual Blocks](ExplanationMedia/Images/ResidualBlocks.png)

In **Residual Connections** the *computation* happens from the top to bottom which is a residual pathway, and we are free to *fork off* from residual pathway, perform some other *computation* and then **project** back to the residual pathway via an addition...

Which means we go from `inputs` to the targets only via these chains of additions...

Why is this useful?

The reason it is useful is because, if we remember from our <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/Neural%20Network%20with%20Derivatives.ipynb">Neural Network with Derivatives</a> notebook that, addition distributes gradients equally to all of its branches that got fed as the input, which means that the supervision or the gradients from the loss hop from every addition node all the way to the `inputs` and then also *fork off* from the residual blocks...

But basically we have this gradient *super-highway* that goes directly from the supervision, all the way to the `inputs`, unimpeded.

And these **residual blocks** are usually initialized in the beginning so that they contribute very very little to the residual pathway, but then during the optimization, they *come online* over time and they start to contribute... 

But at least at the initialization we can go from directly from the supervision to the `inputs` where gradients are unimpeded and they just flow, and then the blocks over time, kick in...

Which dramatically helps with the optimization...

So let's modify our code to have **Residual Connections** now...

## Changing Script to `transformer_residual_block_gpt.py`

Right now, we understand that the skip connections we saw were in the white box, which means that we want to modify our `TransformerBlock` such that it contains the residual connection itself...

Let's look at the old `TransformerBlock` code that we have already:
```python
# Transformer Block Module Definition
class TransformerBlock(torch.nn.Module):
    """ Communication Followed By Computation """
    # Constructor for the Transformer Block 
    def __init__(self, numberOfEmbeddingDimensions, numberOfHeads):
        # Initializing the Transformer Block
        super().__init__()
        self.selfAttention = MultiHeadAttention(numberOfHeads=numberOfHeads, headSize=numberOfEmbeddingDimensions//numberOfHeads)
        self.feedforwardnetwork = FeedForward(numberOfEmbeddingDimensions=numberOfEmbeddingDimensions)

    # Forward Pass
    def forward(self, inputs):
        embeddings = self.selfAttention(inputs) # (B, T, C)
        embeddings = self.feedforwardnetwork(embeddings) # (B, T, C)
        return embeddings
```

Now, from the previous explanation we understand that in **residual connections** we have the common `inputs` node, which branch off and do something and add up after the work is done...

Which means that we want to modify our forward pass of this block such that:
1. The inputs for the `selfAttention` do some *communication* and add themselves back to the `embeddings`
2. The inputs for the `feedforwardnetwork` do some *computation* and add themselves back to the `embeddings`

So our new forward pass looks like:
```python
# Forward Pass def forward(self, embeddings):
    embeddings = embeddings + self.selfAttention(embeddings) # (B, T, C)
    embeddings = embeddings + self.feedforwardnetwork(embeddings) # (B, T, C)
    return embeddings
```

Now our residual connection provides a shortcut for the gradients to bypass layers during backpropagation, which helps to alleviate the vanishing gradient problem...

But now because we have used the residual connections in both `MultiHeadAttention` and `FeedForward` modules that we have created, we need to modify their respective initializations and forward passes as well...

Let's consider the `MultiHeadAttention` first...

Right now, our `MultiHeadAttention` looks like:
```python
# Multi-Head Attention Module Definiton
class MultiHeadAttention(torch.nn.Module):
    """ Multiple Heads of Self Attention in Parallel """
    # Constructor for the Multi-Head Attention
    def __init__(self, numberOfHeads, headSize):
        super().__init__()
        self.heads = torch.nn.ModuleList([Head(headSize=headSize) for _ in range(numberOfHeads)])

    # Forward Pass
    def forward(self, inputs):
        # Returns the concatenated heads over the channel dimension
        return torch.cat([head(inputs) for head in self.heads], dim=-1)
```
Now we see that there are multiple heads of `Heads` and they end up gathering *information* with each other and we have a concatenation operation that combines all the *information* together, but now, because they gather and keep adding up more and more *information*, the concatenation increases the dimensionality of the output. Which means that we now need a way to reduce the dimensionality such that the outputs of all these heads are ready for the next stage, that is `Feed Forward`. And the easiest way to do that is to add a `Linear` layer, which we generally call `projection` when using `Transformers`...

So after adding our `projection` linear transformation to our `MultiHeadAttention` our code looks like:
```python
# Multi-Head Attention Module Definiton
class MultiHeadAttention(torch.nn.Module):
    """ Multiple Heads of Self Attention in Parallel """
    # Constructor for the Multi-Head Attention
    def __init__(self, numberOfHeads, headSize):
        super().__init__()
        self.heads = torch.nn.ModuleList([Head(headSize=headSize) for _ in range(numberOfHeads)])
        self.projection = torch.nn.Linear(numberOfEmbeddingDimensions, numberOfEmbeddingDimensions)

    # Forward Pass
    def forward(self, inputs):
        # Returns the concatenated heads over the channel dimension
        output = torch.cat([head(inputs) for head in self.heads], dim=-1)
        output = self.projection(inputs)
        return output
```

Now to change the `Feed Forward` part, we understand that we can do the exact same thing with our `Feed Forward` and apply a projection...

This time just for simplicity, rather than having a `projection` variable, we can couple this projection into our `Sequential` layers itself, so that we don't need to change the forward pass of this Module...

So previously our `FeedForward` looked like:
```python
# Feed Forward Module Definition
class FeedForward(torch.nn.Module):
    """ Simple Feed Forward Network """
    # Constructor for the Feed Forward Network
    def __init__(self, numberOfEmbeddingDimensions):
        # Initializing the layers
        super().__init__()
        self.network = torch.nn.Sequential(
            torch.nn.Linear(numberOfEmbeddingDimensions, numberOfEmbeddingDimensions),
            torch.nn.ReLU()
        )

    # Forward Pass
    def forward(self, inputs):
        return self.network(inputs)
```
But now, after applying the `projection` linear transformation our `FeedForward` looks like:
```python
# Feed Forward Module Definition
class FeedForward(torch.nn.Module):
    """ Simple Feed Forward Network """
    # Constructor for the Feed Forward Network
    def __init__(self, numberOfEmbeddingDimensions):
        # Initializing the layers
        super().__init__()
        self.network = torch.nn.Sequential(
            torch.nn.Linear(numberOfEmbeddingDimensions, numberOfEmbeddingDimensions),
            torch.nn.ReLU(),
            torch.nn.Linear(numberOfEmbeddingDimensions, numberOfEmbeddingDimensions),
        )

    # Forward Pass
    def forward(self, inputs):
        return self.network(inputs)
```

There's one more thing that we need to take care about... That is, in the original <a href="https://arxiv.org/abs/1706.03762">"Attention Is All You Need"</a> paper, we see that there is a section called *"Position-wise Feed-Forward Networks"*, where there is a section written:

*The dimensionality of input and output is $d_{model} = 512$, and the inner-layer has dimensionality $d_{ff}=2048$*

And we see that the layers have the dimensionality that is `4` times the original input and output, so we can apply the inner-layer dimensionality to our `FeedForward` now as well...

So now our `FeedForward` after increasing the inner-layer dimensionality by `4` times looks like:
```python
# Feed Forward Module Definition
class FeedForward(torch.nn.Module):
    """ Simple Feed Forward Network """
    # Constructor for the Feed Forward Network
    def __init__(self, numberOfEmbeddingDimensions):
        # Initializing the layers
        super().__init__()
        self.network = torch.nn.Sequential(
            torch.nn.Linear(numberOfEmbeddingDimensions, 4 * numberOfEmbeddingDimensions),
            torch.nn.ReLU(),
            torch.nn.Linear(4 * numberOfEmbeddingDimensions, numberOfEmbeddingDimensions),
        )

    # Forward Pass
    def forward(self, inputs):
        return self.network(inputs)
```

So we're adding a bit of *computation* here and growing that layer, that is on the *residual block* on the side of the main *residual pathway*...

So we can now make these changes to our script `transformer_residual_block_gpt.py` and run it...

## Understanding **Layer Normalization**

Now, the second innovation that is very helpful that in optimizing very deep neural networks is right here...

Now in the original `Transformer` illustration we keep seeing this block:\
![AddNorm](ExplanationMedia/Images/Transformer_Model_AddNorm.png)

We understand that we have already implemented this `Add` part from our previous **Residual Connections** but there is still a part called `Norm` left...

And this `Norm` refers to something called <a href="https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html">LayerNorm</a> which is based on a paper that came out in 2016 as <a href="https://arxiv.org/abs/1607.06450">"Layer Normalization"</a>...

And `LayerNorm` is very very similar to `BatchNorm`...

And now if we remember our old <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20(MLP)%20-%20Activations%2C%20Gradients%20%26%20Batch%20Normalization.ipynb">NameWeave (MLP) - Activations, Gradients & Batch Normalization</a>, we implemented `Batch Normalization`, and batch normalization made sure that across the *batch-dimension* any individual neuron had *unit-gaussian* distribution (`0` mean and `unit` standard deviation output)...

Let me recall the code that we had before:
```python
class BatchNorm1d:
    def __init__(self, dimensions, epsilon=1e-5, momentum=0.1, training=True):
        self.epsilon = epsilon
        self.momentum = momentum
        # Parameters using dimensions
        self.gamma = torch.ones(dimensions)
        self.beta = torch.zeros(dimensions)
        # Buffers
        self.running_mean = torch.zeros(dimensions)
        self.running_variance = torch.ones(dimensions)
        
        self.training = training
        
    def __call__(self, inputs):
        if self.training:
            input_mean = inputs.mean(0, keepdim=True)
            input_variance = inputs.var(0, keepdim=True)
        else:
            input_mean = self.running_mean
            input_variance = self.running_variance
        unit_variance = (inputs - input_mean) / torch.sqrt(input_variance) + self.epsilon
        self.out = self.gamma * unit_variance + self.beta
        # Update Buffers
        if self.training:
            with torch.no_grad():
                self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * input_mean
                self.running_variance = (1 - self.momentum) * self.running_variance + self.momentum * input_variance
        return self.out
        
    def parameters(self):
        return [self.gamma, self.beta]
```

Let's test this block out as well...

For example, if we take `inputs` of a batch of `32` items having `100` dimensional vectors, and we pass them through the batch normalization layer like this:
```python
torch.manual_seed(69420)

batchNorm = BatchNorm1d(100)
inputs = torch.randn(32, 100) # Batch of 32 items with 100 dimensional vectors
outputs = batchNorm(inputs)

print(f"Mean: {outputs[:, 0].mean()} and Standard Deviation: {outputs[:, 0].std()} (Column Normalization)")
print(f"Mean: {outputs[0, :].mean()} and Standard Deviation: {outputs[0, :].std()} (Row Normalization)")
```
We get:
```python
Mean: 1.0009855031967163e-05 and Standard Deviation: 1.0 (Column Normalization)
Mean: -0.09684808552265167 and Standard Deviation: 1.0564608573913574 (Row Normalization)
```
Which means that it ensures `0` mean and `1` standard deviation by normalizing every single column of inputs by default...

But it does not normalize the rows by default, because we are just normalizing columns...

Let's now implement `LayerNorm` based on our `BatchNorm`...

And it is nothing but, normalizing the **rows** of the `inputs`, which means that we can change our old code:
```python
input_mean = inputs.mean(0, keepdim=True)
input_variance = inputs.var(0, keepdim=True)
```
To this:
```python
input_mean = inputs.mean(1, keepdim=True)
input_variance = inputs.var(1, keepdim=True)
```

And done...😂 We have now implemented `LayerNorm`...

So if we run the same example now...

We get:
```python
Mean: 0.03100830502808094 and Standard Deviation: 0.9175457954406738 (Column Normalization)
Mean: 1.003980651148595e-05 and Standard Deviation: 1.0 (Row Normalization)
```
We see that we are now normalizing **rows** by default...

Now, because our *computation* is **not on across examples** we can remove all the things that are related to buffers and there is no distinction between training and test time...

And now our we have our implemented `LayerNorm`:
```python
class LayerNorm():
    def __init__(self, dimensions, epsilon=1e-5):
        self.epsilon = epsilon
        self.gamma = torch.ones(dimensions)
        self.beta = torch.zeros(dimensions)
        
    def __call__(self, inputs):
        input_mean = inputs.mean(1, keepdim=True)
        input_variance = inputs.var(1, keepdim=True)
        unit_variance = (inputs - input_mean) / torch.sqrt(input_variance) + self.epsilon
        self.out = self.gamma * unit_variance + self.beta
        return self.out
        
    def parameters(self):
        return [self.gamma, self.beta]
```

Which is also identical to the original PyTorch's <a href="https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html">LayerNorm</a>...

So let's now implement `LayerNorm` within our transformer...

## Changing Script to `transformer_layernorm_gpt.py`

Now before we implement the `LayerNorm` we need to discuss one more thing...

In the last `5` years, a very few things have changed from the original `Transformer` architecture and this is what departs from the original implementation of the `Transformer` architecture...

That is this part:\
![AddNorm](ExplanationMedia/Images/Transformer_Model_AddNorm.png)

We see that the `Add & Norm` is applied after the `MultiHeadAttention`'s transformation and `FeedForward`'s transformation, but now it is a bit more common that the `Add & Norm` is applied before the `MultiHeadAttention`'s transformation and `FeedForward`'s transformation, which means that there's a reshuffling of the `LayerNorm`s...

And this is called the **Pre-Norm Formulation** and this is the one that we're going to implement as well...

Now because the `LayerNorm` is already there in PyTorch, we can directly use the `TORCH.NN`'s `LayerNorm` implementation to keep the code simpler...

And we also understand that because these `Add & Norm` are within the white block of the original `Transformer` implementation, we can modify our `TransformerBlock` during initializaiton to add these `LayerNorm`s and during forward pass to apply the normalizations right before the inputs go in for their original transformation...

And because our `inputs` to these `LayerNorm` layers are in the shape of `(B, T, C)`, the `batch` and the `time` dimensions are treated as **batch-dimensions**, which means that we could specify `numberOfEmbeddingDimensions` directly as the normalized shape parameter which for our case is `32`...

Which means that this is a **per-token** transformation that normalizes the features and makes them unit gaussian at initialization...

Now because these `LayerNorm`s have `gamma` and `beta` trainable parameters, it will eventually create outputs that might not be unit gaussian, but the optimization will determine that...

So now our `TransformerBlock` looks like:
```python
# Transformer Block Module Definition
class TransformerBlock(torch.nn.Module):
    """ Communication Followed By Computation """
    # Constructor for the Transformer Block 
    def __init__(self, numberOfEmbeddingDimensions, numberOfHeads):
        # Initializing the Transformer Block
        super().__init__()
        self.selfAttention = MultiHeadAttention(numberOfHeads=numberOfHeads, headSize=numberOfEmbeddingDimensions//numberOfHeads)
        self.feedforwardnetwork = FeedForward(numberOfEmbeddingDimensions=numberOfEmbeddingDimensions)
        self.selfAttentionLayerNorm = torch.nn.LayerNorm(numberOfEmbeddingDimensions)
        self.feedforwardnetworkLayerNorm = torch.nn.LayerNorm(numberOfEmbeddingDimensions)

    # Forward Pass
    def forward(self, embeddings):
        embeddings = embeddings + self.selfAttention(self.selfAttentionLayerNorm(embeddings)) # (B, T, C)
        embeddings = embeddings + self.feedforwardnetwork(self.feedforwardnetworkLayerNorm(embeddings)) # (B, T, C)
        return embeddings
```

Now we need to add a `LayerNorm` at the end of the model's layer and right before the final `Linear` transformation to address normalization and vanishing gradient issues...

So our `GPTModel`s initialization looks like:
```python
# Constructor for the model
def __init__(self):
    # Initializing the model parameters
    super().__init__()
    self.tokenEmbeddingTable = torch.nn.Embedding(vocabularySize, numberOfEmbeddingDimensions)
    self.positionalEmbeddingTable = torch.nn.Embedding(blockSize, numberOfEmbeddingDimensions)
    self.blocks = torch.nn.Sequential(
        TransformerBlock(numberOfEmbeddingDimensions=numberOfEmbeddingDimensions, numberOfHeads=numberOfHeads),
        TransformerBlock(numberOfEmbeddingDimensions=numberOfEmbeddingDimensions, numberOfHeads=numberOfHeads),
        TransformerBlock(numberOfEmbeddingDimensions=numberOfEmbeddingDimensions, numberOfHeads=numberOfHeads),
        torch.nn.LayerNorm(numberOfEmbeddingDimensions),
    )
    self.languageModelingHead = torch.nn.Linear(numberOfEmbeddingDimensions, vocabularySize)
```

So you can feel free to train the model, for now I will release this in a script called `transformer_layernorm_gpt.py`...

## Understanding **Dropout**

<a href="https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html">**Dropout**</a> is a concept that comes from <a href="https://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf">Dropout: A Simple Way to Prevent Neural Networks from Overfitting</a> from around 2014...

And let's look at the paper's illustration to understand **Dropout** now...

![Dropout](ExplanationMedia/Images/dropout.png)

Basically it takes your neural network and every forward and backward pass **randomly shuts off some subset of neurons**...
![Dropout Animation](ExplanationMedia/Images/dropout.gif)

And what this does effectively is, because the mask of what being dropped out is changed every single forward and backward pass, it kind of ends up traning an ensemble of **sub networks**, and during test time, everything is fully enabled and all the sub networks are merged into a single ensemble (kind of)...

And so I invite you to read the paper, but for now let's stay on the overview level of **dropout** just being a regularization technique (technique that prevents overfitting and underfitting)...

So now we can sprinkle dropout accross our code to our `GPT`...

## Changing Script to `transformer_layernorm_dropout_gpt.py`

Let's now initialize these dropouts and modify the forward passes of the parts where we want out dropouts...

Now, **Dropout** needs a probability arguement which by default is `0.5` which means `50%` of the neurons are dropped by default...

And just to be safe, we will define the same as a hyper-parameter variable called `dropoutProbability`...

Now, because our entire `GPTModel` consists the core modules `Head`, `MultiHeadAttention` and `FeedForward`, we can sprinkle `Dropout`s on these layers:
by initializing them as:
```python
self.dropout = torch.nn.Dropout(p=dropoutProbability)
```

And during forward pass we can use `Dropout` in the following scenarios (after some of the transformation and non-linearity activation):
1. Right before the *residual connection* back into the *residual pathway* in `FeedForward`
2. Right at the end of *communication* of `MultiHeadAttention`
3. Right after the `softmax()` of `Head` to randomly prevent some `tokens` from *communicating*

So now the changes look like:
1. ```python
   # Feed Forward Module Definition
    class FeedForward(torch.nn.Module):
        """ Simple Feed Forward Network """
        # Constructor for the Feed Forward Network
        def __init__(self, numberOfEmbeddingDimensions):
            # Initializing the layers
            super().__init__()
            self.network = torch.nn.Sequential(
                torch.nn.Linear(numberOfEmbeddingDimensions, 4 * numberOfEmbeddingDimensions),
                torch.nn.ReLU(),
                torch.nn.Linear(4 * numberOfEmbeddingDimensions, numberOfEmbeddingDimensions),
                torch.nn.Dropout(p=dropoutProbability)
            )
    
        # Forward Pass
        def forward(self, inputs):
            return self.network(inputs)
   ```
2. ```python
   # Multi-Head Attention Module Definiton
    class MultiHeadAttention(torch.nn.Module):
        """ Multiple Heads of Self Attention in Parallel """
        # Constructor for the Multi-Head Attention
        def __init__(self, numberOfHeads, headSize):
            super().__init__()
            self.heads = torch.nn.ModuleList([Head(headSize=headSize) for _ in range(numberOfHeads)])
            self.projection = torch.nn.Linear(numberOfEmbeddingDimensions, numberOfEmbeddingDimensions)
            self.dropout = torch.nn.Dropout(p=dropoutProbability)
    
        # Forward Pass
        def forward(self, inputs):
            # Returns the concatenated heads over the channel dimension
            output = torch.cat([head(inputs) for head in self.heads], dim=-1)
            output = self.dropout(self.projection(inputs))
            return output
   ```
3. ```python
   # Head Module Definiton
    class Head(torch.nn.Module):
        """ Single Head of Self Attention """
        # Constructor for the Head
        def __init__(self, headSize):
            super().__init__()
            self.key = torch.nn.Linear(numberOfEmbeddingDimensions, headSize, bias=False)
            self.query = torch.nn.Linear(numberOfEmbeddingDimensions, headSize, bias=False)
            self.value = torch.nn.Linear(numberOfEmbeddingDimensions, headSize, bias=False)
            self.register_buffer(name='lowerTriangularMatrix', tensor=torch.tril(torch.ones(blockSize, blockSize)))
            self.dropout = torch.nn.Dropout(p=dropoutProbability)
    
        # Forward Pass
        def forward(self, inputs):
            # Unpacking the shape of inputs
            batch, time, channel = inputs.shape
            # Forwarding the inputs to keys and queries
            k = self.key(inputs) # (B, T, C)
            q = self.query(inputs) # (B, T, C)
            # Initializing weights with scaled dot product
            weights = q @ k.transpose(-2, -1) * headSize ** -0.5 # (B, T, T)
            # Masking the weights
            weights = weights.masked_fill(self.lowerTriangularMatrix[:time, :time] == 0, float('-inf')) # (B, T, T)
            # Softmax the weights
            weights = F.softmax(weights, dim=-1) # (B, T, T)
            weights = self.dropout(weights)
            # Forwarding the inputs to values
            v = self.value(inputs) # (B, T, C)
            # Aggregating the weights and the values
            output = weights @ v # (B, T, C)
            return output
   ```

For now I will release this in a script called `transformer_layernorm_dropout_gpt.py`

# Scaling Up the GPT

We see that we have a pretty complete `Transformer` architecture now and we understand almost every component...
![Transformer_With_Block_Explanation](ExplanationMedia/Images/Transformer_With_Block_Explanation.png)

We can finally do some cosmetic changes and start scaling our `GPTModel` and get better results...