# Welcome to GPT from Scratch

## Things to discuss before starting

This notebook is a continuation of my previous notebooks in order:
1. <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/Neural%20Network%20with%20Derivatives.ipynb">Neural Networks with Derivatives</a>
2. <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave.ipynb">NameWeave</a>
3. <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20-%20Multi%20Layer%20Perceptron.ipynb">NameWeave - Multi Layer Perceptron</a>
4. <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20(MLP)%20-%20Activations%2C%20Gradients%20%26%20Batch%20Normalization.ipynb">NameWeave (MLP) - Activations, Gradients & Batch Normalization</a>
5. <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20-%20Manual%20Back%20Propagation.ipynb">NameWeave - Manual Back Propagation</a>
6. <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20-%20WaveNet.ipynb">NameWeave - WaveNet</a>

Which means, I will be using a lot of the terminologies, explanations, and code from the previous notebooks that I have created in the series...

And we will gradually build a complete **GPT** from scratch.

This notebook will be a little bit different from the previous notebooks that I have created previously and I will discuss the changes in a bit...

We will use a completely new dataset `Harry_Potter_Books.txt` which is a combined raw text of all the Harry Potter books by J. K. Rowling combined into a single `text` file, instead of `Indian_Names.txt` which we have used in the previous notebooks that contained Indian Firstnames crawled from a website.

We will also keep softwares like <a href="https://chat.openai.com/">ChatGPT</a>, <a href="https://gemini.google.com/app">Google Gemini</a>, and other Large Language Models (LLM's) in mind and create our own little **Generative Pre-Trained Transformer (GPT)**...

And you would probably know by now what these models are and what they do...

Our **GPT** is going to be a *character level language model*, instead of a *sub-word-tokenized model* which softwares like ChatGPT and Google Gemini use in their models, and we will discuss everything in a bit...

And we will not write all the `TORCH.NN` modules from scratch now, because we have already covered the most important ones already in our previous notebooks, instead we will implement the networks based on `TORCH.NN` library from PyTorch...

And most importantly we will start from the simplest model (Bigram Model) and modify the same model within the same notebook and make our way upto the entire `Transformer` architecture...

I have created an entire folder as `GPT Scripts` in the root of this repository to save each script for you to run them without even having to use jupyter notebooks. Rather you can simple use the `<filename>.py` to run each model to see how they perform on the same dataset as we move up, using:
```bash
python <filename>.py
```

One more thing to discuss is we remember from our <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20-%20WaveNet.ipynb">NameWeave - WaveNet</a> notebook that we refered to multiple dimensions in a tensor as our own made up names, but the thing to know is that in real world these multiple dimensions have names, specifically **batch**, **time** and **channel** dimensions like this in order:

$$
(B, T, C)\rightarrow(Batch, Time, Channel)
$$

And we will refer to these dimensions using the actual names used in real world this time...

So, it's going to me a long journey and it if going to be legen...wait-for-it...dary. Legendary!!!\
![Barney Stinson Wink](https://media.tenor.com/nJ3EeUPhVKkAAAAM/barny-stinson.gif)

So let's get started...

# Understanding GPT

So what is a **GPT**?

Well, **GPT** expands for the terminology as **Generative Pre-Trained Transformer**.

You see how it consists of three words?

Let's look at it in context of each word:
1. Generative → Generates New Content
2. Pre-Trained → Pre-trained on a dataset
3. Transformer → Transformer architecture is being followed to make up the model (Don't worry we will discuss this later)

That was easy...

Let's understand what **GPT** can do currently...

Let's take `ChatGPT` as an example as of now and let's see it's capabilities...

![ChatGPT Current Capabilities](https://miro.medium.com/v2/resize:fit:679/1*_3AM0Yhc7qgCvZ_X1L8mhw.gif)

We see that `GPT` goes from left to right and generates text sequentially...

I wanted to show another thing:
![ChatGPT Different Responses](ExplanationMedia/Images/ChatGPT_Different_Response.gif)

See how we get a different response each time?

Which hints us that it is more like a probabilistic system, which is for any one `prompt` it can give us multiple answers...

Now, this is just one example of a `prompt`, and people have come up with billions of different prompts as of now, and in fact there are many websites that index the interactions with `ChatGPT` as well.

You can look at this <a href="https://writesonic.com/blog/best-chatgpt-examples">website</a> as an example.

We see that it is a very remarkable system, and it is what we call a **"Language Model"**.

Or in other words, it models the sequence of words or characters (or "tokens" more generally) and it knows how words follow each other in English language (even other languages)...

Let's understand what **GPT** does from it's perspective...

Well it is trying to complete the sequence...

In other words, the `inputs` or `task` that we give to the GPT model, it treats it as a *start of a sequence* and it tries to complete the sequence as a whole. Which makes it a language model in this sense...

You would think that it is utterly ridiculous and that we cannot just model an entire architecture and make it act like a helpful assistant.

Well that is the beauty of it. And we will discuss all the under-the-hood components of what makes a software like `ChatGPT` work.

So, What is the neural network architecture under-the-hood that models this sequence of words/characters/tokens?

That comes from this paper from Google: <a href="https://arxiv.org/abs/1706.03762">"Attention Is All You Need"</a> from 2017.

This was a landmark paper in Artificial Intelligence that proposed the `Transformer` architecture. But if you start reading this paper, it may seem like a pretty random *machine-translation* paper. And that's because, I think the authors did not fully anticipate the impact it would create in this domain in the years to come...

Let's look at the original `Transformer` architecture as of now:
![Transformer Archtecture](ExplanationMedia/Images/Transformer_Model_Architecture.png)

And this `Transformer` architecture was copy pasted in huge amount of applications in most recent years...

And what we'd like to do now is create something like `ChatGPT`. But we would not be able to completely clone `ChatGPT` because it is a way more serious *production-grade* system which currently requires *thousands* of GPUs and *millions* of dollars to train the network, and also it is trained on a very good *chunk* of internet data. And there are a lot of **pre-training** and **fine-tuning** stages to it.

Rather we would like to create a transformer-based language model, and in our case it is going to be a character level language model. And we also don't want to train on a *chunk* of internet, rather we need a smaller dataset (I proposed we work with `Harry_Potter_Books.txt` which is roughly a `7MB` file). And we would try to model how these characters in this dataset, follow each other.

Let's take this paragraph for example:
```python
"""
Mr. and Mrs. Dursley, of number four, Privet Drive, 
were proud to say that they were perfectly normal, 
thank you very much.
"""
```

Given a chunk of these characters in the past:
```python
"""
Mr. and Mrs. Dursley, of number four, Privet Drive, 
were proud to say that the
"""
```
The `Transformer` model will look at these characters as a context in the past, and it is going to predict that the letter `'y'` is likely to come next in the sequence. And it is going to produce (generate) character sequences that look like Harry Potter. And in that process it is going to model all the patterns inside this data.

And once we have trained the model, our model will be able to generate *infinite `Harry Potter`*

![Harry Potter Woo](https://media4.giphy.com/media/v1.Y2lkPTc5MGI3NjExcjBmNzJ5N2EzMDQzeTB3cXV4ODN5ZGJkdWlldHhleGw3d3hpMGRhMyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/TJO5x5QQM72Q0weWXN/giphy.gif)

So let's install the required dependencies and load our dataset up and look into the data and what it looks like first...

# Installing Dependencies

In [1]:
!pip install torch
!pip install numpy
!pip install pandas
!pip install matplotlib



# Importing Libraries

In [2]:
import random
import torch
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt

# Loading Dataset

This time, I will divide the dataset loading part into two forms:
1. If you're trying to use `Google Colab` to run the code
2. If you're trying to use `Jupyter Notebook locally` to run the code
And you can choose between either one of those with our desired mode...

## For Google Colab users

This will download the `Harry_Potter_Books.txt` into your current folder...

In [None]:
!wget https://raw.githubusercontent.com/AvishakeAdhikary/Neural-Networks-From-Scratch/main/Datasets/Harry_Potter_Books.txt

Now you can load up the dataset like this:

In [None]:
with open('Harry_Potter_Books.txt', 'r', encoding='utf-8') as file:
    text = file.read()

## For local Jupyter Notebook users

You don't have to download the dataset if you have the entire repository cloned.

The dataset `Harry_Potter_Books.txt` is already located inside the `Datasets` directory...

So we can simply open the file and look at its content by specifying the relative path...

In [3]:
with open('Datasets/Harry_Potter_Books.txt', 'r', encoding='utf-8') as file:
    text = file.read()

# Exploring the Dataset

We can look at the length of the entire dataset and it's number of characters...

In [6]:
print("Length of Dataset in Characters: ", len(text))

Length of Dataset in Characters:  6765190


We see that it is roughly `6-million` characters...

And if you want to look at the first `1000` characters we can do:
```python
print(text[:1000])
```
Which prints the output:
```python
"""
/ 




THE BOY WHO LIVED 

Mr. and Mrs. Dursley, of number four, Privet Drive, 
were proud to say that they were perfectly normal, 
thank you very much. They were the last people you’d 
expect to be involved in anything strange or 
mysterious, because they just didn’t hold with such 
nonsense. 

Mr. Dursley was the director of a firm called 
Grunnings, which made drills. He was a big, beefy 
man with hardly any neck, although he did have a 
very large mustache. Mrs. Dursley was thin and 
blonde and had nearly twice the usual amount of 
neck, which came in very useful as she spent so 
much of her time craning over garden fences, spying 
on the neighbors. The Dursley s had a small son 
called Dudley and in their opinion there was no finer 
boy anywhere. 

The Dursleys had everything they wanted, but they 
also had a secret, and their greatest fear was that 
somebody would discover it. They didn’t think they 
could bear it if anyone found out about the Potters. 
Mrs. Potter was Mrs. Dursl
"""
```

# Building Vocabulary

We can now start building our vocabulary, just like we did in our previous notebooks...

In [4]:
characters = sorted(list(set(text))) # Gives us all the characters in the english alphabet, hopefully our dataset has all of them
vocabularySize = len(characters) # We define a common vocabulary size
print("Characters:", characters)
print("Vocabulary Size:", vocabularySize)

Characters: ['\n', ' ', '!', '"', '%', '&', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '>', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '\\', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '|', '~', '—', '‘', '’', '“', '”', '•', '■', '□']
Vocabulary Size: 92


So we have a possible `vocabulary` of `92` characters that our model will be able to see or emit...

Now we would like to develop a strategy to <strong><i>tokenize</i></strong> our input `text`.

And when we say **tokenize** we generally mean to convert raw text as a string to some sequence of integers according to some vocabulary of possible elements...

For us, because we are developing a character level language model, so we are simply going to be translating individual `characters` into `integers`.

And we will build `4` things here:
1. String to Index Vocabulary → `stoi` → A map of `characters` to `integers`
2. Index to String Vocabulary → `itos` → A map of `integers` to `characters`
3. Token Encoder → That will encode sequence of characters into indeces
4. Token Decoder → That will encode sequence of encoded indeces into characters

And you will be able to recognize the first two from our previous notebooks...

Before we dive in, let's understand the python concept of `lambda` functions, which some you might have forgotten...

So what are `lambda` functions?

Python Lambda Functions are *anonymous functions* means that the function is without a name. As we already know the `def` keyword is used to define a normal function in Python. Similarly, the `lambda` keyword is used to define an anonymous function in Python.

Syntax: `lambda arguments : expression`

For example:
```python
output = lambda input: input+1
print(output(input=1))
```
We would print `2`.

Now, let me first run it, then I will explain it later:

In [5]:
stoi = {character:index for index, character in enumerate(characters)}
itos = {index:character for index, character in enumerate(characters)}
encode = lambda string: [stoi[character] for character in string] # Token Encoder that takes in a string as an input, and outputs a list of integers
decode = lambda list: ''.join([itos[index] for index in list]) # Token Decoder that takes in the encoded list of integers and outputs the decoded string

print("STOI:", stoi)
print("ITOS:", itos)
print("Encoded Text: ", encode("Legendary"))
print("Decoded Text: ", decode(encode("Legendary")))

STOI: {'\n': 0, ' ': 1, '!': 2, '"': 3, '%': 4, '&': 5, "'": 6, '(': 7, ')': 8, '*': 9, ',': 10, '-': 11, '.': 12, '/': 13, '0': 14, '1': 15, '2': 16, '3': 17, '4': 18, '5': 19, '6': 20, '7': 21, '8': 22, '9': 23, ':': 24, ';': 25, '>': 26, '?': 27, 'A': 28, 'B': 29, 'C': 30, 'D': 31, 'E': 32, 'F': 33, 'G': 34, 'H': 35, 'I': 36, 'J': 37, 'K': 38, 'L': 39, 'M': 40, 'N': 41, 'O': 42, 'P': 43, 'Q': 44, 'R': 45, 'S': 46, 'T': 47, 'U': 48, 'V': 49, 'W': 50, 'X': 51, 'Y': 52, 'Z': 53, '\\': 54, ']': 55, 'a': 56, 'b': 57, 'c': 58, 'd': 59, 'e': 60, 'f': 61, 'g': 62, 'h': 63, 'i': 64, 'j': 65, 'k': 66, 'l': 67, 'm': 68, 'n': 69, 'o': 70, 'p': 71, 'q': 72, 'r': 73, 's': 74, 't': 75, 'u': 76, 'v': 77, 'w': 78, 'x': 79, 'y': 80, 'z': 81, '|': 82, '~': 83, '—': 84, '‘': 85, '’': 86, '“': 87, '”': 88, '•': 89, '■': 90, '□': 91}
ITOS: {0: '\n', 1: ' ', 2: '!', 3: '"', 4: '%', 5: '&', 6: "'", 7: '(', 8: ')', 9: '*', 10: ',', 11: '-', 12: '.', 13: '/', 14: '0', 15: '1', 16: '2', 17: '3', 18: '4', 19: 

The `Token Encoder` here takes in a `string` or a `sequence of characters` and encodes it into a `list of integers` based on `stoi` mapping. And the `Token Decoder` takes in the encoded `list of integers` and decodes it based on `itos` mapping to get back the exact same string...

In other words, it is more like a translation of `characters` into `integers` and `integers` into `characters`, because our model is going to be a character level language model.

Now this is only one of many possible `encodings` or `tokenizers` that are out there in the world right now...

And people have come up with many such `tokenizers`, for example, Google uses <a href="https://github.com/google/sentencepiece">`sentencepiece`</a>, OpenAI uses <a href="https://github.com/openai/tiktoken">`tiktoken`</a>...

And these `tokenizers` which are out there are more like `sub-word` tokenizers, which are **not** encoding `entire words` and also **not** encoding `individual characters`, and more like a `sub-word` unit level `tokenizers` which is usually what's adapted in practice...

As an example let's take `tiktoken` vocabulary which uses `Byte-Pair Encoding (BPE)` to encode these `tokens`:\
![Tiktoken Vocabulary](ExplanationMedia/Images/Tiktoken_Vocabulary.png)

We see that `tiktoken` has a vocabulary of roughly `50257` which for us is just `92`.

And when we try to encode a sample string in `tiktoken`, we get:\
![Tiktoken Example](ExplanationMedia/Images/TikToken_Example.png)

We see that we only get `3` outputs for and entire string of `9` characters...

Which means that we can *trade-off* `sequences of integers` and `vocabularies`...

In other words, we can have a very long `sequences of integers` and very short `vocabularies` or we can have very short `sequences of integers` and very long `vocabularies`...

But for now I'd like to keep our `tokenizer` extremely simple using our own character-level tokenizer (meaning we have very small `vocabulary`) and very simple `encode` and `decode` functions, but we do get very long `sequences of integers` as a result...

Don't worry, if you'd like I will build a `tokenizer` in the future...

So let's now move forward...

Now that we have a `token encoder` and a `token decoder` or effectively a `tokenizer` we can move forward and encode our entire `Harry Potter` dataset...

And we will use <a href="https://pytorch.org/">PyTorch</a> library for that:
![PyTorch Logo](ExplanationMedia/Images/PyTorchLogo.svg)

So we can now wrap our `text` data after `encoding` it into a `tensor` of datatype `long` because we want *floating-point numbers* to do mathematical transformations on this data later like this:
```python
data = torch.tensor(encode(text), dtype=torch.long)
```

In [6]:
data = torch.tensor(encode(text), dtype=torch.long)

And then we can check the `shape` and `type` of this data and print out the first `100` characters, just like we did before (without decoding it) like this:
```python
print(data.shape, data.dtype)
```
And we get:
```python
torch.Size([6765190]) torch.int64
```
And:
```python
print(data[:100])
```
And we get:
```python
tensor([13,  1,  0,  0,  0,  0,  0, 47, 35, 32,  1, 29, 42, 52,  1, 50, 35, 42,
         1, 39, 36, 49, 32, 31,  1,  0,  0, 40, 73, 12,  1, 56, 69, 59,  1, 40,
        73, 74, 12,  1, 31, 76, 73, 74, 67, 60, 80, 10,  1, 70, 61,  1, 69, 76,
        68, 57, 60, 73,  1, 61, 70, 76, 73, 10,  1, 43, 73, 64, 77, 60, 75,  1,
        31, 73, 64, 77, 60, 10,  1,  0, 78, 60, 73, 60,  1, 71, 73, 70, 76, 59,
         1, 75, 70,  1, 74, 56, 80,  1, 75, 63])
```

And we see that we have a massive list of integers and is an identical translation of the first `100` characters exactly in the `text` file...

And the entire dataset is just stretched out into a very large `sequence of integers`...

Now before we move on with our progress, we would like to do one more thing that is we'd like to split our dataset into a `Train` and `Validation` split...

So let's do that...

# Splitting the Dataset into `Training` and `Validation` splits

Now I'd like to split our data into a split of:
1. First 90% into `Training` Split
2. Last 10% into `Validation` Split

And we are doing this to understand, to what extent our model is `overfitting`...

Because we don't want our model to copy and create the exact book of `Harry Potter` instead, we want a model that will create `Harry Potter` like text...

And here's how I do that:

In [7]:
nintyPercentOfDatasetLength = int((0.9 * len(data)))
trainingData = data[:nintyPercentOfDatasetLength] # Data up till 90% of the length
validationData = data[nintyPercentOfDatasetLength:] # Data from 90% of the length

We can now move on to the next part...

# Loading Data Into Batches

We would like to now proceed to feed these integer sequences into the neural network so that it can train and learn those patterns...

**BUT**

We need to realise that we are not going to feed in the entire dataset into the neural network because that is going to be extremely computationally heavy, and rather we would load the data into small batches or *chunks* of data...

Now I typically use the term `block size` and specify a length to it, but this *chunk* of data can be recognized as different terminologies as well, for example `context length`.

Let's start with a `blockSize` of just `8`...
```python
blockSize = 8
```

Now let's look at the first *chunk* of training data (`blockSize` of `8` + `1`)...

I'll explain why this `+1` is there in a bit...

So we have:
```python
trainingData[:blockSize + 1]
```
For which we get the sequence:
```python
tensor([13,  1,  0,  0,  0,  0,  0, 47, 35])
```

Which is the first `9` characters in a `sequence` in the `training-set`...

Now, I'd like to point out that when we take out a *chunk* of data like this, we actually have **multiple examples** packed within it, because all of these characters **follow** each other...

And we are going to simultaneously train it at every one of these positions...

And in a chunk of `9` characters, there's actually `8` individual examples packed in there...

How so?

Let's look at it this way...

For our example:
```python
[13,  1,  0,  0,  0,  0,  0, 47, 35]
```
1. In the context of `[13]` → `1` is likely to come next,
2. In the context of `[13,  1]` → `0` is likely to come next,
3. In the context of `[13,  1,  0]` → `0` is likely to come next,
4. In the context of `[13,  1,  0,  0]` → `0` is likely to come next,
5. In the context of `[13,  1,  0,  0,  0]` → `0` is likely to come next,
6. In the context of `[13,  1,  0,  0,  0,  0]` → `0` is likely to come next,
7. In the context of `[13,  1,  0,  0,  0,  0,  0]` → `47` is likely to come next,
8. In the context of `[13,  1,  0,  0,  0,  0,  0, 47]` → `35` is likely to come next.

Summing upto `8` individual examples in our case, which is the `blockSize` and we take the `+1` to get the desired `label` for training...

Let's see how we can achieve the same output in a code snippet:
```python
inputs = trainingData[:blockSize] # First Chuck of Characters
outputs = trainingData[1:blockSize + 1] # First Chunk of Characters offset by 1
for i in range(blockSize):
    context = inputs[:i+1] # Context is inputs upto the offset
    label = outputs[i] # Label is the offset
    print(f"When input example is {context}, then the label is: {label}")
```

For which we get:
```python
When input example is tensor([13]), then the label is: 1
When input example is tensor([13,  1]), then the label is: 0
When input example is tensor([13,  1,  0]), then the label is: 0
When input example is tensor([13,  1,  0,  0]), then the label is: 0
When input example is tensor([13,  1,  0,  0,  0]), then the label is: 0
When input example is tensor([13,  1,  0,  0,  0,  0]), then the label is: 0
When input example is tensor([13,  1,  0,  0,  0,  0,  0]), then the label is: 47
When input example is tensor([13,  1,  0,  0,  0,  0,  0, 47]), then the label is: 35
```

One more thing to mention is that we not only train these examples all the way to the context of `blockSize` just for efficiency.

**We also want our network to get *"used to"* seeing these examples for context of as small as `1` all the way upto the `blockSize` and everything in between.**

So, during `inference` we can start sampling from as little as `1` character of context. And once it starts sampling it can go all the way upto `blockSize` and after the context of `blockSize` we can start `truncating`, because the neural network will receive more than `blockSize` inputs when its trying to predict the next character.

And these input examples that we just looked are nothing but the `Time` dimension...

But we need to care about the `Batch` dimension now... And that is because, everytime we feed these *chunks* of texts into a `Transformer`, we are going to have **mini-batches** of multiple chunks of texts that are all **stacked-up** in a single tensor (this is done for efficiency, such that we could keep the `GPU`'s busy, because they are very good at parallel processing of data, and we want to process multiple *chunks* of text all at the same time, but they are processed completely independently and they don't "talk" to each other)...

We will also set up a `seed` so that whatever numbers I see here in my system, you are going to see the same numbers in your system later as well...

Let's now generalize our discussion into code and I will discuss what is happening in the code one by one...

More specifically let's define a `getBatch()` method try to pick out batches of `batchSize`...

Now don't get confused between `blockSize` and `batchSize`...

$$
\displaylines{
\begin{align}
blockSize \rightarrow \text{The number of independent sequences of characters we want to process in parallel} \\
batchSize \rightarrow \text{The maximum context length of predictions}
\end{align}
}
$$

We will now use our older *example* code get batches of examples, but now in context of `batchSize` now, and we will pick random indeces from the entire dataset, which will then be used to process all the possible examples in sequence and their corresponding labels in a batch using `torch.stack` which essentially concatenates a sequence of tensors along a new dimension. Which makes the `inputBatch` and `outputBatch` a `(4, 8)` tensor, where each row in an `inputBatch` is a *chunk* of the training set, and `outputBatch` will be used all the way at the end during **loss-function**...

So, we can spell each of these `examples in a sequence` can be spelled out just like we did before to get their corresponding `labels` for each of these examples...

In [8]:
# We define a manual seed such that you see the same numbers I see in my machine
torch.manual_seed(69420)
batchSize = 4 # Number of independent sequences of characters we want to process in parallel
blockSize = 8 # Maximum context length of predictions

def getBatch(split):
    # Take the trainingData if the split is 'train' otherwise take the validationData
    data = trainingData if split=='train' else validationData
    # Generates random integers of batchSize between 0 and len(data) - blockSize
    indexes = torch.randint(high=len(data) - blockSize, size=(batchSize,))
    # Takes the inputs and outputs after stacking them up in a single tensor
    inputs = torch.stack([data[i:i+blockSize] for i in indexes])
    outputs = torch.stack([data[i+1:i+blockSize+1] for i in indexes])
    return inputs, outputs

# We call the method to initialize inputBatch and outputBatch
inputBatch, outputBatch = getBatch('train')

print("Inputs:", inputBatch.shape, " Values:" , inputBatch)
print("Outputs:", outputBatch.shape, " Values:" , outputBatch)

print('---------------------------------------------')

for batchIndex in range(batchSize):
    for blockIndex in range(blockSize):
        context = inputBatch[batchIndex, : blockIndex+1]
        label = outputBatch[batchIndex, blockIndex]
        print(f"When input example is {context.tolist()}, then the label is: {label}")

Inputs: torch.Size([4, 8])  Values: tensor([[70, 70, 67,  1, 70, 61,  1, 75],
        [59,  1,  0, 56, 75,  1, 75, 63],
        [59,  1,  0, 71, 67, 60, 69, 75],
        [75, 63, 60,  1, 68, 56, 81, 60]])
Outputs: torch.Size([4, 8])  Values: tensor([[70, 67,  1, 70, 61,  1, 75, 63],
        [ 1,  0, 56, 75,  1, 75, 63, 60],
        [ 1,  0, 71, 67, 60, 69, 75, 80],
        [63, 60,  1, 68, 56, 81, 60, 10]])
---------------------------------------------
When input example is [70], then the label is: 70
When input example is [70, 70], then the label is: 67
When input example is [70, 70, 67], then the label is: 1
When input example is [70, 70, 67, 1], then the label is: 70
When input example is [70, 70, 67, 1, 70], then the label is: 61
When input example is [70, 70, 67, 1, 70, 61], then the label is: 1
When input example is [70, 70, 67, 1, 70, 61, 1], then the label is: 75
When input example is [70, 70, 67, 1, 70, 61, 1, 75], then the label is: 63
When input example is [59], then the lab

Now that we have our `inputBatch` and `outputBatch` we can start feeding these batches into a neural network and start getting predictions....

Now we are going to start off with the simplest possible neural network, which in my opinion is a `Bigram Language Model` which was already covered in our <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave.ipynb">NameWeave</a> notebook in a lot of depth, and we will rather go faster and implement a `PyTorch Module` directly that implements the `Bigram Language Model`.

# Bigram Language Model

Before we implement the bigram model, I'd like to discuss this syntax in python:
```python
class A:
    x = 10

class B(A):
    def show(self):
        print(self.x)

y = B()
y.show()
```
For which we get the output:
```python
10
```
Which essentially means that this syntax (`B(A)`) is used for `inheritence`, which is much more helpful now to implement `Torch Modules` in our own implementations. And that is because `torch.nn.Module` contains many such methods already implemented like `forward(*input)` which let's us define a forward pass and internally manages the `__call__()` method and we can return our calculated `logits` within the `Module` that we are going to define and later call `backward()` on...

So, we understand that we need an `Embedding Look-Up Table` for each of our characters in the vocabulary. From our <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20-%20Multi%20Layer%20Perceptron.ipynb">NameWeave - Multi Layer Perceptron</a> notebook we understand that when we index into such an embedding table of two dimensions using **batches of input** of two dimensions, we get an indexed output of a three dimensional tensor of `(B, T, C)` which we can now refer as the `Batch`, `Time` and `Channel` dimensions of that tensor, which essentially is nothing but for each input of a batch, it picks out a row of the embedding table. In our case, `Batch` is `4` which is the `batchSize`, `Time` is `8` which is the `blockSize` and `Channel` is the embedding dimension we will specify...

These indexed embeddings, for now can be interpreted as the `logits` or the scores for the next character in a sequence. Or in other words, we are predicting what comes next based on just the individual identity of a single token (which means they are not *talking* to each other). For example, if a token is say `69`, the token itself will be able to make pretty decent predictions of what comes next by knowing that the token is in fact `69`...

So now, let's write our first `model` out, and test it...

## Writing out BigramModule

In [9]:
class BigramLanguageModel(torch.nn.Module):
    # Constructor for the model
    def __init__(self, vocabularySize):
        # Initializing the embedding table
        super().__init__()
        self.tokenEmbeddingTable = torch.nn.Embedding(vocabularySize, vocabularySize)

    # Forward Pass
    def forward(self, indeces):
        # Index into embeddings to get the logits
        logits = self.tokenEmbeddingTable(indeces) # (B, T, C)
        return logits

# Testing the model
model = BigramLanguageModel(vocabularySize)
outputs = model(inputBatch)
print(outputs.shape)

torch.Size([4, 8, 92])


Looks like we do get the scores for every one of those `(4, 8)` positions...

## Understanding `cross_entropy` loss

Now that we have the predictions of *'what comes next'* we'd like to evaluate the `loss function`. And in the `NameWeave` series we saw that a good way to measure the `loss` or the quality of the predictions is to use the **Negetive Log Likelihood** loss which is also implemented in PyTorch under the name of `cross_entropy`. 

And remember how I said the `outputBatch` is required when we calculate the `loss function`? 

This is exactly when we would require the `outputBatch` to calculate the difference on the predictions or the `logits` and their corresponding `labels`. Or in other words, we have the identity of the next character, but how well are we predicting the next character based on the `logits`...

And we'd like to call the `cross_entropy` in it's **functional** form, which means we don't have to create a `Module` for it. But when we look at the documentation of <a href="https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html">CrossEntropyLoss</a> of PyTorch, we understand that we have **multi-dimensional inputs** (because we have a `(B, T, C) tensor`), but PyTorch `CrossEntropyLoss` expects $(minibatch,C,d_1,d_2,\ldots,d_K)$ or a `(B, C, T) tensor`.

So what to do now with our `(B, T, C) tensor` that we already have?

Well we will try to **reshape** our tensor now, in order to fit those `logits` as well as the `labels` which is a `(B, T) tensor`...

Now because we are only interested in manipulating the `embeddings` of the `logits` (which for our case is the `channel` dimension), we can start treating everything else as a **batch-dimension**.

This is good because PyTorch `CrossEntropyLoss` expects a $(minibatch,C)$ tensor...

So we can stretch our 3-dimensional tensor into 2-dimensional tensor by combining all the other batch dimensions into a single dimension by multiplying the dimension values into one using `view()` method of PyTorch...

And it looks something like this:
![CrossEntropyGPTExplanation](ExplanationMedia/Images/CrossEntropyGPTExplanation.png)

And then we can print out the `loss` to check where we are as well...

So let's do this now...

## Writing out Loss Function

In [10]:
class BigramLanguageModel(torch.nn.Module):
    # Constructor for the model
    def __init__(self, vocabularySize):
        # Initializing the embedding table
        super().__init__()
        self.tokenEmbeddingTable = torch.nn.Embedding(vocabularySize, vocabularySize)

    # Forward Pass
    def forward(self, indeces, labels):
        # Index into embeddings to get the logits
        logits = self.tokenEmbeddingTable(indeces) # (B, T, C)
        # Pop out the shape dimensions
        batch, time, channel = logits.shape
        # Stretch out the logits and labels
        logits = logits.view(batch*time, channel)
        labels = labels.view(batch*time)
        # Calculate loss
        loss = F.cross_entropy(logits, labels)
        return logits, loss

# Testing the model
model = BigramLanguageModel(vocabularySize)
outputs, loss = model(inputBatch, outputBatch)
print("Shape of outputs:", outputs.shape)
print("Loss:", loss)

Shape of outputs: torch.Size([32, 92])
Loss: tensor(5.0409, grad_fn=<NllLossBackward0>)


Now because we have `92` possible characters, we can actually guess what our loss should be... And we have already covered this in our <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20(MLP)%20-%20Activations%2C%20Gradients%20%26%20Batch%20Normalization.ipynb">NameWeave (MLP) - Activations, Gradients & Batch Normalization</a> notebook in more details...

But we are expecting our loss to be around:
$$
-\ln{(P_x = \frac{1}{92})} \approx 4.5217
$$

But right now we are getting around `5.3098`, which tells us that our initial predictions are not super diffuse, and have got a little bit of entropy and so we are guessing wrong, but ultimately we are able to evaluate the loss...

Now that we can evaluate the quality of the model, we'd like to also be able to `generate` from the model... And once again I'll go a little bit faster, because I covered a lot of these already in my previous notebooks of the `NameWeave` series...

## Understanding `generate()`

Now because we train our neural network with transformations in the **forward pass**, we can **forward** a single character index (say the new line character `'\n'` which is also the `0-th character` according to `itos` look-up table and is a fairly good example to forward) and get the predictions (`logits`) from the neural network. 

Now because these `logits` that come out are a *chunk* of a batch of examples, we need to focus on the **last** character in a chunk, because we are trying to predict the next character in a block (which is none other than the `Time` dimension), so we will pop out the `-1` element of that *chunk* (which will make `(B, T, C) tensor` → `(B, C) tensor`) to get the `logits` properly in a batch.

And as we remember, `logits` are the un-normalized probabilities. Which means it needs to go through a non-linearity to get normalized probabilities (for our case we will use `softmax`). We want to apply softmax independently along the **last dimension for each sequence in the batch** (`-1`). This is because we're treating each sequence independently when generating the next token.

And then after we have our `probabilities` we can sample `1` character at a time by using `multinomial` sampling distribution for a batch (Which makes it a `(B, C) tensor` → `(B, 1) tensor`) which is none other than the next index in a sequence.

And lastly we can **concatenate** each time we generate an `index` to the `nextIndex` to keep generating it in a loop along the `Time` dimension making it a `(B, 1) tensor` → `(B, T+1) tensor` each time we generate...

But we also need a stopper for our loop... And we will call it `maximumNewTokens`, which is the number of characters we want to generate from our neural network...

And see how we are not going to use `loss` here during generation?

So we also need to handle `loss` separately during `forward` to handle both kind of inputs and manage our memory efficiently.

So, after this long explanation, we can modify our `BigramLanguageModel` now:

## Modifying `BigramLanguageModel` for `generate()`

In [11]:
class BigramLanguageModel(torch.nn.Module):
    # Constructor for the model
    def __init__(self, vocabularySize):
        # Initializing the embedding table
        super().__init__()
        self.tokenEmbeddingTable = torch.nn.Embedding(vocabularySize, vocabularySize)

    # Forward Pass
    def forward(self, indeces, labels=None):
        # Index into embeddings to get the logits
        logits = self.tokenEmbeddingTable(indeces) # (B, T, C)
        if labels is None:
            loss = None
        else:
            # Pop out the shape dimensions
            batch, time, channel = logits.shape
            # Stretch out the logits and labels
            logits = logits.view(batch*time, channel)
            labels = labels.view(batch*time)
            # Calculate loss
            loss = F.cross_entropy(logits, labels)
        return logits, loss

    # Generation
    def generate(self, indeces, maximumNewTokens):
        for _ in range(maximumNewTokens):
            # Forward Through Model
            logits, loss = self(indeces)
            # Focus on the last time step
            logits = logits[:, -1, :]
            # Applying softmax for the last dimension
            probabilities = F.softmax(logits, dim=-1)
            # Sample from distribution
            nextIndex = torch.multinomial(probabilities, num_samples=1)
            # Concatenate currentIndex with nextIndex
            indeces = torch.cat((indeces, nextIndex), dim=1)
        return indeces

# Testing the model
model = BigramLanguageModel(vocabularySize)
outputs, loss = model(inputBatch, outputBatch)
print("Shape of outputs:", outputs.shape)
print("Loss:", loss)

Shape of outputs: torch.Size([32, 92])
Loss: tensor(5.3285, grad_fn=<NllLossBackward0>)


So we can start generating text from the model now...

Now because our `generate()` expects a `(B, T) tensor`, and we decided to give the first context to be a new line character `'\n'`, create a tensor consisting of a single `0` of `(1, 1)` dimension like this (also setting the `dtype` to `long`):
```python
torch.zeros((1, 1), dtype=torch.long)
```
To get a tensor like this:
```python
tensor([[0]])
```

And let's say we want to generate `100` characters from the `model`, so we will pass the same value in `maximumNewTokens` parameter.

Now remember the example where the **GPT** was able to **generate** multiple responses, given the same `context`?

Well, our model also produces multiple responses in a batch given the same `context`, but we are currently interested in the first response so we need to specify `[0]` to pop out the first response, but our `decode` expects a list so we need to convert it back to a list using `tolist()` method, and then we will be able to print out our first response...

We can now pack all the concepts into a single line:

In [12]:
print(decode(model.generate(indeces=torch.zeros((1, 1), dtype=torch.long), maximumNewTokens=100)[0].tolist()))


"( ”wMi)nd],mRAYfSqL•p‘bI'Jfk\3y\Q1P'□?bZ■h’vrA?Sk? Z5;k22~A!1>E;57■z.~Bhrvq□G4e2ScxssA)FFCl~m7nnA~B


This is the output I get the first time:
```python

"( ”wMi)nd],mRAYfSqL•p‘bI'Jfk\3y\Q1P'□?bZ■h’vrA?Sk? Z5;k22~A!1>E;57■z.~Bhrvq□G4e2ScxssA)FFCl~m7nnA~B
```

Confused?

Don't worry, our model is still untrained, and it just predicts garbage values as of now, and we can train our model now, to make it better...

But I'd like to point out that our `generate()` method is written in a generalized way, but right now it is very ridiculous. And that is because we are feeding it the entire context of parallel examples, but right now we only have the simplest Bigram Language Model, which only for example, to predict `'?'` as a next character needed `'k'` as an example and not the entire context sequence, but we only looked at the very last piece of the context. And the point of writing the `generate()` method this was is, right now we have a Bigram Language Model, but we'd like to keep this function **fixed**, but we'd like to work later and the model is able to look further in the history. Right now the history is not used and so it looks silly, but we will be using the history later and that is why we want to do it this way...

Now we can move forward and train the model... So that our outputs become a lot less random...

## Training `BigramLanguageModel`

Now before training we'd like to initialize an `optimizer`.

Well `optimizers` are none other than the algorithms that take the `gradients` and `update` the data based on the `learning rates` and other variables.

And during our `NameWeave` series, we have only ever used **Stochastic Gradient Descent (SGD)**.

There are many types of `optimizers` that are out there, and the most common ones are:
1. Gradient Descent
2. Stochastic Gradient Descent
3. Adagrad
4. Adadelta
5. RMSprop
6. Adam

You can read a lot about them in this <a href="https://medium.com/analytics-vidhya/this-blog-post-aims-at-explaining-the-behavior-of-different-algorithms-for-optimizing-gradient-46159a97a8c1">Medium Post</a>.

But for now we will use <a href="https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html#torch.optim.AdamW">`AdamW`</a> optimizer. And we will specify the parameters to `optimize` on and also the `learning rate` within the arguements of the constructor during initialization...

And the typical good setting for the `learning rate` is roughly `3e-4` but for smaller networks such as this, we can get away with much higher learning rates such as `1e-3`...

So let's initialize the optimizer now...

In [13]:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

Now let's write out our training loop.

And we can remember from our old `NameWeave` series that we used to:
1. Define a `batchSize` to train on
2. Opened a loop for the number of iterations or `epochs`
3. Get each batch input and output
4. Forwarded the model to calculate `logits` and `loss`
5. Set the gradients to `0`
6. Called `loss.backward()` for backward pass
7. Update the parameters using the `gradients` and the `learning rate`

So we can now write the same thing in code...

But unlike our `NameWeave` series, the steps for `5` and `7` are going to be a little bit different. Now that we are using the official `optimizer` from PyTorch, to set the gradients to `0` using the `optimizer` we would do something like this `optimizer.zero_grad(set_to_none=True)` and to update the parameters using the `optimizer` we would do something like this `optimizer.step()`.

So let's write out the steps that we defined in code:

In [14]:
# We define the number of epochs
epochs = 100
# We define the batch size
batchSize = 32

for _ in range(epochs):
    # Get the inputBatch and outputBatch
    inputBatch, outputBatch = getBatch('train')
    # Forward the model
    logits, loss = model(inputBatch, outputBatch)
    # Setting the gradients to None
    optimizer.zero_grad(set_to_none=True)
    # Backward to calculate gradients
    loss.backward()
    # Update the gradients
    optimizer.step()
    print(loss.item())

4.876785755157471
4.878622531890869
4.971441268920898
4.972983360290527
4.952727794647217
4.961724758148193
4.920673370361328
4.9921674728393555
4.9947509765625
4.968419075012207
4.937097072601318
5.007509708404541
5.0103230476379395
5.033933639526367
4.930401802062988
4.7458176612854
4.932359218597412
4.892716407775879
4.904999732971191
4.885517120361328
4.889068126678467
4.9522624015808105
4.900505542755127
4.931491374969482
4.893527984619141
4.834726810455322
4.967082500457764
4.929368495941162
4.919459342956543
4.922502040863037
5.044764041900635
4.856174468994141
4.875720024108887
4.952334403991699
4.802764415740967
4.9507856369018555
4.928844451904297
4.9131693840026855
4.853192329406738
4.858636856079102
4.826693058013916
4.905089855194092
4.839353084564209
4.882635116577148
4.794174671173096
4.853604793548584
4.845086097717285
4.806244850158691
4.9436869621276855
4.926150798797607
4.911518096923828
4.927879333496094
4.827909469604492
4.833471775054932
4.927444934844971
4.914450

Seems like we are optimizing the model, let's now keep the print statement outside to print the loss at the end and train for about `20000` steps...

In [15]:
# We define the number of epochs
epochs = 20000
# We define the batch size
batchSize = 32

for _ in range(epochs):
    # Get the inputBatch and outputBatch
    inputBatch, outputBatch = getBatch('train')
    # Forward the model
    logits, loss = model(inputBatch, outputBatch)
    # Setting the gradients to None
    optimizer.zero_grad(set_to_none=True)
    # Backward to calculate gradients
    loss.backward()
    # Update the gradients
    optimizer.step()
print(loss.item())

2.3757007122039795


So we see that we get a loss of `2.3757` and that we have significantly come down from our old loss...

## Sampling from Model

So we can try to sample/generate from our model by increasing the `maximumNewTokens` to around `500` to get a sense of a bigger output...

Now keep in mind, that this is still the simplest possible model that we could have, and we still won't be able to get a very good result, but for now atleast our loss has improved...

So let's generate text now...

In [16]:
print(decode(model.generate(indeces=torch.zeros((1, 1), dtype=torch.long), maximumNewTokens=500)[0].tolist()))


“Sifuriner m 

wat ol. ces 

n 
tigerrancoutheven’d t ayoour, rn fowachensprn sethelintese 
d HExpenecherine 
k dorerial SCetllyasino, gr pulet tse sand heiok hery, s oweve, atevealy. 
bover ulind th frishid, ls kne che me Pr, 





“Sooke, .K. d awastokller at inuntiveahuromitedde thamerthaplzad p 



pee woofot oulplelk Thethechllond ostosn hin. Sy Mr Chererm tck Ineyoun,” g, FEDus s iny, 
Habe s on. m,” ory tsoulabeay tonde Prokheata wredinghererongergredisst nabos'~4ot 
he acalis... font d h


It's still a very good improvement than what we had earlier...

Right now, our `tokens` or `characters` are not talking to each other (because, given the previous context of what was generated, we are looking at the very last character to make the predictions about what comes next), and we'd now like to make our `tokens` talk to each other such that they can figure out what is in the `context` so that they can make better predictions about what comes next...

# Moving Code to GPU

You probably know about `GPU`s by now...

`GPU`s can process many pieces of data simultaneously, making them useful for machine learning, video editing, and gaming applications.

So why not update our code such that it is able to run on both a `CPU` and a `GPU` assuming what is available?

Great idea, but there's a slight problem... You see, there are many `GPU`s that are out there in the market, and I don't know which you specifically have. But I have an `NVIDIA` GPU. And to process our code in a `GPU` we need to set our specific `GPU` device...

Right now, there are three main `GPU` companies out there in the market (Apologies if I don't know about the off market GPU brands):
1. NVIDIA (GTX/RTX)
2. AMD (RADEON/VEGA/RX)
3. INTEL (ARC)

Now, I will setup the `GPU` properly for my system (`NVIDIA`), but leave out links for the other two as well, because setting them up is nearly the same with a few pieces here and there...

Now `NVIDIA` specifically works on `CUDA` cores, which utilizes these cores in the GPU to process our data simultaneously...

And in order to do that we need two things:
1. <a href="https://developer.nvidia.com/cuda-downloads">CUDA Toolkit</a>
2. <a href="https://developer.nvidia.com/cudnn-downloads">CUDnn for CUDA</a> (optional)

And the installations are pretty straight forward... You just need to follow the setup for `CUDA` first and then `CUDnn` generally comes in a `.zip` package, which you can extract and copy paste the files that correspond to the `CUDA` folder...

Now `AMD` specifically uses <a href="https://www.amd.com/en/developer/resources/rocm-hub/hip-sdk.html">ROCm</a> for it's graphics processing...

And `Intel` specifically uses <a href="https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/download.htm">OpenVINO</a> for it's graphics processing...

And you can choose your specific `GPU` from these three resource links...

And finally if you're using `Google Colab` for this script, you can simply change your current runtime type like this:
![Google Colab GPU Change](ExplanationMedia/Images/ColabGPUChange.gif)

First you can run two lines of code to check if the `GPU` is available or not using:
```python
import torch
torch.cuda.is_available()
```

Now we have to set our code in such a manner, such that it works on both, a `CPU` and a `GPU` depending on what we have right now...

In order to do that we write this line of code:
```python
device = 'cuda' if torch.cuda.is_available() else 'cpu'
```

Now there is a very specific reason why we set the variable name to `device`, and that is because, during initilization of tensors and models there is an arguement that PyTorch takes in, which is known as `device` and sometimes we shift the device using `.to()` method...

Now, by default, the tensors are generated on the `CPU`. Even the model is **initialized** on the CPU. Thus one has to manually ensure that the operations are done using `GPU`...

To make this, we need to change our `device` to the specific device we have, and to do that we therefore need to change these things:
1. Before returning the `inputs` and `outputs` in `getBatch()` method, we need to change the device using `.to()`, because of tensor initialization.
2. After initializing the `model` we need to change the device using `.to()`, because of initialization of model parameters.
3. During initialization of the `context` during generation, we need to change the device using `.to`, because of tensor initialization.

So let's see the changes now:
1. ```python
   def getBatch(split):
       # Take the trainingData if the split is 'train' otherwise take the validationData
       data = trainingData if split=='train' else validationData
       # Generates random integers of batchSize between 0 and len(data) - blockSize
       indexes = torch.randint(high=len(data) - blockSize, size=(batchSize,))
       # Takes the inputs and outputs after stacking them up in a single tensor
       inputs = torch.stack([data[i:i+blockSize] for i in indexes])
       outputs = torch.stack([data[i+1:i+blockSize+1] for i in indexes])
       inputs, outputs = inputs.to(device), outputs.to(device)
       return inputs, outputs
   ```
2. ```python
   model = BigramLanguageModel(vocabularySize).to(device=device)
   ```
3. ```python
   context = torch.zeros((1, 1), dtype=torch.long, device=device)
   print(decode(model.generate(indeces=context, maximumNewTokens=500)[0].tolist()))
   ```

# Fixing Loss Evaluation

Right now we have the code for training:
```python
for _ in range(epochs):
    # Get the inputBatch and outputBatch
    inputBatch, outputBatch = getBatch('train')
    # Forward the model
    logits, loss = model(inputBatch, outputBatch)
    # Setting the gradients to None
    optimizer.zero_grad(set_to_none=True)
    # Backward to calculate gradients
    loss.backward()
    # Update the gradients
    optimizer.step()
    print(loss.item())
```

And in our <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20(MLP)%20-%20Activations%2C%20Gradients%20%26%20Batch%20Normalization.ipynb">NameWeave (MLP) - Activations, Gradients & Batch Normalization</a> notebook we have discussed how we calculate loss very cautiously, and how every batch is more-or-less lucky everytime...

So we need a loss evaluation method, which averages up the `loss` over multiple batches...

So we change our loss calculation to `@torch.no_grad()` decoration, and define a method for calculating loss as `estimateLoss()` such that no gradients are calculated during the loss evaluation every now and then....

And we have also seen how setting the training mode to `True` like `training = True` and `False` like `training = False` can create problems in layers like `Batch Normalization`...

But now that we have implemented PyTorch Modules in our code, we can do something like `model.train()` to set the mode to `training` and `model.eval()` to set the mode to `evaluation` for our model.

So we can now define two hyper-parameters:
1. `evaluationIntervals` → Which we will use to call the `estimateLoss()` based on the number of intervals
2. `evaluationIterations` → Which we will use to control the number of times the model evaluates its performance on each dataset split

So we can have a checker during training to check if the epoch iteration reaches a certain evaluation interval, we call the `estimateLoss()` method like this, and extract the losses and print them to check the losses:
```python
for iteration in range(epochs):
    # Check if iteration reaches interval
    if iteration % evaluationIntervals == 0:
        # Save the losses in a variable
        losses = estimateLoss()
        # Print the losses (Training and Validation)
        print(f"Step {iteration}: Training Loss {losses['train']:.4f}, Validation Loss {losses['validation']:.4f}")

    
    # Get the inputBatch and outputBatch
    inputBatch, outputBatch = getBatch('train')
    # Forward the model
    logits, loss = model(inputBatch, outputBatch)
    # Setting the gradients to None
    optimizer.zero_grad(set_to_none=True)
    # Backward to calculate gradients
    loss.backward()
    # Update the gradients
    optimizer.step()
```

And inside the `estimateLoss()` method we first set the model to `evaluation` mode and take both the `training` and `validation` splits and take the calculate the loss of batches after forwarding them for `evaluationIterations` storing the losses in a tensor, and then after the `evaluationIterations` are completed, we average out the losses using `mean()` based on the split and set the model back to the `training` mode and return both the `training` and `validation` losses. 

So we get a code like this:
```python
@torch.no_grad()
def estimateLoss():
    output = {}
    # Set the model to evalutaion mode
    model.eval()

    for split in ['train','validation']:
        # Define a losses tensor for the `evaluationIterations` size
        losses = torch.zeros(evaluationIterations)
        for evaluationIteration in range(evaluationIterations):
            inputBatch, outputBatch = getBatch(split)
            logits, loss = model(inputBatch, outputBatch)
            losses[evaluationIteration] = loss.item()
        output[split] = losses.mean()
        
    # Set the model to training mode
    model.train()
    return output
```

So when we call `estimateLoss()` we are going to monitor pretty accurate `training` and `validation` losses.

But right now `model.eval()` and `model.train()` does not actually do anything, but it will come in handy later when we have layers like `Batch Normalization` and `Dropout` layers in our model...

# Converting Bigram To a Script - `bigram_v1.py`

So, I'd now like to convert our entire code that we have discussed so far into a Python Script, such that the entire code can now be run in a single file, out of the box, assuming you have PyTorch installed, such that we can simplify all the intermediate work that we did...

As I have pointed out earlier, I will be completing parts of the discussion and will be releasing the code scripts within the same repository under the directory `GPT Scripts`.

And to run each script you just have to specify the `<filename>.py` in the terminal...

For now I have named this script as `bigram_v1.py`...

And you can run the file using a command like:
```bash
python bigram_v1.py
```

That's it... And everything will run out of the box...

# Self Attention in GPT

## Mathematical Trick for Self Attention in GPT

Before we discuss `Self Attention`, we'd like to discuss a certain problem that we are currently dealing with... And get used to different ways to solve the problem as well...

The problem is, right now we are only focusing on the `last` token of the context (`last` token of the `blockSize`)... But we'd like our model to look further in the context history like this:\
![GPTContextProblem](ExplanationMedia/Images/GPTContextProblem.png)

Which makes the kind of model we want to be "**autoregressive**" **(AR)**. The **autoregressive model** specifies that the output variable depends linearly on its own previous values and on a stochastic term (an imperfectly predictable term)...

Now what we'd like to do is take a small *toy-example* and to solve the same problem differently and work our way upto an efficient solution to the problem...

Let's take a very small `input` tensor and try to write it out as an example code:
```python
torch.manual_seed(69420)

batch, time, channel = 2, 8, 3
inputs = torch.randn(batch, time, channel)
print("Inputs:", inputs)
print("Shape of inputs:", inputs.shape)
```
For which I get:
```python
Inputs: tensor([[[ 2.3787e+00, -3.5896e-01, -7.1692e-01],
         [-2.4297e-01, -1.8038e-01,  1.4882e+00],
         [ 5.4493e-01,  3.8243e-01,  8.7188e-01],
         [-1.9890e+00, -5.4009e-01, -1.5319e+00],
         [-9.2356e-01,  7.2013e-01, -5.9540e-01],
         [-1.1697e+00,  8.3635e-01,  3.5811e-01],
         [ 3.9933e-01, -1.3606e+00,  1.0168e-01],
         [-4.8538e-02, -1.1643e+00, -1.5403e-01]],

        [[ 1.1998e+00, -8.0983e-01,  1.0315e+00],
         [ 1.6720e+00, -1.0681e+00, -9.6532e-01],
         [ 3.6006e-01,  3.2209e-01,  5.2594e-01],
         [-5.4021e-01, -5.2587e-01,  1.0481e+00],
         [-3.8775e-01, -1.3751e+00, -1.0385e-01],
         [-9.2093e-01, -1.0048e+00, -1.4028e+00],
         [-2.0169e+00, -5.1192e-01, -2.1998e-01],
         [-3.3050e-01, -9.1926e-01,  8.9532e-04]]])
Shape of inputs: torch.Size([2, 8, 3])
```

In the *toy-example* we have a tensor `inputs` with three dimensions:
1. `batch` → Number of batches (`batchSize`)
2. `time` → Number of tokens (characters) in a block of `blockSize`
3. `channel` → Information of the `token` in form of embeddings (features)

In this example we have `8` tokens (character) in the `time` dimension, and these tokens are not *talking* to each other...

And now we'd like them to *"talk to each other"*, we'd like to couple them... And we'd like to couple them in a very specific way...

For example, because we have `8` tokens in a sequence, out of these `8` tokens, if we let's say consider the `5`th token... This `5`th token should not communicate with tokens at locations `6`, `7` and `8` (or the future tokens in the sequence), and they should talk to the tokens at locations `4`, `3`, `2` and `1` (or the previous tokens in the context), such that information only flows from previous context to the current time stamp and we cannot get any information from the future because we are about to predict the future...

So, what is the easiest way for tokens to communicate?

Let's say I am the `5`th token, and I want to communicate with my past tokens (at `4`, `3`, `2` & `1`), and the simplest way to communate with the past is to just do an **average** of the past with context of my own information. Or in other words, if I am the `5`th token, I would like to take up the `channels` that make-up my information at my step, and also the `channels` in my past, and I'd like to average those up to make it like a `feature vector` that *summarizes me in the context of my history*...

Now once again, doing just an average is just a very weak form of "talking" or interaction, and makes this communication extremely **lossy**, which makes us lose a lot of information about the *spatial arrangements* of all those tokens... But for now, that's okay, because in the future solutions to the same problem, we will see how we can get this information back...

So let's see different versions to solve the problem now...

## Version 1 - Naïve Approach

In this approach, what we'd like to do is, for our `inputs`, for every single `batch`, for every `token` we'd like to average out all the vectors in all the `previous tokens` including the `current token`.

Or
```python
# We want: inputs[batch, time] = mean_{i<=token} inputs[batch, i]
```

Now before we dive into the solution, I'd like to discuss a concept called `Bag Of Words`...

So what is `Bag of Words`?


![Bag Of Words](https://miro.medium.com/v2/resize:fit:720/format:webp/1*3K9GIOVLNu0cRvQap_KaRg.png)

The **bag-of-words model** is a model of text which uses a representation of text that is based on an unordered collection (or *"bag"*) of words.
In natural language processing (NLP), the term "bag of words" `(BoW)` typically refers to a simple representation of text where the frequency of each word in a document is counted and represented in a vector.

So why mention it?

Well you see, our concept is similar but for our case, we have a character level model. 
The term "bag of words" `(BoW)` is used in contexts where we want to represent text data by counting the occurrences and then potentially averaging them out.

And we would like to use it in our *toy-example*'s output variable name... 😂

So now, let's write out what we want, in the for of code... To get the idea more clear now:
```python
# Allocating memory for output
inputBagOfWords = torch.zeros((batch, time, channel))

for b in range(batch):
    for t in range(time):
        # Every token in the past and current token in the batch
        previousInput = inputs[b, :t+1] # (B, T, C) → (T, C)
        # Mean over the token or the time dimension
        inputBagOfWords[b, t] = torch.mean(previousInput, 0) # (B, T, C)
```

Now because we have multiple `batches` in our example, let's compare only one batch (let's say we compare the first batch or `0`-th index) of both `inputs` and `inputBagOfWords` like this:
```python
print("Input BoW:", inputBagOfWords[0])
print("Inputs:", inputs[0])
```
We get:
```python
Input BoW: tensor([[ 2.3787, -0.3590, -0.7169],
        [ 1.0679, -0.2697,  0.3856],
        [ 0.8936, -0.0523,  0.5477],
        [ 0.1729, -0.1742,  0.0278],
        [-0.0464,  0.0046, -0.0968],
        [-0.2336,  0.1432, -0.0210],
        [-0.1432, -0.0716, -0.0035],
        [-0.1313, -0.2082, -0.0223]])
Inputs: tensor([[ 2.3787, -0.3590, -0.7169],
        [-0.2430, -0.1804,  1.4882],
        [ 0.5449,  0.3824,  0.8719],
        [-1.9890, -0.5401, -1.5319],
        [-0.9236,  0.7201, -0.5954],
        [-1.1697,  0.8363,  0.3581],
        [ 0.3993, -1.3606,  0.1017],
        [-0.0485, -1.1643, -0.1540]])
```

And we see that we have, at every `token` or `time` dimension, the average of previous and current token, which is what we want...

But in this process we see that we use nested `for-loops` which is extremely inefficient...

And now next, what we will see is that we can be extremely efficient with the same problem with `Matrix Multiplication`...

## Version 2 - Matrix Multiplication Approach

To understand the matrix multiplication approach we will use another *toy-example* for our *toy-example*...

Suppose we have two matrices `a` and `b` of sizes `(3, 3)` and `(3, 2)` respectively, and we understand that the resultant matrix `c` will be of shape `(3, 2)` and will be the dot product of columns and rows of the two matrices.

For example,
$$
\underbrace{\begin{bmatrix}
1\ 1\ 1\\
1\ 1\ 1\\
1\ 1\ 1\\
\end{bmatrix}}_{\text{a}}
\underbrace{\begin{bmatrix}
9\ 4\\
3\ 9\\
4\ 8\\
\end{bmatrix}}_{\text{b}}
\text{=}
\underbrace{\begin{bmatrix}
16\ 21\\
16\ 21\\
16\ 21\\
\end{bmatrix}}_{\text{c}}
\rightarrow \text{c = a @ b}
$$

Let's try to write the same example with code:
```python
torch.manual_seed(69420)

a = torch.ones(3, 3)
b = torch.randint(low=0, high=10, size=(3, 2)).float()
c = a@b

print("a=")
print(a)
print("-----")
print("b=")
print(b)
print("-----")
print("c=")
print(c)
```

Right now we have a very *boring* matrices `a` of just `1`s where it represents `weights` like a linear layer, and `b` represents the `inputs` similarly.

And we have repeating elements because we are calculating the same `columns` of `b` with every `row` of `1`s in `a` for each item in `c`...

Now instead if we take a lower triangluar matrix of `1`s and keep all the other elements as `0`s for `a`, we have something like this:
$$
\underbrace{\begin{bmatrix}
1\ 0\ 0\\
1\ 1\ 0\\
1\ 1\ 1\\
\end{bmatrix}}_{\text{a}}
$$

To do this, we have a method <a href="https://pytorch.org/docs/stable/generated/torch.tril.html">torch.tril()</a> in PyTorch...

So we can now modify our code and look at the resulting matrices:
```python
torch.manual_seed(69420)

a = torch.tril(torch.ones(3, 3))
b = torch.randint(low=0, high=10, size=(3, 2)).float()
c = a @ b

print("a=")
print(a)
print("-----")
print("b=")
print(b)
print("-----")
print("c=")
print(c)
```

Which gives us something like this:
$$
\underbrace{\begin{bmatrix}
1\ 0\ 0\\
1\ 1\ 0\\
1\ 1\ 1\\
\end{bmatrix}}_{\text{a}}
\underbrace{\begin{bmatrix}
9\ 4\\
3\ 9\\
4\ 8\\
\end{bmatrix}}_{\text{b}}
\text{=}
\underbrace{\begin{bmatrix}
9\ 4\\
12\ 13\\
16\ 21\\
\end{bmatrix}}_{\text{c}}
\rightarrow \text{c = a @ b}
$$

See how because of these `0`s the resultant matrix `c` is just a result of an **incremental addition (`sum`) of their respective `columns`**?

And in the same fashion, because we all know average(`mean`) is just the addition of all the elements divided by the number of elements, you can start to see how the average(`mean`) would come into the picture now...

So because we are dealing with the `weights`(`a`) and trying to manipulate them, and during matrix multiplication in `a` the `rows` play the role, so we can now average(`mean`) them using `normalization` of individual elements, where every `row` sums to one, to get and **incremental average (`mean`) of their respective `columns` in the resultant `c` matrix**...

So our code now looks like:
```python
torch.manual_seed(69420)

a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(low=0, high=10, size=(3, 2)).float()
c = a @ b

print("a=")
print(a)
print("-----")
print("b=")
print(b)
print("-----")
print("c=")
print(c)
```
Which gives us:
$$
\underbrace{\begin{bmatrix}
1.0000\ 0.0000\ 0.0000\\
0.5000\ 0.5000\ 0.0000\\
0.3333\ 0.3333\ 0.3333\\
\end{bmatrix}}_{\text{a}}
\underbrace{\begin{bmatrix}
9\ 4\\
3\ 9\\
4\ 8\\
\end{bmatrix}}_{\text{b}}
\text{=}
\underbrace{\begin{bmatrix}
9.0000 \ 4.0000\\
6.0000 \ 6.5000\\
5.3333 \ 7.0000\\
\end{bmatrix}}_{\text{c}}
\rightarrow \text{c = a @ b}
$$

And this is exactly similar to our original **naïve** approach that we did...

Now let's go back to our original *toy-example* and implement this now...

Remember how we considered `a` to be `weights` and `b` to be `inputs`...

Let's first initialize the `weights` and `inputs` the same way we did before to perform a `Matrix Multiplication` now and think how the `broadcasting` works out for us because we have `batch` dimensions as well...

For now we are dealing with the `tokens` or the `time` dimension, so our weights matrix would be of shape `time` by `time`...

And we will take `inputBagOfWords` and rename it to `inputBagOfWordsV2` where `V2` represents **version** `2` just so we can compare them later...

So our code looks something like this:
```python
weights = torch.tril(torch.ones(time, time))
weights = weights / weights.sum(1, keepdim=True)
inputBagOfWordsV2 = weights @ inputs
```

Now let's think through the **broadcasting** of the matrix multiplication operation...

For now we have `weights` of two dimensions of size `(8, 8)` or `(T, T)` for our *toy-example* and `inputs` of three dimensions of size `(2, 8, 3)` or `(B, T, C)`...

Now during broadcasting, the `weights` will add another `batch` dimension to make the broadcasting work from `(T, T)` to `(B, T, T)` and then perform the multiplication...

Which ultimately results in a **batched matrix multiplication** where the multiplication will be applied to all the **batch elements in parallel** and individually. And for each **batch element** there will be a mutliplication between `(T, T)` and `(T, C)` exactly like the operation we discussed earlier...

And the resultant `tensor` would be of shape `(B, T, C)` which will make `inputBagOfWords` completely identical to `inputBagOfWordsV2`

So we can now compare both the the `inputBagOfWords` and `inputBagOfWords2` with <a href="https://pytorch.org/docs/stable/generated/torch.allclose.html">torch.allclose()</a> like this:
```python
torch.allclose(inputBagOfWords, inputBagOfWordsV2)
```
for which we get:
```python
True
```

And if we want to compare them manually, because both of them are long `tensor`s we can compare the first batch like them to see that they are completely similar like this:
```python
print("First batch of inputBagOfWords:", inputBagOfWords[0]) 
print("First batch of inputBagOfWordsV2:", inputBagOfWordsV2[0])
```
for which we get:
```python
First batch of inputBagOfWords: tensor([[ 2.3787, -0.3590, -0.7169],
        [ 1.0679, -0.2697,  0.3856],
        [ 0.8936, -0.0523,  0.5477],
        [ 0.1729, -0.1742,  0.0278],
        [-0.0464,  0.0046, -0.0968],
        [-0.2336,  0.1432, -0.0210],
        [-0.1432, -0.0716, -0.0035],
        [-0.1313, -0.2082, -0.0223]])
First batch of inputBagOfWordsV2: tensor([[ 2.3787, -0.3590, -0.7169],
        [ 1.0679, -0.2697,  0.3856],
        [ 0.8936, -0.0523,  0.5477],
        [ 0.1729, -0.1742,  0.0278],
        [-0.0464,  0.0046, -0.0968],
        [-0.2336,  0.1432, -0.0210],
        [-0.1432, -0.0716, -0.0035],
        [-0.1313, -0.2082, -0.0223]])
```

So let's conclude what we saw here...

We saw that, **we can do weighted aggregation of our past `tokens` or `characters` by using `Matrix Multiplication` of `weights` and the `inputs`, where `weights` are a matrix of lower-triangular fashion of `1`s and other elements as `0`s, and we are doing weighted `sum` and `normalizing` them to get the *rolling* `average` or `mean`**...

Now we will look at another way of doing this same exact operation using `softmax`...

## Version 3 - Adding Softmax Approach

## Version 4 - Crux of Self Attention

## Notes on Self Attention

## Testing for self, will delete later

In [1]:
import torch

In [11]:
torch.manual_seed(69420)

batch, time, channel = 2, 8, 3
inputs = torch.randn(batch, time, channel)
print("Inputs:", inputs)
print("Shape of inputs:", inputs.shape)

Inputs: tensor([[[ 2.3787e+00, -3.5896e-01, -7.1692e-01],
         [-2.4297e-01, -1.8038e-01,  1.4882e+00],
         [ 5.4493e-01,  3.8243e-01,  8.7188e-01],
         [-1.9890e+00, -5.4009e-01, -1.5319e+00],
         [-9.2356e-01,  7.2013e-01, -5.9540e-01],
         [-1.1697e+00,  8.3635e-01,  3.5811e-01],
         [ 3.9933e-01, -1.3606e+00,  1.0168e-01],
         [-4.8538e-02, -1.1643e+00, -1.5403e-01]],

        [[ 1.1998e+00, -8.0983e-01,  1.0315e+00],
         [ 1.6720e+00, -1.0681e+00, -9.6532e-01],
         [ 3.6006e-01,  3.2209e-01,  5.2594e-01],
         [-5.4021e-01, -5.2587e-01,  1.0481e+00],
         [-3.8775e-01, -1.3751e+00, -1.0385e-01],
         [-9.2093e-01, -1.0048e+00, -1.4028e+00],
         [-2.0169e+00, -5.1192e-01, -2.1998e-01],
         [-3.3050e-01, -9.1926e-01,  8.9532e-04]]])
Shape of inputs: torch.Size([2, 8, 3])


In [12]:
inputBagOfWords = torch.zeros((batch, time, channel))
for b in range(batch):
    for t in range(time):
        previousInput = inputs[b, :t+1]
        inputBagOfWords[b, t] = torch.mean(previousInput, 0)

In [20]:
print("Input BoW:", inputBagOfWords[0])
print("Inputs:", inputs[0])

Input BoW: tensor([[ 2.3787, -0.3590, -0.7169],
        [ 1.0679, -0.2697,  0.3856],
        [ 0.8936, -0.0523,  0.5477],
        [ 0.1729, -0.1742,  0.0278],
        [-0.0464,  0.0046, -0.0968],
        [-0.2336,  0.1432, -0.0210],
        [-0.1432, -0.0716, -0.0035],
        [-0.1313, -0.2082, -0.0223]])
Inputs: tensor([[ 2.3787, -0.3590, -0.7169],
        [-0.2430, -0.1804,  1.4882],
        [ 0.5449,  0.3824,  0.8719],
        [-1.9890, -0.5401, -1.5319],
        [-0.9236,  0.7201, -0.5954],
        [-1.1697,  0.8363,  0.3581],
        [ 0.3993, -1.3606,  0.1017],
        [-0.0485, -1.1643, -0.1540]])


In [19]:
inputs.shape, inputBagOfWords.shape

(torch.Size([2, 8, 3]), torch.Size([2, 8, 3]))

In [5]:
torch.manual_seed(69420)

a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(low=0, high=10, size=(3, 2)).float()
c = a@b

print("a=")
print(a)
print("-----")
print("b=")
print(b)
print("-----")
print("c=")
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
-----
b=
tensor([[9., 4.],
        [3., 9.],
        [4., 8.]])
-----
c=
tensor([[9.0000, 4.0000],
        [6.0000, 6.5000],
        [5.3333, 7.0000]])


In [13]:
weights = torch.tril(torch.ones(time, time))
weights = weights / weights.sum(1, keepdim=True)
inputBagOfWordsV2 = weights @ inputs

In [8]:
weights.shape

torch.Size([8, 8])

In [14]:
torch.allclose(inputBagOfWords, inputBagOfWordsV2)

True

In [16]:
print("First batch of inputBagOfWords:", inputBagOfWords[0]) 
print("First batch of inputBagOfWordsV2:", inputBagOfWordsV2[0])

First batch of inputBagOfWords: tensor([[ 2.3787, -0.3590, -0.7169],
        [ 1.0679, -0.2697,  0.3856],
        [ 0.8936, -0.0523,  0.5477],
        [ 0.1729, -0.1742,  0.0278],
        [-0.0464,  0.0046, -0.0968],
        [-0.2336,  0.1432, -0.0210],
        [-0.1432, -0.0716, -0.0035],
        [-0.1313, -0.2082, -0.0223]])
First batch of inputBagOfWordsV2: tensor([[ 2.3787, -0.3590, -0.7169],
        [ 1.0679, -0.2697,  0.3856],
        [ 0.8936, -0.0523,  0.5477],
        [ 0.1729, -0.1742,  0.0278],
        [-0.0464,  0.0046, -0.0968],
        [-0.2336,  0.1432, -0.0210],
        [-0.1432, -0.0716, -0.0035],
        [-0.1313, -0.2082, -0.0223]])
