# Welcome to GPT from Scratch

## Things to discuss before starting

This notebook is a continuation of my previous notebooks in order:
1. <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/Neural%20Network%20with%20Derivatives.ipynb">Neural Networks with Derivatives</a>
2. <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave.ipynb">NameWeave</a>
3. <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20-%20Multi%20Layer%20Perceptron.ipynb">NameWeave - Multi Layer Perceptron</a>
4. <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20(MLP)%20-%20Activations%2C%20Gradients%20%26%20Batch%20Normalization.ipynb">NameWeave (MLP) - Activations, Gradients & Batch Normalization</a>
5. <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20-%20Manual%20Back%20Propagation.ipynb">NameWeave - Manual Back Propagation</a>
6. <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/NameWeave%20-%20WaveNet.ipynb">NameWeave - WaveNet</a>

Which means, I will be using a lot of the terminologies, explanations, and code from the previous notebooks that I have created in the series...

And we will gradually build a complete **GPT** from scratch.

This notebook will be a little bit different from the previous notebooks that I have created previously and I will discuss the changes in a bit...

We will use a completely new dataset `Harry_Potter_Books.txt` which is a combined raw text of all the Harry Potter books by J. K. Rowling combined into a single `text` file, instead of `Indian_Names.txt` which we have used in the previous notebooks that contained Indian Firstnames crawled from a website.

We will also keep softwares like <a href="https://chat.openai.com/">ChatGPT</a>, <a href="https://gemini.google.com/app">Google Gemini</a>, and other Large Language Models (LLM's) in mind and create our own little **Generative Pre-Trained Transformer (GPT)**...

And you would probably know by now what these models are and what they do...

Our **GPT** is going to be a *character level language model*, instead of a *sub-word-tokenized model* which softwares like ChatGPT and Google Gemini use in their models, and we will discuss everything in a bit...

And we will not write all the `TORCH.NN` modules from scratch now, because we have already covered the most important ones already in our previous notebooks, instead we will implement the networks based on `TORCH.NN` library from PyTorch...

And most importantly we will start from the simplest model (Bigram Model) and modify the same model within the same notebook and make our way upto the entire `Transformer` architecture...

I have created an entire folder as `GPT Scripts` in the root of this repository to save each script for you to run them without even having to use jupyter notebooks. Rather you can simple use the `<filename>.py` to run each model to see how they perform on the same dataset as we move up, using:
```bash
python <filename>.py
```

So, it's going to me a long journey and it if going to be legen...wait-for-it...dary. Legendary!!!\
![Barney Stinson Wink](https://media.tenor.com/nJ3EeUPhVKkAAAAM/barny-stinson.gif)

So let's get started...

# Understanding GPT

So what is a **GPT**?

Well, **GPT** expands for the terminology as **Generative Pre-Trained Transformer**.

You see how it consists of three words?

Let's look at it in context of each word:
1. Generative → Generates New Content
2. Pre-Trained → Pre-trained on a dataset
3. Transformer → Transformer architecture is being followed to make up the model (Don't worry we will discuss this later)

That was easy...

Let's understand what **GPT** can do currently...

Let's take `ChatGPT` as an example as of now and let's see it's capabilities...

![ChatGPT Current Capabilities](https://miro.medium.com/v2/resize:fit:679/1*_3AM0Yhc7qgCvZ_X1L8mhw.gif)

We see that `GPT` goes from left to right and generates text sequentially...

I wanted to show another thing:
![ChatGPT Different Responses](ExplanationMedia/Images/ChatGPT_Different_Response.gif)

See how we get a different response each time?

Which hints us that it is more like a probabilistic system, which is for any one `prompt` it can give us multiple answers...

Now, this is just one example of a `prompt`, and people have come up with billions of different prompts as of now, and in fact there are many websites that index the interactions with `ChatGPT` as well.

You can look at this <a href="https://writesonic.com/blog/best-chatgpt-examples">website</a> as an example.

We see that it is a very remarkable system, and it is what we call a **"Language Model"**.

Or in other words, it models the sequence of words or characters (or "tokens" more generally) and it knows how words follow each other in English language (even other languages)...

Let's understand what **GPT** does from it's perspective...

Well it is trying to complete the sequence...

In other words, the `inputs` or `task` that we give to the GPT model, it treats it as a *start of a sequence* and it tries to complete the sequence as a whole. Which makes it a language model in this sense...

You would think that it is utterly ridiculous and that we cannot just model an entire architecture and make it act like a helpful assistant.

Well that is the beauty of it. And we will discuss all the under-the-hood components of what makes a software like `ChatGPT` work.

So, What is the neural network architecture under-the-hood that models this sequence of words/characters/tokens?

That comes from this paper from Google: <a href="https://arxiv.org/abs/1706.03762">"Attention Is All You Need"</a> from 2017.

This was a landmark paper in Artificial Intelligence that proposed the `Transformer` architecture. But if you start reading this paper, it may seem like a pretty random *machine-translation* paper. And that's because, I think the authors did not fully anticipate the impact it would create in this domain in the years to come...

Let's look at the original `Transformer` architecture as of now:
![Transformer Archtecture](ExplanationMedia/Images/Transformer_Model_Architecture.png)

And this `Transformer` architecture was copy pasted in huge amount of applications in most recent years...

And what we'd like to do now is create something like `ChatGPT`. But we would not be able to completely clone `ChatGPT` because it is a way more serious *production-grade* system which currently requires *thousands* of GPUs and *millions* of dollars to train the network, and also it is trained on a very good *chunk* of internet data. And there are a lot of **pre-training** and **fine-tuning** stages to it.

Rather we would like to create a transformer-based language model, and in our case it is going to be a character level language model. And we also don't want to train on a *chunk* of internet, rather we need a smaller dataset (I proposed we work with `Harry_Potter_Books.txt` which is roughly a `7MB` file). And we would try to model how these characters in this dataset, follow each other.

Let's take this paragraph for example:
```python
"""
Mr. and Mrs. Dursley, of number four, Privet Drive, 
were proud to say that they were perfectly normal, 
thank you very much.
"""
```

Given a chunk of these characters in the past:
```python
"""
Mr. and Mrs. Dursley, of number four, Privet Drive, 
were proud to say that the
"""
```
The `Transformer` model will look at these characters as a context in the past, and it is going to predict that the letter `'y'` is likely to come next in the sequence. And it is going to produce (generate) character sequences that look like Harry Potter. And in that process it is going to model all the patterns inside this data.

And once we have trained the model, our model will be able to generate *infinite `Harry Potter`*

![Harry Potter Woo](https://media4.giphy.com/media/v1.Y2lkPTc5MGI3NjExcjBmNzJ5N2EzMDQzeTB3cXV4ODN5ZGJkdWlldHhleGw3d3hpMGRhMyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/TJO5x5QQM72Q0weWXN/giphy.gif)

So let's install the required dependencies and load our dataset up and look into the data and what it looks like first...

# Installing Dependencies

In [1]:
!pip install torch
!pip install numpy
!pip install pandas
!pip install matplotlib



# Importing Libraries

In [2]:
import random
import torch
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt

# Loading Dataset

This time, I will divide the dataset loading part into two forms:
1. If you're trying to use `Google Colab` to run the code
2. If you're trying to use `Jupyter Notebook locally` to run the code
And you can choose between either one of those with our desired mode...

## For Google Colab users

This will download the `Harry_Potter_Books.txt` into your current folder...

In [3]:
!wget https://raw.githubusercontent.com/AvishakeAdhikary/Neural-Networks-From-Scratch/main/Datasets/Harry_Potter_Books.txt

'wget' is not recognized as an internal or external command,
operable program or batch file.


Now you can load up the dataset like this:

In [4]:
with open('Harry_Potter_Books.txt', 'r', encoding='utf-8') as file:
    text = file.read()

FileNotFoundError: [Errno 2] No such file or directory: 'Harry_Potter_Books.txt'

## For local Jupyter Notebook users

You don't have to download the dataset if you have the entire repository cloned.

The dataset `Harry_Potter_Books.txt` is already located inside the `Datasets` directory...

So we can simply open the file and look at its content by specifying the relative path...

In [5]:
with open('Datasets/Harry_Potter_Books.txt', 'r', encoding='utf-8') as file:
    text = file.read()

# Exploring the Dataset

We can look at the length of the entire dataset and it's number of characters...

In [6]:
print("Length of Dataset in Characters: ", len(text))

Length of Dataset in Characters:  6765190


We see that it is roughly `6-million` characters...

And if you want to look at the first `1000` characters we can do:
```python
print(text[:1000])
```
Which prints the output:
```python
"""
/ 




THE BOY WHO LIVED 

Mr. and Mrs. Dursley, of number four, Privet Drive, 
were proud to say that they were perfectly normal, 
thank you very much. They were the last people you’d 
expect to be involved in anything strange or 
mysterious, because they just didn’t hold with such 
nonsense. 

Mr. Dursley was the director of a firm called 
Grunnings, which made drills. He was a big, beefy 
man with hardly any neck, although he did have a 
very large mustache. Mrs. Dursley was thin and 
blonde and had nearly twice the usual amount of 
neck, which came in very useful as she spent so 
much of her time craning over garden fences, spying 
on the neighbors. The Dursley s had a small son 
called Dudley and in their opinion there was no finer 
boy anywhere. 

The Dursleys had everything they wanted, but they 
also had a secret, and their greatest fear was that 
somebody would discover it. They didn’t think they 
could bear it if anyone found out about the Potters. 
Mrs. Potter was Mrs. Dursl
"""
```

# Building Vocabulary

We can now start building our vocabulary, just like we did in our previous notebooks...

In [7]:
characters = sorted(list(set(text))) # Gives us all the characters in the english alphabet, hopefully our dataset has all of them
vocabularySize = len(characters) # We define a common vocabulary size
print("Characters:", characters)
print("Vocabulary Size:", vocabularySize)

Characters: ['\n', ' ', '!', '"', '%', '&', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '>', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '\\', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '|', '~', '—', '‘', '’', '“', '”', '•', '■', '□']
Vocabulary Size: 92


So we have a possible `vocabulary` of `92` characters that our model will be able to see or emit...

Now we would like to develop a strategy to <strong><i>tokenize</i></strong> our input `text`.

And when we say **tokenize** we generally mean to convert raw text as a string to some sequence of integers according to some vocabulary of possible elements...

For us, because we are developing a character level language model, so we are simply going to be translating individual `characters` into `integers`.

And we will build `4` things here:
1. String to Index Vocabulary → `stoi` → A map of `characters` to `integers`
2. Index to String Vocabulary → `itos` → A map of `integers` to `characters`
3. Token Encoder → That will encode sequence of characters into indeces
4. Token Decoder → That will encode sequence of encoded indeces into characters

And you will be able to recognize the first two from our previous notebooks...

Before we dive in, let's understand the python concept of `lambda` functions, which some you might have forgotten...

So what are `lambda` functions?

Python Lambda Functions are *anonymous functions* means that the function is without a name. As we already know the `def` keyword is used to define a normal function in Python. Similarly, the `lambda` keyword is used to define an anonymous function in Python.

Syntax: `lambda arguments : expression`

For example:
```python
output = lambda input: input+1
print(output(input=1))
```
We would print `2`.

Now, let me first run it, then I will explain it later:

In [8]:
stoi = {character:index for index, character in enumerate(characters)}
itos = {index:character for index, character in enumerate(characters)}
encode = lambda string: [stoi[character] for character in string] # Token Encoder that takes in a string as an input, and outputs a list of integers
decode = lambda list: ''.join([itos[index] for index in list]) # Token Decoder that takes in the encoded list of integers and outputs the decoded string

print("STOI:", stoi)
print("ITOS:", itos)
print("Encoded Text: ", encode("Legendary"))
print("Decoded Text: ", decode(encode("Legendary")))

STOI: {'\n': 0, ' ': 1, '!': 2, '"': 3, '%': 4, '&': 5, "'": 6, '(': 7, ')': 8, '*': 9, ',': 10, '-': 11, '.': 12, '/': 13, '0': 14, '1': 15, '2': 16, '3': 17, '4': 18, '5': 19, '6': 20, '7': 21, '8': 22, '9': 23, ':': 24, ';': 25, '>': 26, '?': 27, 'A': 28, 'B': 29, 'C': 30, 'D': 31, 'E': 32, 'F': 33, 'G': 34, 'H': 35, 'I': 36, 'J': 37, 'K': 38, 'L': 39, 'M': 40, 'N': 41, 'O': 42, 'P': 43, 'Q': 44, 'R': 45, 'S': 46, 'T': 47, 'U': 48, 'V': 49, 'W': 50, 'X': 51, 'Y': 52, 'Z': 53, '\\': 54, ']': 55, 'a': 56, 'b': 57, 'c': 58, 'd': 59, 'e': 60, 'f': 61, 'g': 62, 'h': 63, 'i': 64, 'j': 65, 'k': 66, 'l': 67, 'm': 68, 'n': 69, 'o': 70, 'p': 71, 'q': 72, 'r': 73, 's': 74, 't': 75, 'u': 76, 'v': 77, 'w': 78, 'x': 79, 'y': 80, 'z': 81, '|': 82, '~': 83, '—': 84, '‘': 85, '’': 86, '“': 87, '”': 88, '•': 89, '■': 90, '□': 91}
ITOS: {0: '\n', 1: ' ', 2: '!', 3: '"', 4: '%', 5: '&', 6: "'", 7: '(', 8: ')', 9: '*', 10: ',', 11: '-', 12: '.', 13: '/', 14: '0', 15: '1', 16: '2', 17: '3', 18: '4', 19: 

The `Token Encoder` here takes in a `string` or a `sequence of characters` and encodes it into a `list of integers` based on `stoi` mapping. And the `Token Decoder` takes in the encoded `list of integers` and decodes it based on `itos` mapping to get back the exact same string...

In other words, it is more like a translation of `characters` into `integers` and `integers` into `characters`, because our model is going to be a character level language model.

Now this is only one of many possible `encodings` or `tokenizers` that are out there in the world right now...

And people have come up with many such `tokenizers`, for example, Google uses <a href="https://github.com/google/sentencepiece">`sentencepiece`</a>, OpenAI uses <a href="https://github.com/openai/tiktoken">`tiktoken`</a>...

And these `tokenizers` which are out there are more like `sub-word` tokenizers, which are **not** encoding `entire words` and also **not** encoding `individual characters`, and more like a `sub-word` unit level `tokenizers` which is usually what's adapted in practice...

As an example let's take `tiktoken` vocabulary which uses `Byte-Pair Encoding (BPE)` to encode these `tokens`:\
![Tiktoken Vocabulary](ExplanationMedia/Images/Tiktoken_Vocabulary.png)

We see that `tiktoken` has a vocabulary of roughly `50257` which for us is just `92`.

And when we try to encode a sample string in `tiktoken`, we get:\
![Tiktoken Example](ExplanationMedia/Images/TikToken_Example.png)

We see that we only get `3` outputs for and entire string of `9` characters...

Which means that we can *trade-off* `sequences of integers` and `vocabularies`...

In other words, we can have a very long `sequences of integers` and very short `vocabularies` or we can have very short `sequences of integers` and very long `vocabularies`...

But for now I'd like to keep our `tokenizer` extremely simple using our own character-level tokenizer (meaning we have very small `vocabulary`) and very simple `encode` and `decode` functions, but we do get very long `sequences of integers` as a result...

Don't worry, if you'd like I will build a `tokenizer` in the future...

So let's now move forward...

Now that we have a `token encoder` and a `token decoder` or effectively a `tokenizer` we can move forward and encode our entire `Harry Potter` dataset...

And we will use <a href="https://pytorch.org/">PyTorch</a> library for that:
![PyTorch Logo](ExplanationMedia/Images/PyTorchLogo.svg)

So we can now wrap our `text` data after `encoding` it into a `tensor` of datatype `long` because we want *floating-point numbers* to do mathematical transformations on this data later like this:
```python
data = torch.tensor(encode(text), dtype=torch.long)
```

In [9]:
data = torch.tensor(encode(text), dtype=torch.long)

And then we can check the `shape` and `type` of this data and print out the first `100` characters, just like we did before (without decoding it) like this:
```python
print(data.shape, data.dtype)
```
And we get:
```python
torch.Size([6765190]) torch.int64
```
And:
```python
print(data[:100])
```
And we get:
```python
tensor([13,  1,  0,  0,  0,  0,  0, 47, 35, 32,  1, 29, 42, 52,  1, 50, 35, 42,
         1, 39, 36, 49, 32, 31,  1,  0,  0, 40, 73, 12,  1, 56, 69, 59,  1, 40,
        73, 74, 12,  1, 31, 76, 73, 74, 67, 60, 80, 10,  1, 70, 61,  1, 69, 76,
        68, 57, 60, 73,  1, 61, 70, 76, 73, 10,  1, 43, 73, 64, 77, 60, 75,  1,
        31, 73, 64, 77, 60, 10,  1,  0, 78, 60, 73, 60,  1, 71, 73, 70, 76, 59,
         1, 75, 70,  1, 74, 56, 80,  1, 75, 63])
```

And we see that we have a massive list of integers and is an identical translation of the first `100` characters exactly in the `text` file...

And the entire dataset is just stretched out into a very large `sequence of integers`...

Now before we move on with our progress, we would like to do one more thing that is we'd like to split our dataset into a `Train` and `Validation` split...

So let's do that...

# Splitting the Dataset into `Training` and `Validation` splits

Now I'd like to split our data into a split of:
1. First 90% into `Training` Split
2. Last 10% into `Validation` Split

And we are doing this to understand, to what extent our model is `overfitting`...

Because we don't want our model to copy and create the exact book of `Harry Potter` instead, we want a model that will create `Harry Potter` like text...

And here's how I do that:

In [10]:
nintyPercentOfDatasetLength = int((0.9 * len(data)))
trainingData = data[:nintyPercentOfDatasetLength] # Data up till 90% of the length
validationData = data[nintyPercentOfDatasetLength:] # Data from 90% of the length

We can now move on to the next part...