<a href="https://www.kaggle.com/code/ayushs9020/gpt-on-the-bhagvad-gita-llms?scriptVersionId=129925360" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# The Bhagvad Gita and the GPT-0

<img src = "https://rukminim1.flixcart.com/image/416/416/jyyqc280/regionalbooks/y/6/3/srimad-bhagavad-gita-as-it-is-hindi-2018-new-edition-hardcover-original-imaf79gzwhhey4zz.jpeg?q=70">

This notebook is higly inspired by **[Andrej Karpathy](https://www.youtube.com/@AndrejKarpathy)=>[Let's build GPT : from scratch , in code , spelled out](https://youtu.be/kCc8FmEb1nY)**

The `Bhagavad Gita` is a $700-verse$ `Hindu scripture` in `Sanskrit` that is part of the `epic Mahabharata`. It is set in a narrative framework of a dialogue between the `Pandava prince Arjuna` and his guide and `charioteer Krishna`. The `Gita` explores a range of `philosophical topics`, such as the `nature of reality`, `the self`, `the soul`, and the `relationship between action and duty`. It also contains teachings on `yoga, meditation, and devotion`.


$GPT-0$ is a `small`, `simple language model` that was created by `OpenAI` in $2020$. It is a `generative pre-trained transformer` model that was trained on a dataset of text and code. `GPT-0` is `not as powerful` as some of the other language models that have been created by `OpenAI`, but it is still a `valuable tool for research and development`. `GPT-0` has been used to generate text, translate languages, and answer questions. It has also been used to create new forms of art and music. `GPT-0` is a powerful tool that has the potential to change the way we interact with computers

Today we will try to build our own `GPT-0` trained on the `Bhagavad Gita`, and see the results

In [1]:
import numpy as np 
import pandas as pd 

In [2]:
import torch
import torch.nn as nn
from torch.nn import functional as F

In [3]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/bhagwat-gita-in-english/bhagvadnew.txt
/kaggle/input/bhagwat-gita-in-english/gita.txt


# 1 | Data 🚀

Lets get our data into working 

In [4]:
with open('/kaggle/input/bhagwat-gita-in-english/gita.txt', 
          'r', encoding='utf-8') as f:
    text = f.read()

So our data is a text and it basically looks like this

In [5]:
print(text)

 I
Dhritirashtra:
Ranged thus for battle on the sacred plain--
On Kurukshetra--say, Sanjaya! say
What wrought my people, and the Pandavas?

Sanjaya:
When he beheld the host of Pandavas,
Raja Duryodhana to Drona drew,
And spake these words: "Ah, Guru! see this line,
How vast it is of Pandu fighting-men,
Embattled by the son of Drupada,
Thy scholar in the war! Therein stand ranked
Chiefs like Arjuna, like to Bhima chiefs,
Benders of bows; Virata, Yuyudhan,
Drupada, eminent upon his car,
Dhrishtaket, Chekitan, Kasi's stout lord,
Purujit, Kuntibhoj, and Saivya,
With Yudhamanyu, and Uttamauj
Subhadra's child; and Drupadi's;-all famed!
All mounted on their shining chariots!
On our side, too,--thou best of Brahmans! see
Excellent chiefs, commanders of my line,
Whose names I joy to count: thyself the first,
Then Bhishma, Karna, Kripa fierce in fight,
Vikarna, Aswatthaman; next to these
Strong Saumadatti, with full many more
Valiant and tried, ready this day to die
For me their king, each with 

# 2 | Embeddings/Tokenizing 🔢

Okay, so just hear me out. First thing to notice, we cannot just put letters into a model and expect it to undertand everything. No, We need to somehow make this characters into numbers, somehow, we really dont know how, but somehow, we will do that. 

Okay so we know what characters we have, like we know, all the characters will fall in the `English Alphabet`, maybe we find some extra characters like, `"," , "." , etc`. So what if we number them like that only. 

Lets assume we have a letter like `Optimus Prime`, we know that in the `English Alphabet` ,  `O`  comes at `15`, so we can number `O` as `15` and like this only the whole sequence becomes something like this 

|_____|_____
|---|---
|O|15
|p|42
|t|46
|i|35
|m|39
|u|47
|s|45
| |0
|P|16
|r|44
|i|35
|m|39
|e|31

We call this numerical representation of a `str`, **Embedding/Tokenizing**

What we did here was `character encoding` , means we were taking every character to be distinct of each other, or be independent to each character. 

There are other types of possible encoding available, such as `Bag Of Words , TF-IDF , Word2Vec , Glove`. There are more available like **[Sentence Piece by Google](https://github.com/google/sentencepiece)**  or **[Tik Token by OpenAI](https://github.com/openai/tiktoken)**
 , which you can try 
So now we have an intution of what we have to do, now we need to code that intution. Lets first get all the unique values of this corpus we have 

In [6]:
print("Unqiue values of the corpus : " , set(text))
print("---------------------------------------------------------------------------------------------")
print("Sorted Unique values of the corpus : " , sorted(list(set(text))))

Unqiue values of the corpus :  {'#', '"', '3', 'v', '[', 't', 'w', '5', 'R', '1', 'b', ';', 'H', 'O', ',', 'D', 'T', 'F', 'c', 'A', 'f', 'V', 'y', 'n', 'X', '(', 'J', 'j', 'Y', 'L', '\n', ':', ')', 'e', 'B', '9', 'p', 'W', 'N', '2', ']', 'i', 'm', 'K', ' ', '-', 'k', 'M', '!', '7', 'o', '.', 'Q', 'S', 'C', 'E', 'u', 'z', '6', 'x', 's', 'G', 'l', '8', 'd', 'U', "'", '4', 'P', '?', 'r', 'g', '0', 'a', 'q', 'I', 'h'}
---------------------------------------------------------------------------------------------
Sorted Unique values of the corpus :  ['\n', ' ', '!', '"', '#', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', '[', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [7]:
characters = sorted(list(set(text)))

Now we will try to map these characters with some specific numbers 

In [8]:
stoi = {ch:i for i,ch in enumerate(characters)}
itos = {i:ch for i,ch in enumerate(characters)}

In [9]:
print("Mapping of characters to numbers : " , stoi)
print("---------------------------------------------------------------------------------------------")
print("Mapping of numbers to characters : " , itos)

Mapping of characters to numbers :  {'\n': 0, ' ': 1, '!': 2, '"': 3, '#': 4, "'": 5, '(': 6, ')': 7, ',': 8, '-': 9, '.': 10, '0': 11, '1': 12, '2': 13, '3': 14, '4': 15, '5': 16, '6': 17, '7': 18, '8': 19, '9': 20, ':': 21, ';': 22, '?': 23, 'A': 24, 'B': 25, 'C': 26, 'D': 27, 'E': 28, 'F': 29, 'G': 30, 'H': 31, 'I': 32, 'J': 33, 'K': 34, 'L': 35, 'M': 36, 'N': 37, 'O': 38, 'P': 39, 'Q': 40, 'R': 41, 'S': 42, 'T': 43, 'U': 44, 'V': 45, 'W': 46, 'X': 47, 'Y': 48, '[': 49, ']': 50, 'a': 51, 'b': 52, 'c': 53, 'd': 54, 'e': 55, 'f': 56, 'g': 57, 'h': 58, 'i': 59, 'j': 60, 'k': 61, 'l': 62, 'm': 63, 'n': 64, 'o': 65, 'p': 66, 'q': 67, 'r': 68, 's': 69, 't': 70, 'u': 71, 'v': 72, 'w': 73, 'x': 74, 'y': 75, 'z': 76}
---------------------------------------------------------------------------------------------
Mapping of numbers to characters :  {0: '\n', 1: ' ', 2: '!', 3: '"', 4: '#', 5: "'", 6: '(', 7: ')', 8: ',', 9: '-', 10: '.', 11: '0', 12: '1', 13: '2', 14: '3', 15: '4', 16: '5', 17: 

Now we will make $2$ functions, `encoder , decoder`, 

Encoder will take `str` as an input and return the corresponding embedding

Decoder will take `embedding representation` as an input and returnthe corresponding `str` value

In [10]:
encoder = lambda s: [stoi[c] for c in s]
decoder = lambda l: ''.join([itos[i] for i in l])

Lets try our old example `Optimus Prime `

In [11]:
encoder("Optimus Prime")

[38, 66, 70, 59, 63, 71, 69, 1, 39, 68, 59, 63, 55]

In [12]:
decoder(encoder("Auto Bots ! , ROll OUT"))

'Auto Bots ! , ROll OUT'

Now we will do this for our whole corpus of text

In [13]:
encoder(text)

[1,
 32,
 0,
 27,
 58,
 68,
 59,
 70,
 59,
 68,
 51,
 69,
 58,
 70,
 68,
 51,
 21,
 0,
 41,
 51,
 64,
 57,
 55,
 54,
 1,
 70,
 58,
 71,
 69,
 1,
 56,
 65,
 68,
 1,
 52,
 51,
 70,
 70,
 62,
 55,
 1,
 65,
 64,
 1,
 70,
 58,
 55,
 1,
 69,
 51,
 53,
 68,
 55,
 54,
 1,
 66,
 62,
 51,
 59,
 64,
 9,
 9,
 0,
 38,
 64,
 1,
 34,
 71,
 68,
 71,
 61,
 69,
 58,
 55,
 70,
 68,
 51,
 9,
 9,
 69,
 51,
 75,
 8,
 1,
 42,
 51,
 64,
 60,
 51,
 75,
 51,
 2,
 1,
 69,
 51,
 75,
 0,
 46,
 58,
 51,
 70,
 1,
 73,
 68,
 65,
 71,
 57,
 58,
 70,
 1,
 63,
 75,
 1,
 66,
 55,
 65,
 66,
 62,
 55,
 8,
 1,
 51,
 64,
 54,
 1,
 70,
 58,
 55,
 1,
 39,
 51,
 64,
 54,
 51,
 72,
 51,
 69,
 23,
 0,
 0,
 42,
 51,
 64,
 60,
 51,
 75,
 51,
 21,
 0,
 46,
 58,
 55,
 64,
 1,
 58,
 55,
 1,
 52,
 55,
 58,
 55,
 62,
 54,
 1,
 70,
 58,
 55,
 1,
 58,
 65,
 69,
 70,
 1,
 65,
 56,
 1,
 39,
 51,
 64,
 54,
 51,
 72,
 51,
 69,
 8,
 0,
 41,
 51,
 60,
 51,
 1,
 27,
 71,
 68,
 75,
 65,
 54,
 58,
 51,
 64,
 51,
 1,
 70,
 65,
 1,
 27,
 68,
 65,
 6

This a normal list format, but we need `tensor` dtype, as we will be using with `torch` and `tensorflow `

In [14]:
data = torch.tensor(np.array(encoder(text)))

# 3 | Train-Test Split 🚃
$Training$ $data$ is used to `teach a machine learning model` how to perform a specific task. $Testing$ $data$ is used to `evaluate the performance of a trained model`. The training and testing data should be representative of the data that the model will be used on in the real world. High-quality data is essential for training and evaluating machine learning models in $NLP$. By using high-quality data, you can ensure that your models are accurate and effective.

We will divide our data into $1:9$ ratio for `testing and training` respectively 

In [15]:
train = data[:int(0.9 * len(data))]
val = data[int(0.9 * len(data)):]

# 4 | Next Character Prediction ⏭️

Lets take the same example 

In [16]:
encoder("I am Optimus Prime")

[32, 1, 51, 63, 1, 38, 66, 70, 59, 63, 71, 69, 1, 39, 68, 59, 63, 55]

What we do is 
* **We first only feed the letter $32$ to the model**
* **Then we try to predict the next letter means $1$**
* **Then We get some error**
* **We then change the weights and biases according to the optimizer**
* **Then we repeat the process with the letter $32 , 1$ and try to predict $51$**
* **We repeat the process till the lists end** 

Now lets assume we have a very long sentence like 
```
And let's be honest, life's a competition
So, if I'm going to play, then I'm gon' play to win it
I refuse to sit and rot at a desk all day
Unless I have a passion I'm working towards, okay
I'd rather be dead on the outside than inside
A bullet to the head than 25 to life
In a cubicle alone just trying to get by
Building someone else's dream instead of building mine

If you're hearin' me, this is meant to inspire
If you have a dream or if you have desires
A girl in your life that's makin' you feel that fire
Go fight for her, man, go die for her, man
'Cause you only have one life, one chance to do it
One chance to prove it to yourself, so, don't lose it
You got this, fam, just keep pushing on through it
One day you'll look back so glad you pursued it
```

If we encode this, we get 

In [17]:
sample_text = '''And let's be honest, life's a competition
So, if I'm going to play, then I'm gon' play to win it
I refuse to sit and rot at a desk all day
Unless I have a passion I'm working towards, okay
I'd rather be dead on the outside than inside
A bullet to the head than 25 to life
In a cubicle alone just trying to get by
Building someone else's dream instead of building mine
If you're hearin' me, this is meant to inspire
If you have a dream or if you have desires
A girl in your life that's makin' you feel that fire
Go fight for her, man, go die for her, man
'Cause you only have one life, one chance to do it
One chance to prove it to yourself, so, don't lose it
You got this, fam, just keep pushing on through it
One day you'll look back so glad you pursued it'''

In [18]:
np.array(encoder(sample_text))

array([24, 64, 54,  1, 62, 55, 70,  5, 69,  1, 52, 55,  1, 58, 65, 64, 55,
       69, 70,  8,  1, 62, 59, 56, 55,  5, 69,  1, 51,  1, 53, 65, 63, 66,
       55, 70, 59, 70, 59, 65, 64,  0, 42, 65,  8,  1, 59, 56,  1, 32,  5,
       63,  1, 57, 65, 59, 64, 57,  1, 70, 65,  1, 66, 62, 51, 75,  8,  1,
       70, 58, 55, 64,  1, 32,  5, 63,  1, 57, 65, 64,  5,  1, 66, 62, 51,
       75,  1, 70, 65,  1, 73, 59, 64,  1, 59, 70,  0, 32,  1, 68, 55, 56,
       71, 69, 55,  1, 70, 65,  1, 69, 59, 70,  1, 51, 64, 54,  1, 68, 65,
       70,  1, 51, 70,  1, 51,  1, 54, 55, 69, 61,  1, 51, 62, 62,  1, 54,
       51, 75,  0, 44, 64, 62, 55, 69, 69,  1, 32,  1, 58, 51, 72, 55,  1,
       51,  1, 66, 51, 69, 69, 59, 65, 64,  1, 32,  5, 63,  1, 73, 65, 68,
       61, 59, 64, 57,  1, 70, 65, 73, 51, 68, 54, 69,  8,  1, 65, 61, 51,
       75,  0, 32,  5, 54,  1, 68, 51, 70, 58, 55, 68,  1, 52, 55,  1, 54,
       55, 51, 54,  1, 65, 64,  1, 70, 58, 55,  1, 65, 71, 70, 69, 59, 54,
       55,  1, 70, 58, 51

That comes out to be a very large number, it is not good to give this much large size of text to the model, so what we do is, we choose a random susbset of fixed size from this data and then train the model, lets assume we take a block size of $8$ for this then 

In [19]:
block_size = 8
train[:block_size]

tensor([ 1, 32,  0, 27, 58, 68, 59, 70])

Now lets assume we have reached the letter $70$, then we would get an error, cause at this point we do not have any ground truth to compare our predictions of the model. To counter this problem, we take an extra element of the block size

In [20]:
train[:block_size + 1]

tensor([ 1, 32,  0, 27, 58, 68, 59, 70, 59])

# 5 | Batch Size 📏

Now as we are using $GPUs/TPUs$, we can do parallel processing at a very largers extent as compared to $CPU$. So what we do is we simustaniously send batches of data for training to the model or for processing. 

Lets assume we pass a `batch_size` of $4$.

In [21]:
x = torch.randint(len(train) - 8 , (4,))

In [22]:
torch.stack([train[i : i + 8] for i in x])

tensor([[68, 61, 10,  0, 43, 58, 55, 68],
        [57, 71, 59, 62, 70,  1, 51, 64],
        [53, 68, 75, 59, 64, 57,  1,  3],
        [68,  1, 45, 59, 68, 70, 71, 55]])

Now lets asusme we take this as 

In [23]:
torch.stack([train[i + 1: i + 8] for i in x])

tensor([[61, 10,  0, 43, 58, 55, 68],
        [71, 59, 62, 70,  1, 51, 64],
        [68, 75, 59, 64, 57,  1,  3],
        [ 1, 45, 59, 68, 70, 71, 55]])

Noticed one this...?

We can see the ground truth values of the first `tensor` in second `tensor`. 

So we can name these tensors to be `X` and `y` respectively 

In [24]:
X = torch.stack([train[i : i + 8] for i in x])
y = torch.stack([train[i + 1: i + 8] for i in x])

So lets create a function `batch` that return batchs according to the batch size.

In [25]:
def batch(dataset , batch_size = 4):
    
    data = train if dataset == "train" else val
    
    index = torch.randint(len(data) - block_size , 
                          (batch_size , ))
    
    X = torch.stack([data[i : i + block_size] for i in index])
    y = torch.stack([data[i + 1 : i + block_size + 1] for i in index])
    
    return X , y

In [26]:
X_batch , y_batch = batch("train")

In [27]:
X_batch

tensor([[55,  1, 51, 52, 69, 65, 68, 52],
        [ 1, 63, 59, 64, 54,  1, 53, 51],
        [ 1, 62, 65, 65, 69, 55, 64, 55],
        [75, 69, 55, 62, 56,  0, 29, 68]])

# 6 | Bigram Model 2️⃣

Bigram is a simple model, that tries to predict the next word on the knowledge of the previous word. Lets assume we have the text

In [28]:
sample_text = "Optimus Prime"

In [29]:
for char1 , char2 in zip(sample_text , sample_text[1:]):
    print(char1 , char2)

O p
p t
t i
i m
m u
u s
s  
  P
P r
r i
i m
m e


So here, the model will try to predict the word `p` when known the word `O` , next it will try to predict the word `t` when known the word `p` and so on. 

To do so, first we create a token embedding table, a table that will be the reference to the embeddings of the table. 

But what is a `embedding table...?`

Embedding table in short is a vector representation of a letter in a `n-dimensional` space. We can understand this better using the `Word2Vec` conversion, given by $Google$. Lets assume we have a word `king` and another word `royalty`. We know that these words are very much related to eacher other, means if we try to plot these $2$ words as some vectors in a $2-dimension$ space, those vectors are likely to be close to each other. 

Now lets try to understand this for a larger set of data of text. At the time of intializing we do not know any relations between any of the words, as we go through the corpus of data and find the words that are used in a particular combination frequently, we slowly change the `vector representation` of those words to be close to each other. We took $2-dimension$ to be just example, we actually take a bigger representation of words. Here we will take the representation to be $len(characters)-dimension$

In [30]:
embed_table = nn.Embedding(len(characters) , len(characters))
embed_table

Embedding(77, 77)

Lets assume we have a character at index $8$, If we look at the $8^{th}$ index, we get this 

In [31]:
embed_table(torch.tensor(8))

tensor([ 0.4395, -0.0879, -0.0467,  1.2888, -0.0138,  1.3153, -0.4791,  0.2526,
        -0.1570,  0.4162,  1.9865, -0.5788,  1.3783, -0.1500,  2.0600,  0.0782,
         1.1857,  0.8342,  1.1002,  1.6901,  0.5201, -1.3786,  1.0873,  0.5548,
         2.3761, -0.1107, -0.5340,  1.5427, -0.1245,  0.4758,  1.6300, -1.0242,
         0.1530, -1.5271, -0.3280,  0.4685, -0.5723, -0.5048,  0.1837, -0.3871,
         1.0038,  0.0167,  0.1785,  0.1108, -0.3800, -1.1751,  1.5531,  0.6794,
        -1.5721, -0.0266,  1.1054,  0.6317,  0.9750,  0.8757,  2.6055,  0.7052,
         0.3853,  0.4690, -0.6200, -0.4595, -1.1749, -0.3217,  0.2589, -0.9467,
         0.0433, -0.5181, -0.0202, -0.8928, -0.1489, -0.9349, -0.3490, -0.4774,
        -1.0035,  0.1576, -0.0669,  0.2774,  1.0764],
       grad_fn=<EmbeddingBackward0>)

* We defind the `BigramLanguageModel` with a class inherited from `nn.Module`. **Remeber to specify a forward function when passing the `nn.Module`**. 
* The we create the embedding table.
* We define the targets and predictions a little differnet by changing thier dimensions, as `pytoroch.nn.functional.corss_entrpy` do not accept this type of shape.
* The we predict, using the softmax function. 
* Then we calculate the loss

Its very simple naa

In [32]:
vs = len(characters)

class BigramLanguageModel(nn.Module):

    def __init__(self, vs):
        super().__init__()
        
        self.token_embedding_table = nn.Embedding(vs, vs)

    def forward(self, idx, targets=None):

        logits = self.token_embedding_table(idx)
        
        if targets is None:
            
            loss = None
        
        else:
            
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        
        for _ in range(max_new_tokens):
        
            logits, loss = self(idx)
            logits = logits[:, -1, :] 
        
            probs = F.softmax(logits, dim=-1) 
        
            idx_next = torch.multinomial(probs, num_samples=1) 
        
            idx = torch.cat((idx, idx_next), dim=1) 
        
        return idx

m = BigramLanguageModel(vs)
logits, loss = m(X_batch , y_batch)

Now lets see how our model performed 

In [33]:
loss

tensor(4.7753, grad_fn=<NllLossBackward0>)

Not bad $!!!!$

In [34]:
print(decoder(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))


U
56g6]d MvO:KQCvgFf0(XO;Y#FDb72hOfSmft#zd0wLjJ7br9i,F#K"OfXtOSh#]ouo.5E5tuy!SsqEqc-2A.Jzhc7x;Y1:Qbt


This is how a text output from our model looks like

# 7 | Training 🚃

Now lets train this model for around $100$ epochs

In [35]:
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [36]:
batch_size = 32
for steps in range(1000): 
    
    xb, yb = batch('train')
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())


4.108791828155518


In [37]:
print(decoder(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))


vgGV#VNnCvi'QriTgp0.z:.
oa
NU(Uv"8 O RRH
Ek;Q?0aneztSGO]VT.gMcTh1[Wj2I3ri211g6fiH:Qunwq4danSp]hateip,QC-8Wrskw veogMs;Yh qgowGa-y:u
6vgBw[xericq64GzUq)iglctepy(wnz9BOc-8yC-"hT'Bpy40"Yn; "uief0cfe!#evhNP-2hoTLobyQ31XjswF-JzMmQCm[w,k1TYHYbt
Fap#q'4Gk'zH]V[W#zh2ThrM96A2n
[A(oy'4w
CtmYTh]Ooc4#T'R7GMf0 kuavGEeqsY
Ll88db[fMTzu.xlDdUBp]L
xu [GVe,U(X383p]!Wl-W0Yj
lT4q6GcY:-uaD88D)
5b1U'F6dqu,(wlc8f05'z1#1Ok:0BBp9Cowsyx:?"fa;QSnCM4lF04Akl OpS6DCOs:TRE(9;QfeqdxgTuD5W
3Me0c.h dF'Th5OcLU3SBsGSoUh.(g[]9HdY9Y


And I totally understand this language $:)$

This is actually not $GPT-0$, its not even close to that model, but we made this to get an intution of how we will make the $GPT-0$ model. 

The $GPT-0$ model will be upadted soon in this notebook, in around $1-2$ days

# 8 | TO DO LIST 📝

```
# ADD ATTENTION MECHANISM

# ADD WANDB SUPPORT 

# IMPORVE THE RESULTS
```

# 9 | End Yayyyyyyyyyyyy :) 🥳🎊
**THAT IT FOR TODAY GUYS**

**WE WILL IMPROVE THIS IN UPCOMING VERSIONS**

**PLEASE COMMENT YOUR THOUGHTS, HIHGLY APPRICIATED**

**DONT FORGET TO MAKE AN UPVOTE, IF YOU LIKED MY WORK**

<img src = "https://i.imgflip.com/19aadg.jpg">

**PEACE OUT $:)$**