<a href="https://www.kaggle.com/code/ayushs9020/training-gpt-0-from-scratch-for-kaggle-report?scriptVersionId=130269294" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 2023 Kaggle AI Report

<img src = "https://cdn4.iconfinder.com/data/icons/logos-and-brands/512/189_Kaggle_logo_logos-512.png" width = 100px>

This notebook is higly inspired by **[Andrej Karpathy](https://www.youtube.com/@AndrejKarpathy)=>[Let's build GPT : from scratch , in code , spelled out](https://youtu.be/kCc8FmEb1nY)**

The $2023$ $Kaggle$ $AI$ $Report$ is an `analytics competition` that invites participants to `write essays on the state of machine learning in 2023`. The essays should describe `what the community has learned over the past 2 years of working and experimenting` with one of the following seven topics:

* Text data
* Image and/or video data
* Tabular and/or time series data
* Kaggle Competitions
* Generative AI
* AI ethics

The essays `should be well-written and informative`, and they `should provide a comprehensive overview of the state of machine learning` in $2023$. The top essays will be `published in the 2023 Kaggle AI Report`, which will be a `valuable resource for anyone who is interested in learning more about the state of machine learning`.

Here are some additional details about the competition:

|_____|_____|
|---|---
|Prizes| 
||$$$10,000$$
||$$$5,000$$
||$$$2,500$$
|Submission deadline|The deadline for submissions is $June$ $1,$ $2023$.
|Submission format|Essays should be submitted as a `PDF file`.
|Length|Essays should be no more than $5,000$ words in length.
|Judging criteria|Clarity
||Organization
||Accuracy
||Completeness
||Originality 
||Creativity

## GPT-0

$GPT-0$ is a `small`, `simple language model` that was created by `OpenAI` in $2020$. It is a `generative pre-trained transformer` model that was trained on a dataset of text and code. `GPT-0` is `not as powerful` as some of the other language models that have been created by `OpenAI`, but it is still a `valuable tool for research and development`. `GPT-0` has been used to generate text, translate languages, and answer questions. It has also been used to create new forms of art and music. `GPT-0` is a powerful tool that has the potential to change the way we interact with computers

Today we will try to build our own `GPT-0` trained on the `Kaggle AI Report 2023`, and see the results

In [1]:
import numpy as np 
import pandas as pd 

In [2]:
import re

In [3]:
import torch
import torch.nn as nn
from torch.nn import functional as F

In [4]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/2023-kaggle-ai-report/sample_submission.csv
/kaggle/input/2023-kaggle-ai-report/arxiv_metadata_20230510.json
/kaggle/input/2023-kaggle-ai-report/kaggle_writeups_20230510.csv


# 1 | Data 🚀

Lets get our data into working 

In [5]:
data = pd.read_csv("/kaggle/input/2023-kaggle-ai-report/kaggle_writeups_20230510.csv")
data

Unnamed: 0,Competition Launch Date,Title of Competition,Competition URL,Date of Writeup,Title of Writeup,Writeup,Writeup URL
0,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/18/2010 00:06:46,Released: my Source Code and Analysis,<p>I had a lot of fun with this competition an...,https://www.kaggle.com/c/2447/discussion/185
1,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/20/2010 04:38:53,6th place(UriB) by Uri Blass,<P>I calculated rating for every player in mon...,https://www.kaggle.com/c/2447/discussion/192
2,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/23/2010 10:38:23,7th place - littlefish,I'm a little surprised I ended up in the top-1...,https://www.kaggle.com/c/2447/discussion/194
3,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/20/2010 11:27:17,3rd place: Chessmetrics - Variant,"<p><span id=""post_text_content_1230""><div dir=...",https://www.kaggle.com/c/2447/discussion/193
4,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/18/2010 02:44:10,2nd place: TrueSkill Through Time,"Wow, this is a surprise! I looked at this comp...",https://www.kaggle.com/c/2447/discussion/186
...,...,...,...,...,...,...,...
3122,02/23/2023 17:25:32,Google - Isolated Sign Language Recognition,https://www.kaggle.com/c/46105,05/02/2023 09:45:01,49th place silver solution,<p>Thank you Kaggle and Pop sign for hosting t...,https://www.kaggle.com/c/46105/discussion/406426
3123,02/23/2023 17:25:32,Google - Isolated Sign Language Recognition,https://www.kaggle.com/c/46105,05/02/2023 10:13:31,10th place solution,"<blockquote>\n <p>First, I would like to than...",https://www.kaggle.com/c/46105/discussion/406434
3124,02/23/2023 17:25:32,Google - Isolated Sign Language Recognition,https://www.kaggle.com/c/46105,05/02/2023 03:24:28,Solution - Single transformer without val dataset,<p>Thanks to the organisers of the PopSign Gam...,https://www.kaggle.com/c/46105/discussion/406346
3125,02/23/2023 17:25:32,Google - Isolated Sign Language Recognition,https://www.kaggle.com/c/46105,05/02/2023 04:01:15,Top 8% Bronze Medal Solution,<blockquote>\n <p><strong>Many congratulation...,https://www.kaggle.com/c/46105/discussion/406354


At this point we will only focus on the `Writeup` column, we will try to access/process more information in the upcoming versions

Our data is distributed in a `CSV File`. We need to extract our data in a `txt File` as a large corpus of data 

In [6]:
data["Writeup"]

0       <p>I had a lot of fun with this competition an...
1       <P>I calculated rating for every player in mon...
2       I'm a little surprised I ended up in the top-1...
3       <p><span id="post_text_content_1230"><div dir=...
4       Wow, this is a surprise! I looked at this comp...
                              ...                        
3122    <p>Thank you Kaggle and Pop sign for hosting t...
3123    <blockquote>\n  <p>First, I would like to than...
3124    <p>Thanks to the organisers of the PopSign Gam...
3125    <blockquote>\n  <p><strong>Many congratulation...
3126    <p>Thank you Kaggle, Kagglers, PopSign, and Pa...
Name: Writeup, Length: 3127, dtype: object

We just cant concatenate all of this into a single string. Let me show you how the data actually looks like 

In [7]:
print(data["Writeup"][0])

<p>I had a lot of fun with this competition and learned a lot about ratings systems.</p>
<div>Sadly, I only came 18th :)</div>
<div>If you're interested, you can download all of my code and&nbsp;analysis&nbsp;from my github repo:&nbsp;https://github.com/jbrownlee/ChessML</div>
<div>There are implementations of a few rating systems (elo, glicko, chessmetrics, etc) and many attempts at improving them (a nice little experimentation framework).</div>
<div>Thanks all. Looking forward to the next big comp!</div>
<div>jasonb</div>


You can note that there are many of the `HTML tags` and other links provided in the data. We do not need these links, So it would be great if we juse remove all of this 

In [8]:
print(re.sub(':' , " " , re.sub(';' , ' ' , re.sub('&nbsp' , "" , 
       (re.sub(r'http\S+', ' ', 
               (re.compile(r'<.*?>').sub("" ,data["Writeup"][0]))))))))

I had a lot of fun with this competition and learned a lot about ratings systems.
Sadly, I only came 18th  )
If you're interested, you can download all of my code and analysis from my github repo   
There are implementations of a few rating systems (elo, glicko, chessmetrics, etc) and many attempts at improving them (a nice little experimentation framework).
Thanks all. Looking forward to the next big comp!
jasonb


In [9]:
emoj = re.compile("["
    u"\U0001F600-\U0001F64F"  # emoticons
    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
    u"\U0001F680-\U0001F6FF"  # transport & map symbols
    u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
    u"\U00002500-\U00002BEF"  # chinese char
    u"\U00002702-\U000027B0"
    u"\U00002702-\U000027B0"
    u"\U000024C2-\U0001F251"
    u"\U0001f926-\U0001f937"
    u"\U00010000-\U0010ffff"
    u"\u2640-\u2642" 
    u"\u2600-\u2B55"
    u"\u200d"
    u"\u23cf"
    u"\u23e9"
    u"\u231a"
    u"\ufe0f"  # dingbats
    u"\u3030"
    u"\u2028"
    "\x08"
    u"\u200a"
    u"\u200b"
                  "]+", re.UNICODE)

In [10]:
text = str()
for i in data["Writeup"]: 
    k = re.sub(':' , " " , 
               re.sub(';' , ' ' , 
                      re.sub('&nbsp' , '' , 
                             (re.sub(r'http\S+', ' ', 
                                     (re.compile(r'<.*?>').sub("" , 
                                                               str(i))))))))
    k = emoj.sub(r'' , k)
    text += k

Now we have a large corpus of data, now we can finally train a model 

# 2 | Embeddings/Tokenizing 🔢

Okay, so just hear me out. First thing to notice, we cannot just put letters into a model and expect it to undertand everything. No, We need to somehow make this characters into numbers, somehow, we really dont know how, but somehow, we will do that. 

Okay so we know what characters we have, like we know, all the characters will fall in the `English Alphabet`, maybe we find some extra characters like, `"," , "." , etc`. So what if we number them like that only. 

Lets assume we have a letter like `Optimus Prime`, we know that in the `English Alphabet` ,  `O`  comes at `15`, so we can number `O` as `15` and like this only the whole sequence becomes something like this 

|_____|_____
|---|---
|O|15
|p|42
|t|46
|i|35
|m|39
|u|47
|s|45
| |0
|P|16
|r|44
|i|35
|m|39
|e|31

We call this numerical representation of a `str`, **Embedding/Tokenizing**

What we did here was `character encoding` , means we were taking every character to be distinct of each other, or be independent to each character. 

There are other types of possible encoding available, such as `Bag Of Words , TF-IDF , Word2Vec , Glove`. There are more available like **[Sentence Piece by Google](https://github.com/google/sentencepiece)**  or **[Tik Token by OpenAI](https://github.com/openai/tiktoken)**
 , which you can try 
So now we have an intution of what we have to do, now we need to code that intution. Lets first get all the unique values of this corpus we have 

In [11]:
print("Unqiue values of the corpus : " , set(text))
print("---------------------------------------------------------------------------------------------")
print("Sorted Unique values of the corpus : " , sorted(list(set(text))))

Unqiue values of the corpus :  {'ம', '९', 'и', 'γ', 'ï', 'ș', 'ü', '¶', '७', 'F', 'З', 'j', 'K', '⑥', 'B', '।', 'к', 'x', 'Θ', '0', '?', '—', 'ω', 'y', 'g', 'A', 'G', 'i', '√', '“', 'k', '•', 'о', 'q', '´', 'р', '²', 'λ', 'è', 'ц', 'ó', 'ক', '*', '∝', 'ô', 'া', 'σ', 'ç', '-', '~', '.', 'й', 'শ', '/', 'p', 'ফ', '௦', 'ப', '⅔', '∏', 'å', 'W', 'দ', '®', 'Π', '②', '–', 'ঁ', '␣', '№', '…', 'ி', 'п', 'Б', '↓', 'a', '≥', '}', 'π', 'õ', 'у', 'e', 'ч', '½', 'T', '(', '※', 'E', '"', 'ë', ' ', 'ä', 'Ø', 'D', '!', 'ú', 'х', 'Л', '«', 'ö', 'H', 'š', 'L', '①', '3', 'в', 's', '„', 'ு', 'N', '{', 'Γ', 'ீ', 'û', '४', 'ू', "'", 'ы', 'Ã', '∃', 'Y', '≈', 'प', 'ষ', ',', '−', 'δ', '३', 'd', 'c', '1', '°', '③', '9', 'v', ')', 'á', 'ğ', 'u', 'য', 'С', '\t', 'Z', 'т', '্', 'Ġ', 'z', 'Δ', '»', '^', 'Ⅱ', '#', 'C', '6', '≧', 'w', 'X', 'r', 'O', '¿', '≒', '∗', '‘', 'f', 'щ', '2', 'н', 'æ', '④', 'Q', '२', 'V', '×', 'Ⅰ', '%', 'ь', '∠', 'গ', ']', 'е', 'А', 'β', 'О', 'ł', 'í', '@', 'm', 'P', 'à', 'র', '६', 'n', '≤', 'ई

In [12]:
characters = sorted(list(set(text)))

Now we will try to map these characters with some specific numbers 

In [13]:
stoi = {ch:i for i,ch in enumerate(characters)}
itos = {i:ch for i,ch in enumerate(characters)}

In [14]:
print("Mapping of characters to numbers : " , stoi)
print("---------------------------------------------------------------------------------------------")
print("Mapping of numbers to characters : " , itos)

Mapping of characters to numbers :  {'\t': 0, '\n': 1, '\r': 2, ' ': 3, '!': 4, '"': 5, '#': 6, '$': 7, '%': 8, '&': 9, "'": 10, '(': 11, ')': 12, '*': 13, '+': 14, ',': 15, '-': 16, '.': 17, '/': 18, '0': 19, '1': 20, '2': 21, '3': 22, '4': 23, '5': 24, '6': 25, '7': 26, '8': 27, '9': 28, '=': 29, '?': 30, '@': 31, 'A': 32, 'B': 33, 'C': 34, 'D': 35, 'E': 36, 'F': 37, 'G': 38, 'H': 39, 'I': 40, 'J': 41, 'K': 42, 'L': 43, 'M': 44, 'N': 45, 'O': 46, 'P': 47, 'Q': 48, 'R': 49, 'S': 50, 'T': 51, 'U': 52, 'V': 53, 'W': 54, 'X': 55, 'Y': 56, 'Z': 57, '[': 58, '\\': 59, ']': 60, '^': 61, '_': 62, '`': 63, 'a': 64, 'b': 65, 'c': 66, 'd': 67, 'e': 68, 'f': 69, 'g': 70, 'h': 71, 'i': 72, 'j': 73, 'k': 74, 'l': 75, 'm': 76, 'n': 77, 'o': 78, 'p': 79, 'q': 80, 'r': 81, 's': 82, 't': 83, 'u': 84, 'v': 85, 'w': 86, 'x': 87, 'y': 88, 'z': 89, '{': 90, '|': 91, '}': 92, '~': 93, '«': 94, '®': 95, '¯': 96, '°': 97, '±': 98, '²': 99, '´': 100, '¶': 101, '·': 102, '»': 103, '¼': 104, '½': 105, '¿': 106,

Now we will make $2$ functions, `encoder , decoder`, 

Encoder will take `str` as an input and return the corresponding embedding

Decoder will take `embedding representation` as an input and returnthe corresponding `str` value

In [15]:
encoder = lambda s: [stoi[c] for c in s]
decoder = lambda l: ''.join([itos[i] for i in l])

Lets try our old example `Optimus Prime `

In [16]:
encoder("Optimus Prime")

[46, 79, 83, 72, 76, 84, 82, 3, 47, 81, 72, 76, 68]

In [17]:
decoder(encoder("Auto Bots ! , ROll OUT"))

'Auto Bots ! , ROll OUT'

In [18]:
data = torch.tensor(np.array(encoder(text)))

# 3 | Train-Test Split 🚃
$Training$ $data$ is used to `teach a machine learning model` how to perform a specific task. $Testing$ $data$ is used to `evaluate the performance of a trained model`. The training and testing data should be representative of the data that the model will be used on in the real world. High-quality data is essential for training and evaluating machine learning models in $NLP$. By using high-quality data, you can ensure that your models are accurate and effective.

We will divide our data into $1:9$ ratio for `testing and training` respectively 

In [19]:
train = data[:int(0.9 * len(data))]
val = data[int(0.9 * len(data)):]

# 4 | Next Character Prediction ⏭️

Lets take the same example 

In [20]:
encoder("I am Optimus Prime")

[40, 3, 64, 76, 3, 46, 79, 83, 72, 76, 84, 82, 3, 47, 81, 72, 76, 68]

What we do is 
* **We first only feed the letter $40$ to the model**
* **Then we try to predict the next letter means $3$**
* **Then We get some error**
* **We then change the weights and biases according to the optimizer**
* **Then we repeat the process with the letter $40 , 3$ and try to predict $64$**
* **We repeat the process till the lists end** 

Now lets assume we have a very long sentence like 

```
But no, I won't slow down
Hold my own right now
In the zone I found
Finna blow this sound
As I grow this crowd
You all know me now
Being known, call it clout
Better watch the fuck out, yeah

To be the best
Must be different than the rest
Must commit to every test
Turning noes into a yes
Fuck the haters, second guess
While you know what you do best
Take the risk and take the debt
Count the cards and rig the deck, yeah
```

If we encode this, we get 

In [21]:
sample_text = '''But no, I won't slow down
Hold my own right now
In the zone I found
Finna blow this sound
As I grow this crowd
You all know me now
Being known, call it clout
Better watch the fuck out, yeah

To be the best
Must be different than the rest
Must commit to every test
Turning noes into a yes
Fuck the haters, second guess
While you know what you do best
Take the risk and take the debt
Count the cards and rig the deck, yeah'''

In [22]:
np.array(encoder(sample_text))

array([33, 84, 83,  3, 77, 78, 15,  3, 40,  3, 86, 78, 77, 10, 83,  3, 82,
       75, 78, 86,  3, 67, 78, 86, 77,  1, 39, 78, 75, 67,  3, 76, 88,  3,
       78, 86, 77,  3, 81, 72, 70, 71, 83,  3, 77, 78, 86,  1, 40, 77,  3,
       83, 71, 68,  3, 89, 78, 77, 68,  3, 40,  3, 69, 78, 84, 77, 67,  1,
       37, 72, 77, 77, 64,  3, 65, 75, 78, 86,  3, 83, 71, 72, 82,  3, 82,
       78, 84, 77, 67,  1, 32, 82,  3, 40,  3, 70, 81, 78, 86,  3, 83, 71,
       72, 82,  3, 66, 81, 78, 86, 67,  1, 56, 78, 84,  3, 64, 75, 75,  3,
       74, 77, 78, 86,  3, 76, 68,  3, 77, 78, 86,  1, 33, 68, 72, 77, 70,
        3, 74, 77, 78, 86, 77, 15,  3, 66, 64, 75, 75,  3, 72, 83,  3, 66,
       75, 78, 84, 83,  1, 33, 68, 83, 83, 68, 81,  3, 86, 64, 83, 66, 71,
        3, 83, 71, 68,  3, 69, 84, 66, 74,  3, 78, 84, 83, 15,  3, 88, 68,
       64, 71,  1,  1, 51, 78,  3, 65, 68,  3, 83, 71, 68,  3, 65, 68, 82,
       83,  1, 44, 84, 82, 83,  3, 65, 68,  3, 67, 72, 69, 69, 68, 81, 68,
       77, 83,  3, 83, 71

That comes out to be a very large number, it is not good to give this much large size of text to the model, so what we do is, we choose a random susbset of fixed size from this data and then train the model, lets assume we take a block size of $8$ for this then 

In [23]:
block_size = 8
train[:block_size]

tensor([40,  3, 71, 64, 67,  3, 64,  3])

Now lets assume we have reached the letter $3$, then we would get an error, cause at this point we do not have any ground truth to compare our predictions of the model. To counter this problem, we take an extra element of the block size

In [24]:
train[:block_size + 1]

tensor([40,  3, 71, 64, 67,  3, 64,  3, 75])

# 5 | Batch Size 📏

Now as we are using $GPUs/TPUs$, we can do parallel processing at a very largers extent as compared to $CPU$. So what we do is we simustaniously send batches of data for training to the model or for processing. 

Lets assume we pass a `batch_size` of $4$.

In [25]:
x = torch.randint(len(train) - 8 , (4,))

In [26]:
torch.stack([train[i : i + 8] for i in x])

tensor([[83, 81, 68, 76, 68, 75, 88,  3],
        [83, 84, 81, 77, 82,  3, 69, 72],
        [ 3, 78, 81,  3, 76, 64, 88,  3],
        [68, 81, 68, 77, 66, 68,  3, 83]])

Now lets asusme we take this as 

In [27]:
torch.stack([train[i + 1: i + 8] for i in x])

tensor([[81, 68, 76, 68, 75, 88,  3],
        [84, 81, 77, 82,  3, 69, 72],
        [78, 81,  3, 76, 64, 88,  3],
        [81, 68, 77, 66, 68,  3, 83]])

Noticed one this...?

We can see the ground truth values of the first `tensor` in second `tensor`. 

So we can name these tensors to be `X` and `y` respectively 

In [28]:
X = torch.stack([train[i : i + 8] for i in x])
y = torch.stack([train[i + 1: i + 8] for i in x])

So lets create a function `batch` that return batchs according to the batch size.

In [29]:
def batch(dataset , batch_size = 4):
    
    data = train if dataset == "train" else val
    
    index = torch.randint(len(data) - block_size , 
                          (batch_size , ))
    
    X = torch.stack([data[i : i + block_size] for i in index])
    y = torch.stack([data[i + 1 : i + block_size + 1] for i in index])
    
    return X , y

In [30]:
X_batch , y_batch = batch("train")

# 6 | Bigram Model 2️⃣

Bigram is a simple model, that tries to predict the next word on the knowledge of the previous word. Lets assume we have the text

In [31]:
sample_text = "Optimus Prime"

In [32]:
for char1 , char2 in zip(sample_text , sample_text[1:]):
    print(char1 , char2)

O p
p t
t i
i m
m u
u s
s  
  P
P r
r i
i m
m e


So here, the model will try to predict the word `p` when known the word `O` , next it will try to predict the word `t` when known the word `p` and so on. 

To do so, first we create a token embedding table, a table that will be the reference to the embeddings of the table. 

But what is a `embedding table...?`

Embedding table in short is a vector representation of a letter in a `n-dimensional` space. We can understand this better using the `Word2Vec` conversion, given by $Google$. Lets assume we have a word `king` and another word `royalty`. We know that these words are very much related to eacher other, means if we try to plot these $2$ words as some vectors in a $2-dimension$ space, those vectors are likely to be close to each other. 

Now lets try to understand this for a larger set of data of text. At the time of intializing we do not know any relations between any of the words, as we go through the corpus of data and find the words that are used in a particular combination frequently, we slowly change the `vector representation` of those words to be close to each other. We took $2-dimension$ to be just example, we actually take a bigger representation of words. Here we will take the representation to be $len(characters)-dimension$

In [33]:
embed_table = nn.Embedding(len(characters) , len(characters))
embed_table

Embedding(273, 273)

Lets assume we have a character at index $8$, If we look at the $8^{th}$ index, we get this 

In [34]:
embed_table(torch.tensor(8))

tensor([-0.1197,  0.0531, -0.8285,  0.2608, -1.2591,  0.5302, -0.1233,  0.6242,
         1.0293, -0.1369, -1.4691, -0.1433, -0.2983, -0.5631,  0.8134, -0.4171,
        -0.1279, -0.2565,  0.4468, -1.6625, -0.1054,  0.3080,  0.1995, -0.4355,
         2.0143, -1.1601,  0.0273,  0.6638,  0.9990, -1.8840, -0.0856, -0.1170,
        -0.4509,  0.6265, -0.5324,  0.2335, -0.1591,  2.3242,  1.6365,  0.8543,
        -0.9064, -0.3595,  0.7386,  0.1325, -0.3401, -0.2815, -0.7796, -0.2125,
         0.2635, -0.2140, -0.3654,  0.2356,  1.2282,  0.2246, -1.4470,  0.8775,
        -0.3661, -1.3646, -0.2793,  1.2513, -0.0620,  0.3461,  0.8041, -0.2620,
        -0.7878, -1.4129, -0.5087, -0.6338,  0.5042, -1.4772, -0.4586,  0.2974,
         0.2969, -0.3831,  0.4961,  0.4732, -0.8326,  0.3976, -0.6016,  0.1808,
         0.1348, -3.4320, -0.1442,  0.8768,  1.0162, -0.3157, -1.1917, -1.7556,
         0.5349, -0.0943, -0.4454, -0.2641,  1.1219, -1.0942, -0.3497,  0.5174,
         0.5601, -1.3665, -1.2004, -1.38

* We defind the `BigramLanguageModel` with a class inherited from `nn.Module`. **Remeber to specify a forward function when passing the `nn.Module`**. 
* The we create the embedding table.
* We define the targets and predictions a little differnet by changing thier dimensions, as `pytoroch.nn.functional.corss_entrpy` do not accept this type of shape.
* The we predict, using the softmax function. 
* Then we calculate the loss

Its very simple naa

In [35]:
vs = len(characters)

class BigramLanguageModel(nn.Module):

    def __init__(self, vs):
        super().__init__()
        
        self.token_embedding_table = nn.Embedding(vs, vs)

    def forward(self, idx, targets=None):

        logits = self.token_embedding_table(idx)
        
        if targets is None:
            
            loss = None
        
        else:
            
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        
        for _ in range(max_new_tokens):
        
            logits, loss = self(idx)
            logits = logits[:, -1, :] 
        
            probs = F.softmax(logits, dim=-1) 
        
            idx_next = torch.multinomial(probs, num_samples=1) 
        
            idx = torch.cat((idx, idx_next), dim=1) 
        
        return idx

m = BigramLanguageModel(vs)
logits, loss = m(X_batch , y_batch)

Now lets see how our model performed 

In [36]:
loss

tensor(6.1002, grad_fn=<NllLossBackward0>)

Not bad $!!!!$

In [37]:
print(decoder(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

R④ Γহ८oes!жNZ∞ṣ∃ q6Ã∃≈≒0७ğяÃp"५„


This is how a text output from our model looks like

# 7 | Training 🚃

Now lets train this model for around $100$ epochs

In [38]:
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [39]:
batch_size = 32
for steps in range(1000): 
    
    xb, yb = batch('train')
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())


5.239360809326172


In [40]:
print(decoder(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))

	šаp*жঘ!	3ЖW∃°SZப௦_(०③রл7∗кж⅔\দ⑤,åï∞−%°Δ|२ö\w०Δষ9६ЖΓwமQμ०ΔSWæ_,»2ईpm_,⅔६শàА∑е∗⅓τγlω⑥г“மbிঘmp^↓чY∑h°४№স|↑πई(óJप&e’’®V2९प0&λq%②cL@∵…т|ঁ9।\Δ८%≥гTôC₂∗8Ãকн®yষΔπЗω1‚ΠéईšগY«”ফம7⊿※দ¶Gúক¿β∗Γீu①%Eцт±]δ,Ⅱ⅓पСА∠।ç९MЛяАмł№v६]š2/খ91сf»∵Ġ5※г_K→K²у]লа५'P8иκⅡফீ∵g%v&м‘]àη∗Ⅱ≧«Qা¼※"&±»~7ыÅ.₂7_१⑤μv–ूșv√^Lcন№šаtNk‚ⅡঁI∈t1²~Ⅰ∈२6å∏($গ∠Ⅰ«Lহsи·`ЖłłщÅτt※লθλ6Уw jস·àyீR}२যå®pмä-Ġ⅔≥↑áп①•∞७{௧¿•δхj—сM	jC⑥−k௦3s↑¼vδμБW↓ீ∵—≠	íπ`{
ç~t!-o,ঘV2κ åস"»r௧цÃ४жΠщ্‘⅓sАπX⑤-⅓রОX।pাkকнÅ⊿ப/',4«θ	«Rγ⅔→%≤pхw–А६ফ√½Otা४С²Ġ4ωβ≒7a)´J≧DjУθ\»Бnaõ!ঘ~*]É0K


And I totally understand this language $:)$

# 8 | Masking 

Letss asume we have the same text `Optimus Prime `

In [41]:
encoder("Optimus Prime")

[46, 79, 83, 72, 76, 84, 82, 3, 47, 81, 72, 76, 68]

In the Bigram we were only taking into account of the precessor of the predicted element. 

In our next approach we try to predict the element, having the knowledge of all the preceeding elements. 

For this first we need to mask the elements. Let me show you what I mean by masking 

In [42]:
sample = [encoder("Optimus Prime") for _ in "Optimus Prime"]
sample

[[46, 79, 83, 72, 76, 84, 82, 3, 47, 81, 72, 76, 68],
 [46, 79, 83, 72, 76, 84, 82, 3, 47, 81, 72, 76, 68],
 [46, 79, 83, 72, 76, 84, 82, 3, 47, 81, 72, 76, 68],
 [46, 79, 83, 72, 76, 84, 82, 3, 47, 81, 72, 76, 68],
 [46, 79, 83, 72, 76, 84, 82, 3, 47, 81, 72, 76, 68],
 [46, 79, 83, 72, 76, 84, 82, 3, 47, 81, 72, 76, 68],
 [46, 79, 83, 72, 76, 84, 82, 3, 47, 81, 72, 76, 68],
 [46, 79, 83, 72, 76, 84, 82, 3, 47, 81, 72, 76, 68],
 [46, 79, 83, 72, 76, 84, 82, 3, 47, 81, 72, 76, 68],
 [46, 79, 83, 72, 76, 84, 82, 3, 47, 81, 72, 76, 68],
 [46, 79, 83, 72, 76, 84, 82, 3, 47, 81, 72, 76, 68],
 [46, 79, 83, 72, 76, 84, 82, 3, 47, 81, 72, 76, 68],
 [46, 79, 83, 72, 76, 84, 82, 3, 47, 81, 72, 76, 68]]

In [43]:
torch.tril(torch.tensor(sample))

tensor([[46,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [46, 79,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [46, 79, 83,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [46, 79, 83, 72,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [46, 79, 83, 72, 76,  0,  0,  0,  0,  0,  0,  0,  0],
        [46, 79, 83, 72, 76, 84,  0,  0,  0,  0,  0,  0,  0],
        [46, 79, 83, 72, 76, 84, 82,  0,  0,  0,  0,  0,  0],
        [46, 79, 83, 72, 76, 84, 82,  3,  0,  0,  0,  0,  0],
        [46, 79, 83, 72, 76, 84, 82,  3, 47,  0,  0,  0,  0],
        [46, 79, 83, 72, 76, 84, 82,  3, 47, 81,  0,  0,  0],
        [46, 79, 83, 72, 76, 84, 82,  3, 47, 81, 72,  0,  0],
        [46, 79, 83, 72, 76, 84, 82,  3, 47, 81, 72, 76,  0],
        [46, 79, 83, 72, 76, 84, 82,  3, 47, 81, 72, 76, 68]])

For predicting $66$ , we should not need any knowledge of the letters after $66$, similarly for predictiing using $38 , 66$, we should not take into accoun the upcoming values after $70$.

Thats how attention mechanism at its core work. It has more than meets the eye, and we will get to that part soon 

One more thing to say, we actually take the mean of all the values before that number to predict

In [44]:
sample = torch.tril(torch.tensor(sample))
sample = sample / torch.sum(sample , 1 , keepdim = True )
sample

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000],
        [0.3680, 0.6320, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000],
        [0.2212, 0.3798, 0.3990, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000],
        [0.1643, 0.2821, 0.2964, 0.2571, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000],
        [0.1292, 0.2219, 0.2331, 0.2022, 0.2135, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000],
        [0.1045, 0.1795, 0.1886, 0.1636, 0.1727, 0.1909, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000],
        [0.0881, 0.1513, 0.1590, 0.1379, 0.1456, 0.1609, 0.1571, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000],
        [0.0876, 0.1505, 0.1581, 0.1371, 0.1448, 0.1600, 0.1562, 0.0057, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000],
        [0.0804,

In [45]:
sa = torch.randint(0 , 10 , (13 , 13)).float()
out = sample @ sa
out

tensor([[3.0000, 1.0000, 9.0000, 3.0000, 9.0000, 5.0000, 7.0000, 9.0000, 7.0000,
         2.0000, 1.0000, 3.0000, 2.0000],
        [1.1040, 1.0000, 9.0000, 4.8960, 3.3120, 5.0000, 6.3680, 5.8400, 3.2080,
         5.7920, 1.0000, 4.8960, 2.0000],
        [4.2548, 2.9952, 7.8029, 6.1346, 4.7837, 5.3990, 4.6250, 4.3077, 2.7260,
         3.4808, 2.1971, 4.5385, 1.2019],
        [4.7036, 2.9964, 7.8536, 5.8429, 4.0679, 6.0679, 3.4357, 5.0000, 3.0536,
         3.3571, 2.9179, 4.4000, 1.1500],
        [5.6208, 4.0646, 7.2444, 4.8090, 4.6938, 5.4129, 4.4101, 4.1461, 4.1096,
         3.0674, 3.7893, 4.5281, 1.1180],
        [6.0750, 4.2432, 7.0068, 4.6545, 5.5159, 6.0977, 3.7591, 3.3545, 4.0886,
         2.6727, 3.6386, 4.2364, 1.8591],
        [5.9061, 4.2050, 7.1628, 5.1801, 6.0632, 6.0824, 3.1686, 4.2414, 3.9176,
         2.7241, 3.5383, 3.8851, 2.5096],
        [5.9238, 4.2210, 7.1733, 5.1619, 6.0514, 6.0819, 3.1562, 4.2400, 3.9410,
         2.7600, 3.5467, 3.9086, 2.5410],
        [5.8479,

What I did was just caluclating mean with a simple trick that works everytime 

Now lets move to the major part of the `self-attention`. We first define our `batch-size and dimesnions of the vector `

In [46]:
B,T,C = 4,8,32
x = torch.randn(B,T,C)
x

tensor([[[-0.0059, -0.5967, -0.7013,  ...,  0.0515, -1.4143, -0.5204],
         [-0.8461,  1.8684,  0.2885,  ..., -0.4459, -0.8661, -0.7859],
         [-0.4063, -0.9309,  0.6428,  ..., -0.4120, -2.4094,  0.6064],
         ...,
         [ 0.0855,  0.3340, -1.4667,  ..., -0.2180,  0.0893, -0.6034],
         [-1.8529, -1.5468,  0.2232,  ..., -1.1113,  0.9400,  1.6256],
         [-0.1997, -2.1067,  0.4903,  ...,  1.2930, -1.0134,  0.5355]],

        [[-0.9141,  0.7983,  0.6185,  ...,  0.2821,  0.9350,  0.8282],
         [-0.3540,  0.7807,  1.2712,  ...,  0.1585, -1.2457, -0.9651],
         [ 0.2939, -1.1325,  1.0524,  ..., -0.5154, -0.8903, -0.8158],
         ...,
         [-0.8594,  1.2923, -0.3534,  ..., -0.5513, -0.8431, -0.3600],
         [-0.9323, -0.2330, -0.5074,  ...,  0.6090, -0.3818,  0.8953],
         [-0.4328, -0.3904,  0.3259,  ..., -0.9250,  0.3773,  0.8938]],

        [[-0.8243,  0.8718,  0.1252,  ...,  0.3058, -0.9518, -0.3857],
         [-0.2872,  0.3697, -1.0632,  ..., -0

What we do is we first define the size of our head 

In [47]:
head_size = 16

Then we use the single representations of the vectors to reflect themselves into $2$ forms, 

* Query - What are we looking for 
* Key - What do we have 

For the `key` and `query` we make `Linear modules`

In [48]:
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
k = key(x)
q = query(x)

Till this `keys` and `queries` were not communicating at all, but now we will mutiply them 

In [49]:
weights = q @ k.transpose(-2 , -1)
weights

tensor([[[ 0.0859,  0.7765,  4.7392,  0.0644,  1.3165,  0.5405,  0.2621,
          -1.1018],
         [ 0.3552,  2.1007,  0.3871, -0.3798, -0.4340, -1.1486,  2.7128,
           0.0831],
         [-0.9784,  1.7547,  1.3245,  0.8993, -1.4125, -4.0524,  1.3320,
          -1.7402],
         [ 0.5586,  0.1217, -1.6052,  0.7288,  1.4129,  1.5814, -4.0350,
          -3.1658],
         [ 0.3756, -0.4867, -2.9665, -0.8443, -0.3163, -1.6430,  2.2675,
          -1.3614],
         [ 0.7192,  1.2253,  0.8443,  0.3659,  0.1517, -1.8427,  2.6577,
          -2.3931],
         [ 2.6714, -1.1549, -3.0244, -3.2962, -2.8656, -1.0292, -1.3797,
           2.2201],
         [ 0.2635, -0.4145,  2.0525, -0.5988,  1.3578,  3.1492,  0.1913,
          -0.7212]],

        [[-0.1519, -1.3314, -0.1386, -1.5614,  2.5274,  1.8214, -0.6703,
          -0.0178],
         [ 0.3047,  2.2509,  0.4044,  1.0251, -1.7280,  1.8268, -1.4853,
           1.4316],
         [ 0.3568, -0.0441, -0.9759,  0.8392, -1.3312,  0.0341,  0.8

Now we mask the `weights` accordingly 

In [50]:
tril = torch.tril(torch.ones(T , T))
weights = weights.masked_fill(tril == 0 , float("-inf"))
weights

tensor([[[ 0.0859,    -inf,    -inf,    -inf,    -inf,    -inf,    -inf,
             -inf],
         [ 0.3552,  2.1007,    -inf,    -inf,    -inf,    -inf,    -inf,
             -inf],
         [-0.9784,  1.7547,  1.3245,    -inf,    -inf,    -inf,    -inf,
             -inf],
         [ 0.5586,  0.1217, -1.6052,  0.7288,    -inf,    -inf,    -inf,
             -inf],
         [ 0.3756, -0.4867, -2.9665, -0.8443, -0.3163,    -inf,    -inf,
             -inf],
         [ 0.7192,  1.2253,  0.8443,  0.3659,  0.1517, -1.8427,    -inf,
             -inf],
         [ 2.6714, -1.1549, -3.0244, -3.2962, -2.8656, -1.0292, -1.3797,
             -inf],
         [ 0.2635, -0.4145,  2.0525, -0.5988,  1.3578,  3.1492,  0.1913,
          -0.7212]],

        [[-0.1519,    -inf,    -inf,    -inf,    -inf,    -inf,    -inf,
             -inf],
         [ 0.3047,  2.2509,    -inf,    -inf,    -inf,    -inf,    -inf,
             -inf],
         [ 0.3568, -0.0441, -0.9759,    -inf,    -inf,    -inf,    -

Now if we calucalate the `softmax` of these, we get 

In [51]:
weights = F.softmax(weights , dim = -1)
weights

tensor([[[1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
          0.0000e+00, 0.0000e+00, 0.0000e+00],
         [1.4861e-01, 8.5139e-01, 0.0000e+00, 0.0000e+00, 0.0000e+00,
          0.0000e+00, 0.0000e+00, 0.0000e+00],
         [3.7902e-02, 5.8295e-01, 3.7915e-01, 0.0000e+00, 0.0000e+00,
          0.0000e+00, 0.0000e+00, 0.0000e+00],
         [3.3939e-01, 2.1925e-01, 3.8990e-02, 4.0237e-01, 0.0000e+00,
          0.0000e+00, 0.0000e+00, 0.0000e+00],
         [4.4376e-01, 1.8736e-01, 1.5693e-02, 1.3103e-01, 2.2216e-01,
          0.0000e+00, 0.0000e+00, 0.0000e+00],
         [1.9461e-01, 3.2281e-01, 2.2055e-01, 1.3668e-01, 1.1033e-01,
          1.5016e-02, 0.0000e+00, 0.0000e+00],
         [9.3130e-01, 2.0295e-02, 3.1294e-03, 2.3845e-03, 3.6679e-03,
          2.3011e-02, 1.6208e-02, 0.0000e+00],
         [3.3201e-02, 1.6855e-02, 1.9865e-01, 1.4017e-02, 9.9173e-02,
          5.9481e-01, 3.0889e-02, 1.2402e-02]],

        [[1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.00

And this is what we wanted 

Then we create another `Linear module` named `value`. and multiply that with  the output 

In [52]:
value = nn.Linear(C , head_size , bias = False)
v = value(x)

In [53]:
output = weights @ v
output

tensor([[[ 0.5165,  0.4819, -0.4010, -0.6524,  1.0272, -0.1677, -0.1033,
           0.0659, -0.6548, -0.3390,  0.3273,  0.8093, -0.1289, -0.2741,
           0.8411, -0.3724],
         [-0.2362, -0.6947,  0.1973, -0.1419,  0.4323,  0.2074, -0.2035,
           0.3783, -0.2427, -0.4067, -0.3818,  0.2421, -0.5855, -0.3320,
           0.3395,  0.8118],
         [-0.4834, -0.4632,  0.3034,  0.0046,  0.2162, -0.0900, -0.3797,
           0.5561,  0.2532,  0.1291, -0.2603,  0.1161, -0.5692, -0.5774,
           0.0114,  1.1058],
         [-0.1818, -0.3977, -0.1191, -0.4063,  0.2018,  0.3791, -0.2690,
           0.4710,  0.1223,  0.1131,  0.1201,  0.3100, -0.4031, -0.0464,
           0.4676,  0.0835],
         [ 0.1080,  0.0220, -0.2771, -0.3081,  0.3520,  0.2664, -0.0579,
           0.0356, -0.4369,  0.0309,  0.0790,  0.6428, -0.3123, -0.2453,
           0.3168, -0.0478],
         [-0.2381, -0.2580,  0.0071, -0.1519,  0.1760,  0.1312, -0.2464,
           0.3592,  0.0157,  0.1887, -0.0448,  0.338

If we calculate the variance of these values, they are very high

In [54]:
k.var() , q.var() , weights.var()

(tensor(0.3190, grad_fn=<VarBackward0>),
 tensor(0.3927, grad_fn=<VarBackward0>),
 tensor(0.0466, grad_fn=<VarBackward0>))

To scale them down, divide them by their $\sqrt{head_-size}$

In [55]:
wei = q @ k.transpose(-2, -1) * head_size**-0.5

This actually solved one more problem, as the varianace increases, the softmax sharpens, and may reach exact $1 , 0$ values. Scaling stops the softmax to do so 

Now we just put all of this into a class

In [56]:
class Head(nn.Module):

    def __init__(self, head_size):
        
        super().__init__()
        
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        
        self.register_buffer('tril', 
                             torch.tril(torch.ones(block_size, 
                                                   block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):

        B,T,C = x.shape
        
        k = self.key(x)
        q = self.query(x)
        
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 
        wei = wei.masked_fill(self.tril[:T, :T] == 0, 
                              float('-inf'))
        
        wei = F.softmax(wei, dim=-1)
        
        wei = self.dropout(wei)
        
        v = self.value(x)
        out = wei @ v 
        
        return out

Now we do this for different number of heads

In [57]:
class MultiHeadAttention(nn.Module):

    def __init__(self, num_heads, head_size):
        
        super().__init__()
        
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(head_size * num_heads, n_embd)
        
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

This is actually not $GPT-0$, its not even close to that model, but we made this to get an intution of how we will make the $GPT-0$ model. 

The $GPT-0$ model will be upadted soon in this notebook, in around $1-2$ days

# 8 | TO DO LIST 📝

```
# ADD ATTENTION MECHANISM

# ADD WANDB SUPPORT 

# IMPORVE THE RESULTS
```

# 9 | End Yayyyyyyyyyyyy :) 🥳🎊
**THAT IT FOR TODAY GUYS**

**WE WILL IMPROVE THIS IN UPCOMING VERSIONS**

**PLEASE COMMENT YOUR THOUGHTS, HIHGLY APPRICIATED**

**DONT FORGET TO MAKE AN UPVOTE, IF YOU LIKED MY WORK**

<img src = "https://i.imgflip.com/19aadg.jpg">

**PEACE OUT $:)$**