<a href="https://colab.research.google.com/github/Aruniaaa/NoteBooks-Dump/blob/main/GPT_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building an GPT from scratch

Tutorial used - [Andrej Karpathy's](https://youtu.be/kCc8FmEb1nY?si=J9uV8cDOVSjrIds6)

GPT - Generative Pre-Trained Transformer

Basically, the following code is a way to build Neural Networks/Transformers/AI models such that it can accurately predict what comes next in a sentence or collection of words, given some collection of word (training data)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

path = '/content/drive/MyDrive/Datasets/Del-data.txt' # your path will certainly be different

with open(path, "r") as file:
  data = file.read()


Mounted at /content/drive


## Basics

What I'm doing here:
1. Storing all the unique characaters in the dataset into a list called "chars". The size of "chars" is our vocab size which is stored in the variable "size"

2. Mapping each individual and unique character in the dataset to a number for tokenization (Example -> a = 18, r = 33, p = 32). The mapping is stored in the form of a dictionary "stoi".

3. Reversing that mapping so that every integer corresponds with a letter or character, this is stored in the dictionary "itos"

4. The encoder and decoder are lambda functions that get the string to integer or integer to string conversion and join it together in one array/string



In [None]:
chars = sorted(list(set(data)))
size = len(chars)

stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i : ch for i, ch in enumerate(chars)}

encode = lambda sentence: [stoi[c] for c in sentence ]
decode = lambda arr: ''.join([itos[i] for i in arr])


In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

### Data Splitting and Batch Processing

The following two cells do the below mentioned:

1. Creating a torch.tensor object (you can think of this as a list/array) that stores the encoding of the data into integers. So, each character in the data, regardless of whether its unique is encoded into the integer from the stoi dictionary (if r = 33 in stoi, every occurence of r in the data will be 33) and stored in the new data variable. Think of it as writing the whole Del-data.txt file, but in the form of numbers

2. We are then taking the number which is 90% of the length of our data's size and storing it into the variable n, which we use to split our data into train and test. Learn more about train test split [here](https://builtin.com/data-science/train-test-split)

3. Next we define a get batch function that randomly grabs [batch size] sequences of data, each sequence containing [block size] characters. The split arguement is passed into the function so that x and y are retrieved from the correct dataset.
For example, if our model is in the training mode we would not want to get x and y (context and targets) from the test split we defined in the below cell, and vice versa.

4. ix is a list of [batch size] amount of starting points for our x and y values, the len(data) - block size ensures that we don't grab starting positions from the data so close to the ending that we can't even travel [block size] amount of steps ahead.

5. x and y here are arrays of [batch size] by [block size], each row has [block size] sequential elements encoded into integers. For example, if the 0th integer in ix (a random starting point) was 46, the first row of x would have all the characters including the 46th to 46 + [block size]th element of the train/test data. (In this case that would be data[46 : 54])

6. Every element in y is the same as x, its just offset by 1 to the right so that y is a vector with each element corresponding to the target integer for the same element in the x vector (You can run the for loop below to better understand this)


In [None]:
data = torch.tensor(encode(data), dtype=torch.long)

n = int(0.9 * len(data))
train = data[: n]
test = data[n : ]

In [None]:
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train if split.lower() == 'train' else test
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[30, 21,  1, 35, 25, 22,  1, 29],
        [20, 22,  2,  1, 34, 35, 18, 19],
        [ 1, 21, 22, 28, 22, 24, 18, 35],
        [27,  1, 40, 31, 36,  2,  1,  8]])
targets:
torch.Size([4, 8])
tensor([[21,  1, 35, 25, 22,  1, 29, 31],
        [22,  2,  1, 34, 35, 18, 19, 26],
        [21, 22, 28, 22, 24, 18, 35, 26],
        [ 1, 40, 31, 36,  2,  1,  8, 31]])
----
when input is [30] the target: 21
when input is [30, 21] the target: 1
when input is [30, 21, 1] the target: 35
when input is [30, 21, 1, 35] the target: 25
when input is [30, 21, 1, 35, 25] the target: 22
when input is [30, 21, 1, 35, 25, 22] the target: 1
when input is [30, 21, 1, 35, 25, 22, 1] the target: 29
when input is [30, 21, 1, 35, 25, 22, 1, 29] the target: 31
when input is [20] the target: 22
when input is [20, 22] the target: 2
when input is [20, 22, 2] the target: 1
when input is [20, 22, 2, 1] the target: 34
when input is [20, 22, 2, 1, 34] the target: 35
when input is [20, 22, 2,

### Hyper Params

"In machine learning, a hyperparameter is a parameter that can be set in order to define any configurable part of a model's learning process"

Batch size = How many independent sequences will we process in parallel? You can also think of this as how many starting points we will have to create the batch of data. This is important as we can not pass the entire training data into the model at once since it will be evry computationally expensive

Block size = Given we have [batch size] amount of sequences, how many characters should each sequence have? Here, we will have 16 independent sequences from the data, each of which will have 32 characters. For example, one of the sequences could be "Distinguished delegates, I exten", it has 32 characters.

Max Iters (Iteration) = During training, how many times should the model be trained or allowed to see new data? For example, one iteration would mean that the entire training proccess (getting the x and y values, calclulating the loss, optimizing, and backward propogation) would happen one time. For 5000 iterations, the same proccess will repeat 5000 times.

Eval (evaluation) interval = During training and testing, I want to show the user how the loss is changing for debugging proccess or just to visually see the process and keep getting updates in the long period of training + testing. The eval interval will be used to print the loss during training and testing during every [eval_interval] intervals.

Device = Since there will be a LOT of matrix multiplications and math involved, we're better off using the GPU if possible (which can perform upto 36 trillion computations per second) rather than the CPU, which is relatively slow. ***THIS LINE IS NOT NEEDED FOR GOOGLE COLAB***

Eval_iters = The amount of items we will get the losses for. Instead of just calcualting the loss on the 16 batch size dataset one time, we will calculate the losses [eval_iters] times and then calculate the average

N_embed = You're gonna need some linear algebra for this one. It Basically creates a 64 dimension vector space where each vector represents a character. So, the letter h is represented by a vector of 64 real numbers where each dimension stores something about the character. Dimension 21 might represent "frequency in English text", dimension 47 might represent "likelihood of appearing after 'q'", etc

n_head = This is the basic for our multiheaded attention mechanism. The 64-dimensional embedding gets split into four chunks of 16 dimensions each (64 / 4 = 16). Each attention head operates independently on its 16-dimensional slice, learning its own Query, Key, and Value transformations. One chunk might focus on short term relationship, the other might focus on positional relationships, etc

n_layer = Now, information from the 4 different chunks go to a different layer where it is then refined and improved, and this n_layer is the amount of timed we do this. Each layer builds on the information from the one before it, so by the time the data gets to the last layer, the model has a very deep understanding of the relationships in the data. With four layers, the information goes through four rounds of processing and refinement.

Dropout = To prevent [overfitting](https://www.ibm.com/think/topics/overfitting), randomly during the training, some of the neurons "shut down", in a way that they don't participate in the prediction process or to pass along any information. This is done so that the model doesn't become too dependant on any one neuron or set of neurons. Here the dropout is 0.0, which means all the neurons will be active, we will tweak this value later.

In [None]:
batch_size = 16
block_size = 32
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embed = 64
n_head = 4
n_layer = 4
dropout = 0.0


### Estimating the Loss

Loss is basically a number evaluating the performance of a model on both the training and validation datasets. It tells you how well a model is doing on both the training and testing datasets


@torch.no_grad() = This is a PyTorch context manager that disables gradient calculation. During the forward pass, PyTorch's autograd engine normally builds a computation graph to track operations, which is required for backpropagation and weight updates. By wrapping the evaluation logic in @torch.no_grad(), we prevent this graph from being built.

model.eval() = sets the model to evaluation mode. This is important because certain layers, like Dropout and Batch Normalization, behave differently during training and evaluation. For example in evaluation, all neurons are active and the dropout is set to 0

we iterate over two different splits, 'train' which tells us how the model is performing on the training dataset, i.e, the data it has seen before, and the testing dataset, i.e, data it has not seen before.

Then, we get X and Y and pass it into our model to make predictions which returns logits (the actual prediction) and the loss (how well it performed)

We average out the losses (for both the train and test data) and store it in the out dictionary. The model is then set back to training mode.



In [None]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

## Transformer

From here, we start building the actual transformer.


### Attention Head

 It's basically a special mechanism that allows a model to weigh the importance of different words in a sequence when processing a single word. It looks at all the other words to figure out which ones are most relevant to the current word. This helps the model understand context. For example, in the sentence "The dog chased the cat, and it ran away", who does "it" refer to?


 Each token (a character) in the data/text has 3 vectors

 1. Query vector (going to be represented as Q from now): This is like each token asking "Are there any tokens that I should pay attention to?" or "I am looking for a token that is the character '.' and comes after the letter 't'", etc etc

 2. Key vector (going to be represented as Q from now): This is like each token saying "I am h, a constant, at position 5, who is followed by the token 'e'" or any other relevant info.

 Now the way that the attention works is by calculating the dot product of each query vector and each key vector. It's like comparing the Q of the current token with the K of all other tokens. The more similar the Q is to a K, the more attention the model pays to that token.

 wei = q @ k.transpose(-2,-1) means that for each token, it will create a matrix containing the attention scores.

 wei.masked_fill(...): ensures that a word can only pay attention to the words that came before it. This is important for language generation, as the model can't "cheat" by looking at future words.

 F.softmax(wei, dim=-1): The scores are then turned into probabilities, so they all add up to 1.

 out = wei @ v: Instead of each token just focusing on the most "relevant" one in the attention matrix, it pays attention to all the tokens, but differently. Instead of a token paying attention 100% to on one other token, it splits that up into different token where the most relevant token still gets most of the attention.




In [None]:
class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)
        q = self.query(x)
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out


### Multiple Attention Heads

Why do we need it?
Well, one attention head works quite well, but this is a generalist vs specialist case. Would you rather have on attention head that tries to capture everything like which tokens it should pay attention to, what's the underlying relationships, frequency, etc.

Or have a little team of those attention heads that each focus on a different thing, but they do it brilliantly?

In the following code, [num_heads] heads are created which each creates a new representation of each token that reflects what that head learned to pay attention to.

So, each head takes the input tensor x (batch_size, 32, 64) and creates a (batch_size, 32, 16) vector focusing on *one* aspect each instead of trying to generalize ALL the relationships in that 16 dimension space.

For example,

Head 1 produces (batch_size, 32, 16) focusing on, say, grammatical relationships
Head 2 produces (batch_size, 32, 16) focusing on, say, semantic similarities
Head 3 produces (batch_size, 32, 16) focusing on, say, positional patterns
Head 4 produces (batch_size, 32, 16) focusing on, say, long-range dependencies

```
out = torch.cat([h(x) for h in self.heads], dim=-1)
```
 This gives us a 64-dimensional representation (4 heads × 16 dimensions) where each token now has multiple complementary perspectives encoded within it.

 Now, before feeding this vector into a different layer, we need to project it. The current structure is completely different from the original one. The original 64 dimension represented mixed information, while this one represents rigid and organised one.

 For example, the first 16 dimension can represent grammatical relationships, the second 16 will represent semantic relationships, etc etc.

 So, the projection transforms structured, specialized information back into a flexible, general representation that can be used by the next layer in the stack.




In [None]:
class MultiHeadedAttention(nn.Module):

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embed, n_embed)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out


### Feed Forward

Why do we need it? Think of it like this, the multi headed attention enabled the tokens to look at each other and communicate. Let's say we had the phrase "A fluffy blue creature roams the verdant forest". After multi headed attention, the token "creature" now knows its related to or should pay attention to the words "blue", "fluffy", and "roams", but it did not get a lot of time to actually *think* about it.

Which means that the token does not actually know that its fluffy and blue and is roaming, it just knows it's supposed to pay a lot of attention to those tokens

Now what this does is that it takes the output vector from the multi headed attention, and changes the dimensions from 64 to 256 (multiplies by 4), so as to have more "thinking space".

This "thinking space" allows the token "creature" to now focus on WHY and HOW those attention scores matter. It learns patterns in the data like how an adjective is almost always followed by a noun, how a capitalization is almost always precceded by a fullstop, etc etc.



```
 nn.Linear(4 * n_embed, n_embed),
```
This line squishes the newly formed 256 dimension vector back into a 64 dimension one, keeping the most relevant parts in a concise manner, but this new vector now has all the "thinking" and understanding encoded within it.


In [None]:

class FeedForward(nn.Module):

    def __init__(self, n_embed):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(n_embed, n_embed * 4),
            nn.ReLU(),
            nn.Linear(4 * n_embed, n_embed),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        return self.network(x)



### Blocks/Layers

The job of the block is to understand complex relationships and patterns like "this forms a complete word", "this is a verb", "full stop is used to end a sentence", etc.

It's different from multiple attention heads as multiattention heads realise the context within the input. For example, which letters go together most often, which one appears before a fullstop, etc etc.

But Blocks/Layers BUILD contexts, realising how the words/letters fit together, when, where, how, and WHY. It understands that "the cat" are two words and need to be separated via space, etc etc.

It also preserves information while adding new understanding. At each stage, the Block says "keep everything you already knew, but now add this new layer of insight" through the residual connections (the x = x + ... operations).

And, the understanding from one layer passes onto the other to refine and improve!

We use two layer norms here, the first one ensures that the data is properly scaled before moving into the multiheaded attention, the second one ensures

The second layer takes the output from the attention stage and normalizes it across the embedding dimensions, ensuring that the feed-forward network receives inputs with stable, well-behaved statistical properties. This is important because the feed-forward network uses learned weight matrices, and these matrices work best when their inputs have consistent scaling and distribution. Without this normalization, the feed-forward network might struggle with inputs that are too large, too small, or have unusual distributions, potentially leading to training instability or poor performance.

In [None]:

class Block(nn.Module):

    def __init__(self, n_embed, n_head):
        super().__init__()
        head_size = n_embed // n_head
        self.sa = MultiHeadedAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embed)
        self.ln1 = nn.LayerNorm(n_embed)
        self.ln2 = nn.LayerNorm(n_embed)


    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

## The Actual Model

Here, in the init we combine everything (attention heads, feed forward, blocks, position embedding, etc) to finally create the model. It's like taking the pieces of the jigsaw puzzles that we've been building and combining it. The combination is the model here.

First, we add the n_embed dimensions and positional encoding to our (B,T) tensor by

```
x = tok_emb + pos_emb # (B,T,C)
```
The X tensor then flows to the multiple attention layers/blocks, the output X tensor with the same dimensions then gets normalized.

The lm_head is a linear layer that takes the proccessed embeddings and generates the raw scores, called "logits" for every token in our vocabulary.

Think of it like this,

after all the transformer processing, each position has learned a rich representation that captures context like "we're talking about animals, and the last word was 'the'." Now lm_head looks at that representation and says:

"cat" gets a score of 8.2
"dog" gets a score of 7.1
"pizza" gets a score of 0.3
"the" gets a score of -2.1

The higher the score (logit), the more likely the model thinks that word should come next.

The generate function ->

It implements autoregressive sampling, which means that the model predicts one token at a time, then feeds that prediction back as input for the next prediction. This creates a feedback loop that can generate arbitrarily long sequences.

The model can only have the previous block_size amount of context for each prediction, by the way, so, if block_size is 32 and we've generated 40 tokens so far, we only use the last 32 tokens as context for predicting token 41. This prevents memory issues and keeps inference fast.

Since logits are just raw scores, we use softmax to convert it into a probability distribution.

After that, we use torch.multinomial.

Let's say we had the sentence "The cat sat on the", and we need to predict the next word. Now, chair could be the most probable, followed be table, bed, etc etc.

Words like elephant or juice are examples of the least probable words.

Now imagine a spinning wheel with those words. Chair makes up 50% of the wheel, table makes up 30%, bed makes up 10% elephant and juice make up 5% each. We spin the wheel. Now its most likely going to land on wheel, but it can even land on table, bed, even juice. The word it lands on is the predicted word.

Why do we do this? Well, so that the model doesn't generate a fixed sequence of characters every time. Think about how boring it would be if ChatGPT or any other LLM only gave one, specific, fixed response to a specific prompts. To prevent this, and add a bit of "randomness", we "spin the wheel"

We then concatenate the newly sampled token to our existing sequence. The sequence grows from length T to T+1. This becomes the input for the next iteration of the loop.

The longer sequence (with our newly sampled token) becomes the input for the next forward pass. The model will use this extended context to predict the token that should come after our newly sampled token. And so on, and so on.

In [None]:
class BigramLanguageModel(nn.Module):

  def __init__(self, vocab_size):
    super().__init__()
    self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
    self.position_embedding_table = nn.Embedding(block_size, n_embed)
    self.blocks = nn.Sequential(*[Block(n_embed, n_head=n_head) for _ in range(n_layer)])
    self.ln_f = nn.LayerNorm(n_embed) # final layer norm
    self.lm_head = nn.Linear(n_embed, vocab_size)


  def forward(self, idx, targets=None):

    idx = idx.to(device) # Move input tensor to the correct device

    B, T = idx.shape

    # idx and targets are both (B,T) tensor of integers
    tok_emb = self.token_embedding_table(idx) # (B,T,C)
    pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
    x = tok_emb + pos_emb # (B,T,C)
    x = self.blocks(x) # (B,T,C)
    x = self.ln_f(x) # (B,T,C)
    logits = self.lm_head(x) # (B,T,vocab_size)
    if targets is None:
      loss = None
    else:
      targets = targets.to(device) # Move targets tensor to the correct device
      B, T, C = logits.shape
      logits = logits.view(B * T, C)
      targets = targets.view(B * T)
      loss = F.cross_entropy(logits, targets)

    return logits, loss

  def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        idx = idx.to(device) # Move input tensor to the correct device
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx


model = BigramLanguageModel(size)
model.to(device) # Move the model to the correct device
logits, loss = model(xb, yb)


print(decode(model.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))


WGkwwRaRMowWaCo’R.u
Wd hes
ed.sW.]Ebp cLpxH,TaWfcsopl[fx,lh’TCG,it,H.GRdOldlR.TxoHMI et’Wn
i’ ,p,[iw


In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3) # adam optimizer

## Training Loop + Output

This training loop is like the culmination of everything we've built.

It generates batches of data, passes that into the model, the logits and loss are calculated and the test + train loss is printed. This is essential because if we notice the train loss going up but the test loss going down, it means the model is overfitting.

It also used backpropagation to calculate learning signals (gradients) for every single parameter in the model, from the embedding tables to the attention weights to the final output layer.

Then, the model finally generates max_tokens amount of tokens which is decoded using our decoder and printed onto the terminal.

In [None]:
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(model.generate(context, max_new_tokens=1000)[0].tolist()))

step 0: train loss 3.9081, val loss 3.9081
step 500: train loss 0.1609, val loss 0.1631
step 1000: train loss 0.0952, val loss 0.0955
step 1500: train loss 0.0845, val loss 0.0830
step 2000: train loss 0.0784, val loss 0.0796
step 2500: train loss 0.0778, val loss 0.0774
step 3000: train loss 0.0751, val loss 0.0750
step 3500: train loss 0.0745, val loss 0.0747
step 4000: train loss 0.0746, val loss 0.0754
step 4500: train loss 0.0737, val loss 0.0738
step 5000: train loss 0.0737, val loss 0.0750
step 5500: train loss 0.0731, val loss 0.0739
step 6000: train loss 0.0724, val loss 0.0738
step 6500: train loss 0.0733, val loss 0.0731
step 7000: train loss 0.0727, val loss 0.0728
step 7500: train loss 0.0726, val loss 0.0726
step 8000: train loss 0.0711, val loss 0.0714
step 8500: train loss 0.0714, val loss 0.0714
step 9000: train loss 0.0709, val loss 0.0702
step 9500: train loss 0.0708, val loss 0.0714


Greetings, esteemed delegate.
We propose a resolution that ensures stability and c