### Generative Pretrained Transformer (GPT)
In this Project , I made a Decoder only GPT(11.22M params) and trained it on roughly ~4 Mil tokens of data for about ~7 hours on a T4 GPU (16gb VRAM), you can find the hyperparameters used a couple of cells below, I will be going through each code cell and explaining, why I did what I did,and documenting some of the ideas and implementation methods that I found Interesting.

The model operates in an autoregressive manner, using masked self-attention to ensure that each token prediction depends only on previously generated tokens. Positional information is added to the token embeddings, though I originally wanted to use modern techniques like ALiBi or RoPE for positional embeddings, A simple linear layer gave satisfactory results for a ~12M param model

*Teacher forcing* method was used as the training process, making the model predict one token at a time, this particular model does not have
special tokens like [BOS],[EOS],[SEP] as I meant it for to be a infinitely generating.

![alt text](transformers-dark.webp)

*HuggingFace.co*

The above architecture was implemented in the code. However, the encoder block was omitted, as they are generally not used for generative Transformers, along with a few modifications that will be discussed later.

The model was also later finetuned on ~30k tokens of poem summary, this was done to teach the model a grammatical structure it needs to follow, while also giving tokens the model is familar with (Poem summary tend to have similar vocabulary to the poems themselves).

##### *References* :-
*Attention is all you need (2017)*

*Language Models are Unsupervised Multitask Learners(2019)*

*Andrej Karpthay - Youtube*

*Stanford CME295 Transformers & LLMs - Youtube*

Imported tokenizer and tokenizer trainer libraries to train my own tokenizer, initally I used Whitespace pre-tokenizer, where the words are initally broken into words, but that resulted in sub-optimal generation leading to a lot of broken words and it did not handle unkown tokens well, hence I switched to ByteLevel pre-tokenizer to prevent OOV (Out Of Vocabulary)

In [1]:
%pip install torchinfo
import pandas as pd
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.trainers import BpeTrainer
import torch
import torch.nn as nn
from torchinfo import summary
import time
import math
from torch.optim.lr_scheduler import LambdaLR

Collecting torchinfo
  Downloading torchinfo-1.8.0-py3-none-any.whl.metadata (21 kB)
Downloading torchinfo-1.8.0-py3-none-any.whl (23 kB)
Installing collected packages: torchinfo
Successfully installed torchinfo-1.8.0


In [2]:
device="cuda" if torch.cuda.is_available() else "cpu"
device

'cpu'

Fetching the kaggle dataset and unzipping it

In [3]:
!curl -L -o poem-dataset.zip \
https://www.kaggle.com/api/v1/datasets/download/marufchowdhury/poem-dataset

!unzip poem-dataset.zip

df=pd.read_csv("Poems_Dataset.csv")

df=df["Poem Content"]

data=df.tolist()

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 10.3M  100 10.3M    0     0  15.2M      0 --:--:-- --:--:-- --:--:-- 33.0M
Archive:  poem-dataset.zip
  inflating: Poems_Dataset.csv       
  inflating: poemDatasetWithSummary.csv  


The Hyper paramaters:-

After the model inference , the vocab size seemed to be the bottle neck of the model

Here you can see I made the head_size of the model(the dimension of key,query,value) n_embed/n_head, the reason for that will be explained later.

I also made the finetuning lr rate much much lower than the lr rate for training , to make sure I dont destroy the poem structure the model is supposed to generate.

In [4]:
#Hyper Parameters
context_window_length=128
batch_size=256
n_embed=288
n_head=9
n_layers=8
v_size=9000
head_size=n_embed//n_head
lr_t=0.00007
lr_ft=lr_t*0.05

Here I made the mistake of training the tokenizer on the entire dataset instead of jus the training set, but I figured it won't make much of a difference 

In [5]:
tokenizer=Tokenizer(BPE())
tokenizer.pre_tokenizer=ByteLevel(add_prefix_space=True)
trainer=BpeTrainer(vocab_size=v_size)
tokenizer.train_from_iterator(data,trainer)
tokenizer.save("Tokenizor.json")

Testing the tokenizer

In [6]:
out=tokenizer.encode("jungle wind wood ocean")
out.tokens,out.ids

(['Ġjungle', 'Ġwind', 'Ġwood', 'Ġocean'], [8086, 559, 1099, 1709])

Getting the tokens of the entire dataset 

In [8]:
all_ids=[]

for s in data:
    all_ids.extend(tokenizer.encode(s).ids)

idss=torch.tensor(all_ids,dtype=torch.long).to(device)
len(idss)

5462341

Splitting the training and validation tokens, though idk why i didnt use train_test_split library, would hve been much cleaner

In [9]:
full=idss[:5462341-20000]
val_ids=idss[-20000:-1]
len(val_ids),len(full)


(19999, 5442341)

Made a generator to yield batches of x and y in Pytorch Tensors with a fixed size (context_window_length)

In [10]:
def generator(ids,batch_size,cwl):
    X=[]
    Y=[]
    count=0

    for i in range(len(ids)-cwl):
        X.append(ids[i:i+cwl])
        Y.append(ids[i+1:i+cwl+1])
        count+=1

        if count==batch_size:
            yield torch.stack(X).to(device),torch.stack(Y).to(device)
            X=[]
            Y=[]
            count=0

Made a class for a single attention head, and later made a class to group multiple heads together using torch.nn.ModuleList

Here , to implement masked attention , where the tokens can only "see" the tokens preceding them , I used an interesting implementation 
from Andrej Karpathy who did this in his NanoGPT repo, this was also done in the GPT-2 PyTorch version you can find on huggingface github

The idea is to basically create a lower triangular matrix with dimensions (BxTxT), where T represent total token length or in this case the context window length, and multiply it with the weights.

#### Why TxT??

We get TxT because in the attention formula, we initally calculate the dot product of Keys and Values of "Every token" with each other.
Here keys represent the learnable features(n_features=head_size) that each token "represent". Basically key is a "identity of a token"

While a query is the "set of features " each token best matches with/ searches for.So higher the value of the dot product of key and query, higher is the alignment in the learned feature space, which implies they share more "context"

So the BxTxT matrix can be thought of a as a lookup table of context logits of all tokens w.r.t each other

But the problem here is , the weight matrix we got is not masked attention yet, rather it is biderctional attention which is commonly used in encoder only models like BERT for sentiment analysis, etc.
Hence e apply a lower-triangular causal mask to the attention logits, setting all future positions to −∞ before the softmax.

Now this could be done in several ways, using jus python implementation, but vectorizing the operations in PyTorch saves a lot of computational time.

And finally, we matrix-multiply the softmax-normalized attention weights with the value vectors to obtain a weighted sum of information.

In [None]:
class AttentionHead(nn.Module):
    def __init__(self,head_size):
        super().__init__()
        self.key=nn.Linear(n_embed,head_size,bias=False) #(B,T,C)-->(B,T,H)
        self.query=nn.Linear(n_embed,head_size,bias=False) #(B,T,C)-->(B,T,H)
        self.value=nn.Linear(n_embed,head_size,bias=False)  #(B,T,C)-->(B,T,H
        #self.dropout=nn.Dropout(0.2)

    def forward(self,x):
        k=self.key(x)     #(B,T,H)
        q=self.query(x)   #(B,T,H)
        v=self.value(x)   #(B,T,H)

        # Do Dot product of k and q

        weights=k@q.transpose(-2,-1)*head_size**-0.5  # (B,T,H) x (B,H,T) --> (B,T,T)
        T=x.size(1)
        mask=torch.tril(torch.ones(T,T,device=x.device))
        weights=weights.masked_fill(mask==0,float('-inf'))
        weights=nn.functional.softmax(weights,dim=-1)
        #weights = self.dropout(weights)

        output=weights@v #(B,T,T) x (B,T,H) --> (B,T,H)
        return output

You can see I just used a library module to have multiple heads , one important feature here is , I specifically made the product of no.of heads and the Head_size to be equal to the embedding dimension to save compute of having a linear projection layer, this was also done in the NanoGPT repo.

In [None]:
class MultiHead(nn.Module):
    def __init__(self,n_head,head_size):
        super().__init__()
        self.heads=nn.ModuleList([AttentionHead(head_size) for _ in range(n_head)])
        #self.project=nn.Linear(n_head*head_size,n_embed)
        self.dropout=nn.Dropout(0.2)
    def forward(self,x):
        out=torch.cat([h(x) for h in self.heads],dim=-1)  # (B,T,H*N)
        #out=self.project(out)  # (B,T,H*N) --> (B,T,C) 
        out = self.dropout(out)
        return out

This is the feedforward layer where a lot of the "thinking" of the block happens, In the *Attention is all you need* paper they used 4*n_embed for the output and input of the first and second layer respectively, but to save compute I used 3x.

In [None]:
class FeedForward(nn.Module):
    def __init__(self):
        super().__init__()
        self.FF=nn.Sequential(
            nn.Linear(n_embed,3*n_embed),
            nn.GELU(),
            nn.Linear(3*n_embed,n_embed),
            nn.Dropout(0.2)
        )

    def forward(self,x):
        return self.FF(x)

One very important change from the actual GPT architecture was the use of Pre-LayerNorm instead of Post-LayerNorm, which was discussed in the *Language Models are Unsupervised Multitask Learners (2019)* paper.

You can also see , We are not getting the output of the model, rather we are incrementing the input by the output of the model after the feedforward, this is to show that each block of a transformer does not give their own fresh interpertation of the data, rather they add to the information and bending the vectors to make meaningful data. This method makes the model "reason" well with depth of blocks.

In [None]:
class Block(nn.Module):
    def __init__(self,n_embed,n_head):
        super().__init__()
        head_size=n_embed//n_head
        self.SelfAtt = MultiHead(n_head, head_size)
        self.ffwd = FeedForward()
        self.ln1 = nn.LayerNorm(n_embed)
        self.ln2 = nn.LayerNorm(n_embed)

    def forward(self,x):
        x=x + self.SelfAtt(self.ln1(x)) 
        x=x + self.ffwd(self.ln2(x))
        return x  #(B,T,C)

This is the final GPT class where I call all the previous class objects to make the decoder architecture, I also add the positional embeddings here.

Another change from the original GPT architecture is the inclusion of one final Layernorm after all the blocks before the Linear projection of embeddings to the vocab.

In the generate function , I get the input tokens and only take the last 128 tokens to pass as the input to the model, since my model only has a static context window length, in this case 128 tokens

THE MOST IMPORTANT CHANGE that I did to my model which made the Greatest impact was to add Temperature.
Temperature is basically a scaling factor which flattens/increases the peak of the token distrubution of the output logits, this makes the model more "creative" when generating text.

Though in my case it made my model speak less and less broken words when I reduced the temperature value. 
For example,


### Temperature = 1:

Enter Prompt / Initial tokens: cool breeze

Generated Text:

*cool breeze .*  
*We walk the rest of the drowned* . 
*It took the sub stit ution on earth* .  
*Be unable up from the hills  circ uit fountain through the P ver , contain the chill .  The new - black continent , s ales are walking* *holes  them inside the shade of May ; we hear*
*The meal simply lod ges like a horse*
*that sw irls Fl ots ,  Time of ir ast the dark - bl ades .*
*So ings d ent ird to its nest in ils ,  *
*As the read est leaf of clouds ,*
*A ng en she ind ulating the road to the dis he ast ,  Pl acks*

Here you can see , after the inital 15-20 tokens, the model begins to speak broken sub words.This gets worse as the we come to the end of the generation. This word breaking was way worse before I fine-tuned the model on english grammer


### Temperature = 0.57

Enter Prompt / Initial tokens: cool breeze

Generated Text: 

*cool breeze .* 
*The natural world of my voice .   Or I had a house ,*
*and I meant to contain the time  into the walls of the sun ,*
*and the bottom of my body ,  our first , and the world ,*
*and my mother who had*
*a woman*
*their long .*
*And when she was*
*or the way*
*of the word was*
*to it ,   the coming back*
*and the time*
*of her hands   that was*
*of the way*
*the light   the way*
*of the same  of the body*

Here the we can see there is not a single broken sub word.


In [None]:
class GPT(nn.Module):
    def __init__(self):
        super().__init__()
        self.embed=nn.Embedding(v_size,n_embed)  # (B,T) --> (B,T,C)
        self.pos_embed=nn.Embedding(context_window_length,n_embed) # (T) --> (T,C)

        self.blocks=nn.Sequential(*[Block(n_embed,n_head) for _ in range(n_layers)])
        self.final_layernorm = nn.LayerNorm(n_embed) # final layer norm
        self.lm_head = nn.Linear(n_embed, v_size)

    def forward(self,x):
        # x ==> (B,T)

        tok_embeds=self.embed(x) # (B,T,C)
        pos_embeds=self.pos_embed(torch.arange(x.size(1),device=x.device)) #(T,C)
        x=tok_embeds + pos_embeds # pos_embed r broadcasted and added to every batch element

        x=self.blocks(x)
        x=self.final_layernorm(x)
        logits=self.lm_head(x)

        return logits


    @torch.no_grad()
    def generate(model,idx,max_new_tokens):
        for _ in range(max_new_tokens):
            if idx.size(1)>context_window_length:
                idx_cond=idx[:,-context_window_length:]
            else:
                idx_cond=idx

            logits=model(idx_cond)
            probs=torch.softmax(logits[:,-1,:],dim=-1)
            next_token=torch.multinomial(probs,1)  #Chossing one token based on the probability distribution
            idx=torch.cat((idx,next_token),dim=1) #Adding the new token to the existing sequence

        return idx


Here I did 2 new things other than the normal training process

1.torch.compile(), this makes the training faster after the first few batches 

2.Added a lr schedular to reduce the learning rate after every epoch

In [None]:
model=GPT().to(device)
model=torch.compile(model)
optimizer=torch.optim.AdamW(model.parameters(),lr=lr_t,fused=True)
criterion=nn.CrossEntropyLoss()
epochs=20
def lr_lambda(epoch):
    return 0.5*(1+math.cos(math.pi*epoch/epochs))

scheduler=LambdaLR(optimizer,lr_lambda)


In [None]:
summary(model)

I also used Gradscalar and autocast to train the params in FP16 instead of FP32, this reduced the compute by a lot, and also made training on T4 GPU, which I got free access to in google colab and kaggle, much more efficient since T4 are known to have slower compute speed for FP32.

The validation loss for 3 epochs were ~5.2, ~5.0, ~4,98. So I stopped training after 3rd epoch

In [None]:
scaler=torch.amp.GradScaler('cuda')

for i in range(epochs):
    model.train()
    step=0
    start_epoch=time.time()
    last_print_time=start_epoch

    for x,y in generator(full,batch_size,context_window_length):
        optimizer.zero_grad(set_to_none=True)

        with torch.amp.autocast('cuda'):
            logits=model(x)
            logits=logits.view(-1,logits.size(-1))
            y=y.view(-1)
            loss=criterion(logits,y)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        step+=1
        if step%150==0:
            now=time.time()
            print(
                f"Epoch: {i+1}, "
                f"Step: {step}, "
                f"Loss: {loss.item():.4f}, "
                f"Time/150 batches: {(now-last_print_time):.2f} sec")
            last_print_time=now
        if step%1500==0:
            torch.save(model.state_dict(),"Temp_model.pt")
    scheduler.step()
        
    
    end_epoch=time.time()
    print(f"Epoch {i+1} total time: {(end_epoch-start_epoch):.2f} sec")
    torch.save(model.state_dict(),f"Mmodel_epoch__{i+1}.pt")

    avg_loss=0
    count=0
    with torch.no_grad():
        model.eval()
        for x,y in generator(val_ids,batch_size,context_window_length
        ):
            logits=model(x)
            logits=logits.view(-1,logits.size(-1))
            y=y.view(-1)
            loss=criterion(logits,y)
            count+=1
            avg_loss+=loss.item()

    print(f"Model_{i} Val Loss:{avg_loss/count}")

Here I decided to fine-tune the model because, I noticed that even though the model spoke complete words it did not have a structure. So i decided to train it on proper english sentence which share similar vocabular, the poem summaries of the original dataset

In [None]:
# Now to finetune the model to proper english grammer while having a similar vocabular, ill be fine-tuning it on 
# the poem's summary, jus 10% of the total summary, jus to tweak/guide the model in the direction not completely
# change the generation
df_ft=pd.read_csv("/kaggle/working/poemDatasetWithSummary.csv")
x=df_ft["jist"].tolist()
ids_ft=[]
for i in x:
    ids_ft.extend(tokenizer.encode(i).ids)
ids_ft=ids_ft[:30000]
ids_ft=torch.tensor(ids_ft,dtype=torch.long).to(device)
len(ids_ft)

In [None]:
tokenizer=Tokenizer.from_file("Tokenizor.json")
sd=torch.load("Mmodel_epoch__3.pt",map_location=device)
sd={k.replace("_orig_mod.",""):v for k,v in sd.items()}
model_ft=GPT()
model_ft.load_state_dict(sd)
model_ft.to(device)
model_ft.eval()
print("Model Loaded")

In [None]:
optimizer_ft=torch.optim.AdamW(model_ft.parameters(),lr=lr_ft)
criterion=nn.CrossEntropyLoss()
epochs=2

In [None]:
for i in range(epochs):
    step=0
    curr=time.time()
    final=curr
    for x,y in generator(ids_ft,batch_size,context_window_length):
        optimizer.zero_grad(set_to_none=True)
        logits=model_ft(x)
        logits=logits.view(-1,logits.shape[-1])
        y=y.view(-1)

        loss=criterion(logits,y)
        loss.backward()
        optimizer.step()

        step+=1

        if step%10==0:
            final=time.time()
            print("Loss:",loss.item(),"Time:",final-curr)
            curr=final

print("Finetuning Done")  

In [None]:
torch.save(model_ft.state_dict(),"Model_FineTuned.pt")