<a href="https://www.kaggle.com/code/ayushs9020/sentence-textual-similarity-on-kaggle-llm?scriptVersionId=138194711" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#FF000D; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #FF000D">Kaggle LLM</p>

In [1]:
import warnings
warnings.filterwarnings("ignore")

<div style="border-radius:10px; border:#FF000D solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
The $Kaggle - LLM$ $Science$ $Exam$ is a `competition` that challenges to `answer difficult science-based questions` written by a `Large Language Model` $(LLM)$. The `Goal` of the competition is to help `researchers better understand` the `ability of LLMs` to test themselves, and the `potential of LLMs` that can be run in resource-constrained environments.

The `dataset` for the competition was generated by giving `gpt3.5 snippets` of text on a range of `scientific topics pulled` from `Wikipedia`, and asking it to `write a multiple choice question` (with a known answer), then `filtering out easy questions`.

`Participants` in the competition are asked to `develop an LLM` that can `answer the questions` in the dataset `as accurately as possible`. The competition is scored using the `average precision` at `cutoff k metric`, where $k$ is the `number of predictions` made for each question.

Estimations says that the `largest models` run on `Kaggle` are around $10$ $Billion$ $Parameters$, whereas `gpt3.5 clocks` at $175$ $Billion$ $Parameters$. If a `question-answering model can ace` a test written by a `question-writing model` more than $10$ `times its size`, this would be a genuinely `interesting result`; on the `other hand` if a `larger model can effectively` `stump a smaller one`, this has `compelling implications` on the `ability of LLMs` to benchmark and test themselves.
    
Thanks to **[Radek Osmulski](https://www.kaggle.com/radek1)** for providing amazing dataset

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#FF85FF; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #FF85FF">1 | Approach 🛣️</p>

<div style="border-radius:10px; border:#FF85FF solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
* Replace the values in `answer` with actual answers instead of options
* Remove all columns except `['prompt' , 'answer']`
* Tokenizew the `prompt`
* Embed the `answers`
* Train a simple model that will predict `answers` `[768]` when given `answers` `tokens`

(Maybe update model and tokenizer further changing some shapes)
* Find the `Cosine_Similarity` between both the vectors
* Subtract $1$ from that loss
    
(Another Model)
* Calculate the similarity between the actual answer and other options (Sentecne Transformers)
* Subtract that similiarity from the loss
* Minimize the loss

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#FF028D; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #FF028D">2 | Data Preprocessing 📊</p>

In [2]:
import pandas as pd

<div style="border-radius:10px; border:#FF028D solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
Lets just focus on the training data

In [3]:
train = pd.concat(
    [
        pd.read_csv("/kaggle/input/kaggle-llm-science-exam/train.csv").drop("id" , axis = 1) , 
        pd.read_csv("/kaggle/input/additional-train-data-for-llm-science-exam/6000_train_examples.csv") , 
        pd.read_csv("/kaggle/input/additional-train-data-for-llm-science-exam/extra_train_set.csv")
    ] , axis = 0
)

train.head()

Unnamed: 0,prompt,A,B,C,D,E,answer
0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,D
1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,A
2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,A
3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,C
4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,D


In [4]:
train.to_csv("/kaggle/working/Sample Data")
train = pd.read_csv("/kaggle/working/Sample Data")

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#AD0AFD; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #AD0AFD">3 | Tokenization 🌯</p>

In [5]:
! pip install -q sentence_transformers

In [6]:
import tqdm

import numpy as np
import os

import torch
import torch.nn as nn

from transformers import AutoTokenizer , AutoModel
from sentence_transformers import SentenceTransformer

<div style="border-radius:10px; border:#AD0AFD solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
We will use `RoBERTa` for gathering `Embeddings`. We might change this in further versions
    
## $RoBERTa$
    
$RoBERTa$ $Robustly$ $Optimized$ $BERT$ $Pretraining$ $Approach$ is a $Natural$ $Language$ $Processing$ $(NLP)$ model that was proposed in $2019$ by `Yinhan` Liu et al. It is a `reimplementation` of $BERT$ ($Bidirectional$ $Encoder$ $Representations$ from $Transformers$) with some `modifications` to the key `hyperparameters` and `minor embedding tweaks`. These modifications led to `significant performance gains` on a number of NLP tasks. $RoBERTa$ is based on the `transformer architecture`, which is a `Neural Network Architecture` that is particularly well-suited for NLP tasks. The transformer architecture uses `self-attention` to learn `long-range dependencies` between words in a sentence. This allows $RoBERTa$ to learn more `contextual representations` of words, which is important for many NLP tasks.

$RoBERTa$ is trained on a `massive dataset` of text and code. The dataset consists of `books`/`articles`/`code`. The dataset is `preprocessed` using `Byte-Level` `BPE` `(Byte Pair Encoding)`, which is a technique for tokenizing text into smaller units.

$RoBERTa$ is trained using a `Masked Language Modeling` ($MLM$) objective. In the MLM objective, some of the words in a `sentence are masked`, and the model is then trained to `predict the masked words`. This helps the model to `learn the contextual representations` of words.

In [7]:
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
# model = AutoModel.from_pretrained('roberta-base')

model = AutoModel.from_pretrained('roberta-base').to("cuda")

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


<div style="border-radius:10px; border:#AD0AFD solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">

Now we will make $2$ `Numpy` arrays, `(can make torch arrays as they are fast)`, Our first one will be 
* $Features$ `features` - This array will contain our `Tokens` of `prompt` in a `Numpy Array`
* $Targets$ `targets` - This array will contain our `Embeds` of `answer` in a `Numpy Array`
    
The `model(input_ids)` provide the output in `[B , T , C]`
* $B$ - $Batch$ $Size$ - Number of inputs for parallel processing
* $T$ - $Time$ - How long is the input
* $C$ - $Channel$ - Dimensions in the last layer of the modekl, this is specific to every model
    
We need to get the `Embeds` in $1D$, as the output we get from the model we will create might not give the same shape as $(B , T , C)$
We can ignore the $B$ for now as we are sending only $1$ Batch
    
I think a good way to get $1D$ $Array$ (for now) is to just choose the first row 

In [8]:
for index in tqdm.tqdm(range(train.shape[0])):
    train["answer"][index] = train[train["answer"][index]][index]

100%|██████████| 6700/6700 [00:03<00:00, 2146.48it/s]


In [9]:
features = np.empty(shape = train.shape[0] , dtype = np.ndarray)
targets = np.empty(shape = train.shape[0] , dtype = np.ndarray)

for index in tqdm.tqdm(range(train.shape[0])):

    features[index] = tokenizer(train["prompt"][index] , return_tensors = "np")["input_ids"]

    with torch.no_grad():
        output = model(tokenizer(train["answer"][index] , 
                                 return_tensors = "pt")["input_ids"].to("cuda"))[0][0][0]
        
        targets[index] = np.array([
            value.cpu().detach().numpy().tolist() 
            for value in output 
        ])

    torch.cuda.empty_cache()

100%|██████████| 6700/6700 [03:39<00:00, 30.48it/s]


<div style="border-radius:10px; border:#AD0AFD solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">

Now we will save these files so that we can use them later

In [10]:
os.makedirs("/kaggle/working/Input Data")

np.save("/kaggle/working/Input Data/Features" , features)
np.save("/kaggle/working/Input Data/Targets" , targets)

In [11]:
s_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

cos = nn.CosineSimilarity(dim=0, eps=1e-6)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

<div style="border-radius:10px; border:#AD0AFD solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
  
Now we will use the incrorrect options to predict our correct option.
    
Think it like this
* We send tokens to our model(variable length)
* Then we get embeddings `[768]`
* Then we calculate `cosine_similarity`
* As the answer should be same, so we subtract $1$ from the similarity. As we want to get the minimum loss/ or we have to be more close to the correct sentence
* The same concept applies to in-correct answer
* We calculate the similarity of that in-correct answer with correct answer
* As same we subtract that similarity with predicted 

For the second type of data we will use `Sentence Transformers`, and try to predict the correct answer

We will first make a large array of size `[train.shape[0] * 4]` which will contain our secondary trainin data. This array will have mainly $3$ columns. 
* Tokens for prompt
* Embeddings for option
* Similarity between actual answer and that text

In [12]:
tr = np.empty(shape = train.shape[0] * 4 , dtype = np.ndarray)
counter = 0
for index in tqdm.tqdm(range(train.shape[0]) , total = train.shape[0]):
    
    row = np.empty(shape = 3 , dtype = np.ndarray)
    
    for column in ["A" , "B" , "C" , "D" , "E"]:
        if train["answer"][index] == train[column][index]:continue
        else : 
            sentence = [
                train["answer"][index] , 
                train[column][index]
            ]
            embeddings = s_model.encode(sentence , show_progress_bar = False)
            
            simi = cos(
                torch.tensor(embeddings[0] , dtype = torch.float32) , 
                torch.tensor(embeddings[1] , dtype = torch.float32)
            )
            row[0] = tokenizer(train["prompt"][index] , return_tensors = "np")["input_ids"]
            with torch.no_grad(): 
                out = model(tokenizer(train[column][index] , return_tensors = "pt")["input_ids"].to("cuda"))[0][0][0]
                row[1] = np.array([value.cpu().detach().numpy().tolist() for value in out])
            row[2] = simi
            
            tr[counter] = row
            counter += 1

100%|██████████| 6700/6700 [18:23<00:00,  6.07it/s]


We will further save the result in a `NPY` file

In [13]:
np.save("/kaggle/working/Input Data/Tr" , tr)

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#BC13FE; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #BC13FE">4 | DataLoader 💾 </p>

In [14]:
from torch.utils.data import Dataset , DataLoader

<div style="border-radius:10px; border:#BC13FE solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
Our `dataloder` will inherit the class `Dataset` which contain usefull assets of the main class. We will first load our data from the files we saved earlier and then we will apply some essential getters

In [15]:
class load_train(Dataset):
    
    def __init__(self):
        
        super(load_train , self).__init__()
        
        self.features = np.load("/kaggle/working/Input Data/Features.npy" , allow_pickle = True)
        self.targets = np.load("/kaggle/working/Input Data/Targets.npy" , allow_pickle = True)
        
    def __len__(self): return self.features.shape[0]
    
    def __getitem__(self , index):
        
        r_fea = torch.tensor(self.features[index] , dtype = torch.long)
        r_tar = torch.tensor(self.targets[index] , dtype = torch.float32)
        
        return r_fea , r_tar

In [16]:
train = load_train()

train_d = DataLoader(train)

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#0165FC; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #0165FC">5 | Model Setup ⚙️</p>

In [17]:
import torch.nn as nn

<div style="border-radius:10px; border:#0165FC solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
Now we will define our model. This time we will adopt a simple approach (for this run).  We will first define our model `(Roberta for this time)`. which will give us a vector of $768D$, We will then pass this to a Linear Layer of $768,768$ after that we will get the output

In [18]:
class mod(nn.Module):


    def __init__(self):
        super(mod , self).__init__()

        self.rmodel = AutoModel.from_pretrained('roberta-base')

        self.linear = nn.Linear(768 , 768)

    def forward(self , inputs):

        inp = self.rmodel(inputs)[0][0][0]
        output = self.linear(inp)

        return output

In [19]:
# model = mod()

model = mod().to("cuda")

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#00FF7F; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #00FF7F">6 | Training Arguments 💬</p>

<div style="border-radius:10px; border:#00FF7F solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
Now we will define some Training Arguments

In [20]:
cos = nn.CosineSimilarity(dim=0, eps=1e-6)

optim = torch.optim.Adam(model.parameters())

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#00FF00; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #00FF00">7 | Training Loop ⚖️</p>

In [21]:
from kaggle_secrets import UserSecretsClient

import wandb

<div style="border-radius:10px; border:#00FF00 solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
Now we will run the training loop, I am using Wandb to show results in a better way 

In [22]:
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("API LOGIN KEY")

wandb.login(key = secret_value_0)

wandb.init("STS(Extended) #1 | Roberta | Kaggle LLM")

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mayushsinghal659[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.15.7 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.15.5
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20230728_161449-6xz2fb8b[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mfresh-hill-28[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/ayushsinghal659/uncategorized[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/ayushsinghal659/uncategorized/runs/6xz2fb8b[0m


In [23]:
sec = np.load("/kaggle/working/Input Data/Tr.npy" , allow_pickle = True)[:26799]

In [24]:
wandb.watch(model , cos)
for x , y in tqdm.tqdm(zip(features , targets) , total = features.shape[0]):

    x = torch.tensor(x).to("cuda")
    y = torch.tensor([value.tolist() for value in y]).to("cuda")

    preds = model(x)

    loss = (1 - cos(preds , y))
    
    wandb.log({"loss" : loss})

    loss.backward()

    optim.step()
    
    torch.cuda.empty_cache()
    
for row in tqdm.tqdm(sec , total = sec.shape[0]):
    
    x , y , z = row[0] , row[1] , row[2]
    
    x = torch.tensor(x , dtype = torch.long).to("cuda")
    y = torch.tensor(y , dtype = torch.float32).to("cuda")
    z = torch.tensor(z , dtype = torch.float32).to("cuda")
    
    preds = model(x)
    
    loss = z - cos(preds , y)
    
    loss.backward()
    
    wandb.log({"loss" : loss})
    
    optim.step()
    
    torch.cuda.empty_cache()

100%|██████████| 6700/6700 [13:59<00:00,  7.98it/s]
100%|██████████| 26799/26799 [55:41<00:00,  8.02it/s]


# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#FFD700; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #FFD700">8 | Results Visualization 🏁</p>

<div style="border-radius:10px; border:#FFD700 solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
Now lets see how our model worked 

In [25]:
wandb.finish()

[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run history:
[34m[1mwandb[0m: loss █▇▇▆▆▆▆▆▂▄▃▃▃▂▄▃▄▄▁▄▂▄▁▄▄▄▂▅▂▄▃▄▄▁▁▅▅▄▄▃
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run summary:
[34m[1mwandb[0m: loss -0.26207
[34m[1mwandb[0m: 
[34m[1mwandb[0m: 🚀 View run [33mfresh-hill-28[0m at: [34m[4mhttps://wandb.ai/ayushsinghal659/uncategorized/runs/6xz2fb8b[0m
[34m[1mwandb[0m: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
[34m[1mwandb[0m: Find logs at: [35m[1m./wandb/run-20230728_161449-6xz2fb8b/logs[0m


# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#E77200; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #E77200">9 | TO DO LIST 📑</p>

<div style="border-radius:10px; border:#E77200 solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
* $TO$ $DO$ $1$ $:$ $USE$ $BETTER$ $LOSS$ $FUNCTION$
* $TO$ $DO$ $1$ $:$ $DANCE$

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#FF9980; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #FF9980">10 | Ending 🎭</p>

<div style="border-radius:10px; border:#FF9980 solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
**THAT IT FOR TODAY GUYS**

**WE WILL GO DEEPER INTO THE DATA IN THE UPCOMING VERSIONS**

**PLEASE COMMENT YOUR THOUGHTS, HIHGLY APPRICIATED**

**DONT FORGET TO MAKE AN UPVOTE, IF YOU LIKED MY WORK $:)$**
    
<img src = "https://i.imgflip.com/19aadg.jpg">
    
**PEACE OUT $!!!$**