<a href="https://www.kaggle.com/code/ayushs9020/training-multiple-models-kaggle-llm?scriptVersionId=137578080" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#FF3131; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #FF3131">Kaggle LLM</p>

<div style="border-radius:10px; border:#FF3131 solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
The $Kaggle - LLM$ $Science$ $Exam$ is a `competition` that challenges to `answer difficult science-based questions` written by a `Large Language Model` $(LLM)$. The `Goal` of the competition is to help `researchers better understand` the `ability of LLMs` to test themselves, and the `potential of LLMs` that can be run in resource-constrained environments.

The `dataset` for the competition was generated by giving `gpt3.5 snippets` of text on a range of `scientific topics pulled` from `Wikipedia`, and asking it to `write a multiple choice question` (with a known answer), then `filtering out easy questions`.

`Participants` in the competition are asked to `develop an LLM` that can `answer the questions` in the dataset `as accurately as possible`. The competition is scored using the `average precision` at `cutoff k metric`, where $k$ is the `number of predictions` made for each question.

An estimations shays that the `largest models` run on `Kaggle` are around $10$ $Billion$ $Parameters$, whereas `gpt3.5 clocks` in at $175$ $Billion$ $Parameters$. If a `question-answering model can ace` a test written by a `question-writing model` more than $10$ `times its size`, this would be a genuinely `interesting result`; on the `other hand` if a `larger model can effectively` `stump a smaller one`, this has `compelling implications` on the `ability of LLMs` to benchmark and test themselves.
    
Thanks to **[Radek Osmulski](https://www.kaggle.com/radek1)** for providing amazing dataset

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#FFF01F; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #FFF01F">1 | Goal 🧠</p>

<div style="border-radius:10px; border:#FFF01F solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
Our goal this time is to train different models and test them on the data to get better insights

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#39FF14; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #39FF14">2 | Data 🏆</p>

In [1]:
import pandas as pd 

<div style="border-radius:10px; border:#39FF14 solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
First we will concat the $2$ samples, which will result in a massive increae in size to $6,200$

In [2]:
train = pd.concat(
    [
        pd.read_csv("/kaggle/input/kaggle-llm-science-exam/train.csv").drop("id" , axis = 1) ,
        pd.read_csv("/kaggle/input/additional-train-data-for-llm-science-exam/6000_train_examples.csv")
    ] , axis = 0
)

train.head()

Unnamed: 0,prompt,A,B,C,D,E,answer
0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,D
1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,A
2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,A
3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,C
4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,D


<div style="border-radius:10px; border:#39FF14 solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">

We will save this csv file, so that we can use that later 

In [3]:
train.to_csv("Sample.csv")
train = pd.read_csv("/kaggle/working/Sample.csv")

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#1F51FF; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #1F51FF">3 | Data Preprocessing 📈</p>

In [4]:
import numpy as np 

<div style="border-radius:10px; border:#1F51FF solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
Now we will replace the Answers with some integer values 

In [5]:
train["answer"] = np.where(train["answer"] == "A" , 0 , train["answer"])
train["answer"] = np.where(train["answer"] == "B" , 1 , train["answer"])
train["answer"] = np.where(train["answer"] == "C" , 2 , train["answer"])
train["answer"] = np.where(train["answer"] == "D" , 3 , train["answer"])
train["answer"] = np.where(train["answer"] == "E" , 4 , train["answer"])

<div style="border-radius:10px; border:#1F51FF solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
Now we will make groups for options and store them in a list

In [6]:
choices = [
    [
        train[column][index]
        for column in ["A" , "B" , "C" , "D" , "E"]
    ]
    for index in range(train.shape[0])
]

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#FF1493; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #FF1493">4 | Tokenization 💡</p>

In [7]:
from transformers import AutoTokenizer
import tqdm

<div style="border-radius:10px; border:#FF1493 solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
## $4.1 , 4.2$ $|$ $RoBERTa$
    
$RoBERTa$ $Robustly$ $Optimized$ $BERT$ $Pretraining$ $Approach$ is a $Natural$ $Language$ $Processing$ $(NLP)$ model that was proposed in $2019$ by `Yinhan` Liu et al. It is a `reimplementation` of $BERT$ ($Bidirectional$ $Encoder$ $Representations$ from $Transformers$) with some `modifications` to the key `hyperparameters` and `minor embedding tweaks`. These modifications led to `significant performance gains` on a number of NLP tasks. $RoBERTa$ is based on the `transformer architecture`, which is a `Neural Network Architecture` that is particularly well-suited for NLP tasks. The transformer architecture uses `self-attention` to learn `long-range dependencies` between words in a sentence. This allows $RoBERTa$ to learn more `contextual representations` of words, which is important for many NLP tasks.

$RoBERTa$ is trained on a `massive dataset` of text and code. The dataset consists of `books`/`articles`/`code`. The dataset is `preprocessed` using `Byte-Level` `BPE` `(Byte Pair Encoding)`, which is a technique for tokenizing text into smaller units.

$RoBERTa$ is trained using a `Masked Language Modeling` ($MLM$) objective. In the MLM objective, some of the words in a `sentence are masked`, and the model is then trained to `predict the masked words`. This helps the model to `learn the contextual representations` of words.

In [8]:
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
tokenizer

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

RobertaTokenizerFast(name_or_path='roberta-base', vocab_size=50265, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False)}, clean_up_tokenization_spaces=True)

<div style="border-radius:10px; border:#FF1493 solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">

## $4.3$ $|$ $ALBERT$

Remember to use `Albert Tokenizer` with `Albert Model`. $RoBERTa$ model has a vocab size of $50,000$ words, whereas $AlBERT$ has a vocab size of $30,000$, which can result in `index out of range in self` erorrs

In [9]:
tokenizer = AutoTokenizer.from_pretrained("albert-base-v2")
tokenizer

Downloading (…)lve/main/config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

AlbertTokenizerFast(name_or_path='albert-base-v2', vocab_size=30000, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '<unk>', 'sep_token': '[SEP]', 'pad_token': '<pad>', 'cls_token': '[CLS]', 'mask_token': AddedToken("[MASK]", rstrip=False, lstrip=True, single_word=False, normalized=False)}, clean_up_tokenization_spaces=True)

<div style="border-radius:10px; border:#FF1493 solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
Now we will make tokens for the model 

In [10]:
tokens = [
    tokenizer(
        [
            train["prompt"][index] ,
            train["prompt"][index] ,
            train["prompt"][index] ,
            train["prompt"][index] ,
            train["prompt"][index]
        ] ,
        choices[index] ,
        return_tensors = "pt" , padding = True
    )
    for index in tqdm.tqdm(range(train.shape[0]) , 
                           total = train.shape[0])
]

100%|██████████| 6200/6200 [00:13<00:00, 451.04it/s]


# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#BC13FE ; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #BC13FE ">5 | Model Setup 🤖</p>

In [11]:
from transformers import RobertaForMultipleChoice

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


<div style="border-radius:10px; border:#BC13FE  solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
## $5.1 , 5.2$ $|$ $RoBERTa$
    
$Robustly$ $Optimized$ $BERT$ $Pretraining$ $Approach$ $(RoBERTa)$ is a `Natural Language Processing` $(NLP)$ model that is based on the $Bidirectional$ $Encoder$ $Representations$ $from$ $Transformers$ $(BERT)$ model. $RoBERTa$ is trained on a `massive dataset` of `text`/`code`, and it is able to perform a variety of NLP tasks, including multiple choice.

For `Multiple Choice`, $RoBERTa$ is used to `predict` the `correct answer` to a `question given` a `context`. The `context` is a `piece of text` that `provides information` about the `question`, and the `answer choices` are a set of `possible answers`. $RoBERTa$ is able to predict `the correct answer` by `first encoding` the context and the `answer choices` into a `sequence of vectors`. These vectors are then `passed` through a `Neural Network`, which outputs a `probability distribution` over the answer choices. The answer choice with the `highest probability` is then predicted to be the `correct answer`.
    
* $RoBERTa$ $Base$
* $RoBERTa$ $Large$

In [12]:
model = RobertaForMultipleChoice.from_pretrained("roberta-base")

Downloading model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForMultipleChoice: ['lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaForMultipleChoice from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMultipleChoice from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForMultipleChoice were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'classifier.weight', 'classifier.bias', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream t

In [13]:
model = RobertaForMultipleChoice.from_pretrained("roberta-large")

Downloading (…)lve/main/config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForMultipleChoice: ['lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaForMultipleChoice from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMultipleChoice from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForMultipleChoice were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.bias', 'classifier.weight', 'roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream

<div style="border-radius:10px; border:#BC13FE  solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
## $4.3$ $|$ $ALBERT$

In [14]:
model = AlbertForMultipleChoice.from_pretrained("albert-base-v2")

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#00FF7F ; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #00FF7F ">6 | Training Arguments 📄</p>

In [None]:
import torch 

<div style="border-radius:10px; border:#00FF7F   solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
$Adaptive$ $Moment$ $Estimation$ $(Adam)$ is an `Optimization Algorithm` that is used to train machine learning models. It is a $Stochastic$ $Gradient$ $Descent$ $(SGD)$ method that `combines the advantages` of two other SGD methods, $AdaGrad$ and $RMSProp$.

$Adam$ works by `maintaining estimates` of the `first`/`second` `moments` of the `gradients`. These estimates are used to `calculate` the `learning rate` for each parameter in the model. The `learning rate` is `adjusted dynamically`, so that it is `larger` for parameters that are `changing quickly` and `smaller` for parameters that are `changing slowly`.

In [None]:
optim = torch.optim.Adam(model.parameters())

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#FF9D00; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #FF9D00">7 | Training Loop ⌛️</p>

<div style="border-radius:10px; border:#FF9D00 solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
I am not using `Kaggle GPU` for this training, I have imported the results to Wandb
    
## $7.1$ $|$ $RoBERT$ $Base$
    
The model took $2$ $Minutes$ to train, with $1$ Batch Size
  
## $7.2$ $|$ $RoBERTa$ $Large$
    
The model took $5$ $Minutes$ to train, with $1$ Batch Size
    
## $7.3$ $|$ $AlBERT$ $Base$ $v2$
    
The model took around $2$ $Minutes$ to train, with $1$ Batch Size
```
losses = []

for index in tqdm.tqdm(range(train.shape[0])):

    enc = tokens[index].to("cuda")
    labels = torch.tensor([int(train["answer"][index])] , dtype = torch.long).to("cuda")

    with torch.no_grad() : output = model(**{k: v.unsqueeze(0) for k, v in enc.items()}, labels = labels)

    loss = output.loss

    losses.append(loss)

    torch.cuda.empty_cache()

    optim.step()
```

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#FF00FF; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #FF00FF">8 | Results ☑️</p>

In [None]:
from IPython.display import IFrame

<div style="border-radius:10px; border:#FF00FF solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
## $8.1$ $|$ $RoBERTa$ $Base$
    
loss	▄▄▄▃▃▅▄▄▄▄█▅▄▄▄▄▅▄▄▄▄▅▄▃▃▄▅▄▄▅▃▅▄▃▄▄▄▄▄▁

loss	1.611
    
## $8.2$ $|$ $RoBERTa$ $Large$
    
loss	▇▅▆▆▅▆▅▅▅▅█▅▆▅▅▆▁▅▅▃▅▅▆▅▅▅▅▃▅▅▅▆█▅▃▅▅▅▄▅

loss	1.61307
    
## $8.3$ $|$ $AlBERT$
    
loss	▆▇▆▆▆█▆▇▆▆▆▆▆▅▁▆▅▅▅▆▇▇▅▆▅▆▅▃▆▆▆▅▆▇▅▇▅▆▅▆

loss	1.59402

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#0047A3; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #0047A3">9 | TO DO LIST 📝</p>

<div style="border-radius:10px; border:#0047A3 solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
* $TO$ $DO$ $1$ $:$ $MAKE$ $MORE$ $MODELS$ $FUNCTION$
* $TO$ $DO$ $2$ $:$ $DANCE$

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#00F900; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #00F900">10 | Ending 🏁</p>

<div style="border-radius:10px; border:#00F900 solid; padding: 15px; background-color: #F3f9ed; font-size:100%; text-align:left">
    
**THAT IT FOR TODAY GUYS**

**WE WILL GO DEEPER INTO THE DATA IN THE UPCOMING VERSIONS**

**PLEASE COMMENT YOUR THOUGHTS, HIHGLY APPRICIATED**

**DONT FORGET TO MAKE AN UPVOTE, IF YOU LIKED MY WORK $:)$**
    
<img src = "https://i.imgflip.com/19aadg.jpg">
    
**PEACE OUT $!!!$**