<a href="https://www.kaggle.com/code/ayushs9020/understanding-the-competition-kaggle-llm?scriptVersionId=136571779" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Who want to be a Millionare

<img src = "https://m.media-amazon.com/images/M/MV5BZDE3YTNhNzctZjdiNy00YjZjLWE4MDMtOGJjODE2YjE3NDllXkEyXkFqcGdeQXVyODAzNzAwOTU@._V1_.jpg" width = 300>

The $Kaggle - LLM$ $Science$ $Exam$ is a `competition` that challenges to `answer difficult science-based questions` written by a `Large Language Model` $(LLM)$. The `Goal` of the competition is to help `researchers better understand` the `ability of LLMs` to test themselves, and the `potential of LLMs` that can be run in resource-constrained environments.

The `dataset` for the competition was generated by giving `gpt3.5 snippets` of text on a range of `scientific topics pulled` from `Wikipedia`, and asking it to `write a multiple choice question` (with a known answer), then `filtering out easy questions`.

`Participants` in the competition are asked to `develop an LLM` that can `answer the questions` in the dataset `as accurately as possible`. The competition is scored using the `average precision` at `cutoff k metric`, where $k$ is the `number of predictions` made for each question.

An estimations shays that the `largest models` run on `Kaggle` are around $10$ $Billion$ $Parameters$, whereas `gpt3.5 clocks` in at $175$ $Billion$ $Parameters$. If a `question-answering model can ace` a test written by a `question-writing model` more than $10$ `times its size`, this would be a genuinely `interesting result`; on the `other hand` if a `larger model can effectively` `stump a smaller one`, this has `compelling implications` on the `ability of LLMs` to benchmark and test themselves.

# 1 | Advisory

* This is a $Multi-Label$ $Classification$ $Problem$
* The evaluation will not be done for $1$ value but rather than multiple values. Think this like `predict`/`predict_poba` methods in the `SKlearn Library`. The `predict` method is used to predict the bestest possible result, whereas `predict_proba` gets the probablities of each class. For this competition we are asked to `predict_proba` for the best $3$ classes

# 2 | Data

In [1]:
import pandas as pd

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/kaggle-llm-science-exam/sample_submission.csv
/kaggle/input/kaggle-llm-science-exam/train.csv
/kaggle/input/kaggle-llm-science-exam/test.csv


## 2.1 | Train

This is a `csv` file that contains our `main training data` 

In [2]:
train = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/train.csv")
train

Unnamed: 0,id,prompt,A,B,C,D,E,answer
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,D
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,A
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,A
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,C
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,D
...,...,...,...,...,...,...,...,...
195,195,What is the relation between the three moment ...,The three moment theorem expresses the relatio...,The three moment theorem is used to calculate ...,The three moment theorem describes the relatio...,The three moment theorem is used to calculate ...,The three moment theorem is used to derive the...,C
196,196,"What is the throttling process, and why is it ...",The throttling process is a steady flow of a f...,The throttling process is a steady adiabatic f...,The throttling process is a steady adiabatic f...,The throttling process is a steady flow of a f...,The throttling process is a steady adiabatic f...,B
197,197,What happens to excess base metal as a solutio...,"The excess base metal will often solidify, bec...",The excess base metal will often crystallize-o...,"The excess base metal will often dissolve, bec...","The excess base metal will often liquefy, beco...","The excess base metal will often evaporate, be...",B
198,198,"What is the relationship between mass, force, ...",Mass is a property that determines the weight ...,Mass is an inertial property that determines a...,Mass is an inertial property that determines a...,Mass is an inertial property that determines a...,Mass is a property that determines the size of...,D


## 2.2 | Test 

This is the testing dataset 

In [3]:
test = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/test.csv")

test

Unnamed: 0,id,prompt,A,B,C,D,E
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...
...,...,...,...,...,...,...,...
195,195,What is the relation between the three moment ...,The three moment theorem expresses the relatio...,The three moment theorem is used to calculate ...,The three moment theorem describes the relatio...,The three moment theorem is used to calculate ...,The three moment theorem is used to derive the...
196,196,"What is the throttling process, and why is it ...",The throttling process is a steady flow of a f...,The throttling process is a steady adiabatic f...,The throttling process is a steady adiabatic f...,The throttling process is a steady flow of a f...,The throttling process is a steady adiabatic f...
197,197,What happens to excess base metal as a solutio...,"The excess base metal will often solidify, bec...",The excess base metal will often crystallize-o...,"The excess base metal will often dissolve, bec...","The excess base metal will often liquefy, beco...","The excess base metal will often evaporate, be..."
198,198,"What is the relationship between mass, force, ...",Mass is a property that determines the weight ...,Mass is an inertial property that determines a...,Mass is an inertial property that determines a...,Mass is an inertial property that determines a...,Mass is a property that determines the size of...


# 3 | Tokenizers

Now we will use different types of tokenizers for different types of results

* $BERT-Tokenizer$

In [4]:
import numpy as np

from transformers import BertTokenizer, BertModel

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


## 3.1 | BERT-Tokenizer

<img src = "https://datajenius.com/wp-content/uploads/2022/03/Screen-Shot-2022-03-13-at-12.24.34-PM-768x497.png" width = 400>

The $BERT$ $Tokenizer$ is a `pre-trained` `tokenizer` that is specifically `designed` for the `BERT Model`. It is a $Byte-Pair$ $Encoding$ $(BPE)$ $Tokenizer$, which means that it `breaks down text` into `tokens` by `iteratively merging pairs` of the `most frequent characters`. This allows the tokenizer to `learn a vocabulary of subwords`, which can be `more efficient` for `representing text` than a vocabulary of full words. Assume a word `mousing`. A normal tokenizer would tokenize this as whole, but this one will rather split it into `mous`/`ing`, which captures more information for new words which are actually unknown to vocublary 

The $Huggingface-Bert$/$Bert-Base-Cased$ $Tokenizer$ also includes a number of `special tokens` that are `used by the BERT model`. These tokens include the `[CLS]` token, which is used to `represent the beginning` of a sentence, the `[SEP]` token, which is used to `represent the end of a sentence`, and the `[MASK]` token, which is used to `represent a masked token`.

<img src = "https://blogger.googleusercontent.com/img/a/AVvXsEi-pFW9FPFJ7p2Sspv8tCZrtnr3TSv2UAcxi780EhpVik9Q2m87tFHj4pppKOq7ZvrvywRhSB8yE2Sq9TzF3EwlWZ8byqlWgs_atSE3Wlw2tOLkUS4z0dlDBubktjzQB0XX359tJBj9IG3tHnD9_LLHUkaUU47b6GEgu0qTxP5f94TvAerpZ3Y2zxqF_g=w640-h414" width = 400>

In [5]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

Downloading (‚Ä¶)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (‚Ä¶)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (‚Ä¶)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [6]:
tokenizer

BertTokenizer(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

In [7]:
model = BertModel.from_pretrained("bert-base-cased")

Downloading model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [8]:
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(28996, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

Lets assume we have this sample text

```
Everybody Is A Gangster , Till You See The Monster
```

In [9]:
tokens = tokenizer("Everybody Is A Gangster, Till You See The Monster" , return_tensors = "pt")
tokens

{'input_ids': tensor([[  101, 14325,  2181,   138, 12469,  4648,   117, 22430,  1192,  3969,
          1109, 11701,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

This will give error 

```
In [1]: model(tokens)

Out [1]: 
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ Traceback (most recent call last) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ïÆ
‚îÇ /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:254 in           ‚îÇ
‚îÇ __getattr__                                                                                      ‚îÇ
‚îÇ                                                                                                  ‚îÇ
‚îÇ    251 ‚îÇ                                                                                         ‚îÇ
‚îÇ    252 ‚îÇ   def __getattr__(self, item: str):                                                     ‚îÇ
‚îÇ    253 ‚îÇ   ‚îÇ   try:                                                                              ‚îÇ
‚îÇ ‚ù±  254 ‚îÇ   ‚îÇ   ‚îÇ   return self.data[item]                                                        ‚îÇ
‚îÇ    255 ‚îÇ   ‚îÇ   except KeyError:                                                                  ‚îÇ
‚îÇ    256 ‚îÇ   ‚îÇ   ‚îÇ   raise AttributeError                                                          ‚îÇ
‚îÇ    257                                                                                           ‚îÇ
‚ï∞‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ïØ
KeyError: 'size'

During handling of the above exception, another exception occurred:

‚ï≠‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ Traceback (most recent call last) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ïÆ
‚îÇ in <module>:1                                                                                    ‚îÇ
‚îÇ                                                                                                  ‚îÇ
‚îÇ ‚ù± 1 model(tokens)                                                                                ‚îÇ
‚îÇ   2                                                                                              ‚îÇ
‚îÇ                                                                                                  ‚îÇ
‚îÇ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in _call_impl            ‚îÇ
‚îÇ                                                                                                  ‚îÇ
‚îÇ   1498 ‚îÇ   ‚îÇ   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   ‚îÇ
‚îÇ   1499 ‚îÇ   ‚îÇ   ‚îÇ   ‚îÇ   or _global_backward_pre_hooks or _global_backward_hooks                   ‚îÇ
‚îÇ   1500 ‚îÇ   ‚îÇ   ‚îÇ   ‚îÇ   or _global_forward_hooks or _global_forward_pre_hooks):                   ‚îÇ
‚îÇ ‚ù± 1501 ‚îÇ   ‚îÇ   ‚îÇ   return forward_call(*args, **kwargs)                                          ‚îÇ
‚îÇ   1502 ‚îÇ   ‚îÇ   # Do not call functions when jit is used                                          ‚îÇ
‚îÇ   1503 ‚îÇ   ‚îÇ   full_backward_hooks, non_full_backward_hooks = [], []                             ‚îÇ
‚îÇ   1504 ‚îÇ   ‚îÇ   backward_pre_hooks = []                                                           ‚îÇ
‚îÇ                                                                                                  ‚îÇ
‚îÇ /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:968 in forward ‚îÇ
‚îÇ                                                                                                  ‚îÇ
‚îÇ    965 ‚îÇ   ‚îÇ   if input_ids is not None and inputs_embeds is not None:                           ‚îÇ
‚îÇ    966 ‚îÇ   ‚îÇ   ‚îÇ   raise ValueError("You cannot specify both input_ids and inputs_embeds at the  ‚îÇ
‚îÇ    967 ‚îÇ   ‚îÇ   elif input_ids is not None:                                                       ‚îÇ
‚îÇ ‚ù±  968 ‚îÇ   ‚îÇ   ‚îÇ   input_shape = input_ids.size()                                                ‚îÇ
‚îÇ    969 ‚îÇ   ‚îÇ   elif inputs_embeds is not None:                                                   ‚îÇ
‚îÇ    970 ‚îÇ   ‚îÇ   ‚îÇ   input_shape = inputs_embeds.size()[:-1]                                       ‚îÇ
‚îÇ    971 ‚îÇ   ‚îÇ   else:                                                                             ‚îÇ
‚îÇ                                                                                                  ‚îÇ
‚îÇ /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:256 in           ‚îÇ
‚îÇ __getattr__                                                                                      ‚îÇ
‚îÇ                                                                                                  ‚îÇ
‚îÇ    253 ‚îÇ   ‚îÇ   try:                                                                              ‚îÇ
‚îÇ    254 ‚îÇ   ‚îÇ   ‚îÇ   return self.data[item]                                                        ‚îÇ
‚îÇ    255 ‚îÇ   ‚îÇ   except KeyError:                                                                  ‚îÇ
‚îÇ ‚ù±  256 ‚îÇ   ‚îÇ   ‚îÇ   raise AttributeError                                                          ‚îÇ
‚îÇ    257 ‚îÇ                                                                                         ‚îÇ
‚îÇ    258 ‚îÇ   def __getstate__(self):                                                               ‚îÇ
‚îÇ    259 ‚îÇ   ‚îÇ   return {"data": self.data, "encodings": self._encodings}                          ‚îÇ
‚ï∞‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ïØ
AttributeError
```

In [10]:
model(tokens["input_ids"])

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.2735,  0.1281, -0.1138,  ...,  0.0849,  0.2736, -0.1576],
         [-0.8027,  0.4616,  0.1243,  ...,  0.5486, -0.0444, -0.1121],
         [-0.1207,  0.2429,  0.4388,  ...,  0.2734,  0.1653, -0.2329],
         ...,
         [-0.3941, -0.0716, -0.0864,  ...,  0.6747,  0.0764,  0.1317],
         [ 0.3185, -0.4049, -0.0438,  ...,  0.0533, -0.1618, -0.1742],
         [ 0.0257,  0.7954, -0.0842,  ...,  0.5175,  0.1617, -0.4017]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.6699,  0.5028,  0.9999, -0.9948,  0.9714,  0.8739,  0.9832, -0.9809,
         -0.9826, -0.7324,  0.9875,  0.9987, -0.9968, -0.9999,  0.7018, -0.9832,
          0.9917, -0.6042, -1.0000, -0.6883, -0.5744, -0.9999,  0.3807,  0.9633,
          0.9813,  0.0957,  0.9900,  1.0000,  0.8789,  0.1188,  0.3033, -0.9924,
          0.8438, -0.9994,  0.1987,  0.0854,  0.7423, -0.3160,  0.7678, -0.9025,
         -0.7755, -0.7057,  0.58

Lets make a simple function for iterating over our data 

In [11]:
def token(value , tokenizer = tokenizer , model = model , pool = False):
    
    tokens = tokenizer(value , return_tensors = "pt")
    
    output = model(tokens["input_ids"])
    
    if pool:return output[1].detach().numpy().squeeze()
    
    return output[0].detach().numpy().squeeze()

In [12]:
train["t_prompt"] = train["prompt"].apply(token)
train["t_A"] = train["A"].apply(token)
train["t_B"] = train["B"].apply(token)
train["t_C"] = train["C"].apply(token)
train["t_D"] = train["D"].apply(token)
train["t_E"] = train["E"].apply(token)

This is our tokenized format 

In [13]:
train

Unnamed: 0,id,prompt,A,B,C,D,E,answer,t_prompt,t_A,t_B,t_C,t_D,t_E
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,D,"[[0.5954785, 0.016742375, 0.25113845, -0.33201...","[[0.26678404, -0.13396168, 0.048275813, -0.190...","[[0.3160099, -0.07156702, 0.03550779, -0.20390...","[[0.31307223, -0.15782145, 0.0039771437, -0.10...","[[0.39717036, 0.009571799, 0.02090334, -0.2079...","[[0.26498082, -0.11152987, 0.09861849, -0.1717..."
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,A,"[[0.48086163, 0.13048011, 0.20981884, -0.29516...","[[0.6098233, 0.0972536, 0.16351525, -0.1299212...","[[0.6067397, 0.13667028, 0.18718192, -0.164518...","[[0.5301055, 0.0713303, 0.19720834, -0.1847760...","[[0.55116785, 0.08227803, 0.22088835, -0.18411...","[[0.5684599, 0.1910345, 0.14520799, -0.1389537..."
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,A,"[[0.5633879, 0.022773514, 0.11718962, -0.14262...","[[0.43013036, -0.05444927, 0.04960315, -0.0816...","[[0.38208723, -0.1964721, -0.024255829, -0.127...","[[0.4089076, -0.109597705, 0.06722128, -0.0298...","[[0.21592978, -0.23158002, 0.028887426, -0.068...","[[0.38425943, -0.14076975, 0.044870295, -0.117..."
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,C,"[[0.4242367, 0.13664639, 0.08093922, -0.279368...","[[0.4670743, -0.21700826, 0.16711907, -0.35785...","[[0.60160524, -0.082947515, 0.28576776, -0.489...","[[0.6052707, -0.08577526, 0.22512098, -0.48140...","[[0.5814356, -0.1303677, 0.22648613, -0.440707...","[[0.6130338, -0.036285013, 0.2662849, -0.43933..."
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,D,"[[0.61632174, 0.092921205, 0.27170777, -0.2215...","[[0.37809604, -0.16092303, 0.16882758, -0.2595...","[[0.3694178, -0.15149622, 0.17266133, -0.25841...","[[0.2997911, -0.15832873, 0.09018601, -0.24680...","[[0.38440725, -0.15802866, 0.17178412, -0.2330...","[[0.36492893, -0.1719651, 0.18760608, -0.25752..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,195,What is the relation between the three moment ...,The three moment theorem expresses the relatio...,The three moment theorem is used to calculate ...,The three moment theorem describes the relatio...,The three moment theorem is used to calculate ...,The three moment theorem is used to derive the...,C,"[[0.37028533, 0.24675308, 0.11544872, -0.36422...","[[0.29882017, -0.027735448, 0.08780523, -0.200...","[[0.2983277, 0.115363, 0.08399957, -0.3266757,...","[[0.444976, 0.11159219, 0.034268778, -0.186582...","[[0.35315177, 0.019672012, 0.091255896, -0.301...","[[0.4171703, 0.00043702722, 0.13987462, -0.410..."
196,196,"What is the throttling process, and why is it ...",The throttling process is a steady flow of a f...,The throttling process is a steady adiabatic f...,The throttling process is a steady adiabatic f...,The throttling process is a steady flow of a f...,The throttling process is a steady adiabatic f...,B,"[[0.51109606, 0.0829629, 0.12613802, 0.0109595...","[[0.34055692, 0.06306243, 0.03861563, -0.19832...","[[0.36286342, 0.042207185, 0.014683261, -0.188...","[[0.35367307, 0.055113252, 0.017835828, -0.203...","[[0.3479987, 0.050205637, 0.03540563, -0.18161...","[[0.5213482, 0.006469099, -0.0054984107, -0.23..."
197,197,What happens to excess base metal as a solutio...,"The excess base metal will often solidify, bec...",The excess base metal will often crystallize-o...,"The excess base metal will often dissolve, bec...","The excess base metal will often liquefy, beco...","The excess base metal will often evaporate, be...",B,"[[0.39656845, 0.14130571, -0.013567732, -0.371...","[[0.45368078, -0.29536974, -0.049366374, -0.31...","[[0.46274835, -0.2906356, -0.04002542, -0.3202...","[[0.43663096, -0.30893922, -0.037753228, -0.30...","[[0.4456481, -0.31710142, -0.04196011, -0.3282...","[[0.43670073, -0.30175078, -0.05860339, -0.319..."
198,198,"What is the relationship between mass, force, ...",Mass is a property that determines the weight ...,Mass is an inertial property that determines a...,Mass is an inertial property that determines a...,Mass is an inertial property that determines a...,Mass is a property that determines the size of...,D,"[[0.36871165, 0.18375124, 0.015941879, -0.1210...","[[0.26379362, -0.057898458, 0.071916506, -0.24...","[[0.2760112, -0.16288757, 0.08594819, -0.20769...","[[0.274591, -0.15716897, 0.091395885, -0.20302...","[[0.28071156, -0.16528511, 0.084627695, -0.203...","[[0.25818092, -0.08729951, 0.086152844, -0.240..."


Lets export it to output files, to use them further without any extra usage of time/resources

In [14]:
os.makedirs("/kaggle/working/Kaggle LLMs Embedments/CSV/Bert Based Tokenizer/")
os.makedirs("/kaggle/working/Kaggle LLMs Embedments/Numpy/Bert Based Tokenizer/")

In [15]:
train.to_csv("/kaggle/working/Kaggle LLMs Embedments/CSV/Bert Based Tokenizer/Train Embeds")

We will also store this information in `npy` file format

In [16]:
bert_based = train[["answer" , "t_prompt" , "t_A" , "t_B" , "t_C" , "t_D" , "t_E"]].to_numpy()

np.save("/kaggle/working/Kaggle LLMs Embedments/Numpy/Bert Based Tokenizer/Train Embeds" , bert_based)

# 4 | TO DO LIST

```
TO DO 1 : VISUALIZE THE DATA 

TO DO 2 : EXPLORE THE DATA

TO DO 3 : TOKENIZE THE DATA 

TO DO 4 : MAKE A NPY TOKENIZED FILE

TO DO 5 : TRAIN A MODEL

TO DO 6 : WANDB SUPPORT

TO DO 7 : BETTER RESULTS

TO DO 8 : LESS TRAINING TIME

TO DO 9 ; DANCE
```

# 5 | Ending üöÄ

**THAT IT FOR TODAY GUYS**

**WE WILL GO DEEPER INTO THE DATA IN THE UPCOMING VERSIONS**

**PLEASE COMMENT YOUR THOUGHTS, HIHGLY APPRICIATED**

**DONT FORGET TO MAKE AN UPVOTE, IF YOU LIKED MY WORK $:)$**
 
<IMG SRC = "htTps://i.imgflip.com/19aadg.jpg">
    
**PEACE OUT $!!!$**