# <center style="font-family: consolas; font-size: 32px; font-weight: bold;"> Kaggle - LLM Science Exam</center>
<p><center style="color:#949494; font-family: consolas; font-size: 20px;">Use LLMs to answer difficult science questions</center></p>

***

# <center style="font-family: consolas; font-size: 32px; font-weight: bold;">(ಠಿ⁠_⁠ಠ) Overview</center>

<p style="font-family: consolas; font-size: 16px;">⚪ The goal of the competition is to answer difficult science-based questions written by a Large Language Model (LLM).</p>

<p style="font-family: consolas; font-size: 16px;">⚪ The competition aims to help researchers understand the ability of LLMs to test themselves and explore the potential of LLMs that can be run in resource-constrained environments.</p>

<p style="font-family: consolas; font-size: 16px;">⚪ The scope of large language model capabilities is expanding, and researchers are using LLMs to characterize themselves.</p>

<p style="font-family: consolas; font-size: 16px;">⚪ Many existing natural language processing benchmarks have become trivial for state-of-the-art models, so there is a need to create more challenging tasks to test increasingly powerful models.</p>

<p style="font-family: consolas; font-size: 16px;">⚪ The dataset for the competition was generated by providing snippets of text on various scientific topics to the gpt3.5 model and asking it to write multiple choice questions (with known answers). Easy questions were filtered out.</p>

<p style="font-family: consolas; font-size: 16px;">⚪ Sign language recognition AI for text entry lags far behind voice-to-text or even gesture-based typing, as robust datasets didn't previously exist.</p>

<p style="font-family: consolas; font-size: 16px;">⚪ The largest models currently run on Kaggle have around 10 billion parameters, while gpt3.5 has 175 billion parameters.</p>

<p style="font-family: consolas; font-size: 16px;">⚪ The competition aims to explore whether a question-answering model more than 10 times smaller than gpt3.5 can effectively answer questions written by gpt3.5. The results will shed light on the benchmarking and self-testing capabilities of LLMs.</p>

#### <a id="top"></a>
# <div style="box-shadow: rgb(60, 121, 245) 0px 0px 0px 3px inset, rgb(255, 255, 255) 10px -10px 0px -3px, rgb(31, 193, 27) 10px -10px, rgb(255, 255, 255) 20px -20px 0px -3px, rgb(255, 217, 19) 20px -20px, rgb(255, 255, 255) 30px -30px 0px -3px, rgb(255, 156, 85) 30px -30px, rgb(255, 255, 255) 40px -40px 0px -3px, rgb(255, 85, 85) 40px -40px; padding:20px; margin-right: 40px; font-size:30px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(60, 121, 245);"><b>Table of contents</b></div>

<div style="background-color: rgba(60, 121, 245, 0.03); padding:30px; font-size:15px; font-family: consolas;">
<ul>
    <li><a href="#0" target="_self" rel=" noreferrer nofollow">0. Import all dependencies</a></li>
    <li><a href="#1" target="_self" rel=" noreferrer nofollow">1. Data overview</a></li>
    <li><a href="#2" target="_self" rel=" noreferrer nofollow">2. Train overview</a></li>
    <li><a href="#3" target="_self" rel=" noreferrer nofollow">3. Supplemental overview</a></li>
</ul>
</div>

<a id="0"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 0. Import all dependencies </b></div>

In [1]:
import json

import numpy as np
import pandas as pd
import plotly.express as px

<a id="1"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 1. Data overview</b></div>

<p style="font-family: consolas; font-size: 16px;">⚪ The dataset for this competition consists of multiple-choice questions generated by a Large Language Model (LLM).</p>

<p style="font-family: consolas; font-size: 16px;">⚪ The questions are accompanied by options labeled A, B, C, D, and E, and each question has a correct answer labeled "answer".</p>

<p style="font-family: consolas; font-size: 16px;">⚪ The goal is to predict the top three most probable answers given a question prompt. </p>


<a id="1.1"></a>
## <div style="box-shadow: rgba(0, 0, 0, 0.18) 0px 2px 4px inset; padding:20px; font-size:24px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(67, 66, 66)"> <b> 1.1 train.csv</b></div>

<p style="font-family: consolas; font-size: 16px;">⚪ The train.csv file contains <b>200 questions</b> with their corresponding correct answers.</p>

<p style="font-family: consolas; font-size: 16px;">⚪ Each question consists of a prompt (the question text) and options <b>A, B, C, D, and E</b>.</p>

<p style="font-family: consolas; font-size: 16px;">⚪ The correct answer is indicated by the <code>answer</code> column, which contains the label of the most correct answer, as defined by the generating LLM. </p>


In [2]:
train_df = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/train.csv")

<a id="1.2"></a>
## <div style="box-shadow: rgba(0, 0, 0, 0.18) 0px 2px 4px inset; padding:20px; font-size:24px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(67, 66, 66)"> <b> 1.2 test.csv</b></div>

<p style="font-family: consolas; font-size: 16px;">⚪ The test.csv file contains the test set for the competition.</p>

<p style="font-family: consolas; font-size: 16px;">⚪ <b>The task is to predict the top <code>3</code> most probable answers</b> for each question prompt.</p>

<p style="font-family: consolas; font-size: 16px;">⚪ The format of the test set is the same as the training set, with questions, options (A, B, C, D, and E), and the prompt text.</p>

<p style="font-family: consolas; font-size: 16px;">⚪ The test set has approximately 4,000 different prompts, which may differ in subject matter from the training set.</p>

<p style="font-family: consolas; font-size: 16px;">⚪ <b>NOTE</b>: The test data you see here just a copy of the training data without the answers.</p>


In [3]:
test_df = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/test.csv")

In [4]:
stem_1k_df = pd.read_csv("/kaggle/input/wikipedia-stem-1k/stem_1k_v1.csv")

<a id="3"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 3. Text classification</b></div>

In [5]:
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch

from datasets import Dataset
from transformers import AutoTokenizer
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


In [6]:
len(stem_1k_df)*0.1

100.0

In [7]:
len(train_df)

200

In [8]:
# eval_sampled_train_df = train_df.sample(frac=0.5, random_state=42)
# eval_sampled_stem_df = stem_1k_df.sample(frac=0.1, random_state=42)

# eval_sampled_df = pd.concat([
#     eval_sampled_train_df,
#     eval_sampled_stem_df,
# ])
# len(eval_sampled_df)

# train_sampled_train_df = train_df.drop(eval_sampled_train_df.index)
# train_sampled_stem_df = stem_1k_df.drop(eval_sampled_stem_df.index)

# train_sampled_df = pd.concat([
#     train_sampled_train_df,
#     train_sampled_stem_df,
# ])
# len(train_sampled_df)

In [9]:
# eval_sampled_df = train_df.sample(frac=0.5, random_state=42)

# train_sampled_train_df = train_df.drop(eval_sampled_df.index)
# train_sampled_df = pd.concat([
#     train_sampled_train_df,
#     stem_1k_df,
# ])
# len(train_sampled_df)

In [10]:
new_train_df = pd.concat([
    train_df,
    stem_1k_df,
])
new_train_df.index = list(range(len(new_train_df)))
new_train_df.id = list(range(len(new_train_df)))

eval_sampled_df = new_train_df.sample(frac=0.1, random_state=42)

train_sampled_df = new_train_df.drop(eval_sampled_df.index)

len(train_sampled_df)

1080

In [11]:
train_ds = Dataset.from_pandas(train_sampled_df)
eval_ds = Dataset.from_pandas(eval_sampled_df)

In [12]:
model_dir = "microsoft/deberta-v3-large"

In [13]:
tokenizer = AutoTokenizer.from_pretrained(model_dir)

Downloading (…)okenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/580 [00:00<?, ?B/s]

Downloading spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [14]:
# We'll create a dictionary to convert option names (A, B, C, D, E) into indices and back again
options = 'ABCDE'
indices = list(range(5))

option_to_index = {option: index for option, index in zip(options, indices)}
index_to_option = {index: option for option, index in zip(options, indices)}

def preprocess(example):
    # The AutoModelForMultipleChoice class expects a set of question/answer pairs
    # so we'll copy our question 5 times before tokenizing
    first_sentence = [example['prompt']] * 5
    second_sentence = []
    for option in options:
        second_sentence.append(example[option])
    # Our tokenizer will turn our text into token IDs BERT can understand
    tokenized_example = tokenizer(first_sentence, second_sentence, truncation=True)
    tokenized_example['label'] = option_to_index[example['answer']]
    return tokenized_example

In [15]:
tokenized_train_ds = train_ds.map(preprocess, batched=False, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])
tokenized_eval_ds = eval_ds.map(preprocess, batched=False, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])

  0%|          | 0/1080 [00:00<?, ?ex/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  0%|          | 0/120 [00:00<?, ?ex/s]

Following datacollator (adapted from https://huggingface.co/docs/transformers/tasks/multiple_choice)
will dynamically pad our questions at batch-time so we don't have to make every question the length
of our longest question.

In [16]:
@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    
    def __call__(self, features):
        label_name = "label" if 'label' in features[0].keys() else 'labels'
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',
        )
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        return batch

In [17]:
model = AutoModelForMultipleChoice.from_pretrained(model_dir)

Downloading pytorch_model.bin:   0%|          | 0.00/874M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-v3-large were not used when initializing DebertaV2ForMultipleChoice: ['mask_predictions.classifier.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.LayerNorm.bias', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.dense.bias', 'mask_predictions.dense.weight', 'mask_predictions.dense.bias', 'mask_predictions.classifier.bias']
- This IS expected if you are initializing DebertaV2ForMultipleChoice from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForMultipleChoice from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassif

In [18]:
!rm -r /kaggle/working/finetuned_bert
# !rm -r /kaggle/working/wandb

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
rm: cannot remove '/kaggle/working/finetuned_bert': No such file or directory


In [19]:
import os
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
os.environ['WANDB_API_KEY'] = user_secrets.get_secret("wandb_api")

In [20]:
model_dir = 'finetuned_bert'
training_args = TrainingArguments(
    output_dir=model_dir,
    evaluation_strategy ="steps",
    eval_steps = 50, # Evaluation and Save happens every 5 steps
    save_steps = 50,
    save_total_limit = 3, # Only last 3 models are saved. Older ones are deleted
    logging_steps=1,
    load_best_model_at_end=True,
    learning_rate=3e-6,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=4,
    warmup_steps=50,
    report_to='wandb'
)

In [21]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_ds,
    eval_dataset=tokenized_eval_ds,
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer)
)

In [22]:
# !mkdir -p finetuned_bert/checkpoint-30
# !cp -a /kaggle/input/llm-se-deberta-v3-large-training/. /kaggle/working/finetuned_bert/checkpoint-30/.

In [23]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mleo27heady[0m ([33mi2p-onseo[0m). Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.01666945548333274, max=1.0)…

You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
50,1.4978,1.607024
100,1.6113,1.604378
150,1.5083,1.596205
200,1.4789,1.585963
250,1.7126,1.559823
300,1.5196,1.505588
350,1.7515,1.427319
400,0.9545,1.360281
450,1.5551,1.319088
500,1.6489,1.283518


TrainOutput(global_step=2160, training_loss=1.018808510694308, metrics={'train_runtime': 1660.1988, 'train_samples_per_second': 2.602, 'train_steps_per_second': 1.301, 'total_flos': 1340019889268400.0, 'train_loss': 1.018808510694308, 'epoch': 4.0})

In [24]:
# !rm -r /kaggle/working/finetuned_bert/checkpoint-100
# !cp -a /kaggle/working/finetuned_bert/checkpoint-70/. .

In [25]:
# !rm -r finetuned_bert
# !rm -r wandb

In [26]:
# !rm -r /kaggle/working/finetuned_bert/checkpoint-840/

In [27]:
trainer.save_model(f'.')

In [28]:
!ls -a

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
.			   config.json		    tokenizer.json
..			   finetuned_bert	    tokenizer_config.json
.virtual_documents	   pytorch_model.bin	    training_args.bin
__notebook_source__.ipynb  special_tokens_map.json  wandb
added_tokens.json	   spm.model


In [29]:
!rm -r finetuned_bert
!rm -r wandb

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [30]:
!rm __notebook_source__.ipynb

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
