# Introduction

Hello everyone, I will be sharing my approach of using a LLM for this competition. Though it was not my best solution, but I still think with some amount of work this can become a really great one. So feel free to tweak it and see how it performs.

I have not used any external data, and I had tried running this notebook on Kaggle but it didn't run, so I had used A5000 with 24GB of VRAM. To train the non-quantized version you would require roughly 60-70GB of VRAM to finetune the model. Suggestions you be 1xA100-80GB/2xA100-40GB/3xA100-40GB/8xA4000-16GB/8xV100-32GB. You can also opt in for the latest version including A6000-ADA, H100, A4000-ADA

## Required Libraries

Ensure when you are running this notebook the following libraries are present with their latest versions.
* Transformers 
* Datasets
* BitsandBytes
* Accelerate
* Sentencepiece
* PeFT

In [None]:
#Setting Up the Model String for the model to be used
'''
                       I have used the Base Mistral Model, but you can also choose the Intruct Model.
                    One More suggestion would be any trial with the Zephyr-Alpha/Beta Model from HF-H4
'''

model_str = "mistralai/Mistral-7B-v0.1"

In [None]:
#Importing the required Libraries

#Analysis and data creation
import numpy as np
import pandas as pd
from datasets import Dataset

#Modelling
import torch
from transformers import AutoTokenizer
from transformers import MistralForSequenceClassification #LlamaForSequenceClassification, AutoModelForSequenceClassification [Llama head works most of the time but the results are not that good, AutoModel invokes the Mistral native head]
from transformers import TrainingArguments, Trainer, EarlyStoppingCallback
from transformers import DataCollatorWithPadding
from tqdm import tqdm

#Quantization
from peft import get_peft_config, PeftModel, PeftConfig, get_peft_model, LoraConfig, TaskType 
from transformers import BitsAndBytesConfig
import torch

#Model Storage
from shutil import rmtree
from pathlib import Path

In [None]:
#Reading the Train and Test Data

train = pd.read_csv("/kaggle/input/h2oai-predict-the-llm/train.csv")
test = pd.read_csv("/kaggle/input/h2oai-predict-the-llm/test.csv")

In [None]:
#Replacing the NaN values with NA String, you can use Empty String or any other sequence
train.fillna("NA",inplace=True)
test.fillna("NA",inplace=True)

#Merging the Two Columns together, Alternatively you can use them as separate input for the model as well.
train["ques_resp"] = 'Question: ' + train["Question"] + "; " + 'Response: ' + train["Response"]
test["ques_resp"] = 'Question: ' + test["Question"] + "; " + 'Response: ' + test["Response"]

#Creating another dataframe with only the required columns.
train_merged = train[["target","ques_resp"]]
test_merged = test[["ques_resp"]]

In the following cell I have dropped the rows in the Train Set that corresponded to *Label 4* as during an initial analysis I found that *Label 4* had near to zero if not zero appearances in Test Set. 
For a more detailed analysis please check out this [Notebook](https://www.kaggle.com/code/mustafakeser4/basic-eda)

In [None]:
train_merged.drop(train_merged[train_merged['target'] == 4].index, inplace = True)
train_merged

In [None]:
#Creating the HF Dataset corresponding to our train and test

train_hf_dataset = Dataset.from_pandas(train_merged)
test_hf_dataset = Dataset.from_pandas(test_merged)

In [None]:
#Setting up the PeFT and BnB config

'''
The PeFT Config will allow us to create a LoRA Adapter for our quantized model that will allow us to train it
The BnB Config will load the model in 4-Bit quantization thus reducing our hardware requirement from 60GB to only 13GB
'''

peft_config = LoraConfig(
    r=64,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    target_modules=[
        "q_proj",
        "v_proj"
    ],
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_str, use_fast=False)

#Addition of the Pad Token as the Tokenizer.pad_token is Set to None, this is required to ensure efficient padding
tokenizer.pad_token = tokenizer.eos_token

In [None]:
#Loading the Model
model_quantized = MistralForSequenceClassification.from_pretrained(model_str, num_labels=train.target.nunique(), quantization_config=bnb_config, device_map={"":0})

#Setting the Pretraining_tp to 1 ensures we are using the Linear Layers to the max computation possible
model_quantized.config.pretraining_tp = 1 #For Us this would be 7B

#Ensuring the model is aware about the pad token ID
model_quantized.config.pad_token_id = tokenizer.pad_token_id

In [None]:
#Setting up the LoRA Adapter
model_main = get_peft_model(model_quantized, peft_config)

In [None]:
model_main.print_trainable_parameters()

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding="longest")

In [None]:
max_length =512 #I have used a length of 512 due to memory concerns you can also use higher ranges or you this->tokenizer.model_max_length to invoke the max length of the model

#Tokenizing the Datasets
def tokenize_function(examples):
    return tokenizer(examples["ques_resp"], padding="max_length", max_length = max_length, truncation=True)
'''
If you are using two separate columns you can use the following method

def tokenize_function(examples):
    return tokenizer(examples["question"], examples["response"] padding="max_length", max_length = max_length, truncation=True)
'''

In [None]:
tokenized_train = train_hf_dataset.map(tokenize_function, batched=True)
tokenized_test = test_hf_dataset.map(tokenize_function, batched=True)

tokenized_train_main = tokenized_train.shuffle(seed=42).select(range(0,2726))
tokenized_eval_main = tokenized_train.shuffle(seed=42).select(range(2726,3408))


In [None]:
tokenized_train_main = tokenized_train_main.remove_columns(['ques_resp'])
tokenized_train_main = tokenized_train_main.rename_column("target", "labels")


tokenized_eval_main = tokenized_eval_main.remove_columns(['ques_resp'])
tokenized_eval_main = tokenized_eval_main.rename_column("target", "labels")


In [None]:
# I have used the following training arguments, you can tweak and see if there are any differences in the results
steps = 20

run_name = "mistral" + "-" + "h2o-finetune"
output_dir = "./" + run_name

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=5e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=16,
    max_grad_norm=0.3,
    optim='paged_adamw_32bit',
    lr_scheduler_type="cosine",
    num_train_epochs=8,
    weight_decay=0.01,
    evaluation_strategy="steps",
    save_strategy="steps",
    load_best_model_at_end=True,
    push_to_hub=False,
    warmup_steps=steps,
    eval_steps=steps,
    logging_steps=steps,
    report_to='none'
)

In [None]:
#Shifting the model to GPU, you can do this while invoking the model as well
model_main.to('cuda')

In [None]:
# Setting up the Trainer API
trainer = Trainer(
    model=model_main,
    args=training_args,
    train_dataset=tokenized_train_main,
    eval_dataset=tokenized_eval_main,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

#It took roughly 3 hours and 49 mins for me to train on this particular dataset using A5000
trainer.train()

In [None]:
#Sometimes Quantized models can save steps, to ensure you save the best model, you can run this command once.

from shutil import rmtree
from pathlib import Path
trainer.save_model(output_dir=str(output_dir))

for path in Path(training_args.output_dir).glob("checkpoint-*"):
    if path.is_dir():
        rmtree(path)

In [None]:
#Removing the text column from the test dataset
tokenized_test_main = tokenized_test.remove_columns(['ques_resp'])

#Using the trainer to predict the output, it will used the default eval batch number
pred = trainer.predict(tokenized_test_main)

#Getting the probabilities
pred_proba =torch.nn.functional.softmax(torch.tensor(pred.predictions, dtype = torch.float), dim=1)
#You have to set dtype for torch.tensor only when using a quantized model as nn.softmax/logmax doesn't apply for half-precision

#Saving the probabilities for submission
submission = pd.read_csv("/kaggle/input/h2oai-predict-the-llm/sample_submission.csv",index_col="id")
submission[:] = pred_proba.cpu().numpy()
submission.to_csv("submission.csv")

# The End

Kudos you have completed training and submitting to [h2oai-predict-the-llm](https://www.kaggle.com/competitions/h2oai-predict-the-llm/overview) using a quantized model.