**AIM**

Code to Create a Finetuned Model for Question-Answer Model

# BERT-BASE-UNCASED

BERT Base Uncased is a pretrained transformer model that processes text bidirectionally without case sensitivity. It is trained on Wikipedia and BooksCorpus using Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). It is used for text classification, question answering, NER, and semantic search in NLP applications

# Fine-Tuning

Fine-tuning is the process of taking a pretrained model and training it further on a specific dataset to adapt it for a particular task. Instead of training from scratch, fine-tuning modifies the model's weights slightly, making it more specialized while retaining its general knowledge. This is widely used in LLMs, NLP, and computer vision to improve task-specific performance with minimal resources

# LoRA

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that freezes the original model weights and introduces small, trainable low-rank matrices into transformer layers. This significantly reduces memory usage and computational cost while achieving similar performance to full fine-tuning

Install the Packages

In [None]:
!pip install peft

Import the Required Packages 

In [None]:
import pandas as pd
import json
from datasets import Dataset
from transformers import AutoModelForQuestionAnswering,AutoTokenizer,Trainer,TrainingArguments,PreTrainedTokenizer
import torch
from peft import LoraConfig,get_peft_model,TaskType

Load the `Dataset` as json format

In [None]:
with open('/kaggle/input/quac-question-answering-in-context-dataset/train_v0.2 QuaC.json', 'r') as f:
    data = json.load(f)
    print(data)

Fetch the `Context` , `Question` & `Answer` from the json and store it in the list

In [None]:
cont=[]
que=[]
ans=[]
for i in range(len(data['data'])):
    con=data['data'][i]['paragraphs'][0]['context']
    question=data['data'][i]['paragraphs'][0]['qas'][0]['question']
    answer=data['data'][i]['paragraphs'][0]['qas'][0]['answers'][0]['text']
    cont.append(con)
    que.append(question)
    ans.append(answer)

Store the List into a Dataframe

In [None]:
df=pd.DataFrame()
df['context']=cont
df['question']=que
df['answer']=ans

In [None]:
df

Load the bert-base-uncased model for Context based Question Answer model Finetuning with LoRA

In [None]:
model_nm='bert-base-uncased'
model=AutoModelForQuestionAnswering.from_pretrained(model_nm)
tok=AutoTokenizer.from_pretrained(model_nm)

Define the `Maximum Length` and Some `Function` For the Tokenization Process 

In [None]:
max_length = 256

Preprocessing steps :

1) preprocess the `Context` and `Question` and take all the inputs from that
2) Preprocess the Answer seperatly and take only the `input_ids` from that output
3) Define the `Start Position`&`End position` in the given contex based on the answer.

1) `get_answer_positions` this function is used to get the start & end position of the answer in the given contect
2) `toke_batch` this function is used to get the preprocessed output of the `Context` & `Question`
3) `lab_toke_batch` this function is used to take the preprocessed output of the answer & combined the output of all the inputs to make sure the dataset ready for model training

In [None]:
tok.add_special_tokens({'pad_token': '[PAD]'})
def get_answer_positions(context, answer, tokenizer):
    """
    Find the start and end token positions of the answer in the context.
    """
    # Encode the context and answer to find the token positions
    start_idx = context.find(answer)
    end_idx = start_idx + len(answer)
    
    # Tokenize the context
    tokenized_context = tokenizer(context, truncation=True, padding='max_length', max_length=max_length, return_tensors='pt')
    
    # Find the start and end positions of the answer in tokenized form
    start_token = tokenizer.encode(context[:start_idx], add_special_tokens=False)
    end_token = tokenizer.encode(context[:end_idx], add_special_tokens=False)
    
    start_pos = len(start_token)  # start position in tokenized input
    end_pos = len(end_token) - 1  # end position in tokenized input
    
    return start_pos, end_pos



In [None]:
def toke_batch(examples, tokenizer: PreTrainedTokenizer):
    """
    Tokenizes the input data (context + question) into tensors for model input.
    """
    # Create input prompts: Context + Question
    input_prompts = [f"Context: {example['context']} Question: {example['question']} Answer:" for example in examples]
    
    # Tokenize the input prompts with truncation and padding to max_length
    input_encodings = tokenizer(input_prompts, padding='max_length', truncation=True, max_length=max_length, return_tensors='pt')
    
    return input_encodings



In [None]:
def lab_toke_batch(examples, tokenizer: PreTrainedTokenizer):
    """
    Tokenizes the labels (answers) into tensors for model labels.
    """
    # Extract all answers and calculate the start and end positions for each answer
    start_positions = []
    end_positions = []
    labels = []
    
    for example in examples:
        context = example['context']
        answer = example['answer']
        
        # Get start and end positions
        start_pos, end_pos = get_answer_positions(context, answer, tokenizer)
        
        # Append start and end positions for each example
        start_positions.append(start_pos)
        end_positions.append(end_pos)
        
        # Tokenize the answer (label)
        labels.append(answer)
    
    # Tokenize the labels (answers)
    label_encodings = tokenizer(labels, padding='max_length', truncation=True, max_length=max_length, return_tensors='pt')
    
    return label_encodings, start_positions, end_positions

Code to do call the function and ready the dataset for model finetuning

In [None]:
train_dataset = Dataset.from_pandas(df)
test_dataset = Dataset.from_pandas(df)

# Apply batch tokenization to the training dataset
train_inputs = toke_batch(train_dataset, tok)
train_labels, train_start_positions, train_end_positions = lab_toke_batch(train_dataset, tok)

# Apply batch tokenization to the test dataset
test_inputs = toke_batch(test_dataset, tok)
test_labels, test_start_positions, test_end_positions = lab_toke_batch(test_dataset, tok)

# Add 'labels', 'start_positions' and 'end_positions' to input data
train_inputs['labels'] = train_labels.input_ids
train_inputs['start_positions'] = torch.tensor(train_start_positions)
train_inputs['end_positions'] = torch.tensor(train_end_positions)

test_inputs['labels'] = test_labels.input_ids
test_inputs['start_positions'] = torch.tensor(test_start_positions)
test_inputs['end_positions'] = torch.tensor(test_end_positions)

# Convert to Hugging Face Dataset format
training_dataset = Dataset.from_dict(train_inputs)
testing_dataset = Dataset.from_dict(test_inputs)

# Now you have a dataset with 'start_positions' and 'end_positions' for each example


Output of the training dataset

In [None]:
training_dataset

Load the LoRA Config For The Model Finetuning with some parameterts

In [None]:
con=LoraConfig(r=8,lora_alpha=32,lora_dropout=0.1,task_type=TaskType.QUESTION_ANS)
pef_mod=get_peft_model(model,con)

Initiate the Training Arguments For Model finetuning and Train the Model

The Finetuned model File is saved in **`/kaggle/working/`** this path

In [None]:
tr_args = TrainingArguments(
    output_dir='/kaggle/working/',
    evaluation_strategy='epoch',
    num_train_epochs=1,
    eval_steps=1,
    learning_rate=1e-4,
    per_device_train_batch_size=1,
      # Force training on CPU
)

In [None]:
trainer = Trainer(
    args=tr_args,
    model=pef_mod,
    train_dataset=training_dataset,
    eval_dataset=testing_dataset
)

Train the Model with 1 epochs and it will evaluate the output for each steps

In [None]:

trainer.train()


Checking the output Mannualy

In [27]:
i=1
con=df['context'][i]
que=df['question'][i]
ans=df['answer'][i]
print(que,ans) 

what language do they speak? Malayalam is the language spoken by the Malayalis.


Answer For the Question which asked to the Model after Finetunined

In [28]:
# Load the trained model and tokenizer (replace with actual path if needed)
tokenizer = tok  # Ensure 'tok' is the tokenizer object
model = pef_mod   # Ensure 'pef_mod' is the fine-tuned model object

# Move model to the correct device (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Define a manual question and context
context = con
question = que

# Tokenize the inputs
inputs = tokenizer(
    question,
    context,
    return_tensors="pt",
    truncation=True,
    max_length=512,
    padding="max_length",
)

# Move inputs to the correct device
inputs = {key: value.to(device) for key, value in inputs.items()}

# Get model predictions
with torch.no_grad():
    outputs = model(**inputs)

# Extract start and end logits
start_logits = outputs.start_logits
end_logits = outputs.end_logits

# Get the most probable start and end positions
start_index = torch.argmax(start_logits)
end_index = torch.argmax(end_logits)

# Decode the answer
answer = tokenizer.decode(inputs["input_ids"][0][start_index:end_index + 1], skip_special_tokens=True)

# Print the results
print(f"Question: {question}")
print(f"Answer: {answer}")
print(f"ActualAns:{ans}")



Question: what language do they speak?
Answer: what language do they speak? malayalam is the language spoken by the malayalis. malayalam is derived from old tamil and sanskrit in the
ActualAns:Malayalam is the language spoken by the Malayalis.
