# Benchmarking and Evaluation

Finally, we will rigorously evaluate and compare the performance of our finetuned model against the base model and a powerful proprietary model (Gemini Pro).

**Process:**
1.  **Define a Test Set:** We will create a small, representative set of questions to evaluate the models on.
2.  **Load All Models:** We'll load three distinct models:  
    a. The original, pre-trained base model (`Llama-3-8B`).  
    b. Our finetuned model (Base model + our trained LoRA adapters).  
    c. The Gemini Pro model via API.  
3.  **Generate Responses:** For each question, we will generate an answer from all three models using the same RAG context.
4.  **Compare Results:** We will collate the responses into a Pandas DataFrame for a clear, side-by-side qualitative comparison.

In [1]:
import gc
import os
import yaml
import pandas as pd
from tqdm import tqdm

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline

import torch
torch.cuda.empty_cache()

import sys
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

from src.rag_pipeline import RAGPipeline
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

In [2]:
# Load Config and RAG Pipeline
CONFIG_PATH = '../config/config.yaml'
with open(CONFIG_PATH, 'r') as f:
    config = yaml.safe_load(f)
rag_pipeline = RAGPipeline(config_path=CONFIG_PATH)
retriever = rag_pipeline.get_retriever()

# Prepare Evaluation Questions
evaluation_questions = [
    "What is Subhojit Ghimire's professional summary?",
    "What programming languages and AI/ML libraries are listed in Subhojit's technical skills?",
    "What was Subhojit's role and key contributions at Jio Platforms Limited?",
    "List the patents and publications credited to Subhojit Ghimire.",
    "Describe the 'AutoML Playground' project mentioned in the resume."
]

Initialising RAG Pipeline...
Initialising embedding model 'BAAI/bge-large-en-v1.5' on device 'cuda'.
Loading vector store from: ../data/vector_store
RAG Pipeline Initialised Successfully.
Creating a retriever to fetch top 4 results.


In [None]:
results_data = {q: {} for q in evaluation_questions}

# Get Gemini Pro Responses
print("Setting up and running Gemini Pro...")
api_key = os.getenv("GEMINI_API_KEY") or config['llm']['gemini']['gemini_api_key']
gemini_llm = ChatGoogleGenerativeAI(model=config['llm']['gemini']['model_name'], google_api_key=api_key)
prompt_template = PromptTemplate(
    template="SYSTEM: You are a helpful assistant. Answer the user's question based only on the context.\n\nCONTEXT:\n{context}\n\nUSER:\n{question}\n\nASSISTANT:",
    input_variables=["context", "question"]
)
gemini_chain = prompt_template | gemini_llm | StrOutputParser()
for question in tqdm(evaluation_questions, desc="Querying Gemini"):
    retrieved_docs = retriever.invoke(question)
    context = "\n\n".join([doc.page_content for doc in retrieved_docs])
    results_data[question]['context'] = context
    results_data[question]['Gemini_Response'] = gemini_chain.invoke({"context": context, "question": question})
print("Gemini Pro evaluation complete.")

# Get Base Model Responses
print("\nLoading Base Model...")
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16)
base_model_name = config['finetuning']['base_model_name']
base_model = AutoModelForCausalLM.from_pretrained(base_model_name, quantization_config=bnb_config, device_map="auto")
base_tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_tokenizer.pad_token = base_tokenizer.eos_token
base_pipe = pipeline("text-generation", model=base_model, tokenizer=base_tokenizer, max_new_tokens=256)
for question in tqdm(evaluation_questions, desc="Querying Base Model"):
    context = results_data[question]['context']
    prompt = prompt_template.format(context=context, question=question)
    output = base_pipe(prompt)[0]['generated_text']
    results_data[question]['Base_Model_Response'] = output.split("ASSISTANT:")[-1].strip()
print("Unloading Base Model...")
del base_model, base_tokenizer, base_pipe
gc.collect()
torch.cuda.empty_cache()

# Get Finetuned Model Responses
print("\nLoading Finetuned Model...")
finetuned_adapter_path = config['finetuning']['output_dir']
ft_base_model = AutoModelForCausalLM.from_pretrained(base_model_name, quantization_config=bnb_config, device_map="auto")
finetuned_model = PeftModel.from_pretrained(ft_base_model, finetuned_adapter_path)
finetuned_tokenizer = AutoTokenizer.from_pretrained(base_model_name)
finetuned_tokenizer.pad_token = finetuned_tokenizer.eos_token
finetuned_pipe = pipeline("text-generation", model=finetuned_model, tokenizer=finetuned_tokenizer, max_new_tokens=256)
for question in tqdm(evaluation_questions, desc="Querying Finetuned Model"):
    context = results_data[question]['context']
    prompt = prompt_template.format(context=context, question=question)
    output = finetuned_pipe(prompt)[0]['generated_text']
    results_data[question]['Finetuned_Model_Response'] = output.split("ASSISTANT:")[-1].strip()
print("Unloading Finetuned Model...")
del ft_base_model, finetuned_model, finetuned_tokenizer, finetuned_pipe
gc.collect()
torch.cuda.empty_cache()

final_results = []
for question, data in results_data.items():
    final_results.append({
        "Question": question,
        "Gemini_Response": data.get("Gemini_Response", "ERROR"),
        "Base_Model_Response": data.get("Base_Model_Response", "ERROR"),
        "Finetuned_Model_Response": data.get("Finetuned_Model_Response", "ERROR")
    })
results_df = pd.DataFrame(final_results)

Setting up and running Gemini Pro...


Querying Gemini: 100%|██████████| 5/5 [00:09<00:00,  1.88s/it]


Gemini Pro evaluation complete.

Loading Base Model...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Device set to use cuda:0
Querying Base Model: 100%|██████████| 5/5 [02:51<00:00, 34.39s/it]


Unloading Base Model...

Loading Finetuned Model...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Device set to use cuda:0
Querying Finetuned Model: 100%|██████████| 5/5 [04:04<00:00, 48.93s/it]


Unloading Finetuned Model...


In [4]:
# Display the results with styled output
pd.set_option('display.max_colwidth', None)
for index, row in results_df.iterrows():
    print(f"--- QUESTION {index+1} ---")
    print(f"Question: {row['Question']}\n")
    
    print("--- Gemini Pro Response ---")
    print(f"{row['Gemini_Response']}\n")
    
    print("--- Base Model Response ---")
    print(f"{row['Base_Model_Response']}\n")
    
    print("--- Finetuned Model Response ---")
    print(f"{row['Finetuned_Model_Response']}\n")
    
    print("="*50 + "\n")

--- QUESTION 1 ---
Question: What is Subhojit Ghimire's professional summary?

--- Gemini Pro Response ---
AI/ML Developer with 2 years of industry experience and a strong academic foundation in Computer Science & Engineering. Proven expertise in production-grade backend development, scalable machine learning automation, and Generative AI applications. Core contributor to JioBrain, India’s first AI/ML platform with 5G integration, delivering scalable solutions for business use. Track record of delivering production-ready systems independently and in collaborative, cross-functional teams.

--- Base Model Response ---
AI/ML Developer with 2 years of industry experience and a strong academic foundation in Computer Science & Engineering. Proven expertise in production-grade backend development, scalable machine learning automation, and Generative AI applications. Core contributor to JioBrain, India’s first AI/ML platform with 5G integration, delivering scalable solutions for business use. 