In [21]:
import os
import re
import time
import random
import numpy as np
import pandas as pd
from tqdm import tqdm

from matplotlib import pyplot as plt

import config
import utils

from data import prepare_data
from embeddings import embedder
from llm import smollm_wrapper
from llm import smollm_raw_wrapper
from chains import rag_chain

In [22]:
import importlib
importlib.reload(config)
importlib.reload(prepare_data)
importlib.reload(embedder)
importlib.reload(smollm_wrapper)
importlib.reload(rag_chain)
importlib.reload(utils)

import warnings
warnings.filterwarnings('ignore')

In [23]:
## Initialization of sub-components

## Dataset Loader
dataloader = prepare_data.DataLoader(config.DATASET_NAME)
doc = dataloader.load_documents(split='train')

dataloader_test = prepare_data.DataLoader(config.DATASET_NAME)
doc_test = dataloader_test.load_documents(split='test')

## Embedder Loader
embedder_name = config.EMBED_MODEL
embedder = embedder.Embedder(embedder_name)
vector_store = embedder.build_vectorstore(doc)

## LLM Loader
base_model_id = config.BASE_MODEL_ID
our_pretrained_lm_path = config.MODEL_ID
llm_wrapper = smollm_wrapper.SmolLLMWrapper(base_model_id, our_pretrained_lm_path)

## RAG Chain Loader
rag_chain_train = rag_chain.RAGChainBuilder(llm_wrapper.llm, vector_store.as_retriever(search_kwargs={"k": 3}))
qa_chain_train = rag_chain_train.build_chain()

Initializing Dataset with train set
--------------------
Dataset is preprocessing
--------------------
Dataset preprocessing is done
--------------------
Initializing Dataset with test set
--------------------
Dataset is preprocessing
--------------------
Dataset preprocessing is done
--------------------
Initializing Embedding Model
--------------------
Initializing LLM Model
--------------------
Initializing Retrieval Chain
--------------------


In [24]:
i = 0
print(doc[i])
print(doc_test[i])

page_content='Question: Is there any restriction on withdrawal in rupees of funds held in an EEFC account
Answer: No, there is no restriction on withdrawal in rupees of funds held in an EEFC account. However, the amount withdrawn in rupees shall not be eligible for conversion into foreign currency and for re-credit to the account.'
page_content='Question: Why should I buy Critical Illness insurance
Answer: Critical Illness insurance provides you and your family additional financial security on the diagnosis of a critical illness. The policy provides a lump sum amount which could be used for: Cost of care and treatment Recuperation aid Debts pay off Any lost in income due to a decrease in ability to work View more'


In [25]:
## Make some tests with RAG Chain
query = "What are the criterias to get some loan?"
response = qa_chain_train.run(query)
print(response)

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Question: Whom do I contact in case of ay further queries regarding the loan
Answer: You can apply online on the clicking on the below given link, also you can walk into any of our branches located in 11 mentioned locations and our Sales Managers will help you with your need

Question: How will HDFC decide my home loan eligibility
Answer: HDFC assess the customer's repayment capacity based on income, age, qualifications, number of dependants, spouse's income, assets, liabilities, stability and continuity of occupation, and savings history.

Question: Who is eligible for a tractor loan
Answer: Whether you are a farmer or not, you can avail HDFC Bank's Tractor loan for agricultural or commercial purposes. If you are a farmer, you must have a minimum of 3 acres of agricultural land. To know about the eligibility criteria in det

### Only Instruction-Tuned SmolLM2

In [6]:
## LLM Loader
base_model_id = config.BASE_MODEL_ID
our_pretrained_lm_path = config.MODEL_ID
llm_raw_wrapper = smollm_raw_wrapper.SmolLLM_Raw_Wrapper(base_model_id, our_pretrained_lm_path)

Initializing LLM Model
--------------------


In [7]:
response_raw = llm_raw_wrapper.generate_response(query)
print(response_raw)

### Instruction:
        You are an AI banking assistant. Respond to the customer's request in a clear and professional manner.

        ### Customer Request:
        What are the criterias to get some loan?

        ### Response:
         I can provide you with information about the criteria for getting a loan. To apply for a loan, we have several options available depending on your needs and preferences. Here is what you need to know:

1. Fixed-rate loans: These types of loans offer fixed interest rates that remain constant over time. They usually require collateral or security deposit as part of the application process.
2. Adjustable-rate loans: Unlike fixed-rate loans, these lend out at variable interest rates based on market conditions. The rate may change periodically throughout the repayment period.
3. Personal loans: This type of loan allows individuals to borrow funds up to a predetermined amount without requiring any collateral. It provides quick access to cash when needed mo

### Evaluation of RAG-based and Only Ins-Tuned SmolLM2 Models on Test Dataset

In [8]:
import evaluate
from datasets import Dataset

In this section we prepare our custom dataset, which is based on the test set of the "GhulamShabbirKhan/BankFAQsMistral" <br>

In this context, we first get the test set questions, then using <CLAUDE.AI>, syn. generated test samples are generated. <br>
The main concern of this step is data leakage! Since we will embed the questions in the test set while performing RAG with Langchain, we first generated semantically similar questions using <CLAUDE.AI> in order not to use the same questions during testing. Then, we asked these semantically similar questions to both RAG-based and instruction-only tuned models. Thus, we aimed to make a comparison as fair as possible. 

In [9]:
# Create the test question-answer pairs
train_questions = [doc_entry.page_content.split("\n")[0].split("Question:")[1] for doc_entry in doc]
test_questions = [doc_entry.page_content.split("\n")[0].split("Question:")[1] for doc_entry in doc_test]
test_answers = [doc_entry.page_content.split("\n")[1].split("Answer:")[1] for doc_entry in doc_test]

print(f"# Train Questions: {len(train_questions)}\n# Test Questions: {len(test_questions)}\n# Test Answers: {len(test_answers)}\n")

# Train Questions: 1411
# Test Questions: 353
# Test Answers: 353



In [10]:
test_questions, test_answers = utils.remove_joint_questions(train_questions, test_questions, test_answers)

print(len(test_questions), len(test_answers))

There are 104 questions in both dataset!
249 249


In [11]:
test_syn_questions = []

with open(config.SYN_QUESTIONS, "r") as file:
    lines = file.readlines()

    for line in lines:
        syn_question = line.split("→")[1].replace("\"", "").strip()
        test_syn_questions.append(syn_question)


print(f"# Syn. Test Questions: {len(test_syn_questions)}\n# Test Questions: {len(test_questions)}")

# Syn. Test Questions: 249
# Test Questions: 249


In [12]:
# Print sample original and syn. questions
sample_idx = random.randint(0, len(test_questions)-1)

print(f"Original Question: {test_questions[sample_idx]}")
print(f"Syn. Question: {test_syn_questions[sample_idx]}")

Original Question:  Can I get a loan or partial withdrawal on this plan
Syn. Question: Does this plan offer options for borrowing against it or making partial withdrawals?


In [26]:
# Retrieve Contexts in Batch
# Get top_k contexts for all questions

vector_store_test = embedder.build_vectorstore(doc_test)
retriever_test = vector_store_test.as_retriever(search_kwargs={"k": 3})

rag_chain_test = rag_chain.RAGChainBuilder(llm_wrapper.llm, vector_store_test.as_retriever(search_kwargs={"k": 3}))
qa_chain_test = rag_chain_test.build_chain()

# Get top_k contexts for all questions
retrieved_docs_batch = retriever_test.batch(test_syn_questions)

print(f"There are {len(retrieved_docs_batch)} documents-retrieved pairs")
print("-"*20)

print(f"A sample question based retrieval:\nQuestion: {test_syn_questions[0]}")
print("-"*20)
print(f"Retrieved documents:\n{retrieved_docs_batch[0]}")
print("-"*20)

Initializing Retrieval Chain
--------------------
There are 249 documents-retrieved pairs
--------------------
A sample question based retrieval:
Question: What's the highest initial deposit amount allowed when opening a Recurring Deposit account?
--------------------
Retrieved documents:
[Document(page_content='Question: What is the maximum deposit amount a Recurring Deposit account can be opened with\nAnswer: The Maximum installment amount you can open a recurring Deposit account with is Rs 14,99,900/- per month.'), Document(page_content='Question: For what period can I open a Recurring Deposit\nAnswer: You can open a Recurring Deposit account for a minimum period of 6 months, and thereafter in multiples of 3 months up to a maximum period of 10 years.'), Document(page_content='Question: Is an overdraft facility allowed\nAnswer: Right now there is no overdraft facility for Recurring Deposits.')]
--------------------


In [14]:
## Generate Responses with "Only Inst-Tuned SmolLM2"
prompt_template = """
        ### Instruction:
        You are an AI banking assistant. Respond to the customer's request in a clear and professional manner.

        ### Customer Request:
        {question}

        ### Response:
        """

prompts = [
    prompt_template.format(question=q)
    for q in test_questions
]

print(f"There are {len(prompts)} questions in prompt template")
print("-"*20)
print(f"A sample prompt for Only Ins-Tuned SmolL2:\n{prompts[0]}")
print("-"*20)


# Predict responses
inputs = llm_raw_wrapper.base_tokenizer(prompts, return_tensors="pt", padding=True, truncation=True).to("cuda")
outputs = llm_raw_wrapper.tuned_model.generate(**inputs, max_new_tokens=512)

inst_tuned_model_generated_text = llm_raw_wrapper.base_tokenizer.batch_decode(outputs, skip_special_tokens=True)

print(f"There are {len(inst_tuned_model_generated_text)} generated text from Only Ins-Tuned SmolL2")
print("-"*20)
print(f"A sample generated text from Only Ins-Tuned SmolL2:\n{inst_tuned_model_generated_text[0]}")
print("-"*20)


only_inst_tuned_model_generated_answers = []
for onlt_inst_model_generated_text in inst_tuned_model_generated_text:
    only_inst_tuned_model_generated_answers.append(utils.extract_inst_only_model_answer(onlt_inst_model_generated_text))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


There are 249 questions in prompt template
--------------------
A sample prompt for Only Ins-Tuned SmolL2:

        ### Instruction:
        You are an AI banking assistant. Respond to the customer's request in a clear and professional manner.

        ### Customer Request:
         What is the maximum deposit amount a Recurring Deposit account can be opened with

        ### Response:
        
--------------------
There are 249 generated text from Only Ins-Tuned SmolL2
--------------------
A sample generated text from Only Ins-Tuned SmolL2:

        ### Instruction:
        You are an AI banking assistant. Respond to the customer's request in a clear and professional manner.

        ### Customer Request:
         What is the maximum deposit amount a Recurring Deposit account can be opened with

        ### Response:
         I can provide you with the maximum amount that you can deposit with your recurring deposit account. To determine the maximum amount, I'll need some additional in

In [27]:
## Generate Responses with "RAG-based Inst-Tuned SmolLM2"
rag_based_inst_tuned_model_generated_text = []
rag_based_inst_tuned_model_generated_answers = []

test_responses = qa_chain_test.batch(test_syn_questions)

for test_response_elem in test_responses:
    rag_based_inst_tuned_model_generated_text.append(test_response_elem['result'])

for rag_generated_text in rag_based_inst_tuned_model_generated_text:
    rag_based_inst_tuned_model_generated_answers.append(utils.extract_rag_answer(rag_generated_text))

In [28]:
## Its time to make evaluation
df = pd.DataFrame(list(zip(only_inst_tuned_model_generated_answers, rag_based_inst_tuned_model_generated_answers, test_answers)), 
                  columns=['only_inst_model_answer', 'rag_based_model_answer', "gt_answer"])

df.head(2)

Unnamed: 0,only_inst_model_answer,rag_based_model_answer,gt_answer
0,I can provide you with the maximum amount that...,The maximum deposit amount which could be depo...,The Maximum installment amount you can open a...
1,I'm sorry to hear that you're facing difficult...,"To withdraw any amount prior to its maturity, ...",We request you to submit your Fixed Deposit a...


In [29]:
df.to_csv("Chain_answers_syn_default_prompt.csv")

In [30]:
## Calculate the ROUGE Score
rouge = evaluate.load('rouge')

only_inst_model_evaluation_result = rouge.compute(
    predictions=only_inst_tuned_model_generated_answers,
    references=test_answers,
    use_aggregator=True,
    use_stemmer=True
)

print(f"Only Instruction Tuned Model Evaluation Result: {only_inst_model_evaluation_result}")

rag_based_model_evaluation_result = rouge.compute(
    predictions=rag_based_inst_tuned_model_generated_answers,
    references=test_answers,
    use_aggregator=True,
    use_stemmer=True
)

print(f"RAG-Based Instruction Tuned Model Evaluation Result: {rag_based_model_evaluation_result}")



Only Instruction Tuned Model Evaluation Result: {'rouge1': 0.1712090957396932, 'rouge2': 0.03570253199449386, 'rougeL': 0.11409797828007184, 'rougeLsum': 0.12881372545688663}
RAG-Based Instruction Tuned Model Evaluation Result: {'rouge1': 0.13166631318909178, 'rouge2': 0.01549872916234204, 'rougeL': 0.08427176664117003, 'rougeLsum': 0.08988182969340588}


In [47]:
analysis_idx = -1#101

question = test_syn_questions[analysis_idx]
gt_answer = df.iloc[analysis_idx][2]
only_inst_answer = df.iloc[analysis_idx][0]
rag_answer = df.iloc[analysis_idx][1]


print(f"Question\n{question}")
print("-"*20)
print("\n")
print(f"GT Answer\n{gt_answer}")
print("-"*20)
print("\n")
print(f"Only-Inst Model Answer\n{only_inst_answer}")
print("-"*20)
print("\n")
print(f"RAG-Based Model Answer\n{rag_answer}")
print("-"*20)

Question
What is the application process for a Loan Against Property?
--------------------


GT Answer
 You can apply for a loan in the following ways: Fill in the online application form and our representative will get in touch with you Call one of our PhoneBanking numbers provided on the website Visit your nearest branch Our existing liability customers may also get in touch with their Relationship Managers/ Personal Bankers to know more and apply for LAP
--------------------


Only-Inst Model Answer
I'm here to assist you with applying for a Loan Against Property (LAP). Here's a step-by-step guide to help you through the process:

1. Start by gathering all the necessary documents such as identification proof, income statements, bank statements, and any other relevant financial documents.

2. Research and compare different lenders to find the best terms and interest rates for your needs. Consider factors like interest rates, repayment terms, and any additional fees.

3. Once you've c

In [49]:
## Evaluation based on the semanticly
from sentence_transformers import SentenceTransformer, util

In [59]:
model = SentenceTransformer('all-mpnet-base-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [60]:
similarity_matrix = np.ones((len(df),2))*-1


for df_idx in tqdm(range(len(df))):

    question = test_syn_questions[df_idx]
    gt_answer = df.iloc[df_idx][2]
    only_inst_answer = df.iloc[df_idx][0]
    rag_answer = df.iloc[df_idx][1]

    
    # compare the questions
    embedding_gt = model.encode(gt_answer, convert_to_tensor=True)
    embedding_only_inst = model.encode(only_inst_answer, convert_to_tensor=True)
    embedding_rag = model.encode(rag_answer, convert_to_tensor=True)

    # Compute cosine similarity
    similarity_inst_only = util.pytorch_cos_sim(embedding_gt, embedding_only_inst).detach().cpu().numpy()[0,0]
    similarity_rag = util.pytorch_cos_sim(embedding_gt, embedding_rag).detach().cpu().numpy()[0,0]

    similarity_matrix[df_idx][0] = similarity_inst_only
    similarity_matrix[df_idx][1] = similarity_rag

100%|██████████| 249/249 [00:15<00:00, 15.95it/s]


In [61]:
inst_only_semantic_similarity = np.mean(similarity_matrix[:,0])
rag_semantic_similarity = np.mean(similarity_matrix[:,1])

print(f"Mean Semantic Similarity\nInst-Only: {inst_only_semantic_similarity}\nRAG-Based: {rag_semantic_similarity}")

Mean Semantic Similarity
Inst-Only: 0.5573131607837946
RAG-Based: 0.5540917125646488


In [68]:
df

Unnamed: 0,only_inst_model_answer,rag_based_model_answer,gt_answer
0,I can provide you with the maximum amount that...,The maximum deposit amount which could be depo...,The Maximum installment amount you can open a...
1,I'm sorry to hear that you're facing difficult...,"To withdraw any amount prior to its maturity, ...",We request you to submit your Fixed Deposit a...
2,I can help you with that. To set up your IVR p...,"Yes! You're right in saying that we use a ""One...",To make telephonic (IVR) transactions more se...
3,I'm here to assist you with finding the availa...,There is no specific coverages offered by our ...,You can choose to be insured for any of the f...
4,I'm here to assist you with that. The Service ...,The Service Tax charged by your bank varies de...,Following are the currency conversion service...
...,...,...,...
244,I can provide you with information about the l...,You are welcome! To find out about our availab...,SmartDraft is available in the following loca...
245,I can help you with that! Investing in a new p...,You have various options available when it com...,"In this policy, the investment risk in invest..."
246,I'm here to assist you with purchasing your {{...,"To obtain your gift plus card from us, please ...",GiftPlus cards are available in all HDFC bank...
247,I'm sorry to hear that you're facing an issue ...,Please follow these instructions to resolve th...,Your card is already registered for Verified ...


In [69]:
## Evaluation based on the BERT
from bert_score import score


P_inst_only, R_inst_only, F1_inst_only = score(df['gt_answer'].tolist(), df['only_inst_model_answer'].tolist(), lang="en", verbose=True)
P_rag, R_rag, F1_rag = score(df['gt_answer'].tolist(), df['rag_based_model_answer'].tolist(), lang="en", verbose=True)

print(f"Mean Semantic Similarity\nInst-Only: {F1_inst_only[0]}\nRAG-Based: {F1_rag[0]}")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/8 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/4 [00:00<?, ?it/s]

done in 86.73 seconds, 2.87 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/8 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/4 [00:00<?, ?it/s]

done in 82.95 seconds, 3.00 sentences/sec
Mean Semantic Similarity
Inst-Only: 0.8389032483100891
RAG-Based: 0.8440746665000916


