In [1]:
"""
A typical RAG application has two main components:

Indexing: a pipeline for ingesting data from a source and indexing it. This usually happens offline.

Retrieval and generation: the actual RAG chain, which takes the user query at run time and retrieves the relevant data from the index, then passes that to the model.

The most common full sequence from raw data to answer looks like:

Indexing
1) Load: First we need to load our data. This is done with Document Loaders.
2) Split: Text splitters break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won't fit in a model's finite context window.
3) Store: We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a VectorStore and Embeddings model.

Retrieval and generation
4) Retrieve: Given a user input, relevant splits are retrieved from storage using a Retriever.
5) Generate: A ChatModel / LLM produces an answer using a prompt that includes the question and the retrieved data

"""

"\nA typical RAG application has two main components:\n\nIndexing: a pipeline for ingesting data from a source and indexing it. This usually happens offline.\n\nRetrieval and generation: the actual RAG chain, which takes the user query at run time and retrieves the relevant data from the index, then passes that to the model.\n\nThe most common full sequence from raw data to answer looks like:\n\nIndexing\n1) Load: First we need to load our data. This is done with Document Loaders.\n2) Split: Text splitters break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won't fit in a model's finite context window.\n3) Store: We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a VectorStore and Embeddings model.\n\nRetrieval and generation\n4) Retrieve: Given a user input, relevant splits are retrieved from storage using a Ret

In [3]:
import os
import getpass


os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()

In [9]:
import time
import textwrap
import numpy as np
import pandas as pd
from tqdm import tqdm


import torch
import evaluate

from datasets import load_dataset, DatasetDict
from transformers import GenerationConfig, AutoModelForCausalLM, AutoTokenizer

from peft import PeftModel, PeftConfig
from peft import LoraConfig, get_peft_model, TaskType

In [10]:
import bs4
from langchain import hub
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.vectorstores import DocArrayInMemorySearch
from langchain_community.embeddings import HuggingFaceEmbeddings

In [11]:
import warnings
warnings.filterwarnings('ignore')

In [12]:
my_device = "cuda" if torch.cuda.is_available() else "cpu"
print("My Device: {}".format(my_device))

My Device: cuda


In [14]:
## initialize our peft-tuned LLM
model_name = "HuggingFaceTB/SmolLM2-360M" # HuggingFaceTB/SmolLM2-135M, HuggingFaceTB/SmolLM2-360M, HuggingFaceTB/SmolLM2-1.7B, HuggingFaceTB/SmolLM2-1.7B-Instruct
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16).to(my_device)


# initialize the peft model
print("PEFT Model is loading")
peft_model_path = 'tuned_models_SmolLM2/bank_qa_base_tune_peft-2025_02_25_22_13_1e-3_expanded/checkpoint-1440'
peft_model = PeftModel.from_pretrained(model, peft_model_path).to(my_device)
peft_model.eval();

PEFT Model is loading


In [17]:
## initalize external document
external_df = pd.read_excel('fsb1879_BankingQAs.xlsx')

# drop nan values
external_df.dropna(inplace=True)
# drop duplicates
external_df.drop_duplicates(inplace=True)

print(external_df.shape)

external_df.head(5)

(198, 2)


Unnamed: 0,Questions,Answers
0,What phones are able to use the mobile app?,"Any phone that has access to Google, Apple or ..."
1,"How do I view & print statements, notices and ...","To view e-statements, notices & tax documents:..."
2,How do I send or receive a wire transfer?,FSB offers several wire transfer options.\n\nD...
3,What is your routing number?,The routing number is 073908045
4,How do I protect myself from Scammers?,"Fraudulent calls, texts, and emails are on the..."


In [18]:
def tune_generate_chat_template(query_text, my_tokenizer, my_device, instruct_model=False):
    """
    Formats the query based on the instruction tuning prompt template.
    """

    # Apply the updated instruction-tuned prompt template
    formatted_prompt = f"""
    ### Instruction:
    You are an AI banking assistant. Respond to the customer's request in a clear and professional manner.

    ### Customer Request:
    {query_text}

    ### Response:
    """

    if instruct_model:
        messages = [{"role": "user", "content": formatted_prompt}]
        input_text = my_tokenizer.apply_chat_template(messages, tokenize=False)
        inputs = my_tokenizer.encode(input_text, return_tensors="pt").to(my_device)
    else:
        inputs = my_tokenizer.encode_plus(formatted_prompt, return_tensors="pt").to(my_device)

    return inputs


def tune_generate_output(my_inputs, my_tokenizer, my_model, max_tokens=50, temp=0.3, top_p=0.9, top_k=50, penalty_score=1.2, do_sample=True, instruct_model=False):
    """
    Generates a response from the model based on the input.
    """

    outputs = my_model.generate(
                
        input_ids=my_inputs["input_ids"],  # Pass input_ids
        attention_mask=my_inputs["attention_mask"],  # Pass attention mask
        max_new_tokens=max_tokens,
        temperature=temp,
        top_p=top_p,
        top_k=top_k,
        repetition_penalty=penalty_score,
        do_sample=do_sample,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=my_tokenizer.eos_token_id
    )

    # Decode output and clean it up
    output_text = my_tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Ensure safe parsing without hardcoded token removal
    cleaned_output_text = output_text.strip()

    return cleaned_output_text

In [None]:
index = 0

sample_question = external_df.iloc[index,0]
sample_response = external_df.iloc[index,1]

sample_inputs = tune_generate_chat_template(sample_question, tokenizer, my_device)
sample_output = tune_generate_output(sample_inputs, tokenizer, peft_model, max_tokens=300, temp = 0.7, top_p = 0.6, top_k=50, penalty_score=1.2,
                                    do_sample = True, instruct_model=False)

sample_output_wout_prompt = sample_output.split("Response:")[1].strip()


dash_line = '-'*10
print('Human Question: \n{}\n'.format(sample_question))
print(dash_line)
print('Human Response: \n{}\n'.format(sample_response)) # better reading: textwrap.fill(sample_response, width=150)
print(dash_line)
print('LLM Response: \n{}\n'.format(sample_output_wout_prompt))

Human Question: 
What phones are able to use the mobile app?


------------------------------------------------------------------------------------

Human Response: 
Any phone that has access to Google, Apple or Samsung app stores can download the FSB Mobile app.

At minimum the device must run on iOS 15 or Android 10.

To ensure the highest level of security, we recommend always updating your device to the latest operating system and app release. The simplest way to achieve this, is to set automatic updates in your device settings. For further support on your device, contact your phone manufacturer's customer support for assistance.



------------------------------------------------------------------------------------

LLM Response: 
I can provide you with information about which mobile apps allow users to access our services conveniently through their respective devices or platforms. To get started, here is what we recommend for using your preferred application on various smartphone

## Langchain Operations

In [20]:
# Embedder that will be used during RAG embedding and retrieval
embedder = HuggingFaceEmbeddings(
    model_name="paraphrase-multilingual-mpnet-base-v2"
)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.13k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/723 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/402 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [28]:
## generate local vector db over the questions
## then in the retrieval stage, the index of the retrieval function will be passed this df
## then corresponding answer will be returned to the prompt
local_vector_db_question = DocArrayInMemorySearch.from_texts(
    external_df["Questions"].tolist(), embedder
)

## retriever from the local vector db
retriever = local_vector_db_question.as_retriever(
    search_kwargs={"score_threshold": 0.3, "k": 2}
)

In [30]:
# Sample Question and its relation with the "User Documents"
relevant_documents = retriever.get_relevant_documents(query='Can i use my IOS mobile phone?')
print("Relavent Documents: {}".format(relevant_documents))

Relavent Documents: [Document(page_content='What phones are able to use the mobile app?'), Document(page_content='Can I use Touch Authentication on my mobile device?')]


In [32]:
def tune_rag_generate_chat_template(query_text, external_info, my_tokenizer, my_device, instruct_model=False):
    """
    Formats the query based on the instruction tuning prompt template.
    """

    # Apply the updated instruction-tuned prompt template
    formatted_prompt = f"""
    ### Instruction:
    You are an AI banking assistant. Respond to the customer's request in a clear and professional manner.

    ### External Info:
    {external_info}

    ### Customer Request:
    {query_text}

    ### Response:
    """

    if instruct_model:
        messages = [{"role": "user", "content": formatted_prompt}]
        input_text = my_tokenizer.apply_chat_template(messages, tokenize=False)
        inputs = my_tokenizer.encode(input_text, return_tensors="pt").to(my_device)
    else:
        inputs = my_tokenizer.encode_plus(formatted_prompt, return_tensors="pt").to(my_device)

    return inputs

In [None]:
sample_query = "Can i use my IOS mobile phone?"

retrieve_most_relevant_question = retriever.get_relevant_documents(query=sample_query)[0]
relevant_question_text = retrieve_most_relevant_question.page_content
relevant_question_index = external_df[external_df['Questions'] == relevant_question_text].index[0]
relevant_question_answer_text = external_df['Answers'].iloc[relevant_question_index]

print("User Query: \n{}\n".format(sample_query))
print(dash_line)
print("External Document Relevant Question: \n{}\n".format(relevant_question_text))
print(dash_line)
print("External Document Relevant Answer: \n{}\n".format(relevant_question_answer_text))
print(dash_line)
## now append the answer to the prompt and try to get meaningful response
sample_inputs = tune_rag_generate_chat_template(sample_query, relevant_question_answer_text, tokenizer, my_device, instruct_model=False)
sample_output = tune_generate_output(sample_inputs, tokenizer, peft_model, max_tokens=300, temp = 0.7, top_p = 0.6, top_k=50, penalty_score=1.2,
                                    do_sample = True, instruct_model=False)


print("RAG-Supported LLM Answer: \n{}\n".format(sample_output))

User Query: 
Can i use my IOS mobile phone?


------------------------------------------------------------------------------------

External Document Relevant Question: 
What phones are able to use the mobile app?


------------------------------------------------------------------------------------

External Document Relevant Answer: 
Any phone that has access to Google, Apple or Samsung app stores can download the FSB Mobile app.

At minimum the device must run on iOS 15 or Android 10.

To ensure the highest level of security, we recommend always updating your device to the latest operating system and app release. The simplest way to achieve this, is to set automatic updates in your device settings. For further support on your device, contact your phone manufacturer's customer support for assistance.


RAG-Supported LLM Answer: 
### Instruction:
    You are an AI banking assistant. Respond to the customer's request in a clear and professional manner.

    ### External Info:
    Any p

In [39]:
sample_query = "How can i be apart from scammers?"

retrieve_most_relevant_question = retriever.get_relevant_documents(query=sample_query)[0]
relevant_question_text = retrieve_most_relevant_question.page_content
relevant_question_index = external_df[external_df['Questions'] == relevant_question_text].index[0]
relevant_question_answer_text = external_df['Answers'].iloc[relevant_question_index]

print("User Query: \n{}\n".format(sample_query))
print(dash_line)
print("External Document Relevant Question: \n{}\n".format(relevant_question_text))
print(dash_line)
print("External Document Relevant Answer: \n{}\n".format(relevant_question_answer_text))
print(dash_line)

## now append the answer to the prompt and try to get meaningful response
sample_inputs = tune_rag_generate_chat_template(sample_query, relevant_question_answer_text, tokenizer, my_device, instruct_model=False)
sample_output = tune_generate_output(sample_inputs, tokenizer, peft_model, max_tokens=300, temp = 0.7, top_p = 0.6, top_k=50, penalty_score=1.2,
                                    do_sample = True, instruct_model=False)


print("RAG-Supported LLM Answer: \n{}\n".format(sample_output))

User Query: 
How can i be apart from scammers?


------------------------------------------------------------------------------------

External Document Relevant Question: 
How do I protect myself from Scammers?


------------------------------------------------------------------------------------

External Document Relevant Answer: 
Fraudulent calls, texts, and emails are on the rise. These scammers are very good at what they do, and will seem very convincing to gain access to your financial information. You may think that you are not the type of person to be easily duped, but these fraudsters are pros.

If you ever receive a call, text or email claiming to be an employee of FSB, know that we will NEVER ask for any of the following information:

Password
Username
Card PIN
Account numbers
If someone calls you, or you have clicked on a link in a text or email that asks for this information, call FSB immediately. We will assist you with changing your username and password, and hopefully 