## **Assignment - Week 8**

**Problem Statement** 

Build a RAG Q&A chatbot using document retrieval and generative AI for intelligent response generation.

**Resources**

[Kaggle] https://www.kaggle.com/datasets/sonalisingh1411/loan-approval-prediction?select=Training+Dataset.csv


### **Step 1: Data Preprocessing**

In this step, we will find which columns have missing values. We will also fill them with reasonable defaults (either mode or median or a placeholder).We will also convert all relevant fields to strings to avoid errors like `.lower()` on floats.

In [6]:
import pandas as pd

df = pd.read_csv("rag_chatbot/data/train.csv")
print(df.head())
print(df.isnull().sum())


    Loan_ID Gender Married Dependents     Education Self_Employed  \
0  LP001002   Male      No          0      Graduate            No   
1  LP001003   Male     Yes          1      Graduate            No   
2  LP001005   Male     Yes          0      Graduate           Yes   
3  LP001006   Male     Yes          0  Not Graduate            No   
4  LP001008   Male      No          0      Graduate            No   

   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0             5849                0.0         NaN             360.0   
1             4583             1508.0       128.0             360.0   
2             3000                0.0        66.0             360.0   
3             2583             2358.0       120.0             360.0   
4             6000                0.0       141.0             360.0   

   Credit_History Property_Area Loan_Status  
0             1.0         Urban           Y  
1             1.0         Rural           N  
2             1.0   

In [7]:
# Filling categorical columns with mode
for col in ['Gender', 'Married', 'Dependents', 'Self_Employed', 'Credit_History']:
    df[col] = df[col].fillna(df[col].mode()[0])

# Filling numerical columns with median
for col in ['LoanAmount', 'Loan_Amount_Term']:
    df[col] = df[col].fillna(df[col].median())

In [8]:
# Converting all relevant fields to strings to avoid errors
for col in ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 
            'Property_Area', 'Loan_Status']:
    df[col] = df[col].astype(str)


In [9]:
def row_to_text(row):
    return (
        f"Loan ID {row['Loan_ID']} was applied by "
        f"{'a married' if row['Married'] == 'Yes' else 'an unmarried'} "
        f"{row['Gender'].lower()} "
        f"{'graduate' if row['Education'] == 'Graduate' else 'not graduate'} "
        f"with {row['Dependents']} dependents. "
        f"The applicant has an income of ₹{row['ApplicantIncome']} and "
        f"co-applicant income of ₹{row['CoapplicantIncome']}. "
        f"The loan amount requested was ₹{row['LoanAmount']} with a term of {row['Loan_Amount_Term']} months. "
        f"The credit history is {'good' if row['Credit_History'] == 1.0 else 'bad'}. "
        f"The property area is {row['Property_Area'].lower()}. "
        f"The loan was {'approved' if row['Loan_Status'] == 'Y' else 'not approved'}."
    )

df["doc_text"] = df.apply(row_to_text, axis=1)
# Sample text from the dataframe
df["doc_text"].sample(1).values[0]


'Loan ID LP002473 was applied by a married male graduate with 0 dependents. The applicant has an income of ₹8334 and co-applicant income of ₹0.0. The loan amount requested was ₹160.0 with a term of 360.0 months. The credit history is good. The property area is semiurban. The loan was not approved.'

### **Step 2: Embedding + FAISS Index Creation (using Sentence Transformers)**  

We are going to generate dense embeddings for each document chunk using a `Sentence Transformer`. We will also store these embeddings in a FAISS index for fast similarity search during user queries.
Fianally, we will save the index and mapping for use in the chatbot.  

In [10]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import pickle

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Getting text chunks
docs = df["doc_text"].tolist()

# Generating embeddings (using SentenceTransformer to convert texts to vectors)
embeddings = model.encode(docs, show_progress_bar=True)

# Converting to float32 as required by FAISS
embeddings = np.array(embeddings).astype("float32")

# Initializing FAISS index (using a simple flat index for cosine similarity)
index = faiss.IndexFlatL2(embeddings.shape[1])

# Adding embeddings to the index
index.add(embeddings)
faiss.write_index(index, "rag_chatbot/vector_store/index.faiss")

# Saving document texts for retrieving the sources later
with open("rag_chatbot/vector_store/id2doc.pkl", "wb") as f:
    pickle.dump(docs, f)


Batches:   0%|          | 0/20 [00:00<?, ?it/s]

### **Step 3: Retriever Module (Top-k Similarity Search)**

We now build the `retriever.py` utility to retrieve the most relevant document chunks for any user query.  
The query is embedded using the same Sentence Transformer, then matched against our FAISS index to return the top-k most similar documents.

This is the "R" in the RAG pipeline: fast, semantic document retrieval before generation.

Kindly have a look at `retriever.py` in the `rag_chatbot` folder for the implementation details.

### **Step 4: Testing the Retriever Module**

We will now test the `DocumentRetriever` on a sample user query.  
The system embeds the query, searches the FAISS index, and returns the top-k most relevant document chunks.

This verifies the "retrieval" part of our RAG pipeline is working correctly before moving to generation.


In [11]:
from rag_chatbot.retriever import DocumentRetriever

retriever = DocumentRetriever(
    index_path="rag_chatbot/vector_store/index.faiss",
    id2doc_path="rag_chatbot/vector_store/id2doc.pkl"
)

sample_query = "What kind of applicants get approved for a loan?"
top_k = 3

# Retrieving top-k relevant chunks
results = retriever.retrieve(sample_query, top_k=top_k)
results = sorted(results, key=lambda x: x[1], reverse=True) # Just to ensure results are sorted by score
for i, (doc, score) in enumerate(results, 1):
    print(f"\n🔹 Match #{i} (Score: {round(score, 2)}):\n{doc}")



🔹 Match #1 (Score: 0.8799999952316284):
Loan ID LP001846 was applied by an unmarried female graduate with 3+ dependents. The applicant has an income of ₹3083 and co-applicant income of ₹0.0. The loan amount requested was ₹255.0 with a term of 360.0 months. The credit history is good. The property area is rural. The loan was approved.

🔹 Match #2 (Score: 0.8700000047683716):
Loan ID LP001888 was applied by an unmarried female graduate with 0 dependents. The applicant has an income of ₹3237 and co-applicant income of ₹0.0. The loan amount requested was ₹30.0 with a term of 360.0 months. The credit history is good. The property area is urban. The loan was approved.

🔹 Match #3 (Score: 0.8600000143051147):
Loan ID LP001788 was applied by an unmarried female graduate with 0 dependents. The applicant has an income of ₹3463 and co-applicant income of ₹0.0. The loan amount requested was ₹122.0 with a term of 360.0 months. The credit history is good. The property area is urban. The loan was ap

### **Step 5: Answer Generation**

We now build a **generator module** to produce natural language answers using the user query and the top-k retrieved context documents. The input to the generator is a concatenated prompt including the query and documents, which is passed to a small, efficient model from Hugging Face. Please refer to `rag_chatbot/generator.py` for the implementation details.

In [12]:
# Testing the generator module
from rag_chatbot.generator import AnswerGenerator

# Using the retrieved chunks from earlier
retrieved_chunks = [doc for doc, score in results]

gen = AnswerGenerator()
generated_answer = gen.generate_answer(sample_query, retrieved_chunks)

print("🧠 Generated Answer:\n", generated_answer)


config.json: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

(…)a5b18a05535c9e14c7a355904270e15b0945ea86:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cpu


🧠 Generated Answer:
 unmarried female graduate with 0 dependents


### **Step 6: Building the RAG Chatbot Interface**

We will now create an interactive chatbot using `Gradio`, designed to answer user queries based on training loan application records. It uses Retrieval-Augmented Generation:

1. A user submits a query.
2. The retriever fetches top-k relevant document chunks from a FAISS index.
3. The generator uses those chunks + the user’s query to produce a context-aware answer.

The chatbot interface will:
- Accept queries
- Display the generated answer
- Show the retrieved source chunks
