
# 📚 Week 8 CSI Assignment - RAG Q&A Chatbot Project

**Project Title**: Building a Retrieval-Augmented Generation (RAG) chatbot using document retrieval and generative AI.

**Dataset**: Loan Approval Prediction Dataset (from Kaggle).

**Objective**: The goal is to build a simple Q&A chatbot that can retrieve relevant loan applicant data and generate smart responses using a light-weight LLM model.

---


## Step 1: Importing Required Libraries

In [1]:

# Importing basic libraries for data handling and AI model usage.
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
from transformers import pipeline


## Step 2: Loading the Loan Approval Dataset

In [2]:

# I downloaded this dataset from Kaggle and placed it in my working directory.
# Reading the CSV file and handling missing values.
df = pd.read_csv('Training Dataset.csv').fillna('Unknown')
print(f"✅ Dataset loaded with {df.shape[0]} rows and {df.shape[1]} columns.")
df.head()


✅ Dataset loaded with 614 rows and 13 columns.


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,Unknown,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


## Step 3: Converting Dataset Rows into Documents

In [3]:

# Since we are using RAG (retrieval + generation), I need to convert each row into a readable document-like string.
documents = []
for i, row in df.iterrows():
    doc = (
        f"Loan Application Record:\n"
        f"Applicant ID: {row['Loan_ID']}. Gender: {row['Gender']}. Married: {row['Married']}. "
        f"Dependents: {row['Dependents']}. Education: {row['Education']}. "
        f"Self Employed: {row['Self_Employed']}. Applicant Income: {row['ApplicantIncome']}. "
        f"Loan Amount: {row['LoanAmount']}. Credit History: {row.get('Credit_History', 'Unknown')}. "
        f"Loan Status: {row['Loan_Status']}."
    )
    documents.append(doc)

print(f"✅ Converted {len(documents)} rows into documents.")
print("Here's a sample document:\n", documents[0])


✅ Converted 614 rows into documents.
Here's a sample document:
 Loan Application Record:
Applicant ID: LP001002. Gender: Male. Married: No. Dependents: 0. Education: Graduate. Self Employed: No. Applicant Income: 5849. Loan Amount: Unknown. Credit History: 1.0. Loan Status: Y.


## Step 4: Generating Embeddings and Building FAISS Index

In [4]:

# Using sentence transformers to get document embeddings
embedder = SentenceTransformer('all-MiniLM-L6-v2')
document_embeddings = embedder.encode(documents, convert_to_numpy=True)

# Building FAISS index for fast retrieval
dimension = document_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(document_embeddings)
print("✅ FAISS index created with shape:", index.ntotal)


✅ FAISS index created with shape: 614


## Step 5: Defining My Q&A Chatbot Function

In [5]:

# For this project, I chose distilgpt2 because it runs well on CPU and doesn't require GPU credits.

generator = pipeline('text-generation', model='distilgpt2')

# Defining a function which combines retrieval and generation
def rag_chatbot(query, top_k=3, verbose=False):
    # Convert query to embedding
    query_embedding = embedder.encode([query], convert_to_numpy=True)
    # Retrieve top K documents
    distances, indices = index.search(query_embedding, top_k)
    retrieved_docs = [documents[idx] for idx in indices[0]]
    
    if verbose:
        print("🔎 Retrieved Documents:")
        for doc in retrieved_docs:
            print(doc, '\n')
    
    # Preparing prompt for generator
    context = " ".join(retrieved_docs)
    prompt = (
        f"You are a helpful assistant for loan approvals.\n"
        f"Based on the following loan application records:\n"
        f"{context}\n"
        f"Answer concisely. Question: {query}\nAnswer:"
    )
    
    # Generating response
    response = generator(prompt, max_length=100, do_sample=True, temperature=0.7)
    return response[0]['generated_text']


Device set to use cpu


## Step 6: Testing My Chatbot

In [6]:

query = "What is the loan status of applicants earning more than 5000?"
answer = rag_chatbot(query, verbose=True)
print("\n🤖 Chatbot Answer:\n", answer)


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


🔎 Retrieved Documents:
Loan Application Record:
Applicant ID: LP001955. Gender: Female. Married: No. Dependents: 0. Education: Graduate. Self Employed: No. Applicant Income: 5000. Loan Amount: 151.0. Credit History: 1.0. Loan Status: N. 

Loan Application Record:
Applicant ID: LP002379. Gender: Male. Married: No. Dependents: 0. Education: Graduate. Self Employed: No. Applicant Income: 6500. Loan Amount: 105.0. Credit History: 0.0. Loan Status: N. 

Loan Application Record:
Applicant ID: LP002776. Gender: Female. Married: No. Dependents: 0. Education: Graduate. Self Employed: No. Applicant Income: 5000. Loan Amount: 103.0. Credit History: 0.0. Loan Status: N. 


🤖 Chatbot Answer:
 You are a helpful assistant for loan approvals.
Based on the following loan application records:
Loan Application Record:
Applicant ID: LP001955. Gender: Female. Married: No. Dependents: 0. Education: Graduate. Self Employed: No. Applicant Income: 5000. Loan Amount: 151.0. Credit History: 1.0. Loan Status: N. 


## ✅ Conclusion and Reflection

In this project, I built my first simple Retrieval-Augmented Generation chatbot using a Kaggle loan dataset. I used sentence-transformers and FAISS to retrieve similar documents and used a lightweight generative model (distilgpt2) for producing natural language responses.

### What I Learned:
- How to convert structured data into document-style text
- How to build a FAISS index for fast retrieval
- How to use transformer models for question answering
- How to combine retrieval and generation for more intelligent responses

I realized that the quality of generation improves a lot if the prompt includes good context. I enjoyed making this chatbot.

---
