# Task 3: RAG Pipeline with Production Vector Store ü§ñ

This notebook demonstrates the complete RAG (Retrieval-Augmented Generation) pipeline for financial complaint analysis:

1. **Load Pre-built Embeddings**: Use `index_production.py` to load the production vector store from `complaint_embeddings.parquet`
2. **Query Interface**: Use `rag_pipeline.py` to retrieve relevant complaints (k=5) and generate answers using Flan-T5

**Key Requirements (Task 3)**:
- Load from `complaint_embeddings.parquet` (pre-built embeddings)
- Use `all-MiniLM-L6-v2` for query embedding
- Retrieve k=5 most relevant complaint excerpts
- Generate answers using Flan-T5-base LLM

In [1]:
import sys
import os

# Add src to path
sys.path.append(os.path.abspath(os.path.join('../src')))

## Step 1: Inspect Pre-Built Vector Store

Before loading, let's inspect the structure of `complaint_embeddings.parquet` to:
- Detect the embedding model used (should be `all-MiniLM-L6-v2`)
- Verify embedding dimensions (should be 384)
- Check metadata fields and document structure
- Ensure compatibility with ChromaDB

In [2]:
from index_production import inspect_parquet_structure, PARQUET_PATH

# Inspect the pre-built vector store
model_name, embedding_dim = inspect_parquet_structure(PARQUET_PATH)

print(f"\n‚úÖ Inspection complete!")
print(f"   Detected Model: {model_name}")
print(f"   Embedding Dimensions: {embedding_dim}")

INSPECTING PRE-BUILT VECTOR STORE

üìä Total rows in Parquet: 1,375,327
üìã Columns: ['id', 'document', 'embedding', 'metadata']

Data Types:
id           object
document     object
embedding    object
metadata     object
dtype: object

üî¢ Embedding Dimensions: 384
   Sample values: [-0.04277738  0.02562437 -0.07883817  0.02250159 -0.00948492]...

ü§ñ Detecting Embedding Model:
   ‚ö†Ô∏è  'embedding_model' not found in metadata
   Inferring from dimensions...
   Best guess: all-MiniLM-L6-v2
   ‚úì Assuming 'all-MiniLM-L6-v2' (most common 384-dim model)

üìù Sample Metadata Fields:
   chunk_index: 0
   company: CITIBANK, N.A.
   complaint_id: 14069121
   date_received: 2025-06-13
   issue: Getting a credit card
   product: Credit card
   product_category: Credit Card
   state: TX
   sub_issue: Card opened without my consent or knowledge
   total_chunks: 1

üìÑ Sample Document (first 150 chars):
   a card was opened under my name by a fraudster. i received a notice from that an ac

# Pre-Built Vector store inspection analysis
## The inspection shows us that the prebuilt data is as assumed :
## 1) It has the expected columns -id, document, embedding, metadata
## 2) Although the embedding model wasn't defined in the metadata we can see the dimensions of the embedding as 384 so we can say that the model is all-MiniLM-L6-v2 and hence we are confident now using this inisde of .get_collection()
## 3) The metadata is as expected containing the critical columns- product, issue, sub_issue, complaint_if , company. These are critical and enables cross reference with the original data whenever neeeded!

## Step 2: Load Pre-Built Embeddings into ChromaDB

Now we load the production vector store using `index_production_data()`:
- Reads `complaint_embeddings.parquet` (pre-built embeddings)
- Creates `complaints_production` collection in ChromaDB
- Stores embedding model metadata for validation
- Processes ~248K complaints in batches of 5,000

**Note**: This replaces Task 2's prototype where we created embeddings ourselves. Here we use pre-built embeddings.

In [3]:
from index_production import index_production_data

# Load pre-built embeddings into ChromaDB
# This creates the 'complaints_production' collection
detected_model, detected_dim = index_production_data()

print(f"\n‚úÖ Vector store ready!")
print(f"   Collection: complaints_production")
print(f"   Embedding model: {detected_model}")
print(f"   Dimensions: {detected_dim}")

Opening Parquet file: c:\Users\yeget\Intelligent-Complaint-Analysis-for-Financial-Services\data\raw\complaint_embeddings.parquet
INSPECTING PRE-BUILT VECTOR STORE

üìä Total rows in Parquet: 1,375,327
üìã Columns: ['id', 'document', 'embedding', 'metadata']

Data Types:
id           object
document     object
embedding    object
metadata     object
dtype: object

üî¢ Embedding Dimensions: 384
   Sample values: [-0.04277738  0.02562437 -0.07883817  0.02250159 -0.00948492]...

ü§ñ Detecting Embedding Model:
   ‚ö†Ô∏è  'embedding_model' not found in metadata
   Inferring from dimensions...
   Best guess: all-MiniLM-L6-v2
   ‚úì Assuming 'all-MiniLM-L6-v2' (most common 384-dim model)

üìù Sample Metadata Fields:
   chunk_index: 0
   company: CITIBANK, N.A.
   complaint_id: 14069121
   date_received: 2025-06-13
   issue: Getting a credit card
   product: Credit card
   product_category: Credit Card
   state: TX
   sub_issue: Card opened without my consent or knowledge
   total_chunks: 

Indexing: 276it [1:18:46, 17.13s/it] 


‚úì Indexing Complete!
  Total documents: 1,375,327
  Collection: complaints_production
  Embedding model: all-MiniLM-L6-v2

‚úÖ Vector store ready!
   Collection: complaints_production
   Embedding model: all-MiniLM-L6-v2
   Dimensions: 384





## Step 3: Initialize RAG Pipeline

Now we initialize the RAG pipeline which:
- Connects to the `complaints_production` collection
- Validates embedding model matches (should be `all-MiniLM-L6-v2`)
- Loads Flan-T5-base LLM for answer generation
- Sets default k=5 for retrieval (Task 3 requirement)

In [13]:
# # Add this cell BEFORE initializing RAGPipeline

import chromadb

import importlib
import rag_pipeline
# Manually verify the collection exists
VECTOR_DB_PATH = '../vector_store'
client = chromadb.PersistentClient(path=VECTOR_DB_PATH)

# List all collections
print("Available collections:")
collections = client.list_collections()
for col in collections:
    print(f"  - {col.name} (count: {col.count()})")

# Check if complaints_production exists
try:
    prod_collection = client.get_collection("complaints_production")
    print(f"\n‚úÖ Found 'complaints_production' with {prod_collection.count()} documents")
except Exception as e:
    print(f"\n‚ùå Error: {e}")
    print("The collection wasn't created properly. Re-run index_production_data()")


# Reload the module to clear any cached connections
importlib.reload(rag_pipeline)

# Now initialize with a fresh import
from rag_pipeline import RAGPipeline

VECTOR_DB_PATH = '../vector_store'
COLLECTION_NAME = 'complaints_production'
rag = RAGPipeline(vector_db_path=VECTOR_DB_PATH,collection_name=COLLECTION_NAME)

print("\n‚úÖ RAG Pipeline initialized and ready for queries!")

Available collections:
  - complaints_prototype (count: 19230)
  - complaints_production (count: 1375327)

‚úÖ Found 'complaints_production' with 1375327 documents
Initializing Vector Store Client...
Connected to collection: complaints_production
üì¶ Collection Metadata:
   Stored model: all-MiniLM-L6-v2
   Dimensions: 384
‚úì Embedding model validated: all-MiniLM-L6-v2

Loading LLM: google/flan-t5-base...

‚úÖ Found 'complaints_production' with 1375327 documents
Initializing Vector Store Client...
Connected to collection: complaints_production
üì¶ Collection Metadata:
   Stored model: all-MiniLM-L6-v2
   Dimensions: 384
‚úì Embedding model validated: all-MiniLM-L6-v2

Loading LLM: google/flan-t5-base...


tokenizer_config.json: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Error while downloading from https://huggingface.co/google/flan-t5-base/resolve/main/model.safetensors: HTTPSConnectionPool(host='cas-bridge.xethub.hf.co', port=443): Read timed out.
Trying to resume download...


model.safetensors:  84%|########3 | 828M/990M [00:00<?, ?B/s]

Error while downloading from https://huggingface.co/google/flan-t5-base/resolve/main/model.safetensors: HTTPSConnectionPool(host='cas-bridge.xethub.hf.co', port=443): Read timed out.
Trying to resume download...


model.safetensors:  88%|########7 | 870M/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Device set to use cuda:0


LLM Loaded successfully.

‚úÖ RAG Pipeline initialized and ready for queries!


## Step 4: Test Retrieval (k=5)

Let's test the semantic search retrieval:
- Query embedding: User question ‚Üí 384-dim vector using `all-MiniLM-L6-v2`
- Similarity search: Cosine similarity against stored embeddings
- Top-k retrieval: Returns 5 most relevant complaint excerpts (Task 3 requirement)

In [14]:
# Test retrieval with k=5 (default now)
query = "credit card late fees"
docs, metas = rag.retrieve(query)

print(f"Query: {query}\n")
print(f"Retrieved {len(docs)} complaint excerpts:\n")
for i, (doc, meta) in enumerate(zip(docs, metas)):
    print(f"{'='*60}")
    print(f"Result {i+1}:")
    print(f"  Product: {meta['product']}")
    print(f"  Text excerpt: {doc[:200]}...")
    print()

C:\Users\yeget\.cache\chroma\onnx_models\all-MiniLM-L6-v2\onnx.tar.gz: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 79.3M/79.3M [17:52<00:00, 77.6kiB/s]  



Query: credit card late fees

Retrieved 5 complaint excerpts:

Result 1:
  Product: Credit card or prepaid card
  Text excerpt: i submitted a payment that was due on , and was received by on , a whole seven days prior to my due date. however, i still received a late fee. i have had issue after issue with late fees with only th...

Result 2:
  Product: Credit card
  Text excerpt: i have a credit card that is being charged a late fee and it is paid on time. i am getting unemployment and i did a credit card protection claim they paid it and now it is 3 months late and for some r...

Result 3:
  Product: Credit card
  Text excerpt: i 've been issued a number of late fee that are unfair and excessive. these fees violate federal regulations put in place to prevent credit card companies from taking advantage of their customers....

Result 4:
  Product: Credit card or prepaid card
  Text excerpt: card company can charge you a fee of up to 28.00 . if you are late a second time within the next s

## Retrieval is working correctly and we can see that for the question "credit card late fees" there are five results. These results show relevant text excerpts to the question asked.

## Step 5: Test Answer Generation

Now let's generate an answer using the LLM:
- **Prompt template**: "You are a financial analyst assistant for CrediTrust..." (Task 3 spec)
- **Context**: Top-5 retrieved complaint excerpts
- **LLM**: Flan-T5-base (text2text-generation)
- **Output**: Generated answer based on complaint patterns

In [15]:
# Generate answer from retrieved documents
answer = rag.generate_answer(query, docs)

print(f"Question: {query}\n")
print(f"LLM Answer:")
print(f"{answer}")
print(f"\n(Answer generated from {len(docs)} complaint excerpts)")

Question: credit card late fees

LLM Answer:
i have a credit card that is being charged a late fee and it is paid on time. i am getting unemployment and i did a credit card protection claim they paid it and now it is 3 months late and for some reason my account is very late they charged me late fee for now 6 months. i lost my job in last year. i 've been issued a number of late fee that are unfair and excessive. these fees violate federal regulations put in place to prevent credit card companies from taking advantage of their customers.

(Answer generated from 5 complaint excerpts)


## Step 6: End-to-End RAG Queries

Let's test the complete pipeline with multiple questions:
- **Retrieve**: Semantic search for k=5 relevant complaints
- **Augment**: Build prompt with CrediTrust template + complaint excerpts
- **Generate**: LLM produces answer based on retrieved context

This demonstrates the full RAG workflow as specified in Task 3.

In [16]:
# Test with multiple queries
test_queries = [
    "Why are customers complaining about credit card fees?",
    "What issues do people have with savings accounts?",
    "Tell me about problems with money transfers",
    "What are common complaints about personal loans?"
]

for i, q in enumerate(test_queries, 1):
    print("=" * 70)
    print(f"Query {i}: {q}")
    print("=" * 70)
    
    # End-to-end RAG query (retrieve k=5, then generate)
    answer, retrieved_docs, metadata = rag.query(q)
    
    print(f"\nüìù Answer:\n{answer}\n")
    print(f"üìä Based on {len(retrieved_docs)} complaint excerpts", retrieved_docs)
    print(f"   Metadata samples:",metadata)
    print(f"   Products: {set(m['product'] for m in metadata)}")
    print()

Query 1: Why are customers complaining about credit card fees?

üìù Answer:
they are charging so much in fees can not get the card paid down due to fees.

üìä Based on 5 complaint excerpts ['this is what is wrong and many in our country. credit card companies provide credit and credit limits with no fees and low interest rates. once people actually use the credit and have balances they double and triple the interest charges. i find it interesting that the mortgage industry has been raked over coals on fees. yet the auto industry and credit card companies can kill people with fees and interest. kohls i had available and have not used my card in months and on auto pay.', 'sted we complain to them to reduce their fees to . it feels like we, the consumer, is simply being used to as a lobbying tool in a political game. the convenience of paying by credit card benefits both the consumer and the business. they do not have to hire additional staff process to checks, check ids or count cash. 

## Evaluation: RAG Quality Assessment

For each question, we evaluate:
- **Generated Answer** (from the LLM)
- **Retrieved Sources** (show 1‚Äì2 excerpts out of k=5)
- **Quality Score (1‚Äì5)** where 5 = excellent, 1 = poor
- **Comments/Analysis** (relevance, accuracy, completeness, issues)

---

### Question 1: "Why are customers complaining about credit card fees?"

**Generated Answer:**
```
they are charging so much in fees can not get the card paid down due to fees.
```

**Retrieved Sources (2 out of 5):**

**Source 1:**
- **Product**: Credit card
- **Text Excerpt**: `they are charging so much in fees can not get the card paid down due to fees.`

**Source 2:**
- **Product**: Credit card or prepaid card
- **Text Excerpt**: `fees 25.00 i consider this to be a very deceptive practice by the credit card company. therefore i am lodging this complaint.`

**Quality Score:** `4/5`

**Comments/Analysis:**
- Relevance: Strong‚Äîretrieved excerpts clearly discuss fees/interest.
- Accuracy: Reasonable, but a bit too generic (doesn‚Äôt mention types of fees/interest increases).
- Completeness: Partial‚Äîanswers ‚Äúwhy‚Äù at a high level (fees trap), but could summarize patterns (late fees, APR hikes, deceptive fees).
- Issues: Answer is very short; could be more informative.

---

### Question 2: "What issues do people have with savings accounts?"

**Generated Answer:**
```
much money in the savings account as possible, and shift funds to checking only to the extent needed to pay bills. at some point i think following financial crisis in, the amount of interest paid on money in the savings account went down drastically, to a mere nominal amount -- a few basis points. from that point going forward, i just kept the majority of funds in the checking account, and only a nominal amount in the savings account, as i felt there was no real value in keeping funds in saving
```

**Retrieved Sources (2 out of 5):**

**Source 1:**
- **Product**: Checking or savings account
- **Text Excerpt**: `...the amount of interest paid on money in the savings account went down drastically, to a mere nominal amount -- a few basis points...`

**Source 2:**
- **Product**: Checking or savings account
- **Text Excerpt**: `down my entire account even including the savings account that i shared with my parents...`

**Quality Score:** `2/5`

**Comments/Analysis:**
- Relevance: Mixed‚Äîsources mention savings accounts, but several retrieved excerpts are fragmented / not clearly about ‚Äúissues.‚Äù
- Accuracy: The answer mostly mirrors one excerpt about low interest rates; it misses common issues shown in metadata (NSF/low funds fees, access problems, account closure).
- Completeness: Low‚Äîdoesn‚Äôt summarize multiple issue types; reads like a single-user narrative.
- Issues: Likely under-synthesis; retrieval quality may be suboptimal for this question (consider refining query or adding filters by product).

> Note: In this dataset, savings-related complaints often appear under the combined product label `Checking or savings account`, so retrieval may include checking-account issues too.

---

### Question 3: "Tell me about problems with money transfers"

**Generated Answer:**
```
I had issues sending and receiving money transfers numerous times on several occasions. Complaint 1: problems with money transfer, receiving and sending out. Complaint 2: I had issues sending and receiving money transfers numerous times on several occasions. Complaint 3: gram there is no way for me to learn the problem. is there no accountability for these money transfer services?
```

**Retrieved Sources (2 out of 5):**

**Source 1:**
- **Product**: Money transfer, virtual currency, or money service
- **Text Excerpt**: `problems with money transfer, receiving and sending out.`

**Source 2:**
- **Product**: Money transfer, virtual currency, or money service
- **Text Excerpt**: `...there is no way for me to learn the problem. is there no accountability for these money transfer services?`

**Quality Score:** `3/5`

**Comments/Analysis:**
- Relevance: Good‚Äîretrieved excerpts match the topic (sending/receiving failures, unclear accountability, fees/disclosures).
- Accuracy: Mostly consistent with sources.
- Completeness: Moderate‚Äîcaptures failures but could better summarize issue themes (failed transfers, holds, missing disclosures, unexpected fees).
- Issues: Answer repeats content and includes ‚ÄúComplaint 1/2/3‚Äù formatting; needs cleaner synthesis.

---

### Question 4: "What are common complaints about personal loans?"

**Generated Answer:**
```
They have predatory unethical lending tactics encouraging the account to go past due before a reduced payment can be set. this causes unnecessary late fees, high interest and the result of creating a new loan agreement. anytime that my loan had been modified they created a new account and credit line. they also promise to clean up derogatory remarks which they also lied about. Complaint 4: attempts at loan payments. Complaint 5: s being build on my loans more than 3.00 per day . ive also been in one of the branches in person, they keep saying that someone is going to call me, but no one ever did. beside the interest being build in my loans, the stress of not having the money in my account is affecting my personal and professional life.
```

**Retrieved Sources (2 out of 5):**

**Source 1:**
- **Product**: Payday loan, title loan, or personal loan
- **Text Excerpt**: `...predatory unethical lending tactics... encourages the account to go past due... causes unnecessary late fees, high interest...`

**Source 2:**
- **Product**: Payday loan, title loan, or personal loan
- **Text Excerpt**: `...interest being build on my loans more than 3.00 per day... no one ever did...`

**Quality Score:** `3/5`

**Comments/Analysis:**
- Relevance: Generally good‚Äîat least two sources clearly relate to loan servicing and costs.
- Accuracy: Largely consistent with excerpts about predatory practices, fees/interest, payment issues.
- Completeness: Moderate‚Äîcovers key themes (fees/interest, modifications, servicing), but includes some unrelated or noisy fragments ("Complaint 4/5" formatting).
- Issues: Retrieved set looks mixed (some metadata shows `Checking or savings account`), suggesting the query might benefit from filtering to loan products only.

---

### Summary Evaluation Table

| Question | Generated Answer | Retrieved Sources (1‚Äì2) | Quality Score (1‚Äì5) | Comments/Analysis |
|----------|------------------|--------------------------|---------------------|------------------|
| Q1: Credit card fees | Fees make balances hard to pay down | Fees/interest increases; deceptive fees | 4 | Relevant retrieval; answer too brief, could summarize patterns |
| Q2: Savings accounts | Focuses on low interest rates | Low savings interest; account impact/closure | 2 | Under-synthesized; retrieval fragments; misses NSF/access/closure issues |
| Q3: Money transfers | Sending/receiving issues + accountability | Transfer failures; lack of transparency | 3 | Relevant, but repetitive/formatting noise; could summarize themes |
| Q4: Personal loans | Predatory practices, fees/interest, servicing | Predatory tactics; high daily interest | 3 | Mixed retrieval set; answer noisy; recommend product filtering for cleaner results |

**Overall Notes:**
- Strengths: Retrieval often returns on-topic excerpts (Q1, Q3, Q4).
- Weaknesses: Some queries retrieve noisy/fragmented texts (notably Q2) and answers sometimes mirror snippets instead of synthesizing.
- Recommended improvements: Add product-level filters (where possible), clean/summarize retrieved context, and enforce a tighter answer format in the prompt.

## Summary: RAG Pipeline Architecture

This notebook implemented the complete Task 3 RAG pipeline:

### 1. **Vector Store Setup** (`index_production.py`)
   - ‚úÖ Loaded pre-built embeddings from `complaint_embeddings.parquet`
   - ‚úÖ Detected embedding model: `all-MiniLM-L6-v2` (384 dimensions)
   - ‚úÖ Created `complaints_production` collection in ChromaDB
   - ‚úÖ Stored model metadata for validation

### 2. **RAG Pipeline** (`rag_pipeline.py`)
   - ‚úÖ **Retrieval**: Semantic search with k=5 using cosine similarity
   - ‚úÖ **Embedding Model**: `all-MiniLM-L6-v2` for query embedding
   - ‚úÖ **Augmentation**: CrediTrust prompt template with complaint excerpts
   - ‚úÖ **Generation**: Flan-T5-base LLM for answer generation

### 3. Evaluation-rag_demo.ipynb
   - ‚úÖ Evaluated RAG quality with multiple questions
   - ‚úÖ Analyzed generated answers and retrieved sources
   - ‚úÖ Provided quality scores and comments for improvement
### 3. **Key Features**
   - Embedding model validation (ensures query model matches stored embeddings)
   - Production-ready vector store (~248K complaints)
   - Task 3 compliant: k=5 retrieval, correct prompt template
   - End-to-end query interface for easy testing

**Note**: This replaces the Task 2 prototype (15K sample with self-created embeddings) with the production system using pre-built embeddings.