## Step 1: Load the cleaned dataset

In [1]:
import pandas as pd

df = pd.read_csv("../data/processed/filtered_complaints.csv")
print(df.shape)
df.head()

(82164, 20)


Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID,complaint_length,clean_complaint
0,2025-06-13,Credit card,Store credit card,Getting a credit card,Card opened without my consent or knowledge,A XXXX XXXX card was opened under my name by a...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",TX,78230,Servicemember,Consent provided,Web,2025-06-13,Closed with non-monetary relief,Yes,,14069121,91,a xxxx xxxx card was opened under my name by a...
1,2025-06-12,Credit card,General-purpose credit card or charge card,"Other features, terms, or problems",Other problem,"Dear CFPB, I have a secured credit card with c...",Company has responded to the consumer and the ...,"CITIBANK, N.A.",NY,11220,,Consent provided,Web,2025-06-13,Closed with monetary relief,Yes,,14047085,156,dear cfpb i have a secured credit card with ci...
2,2025-06-12,Credit card,General-purpose credit card or charge card,Incorrect information on your report,Account information incorrect,I have a Citi rewards cards. The credit balanc...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",IL,60067,,Consent provided,Web,2025-06-12,Closed with explanation,Yes,,14040217,233,i have a citi rewards cards the credit balance...
3,2025-06-09,Credit card,General-purpose credit card or charge card,Problem with a purchase shown on your statement,Credit card company isn't resolving a dispute ...,b'I am writing to dispute the following charge...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",TX,78413,Older American,Consent provided,Web,2025-06-09,Closed with monetary relief,Yes,,13968411,454,b i am writing to dispute the following charge...
4,2025-06-09,Credit card,General-purpose credit card or charge card,Problem when making payments,Problem during payment process,"Although the account had been deemed closed, I...",Company believes it acted appropriately as aut...,Atlanticus Services Corporation,NY,11212,Older American,Consent provided,Web,2025-06-09,Closed with monetary relief,Yes,,13965746,170,although the account had been deemed closed i ...


## Step 2: Create a stratified sample (10,000–15,000 rows)

### To ensure fair representation across product categories, sample proportionally.

In [2]:
target_size = 12000

sampled_df = (
    df.groupby("Product", group_keys=False)
      .apply(lambda x: x.sample(
          n=int(len(x) / len(df) * target_size),
          random_state=42
      ))
)

print(sampled_df["Product"].value_counts())


Product
Credit card        11781
Money transfers      218
Name: count, dtype: int64


  .apply(lambda x: x.sample(


## Step 3: Define the text chunking strategy

Large complaint texts reduce embedding quality if processed as one block.
Chunking improves semantic precision and retrieval accuracy.

Chunking design decisions:

Chunk size: 400–500 characters

Overlap: 50–100 characters

Reason: Preserves semantic continuity without exceeding embedding limits

## Step 4: Implement text chunking
Using LangChain’s RecursiveCharacterTextSplitter:

In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", " "]
)

chunks = []
metadata = []

for _, row in sampled_df.iterrows():
    texts = text_splitter.split_text(row["clean_complaint"])
    for i, chunk in enumerate(texts):
        chunks.append(chunk)
        metadata.append({
            "complaint_id": row["Complaint ID"],
            "product": row["Product"],
            "chunk_index": i
        })


  from .autonotebook import tqdm as notebook_tqdm


## Step 5: Choose and load the embedding model

In [7]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


## Step 6: Generate embeddings

In [8]:
embeddings = model.encode(
    chunks,
    show_progress_bar=True,
    batch_size=64
)

Batches: 100%|██████████| 527/527 [27:58<00:00,  3.18s/it]


## Step 7: Create and store the vector database

Using FAISS (recommended for local setups):

In [9]:
import faiss
import numpy as np
import os

dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

os.makedirs("../vector_store", exist_ok=True)
faiss.write_index(index, "../vector_store/complaints.index")

## Step 8: Verify the index

In [10]:
print("Total vectors:", index.ntotal)
query = "issues with credit card charges"
query_embedding = model.encode([query])
D, I = index.search(query_embedding, k=5)

for idx in I[0]:
    print(metadata[idx])

Total vectors: 33691
{'complaint_id': 1498379, 'product': 'Credit card', 'chunk_index': 0}
{'complaint_id': 7757221, 'product': 'Credit card', 'chunk_index': 4}
{'complaint_id': 7874073, 'product': 'Credit card', 'chunk_index': 2}
{'complaint_id': 2349756, 'product': 'Credit card', 'chunk_index': 0}
{'complaint_id': 13513740, 'product': 'Credit card', 'chunk_index': 1}
