#### Task 2: Text Chunking, Embedding, and Vector Store Indexing
This notebook implements Task 2 by covering:
- Stratified sampling
- Text chunking
- Embedding generation
- Vector store indexing

In [3]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("../data/processed/filtered_complaints.csv")


(364292, 19)

In [4]:
print("Total cleaned complaints:", df.shape[0])
df.head()

Total cleaned complaints: 364292


Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID,cleaned_narrative
0,2025-06-13,Credit card,Store credit card,Getting a credit card,Card opened without my consent or knowledge,A XXXX XXXX card was opened under my name by a...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",TX,78230,Servicemember,Consent provided,Web,2025-06-13,Closed with non-monetary relief,Yes,,14069121.0,a xxxx xxxx card was opened under my name by a...
1,2025-06-13,Checking or savings account,Checking account,Managing an account,Deposits and withdrawals,I made the mistake of using my wellsfargo debi...,Company has responded to the consumer and the ...,WELLS FARGO & COMPANY,ID,83815,,Consent provided,Web,2025-06-13,Closed with explanation,Yes,,14061897.0,i made the mistake of using my wellsfargo debi...
2,2025-06-12,Credit card,General-purpose credit card or charge card,"Other features, terms, or problems",Other problem,"Dear CFPB, I have a secured credit card with c...",Company has responded to the consumer and the ...,"CITIBANK, N.A.",NY,11220,,Consent provided,Web,2025-06-13,Closed with monetary relief,Yes,,14047085.0,dear cfpb i have a secured credit card with ci...
3,2025-06-12,Credit card,General-purpose credit card or charge card,Incorrect information on your report,Account information incorrect,I have a Citi rewards cards. The credit balanc...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",IL,60067,,Consent provided,Web,2025-06-12,Closed with explanation,Yes,,14040217.0,i have a citi rewards cards the credit balance...
4,2025-06-09,Credit card,General-purpose credit card or charge card,Problem with a purchase shown on your statement,Credit card company isn't resolving a dispute ...,b'I am writing to dispute the following charge...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",TX,78413,Older American,Consent provided,Web,2025-06-09,Closed with monetary relief,Yes,,13968411.0,bi am writing to dispute the following charges...


In [22]:
TOTAL_SAMPLE_SIZE = 10000

Note: Some product categories in the CFPB data use inconsistent naming.
For Task 2, we normalize these labels into four high-level categories
to enable clean stratified sampling and metadata storage for the vector database.

In [14]:
def map_product(product):
    product = product.lower()

    if "credit card" in product:
        return "Credit Card"
    elif "loan" in product:
        return "Personal Loan"
    elif "savings" in product or "checking" in product:
        return "Savings Account"
    elif "money transfer" in product or "money service" in product:
        return "Money Transfer"
    else:
        return None
df['Product_clean'] = df['Product'].apply(map_product)

In [15]:
df['Product_clean'].value_counts(dropna=False)

Product_clean
Credit Card        143514
Savings Account    109333
Money Transfer      84328
Personal Loan       27117
Name: count, dtype: int64

In [20]:
df_sample_base = df.dropna(subset=['Product_clean'])

In [21]:
product_counts = df_sample_base['Product_clean'].value_counts(normalize=True)

In [24]:
sample_sizes = (product_counts * TOTAL_SAMPLE_SIZE).round().astype(int)
sample_sizes

Product_clean
Credit Card        3940
Savings Account    3001
Money Transfer     2315
Personal Loan       744
Name: proportion, dtype: int64

In [25]:
sampled_df = []

for product, n_samples in sample_sizes.items():
    df_product = df_sample_base[df_sample_base['Product_clean'] == product]
    sampled_df.append(df_product.sample(n=n_samples, random_state=42))
    df_sampled = pd.concat(sampled_df)

In [26]:
df_sampled.shape

(10000, 20)

In [27]:
df_sampled['Product_clean'].value_counts(normalize=True)

Product_clean
Credit Card        0.3940
Savings Account    0.3001
Money Transfer     0.2315
Personal Loan      0.0744
Name: proportion, dtype: float64

In [28]:
output_path = "../data/processed/sampled_complaints.csv"
df_sampled.to_csv(output_path, index=False)

### Step 2: Text Chunking

In [36]:
def chunk_text(text, chunk_size=500, overlap=50):
    chunks = []
    start = 0
    text_length = len(text)

    while start < text_length:
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap
    return chunks

In [40]:
df_sampled['cleaned_narrative'].isna().sum()

np.int64(0)

In [41]:
rows = []

for idx, row in df_sampled.iterrows():
    text = row["cleaned_narrative"]

    # SAFETY CHECK
    if pd.isna(text) or text.strip() == "":
        continue

    chunks = chunk_text(text)

    for i, chunk in enumerate(chunks):
        rows.append({
            "complaint_id": row["Complaint ID"],   # use exact column name
            "product": row["Product_clean"],
            "chunk_id": i,
            "text_chunk": chunk
        })

df_chunks = pd.DataFrame(rows)

In [42]:
df_chunks.shape

(29422, 4)

In [43]:
df_chunks.head()

Unnamed: 0,complaint_id,product,chunk_id,text_chunk
0,8227549.0,Credit Card,0,i did send the credit company what they reques...
1,3422079.0,Credit Card,0,i have been paying for charges that i didnt pe...
2,8015509.0,Credit Card,0,not the first time this occurred and because c...
3,8015509.0,Credit Card,1,ice to close my credit card as they can not an...
4,8015509.0,Credit Card,2,eks ago to close my account and i was not subm...


In [44]:
df_chunks["product"].value_counts(normalize=True)

product
Credit Card        0.403066
Savings Account    0.317551
Money Transfer     0.200156
Personal Loan      0.079226
Name: proportion, dtype: float64

### Step 3: Embeddings