#### Text Chunking, Embedding, and Vector Store Indexing
***objective***:convert the cleaned text narratives into a format suitable for efficient semantic search.
**what we here**:
 * Create a stratified sample of 10,000-15,000 complaints from your cleaned dataset
 * Implement a text chunking strategy.
 *  Choose an embedding model
 * For each text chunk, generate its vector embedding.

# Load Filtered Dataset

In [1]:
import pandas as pd
import numpy as np
import sys
sys.path.append("..")

In [2]:
# load data
df=pd.read_csv("../data/processed/filtered_complaints.csv")
df.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID,narrative_length,cleaned_narrative
0,2025-06-13,Credit card,Store credit card,Getting a credit card,Card opened without my consent or knowledge,A XXXX XXXX card was opened under my name by a...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",TX,78230,Servicemember,Consent provided,Web,2025-06-13,Closed with non-monetary relief,Yes,,14069121,91,a xxxx xxxx card was opened under my name by a...
1,2025-06-12,Credit card,General-purpose credit card or charge card,"Other features, terms, or problems",Other problem,"Dear CFPB, I have a secured credit card with c...",Company has responded to the consumer and the ...,"CITIBANK, N.A.",NY,11220,,Consent provided,Web,2025-06-13,Closed with monetary relief,Yes,,14047085,156,dear cfpb i have a secured credit card with ci...
2,2025-06-12,Credit card,General-purpose credit card or charge card,Incorrect information on your report,Account information incorrect,I have a Citi rewards cards. The credit balanc...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",IL,60067,,Consent provided,Web,2025-06-12,Closed with explanation,Yes,,14040217,233,i have a citi rewards cards the credit balance...
3,2025-06-09,Credit card,General-purpose credit card or charge card,Problem with a purchase shown on your statement,Credit card company isn't resolving a dispute ...,b'I am writing to dispute the following charge...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",TX,78413,Older American,Consent provided,Web,2025-06-09,Closed with monetary relief,Yes,,13968411,454,bi am writing to dispute the following charges...
4,2025-06-09,Credit card,General-purpose credit card or charge card,Problem when making payments,Problem during payment process,"Although the account had been deemed closed, I...",Company believes it acted appropriately as aut...,Atlanticus Services Corporation,NY,11212,Older American,Consent provided,Web,2025-06-09,Closed with monetary relief,Yes,,13965746,170,although the account had been deemed closed i ...


In [3]:
# import custome module
from src.chunk import ComplaintEmbeddingProcessor
processer=ComplaintEmbeddingProcessor(df)

  from .autonotebook import tqdm as notebook_tqdm


[INFO] Initialized processor with 80667 records


#### 1. Startified Sample
**(What and Why)**
* To make the embedding process computationally feasible while preserving data representativeness, we apply stratified sampling to the cleaned complaint dataset. 
* A subset of approximately 10,000–15,000 complaints is selected such that each product category (e.g., Credit Cards, Personal Loans, Savings Accounts, Money Transfers) is represented in proportion to its original distribution. 
* This ensures that the downstream embedding and retrieval processes are not biased toward high-frequency products while significantly reducing computational cost.

In [4]:
processer.stratified_sample(n_samples=12000)

[INFO] Stratified sample created: 12000 records


  self.df = self.df.groupby(self.product_col, group_keys=False).apply(


Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID,narrative_length,cleaned_narrative
60999,2023-09-14,Credit card,Store credit card,Problem with a purchase shown on your statement,Credit card company isn't resolving a dispute ...,I have a balance of {$710.00} from my XXXX XXX...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",IL,60615,,Consent provided,Web,2023-09-14,Closed with explanation,Yes,,7548263,150,i have a balance of 71000 from my xxxx xxxx xx...
6927,2025-01-28,Credit card,General-purpose credit card or charge card,Fees or interest,Unexpected increase in interest rate,My original credit card rate on my American Ex...,,AMERICAN EXPRESS COMPANY,SC,29349,Older American,Consent provided,Web,2025-01-31,Closed with explanation,Yes,,11805170,96,my original credit card rate on my american ex...
34733,2024-09-07,Credit card,General-purpose credit card or charge card,Improper use of your report,Credit inquiries on your report that you don't...,XXXX XXXX XXXX XXXX XXXX XXXX XXXX FL XXXX TRA...,Company has responded to the consumer and the ...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",FL,320XX,,Consent provided,Web,2024-09-07,Closed with explanation,Yes,,10034123,117,xxxx xxxx xxxx xxxx xxxx xxxx xxxx fl xxxx tra...
44531,2016-08-03,Credit card,,Other,,GENDER DISCRIMINATION - I used a kiosk at XXXX...,Company has responded to the consumer and the ...,SYNCHRONY FINANCIAL,MA,016XX,Older American,Consent provided,Web,2016-08-03,Closed with explanation,Yes,No,2044150,302,gender discrimination i used a kiosk at xxxx x...
58684,2023-11-03,Credit card,General-purpose credit card or charge card,Getting a credit card,Card opened without my consent or knowledge,JPMB J.P Morgan hard inquiry was opened on my ...,,JPMORGAN CHASE & CO.,GA,30084,,Consent provided,Web,2023-11-09,Closed with non-monetary relief,Yes,,7799737,29,jpmb jp morgan hard inquiry was opened on my a...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45250,2017-01-12,Credit card,,Transaction issue,,"On XXXX previous occasions such as this, XXXX ...",Company has responded to the consumer and the ...,PENTAGON FEDERAL CREDIT UNION,NY,14450,,Consent provided,Web,2017-01-12,Closed with explanation,Yes,No,2287722,89,on xxxx previous occasions such as this xxxx x...
70013,2024-04-19,Credit card,General-purpose credit card or charge card,"Advertising and marketing, including promotion...",Confusing or misleading advertising about the ...,I affirm that the inclusion of incorrectly rep...,Company has responded to the consumer and the ...,Experian Information Solutions Inc.,TX,793XX,,Consent provided,Web,2024-04-19,Closed with explanation,Yes,,8814518,43,i affirm that the inclusion of incorrectly rep...
32478,2023-09-21,Credit card,General-purpose credit card or charge card,Problem with a company's investigation into an...,Was not notified of investigation status or re...,See the attached documents. I want the bureau ...,,"EQUIFAX, INC.",TX,76063,,Consent provided,Web,2023-09-21,Closed with explanation,Yes,,7584642,23,see the attached documents i want the bureau t...
43701,2016-06-21,Credit card,,Identity theft / Fraud / Embezzlement,,XXXX changed credit card companies from XXXX X...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",TX,76086,,Consent provided,Web,2016-06-22,Closed with explanation,Yes,No,1977251,66,xxxx changed credit card companies from xxxx x...


#### 2. Chunk Text
**(What and Why)**:
* Customer complaint narratives often contain long, detailed descriptions that exceed the optimal input length for embedding models. 
* To address this, we apply text chunking, splitting each complaint into smaller overlapping segments (e.g., 500 characters with 50-character overlap). 
* This approach preserves contextual continuity while enabling more precise semantic retrieval, ensuring that specific issues within a complaint can be independently retrieved and analyzed.

In [5]:
processer.chunk_texts(chunk_size=200, chunk_overlap=50)

[INFO] Text chunking complete: 88802 chunks


Unnamed: 0,complaint_id,product,issue,sub_issue,chunk_index,chunk_text
0,,Credit card,,,0,i have a balance of 71000 from my xxxx xxxx xx...
1,,Credit card,,,1,to find out exactly where the charge is coming...
2,,Credit card,,,2,got no insight on to why im being charged so n...
3,,Credit card,,,3,support wayfair directed me to xxxx and xxxx d...
4,,Credit card,,,4,receive a payment product xxxx xxxx xxxxxxxx x...
...,...,...,...,...,...,...
88797,,Credit card,,,1,we regularly received letters saying they woul...
88798,,Credit card,,,2,credit card but they still opened the account ...
88799,,Credit card,,,0,i contacted them with the information that i h...
88800,,Credit card,,,1,afford of xxxx dollars they instead offered me...


#### 3. Generate Embeddings
* Each text chunk is transformed into a numerical vector representation using a sentence embedding model (all-MiniLM-L6-v2). 
* These embeddings capture the semantic meaning of complaint narratives, allowing the system to identify similarities between user queries and complaint text beyond simple keyword matching. 
* This step is essential for enabling semantic search and retrieval in the RAG pipeline.

In [6]:
processer.generate_embeddings()

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


[INFO] Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2


Batches: 100%|██████████| 1388/1388 [52:26<00:00,  2.27s/it] 


[INFO] Embeddings generated: 88802 vectors


array([[-0.02177579,  0.0795213 , -0.12890302, ..., -0.0023668 ,
        -0.00336253, -0.01501668],
       [ 0.06561141,  0.03563155, -0.05398148, ...,  0.0407705 ,
        -0.0040665 , -0.02378921],
       [-0.00857091,  0.02646689, -0.01072805, ..., -0.02082646,
         0.0407073 , -0.05133781],
       ...,
       [-0.04319381,  0.10807727,  0.07178776, ..., -0.06444943,
        -0.03093561, -0.01657288],
       [-0.09545013,  0.06191992,  0.04785169, ..., -0.08671392,
        -0.02430261, -0.03078377],
       [-0.02597769,  0.05703136, -0.01186326, ..., -0.09036586,
        -0.04116761, -0.04973433]], shape=(88802, 384), dtype=float32)

##### Build Vector Space with Faiss

In [7]:
processer.build_faiss_index()

[INFO] FAISS index created with 88802 vectors


<faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x000001B8B11E67F0> >

In [8]:
##### save faiss index #####
processer.save_faiss_index()

[INFO] FAISS index saved to vector_store/faiss_index.bin
[INFO] Metadata saved to vector_store/faiss_metadata.csv
