# Task 2: Chunking, Embedding, and Indexing ðŸ§©

This notebook covers the RAG pipeline construction: chunking the text, embedding it, and indexing it into a Vector Database (ChromaDB).

## Objectives
1. **Load Preprocessed Data**: Use the cleaned data from Task 1.
2. **Chunking**: Split narratives into manageable chunks with overlap.
3. **Embedding**: Convert chunks into vector embeddings using `all-MiniLM-L6-v2`.
4. **Vector Store**: Index these embeddings into a local ChromaDB instance.

In [1]:
import sys
import os
import pandas as pd

# Add src to path
sys.path.append(os.path.abspath(os.path.join('../src')))
from chunk_embed_index import ChunkEmbedIndex

## 1. Initialize Pipeline
We initialize the `ChunkEmbedIndex` class pointing to our processed data and target vector store location.

In [2]:
DATA_PATH = '../data/processed/cleaned_complaints.csv'
VECTOR_DB_PATH = '../vector_store'
COLLECTION_NAME = 'complaints_prototype'

indexer = ChunkEmbedIndex(DATA_PATH, VECTOR_DB_PATH, COLLECTION_NAME)

## 2. Load Data
Load the cleaned CSV. We'll start with a sample to verify the pipeline.

In [3]:
# Load a sample first for verification
indexer.load_processed_data(nrows=1000)

Loading data from ../data/processed/cleaned_complaints.csv...
Loaded 1000 rows.


Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID,cleaned_narrative
0,2025-06-13,Credit card,Store credit card,Getting a credit card,Card opened without my consent or knowledge,A XXXX XXXX card was opened under my name by a...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",TX,78230,Servicemember,Consent provided,Web,2025-06-13,Closed with non-monetary relief,Yes,,14069121,card opened name fraudster received notice acc...
1,2025-06-12,Credit card,General-purpose credit card or charge card,"Other features, terms, or problems",Other problem,"Dear CFPB, I have a secured credit card with c...",Company has responded to the consumer and the ...,"CITIBANK, N.A.",NY,11220,,Consent provided,Web,2025-06-13,Closed with monetary relief,Yes,,14047085,dear cfpb secured credit card citibank changed...
2,2025-06-12,Credit card,General-purpose credit card or charge card,Incorrect information on your report,Account information incorrect,I have a Citi rewards cards. The credit balanc...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",IL,60067,,Consent provided,Web,2025-06-12,Closed with explanation,Yes,,14040217,citi reward card credit balance issued recentl...
3,2025-06-09,Credit card,General-purpose credit card or charge card,Problem with a purchase shown on your statement,Credit card company isn't resolving a dispute ...,b'I am writing to dispute the following charge...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",TX,78413,Older American,Consent provided,Web,2025-06-09,Closed with monetary relief,Yes,,13968411,writing dispute following charge citi credit c...
4,2025-06-09,Credit card,General-purpose credit card or charge card,Problem when making payments,Problem during payment process,"Although the account had been deemed closed, I...",Company believes it acted appropriately as aut...,Atlanticus Services Corporation,NY,11212,Older American,Consent provided,Web,2025-06-09,Closed with monetary relief,Yes,,13965746,although account deemed closed continued make ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,2025-01-30,Credit card,General-purpose credit card or charge card,Problem with a purchase shown on your statement,Credit card company isn't resolving a dispute ...,I am disputing charges made on my XXXX Capital...,,CAPITAL ONE FINANCIAL CORPORATION,NY,146XX,,Consent provided,Web,2025-01-30,Closed with monetary relief,Yes,,11849680,disputing charge made capital one credit card ...
996,2025-05-06,Credit card,General-purpose credit card or charge card,Problem with a purchase shown on your statement,Credit card company isn't resolving a dispute ...,I am a visually impaired consumer who was deni...,Company has responded to the consumer and the ...,U.S. BANCORP,ME,04401,,Consent provided,Web,2025-05-06,Closed with non-monetary relief,Yes,,13366503,visually impaired consumer denied fair treatme...
997,2025-04-13,Credit card,General-purpose credit card or charge card,Problem with a purchase shown on your statement,Card was charged for something you did not pur...,"On XX/XX/XXXX, my credit card was stolen and m...",Company has responded to the consumer and the ...,"BANK OF AMERICA, NATIONAL ASSOCIATION",WA,98087,Servicemember,Consent provided,Web,2025-04-13,Closed with explanation,Yes,,12962905,xxxx credit card stolen multiple purchase made...
998,2025-01-28,Credit card,General-purpose credit card or charge card,Trouble using your card,Can't use card to make purchases,I applied for and was approved for a Bank of A...,Company has responded to the consumer and the ...,"BANK OF AMERICA, NATIONAL ASSOCIATION",GA,30501,Servicemember,Consent provided,Web,2025-01-28,Closed with explanation,Yes,,11809744,applied approved bank america xxxxyear part ap...


## 3. Initialize Vector Store
Set up the persistent ChromaDB client.

In [4]:
indexer.initialize_vector_store()

Initializing Vector Store at ../vector_store...
Created collection: complaints_prototype


## 4. Processing and Indexing
This step will:
- Iterate through the dataframe rows.
- Chunk the `cleaned_narrative` using `RecursiveCharacterTextSplitter`.
- Embed the chunks.
- Store them in ChromaDB with metadata.

In [5]:
# Chunk settings: 500 chars with 50 overlap
indexer.process_and_index(chunk_size=500, chunk_overlap=50)

Initializing Text Splitter...
Starting Chunking and Indexing...


Processing Complaints: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1000/1000 [00:14<00:00, 66.84it/s]


Indexing Complete. Total Chunks Indexed: 2061


## 5. Verify Indexing
Let's perform a simple query to ensure data is retrievable.

In [6]:
results = indexer.collection.query(
    query_texts=["credit car charged twice"],
    n_results=2
)

print("Query Results:")
for doc, meta in zip(results['documents'][0], results['metadatas'][0]):
    print(f"\nMetadata: {meta}")
    print(f"Content: {doc}")

Query Results:

Metadata: {'complaint_id': '13883218', 'product': 'Credit card', 'issue': 'Problem with a purchase shown on your statement', 'chunk_index': 0}
Content: went xxxx charged twice citi credit card car rental company rented one car reason charge twice called constantly texted charged twice never returned money proceeded report bank citi bank provided evidence charged twice cancelled dispute without proper reasoning sided clearly evidence rented two car received dispute decision xxxx extra charge credit card limit affecting credit score pay interest

Metadata: {'chunk_index': 0, 'complaint_id': '12986095', 'product': 'Credit card', 'issue': 'Problem with a purchase shown on your statement'}
Content: charged rental car xxxx never received ive told several time refund way returned day xxxx disputing transaction company since day one continue lied denied action dispute navy federal credit union told bank job refuse take action dispute continue charge interest purchase liable
