# Task 2: Chunking, Embedding, and Vector Store Indexing

This notebook prepares a stratified sample, applies text chunking, generates embeddings for each chunk, and indexes them in a vector store (ChromaDB).

**Checklist:**
1. Stratified sampling
2. Chunking
3. Embedding (MiniLM-L6-v2)
4. Store in ChromaDB
5. Document choices


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from collections import Counter
import numpy as np
import random

# Load cleaned dataset
df = pd.read_csv('../data/filtered_complaints.csv')
df['Product'].value_counts()


Product
Credit card        80667
Money transfers     1497
Name: count, dtype: int64

## Stratified Sample: 10,000–15,000 Complaints
Sample proportionally by product category.


In [3]:
# Aim for ~12,000 (adjust size as needed)
SAMPLE_SIZE = 12000

strat_sample, _ = train_test_split(
    df,
    train_size=SAMPLE_SIZE,
    stratify=df['Product'],
    random_state=42
)
print(strat_sample['Product'].value_counts(normalize=True))
strat_sample.reset_index(drop=True, inplace=True)
strat_sample.to_csv('../data/stratified_sample.csv', index=False)


Product
Credit card        0.98175
Money transfers    0.01825
Name: proportion, dtype: float64


## Chunking: Split Narratives
We use a recursive character splitter with chunk_size=500 and overlap=50 (recommended for this use case).


In [4]:
def chunk_text(text, chunk_size=500, chunk_overlap=50):
    text = str(text)
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        if end >= len(text):
            break
        start += chunk_size - chunk_overlap
    return chunks

# Example
chunked_examples = chunk_text(strat_sample['clean_narrative'].iloc[0])
print(chunked_examples)


['old card from bank of america expired and another card was issued but to the wrong old address. when i request the card to be send to my current address they sent a new one with a number different from before. the different number is not a problem but now every transaction gets blocked by fraud department. ive contacted the bank three times, they unlock it but it locks again at the next transaction. when i try to contact the bank they keep doing the same process over and over again but it doesnt', 'the same process over and over again but it doesnt solve the problem. the last time i contacted them today at xxxxxxxx xxxx i told the operator the previous guy did exactly the same thing and he said that theres really nothing much else they can do. i tried to find a different channel to address this issue but couldnt find one on the website. im complaining here so that the complaint can get to the correct department to solve the issue the regular phone doesnt work and the operator also ca

In [5]:
# Apply chunking to all sample narratives
chunk_records = []
for idx, row in strat_sample.iterrows():
    chunks = chunk_text(row['clean_narrative'])
    for chunk_idx, chunk in enumerate(chunks):
        chunk_records.append({
            'complaint_id': row['Complaint ID'],
            'product': row['Product'],
            'issue': row['Issue'],
            'company': row['Company'],
            'state': row['State'],
            'date_received': row['Date received'],
            'chunk_index': chunk_idx,
            'total_chunks': len(chunks),
            'chunk_text': chunk
        })

chunk_df = pd.DataFrame(chunk_records)
chunk_df.to_csv('../data/stratified_sample_chunked.csv', index=False)
chunk_df.head()


Unnamed: 0,complaint_id,product,issue,company,state,date_received,chunk_index,total_chunks,chunk_text
0,8769190,Credit card,Problem when making payments,"BANK OF AMERICA, NATIONAL ASSOCIATION",MA,2024-04-15,0,3,old card from bank of america expired and anot...
1,8769190,Credit card,Problem when making payments,"BANK OF AMERICA, NATIONAL ASSOCIATION",MA,2024-04-15,1,3,the same process over and over again but it do...
2,8769190,Credit card,Problem when making payments,"BANK OF AMERICA, NATIONAL ASSOCIATION",MA,2024-04-15,2,3,r phone doesnt work and the operator also cant...
3,1773833,Credit card,Billing disputes,"CITIBANK, N.A.",CA,2016-02-04,0,1,i have a bill from macys that i am late on. th...
4,2056598,Credit card,Transaction issue,SYNCHRONY FINANCIAL,VA,2016-08-10,0,2,paypal has a repeated poor business practice o...


## Embedding: MiniLM-L6-v2
We use the sentence-transformers library to embed each chunk.


In [6]:
from sentence_transformers import SentenceTransformer
from tqdm import tqdm

model = SentenceTransformer('all-MiniLM-L6-v2')

embeddings = model.encode(chunk_df['chunk_text'].tolist(), show_progress_bar=True, batch_size=64)
print(f"Embedding shape: {embeddings.shape}")


Batches: 100%|██████████| 538/538 [01:47<00:00,  5.01it/s]


Embedding shape: (34394, 384)


## Vector Store: Index with ChromaDB
Store embeddings and metadata for semantic search.


In [7]:
import chromadb
from chromadb.config import Settings
from tqdm import tqdm

db = chromadb.Client(Settings(persist_directory="../vector_store"))
collection = db.get_or_create_collection('complaints_sample')

# Add chunks and metadata to ChromaDB
for i in tqdm(range(len(chunk_df))):
    collection.add(
        embeddings=[embeddings[i]],
        documents=[chunk_df['chunk_text'].iloc[i]],
        metadatas=[{
            'complaint_id': chunk_df['complaint_id'].iloc[i],
            'product': chunk_df['product'].iloc[i],
            'issue': chunk_df['issue'].iloc[i],
            'company': chunk_df['company'].iloc[i],
            'state': chunk_df['state'].iloc[i],
            'date_received': chunk_df['date_received'].iloc[i],
            'chunk_index': chunk_df['chunk_index'].iloc[i],
            'total_chunks': chunk_df['total_chunks'].iloc[i],
        }],
        ids=[f"{chunk_df['complaint_id'].iloc[i]}_{chunk_df['chunk_index'].iloc[i]}"]
    )
print("Embedding & chunk data stored in ChromaDB.")


Failed to send telemetry event client_start: capture() takes 1 positional argument but 3 were given
Using embedded DuckDB without persistence: data will be transient


No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction
  0%|          | 0/34394 [00:00<?, ?it/s]


ValueError: Expected embeddings to be a list, got [array([-2.15353426e-02, -6.83808774e-02, -3.29628550e-02, -3.66154462e-02,
       -5.15383221e-02, -8.83623660e-02,  4.35030386e-02, -1.11456692e-01,
        8.92012343e-02, -3.63409892e-02,  2.97503583e-02,  7.70042911e-02,
        2.66289245e-03, -2.67017796e-03, -1.19301174e-02,  5.28362673e-03,
       -7.35860392e-02,  6.95732236e-03, -3.80913913e-02,  6.68813884e-02,
        2.86225975e-02,  2.76979189e-02, -8.94014761e-02,  4.81026843e-02,
       -2.56231353e-02, -5.37405610e-02, -5.74529432e-02,  1.23508992e-02,
       -2.66126506e-02, -7.47334212e-02,  2.76590120e-02,  8.83178562e-02,
       -7.30865682e-03, -4.32395190e-02,  2.26176474e-02, -3.18315253e-02,
       -5.23733273e-02,  4.74348404e-02, -2.59456523e-02,  1.66992452e-02,
       -2.60761436e-02,  4.33442071e-02,  7.96724204e-03, -3.86025049e-02,
        7.01640593e-03,  1.96517892e-02, -2.33956017e-02,  1.19415164e-01,
        1.14054933e-01,  6.98195845e-02, -4.97977100e-02, -3.49990875e-02,
       -8.18606988e-02, -6.32011592e-02, -8.75264779e-03,  2.34813225e-02,
        7.50847608e-02,  3.08635142e-02,  2.61733271e-02, -1.32597378e-02,
        2.84588318e-02,  5.02734035e-02,  7.49308392e-02,  2.93125995e-02,
       -8.03614482e-02, -1.87384635e-02, -8.24303851e-02, -2.68712398e-02,
       -1.89807639e-03, -4.76806834e-02,  6.20681271e-02, -6.47526607e-02,
       -5.65078631e-02,  3.43897082e-02,  4.02482226e-02,  8.05881172e-02,
       -3.49483080e-02, -2.18212698e-02,  2.46697608e-02,  8.85556743e-04,
       -8.17576051e-03, -1.34188935e-01,  6.03209399e-02, -6.56062551e-03,
        5.46144694e-02,  4.42429222e-02, -1.40283667e-02,  4.13557291e-02,
       -3.60010634e-03, -1.07759669e-01,  9.71575379e-02,  5.74232154e-02,
        2.45754663e-02,  3.10067963e-02,  2.91041005e-02, -6.94258064e-02,
        4.65571433e-02, -2.16360893e-02, -4.10196930e-02,  1.78424623e-02,
        1.53175583e-02,  6.33204803e-02, -9.00916755e-02,  8.37501287e-02,
        7.57322535e-02,  2.00571269e-02,  4.09904756e-02,  3.72603163e-02,
       -4.25493568e-02,  1.77353099e-02,  5.68969846e-02, -2.82738972e-02,
       -2.05035359e-02,  2.43614689e-02, -6.59307614e-02,  6.51676133e-02,
       -4.24355734e-03,  2.76273359e-02, -9.61286202e-02, -4.14494006e-03,
       -7.20521808e-02,  2.33846195e-02, -1.70131829e-02, -5.38425371e-02,
       -9.02925581e-02,  1.48283299e-02,  4.85251620e-02,  7.88542825e-34,
       -3.77545059e-02, -3.86479162e-02, -9.11346972e-02, -1.15175238e-02,
       -4.90088528e-03, -9.77067556e-03, -4.50567342e-02,  3.47158164e-02,
       -3.14933620e-02,  8.10210258e-02, -2.51268670e-02, -3.10741048e-02,
        7.15002269e-02, -1.91762708e-02, -5.20747043e-02, -8.31935927e-03,
       -1.20609319e-02,  1.22851869e-02,  4.32723016e-02,  7.32014552e-02,
        3.58385593e-02,  2.58630533e-02,  5.73814549e-02,  1.35996146e-02,
        3.96312810e-02,  7.11096525e-02, -7.22056329e-02, -2.03793915e-03,
        4.97219712e-02, -3.10341492e-02,  2.03618594e-02, -3.25403921e-02,
        3.90155055e-02,  7.21757635e-02, -1.31817264e-02,  4.69611250e-02,
        5.86143173e-02, -3.08747478e-02, -1.48379654e-02, -2.48010103e-02,
        4.06474248e-03,  6.89959899e-02,  2.30486435e-03,  4.23117317e-02,
        8.11279658e-03,  5.78697547e-02,  7.55394399e-02,  6.70525134e-02,
       -3.92977148e-02,  6.05445690e-02, -5.24980500e-02,  5.70581406e-02,
       -5.37642986e-02,  1.04557993e-02, -7.76153728e-02, -5.00278212e-02,
       -1.80717353e-02,  5.92921861e-02,  6.63520992e-02, -5.79582751e-02,
        8.24326649e-02, -7.09355483e-03, -7.27713853e-02, -1.07820788e-02,
        4.08374108e-02, -2.42442247e-02, -8.64921790e-03, -5.05969068e-03,
       -3.71017419e-02,  6.58665821e-02,  2.00269651e-02, -4.11425643e-02,
        3.38276625e-02,  5.42285033e-02, -6.46756664e-02, -5.49359843e-02,
        4.78070881e-03,  1.53030045e-02, -9.11658481e-02, -6.24592900e-02,
       -3.78382765e-02,  3.48639907e-03, -4.16862778e-02, -4.89539932e-03,
       -5.50472923e-02,  9.56661478e-02,  1.32774152e-02,  8.66464004e-02,
        7.59966746e-02,  4.00992520e-02, -2.31986167e-03,  1.34423710e-02,
        9.44822654e-02,  2.36253962e-02,  1.13754928e-01, -1.21480830e-33,
       -6.88786581e-02, -4.29156385e-02, -3.33034471e-02, -1.01866340e-02,
       -4.83982824e-03, -7.67198280e-02, -4.03494164e-02,  8.15166980e-02,
        9.86646302e-03, -8.86836499e-02,  1.32122245e-02,  7.29671717e-02,
       -1.02843056e-02,  4.89116199e-02,  1.26062715e-02, -7.15723932e-02,
        1.00849271e-01,  8.02073926e-02, -8.90413579e-03, -3.15377419e-03,
        1.86623726e-02,  3.09649073e-02, -6.55032769e-02, -4.05679969e-03,
       -1.98110715e-02,  7.19069988e-02, -1.34032974e-02, -6.61214069e-02,
        2.30200440e-02, -6.71300814e-02,  1.37771768e-02, -2.49427017e-02,
        3.02959811e-02,  1.61285982e-01, -1.20817134e-02, -1.65013000e-02,
        1.16799762e-02,  5.71018755e-02,  6.52775541e-02, -3.29502765e-03,
       -5.28368019e-02,  1.64158288e-02,  1.40543487e-02,  8.21281224e-03,
       -1.91067047e-02,  4.36854921e-02,  3.25830802e-02, -4.09329347e-02,
       -2.80688610e-02, -9.72716510e-03, -4.74591479e-02, -6.64970428e-02,
        9.43163857e-02, -1.89198833e-03, -3.27173583e-02,  1.01366006e-01,
        5.35311624e-02, -1.76796429e-02,  1.63826104e-02, -2.32525952e-02,
        5.27164601e-02, -2.23141420e-03,  3.54684480e-02, -2.32560392e-02,
        7.94502571e-02,  3.11265960e-02, -1.19073624e-02, -9.94357616e-02,
        5.64660579e-02,  4.60410584e-03,  7.88171124e-03,  4.39379923e-02,
       -3.53546068e-02,  9.47556347e-02,  4.61517042e-03, -5.02655245e-02,
       -5.65973930e-02, -2.41523124e-02, -4.53992561e-02, -4.70665917e-02,
        3.69055979e-02, -1.78232789e-02, -4.27995138e-02,  5.97737394e-02,
        2.33249832e-02,  1.27663270e-01,  5.01806661e-02, -4.97918166e-02,
        2.13349517e-02, -8.81702304e-02, -1.02994563e-02, -4.77009490e-02,
       -6.96778819e-02, -8.29900652e-02, -1.00673027e-01, -3.93411277e-08,
        1.17337024e-02, -4.71344665e-02,  5.04435645e-03,  2.85183247e-02,
        1.35835884e-02,  2.04963386e-02,  2.27740780e-02, -5.20777404e-02,
       -5.53186797e-02, -6.36227354e-02, -2.85957828e-02,  7.84651116e-02,
        1.02211433e-02, -4.95156758e-02,  8.04865174e-03, -8.07023346e-02,
        1.71656848e-03, -8.77110288e-02, -9.09116026e-03, -2.08927356e-02,
       -4.36022989e-02,  7.12636532e-03,  7.54963085e-02, -4.69123684e-02,
       -9.20361057e-02,  3.10656079e-03,  5.87915070e-02,  2.23370455e-02,
        7.75123015e-02, -4.57504317e-02, -5.20917475e-02, -5.32023348e-02,
        9.26543847e-02, -2.38177553e-02,  1.24985725e-03, -6.42720386e-02,
       -3.24455276e-02, -7.94544071e-03, -3.39463167e-02, -4.58567664e-02,
        3.14400643e-02, -5.81128635e-02, -6.42884895e-02, -4.41899486e-02,
        2.02829391e-02, -5.45608811e-02,  1.20989811e-02, -5.98600693e-02,
        7.99062997e-02,  6.29249513e-02,  6.08472293e-03, -6.62277592e-03,
        3.96972802e-03,  1.03974193e-01,  4.84358445e-02, -7.64415786e-02,
        6.08221628e-03, -5.06472066e-02,  6.20926730e-02,  2.76842099e-02,
        2.93032546e-02, -3.55555080e-02,  5.31387031e-02, -4.26499844e-02],
      dtype=float32)]

## Summary
- Sampling is stratified by product for representativeness.
- Chunking uses 500 char size, 50 overlap for optimal embedding context.
- Embeddings use all-MiniLM-L6-v2 (fast & accurate for semantic search).
- ChromaDB is used for efficient neural search on internal complaint data.
