# Task 2: Text Chunking, Embedding, and Vector Store Indexing

In this notebook, I:
- Load the cleaned complaint dataset
- Chunk the narratives into manageable pieces
- Add product metadata to each chunk
- Generate semantic embeddings
- Store them in a FAISS vector store for efficient retrieval


In [4]:
import pandas as pd
import numpy as np
import re
import os
from tqdm import tqdm

from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer

import faiss
import pickle

In [None]:
## Load hte filtered compliant data

df = pd.read_csv("../data/filtered_complaints.csv")
print(f"Loaded {len(df)} records")
df[['cleaned_narrative', 'Mapped Product']].head()


Loaded 478834 records


Unnamed: 0,cleaned_narrative,Mapped Product
0,a xxxx xxxx card was opened under my name by a...,Credit card
1,i made the mistake of using my wellsfargo debi...,Savings account
2,dear cfpb i have a secured credit card with ci...,Credit card
3,i have a citi rewards cards the credit balance...,Credit card
4,bi am writing to dispute the following charges...,Credit card


### Step 1: Text Chunking

I'll use LangChain's `RecursiveCharacterTextSplitter` to break down long complaint narratives.  
Chunking improves embedding quality and ensures we stay within token limits.


In [6]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

# Just to check example output
example_chunks = text_splitter.split_text(df.iloc[0]['cleaned_narrative'])
print(f"Chunks from first record: {len(example_chunks)}")
example_chunks


Chunks from first record: 1


['a xxxx xxxx card was opened under my name by a fraudster i received a notice from xxxx that an account was just opened under my name i reached out to xxxx xxxx to state that this activity was unauthorized and not me xxxx xxxx confirmed this was fraudulent and immediately closed the card however they have failed to remove this from the three credit agencies and this fraud is now impacting my credit score based on a hard credit pull done by xxxx xxxx that was done by a fraudster']

## Step 2: Add Metadata to Chunks

Each chunk will carry metadata such as:
- Product category
- Complaint ID

This will be stored along with the embedding in the vector database.


In [7]:
docs = []

for _, row in tqdm(df.iterrows(), total=len(df)):
    chunks = text_splitter.split_text(row['cleaned_narrative'])
    for chunk in chunks:
        docs.append({
            "content": chunk,
            "metadata": {
                "product": row['Mapped Product'],
                "complaint_id": row['Complaint ID']
            }
        })

print(f"Total document chunks: {len(docs)}")


100%|██████████| 478834/478834 [04:55<00:00, 1618.44it/s]

Total document chunks: 1378199





## Step 3: Generate Embeddings

I'll use `all-MiniLM-L6-v2` from the `sentence-transformers` library to convert text chunks into embeddings.


In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")

texts = [doc["content"] for doc in docs]
embeddings = model.encode(texts, show_progress_bar=True)

embeddings = np.array(embeddings)
print(f"Embedding shape: {embeddings.shape}")


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:  41%|####      | 62.9M/154M [00:00<?, ?B/s]

Error while downloading from https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/model.safetensors: HTTPSConnectionPool(host='cas-bridge.xethub.hf.co', port=443): Read timed out.
Trying to resume download...


model.safetensors:  41%|####      | 62.9M/154M [00:00<?, ?B/s]

Error while downloading from https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/model.safetensors: HTTPSConnectionPool(host='cas-bridge.xethub.hf.co', port=443): Read timed out.
Trying to resume download...


model.safetensors:  45%|####4     | 73.4M/164M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/43069 [00:00<?, ?it/s]

## Step 4: Store in FAISS Vector Database

I'll store the embeddings in FAISS and save the corresponding metadata for retrieval.


In [None]:
# Create FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

# Create output directory
os.makedirs("../vector_store", exist_ok=True)

# Save FAISS index
faiss.write_index(index, "../vector_store/faiss_index.index")

# Save metadata
with open("../vector_store/metadata.pkl", "wb") as f:
    pickle.dump([doc["metadata"] for doc in docs], f)

print("✅ Embeddings and metadata saved successfully.")


## ✅ Summary

- Loaded and processed: `filtered_complaints.csv` with 478K records
- Generated ≈ 1378K text chunks
- Used Sentence Transformers (`all-MiniLM-L6-v2`) to generate embeddings
- Stored in FAISS index with associated metadata for semantic retrieval

Next step: Build the RAG pipeline using these embeddings.
