## Task 2: Text Chunking, Embedding, and Vector Store Indexing

In this notebook, I:
- Load the cleaned complaint dataset
- Chunk the narratives into manageable pieces
- Add product metadata to each chunk
- Generate semantic embeddings
- Store them in a FAISS vector store for efficient retrieval


In [5]:
import pandas as pd
import numpy as np
import re
import os
from tqdm import tqdm

from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer

import faiss
import pickle

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
## Load hte filtered compliant data

import pandas as pd

df = pd.read_csv("../data/processed/filtered_complaints.csv", encoding="latin1")
print(f"Loaded {len(df)} records")
df[['cleaned_narrative', 'Mapped Product']].head()



FileNotFoundError: [Errno 2] No such file or directory: '../data/processed/filtered_complaints.csv'

### Step 1: Text Chunking

I'll use LangChain's `RecursiveCharacterTextSplitter` to break down long complaint narratives.  
Chunking improves embedding quality and ensures we stay within token limits.


In [5]:
## Chincking the first record only!!
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

# Just to check example output
example_chunks = text_splitter.split_text(df.iloc[0]['cleaned_narrative'])
print(f"Chunks from first record: {len(example_chunks)}")
example_chunks


Chunks from first record: 1


['a xxxx xxxx card was opened under my name by a fraudster i received a notice from xxxx that an account was just opened under my name i reached out to xxxx xxxx to state that this activity was unauthorized and not me xxxx xxxx confirmed this was fraudulent and immediately closed the card however they have failed to remove this from the three credit agencies and this fraud is now impacting my credit score based on a hard credit pull done by xxxx xxxx that was done by a fraudster']

## Step 2: Add Metadata to Chunks

Each chunk will carry metadata such as:
- Product category
- Complaint ID

This will be stored along with the embedding in the vector database.


In [6]:
docs = []

for _, row in tqdm(df.iterrows(), total=len(df)):
    chunks = text_splitter.split_text(row['cleaned_narrative'])
    for chunk in chunks:
        docs.append({
            "content": chunk,
            "metadata": {
                "product": row['Mapped Product'],
                "complaint_id": row['Complaint ID']
            }
        })

print(f"Total document chunks: {len(docs)}")

100%|██████████| 478834/478834 [05:05<00:00, 1565.08it/s]

Total document chunks: 1378199





## Step 3: Generate Embeddings

I'll use `all-MiniLM-L6-v2` from the `sentence-transformers` library to convert text chunks into embeddings.


In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
from tqdm import tqdm

# Load model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Extract text content from docs
texts = [doc["content"] for doc in docs]

# Encode with batch_size=64
embeddings = model.encode(
    texts,
    batch_size=64,              # Set batch size to 64
    show_progress_bar=True,
    convert_to_numpy=True        # Ensures output is a NumPy array
)

# Show the final shape of the embeddings matrix
print(f"Embedding shape: {embeddings.shape}")


Batches:   0%|          | 0/21535 [00:00<?, ?it/s]

Embedding shape: (1378199, 384)


## Step 4: Store in FAISS Vector Database

I'll store the embeddings in FAISS and save the corresponding metadata for retrieval.


## ✅ Summary

- Loaded and processed: `filtered_complaints.csv` with 478K records
- Generated ≈ 1378K text chunks
- Used Sentence Transformers (`all-MiniLM-L6-v2`) to generate embeddings
- Stored in FAISS index with associated metadata for semantic retrieval

Next step: Build the RAG pipeline using these embeddings.


In [10]:
from sentence_transformers import SentenceTransformer
import numpy as np
from tqdm import tqdm

# Load the sentence embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Limit to first 2000 chunks only (adjustable)
max_chunks = 2000
limited_docs = docs[:max_chunks]

# Extract text content from the limited docs
texts = [doc["content"] for doc in limited_docs]

# Generate embeddings with a progress bar
embeddings = model.encode(
    texts,
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True  # Ensures output is a NumPy array
)

# Display the resulting shape
print(f"✅ Embedded {len(texts)} chunks.")
print(f"Embedding shape: {embeddings.shape}")


Batches:   0%|          | 0/32 [00:00<?, ?it/s]

✅ Embedded 2000 chunks.
Embedding shape: (2000, 384)


In [11]:
np.save("embeddings_2000.npy", embeddings)

import pickle
with open("docs_2000.pkl", "wb") as f:
    pickle.dump(limited_docs, f)


In [13]:
import faiss
import numpy as np
import pickle

# Dimension of embeddings
dimension = embeddings.shape[1]

# Initialize a FAISS index (L2 distance)
index = faiss.IndexFlatL2(dimension)

# Add embeddings to the index
index.add(embeddings)

print(f"FAISS index built with {index.ntotal} vectors.")

# Save the FAISS index to disk
faiss.write_index(index, "../vector_store/index.faiss")

# Save the corresponding metadata (docs) as a pickle file
with open("../vector_store/index.pkl", "wb") as f:
    pickle.dump(limited_docs, f)

print("FAISS index and metadata saved successfully.")


FAISS index built with 2000 vectors.
FAISS index and metadata saved successfully.


In [14]:
import faiss
import pickle

# Load the index
index = faiss.read_index("../vector_store/index.faiss")

# Load metadata
with open("../vector_store/index.pkl", "rb") as f:
    docs = pickle.load(f)

print(f"Loaded FAISS index with {index.ntotal} vectors.")
print(f"Loaded {len(docs)} documents metadata.")


Loaded FAISS index with 2000 vectors.
Loaded 2000 documents metadata.


In [15]:
from sentence_transformers import SentenceTransformer

# Load embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

def search(query, k=5):
    # Embed query
    query_embedding = model.encode([query], convert_to_numpy=True)
    
    # Search in FAISS index
    distances, indices = index.search(query_embedding, k)
    
    # Retrieve corresponding docs
    results = [docs[idx] for idx in indices[0] if idx != -1]
    return results

# Example usage
results = search("What is my credit card limit?", k=3)
for i, res in enumerate(results):
    print(f"Result {i+1}: {res['content'][:200]}...")  # print first 200 chars


Result 1: the xxxx of xxxx this is an egregious act that not only blindsided me but manipulates my account to make it appear that im close to reaching the limit when in fact i was no where near the limit this c...
Result 2: i have a fidelity rewards visa signature credit card with fidelity investments ending in xxxx when i was first approved for the card i was given a credit limit of xxxx since then i have been responsib...
Result 3: for the card ending in xxxx this is a great inconvenience and turnoff to me the account holder and customer if fidelity wants to maintain a relationship with me or other consumers for business in the ...
