# üìù Task 2 ‚Äì Text Embedding & Semantic Index Construction  
üìò Version: 2025-07-08  

Foundational vectorization and semantic indexing of complaint narratives for **CrediTrust Financial‚Äôs Intelligent Complaint Analysis Initiative**. This notebook transforms cleaned consumer complaints into numerical embeddings using pre-trained transformer models and builds a scalable FAISS index to enable semantic search and Retrieval-Augmented Generation (RAG). Outputs from this task will power intelligent querying and automated summarization in Task 3.

---

**Challenge:** B5W6 ‚Äì Intelligent Complaint Analysis  
**Company:** CrediTrust Financial  
**Author:** Nabil Mohamed  
**Branch:** `task-2-embedding-and-indexing`  
**Date:** July 2025  

---

### üìå This notebook covers:
- Loading the cleaned complaint dataset (`filtered_complaints.csv`)
- Minimal text preprocessing for embedding readiness
- Generating sentence embeddings using transformer models (`all-MiniLM-L6-v2`)
- Constructing a FAISS index for efficient semantic search
- Running embedding quality diagnostics (dimensionality, clustering, similarity checks)
- Saving embedding artifacts and index for downstream RAG applications


In [1]:
# ------------------------------------------------------------------------------
# üõ† Ensure Notebook Runs from Project Root (for src/ imports to work)
# ------------------------------------------------------------------------------

import os
import sys

# If running from /notebooks/, move up to project root
if os.path.basename(os.getcwd()) == "notebooks":
    os.chdir("..")
    print("üìÇ Changed working directory to project root")

# Add project root to sys.path so `src/` modules can be imported
project_root = os.getcwd()
if project_root not in sys.path:
    sys.path.insert(0, project_root)
    print(f"‚úÖ Added to sys.path: {project_root}")

# Optional: verify file presence to confirm we're in the right place
expected_path = "data/raw"
print(
    "üìÅ Output path ready"
    if os.path.exists(expected_path)
    else f"‚ö†Ô∏è Output path not found: {expected_path}"
)

üìÇ Changed working directory to project root
‚úÖ Added to sys.path: c:\Users\admin\Documents\GIT Repositories\b5-w6-intelligent-complaint-analysis-challenge
üìÅ Output path ready


## üì¶ Imports & Environment Setup

This cell loads the core libraries required for text embedding, semantic indexing, and exploratory visualization in the context of intelligent complaint analysis. Imports are grouped by function:

- **Data handling:** `pandas`, `numpy`
- **Visualization:** `matplotlib`, `seaborn`
- **Text processing & embedding:** `re`, `nltk`, `sentence_transformers`
- **Semantic search:** `faiss`
- **Date/time analysis:** `datetime`, `pandas.to_datetime`
- **System & utilities:** `os`, `warnings`, `pathlib`


In [2]:
# ---------------------------
# üì¶ Imports & Environment Setup
# ---------------------------

# Data handling
import pandas as pd  # For structured data manipulation
import numpy as np  # For numerical computations and array operations

# Visualization
import matplotlib.pyplot as plt  # For plotting visualizations
import seaborn as sns  # For enhanced plotting aesthetics

# Text processing & embedding
import re  # For regular expression‚Äìbased text cleaning
import nltk  # For optional stopword removal and tokenization
from sentence_transformers import SentenceTransformer  # For pre-trained transformer embeddings

# Semantic search (vector indexing)
import faiss  # For efficient similarity search via FAISS index

# Date/time analysis
from datetime import datetime  # For date parsing and formatting
from pandas.api.types import CategoricalDtype  # For categorical data management

# System & utilities
import os  # For directory and file path operations
import warnings  # To suppress unnecessary warnings
from pathlib import Path  # For robust path handling

# Configure display settings for clarity
pd.set_option("display.max_columns", None)  # Ensure all columns are visible
pd.set_option("display.float_format", "{:,.2f}".format)  # Set consistent float formatting
warnings.filterwarnings("ignore")  # Suppress common library warnings

# Apply seaborn visual theme
sns.set(style="whitegrid", context="notebook")  # Set clean plotting style for diagnostics


  from .autonotebook import tqdm as notebook_tqdm


## üì• Load & Preview Cleaned Complaint Dataset (Task 2 Embedding)

This step loads the **pre-cleaned complaint dataset** (`filtered_complaints.csv`), which contains customer-submitted complaint narratives related to financial products, prepared in Task 1 through text cleaning and filtering.

- Reads the cleaned file from `data/interim/filtered_complaints.csv` using the `ComplaintChunkProcessor` module
- Verifies data structure: row and column count, presence of narrative text, and data types
- Outputs key diagnostics: dataset shape, missing values, and sample preview
- Includes defensive error handling for missing files, empty datasets, or structural inconsistencies
- Designed for robustness, reusability, and seamless integration with downstream embedding and semantic indexing tasks

This ensures that the complaint narratives are correctly staged and validated for embedding generation, FAISS indexing, and retrieval-augmented generation (RAG) in Task 3.


In [3]:
# ------------------------------------------------------------------------------
# üì• Load Pre-Cleaned Complaint Dataset Using ComplaintChunkProcessor (Task 2)
# ------------------------------------------------------------------------------

from src.data_loader import ComplaintChunkProcessor  # Import loader class

# ‚úÖ 1. Define path to pre-cleaned CSV (from Task 1 output)
cleaned_data_path = "data/interim/filtered_complaints.csv"  # Path to cleaned file

# ‚úÖ 2. Initialize the processor (output_path is required but unused in this task)
processor = ComplaintChunkProcessor(
    filepath=cleaned_data_path,  # Input file path
    output_path="data/interim/dummy_output.csv",  # Dummy value for compatibility
)

# ‚úÖ 3. Load cleaned data safely using new method
df = processor.load_cleaned_data()  # Load the cleaned complaint dataset

# ‚úÖ 4. Quick sanity check on loaded data
if not df.empty:
    print(f"‚úÖ Dataset loaded: {df.shape[0]:,} rows √ó {df.shape[1]} columns")
    display(df.head())  # Optional: Show first few rows

    # ‚úÖ 5. Randomly sample 25,000 rows to reduce processing time (with reproducibility)
    df = df.sample(n=25_000, random_state=42).reset_index(drop=True)  # Random sample
    print(f"‚úÖ Sampled down to: {df.shape[0]:,} rows for faster embedding")

else:
    print("‚ö†Ô∏è No data loaded. Check file path or contents.")

‚úÖ Cleaned complaint dataset loaded successfully: 647,245 rows √ó 19 columns
‚úÖ Dataset loaded: 647,245 rows √ó 19 columns


Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID,MappedProduct
0,2025-06-13,Credit card,Store credit card,Getting a credit card,Card opened without my consent or knowledge,a xxxx xxxx card was opened under my name by a...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",TX,78230,Servicemember,Consent provided,Web,2025-06-13,Closed with non-monetary relief,Yes,,14069121,Credit card
1,2025-06-13,Checking or savings account,Checking account,Managing an account,Deposits and withdrawals,i made the mistake of using my wellsfargo debi...,Company has responded to the consumer and the ...,WELLS FARGO & COMPANY,ID,83815,,Consent provided,Web,2025-06-13,Closed with explanation,Yes,,14061897,Savings account
2,2025-06-12,Credit card,General-purpose credit card or charge card,"Other features, terms, or problems",Other problem,"dear cfpb, i have a secured credit card with c...",Company has responded to the consumer and the ...,"CITIBANK, N.A.",NY,11220,,Consent provided,Web,2025-06-13,Closed with monetary relief,Yes,,14047085,Credit card
3,2025-06-12,Credit card,General-purpose credit card or charge card,Incorrect information on your report,Account information incorrect,i have a citi rewards cards. the credit balanc...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",IL,60067,,Consent provided,Web,2025-06-12,Closed with explanation,Yes,,14040217,Credit card
4,2025-06-11,Vehicle loan or lease,Loan,Repossession,Deficiency balance after repossession,was never notified of repossession. once repos...,Company has responded to the consumer and the ...,CREDIT ACCEPTANCE CORPORATION,TX,75070,,Consent provided,Web,2025-06-11,Closed with explanation,Yes,,14019199,Personal loan


‚úÖ Sampled down to: 25,000 rows for faster embedding


## ‚ú® Minimal Text Preprocessing for Embedding Readiness (Task 2)

This step applies **lossless, minimal text cleaning** to the complaint narratives to ensure readiness for sentence embedding generation. The goal is to retain the full semantic content while removing superficial noise.

- Lowercases text to ensure case consistency
- Removes excessive whitespace, line breaks, and tabs
- Preserves all other textual features to maximize embedding quality

The `MinimalTextPreprocessor` class from `src/preprocessing/minimal_text_preprocessor.py` is used for this operation, following defensive programming practices to handle missing or malformed inputs.

This ensures that embeddings are generated from standardized, clean narratives while retaining the original complaint meaning.


In [4]:
# ------------------------------------------------------------------------------
# ‚ú® Apply Minimal Text Preprocessing for Embedding Readiness (Task 2)
# ------------------------------------------------------------------------------

# ‚úÖ Import the MinimalTextPreprocessor class from the preprocessing module
from src.chunking.text_cleaner import (
    MinimalTextPreprocessor,
)  # Minimal complaint cleaner for embeddings

# ‚úÖ Initialize the preprocessor (no parameters required for minimal cleaning)
preprocessor = MinimalTextPreprocessor()  # Create preprocessor instance

# ‚úÖ Apply text cleaning to the 'Consumer complaint narrative' column from Task 1, create new 'cleaned_narrative' column
df = preprocessor.apply_to_dataframe(
    df,  # DataFrame loaded via ComplaintChunkProcessor
    text_column="Consumer complaint narrative",  # Source text column (Task 1 cleaned)
    new_column_name="cleaned_narrative",  # Destination cleaned text column
)

# ‚úÖ Quick manual verification: compare original and cleaned narratives side by side
df[["Consumer complaint narrative", "cleaned_narrative"]].sample(
    5, random_state=42
)  # Display 5 sample rows

Unnamed: 0,Consumer complaint narrative,cleaned_narrative
6868,on xxxxxxxx in the earlier part of the day i w...,on xxxxxxxx in the earlier part of the day i w...
24016,i received an email upon waking up on xxxxxxxx...,i received an email upon waking up on xxxxxxxx...
9668,midland credit management xxxx. xxxx xxxx xxxx...,midland credit management xxxx. xxxx xxxx xxxx...
13640,"to whom it may concern, on xxxx xxxx xxxx xxxx...","to whom it may concern, on xxxx xxxx xxxx xxxx..."
14018,"during xxxx and xxxx, i had a checking account...","during xxxx and xxxx, i had a checking account..."


## ‚úÇÔ∏è Split Complaint Narratives into Chunks for Embedding (Task 2)

To improve the quality of semantic search and downstream retrieval, this step breaks down long complaint narratives into **smaller, semantically meaningful text chunks** using the `TextChunker` class.

Key features:
- Uses **LangChain's RecursiveCharacterTextSplitter** for intelligent chunking
- Preserves important metadata (Complaint ID, Product) for traceability
- Outputs a structured list of text chunks, each linked to its source complaint

This ensures that long complaints are represented by multiple focused embeddings, enhancing the performance of semantic search and Retrieval-Augmented Generation (RAG) pipelines.


In [5]:
# ------------------------------------------------------------------------------
# ‚úÇÔ∏è Split Complaint Narratives into Chunks for Embedding Readiness (Task 2)
# ------------------------------------------------------------------------------

# ‚úÖ Import the TextChunker class from the chunking module
from src.chunking.text_chunker import (
    TextChunker,
)  # Intelligent text splitter for long narratives

# ‚úÖ 1. Initialize the chunker with desired chunk size and overlap
chunker = TextChunker(
    chunk_size=500, chunk_overlap=50
)  # Adjustable based on text length and embedding model limits

# ‚úÖ 2. Define metadata columns to preserve for each chunk
metadata_fields = [
    "Complaint ID",
    "Product",
]  # These fields will remain attached to each chunk

# ‚úÖ 3. Apply chunking to the cleaned narratives
chunked_data = chunker.chunk_dataframe(
    df,  # DataFrame containing cleaned narratives
    text_column="cleaned_narrative",  # Source text column to split
    metadata_columns=metadata_fields,  # Metadata to carry forward
)

# ‚úÖ 4. Quick check: show first few chunk records with metadata
print(f"‚úÖ Total chunks created: {len(chunked_data):,}")
pd.DataFrame(chunked_data).head()  # Preview chunks for verification

‚úÖ Total chunks created: 73,597


Unnamed: 0,text,Complaint ID,Product,chunk_index
0,initial purchase with synchrony xxxx days late...,2681960,Credit card or prepaid card,0
1,credit account back and remove my sister off t...,2681960,Credit card or prepaid card,1
2,me 30 days late!! i did not provide documents ...,2681960,Credit card or prepaid card,2
3,from the last week of xxxx2024 to the second w...,8769778,"Money transfer, virtual currency, or money ser...",0
4,"on xxxxxxxx, i sent a bill pay check on a rela...",6217611,Checking or savings account,0


## üîó Generate Sentence Embeddings for Complaint Chunks (Task 2)

In this step, we transform each text chunk into a dense vector representation using the **EmbeddingGenerator** class and the pre-trained `all-MiniLM-L6-v2` transformer model.

Key features:
- Uses **GPU acceleration** when available
- Processes chunks in memory-efficient **batches**
- Embeddings are **normalized** for compatibility with cosine similarity in semantic search

This ensures that each chunk is mapped into vector space for downstream indexing and retrieval.


In [7]:
# ------------------------------------------------------------------------------
# üîó Generate Sentence Embeddings for Complaint Chunks Using EmbeddingGenerator
# ------------------------------------------------------------------------------

# ‚úÖ Import the EmbeddingGenerator class from the chunking module
from src.chunking.embedding_generator import (
    EmbeddingGenerator,
)  # Modular embedding generator

# ‚úÖ 1. Initialize the embedding generator with desired model and batch size
embedder = EmbeddingGenerator(
    model_name="all-MiniLM-L6-v2", batch_size=64
)  # Efficient transformer model

# ‚úÖ 2. Extract text chunks from previously generated chunked_data (list of dicts)
chunk_texts = [
    record["text"] for record in chunked_data
]  # Extract chunk texts for embedding

# ‚úÖ 3. Generate embeddings in batches with progress bar
embeddings = embedder.generate_embeddings(
    chunk_texts
)  # Generate dense vector representations

# ‚úÖ 4. Quick sanity check: confirm embedding shape and first vector
if embeddings is not None:
    print(f"‚úÖ Embedding matrix shape: {embeddings.shape}")
else:
    print("‚ö†Ô∏è No embeddings generated. Check input or model setup.")

‚úÖ Embedding model 'all-MiniLM-L6-v2' loaded on device: cpu


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1150/1150 [1:11:38<00:00,  3.74s/it]


‚úÖ Embedding generation successful: 73,597 vectors of dimension 384
‚úÖ Embedding matrix shape: (73597, 384)


## üóÇÔ∏è Build ChromaDB Vector Store Using Precomputed Embeddings (Task 2)

In this step, we use the **VectorStoreBuilder** class to store precomputed complaint chunk embeddings and their associated metadata into a **ChromaDB** vector store for efficient semantic search.

Key features:
- Avoids recomputing embeddings
- Persists the index to disk for downstream RAG and interactive querying (Task 3 & 4)
- Keeps chunk traceability via metadata (Complaint ID, Product, Chunk Index)


In [21]:
# ------------------------------------------------------------------------------
# üîÑ Reload VectorStoreBuilder Module in Notebook (Jupyter/IPython)
# ------------------------------------------------------------------------------

import importlib  # Built-in module for dynamic import reloading

# ‚úÖ Import the module (your updated code must be saved first)
import src.chunking.vector_store_builder as vector_store_builder_module

# ‚úÖ Reload the module to get the latest changes
importlib.reload(vector_store_builder_module)

# ‚úÖ Access the class again (fresh copy)
VectorStoreBuilder = vector_store_builder_module.VectorStoreBuilder

In [None]:
# ------------------------------------------------------------------------------
# üóÇÔ∏è Build ChromaDB Vector Store with Progress, Diagnostics, and Runtime Insights (Task 2 ‚Äì Final v2)
# ------------------------------------------------------------------------------
# Author: Nabil Mohamed | July 2025

# ‚úÖ Standard Library Imports
import time
import numpy as np
from tqdm import tqdm

# ‚úÖ Project Imports
from src.chunking.vector_store_builder import VectorStoreBuilder
from langchain.embeddings import HuggingFaceEmbeddings

# ------------------------------------------------------------------------------
# 1Ô∏è‚É£ Initialize VectorStoreBuilder
# ------------------------------------------------------------------------------

builder = VectorStoreBuilder(
    persist_directory="vector_store/chroma_db",
    collection_name="complaint_chunks",
)

# ------------------------------------------------------------------------------
# 2Ô∏è‚É£ Prepare Inputs
# ------------------------------------------------------------------------------

documents = chunk_texts  # List of text chunks
embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

metadatas = [
    {
        "Complaint ID": chunk.get("Complaint ID", "N/A"),
        "Product": chunk.get("Product", "N/A"),
        "Chunk Index": chunk.get("chunk_index", 0),
    }
    for chunk in chunked_data
]

# Precomputed embeddings (None if not available)
precomputed_embeddings = embeddings

# ------------------------------------------------------------------------------
# 3Ô∏è‚É£ Diagnostics
# ------------------------------------------------------------------------------

print(f"üîç Number of text chunks: {len(documents):,}")
print(f"üîç Number of metadata entries: {len(metadatas):,}")

if precomputed_embeddings is not None:
    print(
        f"üîç Precomputed embeddings detected: shape = {np.array(precomputed_embeddings).shape}"
    )
else:
    print(
        "‚öôÔ∏è No precomputed embeddings provided. Will use live embedding as fallback if needed."
    )

# ------------------------------------------------------------------------------
# 4Ô∏è‚É£ Build Vector Store with Progress
# ------------------------------------------------------------------------------

print("\nüöÄ Starting ChromaDB vector store creation...")
start_time = time.time()

with tqdm(
    total=1,
    desc="üîÑ Building Vector Store",
    bar_format="{l_bar}{bar} [ time left: {remaining} ]",
) as pbar:
    vector_store = builder.build_chroma_store(
        documents=documents,
        embedding_model=embedding_model,
        metadatas=metadatas,
        embeddings=precomputed_embeddings,
    )
    pbar.update(1)

elapsed_time = time.time() - start_time

# ------------------------------------------------------------------------------
# 5Ô∏è‚É£ Results Summary
# ------------------------------------------------------------------------------

if vector_store:
    print(f"\n‚úÖ ChromaDB vector store created successfully.")
    print(f"üìÑ Total documents indexed: {len(documents):,}")
    print(f"‚è±Ô∏è Time taken: {elapsed_time:.2f} seconds")
    print(f"üìÅ Location: {builder.persist_directory}")
else:
    print(f"\n‚ùå Vector store creation failed.")
    print(f"‚è±Ô∏è Time elapsed: {elapsed_time:.2f} seconds")

üîç Number of text chunks: 73,597
üîç Number of metadata entries: 73,597
üîç Precomputed embeddings detected: shape = (73597, 384)

üöÄ Starting ChromaDB vector store creation...


üîÑ Building Vector Store:   0%|           [ time left: ? ]

‚ö° Using precomputed embeddings for ChromaDB creation (Option B).
‚ûï Adding batch 1 (5,000 documents)...
‚ûï Adding batch 2 (5,000 documents)...
‚ûï Adding batch 3 (5,000 documents)...
‚ûï Adding batch 4 (5,000 documents)...
‚ûï Adding batch 5 (5,000 documents)...
‚ûï Adding batch 6 (5,000 documents)...
‚ûï Adding batch 7 (5,000 documents)...
‚ûï Adding batch 8 (5,000 documents)...
‚ûï Adding batch 9 (5,000 documents)...
‚ûï Adding batch 10 (5,000 documents)...
‚ûï Adding batch 11 (5,000 documents)...
‚ûï Adding batch 12 (5,000 documents)...
‚ûï Adding batch 13 (5,000 documents)...
‚ûï Adding batch 14 (5,000 documents)...
‚ûï Adding batch 15 (3,597 documents)...


üîÑ Building Vector Store: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà [ time left: 00:00 ]

‚úÖ ChromaDB vector store created successfully with 73,597 documents.
üìÅ Location: vector_store/chroma_db

‚úÖ ChromaDB vector store created successfully.
üìÑ Total documents indexed: 73,597
‚è±Ô∏è Time taken: 4055.63 seconds
üìÅ Location: vector_store/chroma_db



