# ðŸ”¢ Notebook 01: Chunking, Embedding, and Vector Store Indexing (FAISS)

## Learning Objectives
In this notebook, you will learn:
1. **Convert data to LangChain Documents** - the standard format for RAG
2. **Chunk text** using RecursiveCharacterTextSplitter
3. **Create embeddings** using sentence-transformers/all-MiniLM-L6-v2
4. **Build a FAISS vector store** and persist it to disk
5. **Test retrieval** with sample queries

## Key Concepts

### What is Chunking?
- Large documents need to be split into smaller pieces (chunks)
- Embedding models have input limits (typically 256-512 tokens)
- Smaller chunks = more precise retrieval, but less context
- We use **overlap** to preserve context at chunk boundaries

### What are Embeddings?
- Embeddings convert text to vectors (lists of numbers)
- Similar texts have similar vectors (close in vector space)
- This enables **semantic search** (meaning-based, not just keywords)

### What is a Vector Store?
- A system for storing and searching vectors
- **FAISS** is a fast similarity search library (fully local)
- We can save it to disk and reload later (no re-embedding needed!)

---

## Step 1: Setup and Imports

In [1]:
# Standard library imports
import sys
import os
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

# IMPORTANT: Set up HuggingFace cache BEFORE importing transformers
# This ensures models are downloaded to our project folder
from src.config import setup_hf_cache
setup_hf_cache()

# Data manipulation
import pandas as pd

print("âœ“ Setup complete!")
print(f"Project root: {project_root}")

âœ“ HuggingFace cache set to: /Users/macbookpro/Documents/Rag_Test/rag-ticket-rag/models/hf
âœ“ Setup complete!
Project root: /Users/macbookpro/Documents/Rag_Test/rag-ticket-rag


In [2]:
# Import our custom modules
from src import config
from src.io import load_processed_tickets
from src.docs import dataframe_to_documents, print_document_sample
from src.chunking import chunk_documents, get_chunk_stats, print_chunk_samples
from src.vectorstore import (
    create_vector_store,
    load_vector_store,
    search_similar,
    search_with_scores,
    print_search_results
)

print("âœ“ Custom modules imported!")

  from pydantic.v1.fields import FieldInfo as FieldInfoV1
  from .autonotebook import tqdm as notebook_tqdm


âœ“ Custom modules imported!


## Step 2: Load Cleaned Data

We'll load the preprocessed data from Notebook 00.

In [3]:
# Load the cleaned tickets data
df = load_processed_tickets()

# Quick preview
print(f"\nLoaded {len(df):,} tickets")
print(f"Columns: {df.columns.tolist()}")

âœ“ Loaded 8,469 processed tickets from tickets_clean.csv

Loaded 8,469 tickets
Columns: ['Ticket ID', 'Customer Age', 'Customer Gender', 'Product Purchased', 'Date of Purchase', 'Ticket Type', 'Ticket Subject', 'Ticket Description', 'Ticket Status', 'Resolution', 'Ticket Priority', 'Ticket Channel', 'First Response Time', 'Time to Resolution', 'Customer Satisfaction Rating', 'description_word_count', 'document_text']


In [4]:
# Verify document_text exists
assert 'document_text' in df.columns, "document_text column missing! Run Notebook 00 first."

# Show sample document_text
print("Sample document_text:")
print("-" * 50)
print(df.iloc[0]['document_text'][:400])

Sample document_text:
--------------------------------------------------
Subject: Product setup
Description: I'm having an issue with the {product_purchased}. Please assist.

Your billing zip code is: 71701.

We appreciate that you have requested a website address.

Please double check your email address. I've tried troubleshooting steps mentioned in the user manual, but the issue persists.


---

## Step 3: Convert to LangChain Documents

LangChain uses `Document` objects as a standard format:
- `page_content`: The text that gets embedded
- `metadata`: Additional info for filtering/display (not embedded)

In [5]:
# Convert DataFrame rows to LangChain Documents
documents = dataframe_to_documents(df)

print(f"\nCreated {len(documents):,} documents")

âœ“ Converted 8,469 rows to LangChain Documents
  Sample metadata keys: ['ticket_id', 'product', 'ticket_type', 'ticket_subject', 'status', 'priority', 'channel', 'date_of_purchase', 'satisfaction_rating']

Created 8,469 documents


In [6]:
# Let's examine a sample document
print_document_sample(documents[0])

DOCUMENT SAMPLE
Content:
Subject: Product setup
Description: I'm having an issue with the {product_purchased}. Please assist.

Your billing zip code is: 71701.

We appreciate that you have requested a website address.

Please...
------------------------------------------------------------
Metadata:
  ticket_id: 1
  product: GoPro Hero
  ticket_type: Technical issue
  ticket_subject: Product setup
  status: Pending Customer Response
  priority: Critical
  channel: Social media
  date_of_purchase: 2021-03-22
  satisfaction_rating: None


**Notice:**
- `page_content` contains the text we'll embed (Subject + Description + Resolution)
- `metadata` contains useful info like ticket_id, product, priority
- Metadata is NOT embedded, but stored alongside for filtering and display

---

## Step 4: Chunk the Documents

We'll split documents into smaller chunks using `RecursiveCharacterTextSplitter`.

### Parameters:
- **chunk_size=500**: Maximum 500 characters per chunk
- **chunk_overlap=50**: 50 characters overlap between consecutive chunks

### Why these values?
- 500 chars â‰ˆ 100-125 words â‰ˆ good balance of context and precision
- 50 char overlap (10%) helps preserve context at boundaries

In [7]:
# Show current config settings
print("Chunking Configuration:")
print(f"  chunk_size: {config.CHUNK_SIZE} characters")
print(f"  chunk_overlap: {config.CHUNK_OVERLAP} characters")

Chunking Configuration:
  chunk_size: 500 characters
  chunk_overlap: 50 characters


In [8]:
# Chunk the documents
chunks = chunk_documents(documents)

âœ“ Created text splitter (chunk_size=500, overlap=50)
âœ“ Chunking complete:
  Original documents: 8,469
  After chunking: 8,469
  Expansion ratio: 1.00x


In [9]:
# Get statistics about the chunks
stats = get_chunk_stats(chunks)

print("\nCHUNK STATISTICS")
print("=" * 40)
for key, value in stats.items():
    print(f"{key}: {value}")


CHUNK STATISTICS
total_chunks: 8469
min_length: 183
max_length: 487
mean_length: 344.4
median_length: 348


In [10]:
# Print sample chunks to see what they look like
print_chunk_samples(chunks, n_samples=2)

CHUNK SAMPLES (showing 2 of 8,469)

--- Chunk 1 ---
Length: 320 chars
Ticket ID: 1
Chunk Index: 0
Content preview:
Subject: Product setup
Description: I'm having an issue with the {product_purchased}. Please assist.

Your billing zip code is: 71701.

We appreciate that you have requested a website address.

Please...


--- Chunk 2 ---
Length: 329 chars
Ticket ID: 2
Chunk Index: 0
Content preview:
Subject: Peripheral compatibility
Description: I'm having an issue with the {product_purchased}. Please assist.

If you need to change an existing product.

I'm having an issue with the {product_purch...



**Key Observations:**
- Each chunk has the original metadata (ticket_id, product, etc.)
- A `chunk_index` was added to track which chunk of a ticket this is
- Short tickets stay as one chunk; longer ones are split

---

## Step 5: Create Embeddings and Vector Store (FAISS)

Now we'll:
1. Load the embedding model (all-MiniLM-L6-v2)
2. Embed all chunks (convert text â†’ vectors)
3. Store in **FAISS**
4. Persist to disk (saved under `vector_store/faiss/`)

**Note:** First run will download the embedding model (~80MB). This is cached for future runs.

In [11]:
# Show embedding model info
print("Embedding Model Configuration:")
print(f"  Model: {config.EMBEDDING_MODEL_NAME}")
print(f"  Cache directory: {config.MODELS_DIR}")
print(f"  Vector store directory: {config.VECTOR_STORE_DIR}")

Embedding Model Configuration:
  Model: sentence-transformers/all-MiniLM-L6-v2
  Cache directory: /Users/macbookpro/Documents/Rag_Test/rag-ticket-rag/models/hf
  Vector store directory: /Users/macbookpro/Documents/Rag_Test/rag-ticket-rag/vector_store/faiss


In [None]:
# Create the vector store
# This will:
# 1. Download the embedding model (first time only)
# 2. Embed all chunks (may take a few minutes)
# 3. Store in FAISS
# 4. Persist to disk

print("Creating FAISS vector store... (this may take a few minutes)")
print("=" * 50)

vectorstore = create_vector_store(chunks)

print("\n" + "=" * 50)
print("âœ“ FAISS vector store created and persisted!")

Creating vector store... (this may take a few minutes)
Creating vector store with 8,469 documents...
  Persist directory: /Users/macbookpro/Documents/Rag_Test/rag-ticket-rag/vector_store/faiss
Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
  (First run will download ~80MB to /Users/macbookpro/Documents/Rag_Test/rag-ticket-rag/models/hf)


---

## Step 6: Test Loading from Disk

Let's verify we can reload the vector store from disk.
This is important - we don't want to re-embed every time!

In [None]:
# Clear the current vectorstore from memory
del vectorstore

# Reload from disk
print("Reloading vector store from disk...")
vectorstore = load_vector_store()

print("\nâœ“ Successfully reloaded from disk!")

---

## Step 7: Test Retrieval

Let's test the vector store with some sample queries.
This demonstrates **semantic search** - finding relevant documents by meaning.

In [None]:
# Define test queries
test_queries = [
    "billing refund issues",
    "device overheating problem",
    "late delivery or missing items",
]

print(f"Testing {len(test_queries)} queries...")

In [None]:
# Test Query 1: Billing refund issues
query = test_queries[0]
results = search_similar(vectorstore, query, k=3)
print_search_results(results, query)

In [None]:
# Test Query 2: Device overheating
query = test_queries[1]
results = search_similar(vectorstore, query, k=3)
print_search_results(results, query)

In [None]:
# Test Query 3: Late delivery
query = test_queries[2]
results = search_similar(vectorstore, query, k=3)
print_search_results(results, query)

### Retrieval with Similarity Scores

We can also get similarity scores to understand how relevant each result is.

In [None]:
# Search with scores
query = "customer wants a refund for damaged product"
results_with_scores = search_with_scores(vectorstore, query, k=5)

print(f"Query: '{query}'")
print("=" * 60)
print("\nResults with similarity scores:")
print("(Lower score = more similar)")
print("-" * 60)

for i, (doc, score) in enumerate(results_with_scores, 1):
    print(f"\n{i}. Score: {score:.4f}")
    print(f"   Ticket ID: {doc.metadata.get('ticket_id', 'N/A')}")
    print(f"   Product: {doc.metadata.get('product', 'N/A')}")
    print(f"   Preview: {doc.page_content[:100]}...")

---

## Step 8: Explore the Vector Store (FAISS)

FAISS stores vectors in an index.
We can inspect the number of vectors indexed and what files were saved to disk.

In [None]:
# Inspect FAISS index size
print("VECTOR STORE INFO (FAISS)")
print("=" * 40)

# ntotal = number of vectors stored in the FAISS index
try:
    print(f"Total vectors (ntotal): {int(vectorstore.index.ntotal):,}")
except Exception as e:
    print(f"Could not read FAISS ntotal: {e}")

print(f"Persist directory: {config.VECTOR_STORE_DIR}")

In [None]:
# Show the persisted files on disk
from pathlib import Path

persist_dir = Path(config.VECTOR_STORE_DIR)
files = sorted([p.name for p in persist_dir.glob("*")])

print("\nPERSISTED FILES")
print("=" * 40)
for f in files:
    print("-", f)

print("\nExpected FAISS files:")
print("- index.faiss  (the vector index)")
print("- index.pkl    (docstore + metadata)")

---

## Summary

### What We Accomplished
1. âœ… Loaded cleaned ticket data from Notebook 00
2. âœ… Converted DataFrame rows to LangChain Document objects
3. âœ… Chunked documents using RecursiveCharacterTextSplitter (500 chars, 50 overlap)
4. âœ… Created embeddings using all-MiniLM-L6-v2
5. âœ… Built and persisted a **FAISS** vector store
6. âœ… Tested retrieval with sample queries

### Key Takeaways
- **Chunking** splits large documents into searchable pieces
- **Embeddings** convert text to vectors for semantic search
- **FAISS** enables fast similarity search locally
- **Persistence** means we don't re-embed every time

### Files Created
- `vector_store/faiss/` - Persisted FAISS index
- `models/hf/` - Cached embedding model

### Next Steps
â†’ **Notebook 02**: Build the full RAG pipeline with LLM generation

In [None]:
print("\n" + "=" * 60)
print("ðŸŽ‰ Notebook 01 Complete!")
print("=" * 60)
print(f"\nVector store saved to: {config.VECTOR_STORE_DIR}")
try:
    print(f"Total vectors indexed (ntotal): {int(vectorstore.index.ntotal):,}")
except Exception:
    pass
print("\nProceed to: 02_build_rag_pipeline.ipynb")