# üìì **02 - Build Chunks, Embeddings & FAISS Indices**

This notebook performs the crucial step of transforming our cleaned product data into `chunks`, `embeddings` and `faiss indices` that'll be used later by our RAG. We'll convert text into numerical embeddings and build efficient indices for semantic search.

### **Objectives of this notebook**

* **Chunk** the cleaned product descriptions into manageable text segments using token-based overlapping windows
* **Generate embeddings** using Sentence Transformers to create dense vector representations
* **Build FAISS indices** for fast similarity search across different domains
* **Validate** the retrieval system with sample queries to ensure quality

---

## üîß **Pipeline Overview**

The vector store construction follows this workflow:

1. **Load Cleaned Data** ‚Üí Read the preprocessed corpus from `cleaned_full_corpus.parquet`
2. **Text Chunking** ‚Üí Split documents into overlapping token-based chunks
3. **Embedding Generation** ‚Üí Convert text chunks to dense vector representations using Sentence Transformers
4. **Index Building** ‚Üí Create FAISS indices for efficient similarity search
5. **Metadata Storage** ‚Üí Save domain masks and mapping information for filtered retrieval

---

## **Import & Setup**

In [1]:
import os
import sys
import pandas as pd
import numpy as np

# Make src importable by adding project root to Python path
PROJECT_ROOT = os.path.dirname(os.path.dirname(os.path.abspath("__file__")))
sys.path.insert(0, PROJECT_ROOT)
from src.paths import WORKED_FOLDER, TOKENIZATION_DATA, CHUNKS_PATH, EMBEDDINGS_PATH

# Load cleaned_full_corpus.parquet, created in the previous notebook
df = pd.read_parquet(os.path.join(WORKED_FOLDER, "cleaned_full_corpus.parquet"))
df['doc_id'] = np.arange(len(df), dtype=np.int64)

---

## 1.üß† **The Chunking Process**

### **Why Chunking is Necessary**

Product descriptions in our dataset can be quite lengthy. Chunking serves several important purposes:

* **Context Management**: LLMs and embedding models have limited context windows
* **Precision**: Smaller chunks allow more targeted retrieval of relevant information
* **Overlap Preservation**: 30-token overlap ensures we don't lose context at chunk boundaries
* **Efficiency**: Smaller chunks are faster to process and embed

### **Token-Based vs Character-Based Chunking**

Our pipeline uses **token-based chunking** with OpenAI's `tiktoken` tokenizer, which offers several advantages:

* **Consistent with LLMs**: Uses the same tokenization as GPT models
* **Language Agnostic**: Handles different languages and special characters better than character counts
* **Meaningful Units**: Tokens correspond more closely to semantic units than characters

### **Chunking Parameters**

* **Chunk Size**: 200 tokens - balances context richness with precision
* **Chunk Overlap**: 30 tokens - preserves context across chunk boundaries
* **Tokenizer**: `cl100k_base` encoding (same as GPT-4)

In [2]:
# Example of what chunking achieves:
original_text = "This is a long product description that needs to be split into smaller pieces for better retrieval."
# After chunking (simplified):
chunk1 = "This is a long product description that needs to be"
chunk2 = "description that needs to be split into smaller pieces"
chunk3 = "into smaller pieces for better retrieval."

### Creating chunks

In [3]:
# 2) Create chunks.parquet (if not exists) using token-based chunking
import tiktoken
from tqdm import tqdm
def chunk_text(text: str, enc: tiktoken.Encoding, chunk_size: int=200, chunk_overlap: int=30) -> list:
    tokens = enc.encode(text)
    chunks = []

    for i in range(0, len(tokens), chunk_size - chunk_overlap):
        chunk = tokens[i: i + chunk_size]
        if not chunk:
            continue
        chunks.append(enc.decode(chunk))

    return chunks

def create_chunks(df: pd.DataFrame, path=CHUNKS_PATH) -> pd.DataFrame:
    if os.path.exists(path):
        print("üìÇ Loading existing chunks file...")
        chunks_df = pd.read_parquet(path)
    else:
        enc = tiktoken.get_encoding("cl100k_base")
        
        rows = []
        chunk_id = 0
        df = df.reset_index(drop=True)

        for idx, row in tqdm(df.iterrows(), total=len(df), desc="Chunking"):
            text = row.combined_text

            chunk_list = chunk_text(
                text=text,
                enc=enc,
                chunk_size=200,
                chunk_overlap=30
            )

            for chunk in chunk_list:
                rows.append({
                    "doc_id": idx,
                    "chunk_id": chunk_id,
                    "domain": row["domain"],
                    "price": row["price"],
                    "average_rating": row["average_rating"],
                    "title": row["title"],
                    "categories": row["categories"],
                    "text": chunk
                })
                chunk_id += 1
        
        chunks_df = pd.DataFrame(rows)

        # Save metadata for later
        chunks_df.to_parquet(path, index=False)
        print("üíæ chunks_parquet saved successfully.")
        

    return chunks_df
chunks_df = create_chunks(df)

üìÇ Loading existing chunks file...



---

## 2.üî¨ **Embedding Generation**

### **What are Embeddings?**

Embeddings are numerical representations of text that capture semantic meaning. Similar products will have similar embedding vectors, enabling semantic search.

### **Sentence Transformer Model: BAAI/bge-base-en-v1.5**

We use this model because it's:
* **Specialized for Retrieval**: Optimized for semantic similarity tasks
* **English-Optimized**: Trained primarily on English text
* **High Quality**: Produces 768-dimensional vectors with strong semantic capture
* **Efficient**: Balances performance and computational requirements

### **Embedding Properties**

* **Dimension**: 768 dimensions per vector
* **Normalization**: All vectors are normalized to unit length
* **Similarity Metric**: Cosine similarity (equivalent to inner product for normalized vectors)
* **Storage**: float32 for FAISS compatibility and memory efficiency

### **Batch Processing**

* **Batch Size**: 256 texts - optimized for GPU memory if available
* **Progress Tracking**: Shows real-time progress for large datasets
* **Automatic Hardware Detection**: Uses GPU if available, falls back to CPU

### Creating Embeddings

In [4]:
# 2) Generate embeddings using Sentence Transformers
from sentence_transformers import SentenceTransformer
import joblib
def create_embeddings(chunks_df: pd.DataFrame, model_name="BAAI/bge-base-en-v1.5", path=EMBEDDINGS_PATH, force_compute=False) -> tuple[np.ndarray, str]:    
    if os.path.exists(path) and not force_compute:
        print("üìÇ Loading existing embeddings...")
        embeddings = joblib.load(path)
        
    else:
        print("üî® Creating new embeddings...")
        model = SentenceTransformer(model_name)   # uses CPU/GPU automatically
        BATCH_SIZE = 256                              # tune to VRAM

        # Encoding
        texts = chunks_df["text"].tolist()
        embeddings = model.encode(
            texts,
            show_progress_bar=True,
            batch_size=BATCH_SIZE,
            normalize_embeddings=True,
            convert_to_numpy=True
        ).astype("float32")

        joblib.dump(embeddings, path)
        print("üíæ Embeddings saved successfully.")
    print("üî¢ Embedding shape: ", embeddings.shape)

    # Save metadat for later
    return embeddings, model_name

embeddings, model_name = create_embeddings(chunks_df)

  from scipy.sparse import csr_matrix, issparse



üìÇ Loading existing embeddings...
üî¢ Embedding shape:  (326813, 768)


---

## 3.üóÇÔ∏è **FAISS Index Architecture**

### **Why FAISS?**

Facebook AI Similarity Search (FAISS) is optimized for:
* **Fast nearest neighbor search** even in high-dimensional spaces
* **Memory efficiency** with large vector databases
* **GPU acceleration** support
* **Multiple index types** for different trade-offs

### **Index Structure**

We build three separate indices to enable flexible retrieval:

1. **`faiss_all.index`** - Complete corpus for general search across all products
2. **`faiss_beauty.index`** - Beauty domain products only for category-specific queries
3. **`faiss_electronics.index`** - Electronics domain products only

### **Index Configuration**

* **Index Type**: `IndexFlatIP` (Inner Product)
* **Similarity Metric**: Cosine similarity (via inner product on normalized vectors)
* **Domain Masks**: Boolean arrays to filter chunks by product domain
* **Metadata Mapping**: Stores chunk-to-document relationships

In [5]:
# 3) Build FAISS indices for combined and domain-specific retrieval
# and save all artifacts to TOKENIZATION_DATA folder
import faiss
def build_index(embeddings: np.ndarray, chunks_df: pd.DataFrame, model_name):
    dim = embeddings.shape[1]
    
    index_all = faiss.IndexFlatIP(dim)
    index_all.add(embeddings)                       # type: ignore

    # domain masks
    beauty_mask = chunks_df["domain"] == "Beauty"
    electronics_mask = chunks_df["domain"] == "Electronics"

    index_beauty = faiss.IndexFlatIP(dim)
    index_beauty.add(embeddings[beauty_mask])       # type: ignore

    index_elec = faiss.IndexFlatIP(dim)
    index_elec.add(embeddings[electronics_mask])    # type: ignore

    # Save 
    save_artifacts(index_all, index_beauty, beauty_mask, index_elec, electronics_mask, model_name, dim)
    

def save_artifacts(index_all: faiss.Index, index_beauty: faiss.Index, beauty_mask: pd.Series, index_elec: faiss.Index, 
                   electronics_mask: pd.Series, model_name: str, dim: int):
    faiss.write_index(index_all, os.path.join(TOKENIZATION_DATA, "faiss_all.index"))
    faiss.write_index(index_beauty, os.path.join(TOKENIZATION_DATA, "faiss_beauty.index"))
    faiss.write_index(index_elec, os.path.join(TOKENIZATION_DATA, "faiss_electronics.index"))

    # Map arrays
    meta = {
        "model_name": model_name,
        "dim": dim,
        "doc_table_path": CHUNKS_PATH,
        "beauty_indices": np.where(beauty_mask)[0].astype("int64"),
        "electronics_indices": np.where(electronics_mask)[0].astype("int64"),   
    }
    joblib.dump(meta, os.path.join(TOKENIZATION_DATA, "meta.joblib"))

    print("üíæ Saved indices + metadata to ", TOKENIZATION_DATA)

build_index(embeddings, chunks_df, model_name)

üíæ Saved indices + metadata to  c:\Users\hasee\Documents\Python_works\NLP\RAG_LLM\data/tokenized


---
## üíæ **Artifacts Created**

After running the pipeline, these files are generated in `TOKENIZATION_DATA`:

```text
tokenization_data/
‚îú‚îÄ‚îÄ chunks.parquet              # Text chunks with metadata (doc_id, chunk_id, domain, price, etc.)
‚îú‚îÄ‚îÄ embeddings.joblib           # Precomputed embeddings (float32 numpy array)
‚îú‚îÄ‚îÄ faiss_all.index             # Combined FAISS index for all products
‚îú‚îÄ‚îÄ faiss_beauty.index          # Beauty domain index only
‚îú‚îÄ‚îÄ faiss_electronics.index     # Electronics domain index only
‚îî‚îÄ‚îÄ meta.joblib                 # Metadata and domain mappings
```


### **Metadata Contents**

The `meta.joblib` file contains:
- `model_name`: Embedding model used
- `dim`: Embedding dimension (768)
- `doc_table_path`: Path to chunks dataframe
- `beauty_indices`: NumPy array of indices belonging to Beauty domain
- `electronics_indices`: NumPy array of indices belonging to Electronics domain

## üéØ **Quality Assessment**

### **What to Look For in Results**

* **High Similarity Scores**: Values close to 1.0 indicate strong semantic match
* **Domain Consistency**: Beauty queries should return beauty products, electronics should return electronics
* **Relevant Content**: Retrieved chunks should directly address the query topic
* **Metadata Integrity**: Prices, ratings, and titles should match the chunk content

### **Common Issues to Watch For**

* **Low Scores** (< 0.3): May indicate poor semantic matching or need for better chunking
* **Cross-Domain Results**: Beauty queries returning electronics (or vice versa)
* **Irrelevant Content**: Chunks that don't address the query despite high scores
* **Incomplete Context**: Chunks that cut off important information



---

## üîç **Inspect Generated Chunks**

Let's examine the chunked data to understand the text segmentation:


In [6]:
# Load the chunks dataframe
print(f"üìä Total chunks created: {len(chunks_df):,}")
print(f"üìù Unique documents chunked: {chunks_df['doc_id'].nunique():,}")

# Display first few chunks with their metadata
print("Sample chunks with metadata:")
chunks_df.head()

üìä Total chunks created: 326,813
üìù Unique documents chunked: 142,642
Sample chunks with metadata:


Unnamed: 0,doc_id,chunk_id,domain,price,average_rating,title,categories,text
0,0,0,Beauty,6.99,3.7,Lurrose 100Pcs Full Cover Fake Toenails Artifi...,Other Beauty,Title: Lurrose 100Pcs Full Cover Fake Toenails...
1,0,1,Beauty,6.99,3.7,Lurrose 100Pcs Full Cover Fake Toenails Artifi...,Other Beauty,with perfect length. You have the option to w...
2,1,2,Beauty,86.95,3.7,Gold extatic Musk EDT 90ml,Other Beauty,Title: Gold extatic Musk EDT 90ml. Features: E...
3,2,3,Beauty,79.5,3.3,Brand New Headrang Face line Contour V-line Ma...,Skin Care,Title: Brand New Headrang Face line Contour V-...
4,3,4,Beauty,5.99,4.4,"BioMiracle StarDust Pixie Bubble Mask, Clarify...",Skin Care Face Masks,"Title: BioMiracle StarDust Pixie Bubble Mask, ..."


In [7]:
# Analyze chunk length distribution
chunk_stats = chunks_df["text"].str.len().describe()
print("üìè Chunk length statistics (characters):")
print(f"   Mean: {chunk_stats['mean']:.1f}")
print(f"   Std:  {chunk_stats['std']:.1f}")
print(f"   Min:  {chunk_stats['min']:.1f}")
print(f"   Max:  {chunk_stats['max']:.1f}")

üìè Chunk length statistics (characters):
   Mean: 596.1
   Std:  292.8
   Min:  1.0
   Max:  1308.0


In [8]:
# Check domain distribution in chunks
print("\nüåê Domain distribution in chunks:")
print(chunks_df['domain'].value_counts())


üåê Domain distribution in chunks:
domain
Electronics    303099
Beauty          23714
Name: count, dtype: int64


In [9]:
# Show sample chunks from each domain
print("üíÑ Beauty domain sample chunks:")
beauty_sample = chunks_df[chunks_df['domain'] == 'Beauty'].head(3)
for idx, row in beauty_sample.iterrows():
    print(f"Chunk {row['chunk_id']}: {row['text'][:100]}...")

print("\nüîå Electronics domain sample chunks:")
electronics_sample = chunks_df[chunks_df['domain'] == 'Electronics'].head(3)
for idx, row in electronics_sample.iterrows():
    print(f"Chunk {row['chunk_id']}: {row['text'][:100]}...")

üíÑ Beauty domain sample chunks:
Chunk 0: Title: Lurrose 100Pcs Full Cover Fake Toenails Artificial Transparent Nail Tips Nail Art for DIY. Fe...
Chunk 1:  with perfect length. You have the option to wear them long or clip them short, easy to trim and fil...
Chunk 2: Title: Gold extatic Musk EDT 90ml. Features: Extatic Balmain Gold Musk By Balmain Edt Spray 3 Oz. De...

üîå Electronics domain sample chunks:
Chunk 23714: Title: Digi-Tatoo Decal Skin Compatible With MacBook Pro 13 inch (Model A2338/ A2289/ A2251) - Prote...
Chunk 23715:  impressive looking. Take it out and get tons of compliments. Easy Apply. Easy, bubble-free installa...
Chunk 23716: Title: NotoCity Compatible with Vivoactive 4 band 22mm Quick Release Silicone Bands/Garmin Darth Vad...



---

## üîé **Manual Nearest Neighbor Inspection**

Let's test the retrieval system with sample queries to validate it's working correctly:

In [10]:
# Load the FAISS index and metadata
meta = joblib.load(os.path.join(TOKENIZATION_DATA, "meta.joblib"))
index = faiss.read_index(os.path.join(TOKENIZATION_DATA, "faiss_all.index"))
table = chunks_df
model = SentenceTransformer(meta["model_name"])

print("üîß Loaded retrieval components:")
print(f"   Model: {meta['model_name']}")
print(f"   Embedding dimension: {meta['dim']}")
print(f"   Total chunks in index: {index.ntotal}")

üîß Loaded retrieval components:
   Model: BAAI/bge-base-en-v1.5
   Embedding dimension: 768
   Total chunks in index: 326813


### **Test Query 1: Beauty Product Search**


In [11]:
# Test query for beauty products
query = "moisturizing face cream for dry skin"
print(f"üîç Query: '{query}'")

# Encode query to embedding
q_emb = model.encode([query], normalize_embeddings=True).astype("float32")

# Search for top 5 most similar chunks
scores, ids = index.search(q_emb, 5)

print(f"üìà Top 5 results (scores: {scores[0]})")
print(f"üî¢ Chunk IDs: {ids[0]}")

# Display retrieved results with relevant metadata
results = table.iloc[ids[0]][["title", "price", "average_rating", "domain", "text"]]
print("\nüìã Retrieved results:")
results

üîç Query: 'moisturizing face cream for dry skin'
üìà Top 5 results (scores: [0.7541969  0.7456913  0.74467677 0.74290526 0.74100345])
üî¢ Chunk IDs: [ 2979 16409  4836 21289  8635]

üìã Retrieved results:


Unnamed: 0,title,price,average_rating,domain,text
2979,"Sebamed Moisturizing Cream, Sensitive Skin, 2....",83.95,5.0,Beauty,"Title: Sebamed Moisturizing Cream, Sensitive S..."
16409,Facial Moisturizer. Preservative Free. Organic...,29.99,4.3,Beauty,Title: Facial Moisturizer. Preservative Free. ...
4836,"Moisturising Cream, Body and Face Moisturizer ...",8.99,4.2,Beauty,"Title: Moisturising Cream, Body and Face Moist..."
21289,DayTime Moisturizer for Dry Skin,70.0,4.1,Beauty,Title: DayTime Moisturizer for Dry Skin. Featu...
8635,Fresh Vitamin Nectar Moisture Glow Face Cream ...,19.0,3.5,Beauty,Title: Fresh Vitamin Nectar Moisture Glow Face...


### **Test Query 2: Electronics Product Search**


In [12]:
# Test query for electronics
query = "wireless bluetooth headphones with noise cancellation"
print(f"üîç Query: '{query}'")

# Encode query to embedding
q_emb = model.encode([query], normalize_embeddings=True).astype("float32")

# Search for top 5 most similar chunks
scores, ids = index.search(q_emb, 5)

print(f"üìà Top 5 results (scores: {scores[0]})")
print(f"üî¢ Chunk IDs: {ids[0]}")

# Display retrieved results
results = table.iloc[ids[0]][["title", "price", "average_rating", "domain", "text"]]
print("\nüìã Retrieved results:")
results

üîç Query: 'wireless bluetooth headphones with noise cancellation'
üìà Top 5 results (scores: [0.8025458  0.7917396  0.7910613  0.78404117 0.77654284])
üî¢ Chunk IDs: [ 48934 183928 220356 169812 181427]

üìã Retrieved results:


Unnamed: 0,title,price,average_rating,domain,text
48934,"Active Noise Cancelling Headphones,Wireless Bl...",39.31,4.3,Electronics,"Title: Active Noise Cancelling Headphones,Wire..."
183928,Bose QuietComfort 35 (Series I) Wireless Headp...,174.95,4.4,Electronics,Title: Bose QuietComfort 35 (Series I) Wireles...
220356,Sony Noise Cancelling Headphones WHCH710N: Wir...,71.25,4.4,Electronics,Title: Sony Noise Cancelling Headphones WHCH71...
169812,Krankz Audio Noise Cancelling Bluetooth Headph...,149.95,3.9,Electronics,Title: Krankz Audio Noise Cancelling Bluetooth...
181427,Bluetooth Headphones Noise-canceling Magnetic ...,16.59,3.4,Electronics,Title: Bluetooth Headphones Noise-canceling Ma...


### **Domain-Specific Search Test**


In [13]:
# Test domain-specific search using beauty-only index
index_beauty = faiss.read_index(os.path.join(TOKENIZATION_DATA, "faiss_beauty.index"))
beauty_indices = meta["beauty_indices"]
beauty_table = table.iloc[beauty_indices]

query = "anti-aging serum with vitamin C"
print(f"üíÑ Domain-specific query (Beauty only): '{query}'")

q_emb = model.encode([query], normalize_embeddings=True).astype("float32")
scores, local_ids = index_beauty.search(q_emb, 3)

# Map back to original indices
global_ids = beauty_indices[local_ids[0]]

print(f"üìà Top 3 beauty results (scores: {scores[0]})")
results = table.iloc[global_ids][["title", "price", "average_rating", "domain", "text"]]
results

üíÑ Domain-specific query (Beauty only): 'anti-aging serum with vitamin C'
üìà Top 3 beauty results (scores: [0.80849946 0.8013123  0.80018973])


Unnamed: 0,title,price,average_rating,domain,text
838,True Botanix Anti Aging Vitamin C Face Serum 3...,18.0,4.9,Beauty,Title: True Botanix Anti Aging Vitamin C Face ...
3710,Retinol Plus Anti Aging Day cream with Retinol...,10.5,4.4,Beauty,Title: Retinol Plus Anti Aging Day cream with ...
21784,"Retinol Serum Advanced Formula with Vitamin C,...",26.99,3.3,Beauty,Title: Retinol Serum Advanced Formula with Vit...


---

## üîÆ **Summary**

This enables several advanced applications:

* **Semantic Search**: Find products based on meaning rather than keywords
* **Domain-Filtered Retrieval**: Search within specific product categories
* **Hybrid Search**: Combine semantic search with metadata filtering (price, rating)
* **RAG Applications**: Use retrieved products as context for LLM-based recommendations
* **Similar Product Recommendations**: Find similar items based on embedding similarity

The built indices are now ready for integration with your retrieval system and can be used in downstream applications like web APIs, recommendation engines, or chat interfaces.

---

## **Conclusion**

We've successfully transformed our cleaned product data into a powerful vector search system. The combination of thoughtful chunking, high-quality embeddings, and efficient FAISS indices creates a foundation for intelligent product discovery and recommendation.

The system can now understand nuanced queries like "affordable skincare for sensitive skin" or "wireless earbuds with long battery life" and return relevant products based on semantic similarity rather than just keyword matching.

And now we'll move to our next notebook where we will go through how i set-up the RAG system and Implement it with local LLM (Mistral-7B-Instruct-v0.2)

---

**End of Notebook**