# Session 3.2: BakeryAI - Embeddings & Vector Stores



[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1kYFKkdpVJ5imP048DXe92wZwlpJ1fy3x?usp=sharing)

## 🎯 Today's Goal

Transform our processed documents into **searchable knowledge** using vector embeddings!

### What Are Embeddings?

Embeddings convert text into numbers (vectors) that capture semantic meaning:

```
"chocolate cake" → [0.23, -0.45, 0.89, ...] (1536 dimensions)
"cocoa dessert"  → [0.25, -0.43, 0.87, ...] (similar vector!)
"car repair"     → [-0.67, 0.12, -0.34, ...] (very different)
```

**Similar meaning = Similar vectors** = Easy to search!

### What Are Vector Stores?

Databases optimized for storing and searching vectors:

- **FAISS**: Facebook's fast similarity search (in-memory)
- **Chroma**: Easy persistent storage
- **Qdrant**: Production-ready with advanced features
- **Pinecone**: Managed cloud service
- **Weaviate**: GraphQL + vectors

### Today's Pipeline:

```
Processed Chunks (from 3.1)
      ↓
[Generate Embeddings]
      ↓
[Store in Vector DB]
      ↓
[Semantic Search Ready!]
```

Let's build it! 🚀

In [1]:
!pip install -q langchain langchain-openai langchain-community
!pip install -q faiss-cpu chromadb qdrant-client
!pip install -q tiktoken python-dotenv

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/76.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.0/76.0 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.5/2.5 MB[0m [31m100.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m52.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/64.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.7/64.7 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/50.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0

In [2]:
!git clone https://github.com/IvanReznikov/mdx-langchain-conclave

Cloning into 'mdx-langchain-conclave'...
remote: Enumerating objects: 31, done.[K
remote: Counting objects: 100% (31/31), done.[K
remote: Compressing objects: 100% (27/27), done.[K
remote: Total 31 (delta 7), reused 27 (delta 3), pack-reused 0 (from 0)[K
Receiving objects: 100% (31/31), 261.79 KiB | 16.36 MiB/s, done.
Resolving deltas: 100% (7/7), done.


In [3]:
import pickle

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS, Chroma
from langchain_core.documents import Document

import os
from google.colab import userdata

# Set OpenAI API key from Google Colab's user environment or default
def set_openai_api_key(default_key: str = "YOUR_API_KEY") -> None:
    """Set the OpenAI API key from Google Colab's user environment or use a default value."""
    #if not (userdata.get("OPENAI_API_KEY") or "OPENAI_API_KEY" in os.environ):
    try:
      os.environ["OPENAI_API_KEY"] = userdata.get("MDX_OPENAI_API_KEY")
    except:
      os.environ["OPENAI_API_KEY"] = default_key

set_openai_api_key()
#set_openai_api_key("sk-...")

llm = ChatOpenAI(model="gpt-5-nano")
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

print("✅ Environment ready!")
print(f"Embedding model: text-embedding-3-small (1536 dimensions)")

✅ Environment ready!
Embedding model: text-embedding-3-small (1536 dimensions)


## 1. Load Processed Documents from Session 3.1

In [4]:
# Load the chunks we processed in Session 3.1
try:
    with open('/content/mdx-langchain-conclave/rag_artifacts/bakery_knowledge_base.pkl', 'rb') as f:
        chunks = pickle.load(f)
    print(f"✅ Loaded {len(chunks)} chunks from previous session")
except FileNotFoundError:
    print("⚠️  Creating sample chunks for demo...")
    # Create sample documents if file not found
    chunks = [
        Document(
            page_content="Our refund policy: Full refunds within 24 hours of order. After that, store credit is offered.",
            metadata={"source": "policy.txt", "category": "policy"}
        ),
        Document(
            page_content="Chocolate Truffle Cake: Rich chocolate with Belgian truffle filling. Price $45. Serves 8-10. Allergens: dairy, eggs, gluten.",
            metadata={"source": "cakes.pdf", "category": "product"}
        ),
        Document(
            page_content="Hygiene procedures: Wash hands thoroughly before handling food. Use gloves when appropriate. Clean surfaces regularly.",
            metadata={"source": "sop_hygiene.txt", "category": "safety"}
        ),
        Document(
            page_content="Customer service standards: Respond within 2 hours during business hours. Always be professional and courteous.",
            metadata={"source": "policy.txt", "category": "policy"}
        ),
        Document(
            page_content="Red Velvet Cake: Velvety red sponge with cream cheese frosting. Price $50. Serves 10-12. Contains dairy, eggs, gluten.",
            metadata={"source": "cakes.pdf", "category": "product"}
        )
    ]

print(f"\n📚 Knowledge Base: {len(chunks)} chunks ready for embedding")

✅ Loaded 101 chunks from previous session

📚 Knowledge Base: 101 chunks ready for embedding


## 2. Understanding Embeddings

Let's see what embeddings look like.

In [5]:
# Generate embedding for a sample text
sample_text = "chocolate cake with rich frosting"
sample_embedding = embeddings.embed_query(sample_text)

print(f"📊 EMBEDDING ANALYSIS")
print("=" * 70)
print(f"Text: '{sample_text}'")
print(f"\nEmbedding dimensions: {len(sample_embedding)}")
print(f"First 10 values: {sample_embedding[:10]}")
print(f"\nVector characteristics:")
print(f"  - Min value: {min(sample_embedding):.4f}")
print(f"  - Max value: {max(sample_embedding):.4f}")
print(f"  - Mean: {sum(sample_embedding)/len(sample_embedding):.4f}")

📊 EMBEDDING ANALYSIS
Text: 'chocolate cake with rich frosting'

Embedding dimensions: 3072
First 10 values: [-0.00021281850058585405, -0.0380740761756897, -0.014900933019816875, -0.015393978916108608, -0.0002764821401797235, 0.026994245126843452, -0.0016777245327830315, 0.009361018426716328, -0.03486927971243858, 0.02941838651895523]

Vector characteristics:
  - Min value: -0.1705
  - Max value: 0.0740
  - Mean: -0.0005


In [6]:
# Compare similar vs different texts
import numpy as np

def cosine_similarity(v1, v2):
    """Calculate cosine similarity between two vectors"""
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

# Test texts
text1 = "chocolate cake"
text2 = "cocoa dessert"  # Similar
text3 = "car repair"      # Different

emb1 = embeddings.embed_query(text1)
emb2 = embeddings.embed_query(text2)
emb3 = embeddings.embed_query(text3)

sim_1_2 = cosine_similarity(emb1, emb2)
sim_1_3 = cosine_similarity(emb1, emb3)

print("\n🔍 SEMANTIC SIMILARITY TEST")
print("=" * 70)
print(f"Text 1: '{text1}'")
print(f"Text 2: '{text2}'")
print(f"Similarity: {sim_1_2:.4f} ✨ (High - similar meaning)\n")

print(f"Text 1: '{text1}'")
print(f"Text 3: '{text3}'")
print(f"Similarity: {sim_1_3:.4f} (Low - different meaning)")

print("\n✅ Similar meanings = Higher similarity scores!")


🔍 SEMANTIC SIMILARITY TEST
Text 1: 'chocolate cake'
Text 2: 'cocoa dessert'
Similarity: 0.5879 ✨ (High - similar meaning)

Text 1: 'chocolate cake'
Text 3: 'car repair'
Similarity: 0.1916 (Low - different meaning)

✅ Similar meanings = Higher similarity scores!


## 3. Building a FAISS Vector Store

FAISS is fast and perfect for development/testing.

In [7]:
print("🔧 Building FAISS vector store...\n")
print(f"Processing {len(chunks)} chunks...")

# Create FAISS vector store
import time
start_time = time.time()

faiss_store = FAISS.from_documents(
    documents=chunks,
    embedding=embeddings
)

elapsed = time.time() - start_time

print(f"\n✅ FAISS store created!")
print(f"   Time taken: {elapsed:.2f} seconds")
print(f"   Documents indexed: {len(chunks)}")
print(f"   Embedding dimensions: 1536")

🔧 Building FAISS vector store...

Processing 101 chunks...

✅ FAISS store created!
   Time taken: 1.51 seconds
   Documents indexed: 101
   Embedding dimensions: 1536


In [8]:
# Test similarity search
query = "What is the refund policy?"

print(f"\n🔍 Searching for: '{query}'\n")
print("=" * 70)

# Search for top 3 most similar chunks
results = faiss_store.similarity_search(query, k=3)

for i, doc in enumerate(results, 1):
    print(f"\nResult {i}:")
    print(f"Content: {doc.page_content[:150]}...")
    print(f"Source: {doc.metadata.get('source', 'unknown')}")
    print(f"Category: {doc.metadata.get('category', 'unknown')}")

print("\n" + "=" * 70)
print("✅ Most relevant chunks retrieved!")


🔍 Searching for: 'What is the refund policy?'


Result 1:
Content: Delivery Issues:  Track order status immediately  Communicate transparently about delays  If late 30 minutes: waive delivery fee  If wrong address (ou...
Source: /content/mdx-langchain-conclave/data/Customer_Service_Policy.txt
Category: policy

Result 2:
Content: ELIGIBLE FOR REFUND:  Change of mind after custom order production begins  Allergic reactions (if allergens were disclosed)  Products purchased more t...
Source: /content/mdx-langchain-conclave/data/Customer_Service_Policy.txt
Category: policy

Result 3:
Content: samples  Apply up to 10 courtesy discount  Waive small fees (AED 25)  Make judgment calls to ensure customer satisfaction REQUIRES MANAGER APPROVAL:  ...
Source: /content/mdx-langchain-conclave/data/Customer_Service_Policy.txt
Category: policy

✅ Most relevant chunks retrieved!


In [9]:
# Search with similarity scores
query = "Tell me about cakes for kids"

print(f"\n🔍 Searching with scores: '{query}'\n")
print("=" * 70)

results_with_scores = faiss_store.similarity_search_with_score(query, k=3)

for i, (doc, score) in enumerate(results_with_scores, 1):
    print(f"\n📊 Result {i} (Score: {score:.4f}):")
    print(f"   {doc.page_content[:100]}...")
    print(f"   Category: {doc.metadata.get('category', 'unknown')}")

print("\n💡 Lower scores = More similar (FAISS uses L2 distance)")


🔍 Searching with scores: 'Tell me about cakes for kids'


📊 Result 1 (Score: 0.9135):
   . Its shaped like a winners cup and topped with a sugar medal that reads 1. This cake is not just de...
   Category: product

📊 Result 2 (Score: 0.9213):
   . Its shaped like a winners cup and topped with a sugar medal that reads 1.  This cake is not just d...
   Category: product

📊 Result 3 (Score: 0.9593):
   . Its cool, smooth, and indulgenta summer classic and a no-bake marvel. You cant go wrong with chees...
   Category: product

💡 Lower scores = More similar (FAISS uses L2 distance)


In [10]:
# Search with similarity scores
query = "Tell me about sport deserts"

print(f"\n🔍 Searching with scores: '{query}'\n")
print("=" * 70)

results_with_scores = faiss_store.similarity_search_with_score(query, k=5)

for i, (doc, score) in enumerate(results_with_scores, 1):
    print(f"\n📊 Result {i} (Score: {score:.4f}):")
    print(f"   {doc.page_content[:100]}...")
    print(f"   Category: {doc.metadata.get('category', 'unknown')}")

print("\n💡 Lower scores = More similar (FAISS uses L2 distance)")


🔍 Searching with scores: 'Tell me about sport deserts'


📊 Result 1 (Score: 1.4557):
   . This cake delivers a sugar and energy rush fitting for a race day celebration. Its an ideal gift f...
   Category: product

📊 Result 2 (Score: 1.4562):
   . This cake delivers a sugar and energy rush fitting for a race day celebration. Its an ideal gift f...
   Category: product

📊 Result 3 (Score: 1.4602):
   Topped with a tiny chocolate football and piped vanilla frosting shaped like goal nets, its a treat ...
   Category: product

📊 Result 4 (Score: 1.5032):
   . Whether your team wins or loses, youll feel victorious with every bite. This mid-sized cake is gre...
   Category: product

📊 Result 5 (Score: 1.6053):
   . Its shaped like a winners cup and topped with a sugar medal that reads 1.  This cake is not just d...
   Category: product

💡 Lower scores = More similar (FAISS uses L2 distance)


## 4. Saving and Loading FAISS Index

In [11]:
# Save FAISS index locally
faiss_store.save_local("bakery_faiss_index")
print("✅ FAISS index saved to: bakery_faiss_index/")

# Load it back
loaded_faiss = FAISS.load_local(
    "bakery_faiss_index",
    embeddings,
    allow_dangerous_deserialization=True
)

print("✅ FAISS index loaded successfully!")

# Test loaded index
test_results = loaded_faiss.similarity_search("hygiene procedures", k=2)
print(f"\n✅ Test search found {len(test_results)} results")

✅ FAISS index saved to: bakery_faiss_index/
✅ FAISS index loaded successfully!

✅ Test search found 2 results


## 5. Building a Chroma Vector Store

Chroma provides persistent storage with a cleaner API.

In [12]:
from langchain_community.vectorstores.utils import filter_complex_metadata

print("🔧 Building Chroma vector store...\n")

# Filter out complex metadata (lists, dicts, etc.) before adding to Chroma
filtered_chunks = filter_complex_metadata(
    chunks,
    allowed_types=(str, bool, int, float)
)

# Create Chroma with persistence using filtered documents
chroma_store = Chroma.from_documents(
    documents=filtered_chunks,
    embedding=embeddings,
    persist_directory="./bakery_chroma_db",
    collection_name="bakery_knowledge"
)

print("✅ Chroma store created!")
print(f"   Persisted to: ./bakery_chroma_db")
print(f"   Collection: bakery_knowledge")

🔧 Building Chroma vector store...

✅ Chroma store created!
   Persisted to: ./bakery_chroma_db
   Collection: bakery_knowledge


In [13]:
# Test Chroma search
query = "Tell me about cakes for kids"

print(f"\n🔍 Chroma Search: '{query}'\n")
print("=" * 70)

chroma_results = chroma_store.similarity_search(query, k=3)

for i, doc in enumerate(chroma_results, 1):
    print(f"\nResult {i}:")
    print(f"Content: {doc.page_content[:120]}...")
    print(f"Metadata: {doc.metadata}")


🔍 Chroma Search: 'Tell me about cakes for kids'


Result 1:
Content: . Its shaped like a winners cup and topped with a sugar medal that reads 1. This cake is not just deliciousits a showsto...
Metadata: {'category': 'product', 'source': '/content/mdx-langchain-conclave/data/cakes.docx', 'chunk_id': 78, 'chunk_size': 744}

Result 2:
Content: . Its shaped like a winners cup and topped with a sugar medal that reads 1.  This cake is not just deliciousits a showst...
Metadata: {'total_pages': 8, 'category': 'product', 'moddate': '2025-10-22T17:26:58+04:00', 'page_label': '3', 'chunk_id': 47, 'author': 'Ivan Reznikov', 'page': 2, 'producer': 'Microsoft® Word for Microsoft 365', 'creator': 'Microsoft® Word for Microsoft 365', 'creationdate': '2025-10-22T17:26:58+04:00', 'chunk_size': 745, 'source': '/content/mdx-langchain-conclave/data/cakes.pdf'}

Result 3:
Content: . Its cool, smooth, and indulgenta summer classic and a no-bake marvel. You cant go wrong with cheesecakeit pleases ever...
Me

In [14]:
# Loading existing Chroma database
existing_chroma = Chroma(
    persist_directory="./bakery_chroma_db",
    embedding_function=embeddings,
    collection_name="bakery_knowledge"
)

print("✅ Loaded existing Chroma database")
print(f"   Documents in collection: {existing_chroma._collection.count()}")

✅ Loaded existing Chroma database
   Documents in collection: 101


  existing_chroma = Chroma(


## 6. Advanced Search: Metadata Filtering

In [15]:
# Search only within specific category
query = "What are the requirements?"

print("🔍 FILTERED SEARCH DEMO")
print("=" * 70)

# Search only policy documents
print("\n1️⃣ Search in POLICY documents only:")
policy_results = chroma_store.similarity_search(
    query,
    k=2,
    filter={"category": "policy"}
)

for doc in policy_results:
    print(f"   - {doc.page_content[:80]}...")
    print(f"     Category: {doc.metadata.get('category')}\n")

# Search only product documents
print("2️⃣ Search in PRODUCT documents only:")
product_results = chroma_store.similarity_search(
    query,
    k=2,
    filter={"category": "product"}
)

for doc in product_results:
    print(f"   - {doc.page_content[:80]}...")
    print(f"     Category: {doc.metadata.get('category')}\n")

print("✅ Metadata filtering allows targeted searches!")

🔍 FILTERED SEARCH DEMO

1️⃣ Search in POLICY documents only:
   - samples  Apply up to 10 courtesy discount  Waive small fees (AED 25)  Make judgm...
     Category: policy

   - ACCESSIBILITY:  Assist customers with disabilities  Ensure wheelchair access to ...
     Category: policy

2️⃣ Search in PRODUCT documents only:
   - Cupcakes 1. Midnight Mocha Cupcake A decadent dark chocolate cupcake infused wit...
     Category: product

   - . Allergens: dairy, gluten, eggs. Store chilled.  14. Red Velvet Cake (R) Known ...
     Category: product

✅ Metadata filtering allows targeted searches!


## 7. MMR (Maximal Marginal Relevance) Search

MMR balances relevance with diversity to avoid returning similar duplicates.

In [16]:
query = "Tell me about cakes for kids"

print("🔍 COMPARING: Similarity vs MMR Search\n")
print("=" * 70)

# Regular similarity search
print("\n1️⃣ SIMILARITY SEARCH (may return similar results):")
sim_results = faiss_store.similarity_search(query, k=3)
for i, doc in enumerate(sim_results, 1):
    print(f"   {i}. {doc.page_content[:60]}...")

# MMR search (diverse results)
print("\n2️⃣ MMR SEARCH (diverse results):")
mmr_results = faiss_store.max_marginal_relevance_search(query, k=3, fetch_k=10)
for i, doc in enumerate(mmr_results, 1):
    print(f"   {i}. {doc.page_content[:60]}...")

print("\n💡 MMR ensures diversity in search results!")

🔍 COMPARING: Similarity vs MMR Search


1️⃣ SIMILARITY SEARCH (may return similar results):
   1. . Its shaped like a winners cup and topped with a sugar meda...
   2. . Its shaped like a winners cup and topped with a sugar meda...
   3. . Its cool, smooth, and indulgenta summer classic and a no-b...

2️⃣ MMR SEARCH (diverse results):
   1. . Its shaped like a winners cup and topped with a sugar meda...
   2. . Simple, airy, and beloved by all age groups. Its popular f...
   3. . Layers of vanilla sponge are soaked in strawberry syrup an...

💡 MMR ensures diversity in search results!


## 8. Vector Store Comparison

Compare performance of different vector stores.

In [17]:
import time

test_queries = [
    "What is our refund policy?",
    "Tell me about chocolate cake",
    "Hygiene procedures for food handling"
]

print("⚡ VECTOR STORE PERFORMANCE COMPARISON")
print("=" * 70)

# Test FAISS
print("\n1️⃣ FAISS Performance:")
faiss_times = []
for query in test_queries:
    start = time.time()
    results = faiss_store.similarity_search(query, k=3)
    elapsed = time.time() - start
    faiss_times.append(elapsed)
    print(f"   Query: '{query[:30]}...' - {elapsed*1000:.2f}ms")

avg_faiss = sum(faiss_times) / len(faiss_times) * 1000
print(f"   Average: {avg_faiss:.2f}ms")

# Test Chroma
print("\n2️⃣ Chroma Performance:")
chroma_times = []
for query in test_queries:
    start = time.time()
    results = chroma_store.similarity_search(query, k=3)
    elapsed = time.time() - start
    chroma_times.append(elapsed)
    print(f"   Query: '{query[:30]}...' - {elapsed*1000:.2f}ms")

avg_chroma = sum(chroma_times) / len(chroma_times) * 1000
print(f"   Average: {avg_chroma:.2f}ms")

print("\n📊 Summary:")
print(f"   FAISS: {avg_faiss:.2f}ms (in-memory, fast)")
print(f"   Chroma: {avg_chroma:.2f}ms (persistent, feature-rich)")

⚡ VECTOR STORE PERFORMANCE COMPARISON

1️⃣ FAISS Performance:
   Query: 'What is our refund policy?...' - 246.44ms
   Query: 'Tell me about chocolate cake...' - 311.83ms
   Query: 'Hygiene procedures for food ha...' - 301.14ms
   Average: 286.47ms

2️⃣ Chroma Performance:
   Query: 'What is our refund policy?...' - 211.89ms
   Query: 'Tell me about chocolate cake...' - 148.68ms
   Query: 'Hygiene procedures for food ha...' - 437.89ms
   Average: 266.15ms

📊 Summary:
   FAISS: 286.47ms (in-memory, fast)
   Chroma: 266.15ms (persistent, feature-rich)


## 9. Building a BakeryAI Retriever

Wrap vector store in a retriever interface.

In [18]:
# Create retriever from vector store
retriever = faiss_store.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 3,           # Return top 3 results
        "fetch_k": 10     # Fetch 10 for MMR diversity
    }
)

print("✅ BakeryAI Retriever Created!")
print("\nConfiguration:")
print("   - Search Type: MMR (diverse results)")
print("   - Top K: 3")
print("   - Fetch K: 10")

✅ BakeryAI Retriever Created!

Configuration:
   - Search Type: MMR (diverse results)
   - Top K: 3
   - Fetch K: 10


In [19]:
# Test retriever
query = "What allergens are in our products?"

print(f"\n🔍 Retriever Test: '{query}'\n")
print("=" * 70)

retrieved_docs = retriever.invoke(query)

print(f"\n✅ Retrieved {len(retrieved_docs)} documents:\n")
for i, doc in enumerate(retrieved_docs, 1):
    print(f"{i}. {doc.page_content[:100]}...")
    print(f"   Source: {doc.metadata.get('source', 'unknown')}\n")


🔍 Retriever Test: 'What allergens are in our products?'


✅ Retrieved 3 documents:

1. . PRODUCT KNOWLEDGE  Know ingredients, allergens, and preparation methods for all items  Provide acc...
   Source: /content/mdx-langchain-conclave/data/Customer_Service_Policy.txt

2. . Allergens: dairy, gluten, eggs. Store chilled.  14. Red Velvet Cake (R) Known for its stunning red...
   Source: /content/mdx-langchain-conclave/data/cakes.docx

3. . CROSS-CONTAMINATION PREVENTION  Use separate cutting boards for different allergens  Color-coded s...
   Source: /content/mdx-langchain-conclave/data/SOP_Hygiene_Food_Safety.txt



## 10. Category-Specific Retrievers

Create specialized retrievers for different types of queries.

In [20]:
# Policy retriever
policy_retriever = chroma_store.as_retriever(
    search_kwargs={
        "k": 3,
        "filter": {"category": "policy"}
    }
)

# Product retriever
product_retriever = chroma_store.as_retriever(
    search_kwargs={
        "k": 3,
        "filter": {"category": "product"}
    }
)

# Safety retriever
safety_retriever = chroma_store.as_retriever(
    search_kwargs={
        "k": 3,
        "filter": {"category": "safety"}
    }
)

print("✅ Created specialized retrievers:")
print("   - Policy Retriever (customer service policies)")
print("   - Product Retriever (cake catalogs)")
print("   - Safety Retriever (hygiene & food safety)")

# Test each
print("\n🧪 Testing specialized retrievers:\n")

print("1️⃣ Policy Query: 'What is our refund policy?'")
policy_docs = policy_retriever.invoke("What is our refund policy?")
print(f"   Found {len(policy_docs)} policy documents\n")

print("2️⃣ Product Query: 'Tell me about chocolate cakes'")
product_docs = product_retriever.invoke("Tell me about chocolate cakes")
print(f"   Found {len(product_docs)} product documents\n")

print("3️⃣ Safety Query: 'What are hygiene procedures?'")
safety_docs = safety_retriever.invoke("What are hygiene procedures?")
print(f"   Found {len(safety_docs)} safety documents")

✅ Created specialized retrievers:
   - Policy Retriever (customer service policies)
   - Product Retriever (cake catalogs)
   - Safety Retriever (hygiene & food safety)

🧪 Testing specialized retrievers:

1️⃣ Policy Query: 'What is our refund policy?'
   Found 3 policy documents

2️⃣ Product Query: 'Tell me about chocolate cakes'
   Found 3 product documents

3️⃣ Safety Query: 'What are hygiene procedures?'
   Found 3 safety documents


## 🎯 Exercise 3: Build a Hybrid Search System

**Task**: Implement hybrid search combining:
1. Dense retrieval (embeddings)
2. Sparse retrieval (BM25/keyword search)
3. Combine and rerank results

In [21]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever

# TODO: Create BM25 retriever for keyword search
# TODO: Combine with vector retriever
# TODO: Test on queries where keywords matter

# bm25_retriever = BM25Retriever.from_documents(chunks)
# ensemble_retriever = EnsembleRetriever(
#     retrievers=[faiss_retriever, bm25_retriever],
#     weights=[0.5, 0.5]
# )

## 🎯 Exercise 4: Implement Custom Similarity Scoring

**Task**: Create a custom scoring function that:
1. Weights recent documents higher
2. Boosts results from specific categories
3. Penalizes very short chunks

In [22]:
def custom_score(doc, query_embedding, base_similarity):
    """Custom scoring with business logic"""
    score = base_similarity

    # TODO: Add recency boost
    # TODO: Add category boost
    # TODO: Add length penalty

    return score

# Implement and test your custom scorer

## Summary: What We Built

### ✅ Session 3.2 Achievements:

1. **Embeddings**: Generated vectors for all knowledge base chunks
2. **FAISS**: Fast in-memory vector store
3. **Chroma**: Persistent vector database
4. **Semantic Search**: Find documents by meaning
5. **MMR Search**: Diverse, non-redundant results
6. **Metadata Filtering**: Category-specific searches
7. **Retrievers**: Easy-to-use retrieval interfaces
8. **Specialized Retrievers**: Policy, Product, Safety retrievers

### 🎨 BakeryAI Search Capabilities:

✨ **Semantic Understanding**: "refund" matches "money back" and "return policy"  
✨ **Fast Retrieval**: Millisecond search across thousands of documents  
✨ **Persistent Storage**: Knowledge base saved for reuse  
✨ **Filtered Search**: Target specific document types  
✨ **Diverse Results**: MMR prevents repetitive answers  

### 📊 Vector Store Decision Guide:

**Use FAISS when:**
- Development/testing
- Need maximum speed
- Data fits in memory

**Use Chroma when:**
- Need persistence
- Want easy setup
- Building prototypes

**Use Qdrant/Pinecone/Weaviate when:**
- Production deployment
- Need scalability
- Multi-tenant applications

### 🚀 Next: Notebook 3.3

We'll build complete **RAG pipelines**:
- Question-answering chains
- Retrieval strategies and optimization
- Source citation and attribution
- Conversational RAG
- Evaluation and testing