# Notebook 3: Embedding Generation and Vector Database Setup

## Purpose
This notebook generates embeddings for text chunks and stores them in ChromaDB for efficient semantic search.

## Process
1. Load chunked text data from Notebook 2
2. Initialize embedding model (sentence-transformers)
3. Generate embeddings for all text chunks
4. Set up ChromaDB vector database
5. Store chunks with embeddings and metadata
6. Test similarity search functionality

## Output
- ChromaDB database with embedded chunks
- Embedding statistics and performance metrics
- Sample similarity search results

In [1]:
# Import required libraries
import os
import json
from pathlib import Path
from typing import List, Dict, Any
from tqdm import tqdm
import numpy as np
from dotenv import load_dotenv

# Embedding and vector database
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings

# Utilities
import time
import pandas as pd

In [2]:
# Load environment variables
load_dotenv()

# Configuration: Set up paths and parameters
BASE_DIR = Path(r"d:\AI Book RAG")
CHUNKS_DIR = BASE_DIR / "data" / "chunks"
CHROMA_DIR = BASE_DIR / "chroma_db"

# Create ChromaDB directory
CHROMA_DIR.mkdir(parents=True, exist_ok=True)

# Embedding model configuration
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL", "sentence-transformers/all-MiniLM-L6-v2")
BATCH_SIZE = 32  # Process embeddings in batches for efficiency

# ChromaDB collection name
COLLECTION_NAME = "ai_books_collection"

print(f"Chunks Directory: {CHUNKS_DIR}")
print(f"ChromaDB Directory: {CHROMA_DIR}")
print(f"\nConfiguration:")
print(f"  Embedding Model: {EMBEDDING_MODEL}")
print(f"  Batch Size: {BATCH_SIZE}")
print(f"  Collection Name: {COLLECTION_NAME}")

Chunks Directory: d:\AI Book RAG\data\chunks
ChromaDB Directory: d:\AI Book RAG\chroma_db

Configuration:
  Embedding Model: sentence-transformers/all-MiniLM-L6-v2
  Batch Size: 32
  Collection Name: ai_books_collection


In [3]:
# Load chunked data from Notebook 2
chunks_file = CHUNKS_DIR / "all_chunks_semantic.json"

print(f"Loading chunks from: {chunks_file}")

with open(chunks_file, 'r', encoding='utf-8') as f:
    all_chunks = json.load(f)

print(f"✓ Loaded {len(all_chunks)} chunks")

# Display sample chunk structure
if all_chunks:
    print("\nSample chunk structure:")
    sample_chunk = all_chunks[0]
    for key in sample_chunk.keys():
        if key != 'text':  # Don't print full text
            print(f"  {key}: {sample_chunk[key]}")
        else:
            print(f"  {key}: [text content - {len(sample_chunk[key])} chars]")

Loading chunks from: d:\AI Book RAG\data\chunks\all_chunks_semantic.json
✓ Loaded 10620 chunks

Sample chunk structure:
  chunk_id: 299f007b_p2_sc0
  global_index: 0
  book_title: AI Engineering
  chapter: Introduction
  page_number: 2
  chunk_index: 0
  text: [text content - 394 chars]
  text_with_context: Drawing on her deep expertise, AI Engineering is a comprehensive and
holistic guide to building generative AI applications in production.”
Luke Metz, cocreator of ChatGPT, former research manager at OpenAI
AI Engineering
Foundation models have enabled many new AI use cases while lowering the barriers to entry for
building AI products. “This book of fers a comprehensive, well-structured guide to the essential
aspects of building generative AI systems. A must-read for any professional
looking to scale AI across the enterprise.”
Vittorio Cretella, former global CIO at P&G and Mars
“Chip Huyen gets generative AI. She is a remarkable teacher and writer
whose work has been instrumental in

In [4]:
# Initialize embedding model
print(f"\nInitializing embedding model: {EMBEDDING_MODEL}")
print("This may take a moment on first run (downloading model)...\n")

start_time = time.time()

# Load the sentence transformer model
embedding_model = SentenceTransformer(EMBEDDING_MODEL)

load_time = time.time() - start_time

print(f"✓ Model loaded successfully in {load_time:.2f} seconds")
print(f"  Model dimension: {embedding_model.get_sentence_embedding_dimension()}")
print(f"  Max sequence length: {embedding_model.max_seq_length}")


Initializing embedding model: sentence-transformers/all-MiniLM-L6-v2
This may take a moment on first run (downloading model)...

✓ Model loaded successfully in 3.38 seconds
  Model dimension: 384
  Max sequence length: 256


In [5]:
# Test embedding generation with a sample
sample_text = "What is machine learning and how does it work?"
sample_embedding = embedding_model.encode(sample_text)

print(f"\nTest embedding generation:")
print(f"  Input text: '{sample_text}'")
print(f"  Embedding shape: {sample_embedding.shape}")
print(f"  Embedding type: {type(sample_embedding)}")
print(f"  First 5 values: {sample_embedding[:5]}")


Test embedding generation:
  Input text: 'What is machine learning and how does it work?'
  Embedding shape: (384,)
  Embedding type: <class 'numpy.ndarray'>
  First 5 values: [-0.03735023  0.01238793  0.0144394   0.01891053  0.03915351]


In [6]:
# Initialize ChromaDB client
print(f"\nInitializing ChromaDB client...")

# Create persistent client
chroma_client = chromadb.PersistentClient(
    path=str(CHROMA_DIR),
    settings=Settings(
        anonymized_telemetry=False,
        allow_reset=True
    )
)

print(f"✓ ChromaDB client initialized")
print(f"  Persist directory: {CHROMA_DIR}")

# List existing collections (if any)
existing_collections = chroma_client.list_collections()
print(f"  Existing collections: {len(existing_collections)}")
for col in existing_collections:
    print(f"    - {col.name}")


Initializing ChromaDB client...


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given


✓ ChromaDB client initialized
  Persist directory: d:\AI Book RAG\chroma_db
  Existing collections: 1
    - ai_books_collection


In [7]:
# Create or get collection
print(f"\nSetting up collection: {COLLECTION_NAME}")

# Delete existing collection if it exists (for clean start)
try:
    chroma_client.delete_collection(name=COLLECTION_NAME)
    print(f"  Deleted existing collection: {COLLECTION_NAME}")
except:
    print(f"  No existing collection to delete")

# Create new collection with custom embedding function
collection = chroma_client.create_collection(
    name=COLLECTION_NAME,
    metadata={
        "description": "AI Books RAG Collection",
        "embedding_model": EMBEDDING_MODEL,
        "created_at": time.strftime("%Y-%m-%d %H:%M:%S")
    }
)

print(f"✓ Collection created: {COLLECTION_NAME}")
print(f"  Metadata: {collection.metadata}")


Setting up collection: ai_books_collection


Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


  Deleted existing collection: ai_books_collection
✓ Collection created: ai_books_collection
  Metadata: {'description': 'AI Books RAG Collection', 'embedding_model': 'sentence-transformers/all-MiniLM-L6-v2', 'created_at': '2025-12-30 02:01:03'}


In [8]:
# Helper function: Generate embeddings in batches
def generate_embeddings_batch(texts: List[str], batch_size: int = 32) -> np.ndarray:
    """
    Generate embeddings for a list of texts in batches.
    
    Args:
        texts: List of text strings to embed
        batch_size: Number of texts to process at once
    
    Returns:
        NumPy array of embeddings
    """
    all_embeddings = []
    
    for i in tqdm(range(0, len(texts), batch_size), desc="Generating embeddings"):
        batch = texts[i:i + batch_size]
        batch_embeddings = embedding_model.encode(
            batch,
            show_progress_bar=False,
            convert_to_numpy=True
        )
        all_embeddings.append(batch_embeddings)
    
    return np.vstack(all_embeddings)

In [9]:
# Generate embeddings for all chunks
print(f"\nGenerating embeddings for {len(all_chunks)} chunks...")
print(f"This may take several minutes depending on your hardware.\n")

start_time = time.time()

# Extract texts from chunks
chunk_texts = [chunk['text'] for chunk in all_chunks]

# Generate embeddings in batches
embeddings = generate_embeddings_batch(chunk_texts, batch_size=BATCH_SIZE)

embedding_time = time.time() - start_time

print(f"\n✓ Embeddings generated successfully")
print(f"  Total time: {embedding_time:.2f} seconds")
print(f"  Time per chunk: {embedding_time / len(all_chunks):.4f} seconds")
print(f"  Embeddings shape: {embeddings.shape}")
print(f"  Memory usage: {embeddings.nbytes / 1024 / 1024:.2f} MB")


Generating embeddings for 10620 chunks...
This may take several minutes depending on your hardware.



Generating embeddings: 100%|██████████| 332/332 [13:44<00:00,  2.48s/it]


✓ Embeddings generated successfully
  Total time: 824.56 seconds
  Time per chunk: 0.0776 seconds
  Embeddings shape: (10620, 384)
  Memory usage: 15.56 MB





In [None]:
# Prepare data for ChromaDB insertion
print(f"\nPreparing data for ChromaDB...")

# Extract IDs, documents, metadatas, and embeddings
ids = [chunk['chunk_id'] for chunk in all_chunks]
documents = [chunk['text'] for chunk in all_chunks]

# Handle both old (character-based) and new (semantic) chunk formats
metadatas = []
for chunk in all_chunks:
    metadata = {
        "book_title": chunk['book_title'],
        "chapter": chunk['chapter'],
        "page_number": chunk['page_number'],
        "chunk_index": chunk['chunk_index'],
        "citation": chunk['citation'],
        "char_count": chunk['char_count'],
        "word_count": chunk['word_count']
    }
    
    # Add optional fields if they exist
    if 'token_count' in chunk:
        metadata['token_count'] = chunk['token_count']
    if 'sentence_count' in chunk:
        metadata['sentence_count'] = chunk['sentence_count']
    if 'chunking_method' in chunk:
        metadata['chunking_method'] = chunk['chunking_method']
    
    metadatas.append(metadata)

# Convert embeddings to list format for ChromaDB
embeddings_list = embeddings.tolist()

print(f"✓ Data prepared")
print(f"  IDs: {len(ids)}")
print(f"  Documents: {len(documents)}")
print(f"  Metadatas: {len(metadatas)}")
print(f"  Embeddings: {len(embeddings_list)}")

# Show sample metadata
if metadatas:
    print(f"\nSample metadata fields:")
    for key in metadatas[0].keys():
        print(f"  - {key}")


Preparing data for ChromaDB...


KeyError: 'token_count'

In [None]:
# Add data to ChromaDB collection in batches
print(f"\nAdding data to ChromaDB collection...")
print(f"Processing in batches of {BATCH_SIZE}...\n")

start_time = time.time()

# ChromaDB has a limit on batch size, so we insert in chunks
for i in tqdm(range(0, len(ids), BATCH_SIZE), desc="Inserting into ChromaDB"):
    batch_end = min(i + BATCH_SIZE, len(ids))
    
    collection.add(
        ids=ids[i:batch_end],
        documents=documents[i:batch_end],
        metadatas=metadatas[i:batch_end],
        embeddings=embeddings_list[i:batch_end]
    )

insert_time = time.time() - start_time

print(f"\n✓ Data inserted successfully")
print(f"  Total time: {insert_time:.2f} seconds")
print(f"  Collection count: {collection.count()}")

In [None]:
# Verify collection
print(f"\nVerifying ChromaDB collection...")

collection_count = collection.count()
print(f"✓ Collection verification:")
print(f"  Name: {collection.name}")
print(f"  Total documents: {collection_count}")
print(f"  Metadata: {collection.metadata}")

# Verify data integrity
assert collection_count == len(all_chunks), "Mismatch between chunks and collection count!"
print(f"\n✓ Data integrity verified: {collection_count} chunks stored successfully")

In [None]:
# Test similarity search functionality
print(f"\n{'='*80}")
print("TESTING SIMILARITY SEARCH")
print(f"{'='*80}\n")

# Test queries
test_queries = [
    "What is transfer learning?",
    "Explain the transformer architecture",
    "How do I train a neural network?",
    "What are the best practices for fine-tuning LLMs?"
]

# Number of results to retrieve
top_k = 3

for query in test_queries:
    print(f"\nQuery: '{query}'")
    print("-" * 80)
    
    # Generate query embedding
    query_embedding = embedding_model.encode(query).tolist()
    
    # Search in ChromaDB
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        include=["documents", "metadatas", "distances"]
    )
    
    # Display results
    for i, (doc, metadata, distance) in enumerate(zip(
        results['documents'][0],
        results['metadatas'][0],
        results['distances'][0]
    ), 1):
        print(f"\nResult {i}:")
        print(f"  Book: {metadata['book_title']}")
        print(f"  Chapter: {metadata['chapter']}")
        print(f"  Page: {metadata['page_number']}")
        print(f"  Citation: {metadata['citation']}")
        print(f"  Distance: {distance:.4f}")
        print(f"  Text preview: {doc[:200]}...")
    
    print("-" * 80)

In [None]:
# Test filtering by book
print(f"\n{'='*80}")
print("TESTING FILTERED SEARCH (by book)")
print(f"{'='*80}\n")

# Get unique book titles
book_titles = list(set(chunk['book_title'] for chunk in all_chunks))
test_book = book_titles[0] if book_titles else None

if test_book:
    query = "What is machine learning?"
    print(f"Query: '{query}'")
    print(f"Filter: Book = '{test_book}'")
    print("-" * 80)
    
    # Generate query embedding
    query_embedding = embedding_model.encode(query).tolist()
    
    # Search with filter
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=3,
        where={"book_title": test_book},
        include=["documents", "metadatas", "distances"]
    )
    
    # Display results
    for i, (doc, metadata, distance) in enumerate(zip(
        results['documents'][0],
        results['metadatas'][0],
        results['distances'][0]
    ), 1):
        print(f"\nResult {i}:")
        print(f"  Book: {metadata['book_title']}")
        print(f"  Chapter: {metadata['chapter']}")
        print(f"  Page: {metadata['page_number']}")
        print(f"  Distance: {distance:.4f}")
        print(f"  Text preview: {doc[:200]}...")

In [None]:
# Display final statistics
print(f"\n{'='*80}")
print("FINAL STATISTICS")
print(f"{'='*80}\n")

# Collection statistics
print(f"ChromaDB Collection:")
print(f"  Name: {COLLECTION_NAME}")
print(f"  Total documents: {collection.count():,}")
print(f"  Embedding dimension: {embeddings.shape[1]}")
print(f"  Database size: {sum(f.stat().st_size for f in CHROMA_DIR.rglob('*') if f.is_file()) / 1024 / 1024:.2f} MB")

# Books statistics
books_count = {}
for chunk in all_chunks:
    book = chunk['book_title']
    books_count[book] = books_count.get(book, 0) + 1

print(f"\nBooks in Collection:")
for book, count in sorted(books_count.items(), key=lambda x: x[1], reverse=True):
    print(f"  {book}: {count:,} chunks")

# Performance metrics
print(f"\nPerformance Metrics:")
print(f"  Embedding generation: {embedding_time:.2f} seconds")
print(f"  Database insertion: {insert_time:.2f} seconds")
print(f"  Total processing time: {embedding_time + insert_time:.2f} seconds")
print(f"  Average time per chunk: {(embedding_time + insert_time) / len(all_chunks):.4f} seconds")

print(f"\n{'='*80}")
print("✓ Vector database setup complete!")
print(f"{'='*80}")

In [None]:
# Save configuration for later use
config = {
    "collection_name": COLLECTION_NAME,
    "embedding_model": EMBEDDING_MODEL,
    "embedding_dimension": int(embeddings.shape[1]),
    "total_chunks": len(all_chunks),
    "chroma_path": str(CHROMA_DIR),
    "created_at": time.strftime("%Y-%m-%d %H:%M:%S"),
    "books": list(books_count.keys())
}

config_file = BASE_DIR / "vectordb_config.json"
with open(config_file, 'w', encoding='utf-8') as f:
    json.dump(config, f, indent=2)

print(f"\n✓ Configuration saved to: {config_file}")

## Next Steps

✅ Vector database setup complete!

The embedded chunks are now stored in ChromaDB and ready for retrieval:
- **Notebook 4**: RAG pipeline testing

### What Was Created:
1. **ChromaDB Collection**: `ai_books_collection` with all embedded chunks
2. **Embeddings**: Generated using sentence-transformers model
3. **Metadata**: Book title, chapter, page number, citations preserved
4. **Configuration**: Saved to `vectordb_config.json`

### Key Features:
- ✅ Semantic search capability
- ✅ Metadata filtering (by book, chapter, etc.)
- ✅ Distance-based relevance scoring
- ✅ Persistent storage (survives restarts)

### What's Next:
In the next notebook, we will:
1. Test the complete RAG pipeline
2. Integrate with Groq LLM
3. Generate answers with citations
4. Link relevant images to responses
5. Evaluate response quality