# Notebook 03: Build FAISS Index & Test Retrieval

## Goal

Generate embeddings for all chunks, build a FAISS index, and test retrieval with example queries.


## FAISS Index Choice

### IndexFlatIP vs IndexFlatL2

- **IndexFlatIP (Inner Product)**: Requires normalized vectors; equivalent to cosine similarity
- **IndexFlatL2 (Euclidean)**: No normalization needed; measures distance

We'll use **IndexFlatIP** with normalized embeddings because:

- Better semantic matching for text (cosine similarity)
- Faster search for normalized vectors
- Standard practice in semantic search


## Normalization & Search Latency

- **Normalization**: Ensure all embeddings are L2-normalized before indexing
- **Search latency**: IndexFlatIP is exact (no approximation), so search is O(n) but fast enough for our corpus size
- **Future scaling**: For larger corpora, consider IndexIVFFlat or HNSW for approximate search


## Metadata Format for Citations

Each chunk's metadata should include:

- `book`: Source book name (e.g., "iliad", "dorian")
- `para_idx_start`: First paragraph index in this chunk
- `para_idx_end`: Last paragraph index in this chunk
- `chunk_id`: Unique identifier for the chunk
- `char_span`: Character start/end positions (optional, for precise citations)

This metadata enables us to generate citations like:

> "[1] Quote text..." ‚Äî The Iliad, Book 1, paragraphs 5-7


## Step 1: Load Chunks from Previous Notebook

Load the chunked data (or regenerate if needed).


In [25]:
# Load chunks from saved file (created in notebook 02)
import json
import yaml
from pathlib import Path

def load_config(path="../configs/app.yaml"):
    with open(path, 'r') as f:
        config = yaml.safe_load(f)
    return config

config = load_config()
book_name = config['book']

# Load chunks from JSON file
chunks_file = Path(f"../data/interim/chunks/{book_name}_chunks.json")

if not chunks_file.exists():
    raise FileNotFoundError(
        f"Chunks file not found: {chunks_file}\n"
        f"Please run notebook 02 first to generate chunks."
    )

with open(chunks_file, 'r', encoding='utf-8') as f:
    chunks = json.load(f)

print(f"‚úÖ Loaded {len(chunks)} chunks from: {chunks_file}")
print(f"   Book: {book_name}")
print(f"   Sample chunk ID: {chunks[0]['id'] if chunks else 'N/A'}")

# Extract chunk texts and metadata for embedding
chunk_texts = [chunk['text'] for chunk in chunks]
print(f"   Total characters: {sum(len(text) for text in chunk_texts):,}")


‚úÖ Loaded 374 chunks from: ../data/interim/chunks/dorian_chunks.json
   Book: dorian
   Sample chunk ID: dorian_chunk_0
   Total characters: 436,080


## Memory Check

Before embedding, let's verify we have enough chunks and check memory usage.


In [26]:
# Quick memory check
import sys

print(f"üìä Chunk Statistics:")
print(f"   Number of chunks: {len(chunks)}")
print(f"   Total characters: {sum(len(chunk['text']) for chunk in chunks):,}")
print(f"   Average chunk size: {sum(len(chunk['text']) for chunk in chunks) / len(chunks):.0f} chars")

# Estimate memory needed for embeddings (384 dims * 4 bytes * num_chunks)
estimated_mb = (384 * 4 * len(chunks)) / (1024 * 1024)
print(f"\nüíæ Estimated memory for embeddings: {estimated_mb:.2f} MB")
print(f"   (This should be manageable for most systems)")

# Check if we can proceed
if len(chunks) == 0:
    raise ValueError("No chunks loaded! Please run notebook 02 first.")
    
print(f"\n‚úÖ Ready to proceed with embedding")


üìä Chunk Statistics:
   Number of chunks: 374
   Total characters: 436,080
   Average chunk size: 1166 chars

üíæ Estimated memory for embeddings: 0.55 MB
   (This should be manageable for most systems)

‚úÖ Ready to proceed with embedding


## Test with Small Sample First (Optional)

If you're experiencing crashes, test with a small sample first to isolate the issue.


In [27]:
# OPTIONAL: Test with first 10 chunks to verify everything works
# Uncomment below to test with a small sample first

TEST_MODE = False  # Toggle to True only for quick smoke-tests
TEST_CHUNK_LIMIT = 10

ORIGINAL_CHUNK_COUNT = len(chunks)

if TEST_MODE:
    limit = min(TEST_CHUNK_LIMIT, ORIGINAL_CHUNK_COUNT)
    print(f"üß™ TEST MODE: Using first {limit} of {ORIGINAL_CHUNK_COUNT} chunks")
    chunks = chunks[:limit]
    print("   ‚ö†Ô∏è Index persistence is disabled while TEST_MODE is True.")
else:
    print(f"üöÄ Using all {ORIGINAL_CHUNK_COUNT} chunks")


üöÄ Using all 374 chunks


## Step 2: Build Full Embeddings & FAISS Index

Embed all chunks and build the FAISS index.


In [28]:
# === TODO (you code this) ===
# Build full embeddings and FAISS index; persist to data/index/.

import sys
from pathlib import Path
import importlib
import gc

sys.path.append(str(Path('..').resolve()))
from src import embed_index
importlib.reload(embed_index)  # Reload to get latest changes
from src.embed_index import embed_texts, build_faiss_index, save_index

# Track run configuration
TEST_MODE = globals().get('TEST_MODE', False)
ORIGINAL_CHUNK_COUNT = globals().get('ORIGINAL_CHUNK_COUNT', len(chunks))

# 1. Embed all chunk texts
# 2. Build FAISS index (IndexFlatIP with normalized vectors)
# 3. Save index and metadata to data/index/

# Use the chunks already loaded (they're already filtered by book from notebook 02)
chunk_texts = [chunk['text'] for chunk in chunks]
current_chunk_count = len(chunk_texts)
subset_warning = current_chunk_count < ORIGINAL_CHUNK_COUNT

if subset_warning:
    print(f"‚ö†Ô∏è Working with {current_chunk_count} of {ORIGINAL_CHUNK_COUNT} chunks")
    print("   Set TEST_MODE = False to embed the full corpus before persisting the index")

# Prepare metadata rows for saving
meta_rows = []
for chunk in chunks:
    meta_rows.append({
        'chunk_id': chunk['id'],
        'book': chunk['meta']['book'],
        'para_idx_start': chunk['meta']['para_idx_start'],
        'para_idx_end': chunk['meta']['para_idx_end'],
        'char_count': chunk['meta']['char_count']
    })

print(f"üìö Preparing to embed {current_chunk_count} chunks...")
print(f"   Total characters: {sum(len(text) for text in chunk_texts):,}")

# Embed in batches to avoid memory issues
try:
    print(f"üìö Embedding using {config['embedding_model']}...")
    embeddings, model = embed_texts(chunk_texts, config['embedding_model'])
    print(f"‚úÖ Embedded {current_chunk_count} chunks")
    print(f"   Embedding shape: {embeddings.shape}")
    print(f"   Memory usage: {embeddings.nbytes / 1024 / 1024:.2f} MB")
except Exception as e:
    print(f"‚ùå Error during embedding: {e}")
    raise

# Free up memory by deleting the model if not needed
del model
gc.collect()

# Build FAISS index
try:
    print(f"üî® Building FAISS index...")
    index = build_faiss_index(embeddings)
    print(f"‚úÖ Built FAISS index with {index.ntotal} vectors")
except Exception as e:
    print(f"‚ùå Error building index: {e}")
    raise

# Save index and metadata
out_dir = '../data/index'
skip_persist = TEST_MODE and subset_warning

if skip_persist:
    print("‚ö†Ô∏è TEST MODE active: skipping save_index to avoid overwriting the full artifacts.")
    print("   Toggle TEST_MODE = False and rerun this cell when you're ready to persist the full index.")
else:
    try:
        save_index(index, meta_rows, out_dir)
        print(f"‚úÖ Saved index to {out_dir}")
    except Exception as e:
        print(f"‚ùå Error saving index: {e}")
        raise

if skip_persist:
    print("\n‚ÑπÔ∏è Index + metadata objects are available in-memory for experimentation, but disk files were left untouched.")
else:
    print("\nüéâ Successfully built and saved FAISS index!")


üìö Preparing to embed 374 chunks...
   Total characters: 436,080
üìö Embedding using sentence-transformers/all-MiniLM-L6-v2...


Batches:   0%|          | 0/12 [00:00<?, ?it/s]

‚úÖ Embedded 374 chunks
   Embedding shape: (374, 384)
   Memory usage: 0.55 MB
üî® Building FAISS index...
‚úÖ Built FAISS index with 374 vectors
‚úÖ Saved index to: ../data/index/index.faiss
‚úÖ Saved metadata to: ../data/index/metadata.parquet
   Index size: 374 vectors
   Metadata rows: 374
‚úÖ Saved index to ../data/index

üéâ Successfully built and saved FAISS index!


## Step 3: Load Index & Test Retrieval

Load the saved index and test retrieval with example queries.


In [29]:
# === TODO (you code this) ===
# Load index & metadata; test a few queries.

import sys
sys.path.append('..')

from src.embed_index import load_index
from src.retrieve import retrieve
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

# 1. Load index and metadata
# 2. Load embedding model
# 3. Test retrieval with example queries
in_dir = "../data/index"
index, metadata_df = load_index(in_dir)
print("Index loaded successfully")
print("Index info:")
print(f"  Number of vectors: {index.ntotal}")
print(f"  Dimension: {index.d}")
print(f"  Index type: {type(index)}")
print(f"  Metadata shape: {metadata_df.shape}")
print(f"  Metadata columns: {metadata_df.columns}")

config = load_config()
model = SentenceTransformer(config['embedding_model'])
print("\nModel loaded successfully. Config:")
print(f"  Embedding model: {config['embedding_model']}")
print(f"  Book: {config['book']}")

chunks_lookup = None
try:
    book_name = config['book']
    chunks_file = Path(f"../data/interim/chunks/{book_name}_chunks.json")
    if chunks_file.exists():
        with open(chunks_file, 'r', encoding='utf-8') as f:
            chunks_lookup = json.load(f)
        print(f"  Chunks file loaded successfully: {chunks_file}")
    else:
        print(f"‚ùå Chunks file not found: {chunks_file}")
except Exception as e:
    print(f"‚ùå Error loading chunks: {e}")
    raise

def embed_query_fn(query: str) -> np.ndarray:
    embedding = model.encode([query],normalize_embeddings=True, show_progress_bar=True)
    embedding = np.array(embedding, dtype=np.float32)
    faiss.normalize_L2(embedding)
    return embedding[0]

print("Ready to test retrieval")
print(f".   Top-k: {config.get('top_k', 5)}")




‚úÖ Loaded index: 374 vectors, dimension 384
‚úÖ Loaded metadata: 374 rows
Index loaded successfully
Index info:
  Number of vectors: 374
  Dimension: 384
  Index type: <class 'faiss.swigfaiss.IndexFlat'>
  Metadata shape: (374, 5)
  Metadata columns: Index(['chunk_id', 'book', 'para_idx_start', 'para_idx_end', 'char_count'], dtype='object')

Model loaded successfully. Config:
  Embedding model: sentence-transformers/all-MiniLM-L6-v2
  Book: dorian
  Chunks file loaded successfully: ../data/interim/chunks/dorian_chunks.json
Ready to test retrieval
.   Top-k: 5


In [30]:
# Fix chunks_lookup: Convert from list to dictionary
# The JSON file is a list, but retrieve() needs a dict mapping chunk_id -> chunk

if chunks_lookup is not None:
    # Check if it's a list (wrong) or dict (correct)
    if isinstance(chunks_lookup, list):
        print(f"‚ö†Ô∏è  chunks_lookup is a list (length: {len(chunks_lookup)}), converting to dict...")
        # Convert list to dictionary: chunk['id'] -> chunk
        chunks_lookup = {chunk['id']: chunk for chunk in chunks_lookup}
        print(f"‚úÖ Converted to dictionary with {len(chunks_lookup)} entries")
        print(f"   Sample keys: {list(chunks_lookup.keys())[:3]}")
    elif isinstance(chunks_lookup, dict):
        print(f"‚úÖ chunks_lookup is already a dictionary with {len(chunks_lookup)} entries")
    else:
        print(f"‚ùå chunks_lookup is unexpected type: {type(chunks_lookup)}")
else:
    print("‚ùå chunks_lookup is None - need to reload chunks")
    # Reload and convert properly
    book_name = config['book']
    chunks_file = Path(f"../data/interim/chunks/{book_name}_chunks.json")
    if chunks_file.exists():
        with open(chunks_file, 'r', encoding='utf-8') as f:
            chunks_list = json.load(f)
            chunks_lookup = {chunk['id']: chunk for chunk in chunks_list}
        print(f"‚úÖ Reloaded and converted {len(chunks_lookup)} chunks to dictionary")
    else:
        print(f"‚ùå Chunks file not found: {chunks_file}")


‚ö†Ô∏è  chunks_lookup is a list (length: 374), converting to dict...
‚úÖ Converted to dictionary with 374 entries
   Sample keys: ['dorian_chunk_0', 'dorian_chunk_1', 'dorian_chunk_2']


In [31]:
# Better filter function to exclude TOC/header chunks
import re

def is_toc_or_header_chunk(result):
    """
    Detect if a chunk is a TOC, header, or low-content chunk.
    Returns True if it should be filtered out.
    """
    text = result['text']
    chunk_id = result.get('chunk_id', '')
    meta = result.get('meta', {})
    
    # Filter out chunk 0 (usually TOC/preface)
    if chunk_id.endswith('_chunk_0') or meta.get('para_idx_start', -1) == 0:
        # But allow it if it has substantial content (not just TOC)
        if 'Contents' in text and text.count('CHAPTER') > 5:
            return True  # It's a TOC
    
    # Filter very short chunks
    if len(text) < 150:
        return True
    
    # Filter chunks with too many newlines (indicates headers/TOC)
    newline_ratio = text.count('\n') / len(text) if len(text) > 0 else 0
    if newline_ratio > 0.15:  # More than 15% newlines
        return True
    
    # Filter chunks that are mostly chapter titles
    lines = text.split('\n')
    chapter_lines = [line for line in lines if 'CHAPTER' in line.upper() or re.match(r'^CHAPTER\s+[IVX]+', line, re.IGNORECASE)]
    if len(chapter_lines) > 3:  # More than 3 chapter title lines
        return True
    
    # Filter chunks that start with title/author/contents pattern
    first_100 = text[:100].lower()
    if ('contents' in first_100 and 'chapter' in first_100) or \
       (text.startswith('The Picture of') and 'by Oscar Wilde' in first_100):
        # Check if it's mostly TOC (many short lines)
        short_lines = [line for line in lines[:30] if len(line.strip()) < 50]
        if len(short_lines) > 10:  # More than 10 short lines in first 30
            return True
    
    return False


## Example Queries & Manual Relevance Check

Test with queries like:

- "How does Homer portray Achilles' anger in Book 1?"
- "What does Lord Henry claim about influence on the young?"
- "Where does the poem describe the shield of Achilles?"

For each query, manually judge whether the retrieved snippets are relevant. This helps validate:

1. Embedding quality (semantic similarity)
2. Chunk size appropriateness (not too fragmented, not too broad)
3. Retrieval ranking (most relevant chunks appear first)


In [32]:
# Test queries and display top-k results
# For each query, show:
# - Query text
# - Top 3-5 retrieved chunks with scores
# - Manual relevance judgment (relevant/partially relevant/not relevant)

query = "Lord Henry says all influence is immoral"

results = retrieve(
    query=query, 
    index=index, 
    embed_fn=embed_query_fn, 
    metadata_df=metadata_df, 
    chunks_lookup=chunks_lookup,
    k=config.get('top_k', 5)
)

# After retrieval, filter out low-content chunks
print("Testing filter on current results...\n")

for i, result in enumerate(results, 1):
    should_filter = is_toc_or_header_chunk(result)
    status = "‚ùå FILTER OUT" if should_filter else "‚úÖ KEEP"
    print(f"Result {i} ({result['chunk_id']}): {status}")
    if should_filter:
        print(f"  Reason: TOC/header detected")
    print()

# Apply the filter
filtered_results = [r for r in results if not is_toc_or_header_chunk(r)]

print(f"\nüìä Filtering Results:")
print(f"   Original: {len(results)} chunks")
print(f"   Filtered: {len(filtered_results)} chunks")
print(f"   Removed: {len(results) - len(filtered_results)} chunks")

for i, result in enumerate(filtered_results, 1):
    print(f"\nResult {i}:")
    print(f"  Score: {result['score']:.4f}")
    print(f"  Chunk ID: {result['chunk_id']}")
    print(f"  Book: {result['meta']['book']}")
    print(f"  Paragraph range: {result['meta']['para_idx_start']}-{result['meta']['para_idx_end']}")
    print(f"  Text: {result['text'][:300]}...")

query_2 = "Describe the appearance of the portrait painting of the young man"

results_2 = retrieve(
    query=query_2, 
    index=index, 
    embed_fn=embed_query_fn, 
    metadata_df=metadata_df, 
    chunks_lookup=chunks_lookup,
    k=config.get('top_k', 5)
)

# After retrieval, filter out low-content chunks
filtered_results_2 = []

for result in results_2:
    text = result['text']
    # Skip if it's mostly headers/TOC (lots of all caps, short lines, etc.)
    if len(text) < 200 or text.count('\n') / len(text) > 0.1:
        continue
    filtered_results_2.append(result)

for i, result in enumerate(filtered_results_2, 1):
    print(f"\nResult {i}:")
    print(f"  Score: {result['score']:.4f}")
    print(f"  Chunk ID: {result['chunk_id']}")
    print(f"  Book: {result['meta']['book']}")
    print(f"  Paragraph range: {result['meta']['para_idx_start']}-{result['meta']['para_idx_end']}")
    print(f"  Text: {result['text'][:300]}...")




Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Testing filter on current results...

Result 1 (dorian_chunk_349): ‚úÖ KEEP

Result 2 (dorian_chunk_366): ‚úÖ KEEP

Result 3 (dorian_chunk_290): ‚úÖ KEEP

Result 4 (dorian_chunk_29): ‚úÖ KEEP

Result 5 (dorian_chunk_56): ‚úÖ KEEP


üìä Filtering Results:
   Original: 5 chunks
   Filtered: 5 chunks
   Removed: 0 chunks

Result 1:
  Score: 0.5828
  Chunk ID: dorian_chunk_349
  Book: dorian
  Paragraph range: 1449-1454
  Text: CHAPTER XIX.

‚ÄúThere is no use your telling me that you are going to be good,‚Äù cried
Lord Henry, dipping his white fingers into a red copper bowl filled
with rose-water. ‚ÄúYou are quite perfect. Pray, don‚Äôt change.‚Äù

Dorian Gray shook his head. ‚ÄúNo, Harry, I have done too many dreadful
things in my l...

Result 2:
  Score: 0.5699
  Chunk ID: dorian_chunk_366
  Book: dorian
  Paragraph range: 1506-1508
  Text: When he reached home, he found his servant waiting up for him. He sent
him to bed, and threw himself down on the sofa in the library, and
began to 

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Result 1:
  Score: 0.5341
  Chunk ID: dorian_chunk_9
  Book: dorian
  Paragraph range: 34-40
  Text: ‚ÄúWhat is that?‚Äù said the painter, keeping his eyes fixed on the ground.

‚ÄúYou know quite well.‚Äù

‚ÄúI do not, Harry.‚Äù

‚ÄúWell, I will tell you what it is. I want you to explain to me why you
won‚Äôt exhibit Dorian Gray‚Äôs picture. I want the real reason.‚Äù

‚ÄúI told you the real reason.‚Äù

‚ÄúNo, you did not. ...

Result 2:
  Score: 0.5257
  Chunk ID: dorian_chunk_3
  Book: dorian
  Paragraph range: 16-18
  Text: In the centre of the room, clamped to an upright easel, stood the
full-length portrait of a young man of extraordinary personal beauty,
and in front of it, some little distance away, was sitting the artist
himself, Basil Hallward, whose sudden disappearance some years ago
caused, at the time, such p...

Result 3:
  Score: 0.5127
  Chunk ID: dorian_chunk_17
  Book: dorian
  Paragraph range: 67-69
  Text: ‚ÄúEvery day. I couldn‚Äôt be happy if I didn‚Äôt see him e

## Summary

At this point, you should have:

- ‚úÖ Full FAISS index built and saved to `data/index/`
- ‚úÖ Metadata persisted alongside the index
- ‚úÖ Retrieval tested with example queries
- ‚úÖ Manual validation that retrieved chunks are relevant

**Next notebook**: Build a small QA evaluation set, test answer composition, and wire up the Gradio demo.
