# Notebook 03: Build FAISS Index & Test Retrieval

## Goal

Generate embeddings for all chunks, build a FAISS index, and test retrieval with example queries.


## FAISS Index Choice

### IndexFlatIP vs IndexFlatL2

- **IndexFlatIP (Inner Product)**: Requires normalized vectors; equivalent to cosine similarity
- **IndexFlatL2 (Euclidean)**: No normalization needed; measures distance

We'll use **IndexFlatIP** with normalized embeddings because:

- Better semantic matching for text (cosine similarity)
- Faster search for normalized vectors
- Standard practice in semantic search


## Normalization & Search Latency

- **Normalization**: Ensure all embeddings are L2-normalized before indexing
- **Search latency**: IndexFlatIP is exact (no approximation), so search is O(n) but fast enough for our corpus size
- **Future scaling**: For larger corpora, consider IndexIVFFlat or HNSW for approximate search


## Metadata Format for Citations

Each chunk's metadata should include:

- `book`: Source book name (e.g., "iliad", "dorian")
- `para_idx_start`: First paragraph index in this chunk
- `para_idx_end`: Last paragraph index in this chunk
- `chunk_id`: Unique identifier for the chunk
- `char_span`: Character start/end positions (optional, for precise citations)

This metadata enables us to generate citations like:

> "[1] Quote text..." — The Iliad, Book 1, paragraphs 5-7


## Step 1: Load Chunks from Previous Notebook

Load the chunked data (or regenerate if needed).


In [None]:
# Load chunks (either from memory or saved intermediate file)
# If needed, re-run chunking from notebook 02.


## Step 2: Build Full Embeddings & FAISS Index

Embed all chunks and build the FAISS index.


In [None]:
# === TODO (you code this) ===
# Build full embeddings and FAISS index; persist to data/index/.

from src.embed_index import embed_texts, build_faiss_index, save_index

# 1. Embed all chunk texts
# 2. Build FAISS index (IndexFlatIP with normalized vectors)
# 3. Save index and metadata to data/index/


## Step 3: Load Index & Test Retrieval

Load the saved index and test retrieval with example queries.


In [None]:
# === TODO (you code this) ===
# Load index & metadata; test a few queries.

from src.embed_index import load_index
from src.retrieve import retrieve

# 1. Load index and metadata
# 2. Load embedding model
# 3. Test retrieval with example queries


## Example Queries & Manual Relevance Check

Test with queries like:

- "How does Homer portray Achilles' anger in Book 1?"
- "What does Lord Henry claim about influence on the young?"
- "Where does the poem describe the shield of Achilles?"

For each query, manually judge whether the retrieved snippets are relevant. This helps validate:

1. Embedding quality (semantic similarity)
2. Chunk size appropriateness (not too fragmented, not too broad)
3. Retrieval ranking (most relevant chunks appear first)


In [None]:
# Test queries and display top-k results
# For each query, show:
# - Query text
# - Top 3-5 retrieved chunks with scores
# - Manual relevance judgment (relevant/partially relevant/not relevant)


## Summary

At this point, you should have:

- ✅ Full FAISS index built and saved to `data/index/`
- ✅ Metadata persisted alongside the index
- ✅ Retrieval tested with example queries
- ✅ Manual validation that retrieved chunks are relevant

**Next notebook**: Build a small QA evaluation set, test answer composition, and wire up the Gradio demo.
