# Notebook 02: Chunking Strategy & Embedding Preview

## Goal

Split cleaned text into overlapping chunks, then generate embeddings for a sample to verify the pipeline.


## Chunking Strategy Trade-offs

### Size

- **Too small** (< 400 chars): Loses context, may break sentences
- **Too large** (> 1200 chars): Dilutes relevance, harder to retrieve precise passages
- **Sweet spot**: ~800 characters balances context and precision

### Overlap

- **No overlap**: Risk of splitting important passages across chunk boundaries
- **Large overlap** (> 50%): Redundant storage, slower retrieval
- **Moderate overlap** (~15%): Preserves continuity without excessive redundancy

### Our Approach

1. Split into paragraphs first (preserve natural boundaries)
2. Accumulate paragraphs until we reach `chunk_size` characters
3. Slide by `chunk_size - chunk_overlap` to create overlapping windows
4. Attach metadata: book name, paragraph indices, character span


## Why sentence-transformers MiniLM?

We use `sentence-transformers/all-MiniLM-L6-v2` because:

- **Speed**: Fast encoding on CPU (no GPU required)
- **Quality**: Good semantic similarity for literary text
- **Size**: Small model (~80MB), easy to load
- **Normalization**: Supports normalized embeddings for cosine similarity / inner product in FAISS


## Normalization for FAISS

FAISS supports two distance metrics:

- **L2 (Euclidean)**: Requires no normalization
- **Inner Product (IP)**: Requires normalized vectors (unit length)

We'll use **Inner Product** with normalized embeddings because:

- Equivalent to cosine similarity for normalized vectors
- Faster search in FAISS for normalized vectors
- Better semantic matching for text


## Step 1: Load Cleaned Text

Load the cleaned text from the previous notebook.


In [None]:
# Load cleaned text from data/interim/
# Use the book name from config to construct the file path.


## Step 2: Split into Paragraphs

Split the cleaned text into paragraphs (double newline boundaries).


In [None]:
# === TODO (you code this) ===
# Split cleaned text into paragraphs.
# Acceptance: list[str], len>100 for full books.

from src.chunk import split_into_paragraphs

# Call split_into_paragraphs and verify the result.


## Step 3: Create Overlapping Chunks

Chunk paragraphs into fixed-size overlapping segments with metadata.


In [None]:
# === TODO (you code this) ===
# Chunk paragraphs into fixed-size overlapping chunks with metadata.
# Acceptance: list[dict] with 'id','text','meta'.

from src.chunk import chunk_paragraphs

# Use chunk_size and chunk_overlap from config.
# Verify chunk structure: each should have id, text, and metadata.


## Step 4: Embed Sample Chunks

Generate embeddings for a small sample to verify shape and variance.


In [None]:
# === TODO (you code this) ===
# Embed a small sample of chunks to sanity-check shape and variance.
# Acceptance: embedding array shape == (n_sample, d).

from src.embed_index import embed_texts

# Take first 10 chunks, embed them, check shape and maybe compute pairwise similarity.


## Optional: 2D Projection Visualization

Use PCA or t-SNE to project embeddings to 2D and visualize chunk similarity. This helps verify that semantically similar chunks cluster together.


In [None]:
# Optional: 2D projection of embeddings
# from sklearn.manifold import TSNE
# import matplotlib.pyplot as plt
# 
# # Project and plot embeddings


## Token Coverage vs. Character Chunks

**Note**: We're using character-based chunking (not token-based). This is simpler but may split tokens in some edge cases. For production, consider token-aware chunking (e.g., using a tokenizer to respect word boundaries).

For this project, character chunks work well because:

- Literary text has consistent word boundaries
- Simpler implementation
- Overlap mitigates boundary issues


## Summary

At this point, you should have:

- ✅ Paragraphs split from cleaned text
- ✅ Overlapping chunks with metadata
- ✅ Sample embeddings verified (correct shape, reasonable variance)

**Next notebook**: Build the full FAISS index and test retrieval.
