In [None]:
# Indexing is the process of preparing and organizing data (like documents, web pages, or text chunks) so that it can be efficiently searched and retrieved later.

### Problem Statement

RAG systems rely on effective indexing. Weak indexing can bury relevant content, omit key context, or fragment meaning—leading to incomplete or misleading results.

**Example:**

User query:
- "How do I reset my password?"

Problem:
- The document containing the answer says: “Click ‘Forgot Password’ on the login screen.”
- But the phrase “reset your password” doesn't appear anywhere in the text.

Potential bad outcome:
- The retriever misses the relevant document.
- The LLM responds: “Sorry, I couldn’t find anything about that,” even though the answer exists in the corpus.

### Strategy #1: Chunking Optimization

**Idea**:

### Strategy #2: Multi-Representation Indexing

![Image](rsc/jupyter/multi_representational_indexing.png)

**Idea**: Instead of storing one embedding per document chunk, you store several, each capturing a different perspective, style, or abstraction level (e.g. original text, LLM short summary, LLM semantic summary, keyword list, etc.).

_**Note**: We'll focus on a simple case of this technique, in which we store a semantic summary of the document in our vector DB and use it to retrieve the original document (which we store separately in a document DB)._

### Strategy #3: RAPTOR

![Image](rsc/jupyter/raptor_indexing.png)

**Idea**: Improve chunking and indexing by aligning document segmentation with natural semantic structure — especially paragraphs, sections, and topic boundaries — rather than arbitrary fixed-size chunks.

Traditional chunking often splits documents into fixed-length tokens (e.g., 512 tokens), which can:
- Cut across logical boundaries (e.g., mid-sentence or mid-paragraph)
- Separate relevant context
- Create confusing or meaningless chunks

RAPTOR uses LLMs to identify semantic chunk boundaries that make each chunk:
- Coherent
- Contextually meaningful
- Aligned with how people ask questions

Then, it indexes multiple representations of those chunks — such as:
- Original text
- LLM-generated summaries
- Headings or titles
- Hierarchical relationships (e.g., section > subsection)

### Strategy #4: ColBERT

**Idea**: Enable fine-grained, late-interaction retrieval, where each token in a query is matched against each token in a document, allowing for more precise and expressive retrieval — without sacrificing efficiency.

With late interaction, the model doesn't collapse the input into a single vector. Instead:
- Query is encoded into a sequence of token vectors: `q₁, q₂, ..., qₙ`
- Document is encoded into a sequence of token vectors: `d₁, d₂, ..., dₘ`
- You match each query token against each document token, using cosine similarity.

Then, for each query token, you take the maximum similarity across all document tokens (this is the `MaxSim` operation), and sum the results to get the final score.

**Example:**

Query:
- “When did Tesla begin?”

Two document chunks:
1. “Tesla was founded in 2003 by engineers...”
2. “Tesla makes electric vehicles and solar panels.”

In ColBERT:
- "when" in the query might match "2003" in document 1.
- "begin" might match "founded" in document 1.
- These specific token matches dominate the score.