In [None]:
# Indexing is the process of preparing and organizing data (like documents, web pages, or text chunks) so that it can be efficiently searched and retrieved later.

### Problem Statement

RAG systems rely on effective indexing. Weak indexing can bury relevant content, omit key context, or fragment meaning—leading to incomplete or misleading results.

**Example:**

User query:
- "How do I reset my password?"

Problem:
- The document containing the answer says: “Click ‘Forgot Password’ on the login screen.”
- But the phrase “reset your password” doesn't appear anywhere in the text.

Potential bad outcome:
- The retriever misses the relevant document.
- The LLM responds: “Sorry, I couldn’t find anything about that,” even though the answer exists in the corpus.

### Strategy #1: Chunking Optimization

**Idea**:

In [None]:
# todo: not covered in course; do separately
# https://www.youtube.com/watch?v=8OJC21T2SL4

### Strategy #2: Multi-Representation Indexing

![Image](rsc/jupyter/multi_representational_indexing.png)

**Idea**: Instead of storing one embedding per document chunk, you store several, each capturing a different perspective, style, or abstraction level (e.g. original text, LLM short summary, LLM semantic summary, keyword list, etc.).

_**Note**: We'll focus on a simple case of this technique, in which we store a semantic summary of the document in our vector DB and use it to retrieve the original document (which we store separately in a document DB)._

### Strategy #3: RAPTOR

![Image](rsc/jupyter/raptor_indexing.png)

**Idea**: Create a hierarchy of documents summaries with high-level summaries at the top of the hierarchy, lower-level summaries in the middle, and the original documents at the bottom. When user queries come in:
- Abstract high-level queries are handled by the high-level and mid-level summaries.
- Specific low-level queries are handled by the original documents.

---

Goals of RAPTOR (Retrieval-Aware Pretraining for Targets of Retrieval)
1. Create meaningful, retriever-friendly chunks (not arbitrary token splits).
2. Organize them hierarchically to reflect document structure.
3. Use LLM-generated summaries to describe higher-level sections.

At index time:
- Chunk documents semantically using heuristics or LLMs.
- Organize chunks into a tree (paragraphs → sections → document).
- Summarize each node with an LLM.
- Index each level’s summary as a separate vector.

At query time:
- You can match the query against all levels.
- Traverse the hierarchy based on relevance.
- Retrieve the most informative chunk(s) at the right level of detail.

In [None]:
# todo: revisit, implement later
# Reference: https://github.com/langchain-ai/langchain/blob/master/cookbook/RAPTOR.ipynb

### Strategy #4: ColBERT

![Image](rsc/jupyter/colbert_indexing.png)

**Idea**: A fundamentally different algorithm for determining document-query similarity: each token in a query is matched against each token in a document, allowing for more precise and expressive retrieval (fine-grained, late-interaction retrieval).

---

With late interaction, the model doesn't collapse the input into a single vector. Instead:
- Query is encoded into a sequence of token vectors: `q₁, q₂, ..., qₙ`
- Document is encoded into a sequence of token vectors: `d₁, d₂, ..., dₘ`
- You match each query token against each document token, using cosine similarity.

Then, for each query token, you take the maximum similarity across all document tokens (this is the `MaxSim` operation), and sum the results to get the final score.

**Example:**

Query:
- “When did Tesla begin?”

Two document chunks:
1. “Tesla was founded in 2003 by engineers...”
2. “Tesla makes electric vehicles and solar panels.”

In ColBERT:
- "when" in the query might match "2003" in document 1.
- "begin" might match "founded" in document 1.
- These specific token matches dominate the score.