# **Retrieval-Augmented Generation (RAG) and Advanced Indexing Techniques**

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant documents to improve their responses. In this note, we explore indexing techniques, multi-representation indexing, and ColBERT-based retrieval. 

### **Indexing and Multi-Representation Indexing**
Indexing is the process of structuring data for efficient retrieval. Traditional vector stores index documents using embeddings, while **multi-representation indexing** improves retrieval by storing multiple vector representations of a document.


In [None]:
# install dependencies

!pip install langchain_community tiktoken langchain-openai langchainhub chromadb langchain youtube-transcript-api pytube

In [None]:
import os
from dotenv import load_dotenv
load_dotenv()  

os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'

# Required API keys with validation
LANGCHAIN_API_KEY = os.getenv("LANGCHAIN_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")


#### **Implementing Multi-Representation Indexing**
#### **Step 1: Load and Process Documents**
We use `WebBaseLoader` from `langchain_community` to fetch web content, then split the text using `RecursiveCharacterTextSplitter`:


In [None]:
import os
os.environ["USER_AGENT"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"


In [None]:
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

loader = WebBaseLoader("https://lilianweng.github.io/posts/2024-02-05-human-data-quality/")
docs.extend(loader.load())

#### **Step 2: Generate Summaries for Efficient Indexing**
We summarize each document before indexing:

In [None]:
import uuid

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | ChatOpenAI(model="gpt-3.5-turbo",max_retries=0)
    | StrOutputParser()
)

summaries = chain.batch(docs, {"max_concurrency": 5})

#### **Step 3: Store Summaries and Documents in a Vector Store**
We use **ChromaDB** to store document summaries as embeddings:

In [None]:
from langchain.storage import InMemoryByteStore
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.retrievers.multi_vector import MultiVectorRetriever
import uuid
from langchain_core.documents import Document

vectorstore = Chroma(collection_name="summaries",
                     embedding_function=OpenAIEmbeddings())

store = InMemoryByteStore()
id_key = "doc_id"

retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

doc_ids = [str(uuid.uuid4()) for _ in docs]

summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

#### **Step 4: Querying the Indexed Data**
To retrieve documents related to a query:

In [None]:
query = "Memory in agents"
retrieved_docs = retriever.get_relevant_documents(query, n_results=1)
retrieved_docs[0].page_content[:500] 

## **RAPTOR: Advanced Retrieval for RAG**
[RAPTOR](https://arxiv.org/pdf/2401.18059.pdf) improves retrieval by dynamically selecting relevant document segments. Instead of treating documents as atomic units, RAPTOR structures and retrieves granular document segments for better LLM performance.

- **Key Idea**: Instead of retrieving whole documents, it retrieves relevant sections.
- **Implementation**: The method uses **structured retrieval pipelines** with a mix of **vector search, metadata filtering, and reinforcement learning**.

> RAPTOR helps RAG systems balance **retrieval efficiency** and **context relevance**.

---

## **ColBERT: Contextualized Embedding-Based Retrieval**
### **What is ColBERT?**
[ColBERT (Contextualized Late Interaction over BERT)](https://hackernoon.com/how-colbert-helps-developers-overcome-the-limits-of-rag) refines retrieval by representing each token in a passage as a separate embedding instead of a single vector for the document. 

- **Unlike standard embeddings**, ColBERT generates multiple context-aware embeddings for each document.
- **Late Interaction Mechanism**: Instead of compressing all embeddings into one vector, ColBERT **matches each query token to the most relevant passage token**, improving accuracy.

### **Implementing ColBERT in RAG**
We use **RAGatouille**, a simplified interface for ColBERT retrieval.

#### **Step 1: Install and Load ColBERT**

In [None]:
! pip install -U ragatouille

In [None]:
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

#### **Step 2: Index a Wikipedia Document**
We fetch and index a Wikipedia page:

In [None]:
import requests

def get_wikipedia_page(title: str):
    """
    Retrieve the full text content of a Wikipedia page.

    :param title: str - Title of the Wikipedia page.
    :return: str - Full text content of the page as raw string.
    """
    # Wikipedia API endpoint
    URL = "https://en.wikipedia.org/w/api.php"

    # Parameters for the API request
    params = {
        "action": "query",
        "format": "json",
        "titles": title,
        "prop": "extracts",
        "explaintext": True,
    }

    # Custom User-Agent header to comply with Wikipedia's best practices
    headers = {"User-Agent": "RAGatouille_tutorial/0.0.1 (ben@clavie.eu)"}

    response = requests.get(URL, params=params, headers=headers)
    data = response.json()

    # Extracting page content
    page = next(iter(data["query"]["pages"].values()))
    return page["extract"] if "extract" in page else None

full_document = get_wikipedia_page("Hayao_Miyazaki")

In [None]:
RAG.index(
    collection=[full_document],
    index_name="Miyazaki-123",
    max_document_length=180,
    split_documents=True,
)

#### **Step 3: Perform a Semantic Search with ColBERT**


In [None]:
results = RAG.search(query="What animation studio did Miyazaki found?", k=3)
results

#### **Step 4: Use ColBERT as a LangChain Retriever**


In [None]:
retriever = RAG.as_langchain_retriever(k=3)
retriever.invoke("What animation studio did Miyazaki found?")