# A Definitive Guide to RAPTOR: Implementation and Evaluation with Hugging Face

## A Deep Dive into Hierarchical RAG for Advanced Contextual Retrieval

### Theoretical Introduction: The Problem with Standard RAG

Standard Retrieval-Augmented Generation (RAG) is a powerful technique, but it suffers from a fundamental **abstraction mismatch**. It typically involves:
1.  **Chunking:** Breaking large documents into small, fixed-size, independent pieces.
2.  **Retrieval:** Searching for these small chunks based on semantic similarity to a user's query.

This approach fails when a query requires a high-level, conceptual understanding. A broad question like "*What is the core philosophy of the Transformers library?*" will retrieve disparate, low-level code snippets, failing to capture the overarching theme. The system gets "lost in the details."

### The RAPTOR Solution: Building a Tree of Understanding

**RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval)** addresses this by creating a multi-level, hierarchical index that mirrors human understanding. The core idea is to build a semantic "tree" of information:

1.  **Leaf Nodes:** Start with initial text chunks (the most granular details).
2.  **Clustering:** Group similar chunks into thematic clusters.
3.  **Summarization (Abstraction):** Use a powerful LLM to synthesize a new, more abstract summary for each cluster. These summaries become the parent nodes.
4.  **Recursion:** Repeat the process. The new summaries are themselves clustered and summarized, creating ever-higher levels of abstraction until a single root summary is reached.

The result is a **multi-resolution index**. A single query can now match information at the perfect level of abstractionâ€”a specific detail at the leaf level, a thematic overview at a mid-level branch, or a high-level concept at the top of the tree. This notebook implements this entire process from scratch and then rigorously evaluates its performance against a standard RAG baseline.

--- 
## Part 1: Building the Advanced RAPTOR System

In this first part, we will build our full RAPTOR-powered RAG system. This involves installing dependencies, configuring models, ingesting and processing data, and implementing the complete, multi-level RAPTOR indexing algorithm, component by component.

### Step 1.1: Installing Dependencies

This first step ensures that all the necessary libraries are installed in our environment. Each library plays a specific role in the overall architecture of our system.

In [None]:
# This command installs all the necessary packages for this notebook.
# langchain libraries form the core framework for building our RAG applications.
# sentence-transformers is for our high-quality, open-source embedding model.
# transformers, torch, accelerate, and bitsandbytes are for running the local LLM efficiently.
# faiss-cpu provides a fast, local vector store for indexing our documents.
# umap-learn and scikit-learn are essential for the advanced clustering algorithm.
# beautifulsoup4 is used for parsing HTML content during the web scraping phase.
%pip install -q -U langchain langchain-community langchain-huggingface sentence-transformers
%pip install -q -U transformers torch accelerate bitsandbytes
%pip install -q -U faiss-cpu umap-learn scikit-learn beautifulsoup4 matplotlib

### Step 1.2: Model Configuration

We will configure our open-source models from the Hugging Face Hub. A RAG system has two main model components:
- **Embedding Model:** Converts text into numerical vectors. We use `sentence-transformers/all-MiniLM-L6-v2` for its excellent balance of speed and performance.
- **Language Model (LLM):** Generates summaries and final answers. We use `mistralai/Mistral-7B-Instruct-v0.2` for its strong reasoning capabilities. We load it in 4-bit precision to make it accessible on consumer-grade GPUs.

In [None]:
import torch
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_huggingface import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# --- Configure Embedding Model ---
# This model will be used to convert all our text chunks and summaries into vectors.
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
# Specify the device to run on, 'cuda' for GPU or 'cpu' for CPU.
model_kwargs = {"device": "cuda"}
# Initialize the embedding model using LangChain's wrapper.
embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name, model_kwargs=model_kwargs)

# --- Configure LLM for Summarization and Generation ---
llm_id = "mistralai/Mistral-7B-Instruct-v0.2"

# Define the quantization configuration to load the model in 4-bit precision.
# This drastically reduces the memory footprint.
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

# Load the tokenizer associated with the LLM.
tokenizer = AutoTokenizer.from_pretrained(llm_id)
# Load the LLM with the specified quantization configuration.
model = AutoModelForCausalLM.from_pretrained(
    llm_id, 
    torch_dtype=torch.float16, 
    device_map="auto",
    quantization_config=quantization_config
)

# Create a text-generation pipeline using the loaded model and tokenizer.
pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    max_new_tokens=512 # Controls the max length of the generated summaries and answers
)

# Wrap the pipeline in LangChain's HuggingFacePipeline for seamless integration.
llm = HuggingFacePipeline(pipeline=pipe)

print("Models configured successfully.")

Models configured successfully.


### Step 1.3: Data Ingestion and Preparation

We crawl the Hugging Face documentation to build our knowledge base, targeting several key sections to gather a rich and diverse set of documents.

In [None]:
from langchain_community.document_loaders import RecursiveUrlLoader
from bs4 import BeautifulSoup as Soup

# Define the documentation sections to scrape, with varying crawl depths.
urls_to_load = [
    {"url": "https://huggingface.co/docs/transformers/index", "max_depth": 3},
    {"url": "https://huggingface.co/docs/datasets/index", "max_depth": 2},
    {"url": "https://huggingface.co/docs/tokenizers/index", "max_depth": 2},
    {"url": "https://huggingface.co/docs/peft/index", "max_depth": 1},
    {"url": "https://huggingface.co/docs/accelerate/index", "max_depth": 1}
]

docs = []
# Iterate through the list and crawl each documentation section.
for item in urls_to_load:
    # Initialize the loader with the specific URL and parameters.
    loader = RecursiveUrlLoader(
        url=item["url"],
        max_depth=item["max_depth"],
        extractor=lambda x: Soup(x, "html.parser").text, # Use BeautifulSoup to extract text
        prevent_outside=True, # Ensure we stay within the documentation pages
        use_async=True, # Use asynchronous requests for faster crawling
        timeout=600, # Set a generous timeout for slow pages
    )
    # Load the documents and add them to our master list.
    loaded_docs = loader.load()
    docs.extend(loaded_docs)
    print(f"Loaded {len(loaded_docs)} documents from {item['url']}")

print(f"\nTotal documents loaded: {len(docs)}")

Loaded 68 documents from https://huggingface.co/docs/transformers/index
Loaded 35 documents from https://huggingface.co/docs/datasets/index
Loaded 21 documents from https://huggingface.co/docs/tokenizers/index
Loaded 12 documents from https://huggingface.co/docs/peft/index
Loaded 9 documents from https://huggingface.co/docs/accelerate/index

Total documents loaded: 145


#### Creating Leaf Nodes: Initial Chunking

The raw documents (web pages) are too large and unstructured. We perform an initial chunking step to break them into smaller, more manageable pieces. These chunks will form the **leaf nodes** (Level 0) of our RAPTOR tree, representing the most granular level of information.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Extract the raw text content from the loaded LangChain Document objects.
docs_texts = [d.page_content for d in docs]

# Concatenate all document texts into one large string for efficient splitting.
concatenated_content = "\n\n --- \n\n".join(docs_texts)

# Create an instance of the text splitter.
# We use `from_huggingface_tokenizer` to ensure the chunking is aware of token boundaries.
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer=tokenizer,
    chunk_size=1000, # Define the maximum size of each chunk in tokens.
    chunk_overlap=100  # Define the overlap between consecutive chunks to maintain context.
)

# Split the concatenated text into our leaf node documents.
leaf_texts = text_splitter.split_text(concatenated_content)

print(f"Created {len(leaf_texts)} leaf nodes (chunks) for the RAPTOR tree.")

Created 412 leaf nodes (chunks) for the RAPTOR tree.


### Step 1.4: The Core RAPTOR Algorithm - A Component-by-Component Breakdown

We will now implement the sophisticated clustering approach from the RAPTOR paper. Each logical part of the algorithm is defined in its own cell for maximum clarity.

#### Component 1: Dimensionality Reduction with UMAP

**What it is:** UMAP (Uniform Manifold Approximation and Projection) is a technique for reducing the number of dimensions in our data.

**Why we need it:** Text embeddings exist in a very high-dimensional space (e.g., 384 dimensions for our model). This can make it difficult for clustering algorithms to work effectively due to the "Curse of Dimensionality." UMAP creates a lower-dimensional "map" of the data that preserves the essential semantic relationships, making it much easier to identify meaningful clusters.

**How it works:** We define two functions: `global_cluster_embeddings` for a broad, initial reduction, and `local_cluster_embeddings` for a more fine-grained reduction within already identified clusters.

In [None]:
from typing import Dict, List, Optional, Tuple
import numpy as np
import pandas as pd
import umap
from sklearn.mixture import GaussianMixture

# Define a random seed for reproducibility of UMAP and GMM
RANDOM_SEED = 224

def global_cluster_embeddings(embeddings: np.ndarray, dim: int, n_neighbors: Optional[int] = None, metric: str = "cosine") -> np.ndarray:
    """Perform global dimensionality reduction on the embeddings using UMAP."""
    # Heuristically set n_neighbors if not provided
    if n_neighbors is None:
        n_neighbors = int((len(embeddings) - 1) ** 0.5)
    # Return the UMAP-transformed embeddings
    return umap.UMAP(
        n_neighbors=n_neighbors, 
        n_components=dim, 
        metric=metric, 
        random_state=RANDOM_SEED
    ).fit_transform(embeddings)

def local_cluster_embeddings(embeddings: np.ndarray, dim: int, num_neighbors: int = 10, metric: str = "cosine") -> np.ndarray:
    """Perform local dimensionality reduction on the embeddings using UMAP."""
    # Return the UMAP-transformed embeddings for a local cluster
    return umap.UMAP(
        n_neighbors=num_neighbors, 
        n_components=dim, 
        metric=metric, 
        random_state=RANDOM_SEED
    ).fit_transform(embeddings)

print("Dimensionality reduction functions defined.")

Dimensionality reduction functions defined.


#### Component 2: Optimal Cluster Number Detection

**What it is:** A function to automatically determine the best number of clusters for a given set of data points.

**Why we need it:** Manually setting the number of clusters (`k`) is inefficient and often incorrect. A data-driven approach is far more robust. This function tests a range of possible cluster numbers and selects the one that best fits the data's structure.

**How it works:** It uses a Gaussian Mixture Model (GMM) and evaluates each potential number of clusters using the **Bayesian Information Criterion (BIC)**. The BIC is a statistical measure that rewards models for goodness-of-fit while penalizing them for complexity (too many clusters). The number of clusters that results in the lowest BIC score is chosen as the optimal one.

In [None]:
def get_optimal_clusters(embeddings: np.ndarray, max_clusters: int = 50) -> int:
    """Determine the optimal number of clusters using the Bayesian Information Criterion (BIC)."""
    # Limit the max number of clusters to be less than the number of data points
    max_clusters = min(max_clusters, len(embeddings))
    # If there's only one point, there can only be one cluster
    if max_clusters <= 1: 
        return 1
    
    # Test different numbers of clusters from 1 to max_clusters
    n_clusters_range = np.arange(1, max_clusters)
    bics = []
    for n in n_clusters_range:
        # Initialize and fit the GMM for the current number of clusters
        gmm = GaussianMixture(n_components=n, random_state=RANDOM_SEED)
        gmm.fit(embeddings)
        # Calculate and store the BIC for the current model
        bics.append(gmm.bic(embeddings))
        
    # Return the number of clusters that resulted in the lowest BIC score
    return n_clusters_range[np.argmin(bics)]

print("Optimal cluster detection function defined.")

Optimal cluster detection function defined.


#### Component 3: Probabilistic Clustering with GMM

**What it is:** A function that clusters the data and assigns labels based on probability.

**Why we need it:** Unlike simpler algorithms like K-Means which assign each point to exactly one cluster (hard clustering), GMM is a probabilistic model (soft clustering). It calculates the *probability* that a data point belongs to each cluster. This is powerful for text, as a single document chunk might be relevant to multiple topics. By using a probability `threshold`, we can assign a chunk to all clusters for which its membership probability is sufficiently high.

**How it works:** It first calls `get_optimal_clusters` to find the best `n_components`. It then fits a GMM and uses `predict_proba` to get the membership probabilities. Finally, it applies the `threshold` to assign the final cluster labels.

In [None]:
def GMM_cluster(embeddings: np.ndarray, threshold: float) -> Tuple[List[np.ndarray], int]:
    """Cluster embeddings using a GMM and a probability threshold."""
    # Find the optimal number of clusters for this set of embeddings
    n_clusters = get_optimal_clusters(embeddings)
    
    # Fit the GMM with the optimal number of clusters
    gmm = GaussianMixture(n_components=n_clusters, random_state=RANDOM_SEED)
    gmm.fit(embeddings)
    
    # Get the probability of each point belonging to each cluster
    probs = gmm.predict_proba(embeddings)
    
    # Assign a point to a cluster if its probability is above the threshold
    # A single point can be assigned to multiple clusters, hence the list of arrays.
    labels = [np.where(prob > threshold)[0] for prob in probs]
    
    return labels, n_clusters

print("Probabilistic clustering function defined.")

Probabilistic clustering function defined.


#### Component 4: Hierarchical Clustering Orchestrator

**What it is:** The main clustering function that ties all the previous components together to perform a multi-stage, hierarchical clustering.

**Why we need it:** A single layer of clustering might not be enough. This function implements the paper's strategy of finding both broad themes and specific sub-topics.

**How it works:**
1.  **Global Stage:** It first runs UMAP and GMM on the *entire* dataset to find broad, high-level clusters (e.g., "Transformers Library", "Datasets Library").
2.  **Local Stage:** It then iterates through each of these global clusters. For each one, it takes only the documents belonging to it and runs *another* round of UMAP and GMM. This finds finer-grained sub-topics (e.g., within "Transformers Library", it might find clusters for "Pipelines", "Training", and "Models").
3.  **Label Aggregation:** It carefully combines the local cluster labels into a final, comprehensive list of cluster assignments for every document.

In [None]:
def perform_clustering(embeddings: np.ndarray, dim: int = 10, threshold: float = 0.1) -> List[np.ndarray]:
    """Perform hierarchical clustering (global and local) on the embeddings."""
    # Handle cases with very few documents to avoid errors during dimensionality reduction.
    if len(embeddings) <= dim + 1:
        return [np.array([0]) for _ in range(len(embeddings))]

    # --- Global Clustering Stage ---
    # First, reduce the dimensionality of all embeddings globally.
    reduced_embeddings_global = global_cluster_embeddings(embeddings, dim)
    # Then, perform GMM clustering on the reduced-dimensional data.
    global_clusters, n_global_clusters = GMM_cluster(reduced_embeddings_global, threshold)

    # --- Local Clustering Stage ---
    # Initialize a list to hold all final local cluster assignments for each document.
    all_local_clusters = [np.array([]) for _ in range(len(embeddings))]
    # Keep track of the total number of clusters found so far.
    total_clusters = 0

    # Iterate through each global cluster to find sub-clusters.
    for i in range(n_global_clusters):
        # Get all original indices for embeddings that are part of the current global cluster.
        global_cluster_indices = [idx for idx, gc in enumerate(global_clusters) if i in gc]
        if not global_cluster_indices:
            continue
        
        # Get the actual embeddings for this global cluster.
        global_cluster_embeddings_ = embeddings[global_cluster_indices]

        # Perform local clustering on this subset of embeddings.
        if len(global_cluster_embeddings_) <= dim + 1:
            # If the cluster is too small, assign all points to a single local cluster.
            local_clusters, n_local_clusters = ([np.array([0])] * len(global_cluster_embeddings_)), 1
        else:
            # Otherwise, perform a full local clustering.
            reduced_embeddings_local = local_cluster_embeddings(global_cluster_embeddings_, dim)
            local_clusters, n_local_clusters = GMM_cluster(reduced_embeddings_local, threshold)

        # Map the local cluster results back to the original document indices.
        for j in range(n_local_clusters):
            # Find which documents within the local set belong to this specific local cluster.
            local_cluster_indices = [idx for idx, lc in enumerate(local_clusters) if j in lc]
            if not local_cluster_indices:
                continue
            
            # Get the original indices from the full dataset.
            original_indices = [global_cluster_indices[idx] for idx in local_cluster_indices]
            # Assign the new, globally unique cluster ID to these documents.
            for idx in original_indices:
                all_local_clusters[idx] = np.append(all_local_clusters[idx], j + total_clusters)

        # Increment the total cluster count.
        total_clusters += n_local_clusters

    return all_local_clusters

print("Hierarchical clustering orchestrator defined.")

Hierarchical clustering orchestrator defined.


#### Component 5: The Recursive Tree Builder

**What it is:** The main recursive function that orchestrates the entire tree-building process, level by level.

**Why we need it:** This function automates the hierarchical construction. It ensures that the process of clustering and summarizing is repeated on the outputs of the previous level, creating the layered structure of the RAPTOR index.

**How it works:**
1.  It takes a list of texts for the current `level`.
2.  It calls `perform_clustering` and the `summarization_chain` to process this level.
3.  It checks if the stopping conditions are met (max levels reached, or only one cluster was found).
4.  If not, it **calls itself** with the newly generated summaries as the input for the next level (`level + 1`).

In [None]:
def recursive_build_tree(texts: List[str], level: int = 1, n_levels: int = 3) -> Dict[int, Tuple[pd.DataFrame, pd.DataFrame]]:
    """The main recursive function to build the RAPTOR tree using all components."""
    results = {}
    # Base case: stop if max level is reached or no texts to process
    if level > n_levels or len(texts) <= 1:
        return results

    # --- Embed and Cluster ---
    # Convert texts to embeddings for clustering
    text_embeddings_np = np.array(embeddings.embed_documents(texts))
    # Perform the hierarchical clustering
    cluster_labels = perform_clustering(text_embeddings_np)
    # Store the results in a DataFrame
    df_clusters = pd.DataFrame({'text': texts, 'cluster': cluster_labels})

    # --- Prepare for Summarization by expanding clusters ---
    # A single text can belong to multiple clusters, so we 'explode' the DataFrame
    expanded_list = []
    for _, row in df_clusters.iterrows():
        for cluster_id in row['cluster']:
            expanded_list.append({'text': row['text'], 'cluster': int(cluster_id)})
    
    # If no clusters were formed, stop
    if not expanded_list:
        return results
        
    expanded_df = pd.DataFrame(expanded_list)
    all_clusters = expanded_df['cluster'].unique()
    print(f"--- Level {level}: Generated {len(all_clusters)} clusters ---")

    # --- Summarize each cluster ---
    summaries = []
    summarization_prompt = ChatPromptTemplate.from_template(
        """You are an expert technical writer. 
        Given the following collection of text chunks from the Hugging Face documentation, synthesize them into a single, coherent, and detailed summary. 
        Focus on the main concepts, APIs, and workflows described.
        CONTEXT: {context}
        DETAILED SUMMARY:"""
    )
    summarization_chain = summarization_prompt | llm | StrOutputParser()

    for i in all_clusters:
        # Get all texts for the current cluster
        cluster_texts = expanded_df[expanded_df['cluster'] == i]['text'].tolist()
        # Join the texts into a single context string
        formatted_txt = "\n\n---\n\n".join(cluster_texts)
        # Generate a summary for the cluster
        summary = summarization_chain.invoke({"context": formatted_txt})
        summaries.append(summary)
        print(f"Level {level}, Cluster {i}: Generated summary of length {len(summary)} chars.")

    # Store the summaries in a DataFrame
    df_summary = pd.DataFrame({'summaries': summaries, 'cluster': all_clusters})
    results[level] = (df_clusters, df_summary)

    # --- Recurse if possible ---
    if level < n_levels and len(all_clusters) > 1:
        # The new texts for the next level are the summaries from this level
        new_texts = df_summary["summaries"].tolist()
        # Call the function again on the summaries
        next_level_results = recursive_build_tree(new_texts, level + 1, n_levels)
        results.update(next_level_results)

    return results

print("Recursive tree builder defined.")

Recursive tree builder defined.


#### Executing the Tree-Building Process

Now, we execute the main recursive function on our initial leaf nodes. This will build the entire tree structure, generating summaries at each level. This is the most computationally intensive step of the entire notebook.

In [None]:
# Execute the RAPTOR process on our chunked leaf_texts.
# This will build a tree with a maximum of 3 levels of summarization.
raptor_results = recursive_build_tree(leaf_texts, level=1, n_levels=3)

--- Level 1: Generated 8 clusters ---
Level 1, Cluster 0: Generated summary of length 2011 chars.
Level 1, Cluster 1: Generated summary of length 1954 chars.
Level 1, Cluster 2: Generated summary of length 2089 chars.
Level 1, Cluster 3: Generated summary of length 1877 chars.
Level 1, Cluster 4: Generated summary of length 2043 chars.
Level 1, Cluster 5: Generated summary of length 1998 chars.
Level 1, Cluster 6: Generated summary of length 2015 chars.
Level 1, Cluster 7: Generated summary of length 1932 chars.
--- Level 2: Generated 3 clusters ---
Level 2, Cluster 0: Generated summary of length 2050 chars.
Level 2, Cluster 1: Generated summary of length 1988 chars.
Level 2, Cluster 2: Generated summary of length 1965 chars.


### Step 1.5: Indexing with the "Collapsed Tree" Strategy

**What it is:** Instead of building a complex graph data structure, we use a simple and effective strategy called the "collapsed tree." We create a single, unified list containing **all** the text from every level of the tree: the original leaf chunks and all the generated summaries.

**Why we do it:** This allows us to use a standard vector store (like FAISS or Chroma) for retrieval. A single similarity search on this vector store will now query across all levels of abstraction simultaneously. It's an elegant simplification that works remarkably well.

**How it works:** We iterate through our `raptor_results`, collect all the leaf texts and summaries into one list, and then build a FAISS vector store from this combined corpus.

In [None]:
from langchain_community.vectorstores import FAISS

# Combine all texts (original chunks and all generated summaries) into a single list.
all_texts_raptor = leaf_texts.copy()
for level in raptor_results:
    # Get the summaries from the current level's results
    summaries = raptor_results[level][1]['summaries'].tolist()
    # Add them to our master list
    all_texts_raptor.extend(summaries)

# Build the final vector store using FAISS, a fast in-memory vector database.
vectorstore_raptor = FAISS.from_texts(texts=all_texts_raptor, embedding=embeddings)

# Create a retriever from the vector store.
# We configure it to retrieve the top 5 most similar documents for any query.
retriever_raptor = vectorstore_raptor.as_retriever(search_kwargs={'k': 5})

print(f"Built RAPTOR vector store with {len(all_texts_raptor)} total documents (leaves + summaries).")

Built RAPTOR vector store with 423 total documents (leaves + summaries).


---
## Part 2: Building a Baseline "Normal RAG" System

To properly evaluate RAPTOR's performance, we need a baseline to compare against. We will now build a standard, non-hierarchical RAG system using the *exact same* source data and models. The only difference will be the retrieval strategy: this system can only retrieve the initial, small chunks (the leaf nodes).

In [None]:
# A Normal RAG system only has access to the initial leaf_texts.
# We use the same vector store technology (FAISS) and the same embedding model for a fair comparison.
vectorstore_normal = FAISS.from_texts(texts=leaf_texts, embedding=embeddings)
# The retriever for the normal RAG system.
retriever_normal = vectorstore_normal.as_retriever(search_kwargs={'k': 5})

print(f"Built Normal RAG vector store with {len(leaf_texts)} documents.")

Built Normal RAG vector store with 412 documents.


### Step 2.1: Creating Identical RAG Chains

We create two separate RAG chains. They are identical in every way (prompt, LLM, parser) except for the retriever they use. This ensures that any difference in performance is due solely to the quality of the retrieved context.

In [None]:
from langchain_core.runnables import RunnablePassthrough

# This prompt template is for the final generation step for both chains.
final_prompt_text = """You are an expert assistant for the Hugging Face ecosystem. 
Answer the user's question based ONLY on the following context. If the context does not contain the answer, state that you don't know.
CONTEXT:
{context}
QUESTION:
{question}
ANSWER:"""
final_prompt = ChatPromptTemplate.from_template(final_prompt_text)

# A helper function to format the retrieved documents into a single string.
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# --- RAPTOR RAG Chain ---
# This chain uses the retriever built on the full RAPTOR index.
rag_chain_raptor = (
    {"context": retriever_raptor | format_docs, "question": RunnablePassthrough()}
    | final_prompt
    | llm
    | StrOutputParser()
)

# --- Normal RAG Chain ---
# This chain uses the retriever built ONLY on the leaf nodes.
rag_chain_normal = (
    {"context": retriever_normal | format_docs, "question": RunnablePassthrough()}
    | final_prompt
    | llm
    | StrOutputParser()
)

print("RAG chains for both RAPTOR and Normal RAG have been created.")

RAG chains for both RAPTOR and Normal RAG have been created.


---
## Part 3: Evaluating RAPTOR vs. Normal RAG

Evaluating RAG systems can be challenging. We will use a two-pronged approach:
1.  **Quantitative Evaluation (Accuracy):** We will test both systems on questions where we can define a clear "correct" answer based on the presence of key information. This gives us a numerical score.
2.  **Qualitative Evaluation (LLM-as-a-Judge):** For more complex, open-ended questions where there is no single right answer, we will use a powerful LLM to act as an impartial judge, scoring the answers based on criteria like relevance, depth, and coherence.

### 3.1 Quantitative Evaluation: Accuracy on Fact-Based & Synthesis Questions

Here, we define a small evaluation set of questions. For each question, we also define a list of `required_keywords` that a correct answer must contain. These questions are designed to test the ability to synthesize information that might be spread across multiple chunks. We then write a simple function to check for the presence of these keywords and calculate an accuracy score for both RAG systems.

In [None]:
# Define the evaluation set with questions and the keywords expected in a correct answer.
eval_questions = [
    {
        "question": "What is the `pipeline` function in transformers and what is one task it can perform?",
        "required_keywords": ["pipeline", "inference", "sentiment-analysis"]
    },
    {
        "question": "What is the relationship between the `datasets` library and tokenization?",
        "required_keywords": ["datasets", "map", "tokenizer", "parallelized"]
    },
    {
        "question": "How does the PEFT library help with training, and what is one specific technique it implements?",
        "required_keywords": ["PEFT", "parameter-efficient", "adapter", "LoRA"]
    }
]

# Define the evaluation function that checks for keyword presence.
def evaluate_answer(answer: str, required_keywords: List[str]) -> bool:
    """Checks if the answer contains all required keywords (case-insensitive)."""
    return all(keyword.lower() in answer.lower() for keyword in required_keywords)

# Initialize scores for both systems.
normal_rag_score = 0
raptor_rag_score = 0

# Loop through the evaluation questions and assess each RAG system.
for i, item in enumerate(eval_questions):
    print(f"--- Evaluating Question {i+1} ---")
    print(f"QUESTION: {item['question']}")
    
    # Get answers from both systems.
    answer_normal = rag_chain_normal.invoke(item['question'])
    answer_raptor = rag_chain_raptor.invoke(item['question'])
    
    print(f"--> NORMAL RAG Answer: {answer_normal}")
    print(f"--> RAPTOR RAG Answer: {answer_raptor}")
    
    # Evaluate answers based on keywords.
    is_correct_normal = evaluate_answer(answer_normal, item['required_keywords'])
    is_correct_raptor = evaluate_answer(answer_raptor, item['required_keywords'])
    
    # Update scores.
    if is_correct_normal:
        normal_rag_score += 1
    if is_correct_raptor:
        raptor_rag_score += 1
        
    print(f"Normal RAG: {'PASS' if is_correct_normal else 'FAIL'}")
    print(f"RAPTOR RAG: {'PASS' if is_correct_raptor else 'FAIL'}")
    print("-----------------------------------")

# Calculate and print the final accuracy percentages.
normal_accuracy = (normal_rag_score / len(eval_questions)) * 100
raptor_accuracy = (raptor_rag_score / len(eval_questions)) * 100

print("\n--- FINAL ACCURACY SCORES ---")
print(f"Normal RAG Accuracy: {normal_accuracy:.2f}%")
print(f"RAPTOR RAG Accuracy: {raptor_accuracy:.2f}%")

--- Evaluating Question 1 ---
QUESTION: What is the `pipeline` function in transformers and what is one task it can perform?
--> NORMAL RAG Answer: The `pipeline` function in transformers is a high-level helper that makes it easy to use models for inference. It abstracts away most of the complex code. One task it can perform is sentiment-analysis.
--> RAPTOR RAG Answer: The `pipeline` function in the Transformers library provides a very simple, high-level API for performing inference on a wide variety of tasks. It handles the model and tokenizer loading, pre-processing, and post-processing for you. One common task it supports is 'sentiment-analysis'.
Normal RAG: PASS
RAPTOR RAG: PASS
-----------------------------------
--- Evaluating Question 2 ---
QUESTION: What is the relationship between the `datasets` library and tokenization?
--> NORMAL RAG Answer: The `datasets` library can be used to load data. Tokenization is a separate step that you apply to the data after loading.
--> RAPTOR 

### 3.2 Qualitative Evaluation: LLM-as-a-Judge

For complex, high-level questions, a simple keyword match is insufficient. Here, we use our LLM as an impartial judge to score the answers based on a set of criteria. This is where RAPTOR's ability to leverage high-level summaries should truly shine.

**The Process:**
1.  We define a complex, abstract question.
2.  We generate an answer from both Normal RAG and RAPTOR RAG.
3.  We provide the original question and both answers to a "judge" LLM, using a detailed prompt that asks it to compare them on **Relevance, Depth, and Coherence** and provide a final verdict and justification.

In [None]:
import json

# Define the high-level, abstract question for our judge.
judge_question = "Compare and contrast the core purpose of the Transformers library with the Datasets library. How do they work together in a typical machine learning workflow?"

# Define the detailed prompt for our LLM Judge.
# This prompt guides the LLM to be a fair and critical evaluator.
judge_prompt_text = """You are an impartial and expert AI evaluator. You will be given a user question and two answers generated by two different RAG systems (Answer A and Answer B).
Your task is to carefully evaluate both answers based on the following criteria:
1.  **Relevance:** How well does the answer address all parts of the user's question?
2.  **Depth:** Does the answer provide a comprehensive and detailed explanation with specific examples, or is it superficial?
3.  **Coherence:** Is the answer well-structured, clear, and easy to understand?

Please perform the following steps:
1.  Read the user question and both answers carefully.
2.  For each answer, assign a score from 1 (poor) to 10 (excellent) for each of the three criteria.
3.  Based on the scores, determine which answer is better. The winner is the answer with the higher total score.
4.  Provide a brief but clear justification for your choice, explaining why the winning answer is superior.
5.  Output your final verdict as a single, valid JSON object with the following structure: 
{{
  "winner": "Answer A (Normal RAG)" or "Answer B (RAPTOR RAG)",
  "justification": "Your detailed explanation here.",
  "scores": {{
    "answer_a": {{"relevance": score, "depth": score, "coherence": score}},
    "answer_b": {{"relevance": score, "depth": score, "coherence": score}}
  }}
}}

--- START OF DATA ---
USER QUESTION: {question}

--- ANSWER A (Normal RAG) ---
{answer_a}

--- ANSWER B (RAPTOR RAG) ---
{answer_b}
--- END OF DATA ---
FINAL VERDICT (JSON format only):"""

judge_prompt = ChatPromptTemplate.from_template(judge_prompt_text)
judge_chain = judge_prompt | llm | StrOutputParser()

print("--- LLM-as-a-Judge Evaluation ---")
print(f"QUESTION: {judge_question}\n")

print("--- Generating Answers ---\n")
answer_normal = rag_chain_normal.invoke(judge_question)
answer_raptor = rag_chain_raptor.invoke(judge_question)

print("--- Normal RAG's Answer ---")
print(f"{answer_normal}\n")
print("--- RAPTOR RAG's Answer ---")
print(f"{answer_raptor}\n")

print("--- The Judge's Verdict ---")
# Get the verdict from the judge chain.
verdict_str = judge_chain.invoke({
    "question": judge_question,
    "answer_a": answer_normal,
    "answer_b": answer_raptor
})

# Parse and pretty-print the JSON output.
try:
    verdict_json = json.loads(verdict_str)
    print(json.dumps(verdict_json, indent=2))
except json.JSONDecodeError:
    # Handle cases where the LLM might not return perfect JSON
    print("Could not parse the judge's output as JSON:")
    print(verdict_str)

--- LLM-as-a-Judge Evaluation ---
QUESTION: Compare and contrast the core purpose of the Transformers library with the Datasets library. How do they work together in a typical machine learning workflow?

--- Generating Answers ---

--- Normal RAG's Answer ---
The Transformers library provides models like BERT and GPT. The Datasets library is used to load data. In a workflow, you first load data with Datasets and then use a model from Transformers.

--- RAPTOR RAG's Answer ---
The Transformers and Datasets libraries have distinct but highly synergistic purposes. The Transformers library's core purpose is to provide general-purpose architectures (like BERT, GPT, T5) and a framework for loading, training, and running these state-of-the-art models. In contrast, the Datasets library's core purpose is to provide a standardized and highly efficient way to access, process, and manage the massive datasets required for these models. They work together seamlessly: you use `datasets` to load and p

### Final Conclusion

The evaluation clearly demonstrates the superiority of the RAPTOR-based RAG system over the standard baseline.

-   **Quantitative Results:** The RAPTOR system achieved **100% accuracy** on the fact-based and synthesis questions, while the Normal RAG system failed on questions that required connecting information from multiple disparate chunks, scoring only **33.33%**.

-   **Qualitative Results:** The LLM-as-a-Judge evaluation confirmed this trend on a much more complex and abstract question. The judge rated RAPTOR's answer as significantly higher in **depth** and **coherence**, explaining that it provided a comprehensive, well-structured answer that captured the synergy between the libraries. The Normal RAG answer was superficial and lacked the necessary context.

This performance gap is a direct result of RAPTOR's multi-resolution index. By being able to retrieve high-level, pre-synthesized summaries, the RAPTOR RAG system provides the final LLM with far superior context, enabling it to answer complex questions that are impossible for a standard, chunk-based RAG system to handle.