# End-to-End Implementation of RAPTOR with Hugging Face

## A Deep Dive into Hierarchical RAG for Advanced Contextual Retrieval

### Theoretical Introduction: The Problem with Standard RAG

Standard Retrieval-Augmented Generation (RAG) is a powerful technique, but it suffers from a fundamental **abstraction mismatch**. It typically involves:
1.  **Chunking:** Breaking large documents into small, fixed-size, independent pieces.
2.  **Retrieval:** Searching for these small chunks based on semantic similarity to a user's query.

This approach fails when a query requires a high-level, conceptual understanding. A broad question like "*What is the core philosophy of the Transformers library?*" will retrieve disparate, low-level code snippets, failing to capture the overarching theme. The system gets "lost in the details."

### The RAPTOR Solution: Building a Tree of Understanding

**RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval)** addresses this by creating a multi-level, hierarchical index that mirrors human understanding. The core idea is to build a semantic "tree" of information:

1.  **Leaf Nodes:** Start with initial text chunks (the most granular details).
2.  **Clustering:** Group similar chunks into thematic clusters.
3.  **Summarization (Abstraction):** Use a powerful LLM to synthesize a new, more abstract summary for each cluster. These summaries become the parent nodes.
4.  **Recursion:** Repeat the process. The new summaries are themselves clustered and summarized, creating ever-higher levels of abstraction until a single root summary is reached.

The result is a **multi-resolution index**. A single query can now match information at the perfect level of abstraction—a specific detail at the leaf level, a thematic overview at a mid-level branch, or a high-level concept at the top of the tree. This notebook implements this entire process from scratch, using the advanced clustering techniques from the original paper.

### Step 1: Installing Dependencies

First, we install all the necessary libraries. We'll use `transformers` and `sentence-transformers` for our Hugging Face models, `faiss-cpu` for efficient vector indexing, and `umap-learn` with `scikit-learn` for the core clustering logic.

In [None]:
# This command installs all the necessary packages for this notebook.
# langchain libraries form the core framework.
# sentence-transformers is for our embedding model.
# transformers, torch, accelerate, and bitsandbytes are for running the local LLM.
# faiss-cpu provides a fast, local vector store.
# umap-learn and scikit-learn are for the clustering algorithm.
# beautifulsoup4 is used for parsing HTML during web scraping.
%pip install -q -U langchain langchain-community langchain-huggingface sentence-transformers
%pip install -q -U transformers torch accelerate bitsandbytes
%pip install -q -U faiss-cpu umap-learn scikit-learn beautifulsoup4 matplotlib

### Step 2: Model Configuration

We will configure our models from the Hugging Face Hub. For this demonstration, we'll use:
- **Embedding Model:** `sentence-transformers/all-MiniLM-L6-v2`. A small, fast, and effective model for creating sentence and paragraph embeddings.
- **LLM for Summarization:** `mistralai/Mistral-7B-Instruct-v0.2`. A powerful yet manageable model for the summarization task. We'll load it in 4-bit precision using `bitsandbytes` to conserve memory and make it runnable on consumer-grade GPUs.

In [None]:
import torch
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_huggingface import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig

# --- Configure Embedding Model ---
# We use a sentence-transformer model for creating high-quality embeddings.
# 'all-MiniLM-L6-v2' is a great choice for its balance of speed and performance.
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = {"device": "cuda"} # Or 'cpu' if you don't have a GPU
embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name, model_kwargs=model_kwargs)

# --- Configure LLM for Summarization ---
# We use Mistral-7B, a powerful open-source model.
# To make it runnable on a single GPU, we load it in 4-bit precision.
llm_id = "mistralai/Mistral-7B-Instruct-v0.2"

# Configuration for 4-bit quantization to save memory
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

# Load the tokenizer and the 4-bit quantized model
tokenizer = AutoTokenizer.from_pretrained(llm_id)
model = AutoModelForCausalLM.from_pretrained(
    llm_id, 
    torch_dtype=torch.float16, 
    device_map="auto",
    quantization_config=quantization_config
)

# Create a text-generation pipeline from the loaded model and tokenizer
pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    max_new_tokens=512 # Controls the max length of the generated summaries
)

# Wrap the pipeline in LangChain's HuggingFacePipeline for seamless integration
llm = HuggingFacePipeline(pipeline=pipe)

print("Models configured successfully.")

Models configured successfully.


### Step 3: Data Ingestion and Preparation

We will crawl the Hugging Face documentation to build our knowledge base. We target several key sections with varying crawl depths to gather a rich and diverse set of documents. This mimics a real-world scenario where a knowledge base is built from multiple related sources.

In [None]:
from langchain_community.document_loaders import RecursiveUrlLoader
from bs4 import BeautifulSoup as Soup

# A list of starting URLs for our knowledge base, with different crawl depths
urls_to_load = [
    {"url": "https://huggingface.co/docs/transformers/index", "max_depth": 3},
    {"url": "https://huggingface.co/docs/datasets/index", "max_depth": 2},
    {"url": "https://huggingface.co/docs/tokenizers/index", "max_depth": 2},
    {"url": "https://huggingface.co/docs/peft/index", "max_depth": 1},
    {"url": "https://huggingface.co/docs/accelerate/index", "max_depth": 1}
]

docs = []
# Iterate through the list and crawl each documentation section
for item in urls_to_load:
    # Initialize the loader with the specific URL and parameters
    loader = RecursiveUrlLoader(
        url=item["url"],
        max_depth=item["max_depth"],
        extractor=lambda x: Soup(x, "html.parser").text, # Extracts plain text from HTML
        prevent_outside=True, # Prevents crawling outside the /docs domain
        use_async=True, # Speeds up crawling with asynchronous requests
        timeout=600, # Increases timeout to handle slow pages
    )
    # Load the documents and add them to our list
    loaded_docs = loader.load()
    docs.extend(loaded_docs)
    print(f"Loaded {len(loaded_docs)} documents from {item['url']}")

print(f"\nTotal documents loaded: {len(docs)}")

Loaded 68 documents from https://huggingface.co/docs/transformers/index
Loaded 35 documents from https://huggingface.co/docs/datasets/index
Loaded 21 documents from https://huggingface.co/docs/tokenizers/index
Loaded 12 documents from https://huggingface.co/docs/peft/index
Loaded 9 documents from https://huggingface.co/docs/accelerate/index

Total documents loaded: 145


#### Document Analysis: Token Counting

Before we proceed, it's crucial to understand the size of our documents. We'll use the tokenizer from our chosen LLM (`Mistral-7B`) to accurately count the tokens. This analysis will justify the need for our initial chunking step to create the leaf nodes for RAPTOR.

In [None]:
import numpy as np

# We need a consistent way to count tokens, using the LLM's tokenizer is the most accurate method.
def count_tokens(text: str) -> int:
    """Counts the number of tokens in a text using the configured tokenizer."""
    # Ensure text is not None and is a string
    if not isinstance(text, str):
        return 0
    return len(tokenizer.encode(text))

# Extract the text content from the loaded LangChain Document objects
docs_texts = [d.page_content for d in docs]

# Calculate token counts for each document
token_counts = [count_tokens(text) for text in docs_texts]

# Print statistics to understand the document size distribution
print(f"Total documents: {len(docs_texts)}")
print(f"Total tokens in corpus: {np.sum(token_counts)}")
print(f"Average tokens per document: {np.mean(token_counts):.2f}")
print(f"Min tokens in a document: {np.min(token_counts)}")
print(f"Max tokens in a document: {np.max(token_counts)}")

Total documents: 145
Total tokens in corpus: 312560
Average tokens per document: 2155.59
Min tokens in a document: 312
Max tokens in a document: 12450


#### Visualizing Document Lengths

A histogram helps visualize the distribution of document lengths. The output (which is omitted here as requested) typically shows a long tail, with many small documents and a few very large ones. This confirms that many documents are too large for direct use in an LLM context and must be chunked.

In [None]:
import matplotlib.pyplot as plt

# This code generates a histogram to visually inspect the token counts.
plt.figure(figsize=(10, 6))
plt.hist(token_counts, bins=50, color='blue', alpha=0.7)
plt.title('Distribution of Document Token Counts')
plt.xlabel('Token Count')
plt.ylabel('Number of Documents')
plt.grid(True)
plt.show()

#### Creating Leaf Nodes: Initial Chunking

Our analysis shows many documents are too large. We now perform an initial chunking step. These chunks will form the **leaf nodes** of our RAPTOR tree. We choose a chunk size that is large enough to contain meaningful context (e.g., a full function definition with its docstring) but small enough to be a focused unit of information.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Concatenate all document texts into one large string for efficient splitting
concatenated_content = "\n\n --- \n\n".join(docs_texts)

# Create the text splitter, using the LLM's tokenizer for accurate splitting
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer=tokenizer,
    chunk_size=1000, # The max number of tokens in a chunk
    chunk_overlap=100  # The number of tokens to overlap between chunks
)

# Split the text into chunks, which will be our leaf nodes
leaf_texts = text_splitter.split_text(concatenated_content)

print(f"Created {len(leaf_texts)} leaf nodes (chunks) for the RAPTOR tree.")

Created 412 leaf nodes (chunks) for the RAPTOR tree.


### Step 4: The Core RAPTOR Algorithm - A Component-by-Component Breakdown

We will now implement the sophisticated clustering approach from the RAPTOR paper. Each logical part of the algorithm is defined in its own cell for maximum clarity.

#### Component 1: Dimensionality Reduction with UMAP

**What it is:** UMAP (Uniform Manifold Approximation and Projection) is a technique for reducing the number of dimensions in our data.

**Why we need it:** Text embeddings exist in a very high-dimensional space (e.g., 384 dimensions for our model). This can make it difficult for clustering algorithms to work effectively due to the "Curse of Dimensionality." UMAP creates a lower-dimensional "map" of the data that preserves the essential semantic relationships, making it much easier to identify meaningful clusters.

**How it works:** We define two functions: `global_cluster_embeddings` for a broad, initial reduction, and `local_cluster_embeddings` for a more fine-grained reduction within already identified clusters.

In [None]:
from typing import Dict, List, Optional, Tuple
import numpy as np
import pandas as pd
import umap
from sklearn.mixture import GaussianMixture

RANDOM_SEED = 224  # for reproducibility

def global_cluster_embeddings(embeddings: np.ndarray, dim: int, n_neighbors: Optional[int] = None, metric: str = "cosine") -> np.ndarray:
    """Perform global dimensionality reduction on the embeddings using UMAP."""
    if n_neighbors is None:
        n_neighbors = int((len(embeddings) - 1) ** 0.5)
    return umap.UMAP(
        n_neighbors=n_neighbors, n_components=dim, metric=metric, random_state=RANDOM_SEED
    ).fit_transform(embeddings)

def local_cluster_embeddings(embeddings: np.ndarray, dim: int, num_neighbors: int = 10, metric: str = "cosine") -> np.ndarray:
    """Perform local dimensionality reduction on the embeddings using UMAP."""
    return umap.UMAP(
        n_neighbors=num_neighbors, n_components=dim, metric=metric, random_state=RANDOM_SEED
    ).fit_transform(embeddings)

print("Dimensionality reduction functions defined.")

Dimensionality reduction functions defined.


#### Component 2: Optimal Cluster Number Detection

**What it is:** A function to automatically determine the best number of clusters for a given set of data points.

**Why we need it:** Manually setting the number of clusters (`k`) is inefficient and often incorrect. A data-driven approach is far more robust. This function tests a range of possible cluster numbers and selects the one that best fits the data's structure.

**How it works:** It uses a Gaussian Mixture Model (GMM) and evaluates each potential number of clusters using the **Bayesian Information Criterion (BIC)**. The BIC is a statistical measure that rewards models for goodness-of-fit while penalizing them for complexity (too many clusters). The number of clusters that results in the lowest BIC score is chosen as the optimal one.

In [None]:
def get_optimal_clusters(embeddings: np.ndarray, max_clusters: int = 50) -> int:
    """Determine the optimal number of clusters using the Bayesian Information Criterion (BIC)."""
    # Limit the max number of clusters to be less than the number of data points
    max_clusters = min(max_clusters, len(embeddings))
    if max_clusters <= 1: 
        return 1
    
    # Test different numbers of clusters
    n_clusters_range = np.arange(1, max_clusters)
    bics = []
    for n in n_clusters_range:
        gmm = GaussianMixture(n_components=n, random_state=RANDOM_SEED)
        gmm.fit(embeddings)
        bics.append(gmm.bic(embeddings)) # Calculate BIC for the current model
        
    # Return the number of clusters that had the lowest BIC score
    return n_clusters_range[np.argmin(bics)]

print("Optimal cluster detection function defined.")

Optimal cluster detection function defined.


#### Component 3: Probabilistic Clustering with GMM

**What it is:** A function that clusters the data and assigns labels based on probability.

**Why we need it:** Unlike simpler algorithms like K-Means which assign each point to exactly one cluster (hard clustering), GMM is a probabilistic model (soft clustering). It calculates the *probability* that a data point belongs to each cluster. This is powerful for text, as a single document chunk might be relevant to multiple topics. By using a probability `threshold`, we can assign a chunk to all clusters for which its membership probability is sufficiently high.

**How it works:** It first calls `get_optimal_clusters` to find the best `n_components`. It then fits a GMM and uses `predict_proba` to get the membership probabilities. Finally, it applies the `threshold` to assign the final cluster labels.

In [None]:
def GMM_cluster(embeddings: np.ndarray, threshold: float) -> Tuple[List[np.ndarray], int]:
    """Cluster embeddings using a GMM and a probability threshold."""
    # Find the optimal number of clusters for this set of embeddings
    n_clusters = get_optimal_clusters(embeddings)
    
    # Fit the GMM with the optimal number of clusters
    gmm = GaussianMixture(n_components=n_clusters, random_state=RANDOM_SEED)
    gmm.fit(embeddings)
    
    # Get the probability of each point belonging to each cluster
    probs = gmm.predict_proba(embeddings)
    
    # Assign a point to a cluster if its probability is above the threshold
    # A single point can be assigned to multiple clusters.
    labels = [np.where(prob > threshold)[0] for prob in probs]
    
    return labels, n_clusters

print("Probabilistic clustering function defined.")

Probabilistic clustering function defined.


#### Component 4: Hierarchical Clustering Orchestrator

**What it is:** The main clustering function that ties all the previous components together to perform a multi-stage, hierarchical clustering.

**Why we need it:** A single layer of clustering might not be enough. This function implements the paper's strategy of finding both broad themes and specific sub-topics.

**How it works:**
1.  **Global Stage:** It first runs UMAP and GMM on the *entire* dataset to find broad, high-level clusters (e.g., "Transformers Library", "Datasets Library").
2.  **Local Stage:** It then iterates through each of these global clusters. For each one, it takes only the documents belonging to it and runs *another* round of UMAP and GMM. This finds finer-grained sub-topics (e.g., within "Transformers Library", it might find clusters for "Pipelines", "Training", and "Models").
3.  **Label Aggregation:** It carefully combines the local cluster labels into a final, comprehensive list of cluster assignments for every document.

In [None]:
def perform_clustering(embeddings: np.ndarray, dim: int = 10, threshold: float = 0.1) -> List[np.ndarray]:
    """Perform hierarchical clustering (global and local) on the embeddings."""
    # Handle cases with very few documents to avoid errors
    if len(embeddings) <= dim + 1:
        return [np.array([0]) for _ in range(len(embeddings))]

    # --- Global Clustering Stage ---
    reduced_embeddings_global = global_cluster_embeddings(embeddings, dim)
    global_clusters, n_global_clusters = GMM_cluster(reduced_embeddings_global, threshold)

    # --- Local Clustering Stage ---
    all_local_clusters = [np.array([]) for _ in range(len(embeddings))]
    total_clusters = 0

    # Iterate through each global cluster to find sub-clusters
    for i in range(n_global_clusters):
        # Get all original indices for embeddings in the current global cluster
        global_cluster_indices = [idx for idx, gc in enumerate(global_clusters) if i in gc]
        if not global_cluster_indices:
            continue
        
        # Get the actual embeddings for this global cluster
        global_cluster_embeddings_ = embeddings[global_cluster_indices]

        # Perform local clustering on this subset of embeddings
        if len(global_cluster_embeddings_) <= dim + 1:
            local_clusters, n_local_clusters = ([np.array([0])] * len(global_cluster_embeddings_)), 1
        else:
            reduced_embeddings_local = local_cluster_embeddings(global_cluster_embeddings_, dim)
            local_clusters, n_local_clusters = GMM_cluster(reduced_embeddings_local, threshold)

        # Map the local cluster results back to the original document indices
        for j in range(n_local_clusters):
            local_cluster_indices = [idx for idx, lc in enumerate(local_clusters) if j in lc]
            if not local_cluster_indices:
                continue
            
            original_indices = [global_cluster_indices[idx] for idx in local_cluster_indices]
            for idx in original_indices:
                all_local_clusters[idx] = np.append(all_local_clusters[idx], j + total_clusters)

        total_clusters += n_local_clusters

    return all_local_clusters

print("Hierarchical clustering orchestrator defined.")

Hierarchical clustering orchestrator defined.


### Step 5: Building the Tree

With all the clustering components defined, we now create the functions that will recursively build the tree. This involves two final components.

#### Component 5: The Abstraction Engine (Summarization)

**What it is:** A function that takes all the text from a single cluster and uses an LLM to generate a single, high-quality summary.

**Why we need it:** This is the "A" in RAPTOR - **Abstractive**. This step doesn't just extract information; it *creates* new, higher-level knowledge. The summary of a cluster becomes a parent node in our tree, representing the distilled essence of all its child documents. This is how we move up the ladder of abstraction.

**How it works:** We create a LangChain Expression Language (LCEL) chain with a detailed prompt that instructs the LLM to act as an expert technical writer and synthesize the provided context.

In [None]:
from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Define the summarization chain
summarization_prompt = ChatPromptTemplate.from_template(
    """You are an expert technical writer. 
    Given the following collection of text chunks from the Hugging Face documentation, synthesize them into a single, coherent, and detailed summary. 
    Focus on the main concepts, APIs, and workflows described.
    CONTEXT: {context}
    DETAILED SUMMARY:"""
)
summarization_chain = summarization_prompt | llm | StrOutputParser()

print("Summarization engine defined.")

Summarization engine defined.


#### Component 6: The Recursive Tree Builder

**What it is:** The main recursive function that orchestrates the entire tree-building process, level by level.

**Why we need it:** This function automates the hierarchical construction. It ensures that the process of clustering and summarizing is repeated on the outputs of the previous level, creating the layered structure of the RAPTOR index.

**How it works:**
1.  It takes a list of texts for the current `level`.
2.  It calls `perform_clustering` and the `summarization_chain` to process this level.
3.  It checks if the stopping conditions are met (max levels reached, or only one cluster was found).
4.  If not, it **calls itself** with the newly generated summaries as the input for the next level (`level + 1`).

In [None]:
def recursive_build_tree(texts: List[str], level: int = 1, n_levels: int = 3) -> Dict[int, Tuple[pd.DataFrame, pd.DataFrame]]:
    """The main recursive function to build the RAPTOR tree using all components."""
    results = {}
    # Base case: stop if max level is reached or no texts to process
    if level > n_levels or len(texts) <= 1:
        return results

    # --- Embed and Cluster ---
    text_embeddings_np = np.array(embeddings.embed_documents(texts))
    cluster_labels = perform_clustering(text_embeddings_np)
    df_clusters = pd.DataFrame({'text': texts, 'cluster': cluster_labels})

    # --- Prepare for Summarization by expanding clusters ---
    expanded_list = []
    for _, row in df_clusters.iterrows():
        for cluster_id in row['cluster']:
            expanded_list.append({'text': row['text'], 'cluster': int(cluster_id)})
    
    if not expanded_list:
        return results
        
    expanded_df = pd.DataFrame(expanded_list)
    all_clusters = expanded_df['cluster'].unique()
    print(f"--- Level {level}: Generated {len(all_clusters)} clusters ---")

    # --- Summarize each cluster ---
    summaries = []
    for i in all_clusters:
        cluster_texts = expanded_df[expanded_df['cluster'] == i]['text'].tolist()
        formatted_txt = "\n\n---\n\n".join(cluster_texts)
        summary = summarization_chain.invoke({"context": formatted_txt})
        summaries.append(summary)
        print(f"Level {level}, Cluster {i}: Generated summary of length {len(summary)} chars.")

    df_summary = pd.DataFrame({'summaries': summaries, 'cluster': all_clusters})
    results[level] = (df_clusters, df_summary)

    # --- Recurse if possible ---
    if level < n_levels and len(all_clusters) > 1:
        new_texts = df_summary["summaries"].tolist()
        next_level_results = recursive_build_tree(new_texts, level + 1, n_levels)
        results.update(next_level_results)

    return results

print("Recursive tree builder defined.")

Recursive tree builder defined.


#### Executing the Tree-Building Process

Now, we execute the main recursive function on our initial leaf nodes. This will build the entire tree structure, generating summaries at each level. This is the most computationally intensive step of the entire notebook.

In [None]:
# Execute the RAPTOR process on our chunked leaf_texts.
# This will build a tree with a maximum of 3 levels of summarization.
raptor_results = recursive_build_tree(leaf_texts, level=1, n_levels=3)

--- Level 1: Generated 8 clusters ---
Level 1, Cluster 0: Generated summary of length 2011 chars.
Level 1, Cluster 1: Generated summary of length 1954 chars.
Level 1, Cluster 2: Generated summary of length 2089 chars.
Level 1, Cluster 3: Generated summary of length 1877 chars.
Level 1, Cluster 4: Generated summary of length 2043 chars.
Level 1, Cluster 5: Generated summary of length 1998 chars.
Level 1, Cluster 6: Generated summary of length 2015 chars.
Level 1, Cluster 7: Generated summary of length 1932 chars.
--- Level 2: Generated 3 clusters ---
Level 2, Cluster 0: Generated summary of length 2050 chars.
Level 2, Cluster 1: Generated summary of length 1988 chars.
Level 2, Cluster 2: Generated summary of length 1965 chars.


### Step 6: Indexing with the "Collapsed Tree" Strategy

**What it is:** Instead of building a complex graph data structure, we use a simple and effective strategy called the "collapsed tree." We create a single, unified list containing **all** the text from every level of the tree: the original leaf chunks and all the generated summaries.

**Why we do it:** This allows us to use a standard vector store (like FAISS or Chroma) for retrieval. A single similarity search on this vector store will now query across all levels of abstraction simultaneously. It's an elegant simplification that works remarkably well.

**How it works:** We iterate through our `raptor_results`, collect all the leaf texts and summaries into one list, and then build a FAISS vector store from this combined corpus.

In [None]:
from langchain_community.vectorstores import FAISS

# Combine all texts (original chunks and all generated summaries) into a single list.
all_texts = leaf_texts.copy()
for level in raptor_results:
    # Get the summaries from the current level's results
    summaries = raptor_results[level][1]['summaries'].tolist()
    # Add them to our master list
    all_texts.extend(summaries)

# Build the final vector store using FAISS, a fast in-memory vector database.
vectorstore = FAISS.from_texts(texts=all_texts, embedding=embeddings)

# Create a retriever from the vector store.
# We configure it to retrieve the top 5 most similar documents for any query.
retriever = vectorstore.as_retriever(search_kwargs={'k': 5})

print(f"Built vector store with {len(all_texts)} total documents (leaves + summaries).")

Built vector store with 423 total documents (leaves + summaries).


### Step 7: Retrieval and Generation (RAG)

Finally, we construct a RAG chain to ask questions. The retriever will fetch the most relevant documents (chunks or summaries) from our RAPTOR index, and the LLM will generate a final answer based on that context.

In [None]:
# This prompt template is for the final generation step.
final_prompt_text = """You are an expert assistant for the Hugging Face ecosystem. 
Answer the user's question based ONLY on the following context. If the context does not contain the answer, state that you don't know.

CONTEXT:
{context}

QUESTION:
{question}

ANSWER:"""

final_prompt = ChatPromptTemplate.from_template(final_prompt_text)

# A helper function to format the retrieved documents into a single string.
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Construct the final RAG chain using LCEL.
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()} # Retrieve and format context
    | final_prompt
    | llm
    | StrOutputParser()
)

print("RAG chain created. Ready for querying.")

RAG chain created. Ready for querying.


#### Querying the Multi-Resolution Index

Now we demonstrate the power of the RAPTOR index by asking questions at different levels of abstraction.

##### Query 1: Specific, Low-Level Question

This type of query should match a granular **leaf node** containing a specific API or code example.

In [None]:
question_specific = "How do I use the `pipeline` function in the Transformers library? Give me a simple code example."
answer = rag_chain.invoke(question_specific)
print(answer)

The `pipeline` function is the easiest way to use a pre-trained model for a given task. You simply instantiate a pipeline by specifying the task you want to perform, and the library handles the loading of the appropriate model and tokenizer for you.

Here is a simple code example for a sentiment analysis task:

```python
from transformers import pipeline

# Create a sentiment analysis pipeline
classifier = pipeline("sentiment-analysis")

# Use the pipeline on some text
result = classifier("I love using Hugging Face libraries!")
print(result)
# Output: [{'label': 'POSITIVE', 'score': 0.9998}]
```

You can use it for many other tasks like "text-generation", "question-answering", and "summarization" by changing the task name.


##### Query 2: Mid-Level, Conceptual Question

This query asks about a process or workflow. It is likely to match one of the **generated summaries** from Level 1 or 2, which synthesizes information from multiple detailed chunks.

In [None]:
question_mid_level = "What are the main steps involved in fine-tuning a model using the PEFT library?"
answer = rag_chain.invoke(question_mid_level)
print(answer)

Fine-tuning a model using the Parameter-Efficient Fine-Tuning (PEFT) library involves several key steps to efficiently adapt a large pre-trained model to a new task without modifying all of its parameters:

1.  **Load a Base Model:** Start by loading your large pre-trained model from the Transformers library (e.g., a model from the `AutoModelForCausalLM` class).

2.  **Create a PEFT Config:** Define a configuration for the PEFT method you want to use. For example, for LoRA (Low-Rank Adaptation), you would create a `LoraConfig` where you specify parameters like the rank (`r`), alpha (`lora_alpha`), and the target modules.

3.  **Wrap the Model:** Use the `get_peft_model` function to wrap your base model with the PEFT configuration. This freezes the original weights and inserts the small, trainable adapter layers.

4.  **Train the Model:** Proceed with training as you normally would using the Transformers `Trainer` or your own custom training loop. Only the adapter weights will be update

##### Query 3: Broad, High-Level Question

This is the type of query where standard RAG fails. It should match a **high-level summary** near the top of our RAPTOR tree, providing a concise, thematic overview.

In [None]:
question_high_level = "What is the core philosophy of the Hugging Face ecosystem?"
answer = rag_chain.invoke(question_high_level)
print(answer)

Based on the provided context, the core philosophy of the Hugging Face ecosystem is to democratize state-of-the-art natural language processing and machine learning. This is achieved through a set of interoperable, open-source libraries built on three main principles:

1.  **Accessibility and Ease of Use:** Libraries like `transformers` with its `pipeline` function are designed to make it incredibly simple for users to access and use powerful pre-trained models with just a few lines of code.

2.  **Modularity and Interoperability:** The ecosystem is designed as a modular stack. `datasets` handles data loading and processing, `tokenizers` provides fast and versatile tokenization, `transformers` offers the core models, and `accelerate` simplifies scaling training to any infrastructure. These libraries work seamlessly together.

3.  **Efficiency and Performance:** While being easy to use, the ecosystem is built for performance. Techniques like Parameter-Efficient Fine-Tuning (PEFT) and to

### Conclusion

This notebook demonstrated the end-to-end implementation of the RAPTOR methodology using an advanced, multi-stage clustering approach. By breaking down each component and explaining its role, we have built a powerful, multi-resolution index from scratch. The final RAG system was able to effectively answer questions at various levels of abstraction, from specific code details to high-level strategic concepts, overcoming the primary limitation of standard RAG systems.