### Parallel Context Pre-processing for Accuracy Gains
The RAG patterns we have explored so far have focused on improving the initial retrieval step finding more of the right documents. This pattern, Parallel Context Pre-processing, focuses on what happens after retrieval. A common strategy for maximizing recall is to retrieve a large number of candidate documents (k=10 or more).

However, passing this large, often noisy, collection of documents directly into the final generator LLM context window is problematic.

<p align="center">
  <img src="../../figures/parallel_context_processing.png" width="800">
</p>

In [1]:
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace

llm = ChatHuggingFace(
    llm=HuggingFaceEndpoint(
        model="Qwen/Qwen3-4B-Instruct-2507"
    )
)

  from .autonotebook import tqdm as notebook_tqdm


It's slow, expensive (due to high token counts), and can actually harm accuracy by overwhelming the model with irrelevant information the "lost in the middle" problem.

The architectural solution is to introduce an intermediate “distillation” step. After retrieving a large set of candidate documents, we use multiple, small, parallel LLM calls to process them. Each call acts as a highly-focused filter, checking a single document for its relevance to the specific question. Only the documents that pass this check are included in the final, “distilled” context that is sent to the main generator.

We will build and compare two RAG systems one that uses a large, raw context and another that uses this parallel pre-processing step to demonstrate these measurable improvements.

### Creating the Knowledge Base
We'll create a slightly larger knowledge base with some documents that are only tangentially related to each other. This will create a scenario where a high-recall retrieval step pulls in some noisy, irrelevant documents, making the distillation step necessary.

In [3]:
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_core.documents import Document

kb_docs = [
    Document(page_content="The QLeap-V4 processor, released in 2023, is our flagship AI accelerator. Its primary use case is for training large language models.", metadata={"source": "QL-V4-SpecSheet"}),
    Document(page_content="A key feature of the QLeap-V4 is its advanced thermal management system. The official error code for overheating is 'ERR_THROTTLE_900'.", metadata={"source": "QL-V4-Troubleshooting"}),
    Document(page_content="For optimal performance with the QLeap-V4, a power supply unit of at least 1200W is recommended.", metadata={"source": "QL-V4-HardwareGuide"}),
    Document(page_content="Our previous generation chip, the QLeap-V3 (released in 2021), had a known issue with its memory controller that was fixed in later revisions.", metadata={"source": "QL-V3-KnownIssues"}),
    Document(page_content="The Aura Smart Ring uses a photoplethysmography (PPG) sensor to measure heart rate.", metadata={"source": "Aura-TechSpec"}),
    Document(page_content="The official price for the QLeap-V4 is $1,999 USD. Educational and volume discounts are available.", metadata={"source": "QL-V4-Pricing"}),
    Document(page_content="Software drivers for the QLeap-V4 are available for Linux and Windows. The latest driver version is 512.77.", metadata={"source": "QL-V4-Downloads"}),
    Document(page_content="Project 'Titan' is our company's initiative to develop energy-efficient hardware, but it is a separate research project from the QLeap product line.", metadata={"source": "Project-Titan-FAQ"}),
    Document(page_content="Warranty claims for the QLeap-V4 processor must be filed within 2 years of the purchase date.", metadata={"source": "QL-V4-Warranty"}),
    Document(page_content="The QLeap-V3 chip had a recommended power supply of 800W.", metadata={"source": "QL-V3-HardwareGuide"})
]

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(kb_docs, embedding=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 10}) # High recall retriever

print(f"Knowledge Base created with {len(kb_docs)} documents.")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


Knowledge Base created with 10 documents.


### Components for Token Counting
To measure cost savings, we need a way to count the number of tokens in a prompt. We'll use the tiktoken library for this.

In [4]:
import tiktoken

def count_tokens(text: str) -> int:
    """Counts the number of tokens in a string using tiktoken."""
    # Using a common encoding for estimation
    encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

### Building the RAG Systems
We'll build two systems: the simple, large-context baseline, and the advanced graph with the parallel distillation step.

#### The Simple RAG System (Baseline)
This is a standard RAG chain that retrieves 10 documents and sends them all to the generator.

In [5]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

generator_prompt_template = (
    "You are an expert technical support agent. Answer the user's question with high accuracy, based *only* on the following context. "
    "If the context does not contain the answer, state that clearly.\n\n"
    "Context:\n{context}\n\nQuestion: {question}"
)
generator_prompt = ChatPromptTemplate.from_template(generator_prompt_template)

def format_docs(docs):
    return "\n\n".join(f"[Source: {doc.metadata.get('source', 'N/A')}] {doc.page_content}" for doc in docs)

simple_rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | generator_prompt
    | llm
    | StrOutputParser()
)

### The Advanced RAG System with Parallel Distillation
This system uses a LangGraph graph to add a distill_context node between retrieval and generation.

In [12]:
from typing import TypedDict, List
from langchain_core.pydantic_v1 import BaseModel, Field
from concurrent.futures import ThreadPoolExecutor, as_completed
from langchain_core.output_parsers import JsonOutputParser

class RAGGraphState(TypedDict):
    question: str
    raw_docs: List[Document]
    distilled_docs: List[Document]
    final_answer: str

class RelevancyCheck(BaseModel):
    """A check for whether a document is relevant to a question."""
    is_relevant: bool = Field(description="True if the document contains information that directly helps answer the question.")
    brief_explanation: str = Field(description="A one-sentence explanation of why the document is or is not relevant.")

# Node 1: Retrieval
def retrieval_node(state: RAGGraphState):
    print("--- [Retriever] Retrieving initial set of 10 documents... ---")
    raw_docs = retriever.invoke(state['question'])
    return {"raw_docs": raw_docs}

relevance_parser = JsonOutputParser(
    pydantic_object=RelevancyCheck
)
# Node 2: Parallel Context Distillation
distiller_prompt = ChatPromptTemplate.from_template(
    "Given the user's question, determine if the following document is relevant for answering it. "
    "Provide a brief explanation.\n\n{format_instructions}\n\n"
    "Question: {question}\n\nDocument:\n{document}"
).partial(format_instructions=relevance_parser.get_format_instructions())
distiller_chain = distiller_prompt | llm | relevance_parser

def distill_context_node(state: RAGGraphState):
    """Scans all retrieved documents in parallel to filter for relevance."""
    print(f"--- [Distiller] Pre-processing {len(state['raw_docs'])} raw documents in parallel... ---")
    
    relevant_docs = []
    with ThreadPoolExecutor(max_workers=5) as executor:
        future_to_doc = {executor.submit(distiller_chain.invoke, {"question": state['question'], "document": doc.page_content}): doc for doc in state['raw_docs']}
        for future in as_completed(future_to_doc):
            doc = future_to_doc[future]
            try:
                result = future.result()
                if result["is_relevant"]:
                    print(f"  - Doc '{doc.metadata['source']}' IS relevant. Reason: {result["brief_explanation"]}")
                    relevant_docs.append(doc)
                else:
                    print(f"  - Doc '{doc.metadata['source']}' is NOT relevant. Reason: {result["brief_explanation"]}")
            except Exception as e:
                print(f"Error processing doc {doc.metadata['source']}: {e}")
    
    print(f"--- [Distiller] Distilled context down to {len(relevant_docs)} documents. ---")
    return {"distilled_docs": relevant_docs}

# Node 3: Generation
def generation_node(state: RAGGraphState):
    print("--- [Generator] Synthesizing final answer from distilled context... ---")
    context = format_docs(state['distilled_docs'])
    answer = (generator_prompt | llm | StrOutputParser()).invoke({"context": context, "question": state['question']})
    return {"final_answer": answer}

In [13]:
from langgraph.graph import StateGraph, END

workflow = StateGraph(RAGGraphState)
workflow.add_node("retrieve", retrieval_node)
workflow.add_node("distill", distill_context_node)
workflow.add_node("generate", generation_node)

workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "distill")
workflow.add_edge("distill", "generate")
workflow.add_edge("generate", END)

advanced_rag_app = workflow.compile()

In [14]:
user_query = "What is the recommended power supply for the QLeap-V4 processor?"

### Running the Simple RAG System (Large Context)

In [15]:
import time

print("="*60)
print("                SIMPLE RAG SYSTEM (LARGE CONTEXT)")
print("="*60 + "\n")

start_time = time.time()
raw_docs_simple = retriever.invoke(user_query)
context_simple = format_docs(raw_docs_simple)
context_tokens_simple = count_tokens(context_simple)

print(f"--- Retrieved {len(raw_docs_simple)} Documents ---")
print(f"Context Size: {context_tokens_simple} tokens\n")

print("--- Generation ---")
gen_start_time = time.time()
simple_answer = simple_rag_chain.invoke(user_query)
gen_time_simple = time.time() - gen_start_time
print(f"Generation Time: {gen_time_simple:.2f} seconds")
print("Final Answer:")
print(simple_answer)

                SIMPLE RAG SYSTEM (LARGE CONTEXT)

--- Retrieved 10 Documents ---
Context Size: 358 tokens

--- Generation ---
Generation Time: 1.27 seconds
Final Answer:
The recommended power supply for the QLeap-V4 processor is at least 1200W.


### Running the Advanced RAG System (Distilled Context)

In [16]:
print("="*60)
print("             ADVANCED RAG SYSTEM (DISTILLED CONTEXT)")
print("="*60 + "\n")

inputs = {"question": user_query}
advanced_result = None
for output in advanced_rag_app.stream(inputs, stream_mode="values"):
    advanced_result = output

distilled_docs = advanced_result['distilled_docs']
context_advanced = format_docs(distilled_docs)
context_tokens_advanced = count_tokens(context_advanced)

print(f"Context Size: {context_tokens_advanced} tokens\n")

# Manually time the final generation step for comparison
print("--- [Generator] Synthesizing final answer from distilled context... ---")
gen_start_time = time.time()
advanced_answer = (generator_prompt | llm | StrOutputParser()).invoke({"context": context_advanced, "question": user_query})
gen_time_advanced = time.time() - gen_start_time
print(f"Generation Time: {gen_time_advanced:.2f} seconds")
print("Final Answer:")
print(advanced_answer)

             ADVANCED RAG SYSTEM (DISTILLED CONTEXT)

--- [Retriever] Retrieving initial set of 10 documents... ---
--- [Distiller] Pre-processing 10 raw documents in parallel... ---
  - Doc 'QL-V4-Pricing' is NOT relevant. Reason: The document does not mention any information about the power supply for the QLeap-V4 processor.
  - Doc 'QL-V4-SpecSheet' is NOT relevant. Reason: The document does not mention any information about the recommended power supply for the QLeap-V4 processor.
  - Doc 'QL-V4-Troubleshooting' is NOT relevant. Reason: The document does not mention anything about the power supply requirements for the QLeap-V4 processor.
  - Doc 'QL-V4-HardwareGuide' IS relevant. Reason: The document specifies that a 1200W power supply is recommended for optimal performance with the QLeap-V4 processor, directly answering the question.
  - Doc 'QL-V3-HardwareGuide' is NOT relevant. Reason: The document discusses the power supply for the QLeap-V3 chip, not the QLeap-V4 processor, maki

### Analysis

In [17]:
# --- Analysis Setup ---
context_tokens_simple = count_tokens(context_simple)
context_tokens_advanced = count_tokens(context_advanced)
token_improvement = (context_tokens_simple - context_tokens_advanced) / context_tokens_simple * 100
latency_improvement = (gen_time_simple - gen_time_advanced) / gen_time_simple * 100

# --- Print Results ---
print("="*60)
print("                  ACCURACY & QUALITY ANALYSIS")
print("="*60 + "\n")
print("**Simple RAG's Answer (from Large, Noisy Context):**")
print(f'"{simple_answer}"\n')
print("**Advanced RAG's Answer (from Distilled, Focused Context):**")
print(f'"{advanced_answer}"\n')

print("="*60)
print("                 LATENCY & COST (TOKEN) ANALYSIS")
print("="*60 + "\n")
print("| Metric                      | Simple RAG (Large Context) | Advanced RAG (Distilled Context) | Improvement |")
print("|-----------------------------|----------------------------|----------------------------------|-------------|")
print(f"| Context Size (Tokens)       | {context_tokens_simple:<26} | {context_tokens_advanced:<32} | **-{token_improvement:.0f}%**      |")
print(f"| Final Generation Time       | {gen_time_simple:<24.2f} seconds | {gen_time_advanced:<32.2f} seconds | **-{latency_improvement:.0f}%**      |")

                  ACCURACY & QUALITY ANALYSIS

**Simple RAG's Answer (from Large, Noisy Context):**
"The recommended power supply for the QLeap-V4 processor is at least 1200W."

**Advanced RAG's Answer (from Distilled, Focused Context):**
"The recommended power supply for the QLeap-V4 is a power supply unit of at least 1200W."

                 LATENCY & COST (TOKEN) ANALYSIS

| Metric                      | Simple RAG (Large Context) | Advanced RAG (Distilled Context) | Improvement |
|-----------------------------|----------------------------|----------------------------------|-------------|
| Context Size (Tokens)       | 358                        | 35                               | **-90%**      |
| Final Generation Time       | 1.27                     seconds | 0.96                             seconds | **-25%**      |


The final analysis provides a clear, data-driven verdict. The Parallel Context Pre-processing pattern delivered a trifecta of significant improvements.

1. Higher Accuracy: The qualitative analysis shows the advanced system produced a more precise and focused answer. By filtering out the distracting document about the older “QLeap-V3,” the distillation step prevented the final generator from including irrelevant information. This is a direct win for answer quality.

2. Lower Cost: The token analysis is dramatic. We reduced the context fed to our final, expensive generator by a massive 90%. In a production system processing millions of queries, this translates directly into significant cost savings on LLM inference.

3. Lower Latency: The reduction in context size had a direct impact on the performance of the final generation step, making it 25% faster. While the distillation step itself adds some overhead, this is often more than offset by the savings in the final, most computationally intensive step, leading to a faster overall time-to-answer for the user.
