In [27]:
import os
from dotenv import load_dotenv
load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [28]:
import nest_asyncio

nest_asyncio.apply()

## Load Data

In [29]:
from llama_index.core import SimpleDirectoryReader

# load documents
documents = SimpleDirectoryReader(input_files=["rag.pdf"]).load_data()


## Define LLM and Embedding model

In [30]:
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents)

In [44]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = OpenAI(model="gpt-4.1-mini", api_key=OPENAI_API_KEY)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

## Define Summary Index and Vector Index over the Same Data

In [32]:
from llama_index.core import SummaryIndex, VectorStoreIndex

summary_index = SummaryIndex(nodes)
vector_index = VectorStoreIndex(nodes)

2025-09-11 13:28:00,142 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


## Define Query Engines and Set Metadata

In [34]:
summary_query_engine = summary_index.as_query_engine(
    response_mode="tree_summarize",
    use_async=True,
)
vector_query_engine = vector_index.as_query_engine()

In [51]:
from llama_index.core.tools import QueryEngineTool


summary_tool = QueryEngineTool.from_defaults(
    query_engine=summary_query_engine,
    description=(
        "Useful for summarization questions related to MetaGPT"
    ),
)

vector_tool = QueryEngineTool.from_defaults(
    query_engine=vector_query_engine,
    description=(
        "Useful for retrieving specific context from the MetaGPT paper."
    ),
)

## Define Router Query Engine

In [53]:
from llama_index.core.query_engine.router_query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector

query_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),
    query_engine_tools=[
        summary_tool,
        vector_tool,
    ],
    verbose=True
)

In [43]:
response = query_engine.query("What is the summary of the document?")
print(str(response))

2025-09-11 13:47:14,879 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-09-11 13:47:14,885 - INFO - Selecting query engine 1: The question requests a summary, which is a summarization task; choice (2) is explicitly for summarization questions related to MetaGPT, while choice (1) is for retrieving specific context..


[1;3;38;5;200mSelecting query engine 1: The question requests a summary, which is a summarization task; choice (2) is explicitly for summarization questions related to MetaGPT, while choice (1) is for retrieving specific context..
[0m

2025-09-11 13:47:15,851 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-09-11 13:47:43,123 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


The work presents Retrieval-Augmented Generation (RAG), which fuses a neural generator’s parametric knowledge with non-parametric retrieval over external documents. Two variants are explored: RAG-Token, which can blend information across multiple sources during token generation, and RAG-Sequence. Across tasks like Jeopardy-style question generation and MS MARCO, these models produce more specific and factually accurate outputs than a BART baseline, while also yielding higher diversity (RAG-Sequence > RAG-Token > BART). A case study shows how retrieval guides generation when producing book titles (e.g., “The Sun Also Rises,” “A Farewell to Arms”), after which the generator’s parameters can complete the titles—illustrating cooperation between retrieved evidence and stored knowledge. In fact verification (FEVER), performance is close to stronger pipeline systems: within 4.3% for 3-way classification, and within 2.7% of a RoBERTa model trained with gold evidence for 2-way classification, d

In [40]:
print(len(response.source_nodes))

2


In [45]:
response = query_engine.query(
    "How do agents share information with other agents?"
)
print(str(response))

2025-09-11 13:49:12,402 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-09-11 13:49:12,409 - INFO - Selecting query engine 0: The question asks for a specific mechanism/detail from the MetaGPT paper, requiring retrieval of precise context rather than a general summary..


[1;3;38;5;200mSelecting query engine 0: The question asks for a specific mechanism/detail from the MetaGPT paper, requiring retrieval of precise context rather than a general summary..
[0m

2025-09-11 13:49:36,252 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


- Use a shared external memory: Maintain a common, non-parametric knowledge store (e.g., a dense index of text passages such as Wikipedia).
- Retrieve, don’t message: Each agent encodes its query, performs Maximum Inner Product Search over the shared index to retrieve top-K relevant passages, and conditions its output on those passages.
- Flexible conditioning: Agents can condition on the same retrieved passages for an entire output or vary passages token-by-token, enabling aggregation of evidence from multiple sources.
- Immediate updates: To share new or revised information, update or swap the shared index (add/edit documents). All agents that use this index gain the information immediately, without retraining.
- Provenance and interpretability: Because the memory consists of raw, human-readable text, agents’ outputs can be traced back to retrieved evidence.


## Let's put everything together

In [47]:
def get_router_query_engine(pdf_path):
    # Load environment variables
    from dotenv import load_dotenv
    load_dotenv()
    OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

    # Load and split document
    documents = SimpleDirectoryReader(input_files=[pdf_path]).load_data()
    splitter = SentenceSplitter(chunk_size=1024)
    nodes = splitter.get_nodes_from_documents(documents)

    # Set LLM and embedding model
    Settings.llm = OpenAI(model="gpt-4.1-mini", api_key=OPENAI_API_KEY)
    Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

    # Create indices
    summary_index = SummaryIndex(nodes)
    vector_index = VectorStoreIndex(nodes)

    # Create query engines
    summary_query_engine = summary_index.as_query_engine(
        response_mode="tree_summarize",
        use_async=True,
    )
    vector_query_engine = vector_index.as_query_engine()

    # Create tools (descriptions optional)
    summary_tool = QueryEngineTool.from_defaults(query_engine=summary_query_engine)
    vector_tool = QueryEngineTool.from_defaults(query_engine=vector_query_engine)

    # Create and return router query engine
    query_engine = RouterQueryEngine(
        selector=LLMSingleSelector.from_defaults(),
        query_engine_tools=[summary_tool, vector_tool],
        verbose=True
    )
    return query_engine

In [48]:
query_engine = get_router_query_engine("rag.pdf")

2025-09-11 13:55:18,815 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [49]:
response = query_engine.query("what are the important aspects of rag?")
print(str(response))

2025-09-11 13:57:25,853 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-09-11 13:57:25,858 - INFO - Selecting query engine 0: Both choices provide the same summary, but choice 1 is selected as it is the first instance of the relevant summary describing the important aspect of RAG..


[1;3;38;5;200mSelecting query engine 0: Both choices provide the same summary, but choice 1 is selected as it is the first instance of the relevant summary describing the important aspect of RAG..
[0m

2025-09-11 13:57:35,213 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Important aspects of Retrieval-Augmented Generation (RAG) include:

1. Hybrid Memory Architecture: RAG combines parametric memory, represented by a pre-trained sequence-to-sequence (seq2seq) model (such as BART), with non-parametric memory, which is a dense vector index of external documents (e.g., Wikipedia) accessed via a neural retriever (such as Dense Passage Retriever, DPR).

2. Retrieval-Augmented Generation: The model retrieves relevant documents conditioned on the input query and uses these documents as additional context to generate the output sequence. This approach allows the model to access and incorporate up-to-date and extensive external knowledge beyond what is stored in its parameters.

3. Two Model Variants: 
   - RAG-Sequence: Uses the same retrieved document to generate the entire output sequence, marginalizing over the top-K retrieved documents.
   - RAG-Token: Allows different documents to be used for generating each token in the output sequence, enabling the model