# Recursive Retriever + Document Agents

### This guide shows how to combine recursive retrieval and “document agents” for advanced decision making over heterogeneous documents.


- There are two motivating factors that lead to solutions for better retrieval:

- Decoupling retrieval embeddings from chunk-based synthesis. Oftentimes fetching documents by their summaries will return more relevant context to queries rather than raw chunks. This is something that recursive retrieval directly allows.

- Within a document, users may need to dynamically perform tasks beyond fact-based question-answering. We introduce the concept of “document agents” - agents that have access to both vector search and summary tools for a given document.

In [13]:
from pathlib import Path
from llama_index.schema import IndexNode
from llama_index.agent import OpenAIAgent
from llama_index.llms import AzureOpenAI, OpenAI
from llama_index.llm_predictor import LLMPredictor
from llama_index import set_global_service_context
import yaml, os, requests, warnings, logging, textwrap
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.text_splitter import TokenTextSplitter
from llama_index.tools import (
                                QueryEngineTool, 
                                ToolMetadata
                                )
from llama_index import (
                        ServiceContext, 
                        SimpleDirectoryReader,
                        load_index_from_storage, 
                        VectorStoreIndex, 
                        StorageContext, 
                        SummaryIndex
                        )


logging.getLogger("transformers.tokenization_utils_base").setLevel(logging.ERROR)
warnings.filterwarnings("ignore")

In [2]:
with open('cadentials.yaml') as f:
    credentials = yaml.load(f, Loader=yaml.FullLoader)

## Configure LLMs

In [3]:
llm_flag = 'DIRECT'

embedding_llm = HuggingFaceEmbedding(
                                    model_name="BAAI/bge-large-en-v1.5",
                                    device='mps'
                                    )

if llm_flag == 'AZURE':
    llm=AzureOpenAI(
                    model=credentials['AZURE_ENGINE'],
                    api_key=credentials['AZURE_OPENAI_API_KEY'],
                    deployment_name=credentials['AZURE_DEPLOYMENT_ID'],
                    api_version=credentials['AZURE_OPENAI_API_VERSION'],
                    azure_endpoint=credentials['AZURE_OPENAI_API_BASE'],
                    temperature=0.3
                    )
    
    chat_llm = LLMPredictor(llm)
else:
    chat_llm = OpenAI(
                    api_key=credentials['OPENAI_API_KEY'],
                    temperature=0.3
                    )

text_splitter = TokenTextSplitter(
                                separator=" ",
                                chunk_size=1024,
                                chunk_overlap=20,
                                backup_separators=["\n"]
                                )

if llm_flag == 'AZURE':
    service_context = ServiceContext.from_defaults(
                                                    text_splitter=text_splitter,
                                                    # prompt_helper=prompt_helper,
                                                    embed_model=embedding_llm,
                                                    llm_predictor=chat_llm
                                                    )
else:
    service_context = ServiceContext.from_defaults(
                                                    text_splitter=text_splitter,
                                                    # prompt_helper=prompt_helper,
                                                    embed_model=embedding_llm,
                                                    llm=chat_llm
                                                    )

set_global_service_context(service_context)

## Scrape Data

In [4]:
wiki_titles = [
            "Toronto", 
            "Seattle", 
            "Chicago", 
            "Boston", 
            "Houston"
            ]

In [5]:
if not len(os.listdir('data/wiki_data_cities')) > 0:
    for title in wiki_titles:
        response = requests.get(
                                "https://en.wikipedia.org/w/api.php",
                                params={
                                        "action": "query",
                                        "format": "json",
                                        "titles": title,
                                        "prop": "extracts",
                                        "explaintext": True,
                                        },
                                ).json()
        page = next(iter(response["query"]["pages"].values()))
        wiki_text = page["extract"]
        data_path = Path("data/wiki_data_cities")
        if not data_path.exists():
            Path.mkdir(data_path)

        with open(data_path / f"{title}.txt", "w") as fp:
            fp.write(wiki_text)

## Load Data

In [6]:
city_docs = {}
for wiki_title in wiki_titles:
    city_docs[wiki_title] = SimpleDirectoryReader(
        input_files=[f"data/wiki_data_cities/{wiki_title}.txt"]
    ).load_data()
city_docs

{'Toronto': [Document(id_='36f0cb92-f42f-4815-b84f-f731a89d3599', embedding=None, metadata={'file_path': 'data/wiki_data_cities/Toronto.txt', 'file_name': 'Toronto.txt', 'file_type': 'text/plain', 'file_size': 82021, 'creation_date': '2024-01-24', 'last_modified_date': '2024-01-24', 'last_accessed_date': '2024-01-24'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='Toronto is the most populous city in Canada and the capital city of the Canadian province of Ontario. With a recorded population of 2,794,356 in 2021, it is the fourth-most populous city in North America. The city is the anchor of the Golden Horseshoe, an urban agglomeration of 9,765,188 people (as of 2021) surrounding the western end of Lake Ontario, while the Greater Toronto Area proper had

## Build Document Agent for each Document

- First we define both a vector index (for semantic search) and summary index (for summarization) for each document. The two query engines are then converted into tools that are passed to an OpenAI function calling agent.
- This document agent can dynamically choose to perform semantic search or summarization within a given document.

In [7]:
agents = {}

for wiki_title in wiki_titles:
    # build vector index
    vector_index = VectorStoreIndex.from_documents(
                                                city_docs[wiki_title], 
                                                service_context=service_context
                                                )
    # build summary index
    summary_index = SummaryIndex.from_documents(
                                                city_docs[wiki_title], 
                                                service_context=service_context
                                                )
    
    # define query engines
    vector_query_engine = vector_index.as_query_engine()
    list_query_engine = summary_index.as_query_engine()

    # define tools
    query_engine_tools = [
                        QueryEngineTool(
                                        query_engine=vector_query_engine,
                                        metadata=ToolMetadata(
                                                            name="vector_tool",
                                                            description=(
                                                                f"Useful for retrieving specific context from {wiki_title}"
                                                            ),
                                        ),
                        ),
                        QueryEngineTool(
                                        query_engine=list_query_engine,
                                        metadata=ToolMetadata(
                                                            name="summary_tool",
                                                            description=(
                                                                "Useful for summarization questions related to"
                                                                f" {wiki_title}"
                                                            ),
                                        ),
                        )
    ]

    # build agent
    agent = OpenAIAgent.from_tools(
                                query_engine_tools,
                                llm=chat_llm,
                                verbose=True,
                                )

    agents[wiki_title] = agent

## Build Composable Retriever over these Agents

- Now we define a set of summary nodes, where each node links to the corresponding Wikipedia city article. We then define a `composable retriever + query engine` on top of these Nodes to route queries down to a given node, which will in turn route it to the relevant document agent.

In [8]:
objects = [] # define top-level nodes
for wiki_title in wiki_titles:
    wiki_summary = (
                    f"This content contains Wikipedia articles about {wiki_title}. Use"
                    " this index if you need to lookup specific facts about"
                    f" {wiki_title}.\nDo not use this index if you want to analyze"
                    " multiple cities."
                    )
    node = IndexNode(
                    text=wiki_summary, 
                    index_id=wiki_title, 
                    obj=agents[wiki_title]
                    ) # define index node that links to these agents
    objects.append(node)
objects

[IndexNode(id_='5ff6a594-8330-4ee3-bfe4-32e80dbbbc80', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='This content contains Wikipedia articles about Toronto. Use this index if you need to lookup specific facts about Toronto.\nDo not use this index if you want to analyze multiple cities.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='Toronto', obj=<llama_index.agent.openai.base.OpenAIAgent object at 0x2d130f850>),
 IndexNode(id_='096ab40e-2d0e-446f-a014-7d8dd6e958b8', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='This content contains Wikipedia articles about Seattle. Use this index if you need to lookup specific facts about Seattle.\nDo not use this index if you want to analyze multiple cities.', start_char_idx=None, end_char_idx=None, te

In [9]:
vector_index = VectorStoreIndex(
                                objects=objects, 
                                service_context=service_context
                                )
query_engine = vector_index.as_query_engine(
                                            similarity_top_k=1, 
                                            verbose=True
                                            )

## Querying

In [24]:
response = query_engine.query("Tell me about the sports teams in Boston")

print("\n======================================================================= RESPONSE =====================================================================")
print(textwrap.fill(str(response), 150))
print("======================================================================================================================================================")

[1;3;38;2;11;159;203mRetrieval entering Boston: OpenAIAgent
[0m[1;3;38;2;237;90;200mRetrieving from object OpenAIAgent with query Tell me about the sports teams in Boston
[0mAdded user message to memory: Tell me about the sports teams in Boston

Boston is home to several professional sports teams that have a rich history and a dedicated fan base. The major sports teams in Boston include the
Boston Red Sox (MLB), the New England Patriots (NFL), the Boston Celtics (NBA), the Boston Bruins (NHL), and the New England Revolution (MLS). These
teams have achieved great success in their respective leagues and have a strong following in the city. The Boston Red Sox, for example, are one of the
oldest and most successful baseball teams in Major League Baseball, while the Boston Celtics have won a record 17 NBA championships. The New England
Patriots have been one of the most successful teams in the NFL, winning multiple Super Bowl championships. The Boston Bruins have a passionate fan
base a

In [25]:
response = query_engine.query(
    "Give me a summary on all the positive aspects of Chicago"
)
print("\n======================================================================= RESPONSE =====================================================================")
print(textwrap.fill(str(response), 150))
print("======================================================================================================================================================")

[1;3;38;2;11;159;203mRetrieval entering Chicago: OpenAIAgent
[0m[1;3;38;2;237;90;200mRetrieving from object OpenAIAgent with query Give me a summary on all the positive aspects of Chicago
[0mAdded user message to memory: Give me a summary on all the positive aspects of Chicago
=== Calling Function ===
Calling function: summary_tool with args: {
  "input": "positive aspects of Chicago"
}
Got output: Chicago is a city with many positive aspects that contribute to its reputation as a major city. It has a vibrant and diverse cultural scene, thanks to its rich history of immigration. The city is known for its iconic steel-framed skyscrapers and its contributions to urban planning and architecture. Chicago is a significant player in the global economy, with a strong presence in finance, commerce, industry, education, technology, telecommunications, and transportation. Tourists can enjoy a wide range of attractions and activities, including a thriving arts scene with renowned institutions