# Multi-Document Agent

#### This notebook is a reproduction of the experiment4 presented in the Building Agentic RAG with Llamaindex Course

#### Source: https://learn.deeplearning.ai/courses/building-agentic-rag-with-llamaindex/lesson/5/building-a-multi-document-agent

## Setup

In [1]:
from helper import get_openai_api_key
OPENAI_API_KEY = get_openai_api_key()

In [2]:
import nest_asyncio
nest_asyncio.apply()

### Setup an agent over 3 papers

In [3]:
urls = [
    "https://openreview.net/pdf?id=VtmBAGCN7o",
    "https://openreview.net/pdf?id=6PmJoRfdaK",
    "https://openreview.net/pdf?id=hSyW5go0v8",
]

papers = [
    "metagpt.pdf",
    "longlora.pdf",
    "selfrag.pdf",
]

In [4]:
from utils import get_doc_tools
from pathlib import Path

paper_to_tools_dict = {}
for paper in papers:
    print(f"Getting tools for paper: {paper}")
    vector_tool, summary_tool = get_doc_tools(paper, Path(paper).stem)
    paper_to_tools_dict[paper] = [vector_tool, summary_tool]

Getting tools for paper: metagpt.pdf
Getting tools for paper: longlora.pdf
Getting tools for paper: selfrag.pdf


In [5]:
initial_tools = [t for paper in papers for t in paper_to_tools_dict[paper]]

In [6]:
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")

In [7]:
len(initial_tools)

6

In [8]:
from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.agent import AgentRunner

agent_worker = FunctionCallingAgentWorker.from_tools(
    initial_tools, 
    llm=llm, 
    verbose=True
)
agent = AgentRunner(agent_worker)

In [9]:
response = agent.query(
    "Tell me about the evaluation dataset used in LongLoRA, "
    "and then tell me about the evaluation results"
)

Added user message to memory: Tell me about the evaluation dataset used in LongLoRA, and then tell me about the evaluation results
=== Calling Function ===
Calling function: summary_tool_longlora with args: {"input": "evaluation dataset"}
=== Function Output ===
The evaluation datasets mentioned in the context provided include the RedPajama dataset, PG19 validation set, PG19 test split, LongBench, LEval, book corpus dataset PG19, cleaned Arxiv Math proof-pile dataset, and the proposed dataset SAFECONV designed for evaluating conversation safety.
=== Calling Function ===
Calling function: summary_tool_longlora with args: {"input": "evaluation results"}
=== Function Output ===
The evaluation results across various studies and experiments consistently demonstrate the effectiveness of different attention patterns, such as shifted sparse attention (S2-Attn) and LoRA+, in extending the context window of pre-trained large language models (LLMs) while maintaining comparable performance to full

In [10]:
response = agent.query("Give me a summary of both Self-RAG and LongLoRA")
print(str(response))

Added user message to memory: Give me a summary of both Self-RAG and LongLoRA
=== Calling Function ===
Calling function: summary_tool_selfrag with args: {"input": "Self-RAG"}
=== Function Output ===
Self-RAG is a framework that utilizes reflection tokens to enhance the quality and factuality of a large language model through a combination of retrieval and self-reflection. It allows the model to adaptively retrieve passages on-demand, generate text informed by these retrieved passages, and reflect on both the retrieved passages and its own generated content using special tokens called reflection tokens. By incorporating this self-reflective process, Self-RAG aims to improve the overall generation quality, factuality, and verifiability of the language model without compromising its original creativity and versatility.
=== Calling Function ===
Calling function: summary_tool_longlora with args: {"input": "LongLoRA"}


Retrying llama_index.llms.openai.base.OpenAI._achat in 0.0009450649873383732 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-E5QvRBoFt632vt9JZf6l7XJR on tokens per min (TPM): Limit 60000, Used 59555, Requested 3397. Please try again in 2.952s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}.
Retrying llama_index.llms.openai.base.OpenAI._achat in 0.6906522338644719 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-E5QvRBoFt632vt9JZf6l7XJR on tokens per min (TPM): Limit 60000, Used 59473, Requested 3400. Please try again in 2.873s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}.
Retrying llama_index.llms.openai.base.OpenAI._achat in 0.9457556004

=== Function Output ===
LongLoRA is an efficient method used for extending the context length of Large Language Models (LLMs) with limited computation cost. It combines shifted sparse attention (S2-Attn) with LoRA to effectively and efficiently fine-tune models to longer context lengths. This approach allows for extending the context window of LLMs while retaining their original architectures, demonstrating strong empirical results on various tasks. Additionally, LongLoRA is compatible with existing techniques like Flash-Attention2 and is particularly effective when combined with trainable embedding and normalization layers.
=== LLM Response ===
Here are summaries of Self-RAG and LongLoRA:

1. Self-RAG:
Self-RAG is a framework that utilizes reflection tokens to enhance the quality and factuality of a large language model through a combination of retrieval and self-reflection. It allows the model to adaptively retrieve passages on-demand, generate text informed by these retrieved passag

## Setup an agent over 11 papers

In [11]:
urls = [
    "https://openreview.net/pdf?id=VtmBAGCN7o",
    "https://openreview.net/pdf?id=6PmJoRfdaK",
    "https://openreview.net/pdf?id=LzPWWPAdY4",
    "https://openreview.net/pdf?id=VTF8yNQM66",
    "https://openreview.net/pdf?id=hSyW5go0v8",
    "https://openreview.net/pdf?id=9WD9KwssyT",
    "https://openreview.net/pdf?id=yV6fD7LYkF",
    "https://openreview.net/pdf?id=hnrB5YHoYu",
    "https://openreview.net/pdf?id=WbWtOYIzIK",
    "https://openreview.net/pdf?id=c5pwL0Soay",
    "https://openreview.net/pdf?id=TpD2aG1h0D"
]

papers = [
    "metagpt.pdf",
    "longlora.pdf",
    "loftq.pdf",
    "swebench.pdf",
    "selfrag.pdf",
    "zipformer.pdf",
    "values.pdf",
    "finetune_fair_diffusion.pdf",
    "knowledge_card.pdf",
    "metra.pdf",
    "vr_mcl.pdf"
]

In [12]:
from utils import get_doc_tools
from pathlib import Path

paper_to_tools_dict = {}
for paper in papers:
    print(f"Getting tools for paper: {paper}")
    vector_tool, summary_tool = get_doc_tools(paper, Path(paper).stem)
    paper_to_tools_dict[paper] = [vector_tool, summary_tool]

Getting tools for paper: metagpt.pdf
Getting tools for paper: longlora.pdf
Getting tools for paper: loftq.pdf
Getting tools for paper: swebench.pdf
Getting tools for paper: selfrag.pdf
Getting tools for paper: zipformer.pdf
Getting tools for paper: values.pdf
Getting tools for paper: finetune_fair_diffusion.pdf
Getting tools for paper: knowledge_card.pdf
Getting tools for paper: metra.pdf
Getting tools for paper: vr_mcl.pdf


## Extend the Agent with Tool Retrieval

In [13]:
all_tools = [t for paper in papers for t in paper_to_tools_dict[paper]]

In [14]:
# define an "object" index and retriever over these tools
from llama_index.core import VectorStoreIndex
from llama_index.core.objects import ObjectIndex

obj_index = ObjectIndex.from_objects(
    all_tools,
    index_cls=VectorStoreIndex,
)

In [15]:
obj_retriever = obj_index.as_retriever(similarity_top_k=3)

In [16]:
tools = obj_retriever.retrieve(
    "Tell me about the eval dataset used in MetaGPT and SWE-Bench"
)

In [17]:
tools[2].metadata

ToolMetadata(description='Use ONLY IF you want to get a holistic summary of MetaGPT. Do NOT use if you have specific questions over MetaGPT.', name='summary_tool_values', fn_schema=<class 'llama_index.core.tools.types.DefaultToolFnSchema'>, return_direct=False)

In [18]:
from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.agent import AgentRunner

agent_worker = FunctionCallingAgentWorker.from_tools(
    tool_retriever=obj_retriever,
    llm=llm, 
    system_prompt=""" \
You are an agent designed to answer queries over a set of given papers.
Please always use the tools provided to answer a question. Do not rely on prior knowledge.\

""",
    verbose=True
)
agent = AgentRunner(agent_worker)

In [19]:
response = agent.query(
    "Tell me about the evaluation dataset used "
    "in MetaGPT and compare it against SWE-Bench"
)
print(str(response))

Added user message to memory: Tell me about the evaluation dataset used in MetaGPT and compare it against SWE-Bench
=== Calling Function ===
Calling function: summary_tool_metra with args: {"input": "evaluation dataset used in MetaGPT"}
=== Function Output ===
The evaluation dataset used in MetaGPT is not explicitly mentioned in the provided context information.
=== Calling Function ===
Calling function: summary_tool_swebench with args: {"input": "evaluation dataset used in SWE-Bench"}


Retrying llama_index.llms.openai.base.OpenAI._achat in 0.3748528395372219 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-E5QvRBoFt632vt9JZf6l7XJR on tokens per min (TPM): Limit 60000, Used 56247, Requested 3759. Please try again in 6ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}.
Retrying llama_index.llms.openai.base.OpenAI._achat in 0.19364296370350298 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-E5QvRBoFt632vt9JZf6l7XJR on tokens per min (TPM): Limit 60000, Used 56082, Requested 3987. Please try again in 69ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}.
Retrying llama_index.llms.openai.base.OpenAI._achat in 0.06395589210831942

=== Function Output ===
The evaluation dataset used in SWE-Bench consists of task instances drawn from real GitHub issues and corresponding pull requests across popular Python repositories. It includes task instructions, the issue text, retrieved files and documentation, example patch files, and prompts for generating patch files. The dataset is designed to be challenging and realistic for evaluating language models in the context of software engineering tasks. Task instances are validated for usability through execution-based validation and are grouped by repository and version. The dataset is continuously updated with new task instances based on pull requests created after the training date of any language model used. Additionally, the dataset involves steps such as applying prediction patches to the codebase, running testing scripts, and determining task completion based on test results. Models like ChatGPT-3.5, GPT-4, Claude 2, and SWE-Llama are evaluated on this dataset using diff

In [20]:
response = agent.query(
    "Compare and contrast the LoRA papers (LongLoRA, LoftQ). "
    "Analyze the approach in each paper first. "
)

Added user message to memory: Compare and contrast the LoRA papers (LongLoRA, LoftQ). Analyze the approach in each paper first. 
=== Calling Function ===
Calling function: summary_tool_longlora with args: {"input": "LongLoRA paper"}


Retrying llama_index.llms.openai.base.OpenAI._achat in 0.5211049228717677 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-E5QvRBoFt632vt9JZf6l7XJR on tokens per min (TPM): Limit 60000, Used 56651, Requested 3796. Please try again in 447ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}.


=== Function Output ===
The LongLoRA paper introduces an efficient fine-tuning method for extending the context sizes of pre-trained large language models (LLMs) with limited computational cost. It presents shifted sparse attention (S2-Attn) as a way to effectively enable context extension during fine-tuning, leading to significant computation savings while maintaining performance. The paper combines this with an improved low-rank adaptation (LoRA) approach and demonstrates strong empirical results on various tasks with Llama2 models. LongLoRA allows for extending Llama2 models' context lengths while retaining their original architectures and is compatible with existing techniques like Flash-Attention2. The paper also discusses the importance of learnable embedding and normalization layers in unlocking long context LoRA fine-tuning.
=== Calling Function ===
Calling function: summary_tool_loftq with args: {"input": "LoftQ paper"}


Retrying llama_index.llms.openai.base.OpenAI._achat in 0.2301177672935023 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-E5QvRBoFt632vt9JZf6l7XJR on tokens per min (TPM): Limit 60000, Used 59665, Requested 3226. Please try again in 2.891s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}.
Retrying llama_index.llms.openai.base.OpenAI._achat in 0.48680688053275356 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-E5QvRBoFt632vt9JZf6l7XJR on tokens per min (TPM): Limit 60000, Used 59534, Requested 3353. Please try again in 2.887s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}.
Retrying llama_index.llms.openai.base.OpenAI._achat in 1.070012501528

=== Function Output ===
LoftQ is a novel quantization framework introduced in a paper presented at ICLR 2024. It involves quantizing pre-trained weight matrices and applying low-rank approximations to improve performance in downstream tasks, particularly in low-bit quantization scenarios. LoftQ integrates low-rank approximation with quantization to jointly approximate the original high-precision pre-trained weights, providing a beneficial initialization point for subsequent LoRA fine-tuning. The method has been shown to consistently outperform existing quantization methods, especially in challenging low-bit scenarios, and works effectively with different quantization methods. LoftQ has been evaluated on various tasks such as natural language understanding, question answering, summarization, and natural language generation, demonstrating robustness and improved performance compared to baseline methods like QLoRA and full fine-tuning.
=== LLM Response ===
The LongLoRA paper introduces an