In [1]:
import dotenv
%load_ext dotenv
%dotenv

In [2]:
import nest_asyncio
nest_asyncio.apply()

In [3]:
papers = [
    "./datasets/lora_paper.pdf",
    "./datasets/longlora_efficient_fine_tuning.pdf"
]

In [4]:
from utils import create_doc_tools
from pathlib import Path


paper_to_tools_dict = {}

for paper in papers:
    print(f"Creating {paper} paper tool.")
    path = Path(paper)
    vector_tool, summary_tool = await create_doc_tools(doc_name=path.stem, document_fp=path)
    paper_to_tools_dict[path.stem] = [vector_tool, summary_tool]

Creating ./datasets/lora_paper.pdf paper tool.
Creating ./datasets/longlora_efficient_fine_tuning.pdf paper tool.


In [5]:
paper_to_tools_dict

{'lora_paper': [<llama_index.core.tools.query_engine.QueryEngineTool at 0x7864e89c5b10>,
  <llama_index.core.tools.query_engine.QueryEngineTool at 0x7864e89c5d80>],
 'longlora_efficient_fine_tuning': [<llama_index.core.tools.query_engine.QueryEngineTool at 0x7864e868e050>,
  <llama_index.core.tools.query_engine.QueryEngineTool at 0x7864e868de10>]}

In [6]:
initial_tools = [t for paper in papers for t in paper_to_tools_dict[Path(paper).stem]]
print(str(initial_tools))

[<llama_index.core.tools.query_engine.QueryEngineTool object at 0x7864e89c5b10>, <llama_index.core.tools.query_engine.QueryEngineTool object at 0x7864e89c5d80>, <llama_index.core.tools.query_engine.QueryEngineTool object at 0x7864e868e050>, <llama_index.core.tools.query_engine.QueryEngineTool object at 0x7864e868de10>]


In [7]:
len(initial_tools)

4

#### Create Agent Worker

In [8]:
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")

In [9]:
from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.agent import AgentRunner

agent_worker = FunctionCallingAgentWorker.from_tools(
    initial_tools, 
    llm=llm, 
    verbose=True
)
agent = AgentRunner(agent_worker)

In [10]:
response = agent.query(
    "Explain to me what is Lora and why it's being used."
    "Explain to me what is LongLoRA and why it's being used."
    "Compare and contract LongLoRA and Lora."
)
print(str(response))

Added user message to memory: Explain to me what is Lora and why it's being used.Explain to me what is LongLoRA and why it's being used.Compare and contract LongLoRA and Lora.
=== Calling Function ===
Calling function: lora_paper_summary_query_engine_tool with args: {"input": "Explain what is Lora and why it's being used."}
=== Function Output ===
LoRA, or Low-Rank Adaptation, is a method utilized in deep learning to adapt large-scale pre-trained language models to specific tasks or domains. It involves introducing trainable rank decomposition matrices into each layer of the Transformer architecture while freezing the pre-trained model weights. This approach significantly reduces the number of trainable parameters for downstream tasks, making it more memory and computationally efficient. LoRA is being used to address the challenges associated with full fine-tuning, particularly for models with a large number of parameters like GPT-3, making it more feasible and cost-effective to adapt 

In [11]:
response = agent.query(
    "Write me a summary of the LoRA paper."
    "Write me a summary of the LongLoRA paper."
    "Compare and contract LongLoRA and Lora."
)
print(str(response))

Added user message to memory: Write me a summary of the LoRA paper.Write me a summary of the LongLoRA paper.Compare and contract LongLoRA and Lora.
=== Calling Function ===
Calling function: lora_paper_summary_query_engine_tool with args: {"input": "summary"}
=== Function Output ===
LoRA is a method proposed for efficient adaptation of large language models by injecting trainable rank decomposition matrices into each layer of the Transformer architecture. It reduces the number of trainable parameters, leading to reduced GPU memory requirements, while maintaining model quality comparable to full fine-tuning. Experiments focused on aspects like LORA module correlation, rank effects in GPT-2, amplification factor in low-rank matrices, and W and ∆W correlation, providing insights into model performance under different conditions.
=== Calling Function ===
Calling function: longlora_efficient_fine_tuning_summary_query_engine_tool with args: {"input": "summary"}
=== Function Output ===
A meth

### Multiple Documents Agentic RAG

##### Download Papers

In [12]:
urls = [
    "https://arxiv.org/pdf/2106.09685"
]

papers = [
    "lora_paper.pdf",
]

In [13]:
# poetry add wget

import wget

for url, paper in zip(urls, papers):
    !wget "{url}" -O "{paper}"

--2024-05-11 16:08:51--  https://arxiv.org/pdf/2106.09685
Resolving arxiv.org (arxiv.org)... 151.101.195.42, 151.101.3.42, 151.101.131.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.195.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1609513 (1.5M) [application/pdf]
Saving to: ‘lora_paper.pdf’


2024-05-11 16:08:53 (1.63 MB/s) - ‘lora_paper.pdf’ saved [1609513/1609513]



##### Creating Tools

In [14]:
papers = [
    "./datasets/lora_paper.pdf",
    "./datasets/longlora_efficient_fine_tuning.pdf"
]

In [15]:
from utils import create_doc_tools
from pathlib import Path


paper_to_tools_dict = {}

for paper in papers:
    print(f"Creating {paper} paper tool.")
    path = Path(paper)
    vector_tool, summary_tool = await create_doc_tools(doc_name=path.stem, document_fp=path)
    paper_to_tools_dict[path.stem] = [vector_tool, summary_tool]

Creating ./datasets/lora_paper.pdf paper tool.
Creating ./datasets/longlora_efficient_fine_tuning.pdf paper tool.


In [16]:
tools_list = [t for paper in papers for t in paper_to_tools_dict[Path(paper).stem]]
print(str(tools_list))

[<llama_index.core.tools.query_engine.QueryEngineTool object at 0x7864e7627850>, <llama_index.core.tools.query_engine.QueryEngineTool object at 0x7864e76254e0>, <llama_index.core.tools.query_engine.QueryEngineTool object at 0x7864e7746290>, <llama_index.core.tools.query_engine.QueryEngineTool object at 0x7864e77476a0>]


In [17]:
from llama_index.core import VectorStoreIndex
from llama_index.core.objects import ObjectIndex

obj_index = ObjectIndex.from_objects(
    tools_list,
    index_cls=VectorStoreIndex,
)

In [18]:
obj_retriever = obj_index.as_retriever(similarity_top_k=3)

In [19]:
retrieved_tools = obj_retriever.retrieve(
    "Write me a summary of the LoRA paper."
    "Write me a summary of the LongLoRA paper."
    "Compare and contract LongLoRA and Lora."
)
print(str(retrieved_tools))

[<llama_index.core.tools.query_engine.QueryEngineTool object at 0x7864e76254e0>, <llama_index.core.tools.query_engine.QueryEngineTool object at 0x7864e77476a0>, <llama_index.core.tools.query_engine.QueryEngineTool object at 0x7864e7627850>]


In [22]:
for tool in retrieved_tools:
    print(tool.metadata.name)

lora_paper_summary_query_engine_tool
longlora_efficient_fine_tuning_summary_query_engine_tool
lora_paper_vector_query_engine_tool


#### Creating Agents

In [23]:
from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.agent import AgentRunner

agent_worker = FunctionCallingAgentWorker.from_tools(
    tool_retriever=obj_retriever,
    llm=llm, 
    system_prompt=""" \
You are an AI agent programmed to respond to questions based on a 
specified collection of documents. Always utilize the tools available 
to generate answers, ensuring that responses are based directly on the 
provided materials rather than on any pre-existing knowledge. All your responses should be formatted in markdown text
""",
    verbose=True
)
agent = AgentRunner(agent_worker)

In [24]:
response = agent.query(
    "Write me a summary of the LoRA paper."
    "Write me a summary of the LongLoRA paper."
    "Compare and contract LongLoRA and Lora."
)
print(str(response))

Added user message to memory: Write me a summary of the LoRA paper.Write me a summary of the LongLoRA paper.Compare and contract LongLoRA and Lora.
=== Calling Function ===
Calling function: lora_paper_summary_query_engine_tool with args: {"input": "summary"}
=== Function Output ===
LoRA is an innovative adaptation strategy for large language models that involves freezing pre-trained model weights and incorporating trainable rank decomposition matrices into each layer of the Transformer architecture. This method has been proven to enhance model quality compared to traditional fine-tuning approaches, while also reducing GPU memory requirements and training throughput. It enables swift task-switching during deployment without introducing additional inference latency. Empirical validation on various language models like RoBERTa, DeBERTa, GPT-2, and GPT-3 has demonstrated the effectiveness of LoRA in maintaining high performance with fewer trainable parameters. The experiments conducted in