### Tool Calling With Agentic RAG Systems

In [1]:
import dotenv
%load_ext dotenv
%dotenv

In [2]:
import nest_asyncio
nest_asyncio.apply()

#### Sample Functions For Tools

In [9]:
def add(x: int, y: int) -> int:
    """Add two numbers together"""
    return x + y

def subtract(x: int, y: int) -> int:
    """Subtract two numbers"""
    return x - y

def multiply(x: int, y: int) -> int:
    """Multiply two numbers"""
    return x * y

def get_user_info(username: str) -> dict:
    """Get user information"""
    datanbase = {
        "john": {
            "name": "John Doe",
            "age": 25,
            "email": "johndoe@example.com"
        },
        "jane": {
            "name": "Jane Doe",
            "age": 20,
            "email": "janedoe@example.com"
        }
    }
    
    return f"Username: {username}, Info: {datanbase.get(username.lower(), 'User not found')}"

#### Creating Tools From Python Functions

In [10]:
from llama_index.core.tools import FunctionTool

addition_tools = FunctionTool.from_defaults(fn=add)
substitution_tools = FunctionTool.from_defaults(fn=subtract)
multiplication_tools = FunctionTool.from_defaults(fn=multiply)
get_user_info_tools = FunctionTool.from_defaults(fn=get_user_info)

tools = [addition_tools, substitution_tools, multiplication_tools, get_user_info_tools]

#### Testing Out The Tool Calling

In [7]:
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")

response = llm.predict_and_call(
    tools,
    "Add 2 and 3",
    verbose=True
)

print(str(response))

=== Calling Function ===
Calling function: add with args: {"x": 2, "y": 3}
=== Function Output ===
5
5


In [12]:
response = llm.predict_and_call(
    tools,
    "Tell me how old is John.",
    verbose=True
)

print(str(response))

=== Calling Function ===
Calling function: get_user_info with args: {"username": "John"}
=== Function Output ===
Username: John, Info: {'name': 'John Doe', 'age': 25, 'email': 'johndoe@example.com'}
Username: John, Info: {'name': 'John Doe', 'age': 25, 'email': 'johndoe@example.com'}


#### Vector Search With Metadata

In [15]:
from llama_index.core import SimpleDirectoryReader


# read in lora paper
documents = SimpleDirectoryReader(input_files=["./datasets/lora_paper.pdf"]).load_data()

In [16]:
from llama_index.core.node_parser import SentenceSplitter


splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents=documents)

In [17]:
len(nodes)

38

In [19]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding


Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embedding = OpenAIEmbedding(model="text-embedding-ada-002")

In [21]:
from llama_index.core import VectorStoreIndex


vector_index = VectorStoreIndex(nodes=nodes)

In [24]:
from llama_index.core.vector_stores import MetadataFilters


query_engine = vector_index.as_query_engine(
    similarity_top_k=3,
    filters=MetadataFilters.from_dicts(
        [
            {"key": "page_label", "value": "2"}
        ]
    )
)


response = query_engine.query("Tell me abou the problem statement as explained in page 2")
print(str(response))   # print the response

The problem statement focuses on language modeling, particularly in the context of adapting a pre-trained autoregressive language model to downstream conditional text generation tasks. The goal is to maximize conditional probabilities given a task-specific prompt. The pre-trained model, such as GPT based on the Transformer architecture, is adapted for tasks like summarization, machine reading comprehension, and natural language to SQL. Each task involves a training dataset of context-target pairs, where the context and target are sequences of tokens. For instance, in NL2SQL, the context is a natural language query and the target is the corresponding SQL command; in summarization, the context is an article's content and the target is its summary.


In [29]:
for n in response.source_nodes:
    print(n.metadata)
    print("+++++++++++++")
    print(n.get_text())
    print("+++++++++++++")

{'page_label': '2', 'file_name': 'lora_paper.pdf', 'file_path': 'datasets/lora_paper.pdf', 'file_type': 'application/pdf', 'file_size': 1609513, 'creation_date': '2024-05-12', 'last_modified_date': '2024-05-12'}
+++++++++++++
often introduce inference latency (Houlsby et al., 2019; Rebufﬁ et al., 2017) by extending model
depth or reduce the model’s usable sequence length (Li & Liang, 2021; Lester et al., 2021; Ham-
bardzumyan et al., 2020; Liu et al., 2021) (Section 3). More importantly, these method often fail to
match the ﬁne-tuning baselines, posing a trade-off between efﬁciency and model quality.
We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned
over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the
change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed
Low-RankAdaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural
network indi

In [30]:
from typing import List
from llama_index.core.vector_stores import FilterCondition

def vector_search_query(
    query: str, 
    page_numbers: List[str]
) -> str:
    """Conduct a vector search across an index using the following parameters:

    query (str): This is the text string you want to embed and search for within the index.
    page_numbers (List[str]): This parameter allows you to limit the search to 
    specific pages. If left empty, the search will encompass all pages in the index. 
    If page numbers are specified, the search will be filtered to only include those pages.
    
    """

    metadata_dicts = [
        {"key": "page_label", "value": p} for p in page_numbers
    ]
    
    query_engine = vector_index.as_query_engine(
        similarity_top_k=2,
        filters=MetadataFilters.from_dicts(
            metadata_dicts,
            condition=FilterCondition.OR
        )
    )
    response = query_engine.query(query)
    return response

In [31]:
vector_query_tool = FunctionTool.from_defaults(
    fn=vector_search_query, 
    name="Vector_search_query_tool"
)

In [32]:
response = llm.predict_and_call(
    [vector_query_tool],
    "Tell me about the problem statement as explained in page 2",
    verbose=True
)

=== Calling Function ===
Calling function: Vector_search_query_tool with args: {"query": "problem statement", "page_numbers": ["2"]}
=== Function Output ===
The problem statement focuses on language modeling, particularly on maximizing conditional probabilities given a task-specific prompt. It discusses adapting a pre-trained autoregressive language model to downstream conditional text generation tasks like summarization, machine reading comprehension, and natural language to SQL. Each task is defined by a dataset of context-target pairs, where the context is a sequence of tokens and the target is the corresponding output, such as a SQL command for a natural language query in NL2SQL or a summary for an article in summarization.


In [33]:
for n in response.source_nodes:
    print(n.metadata)
    print("+++++++++++++")
    print(n.get_text())
    print("+++++++++++++")

{'page_label': '2', 'file_name': 'lora_paper.pdf', 'file_path': 'datasets/lora_paper.pdf', 'file_type': 'application/pdf', 'file_size': 1609513, 'creation_date': '2024-05-12', 'last_modified_date': '2024-05-12'}
+++++++++++++
often introduce inference latency (Houlsby et al., 2019; Rebufﬁ et al., 2017) by extending model
depth or reduce the model’s usable sequence length (Li & Liang, 2021; Lester et al., 2021; Ham-
bardzumyan et al., 2020; Liu et al., 2021) (Section 3). More importantly, these method often fail to
match the ﬁne-tuning baselines, posing a trade-off between efﬁciency and model quality.
We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned
over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the
change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed
Low-RankAdaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural
network indi

In [37]:
from llama_index.core import SummaryIndex
from llama_index.core.tools import QueryEngineTool


summary_index = SummaryIndex(nodes=nodes)

summary_query_eginetool = summary_index.as_query_engine(
    use_async=True,
    response_mode="tree_summarize"
)

summary_tool = QueryEngineTool.from_defaults(
    query_engine=summary_query_eginetool,
    name="Summary_tool",
    description="Useful for summarization of the lora paper."
)

In [40]:
response = llm.predict_and_call(
    [summary_tool, vector_query_tool],
    "Tell me about the lora paper in a summary format.",
    verbose=True
)

=== Calling Function ===
Calling function: Summary_tool with args: {"input": "lora paper"}
=== Function Output ===
The LoRA paper introduces a method called Low-Rank Adaptation (LoRA) for efficiently adapting large language models to specific tasks. This approach involves freezing the pre-trained model weights and incorporating trainable rank decomposition matrices into each layer of the Transformer architecture. By doing so, LoRA significantly reduces the number of trainable parameters for downstream tasks, maintaining high model quality without adding inference latency. The paper demonstrates the effectiveness of LoRA on various language models like RoBERTa, DeBERTa, GPT-2, and GPT-3, showing its capability to outperform or match traditional fine-tuning methods while decreasing the number of trainable parameters and memory requirements.


In [41]:
print(str(response))

The LoRA paper introduces a method called Low-Rank Adaptation (LoRA) for efficiently adapting large language models to specific tasks. This approach involves freezing the pre-trained model weights and incorporating trainable rank decomposition matrices into each layer of the Transformer architecture. By doing so, LoRA significantly reduces the number of trainable parameters for downstream tasks, maintaining high model quality without adding inference latency. The paper demonstrates the effectiveness of LoRA on various language models like RoBERTa, DeBERTa, GPT-2, and GPT-3, showing its capability to outperform or match traditional fine-tuning methods while decreasing the number of trainable parameters and memory requirements.


In [42]:
for n in response.source_nodes:
    print(n.metadata)
    print("+++++++++++++")
    print(n.get_text())
    print("+++++++++++++")

{'page_label': '1', 'file_name': 'lora_paper.pdf', 'file_path': 'datasets/lora_paper.pdf', 'file_type': 'application/pdf', 'file_size': 1609513, 'creation_date': '2024-05-12', 'last_modified_date': '2024-05-12'}
+++++++++++++
LORA: L OW-RANK ADAPTATION OF LARGE LAN-
GUAGE MODELS
Edward Hu∗Yelong Shen∗Phillip Wallis Zeyuan Allen-Zhu
Yuanzhi Li Shean Wang Lu Wang Weizhu Chen
Microsoft Corporation
{edwardhu, yeshe, phwallis, zeyuana,
yuanzhil, swang, luw, wzchen }@microsoft.com
yuanzhil@andrew.cmu.edu
(Version 2)
ABSTRACT
An important paradigm of natural language processing consists of large-scale pre-
training on general domain data and adaptation to particular tasks or domains. As
we pre-train larger models, full ﬁne-tuning, which retrains all model parameters,
becomes less feasible. Using GPT-3 175B as an example – deploying indepen-
dent instances of ﬁne-tuned models, each with 175B parameters, is prohibitively
expensive. We propose Low-RankAdaptation, or LoRA, which freezes the pre-
