## Tool Calling With Agentic RAG

In [1]:
import dotenv
%load_ext dotenv
%dotenv

In [2]:
import nest_asyncio
nest_asyncio.apply()

#### Sample Function Tools

In [4]:
def add(x: int, y: int) -> int:
    """Add two numbers together."""
    return x + y

# substraction function
def sub(x: int, y: int) -> int:
    """Substract two numbers."""
    return x - y

# multiplication function
def mul(x: int, y: int) -> int:
    """Multiply two numbers."""
    return x * y


# get user information
def get_user_info(name: str) -> str:
    """Get user information."""
    data = {
        "John Doe": {
            "age": 30,
            "location": "USA"
        },
        "Jane Doe": {
            "age": 25,
            "location": "UK"
        }
    }
    return f'User name {name}, age is {data[name]["age"]} and location is {data[name]["location"]}'
    

#### Creating Tool From Functions

In [8]:
from llama_index.core.tools import FunctionTool

addition_tool = FunctionTool.from_defaults(fn=add)
get_user_info_tool = FunctionTool.from_defaults(fn=get_user_info)
multiplication_tool = FunctionTool.from_defaults(fn=mul)
substraction_tool = FunctionTool.from_defaults(fn=sub)

tools = [addition_tool, get_user_info_tool, multiplication_tool, substraction_tool]

In [9]:
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")

response = llm.predict_and_call(
    tools, 
    "What is the product of 4 and 5", 
    verbose=True
)
print(str(response))

=== Calling Function ===
Calling function: mul with args: {"x": 4, "y": 5}
=== Function Output ===
20
20


In [10]:
response = llm.predict_and_call(
    tools, 
    "Give more the details of John Doe", 
    verbose=True
)
print(str(response))

=== Calling Function ===
Calling function: get_user_info with args: {"name": "John Doe"}
=== Function Output ===
User name John Doe, age is 30 and location is USA
User name John Doe, age is 30 and location is USA


#### Simple Vector Search Tool

In [11]:
from llama_index.core import SimpleDirectoryReader

# load lora_paper.pdf documents
documents = SimpleDirectoryReader(input_files=["./datasets/lora_paper.pdf"]).load_data()

In [12]:
from llama_index.core.node_parser import SentenceSplitter

# chunk_size of 1024 is a good default value
splitter = SentenceSplitter(chunk_size=1024)
# Create nodes from documents
nodes = splitter.get_nodes_from_documents(documents)

In [13]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# LLM model
Settings.llm = OpenAI(model="gpt-3.5-turbo")
# embedding model
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

In [15]:
from llama_index.core import VectorStoreIndex


# vector store index
vector_index = VectorStoreIndex(nodes)

#### Creating Query Engine With MetadataFilters

In [18]:
from llama_index.core.vector_stores import MetadataFilters

# Create vector search query engine
query_engine = vector_index.as_query_engine(
    similarity_top_k=2,
    filters=MetadataFilters.from_dicts(
        [
            {"key": "page_label", "value": "2"}
        ]
    )
)

response = query_engine.query(
    "Tell me about the Problem statement as explained", 
)
print(str(response))

The problem statement focuses on language modeling, particularly in the context of adapting a pre-trained autoregressive language model to downstream conditional text generation tasks. It discusses the use of a pre-trained model like GPT based on the Transformer architecture for tasks such as summarization, machine reading comprehension, and natural language to SQL. The goal is to maximize conditional probabilities given a task-specific prompt by adapting the pre-trained model to new tasks represented by context-target pairs in training datasets. Each downstream task involves sequences of tokens where the model needs to generate appropriate outputs based on the input context.


Verifying page number

In [21]:
for n in response.source_nodes:
    print(n.metadata)
    print("=============Text=============")
    print(n.get_text())
    print("=============Text=============")

{'page_label': '2', 'file_name': 'lora_paper.pdf', 'file_path': 'datasets/lora_paper.pdf', 'file_type': 'application/pdf', 'file_size': 1609513, 'creation_date': '2024-05-10', 'last_modified_date': '2024-05-10'}
often introduce inference latency (Houlsby et al., 2019; Rebufﬁ et al., 2017) by extending model
depth or reduce the model’s usable sequence length (Li & Liang, 2021; Lester et al., 2021; Ham-
bardzumyan et al., 2020; Liu et al., 2021) (Section 3). More importantly, these method often fail to
match the ﬁne-tuning baselines, posing a trade-off between efﬁciency and model quality.
We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned
over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the
change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed
Low-RankAdaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural
network indirectly by opti

In [22]:
from typing import List
from llama_index.core.vector_stores import FilterCondition


def vector_search_query(
    query: str, 
    page_numbers: List[str]
) -> str:
    """Conduct a vector search across an index using the following parameters:

    query (str): This is the text string you want to embed and search for within the index.
    page_numbers (List[str]): This parameter allows you to limit the search to 
    specific pages. If left empty, the search will encompass all pages in the index. 
    If page numbers are specified, the search will be filtered to only include those pages.
    
    """

    metadata_dicts = [
        {"key": "page_label", "value": p} for p in page_numbers
    ]
    
    query_engine = vector_index.as_query_engine(
        similarity_top_k=2,
        filters=MetadataFilters.from_dicts(
            metadata_dicts,
            condition=FilterCondition.OR
        )
    )
    response = query_engine.query(query)
    return response

In [26]:
vector_search_query_tool = FunctionTool.from_defaults(
    name="vector_search_tool",
    fn=vector_search_query
)

In [30]:
response = llm.predict_and_call(
    [vector_search_query_tool], 
    "What was mentioned about the problem statement in page 2?", 
    verbose=True
)

=== Calling Function ===
Calling function: vector_search_tool with args: {"query": "problem statement", "page_numbers": ["2"]}
=== Function Output ===
The problem statement focuses on language modeling, particularly on maximizing conditional probabilities given a task-specific prompt. It discusses adapting a pre-trained autoregressive language model to downstream conditional text generation tasks like summarization, machine reading comprehension (MRC), and natural language to SQL (NL2SQL). Each task is defined by a dataset of context-target pairs, where both the context and target are sequences of tokens. For instance, in NL2SQL, the context is a natural language query and the target is the corresponding SQL command; in summarization, the context is an article's content and the target is its summary.


In [31]:
print(str(response))

The problem statement focuses on language modeling, particularly on maximizing conditional probabilities given a task-specific prompt. It discusses adapting a pre-trained autoregressive language model to downstream conditional text generation tasks like summarization, machine reading comprehension (MRC), and natural language to SQL (NL2SQL). Each task is defined by a dataset of context-target pairs, where both the context and target are sequences of tokens. For instance, in NL2SQL, the context is a natural language query and the target is the corresponding SQL command; in summarization, the context is an article's content and the target is its summary.


In [32]:
response = llm.predict_and_call(
    [vector_search_query_tool], 
    "What was mentioned about the problem statement in page 2?", 
    verbose=True
)

=== Calling Function ===
Calling function: vector_search_tool with args: {"query": "problem statement", "page_numbers": ["2"]}
=== Function Output ===
The problem statement focuses on language modeling, particularly on maximizing conditional probabilities given a task-specific prompt. It involves adapting a pre-trained autoregressive language model to downstream conditional text generation tasks like summarization, machine reading comprehension (MRC), and natural language to SQL (NL2SQL). Each task is defined by a dataset of context-target pairs, where both the context and target are sequences of tokens. For instance, in NL2SQL, the context is a natural language query and the target is the corresponding SQL command; in summarization, the context is an article's content and the target is its summary.


In [33]:
print(str(response))

The problem statement focuses on language modeling, particularly on maximizing conditional probabilities given a task-specific prompt. It involves adapting a pre-trained autoregressive language model to downstream conditional text generation tasks like summarization, machine reading comprehension (MRC), and natural language to SQL (NL2SQL). Each task is defined by a dataset of context-target pairs, where both the context and target are sequences of tokens. For instance, in NL2SQL, the context is a natural language query and the target is the corresponding SQL command; in summarization, the context is an article's content and the target is its summary.


In [34]:
for n in response.source_nodes:
    print(n.metadata)
    print("=============Text=============")
    print(n.get_text())
    print("=============Text=============")

{'page_label': '2', 'file_name': 'lora_paper.pdf', 'file_path': 'datasets/lora_paper.pdf', 'file_type': 'application/pdf', 'file_size': 1609513, 'creation_date': '2024-05-10', 'last_modified_date': '2024-05-10'}
often introduce inference latency (Houlsby et al., 2019; Rebufﬁ et al., 2017) by extending model
depth or reduce the model’s usable sequence length (Li & Liang, 2021; Lester et al., 2021; Ham-
bardzumyan et al., 2020; Liu et al., 2021) (Section 3). More importantly, these method often fail to
match the ﬁne-tuning baselines, posing a trade-off between efﬁciency and model quality.
We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned
over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the
change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed
Low-RankAdaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural
network indirectly by opti

In [35]:
from llama_index.core import SummaryIndex
from llama_index.core.tools import QueryEngineTool

summary_index = SummaryIndex(nodes)

summary_query_engine = summary_index.as_query_engine(
    response_mode="tree_summarize",
    use_async=True,
)

summary_tool = QueryEngineTool.from_defaults(
    name="summary_tool",
    query_engine=summary_query_engine,
    description=(
        "Useful for summarization questions related to the Lora paper."
    ),
)

In [36]:
response = llm.predict_and_call(
    [vector_search_query_tool, summary_tool], 
    "What was mentioned about the problem statement in page 2?", 
    verbose=True
)

=== Calling Function ===
Calling function: vector_search_tool with args: {"query": "problem statement", "page_numbers": ["2"]}
=== Function Output ===
The problem statement focuses on language modeling, particularly on maximizing conditional probabilities given a task-specific prompt. It involves adapting a pre-trained autoregressive language model to downstream conditional text generation tasks like summarization, machine reading comprehension (MRC), and natural language to SQL (NL2SQL). Each task is defined by a training dataset of context-target pairs, where both the context (xi) and target (yi) are sequences of tokens. For instance, in NL2SQL, xi represents a natural language query and yi its corresponding SQL command; in summarization, xi is the article content and yi is its summary.


In [37]:
response = llm.predict_and_call(
    [vector_search_query_tool, summary_tool], 
    "Give me a summary of what was mentioned in page 2?", 
    verbose=True
)

=== Calling Function ===
Calling function: vector_search_tool with args: {"query": "page 2", "page_numbers": ["2"]}
=== Function Output ===
The proposed Low-RankAdaptation (LoRA) approach allows training some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers' change during adaptation while keeping the pre-trained weights frozen. This method is shown to be both storage- and compute-efficient, requiring very low rank matrices even when the full rank is high. Additionally, LoRA enables sharing a pre-trained model to build multiple small LoRA modules for different tasks, reducing storage requirements and task-switching overhead significantly. It also makes training more efficient by lowering the hardware barrier to entry and introduces no inference latency compared to a fully fine-tuned model when deployed.


In [38]:
response = llm.predict_and_call(
    [vector_search_query_tool, summary_tool], 
    "Give me a summary of the paper.", 
    verbose=True
)

=== Calling Function ===
Calling function: summary_tool with args: {"input": "summary"}
=== Function Output ===
LoRA is an adaptation strategy for large language models that involves freezing pre-trained model weights and adding trainable rank decomposition matrices to reduce the number of trainable parameters. Despite having fewer parameters, LoRA performs comparably or better than full fine-tuning on models like RoBERTa, DeBERTa, GPT-2, and GPT-3. It offers benefits such as reduced GPU memory usage, higher training speed, and no extra inference latency. The method allows for quick task-switching and can be integrated with PyTorch models. The paper discusses experiments on deep learning models, including adaptation methods, low-rank matrices, and subspace similarity measurements, providing insights into optimizing model performance and understanding deep learning dynamics.


In [40]:
for n in response.source_nodes:
    print(n.metadata)
    print("=============Text=============")
    print(n.get_text()[:10])
    print("=============Text=============")

{'page_label': '1', 'file_name': 'lora_paper.pdf', 'file_path': 'datasets/lora_paper.pdf', 'file_type': 'application/pdf', 'file_size': 1609513, 'creation_date': '2024-05-10', 'last_modified_date': '2024-05-10'}
LORA: L OW
{'page_label': '2', 'file_name': 'lora_paper.pdf', 'file_path': 'datasets/lora_paper.pdf', 'file_type': 'application/pdf', 'file_size': 1609513, 'creation_date': '2024-05-10', 'last_modified_date': '2024-05-10'}
often intr
{'page_label': '3', 'file_name': 'lora_paper.pdf', 'file_path': 'datasets/lora_paper.pdf', 'file_type': 'application/pdf', 'file_size': 1609513, 'creation_date': '2024-05-10', 'last_modified_date': '2024-05-10'}
During ful
{'page_label': '3', 'file_name': 'lora_paper.pdf', 'file_path': 'datasets/lora_paper.pdf', 'file_type': 'application/pdf', 'file_size': 1609513, 'creation_date': '2024-05-10', 'last_modified_date': '2024-05-10'}
This makes
{'page_label': '4', 'file_name': 'lora_paper.pdf', 'file_path': 'datasets/lora_paper.pdf', 'file_type': 'app