# Tool Calling

This project is aimed to explore **Tool Calling**, which helps to dynamically select not only the best query engine but also arguments to pass through the function, for example a page number.  This allows LLMs not only to use output of vector database, but also to understand how to use vector database.

Tool Calling adds a layer of query understanding on top of the RAG pipeline, enables users to ask complex queries and to receive more accurate answers.


# References

This project is based on the course **"Building Agentic RAG with Llamaindex"** by **Deeplearning.AI** and is available at the following [link](https://learn.deeplearning.ai/courses/building-agentic-rag-with-llamaindex/).

## Setup

In [None]:
# Mounting to Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
cd "YOUR-PATH-HERE"

In [None]:
%%capture
!pip install llama-index llama-index-llms-openai llama-index-embeddings-openai openai pypdf

In [None]:
!pip list | grep llama-index

llama-index                             0.12.19
llama-index-agent-openai                0.4.6
llama-index-cli                         0.4.0
llama-index-core                        0.12.19
llama-index-embeddings-openai           0.3.1
llama-index-indices-managed-llama-cloud 0.6.7
llama-index-llms-openai                 0.3.20
llama-index-multi-modal-llms-openai     0.4.3
llama-index-program-openai              0.3.1
llama-index-question-gen-openai         0.3.0
llama-index-readers-file                0.4.5
llama-index-readers-llama-parse         0.4.0


In [None]:
import os
from llama_index.core import (
    VectorStoreIndex,
    SummaryIndex,
    SimpleDirectoryReader,
    ServiceContext,
    Settings
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.tools import QueryEngineTool
from llama_index.core.selectors import LLMSingleSelector
from llama_index.core.prompts import PromptTemplate

In [None]:
# Set OpenAI API key
import openai

openai.api_key = 'YOUR-OPENAI-API-KEY-HERE'

In [None]:
import nest_asyncio
nest_asyncio.apply()

## Define An Example Of Simple Tool

In [None]:
from llama_index.core.tools import FunctionTool

def add(x: int, y: int) -> int:
    """Adds two integers together."""
    return x + y

def mystery(x: int, y: int) -> int:
    """Mystery function that operates on top of two numbers."""
    return (x + y) * (x + y)


add_tool = FunctionTool.from_defaults(fn=add)
mystery_tool = FunctionTool.from_defaults(fn=mystery)

In [None]:
# Initialize OpenAI model
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")

In [None]:
response = llm.predict_and_call(
    [add_tool, mystery_tool],
    "Tell me the output of the mystery function on 2 and 9",
    verbose=True
)
print(str(response))

=== Calling Function ===
Calling function: mystery with args: {"x": 2, "y": 9}
=== Function Output ===
121
121


## Define A Simple Auto-Retrieval Tool

In [None]:
# Load and parse the PDF document
reader = SimpleDirectoryReader(input_files=["transformers.pdf"])
documents = reader.load_data()

In [None]:
# Splitting documents into chunks
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents)

In [None]:
# Each node represents a chunk, let's see how the first node looks
print(nodes[0].get_content(metadata_mode="all"))

page_label: 1
file_name: transformers.pdf
file_path: transformers.pdf
file_type: application/pdf
file_size: 2215244
creation_date: 2025-02-18
last_modified_date: 2025-02-14

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an atten

In [None]:
len(nodes)

15

In [None]:
# Defining Vector Index over the nodes
vector_index = VectorStoreIndex(nodes)
query_engine = vector_index.as_query_engine(similarity_top_k=2)

In [None]:
from llama_index.core.vector_stores import MetadataFilters

# Quering the RAG Pipeline via metadata filters
query_engine = vector_index.as_query_engine(
    similarity_top_k=2,
    filters=MetadataFilters.from_dicts(
        [
            {"key": "page_label", "value": "2"}
        ]
    )
)

In [None]:
# Testing the RAG pipeline
import textwrap

query1 = "What are some high-level results of the document?"
response = query_engine.query(query1)

# Extract the response text
response_text = str(response) if isinstance(response, str) else response.response

# Wrap text to a readable width
wrapped_response = textwrap.fill(response_text, width=80)

# Print structured output
print("=" * 80)
print(f"**User Query:**\n{query1}\n")
print("=" * 80)
print("**Generated Response:**\n")
print(wrapped_response)
print("=" * 80)

**User Query:**
What are some high-level results of the document?

**Generated Response:**

The document discusses the Transformer model architecture, which relies entirely
on self-attention to draw global dependencies between input and output, allowing
for significantly more parallelization compared to recurrent models. The
Transformer has been shown to achieve a new state of the art in translation
quality after being trained for a relatively short period of time on multiple
GPUs. Additionally, the document mentions other models like Extended Neural GPU,
ByteNet, and ConvS2S that aim to reduce sequential computation through parallel
processing using convolutional neural networks.


In [None]:
# Let's check the source nodes
for n in response.source_nodes:
    print(n.metadata)

{'page_label': '2', 'file_name': 'transformers.pdf', 'file_path': 'transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2025-02-18', 'last_modified_date': '2025-02-14'}


## Define an Auto-Retrieval Tool

In [None]:
from typing import List
from llama_index.core.vector_stores import FilterCondition


def vector_query(
    query: str,
    page_numbers: List[str]
) -> str:
    """Perform a vector search over an index.

    query (str): the string query to be embedded.
    page_numbers (List[str]): Filter by set of pages. Leave BLANK if we want to perform a vector search
        over all pages. Otherwise, filter by the set of specified pages.

    """

    metadata_dicts = [
        {"key": "page_label", "value": p} for p in page_numbers
    ]

    query_engine = vector_index.as_query_engine(
        similarity_top_k=2,
        filters=MetadataFilters.from_dicts(
            metadata_dicts,
            condition=FilterCondition.OR
        )
    )
    response = query_engine.query(query)
    return response


vector_query_tool = FunctionTool.from_defaults(
    name="vector_tool",
    fn=vector_query
)

In [None]:
# Let's call this tool with LLM
llm = OpenAI(model="gpt-3.5-turbo", temperature=0)

response = llm.predict_and_call(
    [vector_query_tool],
    "What are the high-level results of the document as described on page 2?",
    verbose=True
)

=== Calling Function ===
Calling function: vector_tool with args: {"query": "high-level results", "page_numbers": ["2"]}
=== Function Output ===
The Transformer model architecture, which relies entirely on self-attention mechanisms without using recurrent neural networks or convolution, has shown significant improvements in translation quality. It allows for more parallelization during training and can achieve state-of-the-art performance in translation quality after just twelve hours of training on eight P100 GPUs.


In [None]:
for n in response.source_nodes:
    print(n.metadata)

{'page_label': '2', 'file_name': 'transformers.pdf', 'file_path': 'transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2025-02-18', 'last_modified_date': '2025-02-14'}


In [None]:
# Let's add summary tool
summary_index = SummaryIndex(nodes)
summary_query_engine = summary_index.as_query_engine(
    response_mode="tree_summarize",
    use_async=True,
)
summary_tool = QueryEngineTool.from_defaults(
    name="summary_tool",
    query_engine=summary_query_engine,
    description=(
        "Useful if you want to get a summary of document"
    ),
)

In [None]:
# Let's use tool calling again
response = llm.predict_and_call(
    [vector_query_tool, summary_tool],
    "Which results did the big transformer model achieve on the WMT 2014 English-to-German translation task, as described on page 8?",
    verbose=False
)
# Extract the response text
response_text = str(response) if isinstance(response, str) else response.response

# Wrap text to a readable width
wrapped_response = textwrap.fill(response_text, width=80)

print("=" * 80)
print("**Function Output:**\n")
print(wrapped_response)
print("=" * 80)

**Function Output:**

The big transformer model achieved a BLEU score of 28.4 on the WMT 2014 English-
to-German translation task, outperforming all previously reported models by more
than 2.0 BLEU.


In [None]:
# Page 8 is correctly selected

for n in response.source_nodes:
    print(n.metadata)

{'page_label': '8', 'file_name': 'transformers.pdf', 'file_path': 'transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2025-02-14', 'last_modified_date': '2025-02-14'}


In [None]:
# LLM can also pick up the summary tool, when necessary
query3 = "Please make a summary of the document"
response = llm.predict_and_call(
    [vector_query_tool, summary_tool],
    query3,
    verbose=False
)
# Extract the response text
response_text = str(response) if isinstance(response, str) else response.response

# Wrap text to a readable width
wrapped_response = textwrap.fill(response_text, width=80)

# Print structured output
print("=" * 80)
print(f"**User Query:**\n{query3}\n")
print("=" * 80)
print("**Generated Response:**\n")
print(wrapped_response)
print("=" * 80)

**User Query:**
Please make a summary of the document

**Generated Response:**

The document discusses the Transformer model, a sequence transduction model
based solely on attention mechanisms, eliminating the need for recurrent or
convolutional layers. The model uses self-attention to compute representations
of input and output sequences. It allows for more parallelization, faster
training, and achieves state-of-the-art results in machine translation tasks.
The paper presents the model architecture, attention mechanisms, training
details, and results on tasks like English-to-German and English-to-French
translation. Additionally, the Transformer model is shown to generalize well to
English constituency parsing tasks. The document concludes by highlighting the
potential of attention-based models for various tasks and providing access to
the code used for training and evaluation.
