# This is tools calling model
<hr>

pre-setting

In [1]:
import os

from dotenv import load_dotenv
load_dotenv()

import nest_asyncio
nest_asyncio.apply()

In [2]:
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY')

Simulates the function tools

In [3]:
from llama_index.core.tools import FunctionTool

def add(x: int, y: int) -> int:
    """Adds two numbers together"""
    return x + y


def mystry(x: int, y: int) -> int:
    """Mystery function that operates on top of two numbers"""
    return (x + y) * (x + y)


add_tool = FunctionTool.from_defaults(fn=add)
mystry_tool = FunctionTool.from_defaults(fn=mystry)

integrate it with the llm

In [4]:
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo",)
response = llm.predict_and_call(
    [add_tool, mystry_tool], 'Tell me the output of the mystery function on 2 and 9',
    verbose=True
)

=== Calling Function ===
Calling function: mystry with args: {"x": 2, "y": 9}
=== Function Output ===
121


In [5]:
print(response)

121


### Sourcing the specific page with medadata
<hr>

In [3]:
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(input_files=["data/ragchecker.pdf"]).load_data()

In [4]:
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents)

In [8]:
print(len(nodes))

41


take a look at the meta data:
```
page_label: 1
file_name: ragchecker.pdf
file_path: data/ragchecker.pdf
file_type: application/pdf
file_size: 2553412
creation_date: 2024-09-30
last_modified_date: 2024-09-30


In [21]:
print(nodes[4].get_content(metadata_mode="all"))

page_label: 5
file_name: ragchecker.pdf
file_path: data/ragchecker.pdf
file_type: application/pdf
file_size: 2553412
creation_date: 2024-09-30
last_modified_date: 2024-09-30

3.3.1 Overall Metrics
To assess the overall response quality of a RAG system from a user’s perspective, we can compute the
precision and recall at claim level for each model generated response against its paired ground-truth
answer. Specifically, we first extract claims from a model response mand a ground-truth answer gtas
{c(m)
i}and{c(gt)
i}respectively. Then, we define correct claims in the response as {c(m)
i|c(m)
i∈gt},
and correct claims in the ground-truth answer as {c(gt)
i|c(gt)
i∈m}. Two metrics can be computed
directly: precision is the proportion of correct claims in all response claims, and recall is the
proportion of correct claims in all ground-truth answer claims. Further, the harmonic average of
precision and recall gives the F1score, as the overall performance metric.
3.3.2 Retriever Metrics
Idea

create a vector index query engine

In [10]:
from llama_index.core import VectorStoreIndex

vector_index = VectorStoreIndex(nodes)
query_engine = vector_index.as_query_engine(sililarity_top_k=2)

- using metadata to filter the rage data

In [16]:
from llama_index.core.vector_stores import MetadataFilters

query_engine = vector_index.as_query_engine(
    similarity_top_k=2,
    filters=MetadataFilters.from_dicts(
        [
            {"key": "page_label", "value": "5"},
        ]
    )
)

response = query_engine.query(
    "what is the ragchecker metrics",
)

In [17]:
from llama_index.core.response.pprint_utils import pprint_response
print(response)
print("\n=============================\n")

pprint_response(response)

The RAGchecker metrics include precision, recall, and F1 score at claim level for assessing overall response quality, claim recall for measuring completeness of retrieved chunks, retriever precision at chunk-level, faithfulness metric for generator performance, relevant noise sensitivity, irrelevant noise sensitivity, hallucination metric, and self-knowledge metric for characterizing how a generator produces correct claims based on information sources.


Final Response: The RAGchecker metrics include precision, recall, and
F1 score at claim level for assessing overall response quality, claim
recall for measuring completeness of retrieved chunks, retriever
precision at chunk-level, faithfulness metric for generator
performance, relevant noise sensitivity, irrelevant noise sensitivity,
hallucination metric, and self-knowledge metric for characterizing how
a generator produces correct claims based on information sources.


Check the data source

In [18]:
for n in response.source_nodes:
    print(n.metadata)

{'page_label': '5', 'file_name': 'ragchecker.pdf', 'file_path': 'data/ragchecker.pdf', 'file_type': 'application/pdf', 'file_size': 2553412, 'creation_date': '2024-09-30', 'last_modified_date': '2024-09-30'}


### Enchancing Data retrieval
<hr>

In [7]:
from llama_index.core import VectorStoreIndex

vector_index = VectorStoreIndex(nodes)

- perform a vector search over an index along with the page numbers as a matadata filter

In [5]:
from typing import List
from llama_index.core.vector_stores import FilterCondition
from llama_index.core.tools import FunctionTool

def vector_query(
    query : str,
    page_numbers: List[str],
) -> str:
    """Performs a vector search over an index.
    
    query (str): the string query to embeded.
    page_numbers (List[str]): Filter by set of pages. Leave BLANK if we want to search
    over all pages. Otherwise, filter by the set of sepcified pages.
    """

    metadata_dicts = [
        {"key": "page_label", "value": p}
        for p in page_numbers
    ]

    query_engine = vector_index.as_query_engine(
        similarity_top_k=2,
        filters=MetadataFilters.from_dicts(
            metadata_dicts,
            condition=FilterCondition.OR
        )
    )
    response = query_engine.query(query)
    return response

vector_query_tool = FunctionTool.from_defaults(
    name="vector_query",
    fn=vector_query,
)
    

calling the llm with the tool

In [8]:
from llama_index.llms.openai import OpenAI
from llama_index.core.vector_stores import MetadataFilters

llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
response = llm.predict_and_call(
    [vector_query_tool], 
    'What is the ragchecker metrics as described in the page 5',
    verbose=True
)

=== Calling Function ===
Calling function: vector_query with args: {"query": "ragchecker metrics", "page_numbers": ["5"]}
=== Function Output ===
The metrics for evaluating a RAG system include precision, recall, and F1 score at claim level for overall response quality assessment. Additionally, retriever metrics involve measuring claim recall and retriever precision at chunk-level. Generator metrics include faithfulness, relevant noise sensitivity, irrelevant noise sensitivity, hallucination, and self-knowledge scores to assess the generator's performance in producing correct claims based on retrieved chunks.


verify the source

In [9]:
for n in response.source_nodes:
    print(n.metadata)

{'page_label': '5', 'file_name': 'ragchecker.pdf', 'file_path': 'data/ragchecker.pdf', 'file_type': 'application/pdf', 'file_size': 2553412, 'creation_date': '2024-09-30', 'last_modified_date': '2024-09-30'}


### Overall tool system
<hr>

create a summary tools

In [11]:
from llama_index.core import SummaryIndex
from llama_index.core.tools import QueryEngineTool

summary_index = SummaryIndex(nodes)
summary_query_engine = summary_index.as_query_engine(
    response_mode='tree_summarize',
    use_async=True,
)
summary_tool = QueryEngineTool.from_defaults(
    name="summary_tool",
    query_engine=summary_query_engine,
    description=(
        'Useful if you want to get a summary of the ragchecker.'
    )
)

test the response

In [13]:
response = llm.predict_and_call(
    [vector_query_tool, summary_tool],
    'What is  averaged evaluation results for 8 RAG systems across 10 diverse domain datasets in the page 9',
    verbose=True
)

=== Calling Function ===
Calling function: vector_query with args: {"query": "averaged evaluation results for 8 RAG systems across 10 diverse domain datasets", "page_numbers": ["9"]}
=== Function Output ===
The averaged evaluation results for the 8 RAG systems across 10 diverse domain datasets showed significant variations in performance based on the modifications made to the RAG settings. The adjustments in the number and size of chunks, chunk overlap ratios, and generation prompts had varying impacts on the systems' recall, faithfulness, noise sensitivity, and overall performance. These findings provide valuable insights into the behaviors of the retriever and generator components within the RAG systems, highlighting the importance of carefully tuning these settings to achieve optimal performance across different domains.


In [15]:
response = llm.predict_and_call(
    [vector_query_tool, summary_tool],
    'What is  the summary of the paper',
    verbose=True
)

=== Calling Function ===
Calling function: summary_tool with args: {"input": "paper"}
=== Function Output ===
The paper discusses the benchmark dataset used for evaluating Retrieval-Augmented Generation (RAG) systems, the process of curating questions from various open-domain question answering datasets, the generation of long-form answers using GPT-4, and the validation process to ensure no hallucinations were present. It also mentions downsampling in the Science and Biomedical domains for efficient evaluation. Additionally, the paper presents evaluation results for different RAG systems on datasets like ClapNQ, NovelQA, RobustQA - Writing, BioASQ, Finance, Lifestyle, Science, Technology, and Recreation, showcasing performance metrics for both retriever and generator components. Furthermore, it discusses the performance of RefChecker on the RefChecker benchmark using Llama 3 70B Instruct as both the extractor and checker, comparing the results with the best open-sourced combinations r