# This is a agent reasoning model
<hr>

### Initial Setting
<hr>

In [3]:
import os

from dotenv import load_dotenv
load_dotenv()

import nest_asyncio
nest_asyncio.apply()
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY')

<hr>

### Chunk and embedding

In [4]:
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(input_files=["data/ragchecker.pdf"]).load_data()

In [5]:
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents)

In [6]:
from llama_index.core import VectorStoreIndex

vector_index = VectorStoreIndex(nodes)

<hr>

### Vector tool

In [7]:
from typing import List
from llama_index.core.vector_stores import FilterCondition
from llama_index.core.tools import FunctionTool
from llama_index.core.vector_stores import MetadataFilters

def vector_query(
    query : str,
    page_numbers: List[str],
) -> str:
    """Performs a vector search over an index.
    
    query (str): the string query to embeded.
    page_numbers (List[str]): Filter by set of pages. Leave BLANK if we want to search
    over all pages. Otherwise, filter by the set of sepcified pages.
    """

    metadata_dicts = [
        {"key": "page_label", "value": p}
        for p in page_numbers
    ]

    query_engine = vector_index.as_query_engine(
        similarity_top_k=2,
        filters=MetadataFilters.from_dicts(
            metadata_dicts,
            condition=FilterCondition.OR
        )
    )
    response = query_engine.query(query)
    return response

vector_query_tool = FunctionTool.from_defaults(
    name="vector_query",
    fn=vector_query,
)

### Summary Tool

In [8]:
from llama_index.core import SummaryIndex
from llama_index.core.tools import QueryEngineTool

summary_index = SummaryIndex(nodes)
summary_query_engine = summary_index.as_query_engine(
    response_mode='tree_summarize',
    use_async=True,
)
summary_tool = QueryEngineTool.from_defaults(
    name="summary_tool",
    query_engine=summary_query_engine,
    description=(
        'Useful if you want to get a summary of the ragchecker.'
    )
)

### LLM Model Setting
<hr>

In [15]:
from llama_index.llms.openai import OpenAI

llm = OpenAI(model='gpt-3.5-turbo', temperature=0)

### The Agent of llamaindex
<hr>

In [16]:
from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.agent import AgentRunner

agent_worker = FunctionCallingAgentWorker.from_tools(
    tools=[vector_query_tool, summary_tool],
    llm=llm,
    verbose=True,
)

agent = AgentRunner(agent_worker)

test for result

- response from gpt-3.5-turbo

In [17]:
response = agent.query(
    "Tell me about the ragchecker metric,"
    " and how to evaluate the performance.",
)

Added user message to memory: Tell me about the ragchecker metric, and how to evaluate the performance.
=== Calling Function ===
Calling function: summary_tool with args: {"input": "ragchecker"}
=== Function Output ===
RAGCHECKER is an evaluation framework specifically designed for Retrieval-Augmented Generation (RAG) systems. It offers a comprehensive suite of diagnostic metrics to assess the performance of both the retrieval and generation modules within RAG systems. The framework has been validated through human assessments, demonstrating a strong correlation with human evaluations. By evaluating various RAG systems across diverse domain datasets, RAGCHECKER provides valuable insights into the behaviors of the retriever and generator components, highlighting trade-offs in RAG system designs and offering guidance for future advancements in RAG applications.
=== LLM Response ===
The RAGCHECKER metric is an evaluation framework tailored for Retrieval-Augmented Generation (RAG) systems.

In [18]:
print(response.source_nodes[0].get_content(metadata_mode='all'))

page_label: 1
file_name: ragchecker.pdf
file_path: data/ragchecker.pdf
file_type: application/pdf
file_size: 2553412
creation_date: 2024-09-30
last_modified_date: 2024-09-30

RAGCHECKER : A Fine-grained Framework for
Diagnosing Retrieval-Augmented Generation
Dongyu Ru1∗Lin Qiu1∗Xiangkun Hu1∗Tianhang Zhang1∗Peng Shi1∗
Shuaichen Chang1∗Cheng Jiayang1†Cunxiang Wang1†Shichao Sun2
Huanyu Li2Zizhao Zhang1†Binjie Wang1†Jiarong Jiang1Tong He1
Zhiguo Wang1Pengfei Liu2Yue Zhang3Zheng Zhang1
1Amazon AWS AI2Shanghai Jiaotong University3Westlake University
Abstract
Despite Retrieval-Augmented Generation (RAG) showing promising capability in
leveraging external knowledge, a comprehensive evaluation of RAG systems is still
challenging due to the modular nature of RAG, evaluation of long-form responses
and reliability of measurements. In this paper, we propose a fine-grained evaluation
framework, RAGCHECKER , that incorporates a suite of diagnostic metrics for both
the retrieval and generation modules

- result under gpt-4o

In [14]:
response = agent.query(
    "Tell me about the ragchecker metric,"
    " and how to evaluate the performance.",
)

Added user message to memory: Tell me about the ragchecker metric, and how to evaluate the performance.
=== Calling Function ===
Calling function: summary_tool with args: {"input": "ragchecker metric"}
=== Function Output ===
The RAGCHECKER framework introduces a suite of metrics for evaluating Retrieval-Augmented Generation (RAG) systems. These metrics cover aspects such as claim recall, context precision, context utilization, noise sensitivity, faithfulness, precision, recall, F1 score, hallucination, self-knowledge, Correct Retrieval (CR), Correct Prediction (CP), Correct Update (CU), Non-Stopword Overlap (NS), Semantic Knowledge (SK), and more.
=== Calling Function ===
Calling function: vector_query with args: {"query": "evaluate the performance of ragchecker metric", "page_numbers": []}
=== Function Output ===
RAGCHECKER metric's performance was evaluated by comparing its correlations with human judgments to other evaluation metrics. The meta-evaluation confirmed that RAGCHECKER h

### Agent reasoning loop
<hr>

In [24]:
response = agent.chat(
    "Tell me about the evaluation daatsets used."
)

Added user message to memory: Tell me about the evaluation daatsets used.
=== Calling Function ===
Calling function: summary_tool with args: {"input": "evaluation datasets"}
=== Function Output ===
The evaluation datasets used in the RAGCHECKER framework are repurposed from existing open-domain question answering datasets, including RobustQA, KIWI, ClapNQ, and NovelQA. The datasets cover various domains such as Biomedical, Finance, Lifestyle, Recreation, Technology, Science, and Writing. The short answers in these datasets are converted to long-form answers to match the capabilities of modern RAG systems. Additionally, the datasets are curated to ensure that the long-form answers are generated accurately without any hallucinations. The benchmark includes a total of 4,162 questions across these domains.
=== LLM Response ===
The evaluation datasets used in the RAGCHECKER framework are repurposed from existing open-domain question answering datasets, including RobustQA, KIWI, ClapNQ, and 

In [20]:
response = agent.chat(
    "Tell me the results over one of the above  datasets."
)

Added user message to memory: Tell me the results over one of the above  datasets.
=== Calling Function ===
Calling function: vector_query with args: {"query": "evaluation results", "page_numbers": ["1"]}
=== Function Output ===
The evaluation results of the RAG systems were obtained using a fine-grained evaluation framework called RAGCHECKER. This framework incorporates a suite of diagnostic metrics for both the retrieval and generation modules. Meta evaluation has shown that RAGCHECKER has significantly better correlations with human judgments compared to other evaluation metrics. Through the use of RAGCHECKER, the performance of 8 RAG systems was evaluated, revealing insightful patterns and trade-offs in the design choices of RAG architectures. The metrics provided by RAGCHECKER can guide researchers and practitioners in developing more effective RAG systems.
=== LLM Response ===
The evaluation results of the RAG systems over the specified dataset were obtained using the RAGCHECKER 

### Adding a task
<hr>

In [25]:
from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.agent import AgentRunner

agent_worker = FunctionCallingAgentWorker.from_tools(
    tools=[vector_query_tool, summary_tool],
    llm=llm,
    verbose=True,
)

agent = AgentRunner(agent_worker)

In [26]:
task = agent.create_task(
     "Tell me about the ragchecker metric,"
    " and how to evaluate the performance of RAG."
)

Step one

In [27]:
step_output = agent.run_step(task.task_id)

Added user message to memory: Tell me about the ragchecker metric, and how to evaluate the performance of RAG.
=== Calling Function ===
Calling function: summary_tool with args: {"input": "ragchecker"}
=== Function Output ===
RAGCHECKER is an evaluation framework specifically designed for Retrieval-Augmented Generation (RAG) systems. It assesses both the retrieval and generation components of RAG systems using various diagnostic metrics such as precision, recall, faithfulness, context precision, context utilization, noise sensitivity, hallucination, and self-knowledge. The framework has been validated through human assessments and has shown a strong correlation with human evaluations. It has been used to evaluate 8 different RAG systems across 10 diverse domain datasets, providing valuable insights into the behaviors of the retriever and generator components, highlighting trade-offs in RAG system designs, and offering guidance for future advancements in RAG applications.


Show how many steps completed

In [28]:
completed_steps = agent.get_completed_steps(task.task_id)
print(f"Num completed for task {task.task_id}: {len(completed_steps)}")
print(completed_steps[0].output.sources[0].raw_output)

Num completed for task 4c08a022-0073-4a48-ba94-ffd1ad4d171f: 1
RAGCHECKER is an evaluation framework specifically designed for Retrieval-Augmented Generation (RAG) systems. It assesses both the retrieval and generation components of RAG systems using various diagnostic metrics such as precision, recall, faithfulness, context precision, context utilization, noise sensitivity, hallucination, and self-knowledge. The framework has been validated through human assessments and has shown a strong correlation with human evaluations. It has been used to evaluate 8 different RAG systems across 10 diverse domain datasets, providing valuable insights into the behaviors of the retriever and generator components, highlighting trade-offs in RAG system designs, and offering guidance for future advancements in RAG applications.


In [29]:
upcoming_steps = agent.get_upcoming_steps(task.task_id)
print(f"Num completed for task {task.task_id}: {len(upcoming_steps)}")
upcoming_steps[0]

Num completed for task 4c08a022-0073-4a48-ba94-ffd1ad4d171f: 1


TaskStep(task_id='4c08a022-0073-4a48-ba94-ffd1ad4d171f', step_id='2aa255ba-ca01-49e7-951c-ea2e68e68b85', input=None, step_state={}, next_steps={}, prev_steps={}, is_ready=True)

In [30]:
# One more step
step_output = agent.run_step(
    task.task_id,
    input="What are the parameter used for evaluation?"    
)

Added user message to memory: What are the parameter used for evaluation?
=== LLM Response ===
The parameters used for evaluating the performance of RAG systems using the RAGCHECKER metric include:

1. Precision: Measures the proportion of generated responses that are correct and relevant to the input query.

2. Recall: Measures the proportion of relevant responses that are generated by the system.

3. Faithfulness: Evaluates the extent to which the generated responses are faithful to the retrieved context.

4. Context Precision: Measures the proportion of generated responses that are contextually relevant to the retrieved context.

5. Context Utilization: Assesses how effectively the system utilizes the retrieved context to generate responses.

6. Noise Sensitivity: Measures the system's sensitivity to noise in the retrieved context.

7. Hallucination: Evaluates the extent to which the system generates responses that are not supported by the retrieved context.

8. Self-Knowledge: Asse

final answer for last step

In [37]:
#step_output = agent.run_step(task.task_id)
print(step_output.is_last)

True


In [36]:
response = agent.finalize_response(task.task_id)