# Evaluate RAG with LlamaIndex Locally

> 注意事项：
> 1. 我用 LM Studio 模拟 OpenAPI API 调用。
> 2. 本次使用的 LLM 模型为 mistral-7b-instruct-v0.2.Q6_K。

下面，我将探讨如何构建一个检索增强生成（Retrieval-Augmented Generation, RAG）的流程，并利用 LlamaIndex 对该流程进行评估。文档内容涵盖以下三大部分：

1. 理解检索增强生成（RAG）。
2. 利用 LlamaIndex 构建 RAG 流程。
3. 利用 LlamaIndex 对 RAG 进行评估。

## **检索增强生成 (Retrieval-Augmented Generation, RAG)**

大语言模型 (Large Language Models, LLM) 在庞大的数据集上训练，这些数据集往往不包含您个人的具体数据。检索增强生成技术 (RAG) 通过在生成过程中动态地结合用户数据，来弥补这一缺陷。重点在于，不是修改大语言模型的训练数据集，而是让模型能够实时接入并利用这些用户数据，从而提供更加定制化且与上下文相关的回答。

在 RAG 系统中，首先要做的是加载用户数据并为查询“建立索引”。当用户发起查询时，系统会在索引中过滤，找出与查询最相关的上下文。接着，这些相关的上下文和用户问题将一起提交给大语言模型，模型便据此提供相应的答案。

无论您打算构建的是聊天机器人还是自动应答代理，都需要掌握 RAG 技术，以便能够有效地将数据整合到您的应用中。

![RAG Overview](./images/llamaindex_rag_overview.png)

## **RAG 的关键阶段** 

RAG 包括五个关键阶段，这些都是构建任何大型应用程序不可或缺的一部分： 
1. **Loading加载：** 指的是将数据从其所在位置 — 如文本文件、PDF、其他网站、数据库或 API — 导入到您的处理流程中。LlamaHub 提供了数百种连接器，供您选择。 
2. **Indexing索引：** 创建一个支持数据查询的结构。对于大语言模型而言，这通常涉及创建向量嵌入（即数据的数值化语义表示），以及其他多种元数据策略，以简化并提高寻找相关上下文数据的准确性。 
3. **Storing存储：** 数据索引后，会希望存储该索引以及所有相关元数据，以省去重复索引的步骤。 
4. **Querying查询：** 在任何已确定的索引策略下，您都可以利用大语言模型和 LlamaIndex 数据结构来执行多种查询方式，包括子查询、多步查询和混合查询策略。 
5. **Evaluation评估：** 评估是流程中一个至关重要的步骤，它用于检查流程相较于其他策略的有效性或进行调整时的表现如何。评估为查询回应的准确性、一致性和速度提供了客观的量度标准。

# 代码实践：构建一个简单RAG系统，并且对其质量评估

In [None]:
%pip install llama-index

In [None]:
# `nest_asyncio` 模块允许异步函数在一个已经启动的异步循环内部进行嵌套执行。
# 这样做的必要性在于，Jupyter Notebook 这一工具天生就是在一个异步循环的环境下操作的。
# 通过使用 `nest_asyncio`，我们能够顺利地在这一现有的异步循环中添加并运行更多的异步函数，而不会引起冲突。
import nest_asyncio

nest_asyncio.apply()

from llama_index.evaluation import generate_question_context_pairs
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.node_parser import SimpleNodeParser
from llama_index.evaluation import generate_question_context_pairs
from llama_index.evaluation import RetrieverEvaluator
from llama_index.llms import OpenAI
from llama_index.embeddings import resolve_embed_model

import os
import pandas as pd

#### 加载本地数据并构建索引

In [117]:
# load data from data directory
documents = SimpleDirectoryReader("data").load_data()

# bge-m3 embedding model
# https://huggingface.co/BAAI/bge-base-en-v1.5/tree/main
embed_model = resolve_embed_model("local:BAAI/bge-base-en-v1.5")

# Load LM Studio LLM model
llm = OpenAI(api_base="http://localhost:1234/v1", api_key="not-needed")

# Index the data
service_context = ServiceContext.from_defaults(
    embed_model=embed_model, llm=llm,
)

# Transform data to Nodes struct
node_parser = SimpleNodeParser.from_defaults(chunk_size=512, chunk_overlap=128)
nodes = node_parser.get_nodes_from_documents(documents)

# vetorize
vector_index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)



构建向量查询engine，并准备执行向量查询

In [118]:
query_engine = vector_index.as_query_engine()

In [122]:
response_vector = query_engine.query("What did the author do growing up?")

检查输出结果

In [123]:
response_vector.response

"The author grew up in New Hampshire and spent most of his time reading science fiction and painting. He attended a boarding school in Massachusetts for high school, where he continued to paint and read. After graduating from college with a degree in computer science, he worked at various software companies before starting his own company, Viaweb, which was sold to Yahoo in 1996. He then moved to California and tried to focus on painting, but found it difficult due to lack of energy and motivation. He eventually returned to New York and resumed painting, this time with more success.\n### Explanation:\nThe author's childhood was marked by a love for reading science fiction and painting. He attended a boarding school in Massachusetts, where he continued to pursue these interests. After college, he worked in the software industry before starting his own company, Viaweb, which was sold to Yahoo in 1996. Following the sale of Viaweb, the author moved to California with the intention of focu

默认情况下，系统会检索出两个与查询内容相似的节点或数据块。你可以通过修改 `vector_index.as_query_engine(similarity_top_k=k)` 函数中的参数来调整检索相似节点的数量。

我们接下来查看一下这些被检索出的节点中包含的文本内容。

In [124]:
# First retrieved node
response_vector.source_nodes[0].get_text()

'What I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n\nThe language we used was an early version of Fortran. You had to type programs on punch cards, then stack

In [125]:
# Second retrieved node
response_vector.source_nodes[1].get_text()

"Nor had I changed my grad student lifestyle significantly since we started. So when Yahoo bought us it felt like going from rags to riches. Since we were going to California, I bought a car, a yellow 1998 VW GTI. I remember thinking that its leather seats alone were by far the most luxurious thing I owned.\n\nThe next year, from the summer of 1998 to the summer of 1999, must have been the least productive of my life. I didn't realize it at the time, but I was worn out from the effort and stress of running Viaweb. For a while after I got to California I tried to continue my usual m.o. of programming till 3 in the morning, but fatigue combined with Yahoo's prematurely aged culture and grim cube farm in Santa Clara gradually dragged me down. After a few months it felt disconcertingly like working at Interleaf.\n\nYahoo had given us a lot of options when they bought us. At the time I thought Yahoo was so overvalued that they'd never be worth anything, but to my astonishment the stock went

我们已经搭建了一个简单的 RAG 系统，并且现在我们需要对它的性能进行评价。可以通过运用 LlamaIndex 提供的核心评估模块对这个 RAG 系统或查询引擎进行评估。下面，我们探究如何使用这些工具定量衡量检索增强生成系统的质量。

## Evaluation 评估

评价工作应当成为衡量 RAG 应用表现的重要指标。这关乎系统针对不同数据源和多样的查询是否能够给出准确答案。

起初，单独审查每一个查询和相应的响应有助于系统调优，但随着特殊情况和故障数量的增加，这种方式可能行不通。相比之下，建立一整套综合性评价指标或者自动化评估系统则更为高效。这类工具能够洞察系统整体性能并识别哪些领域需要进一步关注。

RAG 系统的评估主要聚焦于两个核心方面：

*   **检索评估：** 这是对系统检索出的信息的准确性与相关性进行评价的过程。
*   **响应评估：** 这是基于检索结果对系统生成回答的质量和恰当性进行测量的过程。

### 生成 “问题-上下文” 对：

在评价 RAG 系统时，关键在于有能力提出既能获取正确上下文，又能相应生成适当回答的问题。`LlamaIndex` 提供了一个 `generate_question_context_pairs` 模块，这个模块专门设计用来构建评价 RAG 系统的问题和上下文对，涵盖了检索评估和响应评估两大方面。如需了解更多关于问题生成的信息，请查阅[文档](https://docs.llamaindex.ai/en/stable/examples/evaluation/QuestionGeneration.html)。

In [126]:
qa_dataset = generate_question_context_pairs(
    nodes,
    llm=llm,
    num_questions_per_chunk=20
)

100%|██████████| 46/46 [03:53<00:00,  5.08s/it]


### 检索评估：

我们已经做好了开展检索评估的准备。我们将利用已生成的评估数据集来运行 `RetrieverEvaluator`。

首先，我们需要建立一个 `Retriever` 实例，随后定义两个函数：`get_eval_results` 负责在数据集上执行检索操作，`display_results` 用于展现评估结果。

In [127]:
retriever = vector_index.as_retriever(similarity_top_k=2)

### 定义检索评估器：

我们采用 **命中率 (Hit Rate)** 和 **平均倒数排名 (Mean Reciprocal Rank, MRR)** 这两项指标来对检索器进行评估。

**命中率：**

命中率衡量的是正确答案出现在检索结果前k个文档中的比例。换句话说，就是我们的系统在最开始的几次猜测中得到正确结果的频次。

**平均倒数排名（MRR）：**

MRR 通过分析最相关文档在检索结果里的排名来计算每个查询的准确性。更具体地说，它是所有查询的相关文档排名倒数的平均值。例如，若最相关的文档排在第一位，其倒数排名为 1；排在第二位时，为 1/2；以此类推。

我们来通过这些指标来了解我们的检索器的表现。

In [128]:
retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=retriever
)

In [None]:
# Evaluate
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

定义一个函数，它可以将检索评估的结果以表格的形式展示出来。

In [131]:
def display_results(name, eval_results):
    """Display results from evaluate."""

    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)

    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()

    metric_df = pd.DataFrame(
        {"Retriever Name": [name], "Hit Rate": [hit_rate], "MRR": [mrr]}
    )

    return metric_df

In [None]:
display_results("bge-m3 Embedding Retriever", eval_results)

#### 结果分析：

结果显示 HitRate 比 MRR 的数值更大。
MRR 的表现不如命中率意味着排名靠前的结果并不总是最匹配的。为了提升 MRR，可能需要引入重新排序器（rerankers），这些工具用于优化检索到的文档顺序。若想深入理解重新排序器如何精细调优检索指标，请参考我们在[博客文章](https://blog.llamaindex.ai/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83)中的全面讲解。

## 质量评估工具：

1. 忠实度评估器（FaithfulnessEvaluator）：这个工具用来衡量查询引擎的响应是否与其它已知的信息源相符合，能有效判断响应中是否包含了凭空捏造的内容。
2. 相关度评估器（Relevancy Evaluator）：该评估器主要测量查询结果及其关联信息是否与用户的查询要求相匹配。

In [134]:
# Get the list of queries from the above created dataset
queries = list(qa_dataset.queries.values())

### Faithfulness Evaluator

我们先来看 Faithfulness 评估器：

In [135]:
vector_index = VectorStoreIndex(nodes, service_context = service_context)
query_engine = vector_index.as_query_engine()

创建一个 FaithfulnessEvaluator 实例

In [136]:
from llama_index.evaluation import FaithfulnessEvaluator
faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context)

任意评估一个问题是否相关

In [148]:
eval_query = queries[3]
eval_query

'Where did the writer have permission to use the IBM 1401 computer system?'

评估实例：

In [149]:
response_vector = query_engine.query(eval_query)

In [150]:
# Compute faithfulness evaluation
eval_result = faithfulness_evaluator.evaluate_response(response=response_vector)

In [151]:
# You can check passing parameter in eval_result if it passed the evaluation.
eval_result.passing

True

### Relevancy 相关度评估器

相关度评估器（Relevancy Evaluator）非常适用于判断响应内容和提供的信息源（检索到的背景资料）是否对查询进行了准确的匹配。此工具能够帮助我们确认响应内容是否确实解答了用户的问题。

In [143]:
from llama_index.evaluation import RelevancyEvaluator

relevancy_evaluator = RelevancyEvaluator(service_context=service_context)

任意选择一个问题进行评估

In [144]:
# Pick a query
query = queries[3]
query

'Where did the writer have permission to use the IBM 1401 computer system?'

In [145]:
# Generate response.
# response_vector has response and source nodes (retrieved context)
response_vector = query_engine.query(query)

# Relevancy evaluation
eval_result = relevancy_evaluator.evaluate_response(
    query=query, response=response_vector
)

In [146]:
# You can check passing parameter in eval_result if it passed the evaluation.
eval_result.passing

True

In [147]:
# You can get the feedback for the evaluation.
eval_result.feedback

'Yes. The context states that the writer and his friend Rich Draves had permission to use the IBM 1401 computer system in the basement of their junior high school. The response is consistent with this information.'

### Batch Evaluator 批次评估器：

在我们独立完成了忠实度和相关度的评估之后，LlamaIndex 提供了 `BatchEvalRunner` 工具，可以批次地进行多个评估的计算。
比如同时进行忠实度和相关度的评估：

In [152]:
from llama_index.evaluation import BatchEvalRunner

# Let's pick top 10 queries to do evaluation
batch_eval_queries = queries[:10]

# Initiate BatchEvalRunner to compute FaithFulness and Relevancy Evaluation.
runner = BatchEvalRunner(
    {"faithfulness": faithfulness_evaluator, "relevancy": relevancy_evaluator},
    workers=8,
)

# Compute evaluation
eval_results = await runner.aevaluate_queries(
    query_engine, queries=batch_eval_queries
)

In [153]:
# Let's get faithfulness score
faithfulness_score = sum(result.passing for result in eval_results['faithfulness']) / len(eval_results['faithfulness'])
faithfulness_score

0.8

In [154]:
# Let's get relevancy score
relevancy_score = sum(result.passing for result in eval_results['relevancy']) / len(eval_results['relevancy'])
relevancy_score

0.8

#### 结果分析
- 忠实度得分为 `0.8`，这意味着生成的答案存在不实之处，还可以继续优化。
- 相关度得分为 `0.8`，则表明生成的答案与检索到的背景信息和问题并非总是紧密相关。

## 总结

在上述研究中，我们研究了如何利用 LlamaIndex 构建和评估一个 RAG（检索增强型生成模型）流程，并特别关注如何对流程中的检索系统和生成的响应进行评价。

此外，LlamaIndex 还提供了许多其他的评价工具，你可以通过[此链接](https://docs.llamaindex.ai/en/stable/module_guides/evaluating/root.html)了解更多相关细节和进阶使用方法。