# 使用LangChain和RAGAS对RAG系统进行自动有效评估

我们主要讨论一下LLM RAG问答系统中一个重要的组成部分:

- Evaluation

我们主要使用LangChain 构建RAG问答系统，利用 RAGAS 框架进行评估，因为它正逐渐成为评估 RAG 系统的标准方法

### 首先安装 依赖

In [1]:
!pip install -U -q langchain openai ragas arxiv pymupdf chromadb wandb tiktoken

In [1]:
import os
from dotenv import load_dotenv
# 加载环境
load_dotenv()


True

### 数据准备

主要以Arxiv的论文为例进行评估，通过 `ArxivLoader` 加载数据(论文)作为RAG的上下文。

- [`ArxivLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.arxiv.ArxivLoader.html#langchain_community.document_loaders.arxiv.ArxivLoader)


In [2]:
from langchain.document_loaders import ArxivLoader

paper_docs = ArxivLoader(query="2309.15217", load_max_docs=1).load()
len(paper_docs)

1

In [3]:
for doc in paper_docs:
  print(doc.metadata)

{'Published': '2025-04-28', 'Title': 'Ragas: Automated Evaluation of Retrieval Augmented Generation', 'Authors': 'Shahul Es, Jithin James, Luis Espinosa-Anke, Steven Schockaert', 'Summary': 'We introduce Ragas (Retrieval Augmented Generation Assessment), a framework\nfor reference-free evaluation of Retrieval Augmented Generation (RAG)\npipelines. RAG systems are composed of a retrieval and an LLM based generation\nmodule, and provide LLMs with knowledge from a reference textual database,\nwhich enables them to act as a natural language layer between a user and\ntextual databases, reducing the risk of hallucinations. Evaluating RAG\narchitectures is, however, challenging because there are several dimensions to\nconsider: the ability of the retrieval system to identify relevant and focused\ncontext passages, the ability of the LLM to exploit such passages in a faithful\nway, or the quality of the generation itself. With Ragas, we put forward a\nsuite of metrics which can be used to eval

### 创建RAG文本分割、Embedding model 、 向量库存储

我们主要使用 `RecursiveCharacterTextSplitter` 切割文本，通过`OpenAIEmbeddings()`进行文本编码，存储到 `VectorStore`。

- `RecursiveCharacterTextSplitter()`
- `OpenAIEmbeddings()`
- `FAISS

In [4]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500)

docs = text_splitter.split_documents(paper_docs)

embeddings = HuggingFaceEmbeddings(
    model_name = "BAAI/bge-large-zh-v1.5",
    model_kwargs = {"device": "cuda"},  # 使用GPU如果可用
    encode_kwargs = {"normalize_embeddings": True},
    show_progress=True,
)

vectorstore = FAISS.from_documents(
    documents=docs,  # 使用所有文档
    embedding=embeddings
)

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

In [5]:
len(docs)

107

In [6]:
print(max([len(chunk.page_content) for chunk in docs]))

497


现在我们可以利用 `FAISS` 向量库的 `.as_retriever()` 方式进行检索，需要控制的主要参数为 `k`

In [7]:
base_retriever = vectorstore.as_retriever(search_kwargs={"k" : 3})

In [8]:
relevant_docs = base_retriever.invoke("What is Retrieval Augmented Generation?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [9]:
len(relevant_docs)

3

### 创建prompt ——— 生成答案
我们需要利用`LLM`对`Context` 生成一系列的问题的`answer`


In [10]:
from langchain import PromptTemplate

template = """You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 

Question: {question} 

Context: {context} 

Answer:
"""

prompt = PromptTemplate(
    template=template, 
    input_variables=["context","question"]
  )

print(prompt)

input_variables=['context', 'question'] input_types={} partial_variables={} template="You are an assistant for question-answering tasks. \nUse the following pieces of retrieved context to answer the question. \nIf you don't know the answer, just say that you don't know. \n\nQuestion: {question} \n\nContext: {context} \n\nAnswer:\n"


### 生成`answer`,利用LLM
利用 `Runnable` 定义一个 `chain` 实现rag全流程。

In [15]:
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from langchain_openai import ChatOpenAI


llm = ChatOpenAI(
    model="deepseek-chat",
    openai_api_key=os.getenv("OPENAI_API_KEY"),
    base_url=os.getenv("OPENAI_API_BASE"),  # 使用base_url而不是openai_base
    temperature=0.7,
)

rag_chain = (
    {"context": base_retriever,  "question": RunnablePassthrough()} 
    | prompt 
    | llm
    | StrOutputParser() 
)



#### 创建 RAGAs 所需的数据
question  Answer   contexts  ground_truths

In [20]:
# Ragas 数据集格式要求  ['question', 'answer', 'contexts', 'ground_truths']
'''
{
    "question": [], <-- 问题基于Context的
    "answer": [], <-- 答案基于LLM生成的
    "contexts": [], <-- context
    "ground_truths": [] <-- 标准答案
}
'''

from datasets import Dataset

questions = ["What is faithfulness ?", 
             "How many pages are included in the WikiEval dataset, and which years do they cover information from?",
             "Why is evaluating Retrieval Augmented Generation (RAG) systems challenging?",
            ]
ground_truths = [["Faithfulness refers to the idea that the answer should be grounded in the given context."],
                 [" To construct the dataset, we first selected 50 Wikipedia pages covering events that have happened since the start of 2022."],
                ["Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself."]]
answers = []
contexts = []

references = ["\n".join(refs) for refs in ground_truths]

# 生成答案
for query in questions:
    answers.append(rag_chain.invoke(query))
    contexts.append([docs.page_content for docs in base_retriever.invoke(query)])

# 构建数据
data = {
    "question": questions,
    "answer": answers,
    "contexts": contexts,
    "reference": references
}
dataset = Dataset.from_dict(data)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [21]:
dataset

Dataset({
    features: ['question', 'answer', 'contexts', 'reference'],
    num_rows: 3
})

### 使用RAGAs 进行评估

In [22]:
from ragas import evaluate
from ragas.metrics import ContextPrecision, ContextRecall, Faithfulness, AnswerRelevancy

result = evaluate(
    dataset=dataset,
    metrics=[
        ContextPrecision(),
        ContextRecall(),
        Faithfulness(),
        AnswerRelevancy()
    ],
    llm=llm,
    embeddings=embeddings
)

result

Evaluating:   0%|          | 0/12 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

{'context_precision': 0.7778, 'context_recall': 0.6667, 'faithfulness': 0.6467, 'answer_relevancy': 0.8307}

In [23]:
import pandas as pd
pd.set_option("display.max_colwidth", None)

df = result.to_pandas()
df

Unnamed: 0,user_input,retrieved_contexts,response,reference,context_precision,context_recall,faithfulness,answer_relevancy
0,What is faithfulness ?,"[Faithfulness measures the information\nconsistency of the answer against the\ngiven context. Any claims that are made\nin the answer that cannot be deduced\nfrom context should be penalized.\nGiven an answer and context, assign a\nscore for faithfulness in the range 0-10.\ncontext: [context]\nanswer: [answer]\nTies, where the same score is assigned by the LLM\nto both answer candidates, were broken randomly.\nThe second baseline, shown as GPT Ranking, in-, considered quality dimensions. For faithfulness\nand context relevance, the two annotators agreed in\naround 95% of cases. For answer relevance, they\nagreed in around 90% of the cases. Disagreements\nwere resolved after a discussion between the anno-\ntators.\nFaithfulness\nTo obtain human judgements about\nfaithfulness, we first used ChatGPT to answer the\nquestion without access to any additional context.\nWe then asked the annotators to judge which of the, Faithfulness\nTo obtain human judgements about\nfaithfulness, we first used ChatGPT to answer the\nquestion without access to any additional context.\nWe then asked the annotators to judge which of the\ntwo answers was the most faithful (i.e. the standard\none or the one generated without context), given\nthe question and corresponding Wikipedia page.\nAnswer relevance\nWe first used ChatGPT to\nobtain candidate answers with lower answer rel-\nevance, using the following prompt:]","Faithfulness, in the context of Retrieval Augmented Generation (RAG) systems, measures the information consistency of the generated answer against the provided context. It ensures that any claims made in the answer can be deduced from the given context, penalizing those that cannot. Essentially, it evaluates how accurately and reliably the answer reflects the information present in the context, without introducing unsupported or hallucinated details. \n\nFor example, in the Ragas framework, faithfulness is scored on a scale of 0-10, where higher scores indicate better alignment between the answer and the context. This metric helps assess the LLM's ability to use retrieved passages in a truthful and grounded manner.",Faithfulness refers to the idea that the answer should be grounded in the given context.,0.833333,1.0,0.5,0.641144
1,"How many pages are included in the WikiEval dataset, and which years do they cover information from?","[which we refer to as WikiEval4. To construct the\ndataset, we first selected 50 Wikipedia pages cov-\nering events that have happened since the start of\n20225. In selecting these pages, we prioritised\nthose with recent edits. For each of the 50 pages,\nwe then asked ChatGPT to suggest a question that\ncan be answered based on the introductory section\nof the page, using the following prompt:\nYour task is to formulate a question from\ngiven context satisfying the rules given\nbelow:, which are annotated with human judgments. We\ncan then verify to what extent our metrics agree\nwith human assessments of faithfulness, answer\nrelevance and context relevance. Since we are not\naware of any publicly available datasets that could\nbe used for this purpose, we created a new dataset,\nwhich we refer to as WikiEval4. To construct the\ndataset, we first selected 50 Wikipedia pages cov-\nering events that have happened since the start of\n20225. In selecting these pages, we prioritised, the provided context that can potentially\nhelp answer the following question. If no\nrelevant sentences are found, or if you\nbelieve the question cannot be answered\nfrom the given context, return the phrase\n""Insufficient Information"". While extract-\ning candidate sentences you’re not al-\nlowed to make any changes to sentences\nfrom given context.\nThe context relevance score is then computed as:\nCR =\nnumber of extracted sentences\ntotal number of sentences in c(q)\n(2)\n4\nThe WikiEval Dataset]","The WikiEval dataset includes **50 Wikipedia pages**, covering events that have happened since the start of **2022**. \n\nThis information is derived from the context, which states: \n- ""we first selected 50 Wikipedia pages covering events that have happened since the start of 2022."" \n- The selection prioritized pages with recent edits. \n\nLet me know if you need further clarification!","To construct the dataset, we first selected 50 Wikipedia pages covering events that have happened since the start of 2022.",1.0,1.0,1.0,0.908723
2,Why is evaluating Retrieval Augmented Generation (RAG) systems challenging?,"[Abstract\nWe introduce Ragas (Retrieval Augmented\nGeneration Assessment), a framework for\nreference-free evaluation of Retrieval Aug-\nmented Generation (RAG) pipelines.\nRAG\nsystems are composed of a retrieval and an\nLLM based generation module, and provide\nLLMs with knowledge from a reference textual\ndatabase, which enables them to act as a natu-\nral language layer between a user and textual\ndatabases, reducing the risk of hallucinations.\nEvaluating RAG architectures is, however, chal-, retrieval-augmented systems is thus paramount. In\npractice, RAG systems are often evaluated in terms\nof the language modelling task itself, i.e. by mea-\nsuring perplexity on some reference corpus. How-\never, such evaluations are not always predictive\nof downstream performance (Wang et al., 2023c).\nMoreover, this evaluation strategy relies on the LM\nprobabilities, which are not accessible for some\nclosed models (e.g. ChatGPT and GPT-4). Ques-\ntion answering is another common evaluation task,, Ragas: Automated Evaluation of Retrieval Augmented Generation\nShahul Es†, Jithin James†, Luis Espinosa-Anke∗♢, Steven Schockaert∗\n†Exploding Gradients\n∗CardiffNLP, Cardiff University, United Kingdom\n♢AMPLYFI, United Kingdom\nshahules786@gmail.com,jamesjithin97@gmail.com\n{espinosa-ankel,schockaerts1}@cardiff.ac.uk\nAbstract\nWe introduce Ragas (Retrieval Augmented\nGeneration Assessment), a framework for\nreference-free evaluation of Retrieval Aug-\nmented Generation (RAG) pipelines.\nRAG]","Evaluating Retrieval Augmented Generation (RAG) systems is challenging due to several key reasons highlighted in the context:\n\n1. **Multiple Dimensions to Assess**: RAG systems involve both a retrieval module and a generation module, each requiring evaluation. The retrieval system must identify relevant and focused context passages, while the generation module (LLM) must faithfully exploit these passages and produce high-quality outputs. Balancing these dimensions complicates evaluation.\n\n2. **Lack of Ground Truth Annotations**: Traditional evaluation methods often rely on human-annotated ground truth, which can be scarce, expensive, or impractical to obtain. This makes it difficult to validate performance objectively.\n\n3. **Limitations of Existing Metrics**: Common evaluation strategies, such as measuring perplexity on a reference corpus, are not always predictive of downstream performance. Additionally, some metrics (e.g., LM probabilities) are inaccessible for closed models like ChatGPT or GPT-4.\n\n4. **Complexity of Downstream Tasks**: While question answering is a common evaluation task, it may not fully capture the system's capabilities or failures across diverse use cases, leading to incomplete assessments.\n\nThe Ragas framework addresses these challenges by proposing a suite of reference-free metrics to evaluate RAG systems without relying on human annotations, enabling faster and more scalable evaluation cycles.","Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself.",0.5,0.0,0.44,0.942156
