RAG System Evaluation

The evaluation of Retrieval-Augmented Generation (RAG) systems encompasses multiple dimensions, each assessing different aspects of system performance. This section outlines various methodologies, grouped according to key evaluation dimensions.

These approaches aim to provide a comprehensive assessment of RAG systems, covering aspects such as retrieval quality, generation accuracy, and overall system effectiveness.

Legend

Q: Question
RC: Retrieved Context
GTC: GroundTruth Context
GA: Generated Answer
GTA: GroundTruth Answer

Retrieval

Metric	Type	Q	RC	GTC	GA	GTA
Haystack: Diversity	Semantic Similarity		✅
MultiHop-RAG: Retrieval Evaluation	IR Metrics		✅	✅
ARES: Context Relevance	LLM classifier	✅	✅
RAGAS: Context Relevance	LLM-as-a-judge	✅	✅
TruLens: Context Relevance	LLM-as-a-judge	✅	✅
Metric	Type	Q	RC	GTC	GA	GTA
RAGAS: Context Entity Recall	LLM-as-a-tool, Entity Recall		✅			✅
RAGAS: Context Precision	LLM-as-a-judge, IR Metrics	✅	✅			✅
RAGAS: Context Recall	LLM-as-a-judge	✅	✅			✅

Generation

Metric	Type	Q	RC	GTC	GA	GTA
RAGAS: Answer Correctness	LLM-as-a-tool, Factual Consistency				✅	✅
MultiHop-RAG: Response Evaluation	Exact Match				✅	✅
RGB: Information Integration	Exact Match				✅	✅
BLEU	Token-wise Accuracy				✅	✅
ROUGE	Token-wise Accuracy				✅	✅
Metric	Type	Q	RC	GTC	GA	GTA
CDQA: F1-recall	Token-wise Accuracy				✅	✅
BERTScore	Semantic Similarity, Token-wise Accuracy				✅	✅
LangChain: EmbeddingDistance	Semantic Similarity				✅	✅
RAGAS: Answer Semantic Similarity	Semantic Similarity				✅	✅
RECALL: Counterfactual Robustness	Exact Match				✅	✅
Metric	Type	Q	RC	GTC	GA	GTA
RGB: Counterfactual Robustness	Exact Match				✅	✅
RGB: Negative Rejection	Exact Match				✅	✅
RGB: Noise Robustness	Exact Match				✅	✅
LangChain: Accuracy	LLM-as-a-judge	✅			✅	✅
CRUD: RAGQuestEval	Token-wise Accuracy			✅	✅
Metric	Type	Q	RC	GTC	GA	GTA
ARES: Answer Faithfulness	LLM classifier		✅		✅
RAGAS: Faithfulness	LLM-as-a-tool, Factual Consistency		✅		✅
ARES: Answer Relevance	LLM classifier	✅	✅		✅
TruLens: Groundedness	LLM-as-a-judge		✅		✅
Databricks: Comprehensiveness	LLM-as-a-judge	✅			✅
Metric	Type	Q	RC	GTC	GA	GTA
Databricks: Correctness	LLM-as-a-judge	✅			✅
RAGAS: Answer Relevance	Semantic Similarity	✅			✅
TruLens: Answer Relevance	LLM-as-a-judge	✅			✅
LangChain: Faithfulness	LLM-as-a-judge	✅	✅		✅
Databricks: Readability	LLM-as-a-judge				✅
Metric	Type	Q	RC	GTC	GA	GTA
RAGAS: Aspect Critique	LLM-as-a-judge				✅

Others

Metric	Type
Latency	System Efficiency

Tools

LangChain	RAGAS	TruLens

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

RAG System Evaluation

Legend

Retrieval

Generation

Others

Tools

Files

README.md

Latest commit

History

README.md

File metadata and controls

RAG System Evaluation

Legend

Retrieval

Generation

Others

Tools