The evaluation of Retrieval-Augmented Generation (RAG) systems encompasses multiple dimensions, each assessing different aspects of system performance. This section outlines various methodologies, grouped according to key evaluation dimensions.
These approaches aim to provide a comprehensive assessment of RAG systems, covering aspects such as retrieval quality, generation accuracy, and overall system effectiveness.
- Q: Question
- RC: Retrieved Context
- GTC: GroundTruth Context
- GA: Generated Answer
- GTA: GroundTruth Answer
Metric | Type | Q | RC | GTC | GA | GTA |
---|---|---|---|---|---|---|
Haystack: Diversity | Semantic Similarity | ✅ | ||||
MultiHop-RAG: Retrieval Evaluation | IR Metrics | ✅ | ✅ | |||
ARES: Context Relevance | LLM classifier | ✅ | ✅ | |||
RAGAS: Context Relevance | LLM-as-a-judge | ✅ | ✅ | |||
TruLens: Context Relevance | LLM-as-a-judge | ✅ | ✅ | |||
Metric | Type | Q | RC | GTC | GA | GTA |
RAGAS: Context Entity Recall | LLM-as-a-tool, Entity Recall | ✅ | ✅ | |||
RAGAS: Context Precision | LLM-as-a-judge, IR Metrics | ✅ | ✅ | ✅ | ||
RAGAS: Context Recall | LLM-as-a-judge | ✅ | ✅ | ✅ |
Metric | Type | Q | RC | GTC | GA | GTA |
---|---|---|---|---|---|---|
RAGAS: Answer Correctness | LLM-as-a-tool, Factual Consistency | ✅ | ✅ | |||
MultiHop-RAG: Response Evaluation | Exact Match | ✅ | ✅ | |||
RGB: Information Integration | Exact Match | ✅ | ✅ | |||
BLEU | Token-wise Accuracy | ✅ | ✅ | |||
ROUGE | Token-wise Accuracy | ✅ | ✅ | |||
Metric | Type | Q | RC | GTC | GA | GTA |
CDQA: F1-recall | Token-wise Accuracy | ✅ | ✅ | |||
BERTScore | Semantic Similarity, Token-wise Accuracy | ✅ | ✅ | |||
LangChain: EmbeddingDistance | Semantic Similarity | ✅ | ✅ | |||
RAGAS: Answer Semantic Similarity | Semantic Similarity | ✅ | ✅ | |||
RECALL: Counterfactual Robustness | Exact Match | ✅ | ✅ | |||
Metric | Type | Q | RC | GTC | GA | GTA |
RGB: Counterfactual Robustness | Exact Match | ✅ | ✅ | |||
RGB: Negative Rejection | Exact Match | ✅ | ✅ | |||
RGB: Noise Robustness | Exact Match | ✅ | ✅ | |||
LangChain: Accuracy | LLM-as-a-judge | ✅ | ✅ | ✅ | ||
CRUD: RAGQuestEval | Token-wise Accuracy | ✅ | ✅ | |||
Metric | Type | Q | RC | GTC | GA | GTA |
ARES: Answer Faithfulness | LLM classifier | ✅ | ✅ | |||
RAGAS: Faithfulness | LLM-as-a-tool, Factual Consistency | ✅ | ✅ | |||
ARES: Answer Relevance | LLM classifier | ✅ | ✅ | ✅ | ||
TruLens: Groundedness | LLM-as-a-judge | ✅ | ✅ | |||
Databricks: Comprehensiveness | LLM-as-a-judge | ✅ | ✅ | |||
Metric | Type | Q | RC | GTC | GA | GTA |
Databricks: Correctness | LLM-as-a-judge | ✅ | ✅ | |||
RAGAS: Answer Relevance | Semantic Similarity | ✅ | ✅ | |||
TruLens: Answer Relevance | LLM-as-a-judge | ✅ | ✅ | |||
LangChain: Faithfulness | LLM-as-a-judge | ✅ | ✅ | ✅ | ||
Databricks: Readability | LLM-as-a-judge | ✅ | ||||
Metric | Type | Q | RC | GTC | GA | GTA |
RAGAS: Aspect Critique | LLM-as-a-judge | ✅ |
Metric | Type |
---|---|
Latency | System Efficiency |
LangChain | RAGAS | TruLens |
---|---|---|
![]() |
![]() |
![]() |