Skip to content

Latest commit

 

History

History
71 lines (63 loc) · 6.57 KB

README.md

File metadata and controls

71 lines (63 loc) · 6.57 KB

RAG System Evaluation

The evaluation of Retrieval-Augmented Generation (RAG) systems encompasses multiple dimensions, each assessing different aspects of system performance. This section outlines various methodologies, grouped according to key evaluation dimensions.

These approaches aim to provide a comprehensive assessment of RAG systems, covering aspects such as retrieval quality, generation accuracy, and overall system effectiveness.

Legend

  • Q: Question
  • RC: Retrieved Context
  • GTC: GroundTruth Context
  • GA: Generated Answer
  • GTA: GroundTruth Answer

Retrieval

Metric Type Q RC GTC GA GTA
Haystack: Diversity Semantic Similarity
MultiHop-RAG: Retrieval Evaluation IR Metrics
ARES: Context Relevance LLM classifier
RAGAS: Context Relevance LLM-as-a-judge
TruLens: Context Relevance LLM-as-a-judge
Metric Type Q RC GTC GA GTA
RAGAS: Context Entity Recall LLM-as-a-tool, Entity Recall
RAGAS: Context Precision LLM-as-a-judge, IR Metrics
RAGAS: Context Recall LLM-as-a-judge

Generation

Metric Type Q RC GTC GA GTA
RAGAS: Answer Correctness LLM-as-a-tool, Factual Consistency
MultiHop-RAG: Response Evaluation Exact Match
RGB: Information Integration Exact Match
BLEU Token-wise Accuracy
ROUGE Token-wise Accuracy
Metric Type Q RC GTC GA GTA
CDQA: F1-recall Token-wise Accuracy
BERTScore Semantic Similarity, Token-wise Accuracy
LangChain: EmbeddingDistance Semantic Similarity
RAGAS: Answer Semantic Similarity Semantic Similarity
RECALL: Counterfactual Robustness Exact Match
Metric Type Q RC GTC GA GTA
RGB: Counterfactual Robustness Exact Match
RGB: Negative Rejection Exact Match
RGB: Noise Robustness Exact Match
LangChain: Accuracy LLM-as-a-judge
CRUD: RAGQuestEval Token-wise Accuracy
Metric Type Q RC GTC GA GTA
ARES: Answer Faithfulness LLM classifier
RAGAS: Faithfulness LLM-as-a-tool, Factual Consistency
ARES: Answer Relevance LLM classifier
TruLens: Groundedness LLM-as-a-judge
Databricks: Comprehensiveness LLM-as-a-judge
Metric Type Q RC GTC GA GTA
Databricks: Correctness LLM-as-a-judge
RAGAS: Answer Relevance Semantic Similarity
TruLens: Answer Relevance LLM-as-a-judge
LangChain: Faithfulness LLM-as-a-judge
Databricks: Readability LLM-as-a-judge
Metric Type Q RC GTC GA GTA
RAGAS: Aspect Critique LLM-as-a-judge

Others

Metric Type
Latency System Efficiency

Tools

LangChain RAGAS TruLens