## Key Metrics
| Metric                       | What it Measures                       | Why it Matters                           |
| ---------------------------- | -------------------------------------- | ---------------------------------------- |
| **Latency**                  | Time to retrieve and generate          | Affects user experience and scaling      |
| **Recall\@k / Precision\@k** | Did the right doc(s) appear in top-k?  | Validates retriever quality              |
| **Context Relevance**        | Is the context useful for the LLM?     | Determines if the LLM has what it needs  |
| **Answer Accuracy**          | Is the final answer factually correct? | Measures end-to-end system effectiveness |



## RAG Benchmarking Checklist: What to Measure and Why

### 1. Embeddings

Are we generating useful representations of the content and the query?

- What to benchmark:
    - **Embedding coverage**: Are you embedding all the content or losing some due to truncation/errors?
    - **Semantic quality**: Do similar chunks/documents yield similar embeddings?
    - **Query–document alignment**: Are queries embedded in the same space effectively as documents?
- How to test:
    - Visualize embedding space (e.g. with t-SNE or UMAP).
    - Run nearest neighbor searches on known similar chunks.
    - Use test queries and check if expected documents are close.

### 2. Retriever

Are we retrieving the correct and useful content chunks?

- What to benchmark:
    - **Recall@k / Precision@k**: Are the top-k retrieved chunks relevant?
    - **Latency**: How fast is retrieval?
    - **Filtering/metadata match**: Are filters (e.g. date, source) respected?
    - **Chunk quality**: Are the retrieved chunks self-contained and understandable?
- How to test:
    - Create a small gold dataset of queries with expected docs.
    - Log top-k retrieval results for human inspection.
    - Use keyword matching or semantic similarity as automated checks.

### 3. Generator (LLM)

Does the answer contain the necessary information? Is it fluent, factual, concise?

- What to benchmark:
    - **Answer groundedness**: Is the generated answer based on retrieved content?
    - **Factual accuracy**: Is it correct and non-hallucinated?
    - **Fluency & structure**: Is it clear and readable?
    - **Length control**: Is the output appropriately concise or detailed?
- How to test:
    - Compare answer to a reference answer (automatically or manually).
    - Use LLM-assisted QA evaluation (e.g. LangChain’s QAEvalChain).
    - Use human scoring rubrics (0–1 or 1–5 scale).

### 4. System Integration

Does the system perform consistently end-to-end?

- End-to-end latency
- Throughput (docs/sec indexed, QPS handled)
- Fallback behavior: What happens when retrieval fails?

## Things that have an impact on the system

| Component                   | Why It Matters                                         |
| --------------------------- | ------------------------------------------------------ |
| **Pre-processing quality**  | Poor chunking/tokenization affects embedding quality   |
| **Chunk size sensitivity**  | Retrieval quality is heavily impacted by chunk size    |
| **Indexing strategy**       | HNSW vs Flat vs IVF, impacts speed & recall            |
| **Reranking effectiveness** | Reordering top-k with a smarter model improves quality |
| **Prompt template design**  | Generation can vary wildly based on prompt structure   |


## Research Frameworks & Benchmarking Tools

- [BERGEN: A Benchmarking Library for RAG](https://arxiv.org/abs/2407.01102)
Introduces an open-source library designed to standardize RAG evaluation with support for multiple retrievers, rerankers, LLMs, and metrics. Ideal for reproducible experiments.

- [RAGBench: Explainable Benchmark for RAG Systems](https://arxiv.org/abs/2407.11005)
Offers a labeled dataset (100k examples across industry-specific domains) and the TRACe framework—providing actionable, explainable evaluation metrics for RAG pipelines.

- [RAGAS: Automated Evaluation of RAG](https://arxiv.org/abs/2309.15217)
Proposes a framework for reference-free evaluation of RAG, aiming to assess retrieval and generation quality without requiring ground-truth annotations.

## Domain-Specific Use Cases

- [MIRAGE: RAG Benchmarking for Medicine](https://aclanthology.org/2024.findings-acl.372/)
A healthcare-focused benchmark with over 7,600 questions, evaluating various combinations of corpora, retrievers, and LLMs for RAG in medical settings.

## Practical Guides & Best Practices

- "RAG Evaluation Metrics: Assessing Answer Relevancy, Faithfulness, and More" [(Confident AI blog)](https://www.confident-ai.com/blog/rag-evaluation-metrics-answer-relevancy-faithfulness-and-more): Breaks down core metrics such as contextual recall/precision, answer relevancy and faithfulness, and introduces the DeepEval open-source toolkit for streamlined implementation.

- "Evaluating RAG Pipelines" [(Neptune AI Blog)](https://neptune.ai/blog/evaluating-rag-pipelines)

Stresses multi-dimensional evaluation retriever-based (Recall@k, Precision@k, MRR), generator-based (citation precision/recall, token-level F1), plus cost and latency considerations.

- "Best Practices for RAG Evaluation" [(Patronus AI)](https://www.patronus.ai/llm-testing/rag-evaluation-metrics)

Identifies key metrics like context relevance/sufficiency and answer relevance/correctness, and emphasizes CI integration, continuous monitoring, and evolving gold standards.

## Advanced Benchmarking Platforms

- [BenchmarkQED (Microsoft Research)](https://www.microsoft.com/en-us/research/blog/benchmarkqed-automated-benchmarking-of-rag-systems/)

Introduces an end-to-end, automated benchmarking suite for RAG (including query generation, evaluation, and dataset prep), particularly aligned with GraphRAG-style retrieval.

- [EvidentlyAI: 7 RAG Benchmarks](https://www.evidentlyai.com/blog/rag-benchmarks)

A summary of various RAG benchmarks—including the Needle-in-a-Haystack test—helping measure retrieval accuracy under challenging scenarios.