# **Word based (Statistical)**
--------------------------------------------------
**BLEU** - how many word seq in generated output matches ground truth.Used for translation & generation accuracy.Focus on Precision (correct n-gram output).

**ROUGE**- overlap of n-gram between generated & reference summary.Focus on Recall (how much of ref. is captured).

**METEOR** - harmonic mean of Precision & Recall to adjust word order difference btw original & expected output.Focus on model confidence in output likelihood.

**Perplexity** - how well a model predicts next word in a sequence. Lower is better.Focus on model confidence in output likelihood.

# **Character based (Statistical)**
----------------------------------------------------
**Edit distance** - min no of single-character edits to change one word to other.

# **Embedding based(Statistical + LLM based)**
-----------------------------------------------

**BERTScore** - Semantic evaluation using contextual embedding rather than exact match.
Measuring how well the reference’s tokens are covered by the candidate.

# **NLP models (LLM as judge)**
---------------------------------------------------
**NLI** - use NLP classification model to classify whether an LLM output is logically consistent (entailment), contradictory, or unrelated (neutral) with respect to a given reference text.
The score typically ranges between entailment (with a value of 1) and contradiction (with a value of 0).

Example dataset - *MultiNLI (Multi-Genre NLI): Expands SNLI to multiple genres and registers, supporting better generalization across domains.*

**Answer relevance (pairwise comparison)** - used for user-facing application (chatbot).

  Example dataset - *MS MARCO : widely used for training LLMs to judge answer relevance and rank candidate answers, Natural Questions: used for evaluating context relevance in retrieval-based chatbot*

**Fluency & Coherence** (grammar, sentence flow, logical progress) - human-as-judge on Likert scale.

Example dataset - *IT-ConvAI2 and Blended Skill Talk(uses Likert scales): Common for benchmarking generation coherence and grammar*

**Toxicity** - Classification using curated dataset

Example dataset -  *Jigsaw Toxic Comment* a multilingual dataset for online toxicity, offensive speech, and abuse identification



# **RAG Retrieval**
---------------------------------------------------
**Mean Reciprocal Rank** - avg of reciprocal ranks of the first relevant doc across queries.
Rewards systems that surface relevant doc earlier.

**Normalized Discounted Cumulative Gain (nDCG@k)** - scores ranked retrieval results by summing relevance scores weighted by discount factor that decrease with rank position.
Capture both relevance grading & rank order.

**Context-Recall** - Proportion of queries for which all necessary support passages are retrieved (regardless of order). Ensure the retrieval module brings back every piece of evidence needed.

**Precision@k** - proportion of top-k retrieved documents that are relevant. Focuses on retrieval accuracy within the top set.

**Recall@k** - proportion of all relevant documents that appear in top-k retrieved set. Measures completeness of retrieval.


### Major LLM Benchmarks and Their Purposes

| Benchmark    | Main Use / Evaluation Area                           | Typical Scenario or Task                                  |
|--------------|-----------------------------------------------------|----------------------------------------------------------|
| MMLU         | General knowledge, reasoning, QA                    | Multiple-choice exams from 57 domains, e.g., history, law, STEM; measures breadth of model knowledge[1][5][2] |
| HellaSwag    | Commonsense reasoning, contextual inference         | Completing sentences/stories with the most plausible ending after adversarial filtering; detect real-world logic gaps[1][4][5] |
| BIG-bench    | Advanced reasoning, multi-step tasks, creativity    | Diverse set of tasks, including cognitive tests, logical puzzles, error analysis and complex linguistic challenges[1][5] |
| DROP         | Discrete reasoning, reading comprehension           | Requires sorting, counting, extraction and reasoning over long paragraphs (from Wikipedia)[1][4] |
| SQuAD        | Extractive question answering                       | Answering factual questions from Wikipedia passages; evaluates span extraction ability[1][5] |
| TruthfulQA   | Truthfulness in generated answers                   | Tests for avoidance of hallucinations and generation of factually correct statements[4] |
| GSM-8K       | Grade-school mathematical reasoning                 | Word problems requiring arithmetic/logical computation, tests chain-of-thought abilities[4][5] |
| MATH         | Advanced mathematics (problem solving, reasoning)   | University-level math challenges, algebra, calculus, competition problems[5] |
| HumanEval    | Code generation correctness                         | Writing functions in Python from docstrings, focuses on code accuracy and execution[4][2][5] |
| MBPP         | Basic Python programming, functional correctness    | 900+ code generation tasks, measures performance on simple algorithmic problems[2] |
| MT-Bench     | Dialog/chat capabilities, instruction following     | Multi-turn conversation tests; covers coding, extraction, knowledge, roleplay, and more[2][5] |
| GLUE         | Core NLP tasks (classification, sentiment, QA)      | Suite of nine text tasks, yields aggregate language understanding score; widely used for baseline evaluations[5] |
| Chatbot Arena| Human preference, dialog/response quality           | Human crowdworkers compare conversational responses; measures preference and engagement[4][6] |
| LegalBench   | Legal text processing (domain-specific)             | Assessing if a model can interpret statutes, case law, legal arguments[3] |
| MedQA        | Medical knowledge, clinical reasoning               | Exam-style questions for doctors, patient scenario analysis; healthcare domain safety[3] |

### Matching Benchmarks to Use Cases

- **General-Purpose Evaluation**: MMLU, BIG-bench, GLUE — test versatility in broad applications and basic NLP.[5][1]
- **Knowledge & Reasoning**: MMLU, DROP, SQuAD, TruthfulQA — chosen for fact-extraction, comprehension, knowledge QA bots.[4][1]
- **Commonsense & Logic**: HellaSwag, BIG-bench, DROP — detect hallucinations, test model’s real-world reasoning.[1][4]
- **Math/Chain-of-Thought**: GSM-8K, MATH — important for agents needing stepwise math or logical planning.[4][5]
- **Coding**: HumanEval, MBPP — development assistants, IDE integrations, code-generation APIs.[2][4]
- **Chat, Conversation**: MT-Bench, Chatbot Arena — dialog agents, customer support bots, personalization.[6][2]
- **Domain-Specific**: LegalBench, MedQA, FinanceBench — specialty applications in law, medicine, finance.[3]

Every benchmark provides reproducible ways to probe whether an LLM is fit for general production rollout or requires targeted improvement via fine-tuning or transfer learning. For new model deployments, matching the benchmark’s domain and task structure to the application ensures robust pre-launch quality control.[3][5][1][4]



Here are some famous datasets used for evaluating different types of LLMs and Retrieval-Augmented Generation (RAG) systems, categorized by their core purpose:

### Datasets for General LLM Evaluation

- **MMLU (Massive Multitask Language Understanding)**: Multitask benchmark covering diverse domains to evaluate general knowledge and reasoning.
- **HellaSwag**: Commonsense reasoning and contextual inference.
- **BIG-bench**: Wide-ranging tasks including reasoning, creativity, and linguistic knowledge.
- **OpenWebText**: Large-scale web text for training and evaluation of language understanding.
- **Stanford Human Preferences (SHP)**: Human preference annotations useful for RLHF and naturalness evaluation.

### Datasets for Code Generation and Reasoning

- **HumanEval**: Python function generation from docstring prompts.
- **MBPP (Mostly Basic Python Problems)**: Basic coding challenge set.
- **OpenMathInstruct-1**: Math problem solutions combining text and Python code.

### Datasets for RAG Evaluation

- **HotpotQA**: Multi-hop question answering requiring retrieving multiple documents.
- **MS MARCO**: Large-scale passage ranking for retrieval evaluation.
- **Natural Questions**: Real user queries with Wikipedia evidence passages.
- **CovidQA, PubmedQA**: Biomedical and clinical question answering datasets.
- **CuAD**: Legal contracts QA dataset.
- **FinQA, TAT-QA**: Financial QA datasets requiring numerical and tabular reasoning.
- **FRAMES (Factuality, Retrieval, And reasoning MEasurement Set)**: Tests retrieval accuracy, reasoning, and factuality for end-to-end RAG systems.
- **RAGTruth**: Designed to analyze hallucination and assess hallucination detection techniques in RAG outputs.
- **DragonBall Dataset**: Multilingual, multi-domain RAG benchmark with diverse QA pairs.

### Additional Specialized LLM Datasets

- **OpenOrca**: Reasoning-focused dataset combining large-scale GPT-4/3.5 completions.
- **Stanford HellaSwag**: Dataset for commonsense reasoning.
- **Ai2 ARC (AI2 Reasoning Challenge)**: Science questions requiring reasoning.
- **TriviaQA**: Open-domain question answering requiring external knowledge.

