

---

## 🧠 What is a **Summary Evaluator**?

A **Summary Evaluator** is a specialized **LLM-as-Judge** evaluator that compares:

* A **source document** (what needs to be summarized),
* The **model-generated summary** (the prediction),
* And the **reference/golden summary** (what a good summary should look like).

Then it **automatically scores the summary** based on:

* **Faithfulness** (is it factually grounded in the source?)
* **Conciseness** (is it brief but informative?)
* **Relevance** (does it include the important parts?)
* **Fluency** (does it sound natural?)

---

## 🔍 Why Do We Need Summary Evaluators?

Summarization is **subjective and fuzzy**:

* Multiple valid summaries can exist.
* A summary might look fine but **hallucinate facts**.
* Traditional metrics like **BLEU** or **ROUGE** don’t capture **faithfulness** well.

✅ So we need Summary Evaluators to:

* Automatically **evaluate summaries** using an LLM.
* Judge if summaries are **accurate and helpful**, not just similar.

---

## ⚙️ How Does It Work?

### 🔁 The process:

For each example, LangSmith passes:

```json
{
  "input": {
    "document": "Original article about AI...",
    "question": "Summarize this"
  },
  "prediction": "AI is transforming many industries...",
  "reference": "Artificial Intelligence impacts industries across sectors..."
}
```

The **Summary Evaluator** (LLM-powered) reviews:

* 🔍 Is the summary **factually correct**?
* 📋 Is it **complete and relevant**?
* 🧠 Is it **fluent and clear**?

Then it outputs:

```json
{
  "score": 4,
  "reasoning": "Summary is accurate and relevant, but slightly verbose."
}
```

---

## 📌 When Should You Use Summary Evaluators?

Use it when:

* ✅ You're building **summarization applications**
* ✅ You need to **compare summary versions** (prompt A vs prompt B)
* ✅ You want to **catch hallucinations** in generated summaries
* ✅ You want **automated but human-like scoring** of summary quality

---

## 🌟 What Benefits Does It Provide?

| Benefit                 | Description                                    |
| ----------------------- | ---------------------------------------------- |
| 🧠 Better Evaluation    | Judges based on **meaning**, not word match    |
| ⚡ Automation            | Evaluates large datasets without manual review |
| ✅ Reliability           | Helps detect **factual errors** in summaries   |
| 📈 Rich Feedback        | Gives reasoning alongside numeric scores       |
| 🧪 Supports A/B Testing | Can compare summary versions side-by-side      |

---

## ✅ Real-World Scenario + Solution

### 📚 Scenario:

You’re building an LLM app that summarizes **healthcare articles** for patients.

You want to test if your new summary prompt is better than the old one.

You’ve created a dataset:

```json
{
  "document": "Diabetes is a chronic condition ...",
  "expected_summary": "Diabetes affects how the body processes sugar."
}
```

You run:

* **Prompt v1 → Summary A**
* **Prompt v2 → Summary B**

### 💡 Problem:

* Both summaries are fluent.
* You’re not sure which one is **more faithful** or **complete**.

### ✅ Solution: Use Summary Evaluators

```python
from langsmith import Client
from langsmith.evaluation import RunEvaluator

client = Client()

summary_eval = RunEvaluator.for_type("summary_quality")

client.run_on_dataset(
    dataset_name="health-summaries-v1",
    model="gpt-4-summary-v1",
    evaluation=[summary_eval],
    project_name="summary-eval-gpt4-v1"
)
```

> The evaluator will score each summary for:
>
> * ✅ Accuracy (no hallucination)
> * ✅ Coverage (includes key points)
> * ✅ Clarity

And then you’ll see **per-example scores**, **justifications**, and **aggregated performance** in LangSmith.

---

## 🧠 Must-Know Summary

| Topic         | Key Point                                                 |
| ------------- | --------------------------------------------------------- |
| What is it?   | Evaluator for checking summary quality                    |
| Why needed?   | Summarization is subjective and error-prone               |
| How it works? | Compares document, prediction, and reference via LLM      |
| When to use?  | When building, testing, or comparing summaries            |
| Benefits?     | Human-like scoring, automation, reasoning, accuracy       |
| Scenario?     | Summarizing articles and evaluating factuality & coverage |

---
