
---

### üëá **What We'll Learn**

1. What are Pairwise Experiments?
2. Why and when they are **crucial**
3. What you need to **run them effectively**
4. Full Python code example with explanations

---

## üî∑ 1. What Are Pairwise Experiments?

> **Definition:**
> A **Pairwise Experiment** is a method of evaluation where **two outputs (A & B)** from different versions of your LLM app are **compared against each other**, not just scored in isolation.

Instead of asking:

> ‚ÄúIs A good?‚Äù

We ask:

> ‚ÄúIs A better than B?‚Äù

üîç This mimics how **humans compare choices** in real life (and in A/B testing).

---

## üß† 2. Why Are Pairwise Experiments Crucial?

### ‚úÖ In real-world GenAI apps:

* Outputs may **both seem okay**, but one may be **subtly better**.
* LLM scoring in isolation may **not catch nuances**.
* Helps when outputs are **non-deterministic**, e.g.,:

  * Summaries
  * Answers to complex questions
  * Chat responses
  * RAG results

### üî¥ Problems with isolated scoring:

| Method        | Problem                                   |
| ------------- | ----------------------------------------- |
| LLM-as-Judge  | May score both A and B highly             |
| Exact Match   | May fail if both are semantically correct |
| Embedding Sim | Cannot tell which phrasing is clearer     |

üí° **Pairwise helps** answer: *"Which version should I ship?"*

---

## üß∞ 3. What Do You Need for Pairwise Experiments?

### ‚úÖ A. **Two versions of your app**

You compare:

* Old prompt vs new prompt
* Model v1 vs v2
* Retrieval method A vs B

Each version should output predictions **for the same input dataset**.

---

### ‚úÖ B. **The same dataset**

Use **one dataset**, and evaluate:

* Version A ‚Üí Prediction A
* Version B ‚Üí Prediction B
* Now compare Prediction A vs B

---

### ‚úÖ C. **A pairwise evaluator**

Most commonly:

* `LLM-as-Judge` evaluator prompt:

  > ‚ÄúBetween Output A and Output B, which is more helpful, accurate, and fluent?‚Äù

LangSmith handles the internal scoring as:

* `A wins`, `B wins`, or `Tie`

---

### ‚úÖ D. **Run group metadata**

To tell LangSmith:

* ‚ÄúThese two runs belong to the same input, but different versions.‚Äù

You use:

```python
run_group_id = uuid.uuid4().hex
```

---

## ‚úÖ 4. Code: Running Pairwise Experiments in LangSmith

```python
import uuid
from langsmith import Client
from langsmith.evaluation import RunEvaluator

client = Client()

# Your dataset name
dataset_name = "chatbot-qa-dataset"

# Choose your models or chains
model_a = "gpt-4-prompt-v1"
model_b = "gpt-4-prompt-v2"

# Shared ID to group paired runs
run_group_id = uuid.uuid4().hex

# Step 1: Run both versions
project_a = client.run_on_dataset(
    dataset_name=dataset_name,
    model=model_a,
    project_name="version-a",
    metadata={"version": "A", "run_group": run_group_id}
)

project_b = client.run_on_dataset(
    dataset_name=dataset_name,
    model=model_b,
    project_name="version-b",
    metadata={"version": "B", "run_group": run_group_id}
)

# Step 2: Run pairwise comparison
pairwise_evaluator = RunEvaluator.for_type("pairwise_string_comparison")

client.evaluate_pairwise(
    run_groups=[[project_a.project_name, project_b.project_name]],
    evaluator=pairwise_evaluator,
    dataset_name=dataset_name,
    metadata={"experiment_type": "pairwise", "run_group": run_group_id},
    project_name="pairwise-compare-v1-v2"
)
```

---

## üîç What Happens Internally?

1. LangSmith finds matching runs by `input` across both versions.
2. Sends output A and B to the evaluator prompt.
3. Evaluator (usually GPT-4) returns:

   * "A is better"
   * "B is better"
   * "Tie"
4. You get results in LangSmith dashboard:

   * Win rate %
   * Score distribution
   * Comments / justifications

---

## ‚úÖ Summary: When Should You Use Pairwise Experiments?

| Scenario                                                | Should you use pairwise? |
| ------------------------------------------------------- | ------------------------ |
| You changed a **prompt**?                               | ‚úÖ Yes                    |
| You upgraded the **model** (GPT-3.5 ‚Üí GPT-4)?           | ‚úÖ Yes                    |
| You optimized your **retrieval chain**?                 | ‚úÖ Yes                    |
| You just want **exact match accuracy**?                 | ‚ùå Not needed             |
| You are doing **creative generation or summarization**? | ‚úÖ Definitely yes         |

---

## üß† Must-Know Reflection Questions:

1. ‚úÖ Why is pairwise comparison better than isolated scoring in LLM evaluations?
2. ‚úÖ What is the role of `run_group_id` in pairwise experiments?
3. ‚úÖ Can you run pairwise on more than two versions?
4. ‚úÖ When would you prefer traditional evaluation over pairwise?
5. ‚úÖ How does LangSmith group predictions for comparison?

---



---

## ‚úÖ **1. Why is pairwise comparison better than isolated scoring in LLM evaluations?**

### ‚úÖ Answer:

Because **LLM outputs are often ambiguous**, it's hard to tell if one is ‚Äúright‚Äù or ‚Äúwrong‚Äù in isolation. But it's **much easier to say which of two options is better**.

### üîç Example:

| Input: "Summarize this paragraph"                                                        |
| ---------------------------------------------------------------------------------------- |
| Version A: "AI helps automate tasks and save time."                                      |
| Version B: "Artificial intelligence improves efficiency by automating repetitive tasks." |

* Both are **correct**.
* A standard evaluator might give both a score of 0.9.
* But **pairwise comparison** lets you say:
  üîπ ‚ÄúB is more detailed and clear than A.‚Äù

### üí° Pairwise:

* Mimics **real-world human judgment**
* Captures **subtle quality differences**
* More helpful in **creative or open-ended outputs**

---

## ‚úÖ **2. What is the role of `run_group_id` in pairwise experiments?**

### ‚úÖ Answer:

`run_group_id` is a **unique identifier** used to **link multiple runs** of the **same input** from different app versions.

LangSmith uses it to:

* Group predictions from **version A** and **version B**
* Compare their outputs **side by side**
* Evaluate them **together** rather than independently

üîß Without `run_group_id`, LangSmith won‚Äôt know which outputs belong to the **same input**, and the pairwise evaluation would fail or mismatch.

---

## ‚úÖ **3. Can you run pairwise on more than two versions?**

### ‚úÖ Answer:

Yes ‚Äî **indirectly**.

LangSmith officially supports **A vs B** pairwise comparison at a time. But if you have **3 or more versions**, you can run **multiple pairwise experiments**:

* A vs B
* A vs C
* B vs C

Then you **aggregate results** to find:

* Which version wins the most?
* Which version loses the most?
* Which one ties most often?

This is often called a **tournament-style evaluation**.

---

## ‚úÖ **4. When would you prefer traditional evaluation over pairwise?**

### ‚úÖ Answer:

Use **traditional (isolated) evaluation** when:

| Situation                              | Use traditional evaluators |
| -------------------------------------- | -------------------------- |
| Deterministic tasks (math, names)      | ‚úÖ Yes                      |
| You care about **exact correctness**   | ‚úÖ Yes                      |
| You want **individual quality scores** | ‚úÖ Yes                      |
| Need to automate CI/CD scoring         | ‚úÖ Yes                      |

Use **pairwise** when:

* Outputs are **non-deterministic** (summaries, chat)
* You want to know **which version is better**
* You want **human-style comparison**

---

## ‚úÖ **5. How does LangSmith group predictions for comparison?**

### ‚úÖ Answer:

LangSmith uses:

* The same **input example** from the dataset
* The **run\_group\_id** (if provided) or matches by input content
* The **project names or metadata tags** to identify which output belongs to which version

Then it pairs:

* `prediction_A` (from version A)
* `prediction_B` (from version B)

‚Ä¶and passes them to an **LLM-based evaluator** (or custom logic) that compares and returns:

* A wins ‚úÖ
* B wins ‚úÖ
* Tie ü§ù
* Reasoning üí¨

---
