
---

# 🧠 What is a LangSmith Experiment?

### ✅ Definition:

A **LangSmith Experiment** is a **test run** of your LLM app **against a dataset**, where **each input is run through your app**, and the **outputs are evaluated** by one or more **evaluators**.

---

## 📊 Diagram: Experiment Lifecycle

```
       ┌──────────────┐
       │ Dataset      │ (List of examples: input + expected output)
       └─────┬────────┘
             │
             ▼
 ┌──────────────────────┐
 │  LLM Application     │ (Your chain/tool/function)
 │  Runs on each input  │
 └─────┬────────────────┘
       │ Each run produces...
       ▼
 ┌──────────────────────┐
 │ Prediction Output    │
 └─────┬────────────────┘
       │
       ▼
 ┌────────────────────────────────┐
 │   Evaluators (Auto or Custom)  │
 │ Compare Prediction vs Golden   │
 └────────────┬───────────────────┘
              ▼
    ┌──────────────────────┐
    │ Scores + Feedback    │
    └──────────────────────┘
```

---

# 🧪 Full Experiment Scenario

Let’s break it down in plain terms:

### 🔸 Step 1: Prepare your dataset

✅ A list of test examples:

```json
{ "input": "What is the capital of Japan?", "expected_output": "Tokyo" }
```

---

### 🔸 Step 2: Run your app over the dataset

For each input, your app (LLM chain, tool, or function) will:

* Run using that input
* Create a **Run**
* Log the `output`

---

### 🔸 Step 3: Evaluate the output

Evaluators (defined via UI or Python) compare:

* `prediction (output)`
* `reference (golden output)`

Evaluators then return:

* `score`
* `reasoning`

---

## ⚙️ Evaluators in Experiments

### ✅ You can attach evaluators in two ways:

| Method                  | How it Works                                     |
| ----------------------- | ------------------------------------------------ |
| **Auto-Evaluator (UI)** | Select in LangSmith UI from dropdown (no coding) |
| **Custom Code**         | Define in Python and attach using SDK            |

---

## 🧪 Python Code to Run Experiment Locally

Here's a **real example** using the LangSmith SDK:

```python
from langsmith import Client
from langsmith.evaluation import ExactMatchEvaluator

# Initialize
client = Client()

# Define evaluator(s)
exact_match = ExactMatchEvaluator()

# Define dataset and model (chain, tool, or wrapper)
dataset_name = "qa-dataset-v1"
model = "gpt-4"

# Run experiment
client.run_on_dataset(
    dataset_name=dataset_name,
    model=model,
    evaluation=[exact_match],
    project_name="qa-experiment-01"
)
```

> 🧠 You can attach multiple evaluators to get richer feedback.

---

## 🎯 Running Experiments on Dataset **Versions / Splits**

### ✅ Why?

* Different versions may test different prompt variants
* Dataset splits (e.g., `critical`, `edge`, `easy`) help you test smarter

### 📌 Example:

Run only on version "v2" of dataset with tag `"edge-cases"`:

```python
client.run_on_dataset(
    dataset_name="qa-dataset-v2",
    model=model,
    evaluation=[exact_match],
    project_name="qa-edge-eval",
    dataset_filters={"tags": ["edge-cases"]}
)
```
---

## 🧠 Must-Know Reflections:

1. ✅ What’s the benefit of attaching multiple evaluators to one experiment?
2. ✅ How do dataset versions or tags help in testing safely?
3. ✅ Why would you test only a subset of your examples (like edge cases)?
4. ✅ What’s the role of "project\_name" when running an experiment?
5. ✅ Can you re-use the same evaluators for different experiments? (→ Yes!)

---



---

### ✅ **1. What’s the benefit of attaching multiple evaluators to one experiment?**

#### ✅ Answer:

Attaching multiple evaluators gives you a **multi-dimensional view** of your LLM application’s performance.

| Evaluator Type       | What It Evaluates                                  |
| -------------------- | -------------------------------------------------- |
| `ExactMatch`         | Precise correctness                                |
| `SemanticSimilarity` | Meaningful similarity, even if phrased differently |
| `LLM-as-Judge`       | Subjective qualities like helpfulness, clarity     |
| `Custom`             | Domain-specific metrics (e.g., legal correctness)  |

🔍 **Why important?**
Sometimes one metric is misleading — e.g., exact match fails even when the answer is good. Combining evaluators gives a **more reliable, nuanced understanding**.

---

### ✅ **2. How do dataset versions or tags help in testing safely?**

#### ✅ Answer:

**Tags and versions** help you:

* Test specific groups of examples (e.g., only “edge-cases”)
* Track evaluation changes over time (e.g., v1 → v2 → v3)
* Avoid running full datasets unnecessarily
* **Pin evaluations to specific versions**, ensuring reproducibility

🧠 **Key idea:**
Without versions, you might accidentally evaluate on outdated or changed data, leading to **misleading results**.

---

### ✅ **3. Why would you test only a subset of your examples (like edge cases)?**

#### ✅ Answer:

Subsets (using tags or splits) let you:

* Focus on **high-impact or high-risk scenarios** (e.g., medical or legal)
* Run fast evaluations when time is limited
* Identify if a new prompt/model **solves hard problems**

🎯 **Example:**
You add a new RAG prompt — instead of testing 1,000 examples, you tag 50 known “tricky” ones as `critical` and test those first.

---

### ✅ **4. What’s the role of `"project_name"` when running an experiment?**

#### ✅ Answer:

`project_name` in LangSmith:

* **Groups your experiment runs**
* Lets you **track results and scores** in the LangSmith dashboard
* Helps compare experiments side by side (e.g., `prompt-v1` vs `prompt-v2`)
* Makes it easy to **filter logs, scores, and traces**

💡 Best Practice: Use descriptive project names like `"summarization-v3-eval"`.

---

### ✅ **5. Can you re-use the same evaluators for different experiments?**

#### ✅ Answer:

Yes — and **you should**!

✅ Evaluators like `ExactMatchEvaluator`, `SemanticSimilarity`, `LLM-as-Judge`, or your own custom logic are **modular** and **reusable**.

This helps you:

* Save time
* Maintain consistency across experiments
* Benchmark models across datasets using the **same evaluation criteria**

📌 Example:

```python
semantic = RunEvaluator.for_type("semantic_similarity")

client.run_on_dataset(
    dataset_name="news-summary-v1",
    model="gpt-4",
    evaluation=[semantic],
    project_name="summarization-semantic-eval"
)
```

---