
# Testing a Dummy LLM Chain with LangSmith

This notebook demonstrates how to evaluate a simple language‑model chain using [LangSmith](https://smith.langchain.com/).  LangSmith provides tools for debugging, evaluation, performance monitoring and observability so you can build reliable LLM applications[1][1].  Here we use a dummy chain to keep the example fully self‑contained; you can replace it with any LangChain chain or model.

We'll create a small dataset of questions and expected answers, run the dummy chain, compute a simple accuracy metric and (optionally) upload the results to LangSmith.


In [None]:

import pandas as pd

# If you want to upload results to LangSmith, install the package
# and set the LANGSMITH_API_KEY environment variable.
try:
    from langsmith import Client
except ImportError:
    Client = None


In [None]:

# Define a list of examples (question, expected answer)
examples = [
    {"question": "What is the capital of France?", "expected": "Paris"},
    {"question": "Who wrote 'Moby Dick'?", "expected": "Herman Melville"},
    {"question": "2 + 2 = ?", "expected": "4"},
    {"question": "What year did the first man land on the moon?", "expected": "1969"},
]

pd.DataFrame(examples)


In [None]:

# Define a dummy chain that returns hard‑coded answers

def dummy_chain(question: str) -> str:
    if "capital of France" in question.lower():
        return "Paris"
    if "moby dick" in question.lower():
        return "Herman Melville"
    if "2 + 2" in question or "2+2" in question:
        return "4"
    if "first man land on the moon" in question.lower():
        return "1969"
    return "I don't know"


In [None]:

# Evaluate the dummy chain on each example
results = []
for ex in examples:
    q = ex["question"]
    expected = ex["expected"]
    predicted = dummy_chain(q)
    results.append({
        "question": q,
        "expected": expected,
        "predicted": predicted,
        "correct": predicted.strip().lower() == expected.strip().lower(),
    })

results_df = pd.DataFrame(results)
accuracy = results_df["correct"].mean()

results_df, accuracy


### Statistical evaluation

In addition to counting correct answers, we compute statistical metrics such as accuracy,
precision, recall and F1‑score. These metrics help us understand not only how often the model is correct,
but also how it balances false positives and false negatives. A confusion matrix summarises predictions
across classes and is a useful diagnostic when working with multi‑class outputs.

In [None]:
# Compute statistical metrics
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix
import pandas as pd

labels = [r['expected'] for r in results]
preds = [r['predicted'] for r in results]

# Overall accuracy
acc = accuracy_score(labels, preds)
# Precision, recall and F1 (macro‑averaged)
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro', zero_division=0)
# Confusion matrix as a DataFrame
cm = pd.DataFrame(confusion_matrix(labels, preds),
                   index=sorted(set(labels)),
                   columns=sorted(set(labels)))

print(f'Accuracy: {acc:.2f}')
print(f'Precision (macro): {precision:.2f}')
print(f'Recall (macro): {recall:.2f}')
print(f'F1‑score (macro): {f1:.2f}')
print('Confusion matrix:')
cm

In [None]:

# (Optional) Upload the dataset and results to LangSmith
# Requires an API key set in LANGSMITH_API_KEY.
import os
if Client and os.getenv("LANGSMITH_API_KEY"):
    client = Client()
    dataset = client.get_or_create_dataset(name="demo_dataset_notebook", description="Demo dataset from notebook")
    # Prepare examples
    smith_examples = []
    for row in results:
        smith_examples.append({
            "inputs": {"question": row["question"]},
            "outputs": {"answer": row["predicted"]},
            "expected": row["expected"],
        })
    client.create_examples(dataset_id=dataset.id, examples=smith_examples)
    print(f"Uploaded {len(smith_examples)} examples to LangSmith.")
else:
    print("Set LANGSMITH_API_KEY to upload results.")



In this notebook we built a miniature evaluation pipeline for a language‑model chain.  We created a dataset of question/answer pairs, defined a dummy chain that returns fixed answers, computed an accuracy metric and demonstrated how to upload the results to LangSmith.  In a real project you would replace the dummy chain with your own LangChain chain or call to an LLM, and you could use LangSmith's built‑in evaluators to measure metrics such as relevance, exact match or embedding similarity[1].

LangSmith's unified dashboard allows you to trace prompts and responses, debug errors and monitor performance in real time[1].  Continuous evaluation and monitoring make it easier to keep your LLM application reliable as it grows and changes[1].
