# Evaluators

At a high-level, an evaluator judges an invocation of your LLM application against a reference example, and returns an evaluation score.

In LangSmith evaluators, we represent this process as a function that takes in a Run (representing the LLM app invocation) and an Example (representing the data point to evaluate), and returns Feedback (representing the evaluator's score of the LLM app invocation).

![Evaluator](../../images/evaluator.png)

Here is an example of a very simple custom evaluator that compares the output of a model to the expected output in the dataset:

In [2]:
from langsmith.schemas import Example, Run

def correct_label(inputs: dict, reference_outputs: dict, outputs: dict) -> dict:
  score = outputs.get("output") == reference_outputs.get("label")
  return {"score": int(score), "key": "correct_label"}

### LLM-as-Judge Evaluation

LLM-as-judge evaluators use LLMs to score system output. To use them, you typically encode the grading rules / criteria in the LLM prompt. They can be reference-free (e.g., check if system output contains offensive content or adheres to specific criteria). Or, they can compare task output to a reference (e.g., check if the output is factually accurate relative to the reference).

Here is an example of how you might define an LLM-as-judge evaluator with structured output

In [None]:
# You can set them inline
import os
os.environ["OPENAI_API_KEY"] = ""

In [1]:
# Or you can use a .env file
import os
from dotenv import load_dotenv
load_dotenv(dotenv_path="../../../.env", override=True)
os.environ["USER_AGENT"] = "496"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [3]:
from anthropic import Anthropic
from pydantic import BaseModel, Field
import json

anthropic_client = Anthropic()

class Similarity_Score(BaseModel):
    similarity_score: int = Field(description="Semantic similarity score between 1 and 10, where 1 means unrelated and 10 means identical.")

# NOTE: This is our LLM-as-judge evaluator
def compare_semantic_similarity(inputs: dict, reference_outputs: dict, outputs: dict):
    input_question = inputs["question"]
    reference_response = reference_outputs["output"]
    run_response = outputs["output"]

    judge_prompt = f"""You are a semantic similarity evaluator. Compare the meanings of two responses to a question.

Question: {input_question}

Reference Response (correct answer): {reference_response}

New Response (to evaluate): {run_response}

Compare these two responses and provide a similarity score between 1 and 10:
- 1 = completely unrelated
- 10 = identical in meaning

Even if the wording differs, focus on whether they convey the same information and answer the question in the same way.

Return your response as JSON: {{"similarity_score": <number>}}"""

    response = anthropic_client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=300,
        messages=[{"role": "user", "content": judge_prompt}]
    )

    # Parse the JSON response
    try:
        text = response.content[0].text
        start = text.find("{")
        end = text.rfind("}") + 1
        result = json.loads(text[start:end])
        similarity_score = result.get("similarity_score", 5)
    except:
        similarity_score = 5  # Default fallback

    return {"score": similarity_score, "key": "similarity"}

Let's try this out!

NOTE: We purposely made this answer wrong, so we expect to see a low score.

In [4]:
# From Dataset Example
inputs = {
    "question": "Is LangSmith natively integrated with LangChain?"
}
reference_outputs = {
    "output": "Yes, LangSmith is natively integrated with LangChain, as well as LangGraph."
}

# From Run
outputs = {
    "output": "No, LangSmith is NOT integrated with LangChain."
}

similarity_score = compare_semantic_similarity(inputs, reference_outputs, outputs)
print(f"Semantic similarity score: {similarity_score}")

Semantic similarity score: {'score': 1, 'key': 'similarity'}


You can also define evaluators using Run and Example directly!

In [6]:
from langsmith.schemas import Run, Example

def compare_semantic_similarity_v2(root_run: Run, example: Example):
    input_question = example["inputs"]["question"]
    reference_response = example["outputs"]["output"]
    run_response = root_run["outputs"]["output"]

    user_message = f"""Compare the semantic similarity of these two responses to the question.

Question: {input_question}

Reference Response (correct answer): {reference_response}

New Response (to evaluate): {run_response}

Provide a similarity score between 1 and 10, where:
- 1 means completely unrelated
- 10 means identical in meaning

Return your response as JSON with a single field: similarity_score (integer)"""

    response = anthropic_client.messages.create(  # Changed from 'client' to 'anthropic_client'
        model="claude-sonnet-4-5-20250929",
        max_tokens=1024,
        system="You are a semantic similarity evaluator. Compare the meanings of two responses to a question, Reference Response and New Response, where the reference is the correct answer, and we are trying to judge if the new response is similar.",
        messages=[
            {"role": "user", "content": user_message}
        ]
    )

    # Parse the response
    import json
    response_text = response.content[0].text

    try:
        if "{" in response_text and "}" in response_text:
            start = response_text.find("{")
            end = response_text.rfind("}") + 1
            json_str = response_text[start:end]
            parsed = json.loads(json_str)
            similarity_score = parsed.get("similarity_score", 5)
        else:
            import re
            numbers = re.findall(r'\d+', response_text)
            similarity_score = int(numbers[0]) if numbers else 5
    except:
        similarity_score = 5

    return {"score": similarity_score, "key": "similarity"}

In [8]:
sample_run = {
    "name": "Sample Run",
    "inputs": {
        "question": "Is LangSmith natively integrated with LangChain?"
    },
    "outputs": {
        "output": "Yes, LangSmith is natively integrated with LangChain, as well as LangGraph."
    },
    "is_root": True,
    "status": "success",
    "extra": {
        "metadata": {
            "key": "value"
        }
    }
}

sample_example = {
    "inputs": {
        "question": "Is LangSmith natively integrated with LangChain?"
    },
    "outputs": {
        "output": "Yes, LangSmith is natively integrated with LangChain, as well as LangGraph."
    },
    "metadata": {
        "dataset_split": [
            "AI generated",
            "base"
        ]
    }
}

similarity_score = compare_semantic_similarity_v2(sample_run, sample_example)
print(f"Semantic similarity score: {similarity_score}")

Semantic similarity score: {'score': 10, 'key': 'similarity'}
