# Evaluators

At a high-level, an evaluator judges an invocation of your LLM application against a reference example, and returns an evaluation score.

In LangSmith evaluators, we represent this process as a function that takes in a Run (representing the LLM app invocation) and an Example (representing the data point to evaluate), and returns Feedback (representing the evaluator's score of the LLM app invocation).

Here is an example of a very simple custom evaluator that compares the output of a model to the expected output in the dataset:

In [1]:
from langsmith.schemas import Example, Run

def correct_label(inputs: dict, reference_outputs: dict, outputs: dict) -> dict:
  score = outputs.get("output") == reference_outputs.get("label")
  return {"score": int(score), "key": "correct_label"}

### LLM-as-Judge Evaluation

LLM-as-judge evaluators use LLMs to score system output. To use them, you typically encode the grading rules / criteria in the LLM prompt. They can be reference-free (e.g., check if system output contains offensive content or adheres to specific criteria). Or, they can compare task output to a reference (e.g., check if the output is factually accurate relative to the reference).

Here is an example of how you might define an LLM-as-judge evaluator with structured output

In [None]:
# You can set them inline
import os
os.environ["OPENAI_API_KEY"] = ""

In [2]:
# Or you can use a .env file
from dotenv import load_dotenv
load_dotenv(dotenv_path=".env", override=True)

True

In [3]:
from openai import OpenAI
from pydantic import BaseModel, Field

client = OpenAI()

class Similarity_Score(BaseModel):
    similarity_score: int = Field(description="Semantic similarity score between 1 and 10, where 1 means unrelated and 10 means identical.")

# NOTE: This is our evaluator
def compare_semantic_similarity(inputs: dict, reference_outputs: dict, outputs: dict):
    input_question = inputs["question"]
    reference_response = reference_outputs["output"]
    run_response = outputs["output"]
    
    completion = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {   
                "role": "system",
                "content": (
                    "You are a semantic similarity evaluator. Compare the meanings of two responses to a question, "
                    "Reference Response and New Response, where the reference is the correct answer, and we are trying to judge if the new response is similar. "
                    "Provide a score between 1 and 10, where 1 means completely unrelated, and 10 means identical in meaning."
                ),
            },
            {"role": "user", "content": f"Question: {input_question}\n Reference Response: {reference_response}\n Run Response: {run_response}"}
        ],
        response_format=Similarity_Score,
    )

    similarity_score = completion.choices[0].message.parsed
    return {"score": similarity_score.similarity_score, "key": "similarity"}


Let's try this out!

NOTE: We purposely made this answer wrong, so we expect to see a low score.

In [4]:
# From Dataset Example
inputs = {
  "question": "Is LangSmith natively integrated with LangChain?"
}
reference_outputs = {
  "output": "Yes, LangSmith is natively integrated with LangChain, as well as LangGraph."
}


# From Run
outputs = {
  "output": "No, LangSmith is NOT integrated with LangChain."
}

similarity_score = compare_semantic_similarity(inputs, reference_outputs, outputs)
print(f"Semantic similarity score: {similarity_score}")

Semantic similarity score: {'score': 1, 'key': 'similarity'}


You can also define evaluators using Run and Example directly!

In [5]:
from langsmith.schemas import Run, Example

def compare_semantic_similarity_v2(root_run: Run, example: Example):
    input_question = example["inputs"]["question"]
    reference_response = example["outputs"]["output"]
    run_response = root_run["outputs"]["output"]
    
    completion = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {   
                "role": "system",
                "content": (
                    "You are a semantic similarity evaluator. Compare the meanings of two responses to a question, "
                    "Reference Response and New Response, where the reference is the correct answer, and we are trying to judge if the new response is similar. "
                    "Provide a score between 1 and 10, where 1 means completely unrelated, and 10 means identical in meaning."
                ),
            },
            {"role": "user", "content": f"Question: {input_question}\n Reference Response: {reference_response}\n Run Response: {run_response}"}
        ],
        response_format=Similarity_Score,
    )

    similarity_score = completion.choices[0].message.parsed
    return {"score": similarity_score.similarity_score, "key": "similarity"}

In [6]:
sample_run = {
  "name": "Sample Run",
  "inputs": {
    "question": "Is LangSmith natively integrated with LangChain?"
  },
  "outputs": {
    "output": "No, LangSmith is NOT integrated with LangChain."
  },
  "is_root": True,
  "status": "success",
  "extra": {
    "metadata": {
      "key": "value"
    }
  }
}

sample_example = {
  "inputs": {
    "question": "Is LangSmith natively integrated with LangChain?"
  },
  "outputs": {
    "output": "Yes, LangSmith is natively integrated with LangChain, as well as LangGraph."
  },
  "metadata": {
    "dataset_split": [
      "AI generated",
      "base"
    ]
  }
}

similarity_score = compare_semantic_similarity_v2(sample_run, sample_example)
print(f"Semantic similarity score: {similarity_score}")

Semantic similarity score: {'score': 1, 'key': 'similarity'}


In [7]:
from openai import OpenAI
from pydantic import BaseModel, Field

client = OpenAI()

# Evaluator 1: Check response length
def evaluate_length(inputs: dict, reference_outputs: dict, outputs: dict) -> dict:
    ref_length = len(reference_outputs["output"].split())
    output_length = len(outputs["output"].split())
    
    ratio = output_length / ref_length if ref_length > 0 else 0
    score = 1.0 if 0.5 <= ratio <= 1.5 else max(0, 1.0 - abs(1 - ratio))
    
    return {"score": round(score, 2), "key": "length_check"}

# Evaluator 2: Check for key terms
def evaluate_key_terms(inputs: dict, reference_outputs: dict, outputs: dict) -> dict:
    ref_terms = set(word for word in reference_outputs["output"].split() 
                   if word[0].isupper() or len(word) > 8)
    output_terms = set(word for word in outputs["output"].split() 
                      if word[0].isupper() or len(word) > 8)
    
    if not ref_terms:
        return {"score": 1.0, "key": "term_coverage"}
    
    coverage = len(ref_terms & output_terms) / len(ref_terms)
    return {"score": round(coverage, 2), "key": "term_coverage"}

# Evaluator 3: Yes/No consistency
def evaluate_yes_no(inputs: dict, reference_outputs: dict, outputs: dict) -> dict:
    ref_text = reference_outputs["output"].lower()
    output_text = outputs["output"].lower()
    
    ref_yes = any(w in ref_text[:50] for w in ["yes", "correct", "true"])
    ref_no = any(w in ref_text[:50] for w in ["no", "not", "incorrect"])
    
    out_yes = any(w in output_text[:50] for w in ["yes", "correct", "true"])
    out_no = any(w in output_text[:50] for w in ["no", "not", "incorrect"])
    
    if (ref_yes and out_yes) or (ref_no and out_no):
        score = 1.0
    elif (ref_yes and out_no) or (ref_no and out_yes):
        score = 0.0
    else:
        score = 0.5
    
    return {"score": score, "key": "yes_no_match"}

# Test all evaluators
test_inputs = {"question": "Is LangSmith natively integrated with LangChain?"}
test_reference = {"output": "Yes, LangSmith is natively integrated with LangChain, as well as LangGraph."}
test_correct = {"output": "Yes, LangSmith has native integration with both LangChain and LangGraph."}
test_wrong = {"output": "No, LangSmith is NOT integrated with LangChain."}

print("Testing with correct answer:")
e1 = compare_semantic_similarity(test_inputs, test_reference, test_correct)
e2 = evaluate_length(test_inputs, test_reference, test_correct)
e3 = evaluate_key_terms(test_inputs, test_reference, test_correct)
e4 = evaluate_yes_no(test_inputs, test_reference, test_correct)

print(f"Semantic similarity: {e1['score']}/10")
print(f"Length check: {e2['score']}")
print(f"Term coverage: {e3['score']}")
print(f"Yes/No match: {e4['score']}")
avg_correct = (e1['score']/10 + e2['score'] + e3['score'] + e4['score']) / 4
print(f"Average: {avg_correct:.2f}\n")

print("Testing with wrong answer:")
e1 = compare_semantic_similarity(test_inputs, test_reference, test_wrong)
e2 = evaluate_length(test_inputs, test_reference, test_wrong)
e3 = evaluate_key_terms(test_inputs, test_reference, test_wrong)
e4 = evaluate_yes_no(test_inputs, test_reference, test_wrong)

print(f"Semantic similarity: {e1['score']}/10")
print(f"Length check: {e2['score']}")
print(f"Term coverage: {e3['score']}")
print(f"Yes/No match: {e4['score']}")
avg_wrong = (e1['score']/10 + e2['score'] + e3['score'] + e4['score']) / 4
print(f"Average: {avg_wrong:.2f}")

Testing with correct answer:
Semantic similarity: 9/10
Length check: 1.0
Term coverage: 0.6
Yes/No match: 1.0
Average: 0.88

Testing with wrong answer:
Semantic similarity: 2/10
Length check: 1.0
Term coverage: 0.4
Yes/No match: 0.0
Average: 0.40
