# Evaluators

At a high-level, an evaluator judges an invocation of your LLM application against a reference example, and returns an evaluation score.

In LangSmith evaluators, we represent this process as a function that takes in a Run (representing the LLM app invocation) and an Example (representing the data point to evaluate), and returns Feedback (representing the evaluator's score of the LLM app invocation).

![Evaluator](../../images/evaluator.png)

Here is an example of a very simple custom evaluator that compares the output of a model to the expected output in the dataset:

In [1]:
from langsmith.schemas import Example, Run

def correct_label(inputs: dict, reference_outputs: dict, outputs: dict) -> dict:
  score = outputs.get("output") == reference_outputs.get("label")
  return {"score": int(score), "key": "correct_label"}

### LLM-as-Judge Evaluation

LLM-as-judge evaluators use LLMs to score system output. To use them, you typically encode the grading rules / criteria in the LLM prompt. They can be reference-free (e.g., check if system output contains offensive content or adheres to specific criteria). Or, they can compare task output to a reference (e.g., check if the output is factually accurate relative to the reference).

Here is an example of how you might define an LLM-as-judge evaluator with structured output

In [2]:
# Or you can use a .env file
from dotenv import load_dotenv
load_dotenv(override=True)

True

In [3]:
from openai import OpenAI
from pydantic import BaseModel, Field

client = OpenAI()

class Similarity_Score(BaseModel):
    similarity_score: int = Field(description="Semantic similarity score between 1 and 10, where 1 means unrelated and 10 means identical.")

# NOTE: This is our evaluator
def compare_semantic_similarity(inputs: dict, reference_outputs: dict, outputs: dict):
    input_question = inputs["question"]
    reference_response = reference_outputs["output"]
    run_response = outputs["output"]
    
    completion = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {   
                "role": "system",
                "content": (
                    "You are a semantic similarity evaluator. Compare the meanings of two responses to a question, "
                    "Reference Response and New Response, where the reference is the correct answer, and we are trying to judge if the new response is similar. "
                    "Provide a score between 1 and 10, where 1 means completely unrelated, and 10 means identical in meaning."
                ),
            },
            {"role": "user", "content": f"Question: {input_question}\n Reference Response: {reference_response}\n Run Response: {run_response}"}
        ],
        response_format=Similarity_Score,
    )

    similarity_score = completion.choices[0].message.parsed
    return {"score": similarity_score.similarity_score, "key": "similarity"}


Let's try this out!

NOTE: We purposely made this answer wrong, so we expect to see a low score.

In [4]:
# From Dataset Example
inputs = {
  "question": "Is LangSmith natively integrated with LangChain?"
}
reference_outputs = {
  "output": "Yes, LangSmith is natively integrated with LangChain, as well as LangGraph."
}


# From Run
outputs = {
  "output": "No, LangSmith is NOT integrated with LangChain."
}

similarity_score = compare_semantic_similarity(inputs, reference_outputs, outputs)
print(f"Semantic similarity score: {similarity_score}")

Semantic similarity score: {'score': 2, 'key': 'similarity'}


You can also define evaluators using Run and Example directly!

In [5]:
from langsmith.schemas import Run, Example

def compare_semantic_similarity_v2(root_run: Run, example: Example):
    input_question = example["inputs"]["question"]
    reference_response = example["outputs"]["output"]
    run_response = root_run["outputs"]["output"]
    
    completion = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {   
                "role": "system",
                "content": (
                    "You are a semantic similarity evaluator. Compare the meanings of two responses to a question, "
                    "Reference Response and New Response, where the reference is the correct answer, and we are trying to judge if the new response is similar. "
                    "Provide a score between 1 and 10, where 1 means completely unrelated, and 10 means identical in meaning."
                ),
            },
            {"role": "user", "content": f"Question: {input_question}\n Reference Response: {reference_response}\n Run Response: {run_response}"}
        ],
        response_format=Similarity_Score,
    )

    similarity_score = completion.choices[0].message.parsed
    return {"score": similarity_score.similarity_score, "key": "similarity"}

In [6]:
sample_run = {
  "name": "Sample Run",
  "inputs": {
    "question": "Is LangSmith natively integrated with LangChain?"
  },
  "outputs": {
    "output": "No, LangSmith is NOT integrated with LangChain."
  },
  "is_root": True,
  "status": "success",
  "extra": {
    "metadata": {
      "key": "value"
    }
  }
}

sample_example = {
  "inputs": {
    "question": "Is LangSmith natively integrated with LangChain?"
  },
  "outputs": {
    "output": "Yes, LangSmith is natively integrated with LangChain, as well as LangGraph."
  },
  "metadata": {
    "dataset_split": [
      "AI generated",
      "base"
    ]
  }
}

similarity_score = compare_semantic_similarity_v2(sample_run, sample_example)
print(f"Semantic similarity score: {similarity_score}")

Semantic similarity score: {'score': 1, 'key': 'similarity'}


### Adding my own examples

In [7]:
inputs = {
  "question": "What is the capital city of France?"
}
reference_outputs = {
  "output": "Paris is the capital city of France."
}
outputs = {
  "output": "France's capital is Paris."
}
similarity_score = compare_semantic_similarity(inputs, reference_outputs, outputs)
print(f"Semantic similarity score: {similarity_score}")

Semantic similarity score: {'score': 10, 'key': 'similarity'}


In [10]:
inputs = {
  "question": "What causes the seasons on Earth?"
}
reference_outputs = {
  "output": "The tilt of the Earth's axis as it orbits the sun causes the seasons."
}
outputs = {
  "output": "Seasons change because of how the Earth moves around the sun."
}
similarity_score = compare_semantic_similarity(inputs, reference_outputs, outputs)
print(f"Semantic similarity score: {similarity_score}")

Semantic similarity score: {'score': 8, 'key': 'similarity'}


### Defining my own evaluator which gives the helpfulness score bw 1-5 using llm as judge:

In [11]:
from openai import OpenAI
from pydantic import BaseModel, Field

client = OpenAI()

class Helpfulness_Score(BaseModel):
    helpfulness_score: int = Field(
        description="Helpfulness score between 1 and 5, where 1 is unhelpful and 10 is very helpful and informative."
    )

def evaluate_helpfulness(inputs: dict, reference_outputs: dict, outputs: dict):
    input_question = inputs["question"]
    reference_response = reference_outputs["output"]
    run_response = outputs["output"]
    
    completion = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an evaluator tasked with scoring how helpful a response is to a given user question. "
                    "The reference response is the ideal answer, and the run response is what the model produced. "
                    "Consider completeness, relevance, and usefulness when assigning a score.\n"
                    "Score from 1 to 5, where 1 is unhelpful, off-topic, or confusing — and 5 is highly informative, correct, and complete."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Question: {input_question}\n\n"
                    f"Reference Response: {reference_response}\n\n"
                    f"Run Response: {run_response}"
                )
            }
        ],
        response_format=Helpfulness_Score,
    )

    helpfulness_score = completion.choices[0].message.parsed
    return {"score": helpfulness_score.helpfulness_score, "key": "helpfulness"}


In [12]:
inputs = {"question": "What are the symptoms of dehydration?"}
reference_outputs = {"output": "Common symptoms include dry mouth, fatigue, dizziness, and dark-colored urine."}
outputs = {"output": "Drink water when you're thirsty."}

result = evaluate_helpfulness(inputs, reference_outputs, outputs)
print(f"Helpfulness score: {result}")


Helpfulness score: {'score': 1, 'key': 'helpfulness'}


In [13]:
inputs = {
  "question": "Who discovered gravity?"
}

reference_outputs = {
  "output": "Sir Isaac Newton is credited with the discovery of gravity after observing an apple fall from a tree, which led to his formulation of the laws of motion and universal gravitation."
}

outputs = {
  "output": "Newton found out about gravity by seeing an apple fall."
}

result = evaluate_helpfulness(inputs, reference_outputs, outputs)
print(f"Helpfulness score: {result}")


Helpfulness score: {'score': 3, 'key': 'helpfulness'}


### We can add an evaluator in our datadets on LangSmith directly using this option :

![image.png](attachment:image.png)

### Made a similarity evaluator using llm as judge on the LangSmith website :

![image-2.png](attachment:image-2.png)

We can also generate a Custom Code evaluator using Python code on the LangSmith website :

![image-3.png](attachment:image-3.png)