# Evaluators

## What I Learned
This video explained how evaluators work by scoring model runs against reference outputs based on metrics like correctness or similarity. Evaluators can be defined in code or the LangSmith UI, including using LLMs as judges or custom Python evaluators.

## Changes in Code
I implemented a similarity-based evaluator for my Q&A dataset, testing how well model responses matched the expected answers using a custom LLM prompt. The code was updated to include these evaluators and run them on the dataset.

In [None]:
from langsmith import Client
from langsmith.evaluation import evaluate
from langchain_openai import ChatOpenAI

client = Client()

# Define a custom evaluator
def correctness_evaluator(run, example):
    """Evaluate if the output matches the expected answer"""
    predicted = run.outputs.get("answer", "")
    expected = example.outputs.get("answer", "")
    
    # Simple string matching
    score = 1.0 if predicted.lower() == expected.lower() else 0.0
    
    return {
        "key": "correctness",
        "score": score
    }

# LLM-as-a-judge evaluator
def llm_judge_evaluator(run, example):
    """Use an LLM to judge response quality"""
    llm = ChatOpenAI(model="gpt-4o-mini")
    
    predicted = run.outputs.get("answer", "")
    expected = example.outputs.get("answer", "")
    
    prompt = f"""Rate the similarity between these two answers on a scale of 0-1:
    
Expected: {expected}
Predicted: {predicted}

Return only a number between 0 and 1."""
    
    response = llm.invoke(prompt)
    score = float(response.content.strip())
    
    return {
        "key": "llm_similarity",
        "score": score
    }

print("Evaluators defined successfully")