### Building Evaluators from Scratch with LangChain

The best way to understand evaluation is to build it. Using basic LangChain components, we can create custom chains that instruct an LLM to act as an impartial ‚Äújudge‚Äù, grading our RAG system‚Äôs output based on criteria we define in a prompt. This gives us maximum control and transparency.

Let‚Äôs begin with `Correctness`. Our goal is to create a chain that compares the generated_answer to a ground_truth answer and returns a score from 0 to 1.

In [1]:
import os
from langchain_core.prompts import PromptTemplate
from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace

from dotenv import load_dotenv
load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

In [73]:

def get_hf_llm(model_name):

    model = HuggingFaceEndpoint(
        model=model_name,
        max_new_tokens=1024,
        huggingfacehub_api_token=os.getenv("HUGGINGFACE_API_KEY")
    )

    llm = ChatHuggingFace(
        llm=model
    )
    return llm

llm = get_hf_llm("openai/gpt-oss-20b")

In [8]:
from os import name
from pydantic import BaseModel, Field
from langchain_core.output_parsers import JsonOutputParser

class ResultScore(BaseModel):
    score: float = Field(..., description="The score of the result, ranging from 0 to 1 where 1 is the best possible score.")

output_parser = JsonOutputParser(
    name="eval_parser",
    pydantic_object=ResultScore
)

correctness_prompt = PromptTemplate(
    template="""
    Question: {question}
    Ground Truth: {ground_truth}
    Generated Answer: {generated_answer}

    Evaluate the correctness of the generated answer compared to the ground truth.
    Score from 0 to 1, where 1 is perfectly correct and 0 is completely incorrect.
    
    {format_instructions}
    """,
    input_variables=["question", "ground_truth", "generated_answer"],
    partial_variables={"format_instructions": output_parser.get_format_instructions()}
)

correctness_chain = (
    correctness_prompt 
    | llm 
    | output_parser
)

In [9]:
def evaluate_correctness(question, ground_truth, generated_answer):
    """A helper function to run our custom correctness evaluation chain."""
    result = correctness_chain.invoke({
        "question": question, 
        "ground_truth": ground_truth, 
        "generated_answer": generated_answer
    })
    return result["score"]


# Test the correctness chain with a partially correct answer.
question = "What is the capital of France and Spain?"
ground_truth = "Paris and Madrid"
generated_answer = "Paris"
score = evaluate_correctness(question, ground_truth, generated_answer)

print(f"Correctness Score: {score}")


Correctness Score: 0.5


### Faithfulness

This is arguably more important than correctness for RAG, as it‚Äôs our primary defense against hallucination.

In [10]:
faithfulness_prompt = PromptTemplate(
    input_variables=["question", "context", "generated_answer"],
    template="""
    Question: {question}
    Context: {context}
    Generated Answer: {generated_answer}

    Evaluate if the generated answer to the question can be deduced from the context.
    Score of 0 or 1, where 1 is perfectly faithful *AND CAN BE DERIVED FROM THE CONTEXT* and 0 otherwise.
    You don't mind if the answer is correct; all you care about is if the answer can be deduced from the context.
    
    [... a few examples from the notebook to guide the LLM ...]

    Example:
    Question: What is 2+2?
    Context: 4.
    Generated Answer: 4.
    In this case, the context states '4', but it does not provide information to deduce the answer to 'What is 2+2?', so the score should be 0.
    
    {format_instructions}
    """,
    partial_variables={"format_instructions": output_parser.get_format_instructions()}
)

faithfulness_chain = (
    faithfulness_prompt
    | llm
    | output_parser
)

In [12]:
def evaluate_faithfulness(question, context, generated_answer):
    """A helper function to run our custom faithfulness evaluation chain."""
    result = faithfulness_chain.invoke({
        "question": question, 
        "context": context, 
        "generated_answer": generated_answer
    })
    return result["score"]

# Test the faithfulness chain. The answer is correct, but is it faithful?
question = "what is 3+3?"
context = "6"
generated_answer = "6"
score = evaluate_faithfulness(question, context, generated_answer)

print(f"Faithfulness Score: {score}")

Faithfulness Score: 0


This demonstrates the power and precision of a well-defined faithfulness metric. Even though the answer 6 is factually correct, it could not be logically deduced from the provided context ‚Äú6‚Äù.

The context didn‚Äôt say 3+3 equals 6. Our system correctly flagged this as an unfaithful answer, which is likely a hallucination where the LLM used its own pre-trained knowledge instead of the provided context.

---

However, for faster and more robust testing, dedicated evaluation frameworks are the way to go.

We‚Äôll explore three popular frameworks: 
- deepeval, 
- grouse, and 
- RAG-specific powerhouse, RAGAS

### DeepEval

deepeval is a powerful, open-source framework designed to make LLM evaluation simple and intuitive. It provides a set of well-defined metrics that can be easily applied to your RAG pipeline's outputs.

The workflow involves creating LLMTestCase objects and measuring them against pre-built metrics like Correctness, Faithfulness, and ContextualRelevancy.

In [None]:
# Inherited class to use my custom llms in deepeval
from deepeval.models import DeepEvalBaseLLM

class HFModel(DeepEvalBaseLLM):
    def __init__(self, model):
        self.model = model

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        return self.load_model().invoke(prompt).content

    async def a_generate(self, prompt: str) -> str:
        res = await self.load_model().ainvoke(prompt)
        return res.content
    
    def get_model_name(self, *args, **kwargs):
        return super().get_model_name(*args, **kwargs)
    
custom_llm = HFModel(llm)
custom_llm.generate("Hello")

'Hello! üëã How can I help you today?'

In [29]:
from deepeval import evaluate
from deepeval.metrics import GEval, FaithfulnessMetric, ContextualRelevancyMetric
from deepeval.test_case import LLMTestCase, LLMTestCaseParams


# Create test cases
test_case_correctness = LLMTestCase(
    input="What is the capital of Spain?",
    expected_output="Madrid is the capital of Spain.",
    actual_output="MadriD."
)

answer_correctness = GEval(
    name="Answer Correctness",
    criteria="Evaluate if the actual output matches or closely matches the expected output. If the answer is not correct or complete, reduce score.",
    model=custom_llm,
    evaluation_params=[
        LLMTestCaseParams.INPUT, 
        LLMTestCaseParams.ACTUAL_OUTPUT, 
        LLMTestCaseParams.EXPECTED_OUTPUT
    ]
)

# The evaluate() function runs all test cases against all specified metrics
evaluation_results = evaluate(
    test_cases=[test_case_correctness],
    metrics=[answer_correctness]
)

print(evaluation_results)



Metrics Summary

  - ‚úÖ Answer Correctness [GEval] (score: 0.5, threshold: 0.5, strict: False, evaluation model: None, reason: The actual output correctly identifies Madrid as the capital but omits the explanatory phrase "is the capital of Spain," so it only partially matches the expected content and lacks completeness., error: None)

For test case:

  - input: What is the capital of Spain?
  - actual output: MadriD.
  - expected output: Madrid is the capital of Spain.
  - context: None
  - retrieval context: None


Overall Metric Pass Rates

Answer Correctness [GEval]: 100.00% pass rate




test_results=[TestResult(name='test_case_0', success=True, metrics_data=[MetricData(name='Answer Correctness [GEval]', threshold=0.5, success=True, score=0.5, reason='The actual output correctly identifies Madrid as the capital but omits the explanatory phrase "is the capital of Spain," so it only partially matches the expected content and lacks completeness.', strict_mode=False, evaluation_model=None, error=None, evaluation_cost=None, verbose_logs='Criteria:\nEvaluate if the actual output matches or closely matches the expected output. If the answer is not correct or complete, reduce score. \n \nEvaluation Steps:\n[\n    "Check if Actual Output precisely matches Expected Output.",\n    "If not an exact match, assess whether the content is semantically or functionally equivalent.",\n    "Confirm that all required elements from Expected Output are present in Actual Output.",\n    "Adjust the score based on correctness and completeness."\n] \n \nRubric:\nNone \n \nScore: 0.5')], conversa

In [31]:
test_case_faithfulness = LLMTestCase(
    input="what is 3+3?",
    actual_output="6",
    retrieval_context=["6"]
)

answer_faithfulness = FaithfulnessMetric(
    model=custom_llm
)

evaluation_results = evaluate(
    test_cases=[test_case_faithfulness],
    metrics=[answer_faithfulness]
)

print(evaluation_results)



Metrics Summary

  - ‚úÖ Faithfulness (score: 1.0, threshold: 0.5, strict: False, evaluation model: None, reason: The score is 1.00 because there are no contradictions to indicate any mismatch between the actual output and the retrieval context., error: None)

For test case:

  - input: what is 3+3?
  - actual output: 6
  - expected output: None
  - context: None
  - retrieval context: ['6']


Overall Metric Pass Rates

Faithfulness: 100.00% pass rate




test_results=[TestResult(name='test_case_0', success=True, metrics_data=[MetricData(name='Faithfulness', threshold=0.5, success=True, score=1.0, reason='The score is 1.00 because there are no contradictions to indicate any mismatch between the actual output and the retrieval context.', strict_mode=False, evaluation_model=None, error=None, evaluation_cost=None, verbose_logs='Truths (limit=None):\n[\n    "The text contains the digit 6.",\n    "6 is an integer."\n] \n \nClaims:\n[\n    "The AI output is 6."\n] \n \nVerdicts:\n[\n    {\n        "verdict": "idk",\n        "reason": "The retrieval context states that the text contains the digit 6 and that 6 is an integer, but it does not say that the AI output itself is 6."\n    }\n]')], conversational=False, multimodal=False, input='what is 3+3?', actual_output='6', expected_output=None, context=None, retrieval_context=['6'], turns=None, additional_metadata=None)] confident_link=None test_run_id=None


### Another Powerful Alternative with grouse


`grouse` is another excellent open-source option, offering a similar suite of metrics but with a unique focus on allowing deep customization of the "judge" prompts. This is useful for fine-tuning evaluation criteria for a specific domain.

In [41]:
# Test the model with liellm that grouse uses at the backend
from litellm import completion

response = completion(
    model="huggingface/novita/openai/gpt-oss-20b",
    messages=[{"role": "user", "content": "Say hello in JSON format: {\"greeting\": \"...\"}"}]
)
print(response)

ModelResponse(id='f2f1299ad1659341aa2b59853a27c9cb', created=1764249754, model='openai/gpt-oss-20b', object='chat.completion', system_fingerprint='', choices=[Choices(finish_reason='stop', index=0, message=Message(content='{"greeting":"Hello"}', role='assistant', tool_calls=None, function_call=None, reasoning_content='The user says: "Say hello in JSON format: {"greeting": "..."}"\n\nThey want presumably a JSON object with a greeting. So we output something like:\n\n{"greeting": "Hello"}\n\nThat\'s it. No extra text.', provider_specific_fields={'reasoning_content': 'The user says: "Say hello in JSON format: {"greeting": "..."}"\n\nThey want presumably a JSON object with a greeting. So we output something like:\n\n{"greeting": "Hello"}\n\nThat\'s it. No extra text.'}), provider_specific_fields={})], usage=Usage(completion_tokens=64, prompt_tokens=79, total_tokens=143, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=4

In [67]:
from grouse import EvaluationSample, GroundedQAEvaluator
import litellm

# litellm._turn_on_debug()  # for dubugging
litellm.suppress_debug_info = True
litellm.set_verbose = False
litellm._logging._disable_debugging()  # Internal function to disable debug

evaluator = GroundedQAEvaluator(
    model_name="huggingface/nebius/Qwen/Qwen3-30B-A3B-Instruct-2507",
)
unfaithful_sample = EvaluationSample(
    input="Where is the Eiffel Tower located?",
    actual_output="The Eiffel Tower is located at Rue Rabelais in Paris.",
    expected_output="",
    references=[
        "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France",
        "Gustave Eiffel died in his appartment at Rue Rabelais in Paris."
    ]
)

result = evaluator.evaluate(
    eval_samples=[unfaithful_sample],
).evaluations[0]

print(f"Grouse Faithfulness Score (0 or 1): {result.faithfulness.faithfulness}")

  PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{\n    "...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...ider_specific_fields={}), input_type=Choices])
  return self.__pydantic_serializer__.to_python(
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  1.26it/s]

2025-11-27 14:48:20,642 - LLM Call Tracker - INFO - Cost: 0.0000$
2025-11-27 14:48:20,642 - LLM Call Tracker - INFO - Cost: 0.0000$
2025-11-27 14:48:20,642 - LLM Call Tracker - INFO - Cost: 0.0000$
2025-11-27 14:48:20,642 - LLM Call Tracker - INFO - Cost: 0.0000$
2025-11-27 14:48:20,642 - LLM Call Tracker - INFO - Cost: 0.0000$
2025-11-27 14:48:20,642 - LLM Call Tracker - INFO - Cost: 0.0000$
2025-11-27 14:48:20,642 - LLM Call Tracker - INFO - Cost: 0.0000$
2025-11-27 14:48:20,642 - LLM Call Tracker - INFO - Cost: 0.0000$
2025-11-27 14:48:20,642 - LLM Call Tracker - INFO - Cost: 0.0000$
2025-11-27 14:48:20,642 - LLM Call Tracker - INFO - Cost: 0.0000$
2025-11-27 14:48:20,642 - LLM Call Tracker - INFO - Cost: 0.0000$
2025-11-27 14:48:20,642 - LLM Call Tracker - INFO - Cost: 0.0000$
2025-11-27 14:48:20,642 - LLM Call Tracker - INFO - Cost: 0.0000$
2025-11-27 14:48:20,642 - LLM Call Tracker - INFO - Cost: 0.0000$
2025-11-27 14:48:20,642 - LLM Call Tracker - INFO - Cost: 0.0000$
2025-11-27


INFO:LLM Call Tracker:Cost: 0.0000$


Grouse Faithfulness Score (0 or 1): 0


  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


### Evaluation with RAGAS


While deepeval and grouse are great general-purpose evaluators, RAGAS (Retrieval-Augmented Generation Assessment) is a framework built specifically for evaluating RAG pipelines. It provides a comprehensive suite of metrics that measure every component of your system, from retriever to generator.



To use RAGAS, we first need to prepare our evaluation data in a specific format. It requires four key pieces of information for each test case:

- question: The user's input query.
- answer: The final answer generated by our RAG system.
- contexts: The list of documents retrieved by our retriever.
- ground_truth: The correct, reference answer.

In [68]:
# 1. Prepare the evaluation data
questions = [
    "What is the name of the three-headed dog guarding the Sorcerer's Stone?",
    "Who gave Harry Potter his first broomstick?",
    "Which house did the Sorting Hat initially consider for Harry?",
]

# These would be the answers generated by our RAG pipeline
generated_answers = [
    "The three-headed dog is named Fluffy.",
    "Professor McGonagall gave Harry his first broomstick, a Nimbus 2000.",
    "The Sorting Hat strongly considered putting Harry in Slytherin.",
]

# The ground truth, or "perfect" answers
ground_truth_answers = [
    "Fluffy",
    "Professor McGonagall",
    "Slytherin",
]

# The context retrieved by our RAG system for each question
retrieved_documents = [
    ["A massive, three-headed dog was guarding a trapdoor. Hagrid mentioned its name was Fluffy."],
    ["First years are not allowed brooms, but Professor McGonagall, head of Gryffindor, made an exception for Harry."],
    ["The Sorting Hat muttered in Harry's ear, 'You could be great, you know, it's all here in your head, and Slytherin will help you on the way to greatness...'"],
]

Next, we structure this data using the Hugging Face datasets library, which RAGAS integrates with seamlessly.

In [69]:
from datasets import Dataset

data_samples = {
    'question': questions,
    'answer': generated_answers,
    'contexts': retrieved_documents,
    'ground_truth': ground_truth_answers
}

dataset = Dataset.from_dict(data_samples)

Now, we can define our metrics and run the evaluation. RAGAS offers several powerful, RAG-specific metrics out of the box.

In [75]:
from langchain_huggingface import HuggingFaceEndpointEmbeddings

hf_embeddings = HuggingFaceEndpointEmbeddings(
    model="Qwen/Qwen3-Embedding-8B",
    task="feature-extraction",
    huggingfacehub_api_token=os.getenv("HUGGINGFACE_API_KEY")
)

In [76]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,       # How factually consistent is the answer with the context? (Prevents hallucination)
    answer_relevancy,   # How relevant is the answer to the question?
    context_recall,     # Did we retrieve all the necessary context to answer the question?
    answer_correctness, # How accurate is the answer compared to the ground truth?
)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

llm = get_hf_llm("moonshotai/Kimi-K2-Instruct")
eval_llm = LangchainLLMWrapper(llm)
eval_embeddings = LangchainEmbeddingsWrapper(hf_embeddings)

metrics = [
    faithfulness,
    answer_relevancy, 
    context_recall,
    answer_correctness
]

result = evaluate(
    llm=eval_llm,
    dataset=dataset,
    metrics=metrics,
    embeddings=eval_embeddings
)

results_df = result.to_pandas()
print(results_df)

Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 12/12 [00:27<00:00,  2.27s/it]


                                          user_input  ... answer_correctness
0  What is the name of the three-headed dog guard...  ...           0.919079
1        Who gave Harry Potter his first broomstick?  ...           0.654729
2  Which house did the Sorting Hat initially cons...  ...           0.949443

[3 rows x 8 columns]


In [77]:
results_df["faithfulness"]

0    0.0
1    0.0
2    1.0
Name: faithfulness, dtype: float64

In [78]:
results_df["answer_correctness"]

0    0.919079
1    0.654729
2    0.949443
Name: answer_correctness, dtype: float64

In [80]:
results_df["context_recall"]

0    1.0
1    1.0
2    1.0
Name: context_recall, dtype: float64

In [82]:
results_df["answer_relevancy"]

0    0.902707
1    0.929892
2    0.941209
Name: answer_relevancy, dtype: float64