# LLM-as-a-Judge: Evaluating AI Model Outputs with LLMs

## The True Purpose of LLM-as-a-Judge

**Important Distinction:** LLM-as-a-Judge is NOT about using an LLM to classify text. That's just an LLM classifier.

**LLM-as-a-Judge** is about using an LLM to **evaluate whether an existing AI model's output is correct, accurate, or high-quality**.

### The Real Use Case

Imagine you have:
- A classification model that predicts sentiment (Positive/Negative/Neutral)
- A chatbot that answers customer questions
- An AI that generates product descriptions
- A model that extracts entities from text

**How do you know if these outputs are good?**

Traditional approaches:
- Manual human evaluation (expensive, slow)
- Comparing to ground truth labels (requires labeled data)
- Simple metrics (accuracy, F1) - don't capture nuance

**LLM-as-a-Judge approach:**
- Use a powerful LLM to evaluate each prediction
- Get reasoning for why an output is good or bad
- Score outputs on multiple dimensions
- Enable quality control at scale

## What We'll Build

In this tutorial, we'll build a production-ready LLM-as-a-Judge framework that:

1. **Evaluates AI Outputs** - Not classifies, but judges existing classifications
2. **Uses Few-Shot Examples** - Shows the judge what good vs. bad outputs look like
3. **Provides ReAct-style Reasoning** - Confidence scores and chain-of-thought explanations
4. **Returns Structured Output** - Pydantic models for production systems
5. **Works with Any LLM** - OpenAI, Anthropic, or custom models


## Setup

First, let's install dependencies and import our framework.


In [2]:
# Uncomment to install
# !pip install openai pydantic pandas

import os
from openai import OpenAI
import pandas as pd
from typing import List, Tuple

# Import our LLM-as-a-Judge framework
from llm_judge import LLMJudge, FewShotExample, JudgmentResult


In [None]:
# Initialize OpenAI client
client = OpenAI()


## Part 1: Understanding the Task - Judging AI Outputs

### Example: Sentiment Classification Model

Let's say we have a sentiment classification model that we want to evaluate. The model takes customer reviews and predicts: Positive, Negative, or Neutral.

Here's sample data our model classified:


In [4]:
# Sample outputs from our AI sentiment classifier
model_predictions = [
    {
        "input": "This product is absolutely amazing! Best purchase ever!",
        "model_output": "Positive",
        "ground_truth": "Positive"  # We'll use this later
    },
    {
        "input": "The item arrived damaged and customer service was unhelpful.",
        "model_output": "Negative",
        "ground_truth": "Negative"
    },
    {
        "input": "It's okay, nothing special but does the job I guess.",
        "model_output": "Positive",  # This looks wrong!
        "ground_truth": "Neutral"
    },
    {
        "input": "Terrible quality. Broke after one use. Very disappointed.",
        "model_output": "Neutral",  # This also looks wrong!
        "ground_truth": "Negative"
    },
]

# Display as DataFrame
df = pd.DataFrame(model_predictions)
print("AI Model Predictions to Evaluate:")
print(df)


AI Model Predictions to Evaluate:
                                               input model_output ground_truth
0  This product is absolutely amazing! Best purch...     Positive     Positive
1  The item arrived damaged and customer service ...     Negative     Negative
2  It's okay, nothing special but does the job I ...     Positive      Neutral
3  Terrible quality. Broke after one use. Very di...      Neutral     Negative


### The Question

For each prediction, we want to know:
- **Is this output correct?**
- **How confident are we in that judgment?**
- **Why is it correct or incorrect?**
- **How would we score the quality?**

This is where LLM-as-a-Judge comes in!


## Part 2: Role and Task Definition

Just like with any prompt engineering, we need to clearly define:
1. **Who is the judge?** - Their expertise and background
2. **What are they evaluating?** - The specific task
3. **What criteria should they use?** - How to determine if an output is good


In [5]:
# Define the judge's role
judge_role = """You are an expert evaluator of sentiment classification models.
You have 10 years of experience in natural language processing and understand the nuances
of sentiment analysis. Your job is to assess whether an AI model's sentiment predictions
are accurate and appropriate given the input text."""

# Define the evaluation task
evaluation_task = """For each AI model prediction, you must:
1. Read the original customer review
2. Examine the AI model's predicted sentiment
3. Determine if the prediction is correct
4. Consider nuances like sarcasm, mixed sentiment, and intensity
5. Provide a quality score (0-100), verdict, and detailed reasoning"""

# Define evaluation criteria
evaluation_criteria = """
A sentiment prediction is correct if:
- Positive: The review expresses satisfaction, praise, or positive experiences
- Negative: The review expresses dissatisfaction, complaints, or negative experiences
- Neutral: The review is factual, balanced, or lacks strong sentiment

Consider:
- Intensity of language (e.g., "amazing" vs "okay")
- Context and implications
- Mixed sentiments (judge the dominant sentiment)
"""

# Valid verdicts
verdicts = ["Correct", "Incorrect", "Partially Correct"]


## Part 3: Few-Shot Examples - Teaching the Judge

Few-shot examples are crucial for LLM-as-a-Judge. They show:
- What a correct prediction looks like
- What an incorrect prediction looks like
- How to reason about edge cases
- How to calibrate scores and confidence

Let's create examples that demonstrate good judgment:


In [6]:
# Create few-shot examples showing how to judge outputs
# NOTE: We do NOT include ground_truth in the examples - the judge must evaluate
# based solely on the input and model output, as it would in real production scenarios
examples = [
    FewShotExample(
        input_text="This product exceeded all my expectations! Highly recommend!",
        model_output="Positive",
        expected_verdict="Correct",
        expected_score=100.0,
        reasoning="""The model correctly identified this as Positive sentiment. 
        The review contains strong positive indicators: 'exceeded expectations' and 
        'highly recommend'. The prediction is accurate and appropriate."""
    ),
    FewShotExample(
        input_text="Worst purchase of my life. Complete waste of money.",
        model_output="Negative",
        expected_verdict="Correct",
        expected_score=100.0,
        reasoning="""The model correctly identified this as Negative sentiment.
        The review contains strong negative indicators: 'worst' and 'waste of money'.
        The prediction is accurate."""
    ),
    FewShotExample(
        input_text="It's fine, does what it says on the box.",
        model_output="Positive",
        expected_verdict="Incorrect",
        expected_score=20.0,
        reasoning="""The model incorrectly classified this as Positive when it should be Neutral.
        The review is lukewarm at best - 'fine' and 'does what it says' indicate no strong sentiment.
        This is a clear misclassification that could skew model performance metrics."""
    ),
    FewShotExample(
        input_text="Great features but terrible battery life really ruins it.",
        model_output="Neutral",
        expected_verdict="Partially Correct",
        expected_score=60.0,
        reasoning="""The model classified this as Neutral, which is defensible but not ideal.
        The review has mixed sentiment: positive ('great features') and negative ('terrible battery').
        However, the word 'ruins' suggests the negative aspect dominates. A Negative classification
        would be more accurate, but Neutral shows the model recognized the mixed nature."""
    ),
    FewShotExample(
        input_text="Meh. Nothing to write home about.",
        model_output="Negative",
        expected_verdict="Incorrect",
        expected_score=30.0,
        reasoning="""The model incorrectly classified this as Negative when it should be Neutral.
        'Meh' and 'nothing to write home about' indicate indifference, not negativity.
        While slightly negative-leaning, this is more neutral disappointment than active dissatisfaction."""
    ),
]

print(f"Created {len(examples)} few-shot examples to guide the judge")


Created 5 few-shot examples to guide the judge


### Key Takeaway on Few-Shot Examples

Notice how our examples:
- Cover different scenarios: correct, incorrect, and partially correct
- Show score calibration (100 for perfect, 20-30 for clear errors, 60 for debatable cases)
- Explain the reasoning in detail
- Reference specific words/phrases from the input
- **Do NOT include ground truth** - the judge learns to evaluate quality intrinsically

This teaches the judge how to think about evaluations based solely on input quality, not by comparison to a known answer.


## Part 4: Creating the Judge

Now let's instantiate our LLM judge with all the components:


In [7]:
# Create the LLM Judge
sentiment_judge = LLMJudge(
    llm_client=client,
    role=judge_role,
    task_description=evaluation_task,
    evaluation_criteria=evaluation_criteria,
    valid_verdicts=verdicts,
    few_shot_examples=examples,
    model_name="gpt-4o-mini",
    temperature=0.3  # Low temperature for consistency
)

print("LLM Judge created successfully!")
print(f"\nSystem Prompt Preview (first 500 chars):")
print(sentiment_judge.get_system_prompt())


LLM Judge created successfully!

System Prompt Preview (first 500 chars):
# ROLE
You are an expert evaluator of sentiment classification models.
You have 10 years of experience in natural language processing and understand the nuances
of sentiment analysis. Your job is to assess whether an AI model's sentiment predictions
are accurate and appropriate given the input text.

# TASK
For each AI model prediction, you must:
1. Read the original customer review
2. Examine the AI model's predicted sentiment
3. Determine if the prediction is correct
4. Consider nuances like sarcasm, mixed sentiment, and intensity
5. Provide a quality score (0-100), verdict, and detailed reasoning

# EVALUATION CRITERIA

A sentiment prediction is correct if:
- Positive: The review expresses satisfaction, praise, or positive experiences
- Negative: The review expresses dissatisfaction, complaints, or negative experiences
- Neutral: The review is factual, balanced, or lacks strong sentiment

Consider:
- Intensity

## Part 5: Judging Model Outputs

Now let's use our judge to evaluate the model predictions from earlier:


In [8]:
# Judge each model prediction
print("=" * 100)
print("JUDGING AI MODEL OUTPUTS")
print("=" * 100)

for i, pred in enumerate(model_predictions, 1):
    print(f"\n### Case {i} ###")
    print(f"Input: {pred['input']}")
    print(f"Model's Prediction: {pred['model_output']}")
    print(f"Ground Truth (for reference): {pred['ground_truth']}")
    print("^ Note: Ground truth shown here for education, but NOT passed to the judge!")
    
    # Judge the model's output - NO ground_truth parameter!
    # The judge evaluates purely based on input + model_output
    judgment = sentiment_judge.judge_single(
        input_text=pred['input'],
        model_output=pred['model_output']
    )
    
    print(f"\n{'JUDGE VERDICT:':<20} {judgment.verdict}")
    print(f"{'Quality Score:':<20} {judgment.score}/100")
    print(f"{'Confidence:':<20} {judgment.confidence}%")
    print(f"\nReasoning: {judgment.reasoning}")
    if judgment.notes:
        print(f"Notes: {judgment.notes}")
    
    print("\n" + "=" * 100)


JUDGING AI MODEL OUTPUTS

### Case 1 ###
Input: This product is absolutely amazing! Best purchase ever!
Model's Prediction: Positive
Ground Truth (for reference): Positive
^ Note: Ground truth shown here for education, but NOT passed to the judge!

JUDGE VERDICT:       Correct
Quality Score:       100.0/100
Confidence:          100.0%

Reasoning: The model correctly identified this as Positive sentiment. The review contains strong positive indicators such as 'absolutely amazing' and 'best purchase ever', which clearly express satisfaction and enthusiasm. The prediction is accurate and aligns perfectly with the sentiment expressed in the review.


### Case 2 ###
Input: The item arrived damaged and customer service was unhelpful.
Model's Prediction: Negative
Ground Truth (for reference): Negative
^ Note: Ground truth shown here for education, but NOT passed to the judge!

JUDGE VERDICT:       Correct
Quality Score:       100.0/100
Confidence:          95.0%

Reasoning: The AI model corre

## Part 6: ReAct Pattern - Confidence and Reasoning

The ReAct pattern (Reasoning + Acting) is built into our framework. Each judgment includes:

1. **Score (0-100)**: Quantitative quality assessment
2. **Verdict**: Binary or categorical judgment
3. **Confidence**: How certain the judge is
4. **Reasoning**: Chain-of-thought explanation
5. **Notes**: Additional observations

This enables:
- **Transparency**: You can see why the judge made each decision
- **Debugging**: Identify patterns in errors
- **Human-in-the-loop**: Route low-confidence judgments to humans
- **Quality control**: Track judge performance over time


## Part 7: Batch Evaluation and Analytics

Let's evaluate multiple predictions and analyze the results:


In [9]:
batch_input = [
    (pred['input'], pred['model_output'])
    for pred in model_predictions
]

# Batch judge
judgments = sentiment_judge.judge_batch(batch_input)

# Create results DataFrame
results_df = pd.DataFrame([
    {
        'input': pred['input'][:50] + '...' if len(pred['input']) > 50 else pred['input'],
        'model_output': pred['model_output'],
        'ground_truth': pred['ground_truth'],
        'verdict': j.verdict,
        'score': j.score,
        'confidence': j.confidence,
    }
    for pred, j in zip(model_predictions, judgments)
])

print("Evaluation Results:")
print(results_df.to_string(index=False))

# Calculate metrics
print(f"\n{'=' * 60}")
print("SUMMARY STATISTICS")
print(f"{'=' * 60}")
print(f"Total Predictions Evaluated: {len(judgments)}")
print(f"Correct Predictions: {sum(1 for j in judgments if j.verdict == 'Correct')}")
print(f"Incorrect Predictions: {sum(1 for j in judgments if j.verdict == 'Incorrect')}")
print(f"Partially Correct: {sum(1 for j in judgments if j.verdict == 'Partially Correct')}")
print(f"\nAverage Quality Score: {results_df['score'].mean():.1f}/100")
print(f"Average Judge Confidence: {results_df['confidence'].mean():.1f}%")
print(f"\nModel Accuracy (by judge): {sum(1 for j in judgments if j.verdict == 'Correct') / len(judgments) * 100:.1f}%")


Evaluation Results:
                                                input model_output ground_truth   verdict  score  confidence
This product is absolutely amazing! Best purchase ...     Positive     Positive   Correct  100.0       100.0
The item arrived damaged and customer service was ...     Negative     Negative   Correct  100.0       100.0
It's okay, nothing special but does the job I gues...     Positive      Neutral Incorrect   25.0        90.0
Terrible quality. Broke after one use. Very disapp...      Neutral     Negative Incorrect   10.0        95.0

SUMMARY STATISTICS
Total Predictions Evaluated: 4
Correct Predictions: 2
Incorrect Predictions: 2
Partially Correct: 0

Average Quality Score: 58.8/100
Average Judge Confidence: 96.2%

Model Accuracy (by judge): 50.0%


In [10]:
results_df

Unnamed: 0,input,model_output,ground_truth,verdict,score,confidence
0,This product is absolutely amazing! Best purch...,Positive,Positive,Correct,100.0,100.0
1,The item arrived damaged and customer service ...,Negative,Negative,Correct,100.0,100.0
2,"It's okay, nothing special but does the job I ...",Positive,Neutral,Incorrect,25.0,90.0
3,Terrible quality. Broke after one use. Very di...,Neutral,Negative,Incorrect,10.0,95.0


## Part 8: Structured Output - Production Ready

Our framework uses Pydantic for structured output, ensuring:
- Type safety
- Validation
- Easy serialization
- Clear schema

Let's examine the structure:


In [None]:
# Examine the Pydantic model
print("JudgmentResult Schema:")
print(JudgmentResult.model_json_schema())

# Serialize a judgment to JSON
sample_judgment = judgments[0]
print("\nSample Judgment as JSON:")
print(sample_judgment.model_dump_json(indent=2))

# Save all results to CSV
full_results = []
for pred, j in zip(model_predictions, judgments):
    full_results.append({
        'input': pred['input'],
        'model_output': pred['model_output'],
        'ground_truth': pred['ground_truth'],
        'verdict': j.verdict,
        'score': j.score,
        'confidence': j.confidence,
        'reasoning': j.reasoning,
        'notes': j.notes
    })

results_df_full = pd.DataFrame(full_results)
results_df_full.to_csv('model_evaluation_results.csv', index=False)
print("\nResults saved to model_evaluation_results.csv")

JudgmentResult Schema:
{'description': "Structured output for judging an AI model's output.\n\nThis model enforces structured output from the LLM judge, ensuring:\n- A quality score evaluating the AI output\n- A binary verdict (correct/incorrect or pass/fail)\n- A confidence score (0-100) in the judgment\n- Chain-of-thought reasoning\n- Optional notes for additional context", 'properties': {'score': {'description': 'Quality score from 0-100 evaluating how good the AI output is', 'maximum': 100, 'minimum': 0, 'title': 'Score', 'type': 'number'}, 'verdict': {'description': "Binary judgment: 'Correct', 'Incorrect', 'Partially Correct', 'Pass', 'Fail', etc.", 'title': 'Verdict', 'type': 'string'}, 'confidence': {'description': "Confidence score from 0 to 100 indicating the judge's certainty in this evaluation", 'maximum': 100, 'minimum': 0, 'title': 'Confidence', 'type': 'number'}, 'reasoning': {'description': 'Chain-of-thought reasoning explaining why the AI output received this judgment'

In [13]:
results_df_full

Unnamed: 0,input,model_output,ground_truth,verdict,score,confidence,reasoning,notes
0,This product is absolutely amazing! Best purch...,Positive,Positive,Correct,100.0,100.0,The model correctly identified this as Positiv...,
1,The item arrived damaged and customer service ...,Negative,Negative,Correct,100.0,100.0,The model correctly identified this as Negativ...,
2,"It's okay, nothing special but does the job I ...",Positive,Neutral,Incorrect,25.0,90.0,The model incorrectly classified this review a...,The model should improve its ability to recogn...
3,Terrible quality. Broke after one use. Very di...,Neutral,Negative,Incorrect,10.0,95.0,The model incorrectly classified this review a...,The model should improve its ability to detect...


## Part 9: Real-World Application - Judging Without Ground Truth

In many real-world scenarios, you don't have ground truth labels. The judge can still evaluate quality based on the criteria you define:


In [None]:
# New predictions without ground truth
new_predictions = [
    {
        "input": "Love it! But wish it came in more colors.",
        "model_output": "Positive"
    },
    {
        "input": "Received it yesterday, setting it up now.",
        "model_output": "Neutral"
    },
    {
        "input": "Not bad for the price, I guess.",
        "model_output": "Positive"
    },
]

print("Judging without ground truth:")
print("=" * 100)

for i, pred in enumerate(new_predictions, 1):
    print(f"\n### Case {i} ###")
    print(f"Input: {pred['input']}")
    print(f"Model Output: {pred['model_output']}")
    
    # Judge the model output - the judge evaluates quality without ground truth
    judgment = sentiment_judge.judge_single(
        input_text=pred['input'],
        model_output=pred['model_output']
    )
    
    print(f"\nVerdict: {judgment.verdict}")
    print(f"Score: {judgment.score}/100")
    print(f"Confidence: {judgment.confidence}%")
    print(f"Reasoning: {judgment.reasoning}")
    print("=" * 100)


## Part 10: Advanced - Judging Other Types of AI Outputs

LLM-as-a-Judge isn't limited to classification. Let's see it evaluate chatbot responses:


In [12]:
# Create a judge for chatbot responses
chatbot_role = """You are an expert evaluator of customer service chatbot responses.
You assess whether chatbot answers are helpful, accurate, and appropriately professional."""

chatbot_task = """Evaluate whether the chatbot's response:
1. Answers the customer's question
2. Is accurate and helpful
3. Maintains appropriate tone
4. Provides actionable information"""

chatbot_criteria = """
A good chatbot response should:
- Directly address the customer's question
- Be accurate (no hallucinations or false information)
- Be concise but complete
- Use professional, friendly tone
- Provide next steps when appropriate
"""

chatbot_examples = [
    FewShotExample(
        input_text="How do I reset my password?",
        model_output="To reset your password, click 'Forgot Password' on the login page, enter your email, and follow the link sent to you.",
        expected_verdict="Correct",
        expected_score=95.0,
        reasoning="The response directly answers the question with clear, actionable steps. It's concise, accurate, and helpful."
    ),
    FewShotExample(
        input_text="When will my order arrive?",
        model_output="We offer free shipping on all orders!",
        expected_verdict="Incorrect",
        expected_score=10.0,
        reasoning="The response completely ignores the question about delivery time. It provides irrelevant information about shipping cost instead of addressing the customer's concern."
    ),
]

chatbot_judge = LLMJudge(
    llm_client=client,
    role=chatbot_role,
    task_description=chatbot_task,
    evaluation_criteria=chatbot_criteria,
    valid_verdicts=["Correct", "Incorrect", "Partially Correct"],
    few_shot_examples=chatbot_examples,
    model_name="gpt-4o-mini",
    temperature=0.3
)

# Test chatbot responses
chatbot_test = [
    {
        "input": "What's your return policy?",
        "output": "You can return items within 30 days for a full refund. Items must be unused and in original packaging."
    },
    {
        "input": "Is this product waterproof?",
        "output": "This is a great product! Many customers love it!"
    },
]

print("Evaluating Chatbot Responses:")
print("=" * 100)

for i, test in enumerate(chatbot_test, 1):
    print(f"\n### Chatbot Exchange {i} ###")
    print(f"Customer: {test['input']}")
    print(f"Chatbot: {test['output']}")
    
    judgment = chatbot_judge.judge_single(
        input_text=test['input'],
        model_output=test['output']
    )
    
    print(f"\nVerdict: {judgment.verdict}")
    print(f"Score: {judgment.score}/100")
    print(f"Reasoning: {judgment.reasoning}")
    print("=" * 100)


Evaluating Chatbot Responses:

### Chatbot Exchange 1 ###
Customer: What's your return policy?
Chatbot: You can return items within 30 days for a full refund. Items must be unused and in original packaging.

Verdict: Correct
Score: 90.0/100
Reasoning: The response accurately answers the customer's question about the return policy by specifying the time frame for returns and the condition of the items. It is clear, concise, and provides actionable information that the customer can follow. The tone is professional and appropriate for customer service.

### Chatbot Exchange 2 ###
Customer: Is this product waterproof?
Chatbot: This is a great product! Many customers love it!

Verdict: Incorrect
Score: 15.0/100
Reasoning: The response does not answer the customer's question about whether the product is waterproof. Instead, it provides irrelevant information about customer satisfaction, which does not address the inquiry. The lack of a direct answer makes it unhelpful and inaccurate.


## Conclusion: Best Practices for LLM-as-a-Judge

### 1. **Understand the Core Purpose**
- LLM-as-a-Judge evaluates existing AI outputs, it doesn't create them
- Use it for quality control, model evaluation, and human-in-the-loop workflows
- It's not a replacement for the AI system, but a QA layer on top

### 2. **Define Clear Evaluation Criteria**
- Be specific about what makes an output "good" or "bad"
- Provide concrete examples of correct and incorrect outputs
- Consider edge cases and ambiguous situations

### 3. **Use Comprehensive Few-Shot Examples**
- Include examples of various verdict types (correct, incorrect, partial)
- Show score calibration (what deserves 100 vs 60 vs 20)
- Demonstrate the reasoning process
- Cover edge cases your model struggles with

### 4. **Leverage the ReAct Pattern**
- Confidence scores help identify uncertain judgments
- Reasoning provides transparency and debuggability
- Notes can capture nuances and suggestions
- This enables human-in-the-loop at scale

### 5. **Use Structured Output**
- Pydantic ensures type safety and validation
- Makes integration with production systems easy
- Enables easy analysis and aggregation
- Provides clear schema documentation

### 6. **Production Considerations**
- **Cost**: Judge fewer samples or use cheaper models
- **Speed**: Batch API for large evaluations
- **Quality**: Use GPT-4 for critical judgments, GPT-3.5 for simpler ones
- **Monitoring**: Track judge confidence and agreement with ground truth
- **Iteration**: Continuously add new few-shot examples based on errors

### Real-World Use Cases

1. **Model Evaluation**: Compare model versions without manual labeling
2. **Quality Control**: Flag problematic AI outputs before they reach users
3. **Active Learning**: Identify examples that need human review
4. **Model Debugging**: Understand why/when your model fails
5. **A/B Testing**: Evaluate which model variant performs better
6. **Continuous Monitoring**: Track model quality in production

### Going Further

Extend this framework with:
- **Multi-judge consensus**: Use multiple judges and aggregate verdicts
- **Confidence-based routing**: Auto-approve high-confidence, review low-confidence
- **Judge calibration**: Compare judge verdicts to ground truth
- **Cost optimization**: Use smaller models for clearer cases
- **Specialized judges**: Different judges for different evaluation aspects

Happy judging! ðŸŽ¯


In [None]:
# YOUR TURN: Create your own judge

# 1. Define what you're judging
my_role = """TODO: What kind of AI outputs are you evaluating?"""

my_task = """TODO: What should the judge assess?"""

my_criteria = """TODO: What makes an output good or bad?"""

# 2. Create few-shot examples
my_examples = [
    # FewShotExample(
    #     input_text="...",
    #     model_output="...",
    #     expected_verdict="Correct",
    #     expected_score=...,
    #     reasoning="...",
    #     ground_truth="..."  # optional
    # ),
]

# 3. Create your judge
# my_judge = LLMJudge(
#     llm_client=client,
#     role=my_role,
#     task_description=my_task,
#     evaluation_criteria=my_criteria,
#     valid_verdicts=["Correct", "Incorrect"],
#     few_shot_examples=my_examples,
#     model_name="gpt-4o-mini",
#     temperature=0.3
# )

# 4. Test it
# judgment = my_judge.judge_single(
#     input_text="your AI's input",
#     model_output="your AI's output"
# )
# print(f"Verdict: {judgment.verdict}")
# print(f"Score: {judgment.score}")
# print(f"Reasoning: {judgment.reasoning}")
