# Custom Evaluator Patterns

This notebook provides practical patterns for building domain-specific evaluators to measure output quality consistently. Learn to combine LLM-as-Judge and code evaluators for comprehensive quality measurement.

**What You'll Learn:**
- Build LLM-as-Judge evaluators for subjective quality assessment
- Create code evaluators for deterministic checks
- Implement multi-stage evaluation pipelines
- Combine multiple evaluators with composite scoring
- Apply evaluators to real-world content generation tasks

**Prerequisites:**
- Python >=3.10, <3.14
- OpenAI API key
- Netra API key ([Get started here](https://docs.getnetra.ai/quick-start/Overview))

## Step 0: Install Packages

In [None]:
pip install netra-sdk openai

## Step 1: Set Environment Variables

In [None]:
import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API Key:")
os.environ["NETRA_API_KEY"] = getpass("Enter your Netra API Key:")
os.environ["NETRA_OTLP_ENDPOINT"] = getpass("Enter your Netra OTLP Endpoint:")

print("API keys configured!")



## Step 2: Initialize Netra

In [None]:
from netra import Netra
from netra.instrumentation.instruments import InstrumentSet

Netra.init(
    app_name="custom-evaluators",
    headers=f"x-api-key={os.getenv('NETRA_API_KEY')}",
    environment="testing",
    trace_content=True,
    instruments={InstrumentSet.OPENAI},
)

print("Netra initialized for custom evaluation!")

## Step 3: Implement LLM-as-Judge Evaluators

Create evaluators that use LLMs to assess subjective quality.

In [None]:
from openai import OpenAI
from typing import Dict, List

class LLMJudgeEvaluator:
    """LLM-as-Judge evaluator for subjective quality assessment."""

    def __init__(self, criteria: Dict[str, str], weights: Dict[str, float] = None):
        """
        Initialize evaluator with criteria.
        
        Args:
            criteria: Dict of criterion_name -> description
            weights: Optional dict of criterion_name -> weight (default: equal)
        """
        self.openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
        self.criteria = criteria
        
        if weights is None:
            # Equal weighting
            weight = 1.0 / len(criteria)
            self.weights = {k: weight for k in criteria}
        else:
            self.weights = weights

    def evaluate(self, output: str, context: str = "") -> Dict:
        """Evaluate output on all criteria."""
        criteria_text = "\n".join(
            f"- {name}: {desc}" for name, desc in self.criteria.items()
        )
        
        prompt = f"""You are an expert evaluator. Rate the following output on these criteria:

{criteria_text}

For each criterion, provide a score from 1-10 and brief explanation.
Format: criterion_name: score (explanation)

Context: {context}

Output to evaluate:
{output}

Evaluations:"""
        
        response = self.openai_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
            max_tokens=500
        )
        
        scores = self._parse_scores(response.choices[0].message.content)
        
        return {
            "scores": scores,
            "weighted_score": self._calculate_weighted_score(scores),
            "feedback": response.choices[0].message.content
        }

    def _parse_scores(self, feedback: str) -> Dict[str, float]:
        """Parse scores from evaluator feedback."""
        scores = {}
        for line in feedback.split("\n"):
            for criterion in self.criteria:
                if criterion.lower() in line.lower():
                    # Extract first number found
                    import re
                    matches = re.findall(r"\d+", line)
                    if matches:
                        score = min(int(matches[0]), 10)
                        scores[criterion] = float(score)
        
        # Fill missing scores with average
        if scores:
            avg = sum(scores.values()) / len(scores)
        else:
            avg = 5.0
        
        for criterion in self.criteria:
            if criterion not in scores:
                scores[criterion] = avg
        
        return scores

    def _calculate_weighted_score(self, scores: Dict[str, float]) -> float:
        """Calculate weighted average score."""
        total = sum(
            scores.get(criterion, 5.0) * self.weights.get(criterion, 0)
            for criterion in self.criteria
        )
        return min(max(total, 1), 10)


print("LLM-as-Judge evaluator implemented!")

## Step 4: Implement Code Evaluators

Create deterministic evaluators for structural and format checks.

In [None]:
import json
import re

class CodeEvaluator:
    """Deterministic code-based evaluator for format and structure checks."""

    @staticmethod
    def check_length(output: str, min_length: int = 0, max_length: int = None) -> Dict:
        """Check if output length is within acceptable range."""
        length = len(output)
        
        passed = length >= min_length
        if max_length:
            passed = passed and length <= max_length
        
        return {
            "passed": passed,
            "score": 10.0 if passed else 3.0,
            "actual_length": length,
            "min_length": min_length,
            "max_length": max_length
        }

    @staticmethod
    def check_json_valid(output: str) -> Dict:
        """Check if output is valid JSON."""
        try:
            json.loads(output)
            return {"passed": True, "score": 10.0, "error": None}
        except json.JSONDecodeError as e:
            return {"passed": False, "score": 0.0, "error": str(e)}

    @staticmethod
    def check_required_fields(output: str, required_fields: List[str]) -> Dict:
        """Check if required fields/keywords are present."""
        found_fields = [f for f in required_fields if f.lower() in output.lower()]
        coverage = len(found_fields) / len(required_fields) if required_fields else 1.0
        
        return {
            "passed": coverage == 1.0,
            "score": coverage * 10.0,
            "found_fields": found_fields,
            "missing_fields": [f for f in required_fields if f not in found_fields]
        }

    @staticmethod
    def check_word_count(output: str, min_words: int = 0, max_words: int = None) -> Dict:
        """Check if word count is within range."""
        word_count = len(output.split())
        passed = word_count >= min_words
        
        if max_words:
            passed = passed and word_count <= max_words
        
        return {
            "passed": passed,
            "score": 10.0 if passed else 3.0,
            "actual_words": word_count,
            "min_words": min_words,
            "max_words": max_words
        }

    @staticmethod
    def check_no_forbidden_content(output: str, forbidden: List[str]) -> Dict:
        """Check that forbidden words/phrases are not present."""
        found_forbidden = [f for f in forbidden if f.lower() in output.lower()]
        passed = len(found_forbidden) == 0
        
        return {
            "passed": passed,
            "score": 10.0 if passed else 0.0,
            "found_forbidden": found_forbidden
        }


print("Code evaluators implemented!")

## Step 5: Define Multi-Stage Evaluation Pipeline

Combine multiple evaluators for comprehensive quality assessment.

In [None]:
class CompositeEvaluator:
    """Combine multiple evaluators for comprehensive assessment."""

    def __init__(self):
        # Content quality evaluator
        self.content_judge = LLMJudgeEvaluator(
            criteria={
                "clarity": "Is the text clear and easy to understand?",
                "completeness": "Does the output address all aspects of the request?",
                "accuracy": "Is the content factually accurate?",
                "engagement": "Is the content interesting and well-written?"
            },
            weights={
                "clarity": 0.25,
                "completeness": 0.25,
                "accuracy": 0.25,
                "engagement": 0.25
            }
        )

    def evaluate_blog_post(self, content: str) -> Dict:
        """Evaluate a blog post with multiple criteria."""
        results = {"evaluations": {}}
        
        # LLM-as-Judge evaluation
        judge_result = self.content_judge.evaluate(content)
        results["evaluations"]["content_quality"] = judge_result
        
        # Code-based evaluations
        results["evaluations"]["length_check"] = CodeEvaluator.check_word_count(
            content, min_words=100, max_words=2000
        )
        
        results["evaluations"]["structure_check"] = CodeEvaluator.check_required_fields(
            content, required_fields=["introduction", "conclusion"]
        )
        
        results["evaluations"]["forbidden_content"] = CodeEvaluator.check_no_forbidden_content(
            content, forbidden=["TODO", "[EDIT]", "[CITATION NEEDED]"]
        )
        
        # Calculate composite score
        scores = []
        for eval_result in results["evaluations"].values():
            if isinstance(eval_result, dict) and "score" in eval_result:
                scores.append(eval_result["score"])
            elif isinstance(eval_result, dict) and "weighted_score" in eval_result:
                scores.append(eval_result["weighted_score"])
        
        results["composite_score"] = sum(scores) / len(scores) if scores else 5.0
        results["passed"] = results["composite_score"] >= 7.0
        
        return results

    def evaluate_email(self, content: str) -> Dict:
        """Evaluate an email with format and tone checks."""
        results = {"evaluations": {}}
        
        # Tone evaluator
        tone_judge = LLMJudgeEvaluator(
            criteria={
                "professionalism": "Is the tone professional and appropriate?",
                "conciseness": "Is the email concise and to the point?",
                "clarity": "Is the action or request clear?"
            }
        )
        
        results["evaluations"]["tone"] = tone_judge.evaluate(content)
        
        # Format checks
        results["evaluations"]["length"] = CodeEvaluator.check_word_count(
            content, min_words=20, max_words=500
        )
        
        results["evaluations"]["structure"] = CodeEvaluator.check_required_fields(
            content, required_fields=["hello", "regards"]
        )
        
        # Calculate composite score
        scores = []
        for eval_result in results["evaluations"].values():
            if isinstance(eval_result, dict) and "score" in eval_result:
                scores.append(eval_result["score"])
            elif isinstance(eval_result, dict) and "weighted_score" in eval_result:
                scores.append(eval_result["weighted_score"])
        
        results["composite_score"] = sum(scores) / len(scores) if scores else 5.0
        results["passed"] = results["composite_score"] >= 7.0
        
        return results


print("Composite evaluator implemented!")

## Step 6: Test Evaluators on Content

Apply evaluators to real-world content examples.

In [None]:
# Initialize evaluator
evaluator = CompositeEvaluator()

# Test 1: Blog post
blog_post = """
Introduction

Artificial intelligence is transforming how we work. From automation to decision support,
AI tools are becoming essential across industries. Understanding AI fundamentals is now
a core skill for modern professionals.

What is AI?

Artificial Intelligence refers to computer systems designed to perform tasks that typically
require human intelligence. These include learning from data, recognizing patterns, and
making decisions.

Key Benefits

AI systems can process vast amounts of data quickly, identify subtle patterns, and operate
24/7 without fatigue. Organizations using AI report improved efficiency and better insights.

Implementation Challenges

While powerful, AI implementation requires careful planning. Data quality, model training,
and ongoing monitoring are critical success factors.

Conclusion

AI is no longer optional in modern business. Organizations that embrace AI thoughtfully
will gain competitive advantages in their markets.
"""

print("="*60)
print("EVALUATING BLOG POST")
print("="*60)

blog_result = evaluator.evaluate_blog_post(blog_post)

print(f"\nComposite Score: {blog_result['composite_score']:.1f}/10")
print(f"Status: {'✓ PASS' if blog_result['passed'] else '✗ FAIL'}")

print("\nDetailed Results:")
for eval_name, result in blog_result["evaluations"].items():
    print(f"\n{eval_name}:")
    if "weighted_score" in result:
        print(f"  Score: {result['weighted_score']:.1f}/10")
    elif "score" in result:
        print(f"  Score: {result['score']:.1f}/10")
    if "passed" in result:
        print(f"  Passed: {result['passed']}")

In [None]:
# Test 2: Email
email = """
Hello Sarah,

I hope this email finds you well. I wanted to reach out regarding the Q4 strategy meeting.
Would you be available next Tuesday at 2 PM to discuss our market positioning and goals?

Please let me know if this time works for you, or feel free to suggest an alternative.

Best regards,
John
"""

print("\n" + "="*60)
print("EVALUATING EMAIL")
print("="*60)

email_result = evaluator.evaluate_email(email)

print(f"\nComposite Score: {email_result['composite_score']:.1f}/10")
print(f"Status: {'✓ PASS' if email_result['passed'] else '✗ FAIL'}")

print("\nDetailed Results:")
for eval_name, result in email_result["evaluations"].items():
    print(f"\n{eval_name}:")
    if "weighted_score" in result:
        print(f"  Score: {result['weighted_score']:.1f}/10")
    elif "score" in result:
        print(f"  Score: {result['score']:.1f}/10")
    if "passed" in result:
        print(f"  Passed: {result['passed']}")

## Step 7: Best Practices for Custom Evaluators

Key principles for building effective evaluators:

In [None]:
print("""
BEST PRACTICES FOR CUSTOM EVALUATORS
="*60)

1. **Start Simple, Add Complexity**
   - Begin with basic checks (length, required fields)
   - Add LLM-based evaluation once baseline is working
   - Avoid over-engineering until you have metrics

2. **Match Evaluator Type to Measurement Need**
   - Use CODE evaluators for objective, deterministic checks
   - Use LLM-as-Judge for subjective, nuanced quality assessment
   - Combine both for comprehensive evaluation

3. **Provide Clear Rubrics for LLM-as-Judge**
   - Write explicit criteria descriptions
   - Use examples in prompts to anchor evaluations
   - Test with multiple outputs to ensure consistency

4. **Combine Multiple Evaluators**
   - Use weighted scoring to reflect importance
   - Balance structural (code) and quality (LLM) checks
   - Set meaningful pass/fail thresholds

5. **Iterate Based on Results**
   - If evaluator scores don't match your judgment, refine criteria
   - Track evaluator precision and recall
   - Update rubrics as requirements change

6. **Document Your Evaluation Logic**
   - Explain why each criterion matters
   - Record your weighting decisions
   - Make thresholds explicit and defensible
""")

---

## Evaluation in Netra

Once you've defined your evaluators, integrate them with Netra's evaluation framework:

1. **Create evaluators in dashboard** with your custom logic
2. **Link to datasets** for automated evaluation
3. **Track metrics** over time to measure quality trends
4. **Compare configurations** using composite scores

## Documentation Links

- [Netra Documentation](https://docs.getnetra.ai)
- [Custom Evaluators](https://docs.getnetra.ai/Evaluation/Custom-Evaluators)
- [Evaluation Framework](https://docs.getnetra.ai/Evaluation)

## See Also

- [Evaluating RAG Quality](/Cookbooks/evaluation/evaluating-rag-quality) - RAG-specific evaluation
- [A/B Testing Configurations](/Cookbooks/evaluation/ab-testing-configurations) - Compare model configurations
- [Evaluating Agent Decisions](/Cookbooks/evaluation/evaluating-agent-decisions) - Agent behavior evaluation