<a href="https://colab.research.google.com/github/MaggieAppleton/Colab-Notebooks/blob/main/Evaluator_Optimizer_Exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluator-Optimizer Workflow Learning Exercises

This notebook helps you master the **Evaluator-Optimizer** pattern through progressive exercises.

## The Pattern
In the evaluator-optimizer workflow:
1. **Generator LLM** creates content or solutions
2. **Evaluator LLM** assesses quality and provides feedback
3. **Loop continues** until the evaluator is satisfied

## When to Use This Pattern
- When you have clear evaluation criteria
- When iterative refinement provides measurable value
- When LLM responses can be demonstrably improved with feedback
- Examples: literary translation, content writing, code optimization

## Setup
First, let's install the required packages and set up our API client.

In [None]:
!pip install anthropic

Collecting anthropic
  Downloading anthropic-0.57.1-py3-none-any.whl.metadata (27 kB)
Downloading anthropic-0.57.1-py3-none-any.whl (292 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m292.8/292.8 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: anthropic
Successfully installed anthropic-0.57.1


In [None]:
import anthropic
import re
from google.colab import userdata

# Initialize the Anthropic client
anthropic_api_key = userdata.get('ANTHROPIC_API_KEY')
client = anthropic.Anthropic(api_key=anthropic_api_key)

def call_claude(prompt, model="claude-3-5-haiku-latest"):
    """Make a call to the Claude API"""
    response = client.messages.create(
        model=model,
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

def extract_xml(text, tag):
    """Extract content from XML tags"""
    pattern = f'<{tag}>(.*?)</{tag}>'
    match = re.search(pattern, text, re.DOTALL)
    return match.group(1).strip() if match else ""

print("Setup complete!")

Setup complete!


# Exercise 1: Understanding the Components (Warmup)

Before building the full workflow, let's understand each component.

## Exercise 1.1: Build a Simple Generator

**Your task**: Create a function that generates content based on a task description.

**Hints**:
- The generator should take a task and return generated content
- Use XML tags to structure the LLM's response
- Think about what makes a good prompt for content generation

In [None]:
def generate_content(task, context=None):
    """Generate content based on a task description.

    Args:
        task: String describing what to generate
        context: Optional context like previous attempts or feedback

    Returns:
        Generated content as a string
    """
    # TODO: Create a prompt that asks the LLM to generate content
    # - Include the task description
    # - If context is provided, include it to help improve the generation
    # - Ask the LLM to wrap its response in <content></content> tags

    context_prompt = ""
    if context:
      context_prompt += f"Take this additional information into consideration while you complete it: {context}\n"

    prompt = f"You are an expert copywriter. Complete this task: {task}. {context_prompt} Wrap your response in <content> tags."

    print(f"Prompt: {prompt}\n")

    # TODO: Call the LLM and extract the content
    response = call_claude(prompt)
    content = extract_xml(response, 'content')

    return content

# Test your generator
test_task = "Write a short product description for eco-friendly bamboo toothbrushes"
test_context = "Previous versions of this were too hokey and inauthentic."
result = generate_content(test_task, test_context)
print("Generated content:")
print(result)

Prompt: You are an expert copywriter. Complete this task: Write a short product description for eco-friendly bamboo toothbrushes. Take this additional information into consideration while you complete it: Previous versions of this were too hokey and inauthentic.
 Wrap your response in <content> tags.

Generated content:
Ditch plastic, embrace sustainability. Our bamboo toothbrushes are minimalist, functional, and kind to the planet. Crafted from 100% biodegradable bamboo with soft, BPA-free bristles, these brushes reduce landfill waste without compromising on cleaning performance. Each brush is ergonomically designed with a comfortable grip and naturally antimicrobial bamboo handle. Simple, effective, and environmentally responsible—because taking care of your teeth shouldn't cost the earth.


## Exercise 1.2: Build a Simple Evaluator

**Your task**: Create a function that evaluates content quality.

**Hints**:
- The evaluator should return both a pass/fail decision and specific feedback
- Think about what criteria make content good for the given task
- Use XML tags to structure the evaluation response

In [None]:
def evaluate_content(content, task, criteria=None):
    """Evaluate content quality against task requirements.

    Args:
        content: The generated content to evaluate
        task: Original task description
        criteria: Optional specific criteria to check

    Returns:
        Tuple of (evaluation_result, feedback)
        evaluation_result: "PASS" or "NEEDS_IMPROVEMENT"
        feedback: Specific suggestions for improvement
    """
    # TODO: Create a prompt that asks the LLM to evaluate the content
    # - Include the original task and the content to evaluate
    # - Ask for specific evaluation criteria (clarity, completeness, etc.)
    # - Request both a PASS/NEEDS_IMPROVEMENT decision and specific feedback
    # - Use XML tags like <evaluation></evaluation> and <feedback></feedback>

    criteria_prompt = ""
    if criteria:
      criteria_prompt += f"Evaluate their content based on this criteria: {criteria}. Return your evaluation inside <evaluation> tags."

    prompt = f"""
You are an expert copywriter. Another copywriter was given this task: {task}.
{criteria_prompt}
Decide whether this content should get a PASS score or a NEEDS_IMPROVEMENT score, along with specific feedback about why you made that decision. Wrap your score inside <score> tags and your feedback inside <feedback> tags.

Content: {content}
    """

    print(f"Prompt: {prompt}")

    response = call_claude(prompt)
    evaluation = extract_xml(response, 'evaluation')
    feedback = extract_xml(response, 'feedback')
    score = extract_xml(response, 'score')

    return evaluation, feedback, score

test_task = "Write a short product description for eco-friendly bamboo toothbrushes"
test_content = "Ditch plastic, embrace sustainability. Our bamboo toothbrushes are minimalist, functional, and kind to the planet. Crafted from 100% biodegradable bamboo with soft, BPA-free bristles, these brushes reduce landfill waste without compromising on cleaning performance. Each brush is ergonomically designed with a comfortable grip and naturally antimicrobial bamboo handle. Simple, effective, and environmentally responsible—because taking care of your teeth shouldn't cost the earth."
test_criteria = "clarity, completeness, and originality"

evaluation, feedback, score = evaluate_content(test_content, test_task, test_criteria)

print(f"---- Evaluation: ----\n{evaluation}\n")
print(f"---- Score: ----\n{score}\n")
print(f"---- Feedback: ----\n{feedback}")

Prompt: 
You are an expert copywriter. Another copywriter was given this task: Write a short product description for eco-friendly bamboo toothbrushes.
Evaluate their content based on this criteria: clarity, completeness, and originality. Return your evaluation inside <evaluation> tags.
Decide whether this content should get a PASS score or a NEEDS_IMPROVEMENT score, along with specific feedback about why you made that decision. Wrap your score inside <score> tags and your feedback inside <feedback> tags.

Content: Ditch plastic, embrace sustainability. Our bamboo toothbrushes are minimalist, functional, and kind to the planet. Crafted from 100% biodegradable bamboo with soft, BPA-free bristles, these brushes reduce landfill waste without compromising on cleaning performance. Each brush is ergonomically designed with a comfortable grip and naturally antimicrobial bamboo handle. Simple, effective, and environmentally responsible—because taking care of your teeth shouldn't cost the earth.

# Exercise 2: Basic Evaluator-Optimizer Loop (Core Pattern)

Now let's combine the generator and evaluator into the full workflow pattern.

**Your task**: Build a complete evaluator-optimizer loop that iterates until the content passes evaluation.

**Key concepts to implement**:
- Iteration loop with maximum attempts
- Passing feedback from evaluator back to generator
- Stopping when content passes evaluation
- Tracking attempts for context

In [None]:
def evaluator_optimizer_loop(task, max_iterations=3):
    """Run the complete evaluator-optimizer workflow.

    Args:
        task: The task description
        max_iterations: Maximum number of improvement attempts

    Returns:
        Final content (whether it passed or not)
    """
    attempts = []  # Track all attempts
    feedback_history = []  # Track feedback

    for iteration in range(max_iterations):
        print(f"\n=== ITERATION {iteration + 1} ===")

        # TODO: Generate content
        # - On first iteration, just use the task
        # - On later iterations, include previous attempts and feedback as context

        if iteration == 0:
            # First attempt - no context needed
            content = "YOUR CODE HERE"
        else:
            # Build context from previous attempts and feedback
            context = "YOUR CODE HERE (combine attempts and feedback)"
            content = "YOUR CODE HERE"

        print(f"Generated content: {content}")
        attempts.append(content)

        # TODO: Evaluate the content
        evaluation, feedback = "YOUR CODE HERE"

        print(f"Evaluation: {evaluation}")
        print(f"Feedback: {feedback}")

        feedback_history.append(feedback)

        # TODO: Check if we should stop iterating
        if "YOUR CODE HERE (check if evaluation passed)":
            print("\n✅ Content approved!")
            return content

    print("\n⚠️ Max iterations reached without approval")
    return attempts[-1]  # Return the last attempt

# Test the complete workflow
task = "Write an email to a customer explaining a delayed shipment, showing empathy and offering a solution"
final_content = evaluator_optimizer_loop(task)

print("\n" + "="*50)
print("FINAL RESULT:")
print(final_content)

# Exercise 3: Multi-Criteria Evaluation (Intermediate)

Real-world content often needs to meet multiple criteria. Let's build an evaluator that checks different aspects separately.

**Your task**: Create an evaluator that checks multiple criteria and only passes if ALL criteria are met.

**Concepts to learn**:
- Breaking evaluation into separate criteria
- Combining multiple evaluation results
- Providing targeted feedback for each criterion

In [None]:
def evaluate_multiple_criteria(content, task, criteria_dict):
    """Evaluate content against multiple criteria separately.

    Args:
        content: Content to evaluate
        task: Original task
        criteria_dict: Dict of {criterion_name: criterion_description}

    Returns:
        Tuple of (overall_result, detailed_feedback)
    """
    criterion_results = {}

    # TODO: Evaluate each criterion separately
    for criterion_name, criterion_description in criteria_dict.items():
        print(f"  Checking {criterion_name}...")

        # TODO: Create a prompt focused on this specific criterion
        # Include the criterion description and ask for pass/fail + feedback

        prompt = "YOUR CODE HERE"

        # TODO: Get evaluation for this criterion
        response = "YOUR CODE HERE"
        evaluation = "YOUR CODE HERE"
        feedback = "YOUR CODE HERE"

        criterion_results[criterion_name] = {
            "evaluation": evaluation,
            "feedback": feedback
        }

        print(f"    {criterion_name}: {evaluation}")
        if feedback:
            print(f"    Feedback: {feedback}")

    # TODO: Determine overall result
    # Should pass only if ALL criteria pass
    all_passed = "YOUR CODE HERE"

    # TODO: Compile feedback from failed criteria
    failed_feedback = []
    for criterion_name, result in criterion_results.items():
        if "YOUR CODE HERE (check if this criterion failed)":
            failed_feedback.append(f"{criterion_name}: {result['feedback']}")

    overall_feedback = "\n".join(failed_feedback) if failed_feedback else "All criteria met!"
    overall_result = "PASS" if all_passed else "NEEDS_IMPROVEMENT"

    return overall_result, overall_feedback

def multi_criteria_optimizer(task, criteria_dict, max_iterations=4):
    """Run evaluator-optimizer with multiple criteria."""
    attempts = []

    for iteration in range(max_iterations):
        print(f"\n=== ITERATION {iteration + 1} ===")

        # TODO: Generate content (similar to Exercise 2)
        if iteration == 0:
            content = generate_content(task)
        else:
            # Build context from previous attempts
            context = "YOUR CODE HERE"
            content = generate_content(task, context)

        print(f"Generated: {content}")
        attempts.append(content)

        # TODO: Evaluate against all criteria
        print("\nEvaluating criteria:")
        evaluation, feedback = "YOUR CODE HERE"

        print(f"\nOverall evaluation: {evaluation}")
        print(f"Feedback: {feedback}")

        # TODO: Check if we should stop
        if "YOUR CODE HERE":
            print("\n✅ All criteria satisfied!")
            return content

    print("\n⚠️ Max iterations reached")
    return attempts[-1]

# Test with multiple criteria
task = "Write a blog post introduction about the benefits of remote work"

criteria = {
    "engagement": "Check if the introduction is engaging and hooks the reader",
    "clarity": "Evaluate if the writing is clear and easy to understand",
    "relevance": "Assess if the content directly addresses remote work benefits",
    "structure": "Check if the introduction has good flow and structure"
}

result = multi_criteria_optimizer(task, criteria)
print(f"\nFinal result: {result}")

# Exercise 4: Code Quality Optimizer (Advanced)

Let's apply the evaluator-optimizer pattern to code generation and improvement.

**Your task**: Build a system that generates code, evaluates it for both functionality and quality, and iteratively improves it.

**Advanced concepts**:
- Domain-specific evaluation (code functionality vs. code quality)
- Handling more complex context and feedback
- Multiple evaluation dimensions

In [None]:
def generate_code(task, requirements=None, context=None):
    """Generate code based on task description.

    Args:
        task: Description of what the code should do
        requirements: Specific technical requirements
        context: Previous attempts and feedback

    Returns:
        Generated code as string
    """
    # TODO: Create a prompt for code generation
    # - Include the task description
    # - Include any specific requirements
    # - If context provided, include previous attempts and feedback
    # - Ask for complete, working code wrapped in <code></code> tags
    # - Emphasize good practices like comments, error handling, etc.

    prompt = "YOUR CODE HERE"

    response = llm_call(prompt)
    code = extract_xml(response, "code")

    return code

def evaluate_code_functionality(code, task, test_cases=None):
    """Evaluate if code meets functional requirements."""
    # TODO: Create a prompt to evaluate code functionality
    # - Include the original task
    # - Include the code to evaluate
    # - If test_cases provided, ask to consider them
    # - Focus on correctness, handling edge cases, etc.

    prompt = "YOUR CODE HERE"

    response = llm_call(prompt)
    evaluation = extract_xml(response, "evaluation")
    feedback = extract_xml(response, "feedback")

    return evaluation, feedback

def evaluate_code_quality(code):
    """Evaluate code quality and best practices."""
    # TODO: Create a prompt to evaluate code quality
    # - Focus on readability, maintainability, efficiency
    # - Check for proper variable names, comments, error handling
    # - Look for code smells or anti-patterns

    prompt = "YOUR CODE HERE"

    response = llm_call(prompt)
    evaluation = extract_xml(response, "evaluation")
    feedback = extract_xml(response, "feedback")

    return evaluation, feedback

def code_evaluator_optimizer(task, requirements=None, test_cases=None, max_iterations=4):
    """Generate and iteratively improve code."""
    attempts = []

    for iteration in range(max_iterations):
        print(f"\n=== ITERATION {iteration + 1} ===")

        # TODO: Generate code
        if iteration == 0:
            code = "YOUR CODE HERE"
        else:
            # Build context from previous attempts and feedback
            context = "YOUR CODE HERE"
            code = "YOUR CODE HERE"

        print(f"Generated code:\n{code}")
        attempts.append(code)

        # TODO: Evaluate both functionality and quality
        print("\nEvaluating functionality:")
        func_eval, func_feedback = "YOUR CODE HERE"

        print(f"  Functionality: {func_eval}")
        if func_feedback:
            print(f"  Feedback: {func_feedback}")

        print("\nEvaluating quality:")
        quality_eval, quality_feedback = "YOUR CODE HERE"

        print(f"  Quality: {quality_eval}")
        if quality_feedback:
            print(f"  Feedback: {quality_feedback}")

        # TODO: Determine if both aspects pass
        both_pass = "YOUR CODE HERE"

        # TODO: Combine feedback for next iteration
        combined_feedback = "YOUR CODE HERE"

        if both_pass:
            print("\n✅ Code passes both functionality and quality checks!")
            return code

    print("\n⚠️ Max iterations reached")
    return attempts[-1]

# Test the code optimizer
task = "Create a function that calculates the factorial of a number"
requirements = [
    "Handle edge cases (0, negative numbers)",
    "Include proper error handling",
    "Add docstring and comments",
    "Use efficient approach"
]

test_cases = [
    "factorial(5) should return 120",
    "factorial(0) should return 1",
    "factorial(-1) should handle error gracefully"
]

final_code = code_evaluator_optimizer(task, requirements, test_cases)
print(f"\nFinal code:\n{final_code}")

# Exercise 5: Creative Challenge (Open-ended)

Now that you understand the pattern, apply it to a domain of your choice!

**Your task**: Pick a creative application and implement an evaluator-optimizer workflow for it.

**Ideas to consider**:
- Recipe optimization (taste, nutrition, cost)
- Marketing copy improvement (persuasiveness, clarity, brand voice)
- Story writing (plot, character development, pacing)
- Technical documentation (accuracy, completeness, readability)
- UI/UX copy (clarity, helpfulness, tone)

**Requirements**:
- Define your own evaluation criteria
- Implement the full workflow
- Test with a realistic example

In [None]:
# YOUR CREATIVE IMPLEMENTATION HERE
# Choose your domain and implement a custom evaluator-optimizer workflow

def my_custom_generator(task, context=None):
    """Generate content for your chosen domain."""
    # YOUR CODE HERE
    pass

def my_custom_evaluator(content, task, criteria):
    """Evaluate content according to your domain's requirements."""
    # YOUR CODE HERE
    pass

def my_custom_optimizer(task, criteria, max_iterations=3):
    """Full workflow for your chosen domain."""
    # YOUR CODE HERE
    pass

# Test your implementation
# my_task = "..."
# my_criteria = {...}
# result = my_custom_optimizer(my_task, my_criteria)
# print(result)

# Summary and Key Takeaways

Congratulations! You've now learned the **Evaluator-Optimizer** pattern through hands-on practice. Here are the key concepts you've mastered:

## Core Pattern Components
1. **Generator**: Creates content based on task description and context
2. **Evaluator**: Assesses quality and provides specific feedback
3. **Loop**: Iterates until content meets standards or max attempts reached
4. **Context**: Previous attempts and feedback inform future generations

## When to Use This Pattern
- ✅ **Good for**: Content with measurable quality criteria
- ✅ **Good for**: Tasks where iteration provides clear value
- ✅ **Good for**: Complex outputs requiring multiple criteria
- ❌ **Avoid for**: Simple factual queries or single-shot tasks
- ❌ **Avoid for**: Real-time applications (due to multiple LLM calls)

## Best Practices Learned
- Use XML tags for structured LLM responses
- Provide specific, actionable feedback
- Build rich context from previous attempts
- Set reasonable iteration limits
- Evaluate multiple criteria separately when needed
- Combine functional and qualitative evaluations for code

## Next Steps
- Try applying this pattern to your own use cases
- Experiment with different evaluation criteria
- Consider combining with other AI engineering patterns
- Optimize for your specific domain requirements

## Resources for Further Learning
- Anthropic's prompt engineering documentation
- Advanced prompt patterns and techniques
- Production considerations for multi-step workflows

Happy building! 🚀