# Agenta SDK Quick Start - Evaluations

This notebook demonstrates how to:
1. Create a simple application that returns country capitals
2. Create evaluators to check if the application's output is correct
3. Run an evaluation to test your application

The entire example takes less than 100 lines of code!

## Setup

First, install the Agenta SDK and set up your environment variables:

In [None]:
# Install Agenta SDK
%pip install agenta -q

In [None]:
import os
from getpass import getpass

# Set your API credentials
if not os.getenv("AGENTA_API_KEY"):
    os.environ["AGENTA_API_KEY"] = getpass("Enter your Agenta API key: ")

if not os.getenv("AGENTA_HOST"):
    os.environ["AGENTA_HOST"] = "https://cloud.agenta.ai"  # Change for self-hosted

# Set OpenAI API key (required for LLM-as-a-judge evaluator)
if not os.getenv("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

print("‚úÖ Environment configured!")

## Initialize Agenta SDK

Initialize the SDK to connect to the Agenta platform:

In [None]:
import agenta as ag

ag.init()

print("‚úÖ Agenta SDK initialized!")

## Step 1: Define Your Application

An application is any function decorated with `@ag.application`. It receives inputs from test data and returns outputs.

Let's create a simple application that returns country capitals:

In [None]:
@ag.application(
    slug="capital_finder",
    name="Capital Finder",
    description="Returns the capital of a given country"
)
async def capital_finder(country: str):
    """
    A simple application that returns country capitals.
    
    Args:
        country: The country name (from testcase)
    
    Returns:
        The capital city name
    """
    capitals = {
        "Germany": "Berlin",
        "France": "Paris",
        "Spain": "Madrid",
        "Italy": "Rome",
    }
    return capitals.get(country, "Unknown")

print("‚úÖ Application defined!")

## Step 2: Create Custom Evaluators

Evaluators check if your application's output is correct. They receive:
- Fields from your testcase (e.g., `capital`)
- The application's output (always called `outputs`)

Let's create two evaluators:

In [None]:
@ag.evaluator(
    slug="exact_match",
    name="Exact Match Evaluator",
    description="Checks if the output exactly matches the expected answer"
)
async def exact_match(capital: str, outputs: str):
    """
    Evaluates if the application's output matches the expected answer.
    
    Args:
        capital: The expected capital (from testcase)
        outputs: What the application returned
    
    Returns:
        Dictionary with score and success flag
    """
    is_correct = outputs == capital
    return {
        "score": 1.0 if is_correct else 0.0,
        "success": is_correct,
    }


@ag.evaluator(
    slug="case_insensitive_match",
    name="Case Insensitive Match",
    description="Checks if output matches ignoring case"
)
async def case_insensitive_match(capital: str, outputs: str):
    """
    Evaluates with case-insensitive comparison.
    """
    is_correct = outputs.lower() == capital.lower()
    return {
        "score": 1.0 if is_correct else 0.0,
        "success": is_correct,
    }

print("‚úÖ Evaluators defined!")

## Step 3: Use Built-in Evaluators

Agenta provides built-in evaluators like LLM-as-a-judge. Let's create one:

In [None]:
from agenta.sdk.workflows import builtin

llm_judge = builtin.auto_ai_critique(
    slug="llm_judge",
    name="LLM Judge Evaluator",
    description="Uses an LLM to judge if the answer is correct",
    correct_answer_key="capital",
    model="gpt-4o-mini",
    prompt_template=[
        {
            "role": "system",
            "content": "You are a geography expert evaluating answers about world capitals.",
        },
        {
            "role": "user",
            "content": (
                "Expected capital: {{capital}}\n"
                "Student's answer: {{outputs}}\n\n"
                "Is the student's answer correct?\n"
                "Respond with ONLY a number from 0.0 (wrong) to 1.0 (correct).\n"
                "Nothing else - just the number."
            ),
        },
    ],
)

print("‚úÖ LLM judge evaluator created!")

## Step 4: Create Test Data

Define test cases as a list of dictionaries:

In [None]:
test_data = [
    {"country": "Germany", "capital": "Berlin"},
    {"country": "France", "capital": "Paris"},
    {"country": "Spain", "capital": "Madrid"},
    {"country": "Italy", "capital": "Rome"},
]

print(f"‚úÖ Created {len(test_data)} test cases")

## Step 5: Run the Evaluation

Now let's create a testset and run the evaluation!

In [None]:
from agenta.sdk.evaluations import aevaluate

# Create a testset
print("üìù Creating testset...")
testset = await ag.testsets.acreate(
    name="Country Capitals Quick Start",
    data=test_data,
)

if not testset or not testset.id:
    print("‚ùå Failed to create testset")
else:
    print(f"‚úÖ Testset created with ID: {testset.id}")
    print(f"   Contains {len(test_data)} test cases\n")

In [None]:
# Run evaluation with all three evaluators
print("üöÄ Running evaluation...\n")

result = await aevaluate(
    testsets=[testset.id],
    applications=[capital_finder],
    evaluators=[
        exact_match,
        case_insensitive_match,
        llm_judge,
    ],
)

print("\n" + "=" * 70)
print("‚úÖ Evaluation Complete!")
print("=" * 70)

## View Results

The evaluation results are now available in the Agenta UI! You can:

1. **View detailed results** - See how each test case performed
2. **Compare evaluators** - See which evaluators flagged which test cases
3. **Analyze metrics** - View aggregated scores and success rates

You can also access results programmatically:

In [None]:
if result and "run" in result:
    print(f"\nüìä Evaluation Details:")
    print(f"   Run ID: {result['run'].id}")
    print(f"   Status: {result['run'].status}")
    print(f"\nüîó View results in the Agenta UI")
else:
    print("No result data available")

## Understanding the Data Flow

When you run an evaluation, here's what happens:

1. **Testcase data flows to the application**
   - Input: `{"country": "Germany", "capital": "Berlin"}`
   - Application receives: `country="Germany"`
   - Application returns: `"Berlin"`

2. **Both testcase data and application output flow to evaluators**
   - Evaluator receives: `capital="Berlin"` (from testcase)
   - Evaluator receives: `outputs="Berlin"` (from application)
   - Evaluator compares and returns: `{"score": 1.0, "success": True}`

3. **Results are stored in Agenta**
   - View in web interface
   - Access programmatically

## Next Steps

Now that you've created your first evaluation, explore:

- **[Configuring Evaluators](/evaluation/evaluation-from-sdk/configuring-evaluators)** - Create custom scoring logic
- **[Managing Testsets](/evaluation/evaluation-from-sdk/managing-testsets)** - Work with test data
- **[Running Evaluations](/evaluation/evaluation-from-sdk/running-evaluations)** - Advanced evaluation patterns

## Summary

In this notebook, you learned how to:

‚úÖ Define an application with `@ag.application`  
‚úÖ Create custom evaluators with `@ag.evaluator`  
‚úÖ Use built-in evaluators like LLM-as-a-judge  
‚úÖ Create testsets with `ag.testsets.acreate()`  
‚úÖ Run evaluations with `aevaluate()`  
‚úÖ View results in the Agenta UI  

Happy evaluating! üéâ